From bugzilla-daemon at portal.open-bio.org Wed Apr 1 07:28:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 07:28:12 -0400 Subject: [Biopython-dev] [Bug 2802] New: Loader.py: load SeqRecord comments as list Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2802 Summary: Loader.py: load SeqRecord comments as list Product: Biopython Version: 1.49b Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: andrea at biodec.com Loader.py version: 1.38 or below python: any Actually seqrecord.annotation['comment'] is a string. SProt parser and GenBank parser parse comment as string. SProt record parser, instead, parse comment as list, according to the "-!-" tag. I'm working on parsing comment as lists, either for Uniprot and for GenBank (ncbi), and I need to have the possibility to manage comment as lists. The biosql schema, also, has in the table "comment", the field "rank" that is suitable to be used for storing list entries. In this way the table is ready and implemented to store list data. The patch is retro-compatible, so the _load_comment function is able to load either string or list entries, according to the data type. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 07:29:02 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 07:29:02 -0400 Subject: [Biopython-dev] [Bug 2802] Loader.py: load SeqRecord comments as list In-Reply-To: Message-ID: <200904011129.n31BT23k007952@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2802 ------- Comment #1 from andrea at biodec.com 2009-04-01 07:29 EST ------- Created an attachment (id=1270) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1270&action=view) proposed Patch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 07:48:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 07:48:15 -0400 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200904011148.n31BmFmX009292@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 07:48 EST ------- I've updated CVS as per comment 12 to also use record.query_length, and comment 13 to also use record.database_length. Before: >>> from Bio.Blast import NCBIXML >>> for record in NCBIXML.parse(open("xbt007.xml")) : ... print record.query_id ... print record.query_letters, record.query_length ... print record.num_letters_in_database, record.database_letters, record.database_length ... gi|585505|sp|Q08386|MOPB_RHOCA 270 None 13958303 None None gi|129628|sp|P07175.1|PARA_AGRTU 222 None 13958303 None None Now, with Bio/Blast/NCBIXML.py CVS revision 1.20 or 1.21, >>> from Bio.Blast import NCBIXML >>> for record in NCBIXML.parse(open("xbt007.xml")) : ... print record.query_id ... print record.query_letters, record.query_length ... print record.num_letters_in_database, record.database_letters, record.database_length ... gi|585505|sp|Q08386|MOPB_RHOCA 270 270 13958303 None 13958303 gi|129628|sp|P07175.1|PARA_AGRTU 222 222 13958303 None 13958303 We could perhaps deprecate record.database_letters immediately, and at a later point, record.query_letters -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 07:50:07 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 07:50:07 -0400 Subject: [Biopython-dev] [Bug 2802] Loader.py: load SeqRecord comments as list In-Reply-To: Message-ID: <200904011150.n31Bo7ib009452@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2802 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 07:50 EST ------- See also Bug 2235 for the SwissProt parsing into SeqRecord objects. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 08:33:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 08:33:37 -0400 Subject: [Biopython-dev] [Bug 2802] Loader.py: load SeqRecord comments as list In-Reply-To: Message-ID: <200904011233.n31CXbuM012687@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2802 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 08:33 EST ------- Thanks for the report and suggested patch. This is now fixed in CVS (slightly differently though). I'd be grateful if you could test the latest code. A fresh CVS checkout would be easiest - you'll need to update several files as I was working on another issue at the same time: Checking in BioSQL/BioSeq.py; /home/repository/biopython/biopython/BioSQL/BioSeq.py,v <-- BioSeq.py new revision: 1.35; previous revision: 1.34 done Checking in BioSQL/Loader.py; /home/repository/biopython/biopython/BioSQL/Loader.py,v <-- Loader.py new revision: 1.39; previous revision: 1.38 done Checking in Tests/test_BioSQL_SeqIO.py; /home/repository/biopython/biopython/Tests/test_BioSQL_SeqIO.py,v <-- test_BioSQL_SeqIO.py new revision: 1.33; previous revision: 1.32 done Checking in Tests/output/test_BioSQL_SeqIO; /home/repository/biopython/biopython/Tests/output/test_BioSQL_SeqIO,v <-- test_BioSQL_SeqIO new revision: 1.6; previous revision: 1.5 done Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Apr 1 10:23:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Apr 2009 15:23:45 +0100 Subject: [Biopython-dev] Testing Biopython with NumPy 1.3 In-Reply-To: <320fb6e00903310212o29bba163ma9d68a901eabc2c9@mail.gmail.com> References: <320fb6e00903301535j21ae6659r931c9be0fd17faf3@mail.gmail.com> <730606.962.qm@web62408.mail.re1.yahoo.com> <320fb6e00903310212o29bba163ma9d68a901eabc2c9@mail.gmail.com> Message-ID: <320fb6e00904010723j594bc958kc721a234c54d4ea5@mail.gmail.com> On Tue, Mar 31, 2009 at 10:12 AM, Peter wrote: > On Tue, Mar 31, 2009 at 1:08 AM, Michiel de Hoon wrote: >> >>> So, whatever is going wrong on test_Cluster.py seems to be >>> specific to Windows (XP) and Python 2.6 - and possibly just >>> my Windows development machine. >>> >> I believe that the problem is that msvcr90.dll is missing. This >> is the C runtime from Microsoft. Earlier Pythons used >> msvcr71.dll, if I'm not mistaken. > > You may be right - there is some stuff on the numpy mailing list > about this and manifest files etc when using mingw32. ?It may > be simplest to try the appropriate MS compiler instead... OK, good news using the MS compiler: I went to http://www.microsoft.com/express/download/ and installed the free VC++ 2008 Express Edition (using the web install, unticking the optional silverlight and sql server bits). Using the "Visual Studio 2008 Command Prompt" shortcut I was able to build, test, install Biopython CVS fine. All this shortcut claims to do is setup suitable environment variables first, so this last bit can probably be simplified for every day use. This should mean we can include a Biopython 1.50 (beta) installer for Windows on Python 2.6 using NumPy 1.3 :) It would still be nice to resolve the mingw32 issue, but it isn't critical right now. Peter From bugzilla-daemon at portal.open-bio.org Wed Apr 1 10:41:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 10:41:24 -0400 Subject: [Biopython-dev] [Bug 2803] New: Insure Alignment objects are passed to AlignIO.write() Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2803 Summary: Insure Alignment objects are passed to AlignIO.write() Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: cymon.cox at gmail.com Insure Alignment objects are passed to AlignIO.write() Stops this kind of abuse: records = list(SeqIO.parse(open("Tests/NBRF/DMA_nuc.pir", "r"), "pir")) AlignIO.write([records], open("alignIO.fasta", "w"), "fasta") -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 10:42:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 10:42:55 -0400 Subject: [Biopython-dev] [Bug 2803] Insure Alignment objects are passed to AlignIO.write() In-Reply-To: Message-ID: <200904011442.n31EgtlQ023181@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2803 ------- Comment #1 from cymon.cox at gmail.com 2009-04-01 10:42 EST ------- Created an attachment (id=1271) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1271&action=view) nsure-Alignment-objects-are-passed-to-write-AlignIO -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 11:25:36 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 11:25:36 -0400 Subject: [Biopython-dev] [Bug 2803] Insure Alignment objects are passed to AlignIO.write() In-Reply-To: Message-ID: <200904011525.n31FPa3V026200@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2803 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 11:25 EST ------- Thanks for filing the bug (originally raised in our discussion on the mailing list). There is a major drawback to your proposed fix, + if isinstance(alignments, types.GeneratorType): + alignments = list(alignments) This means if you gave the AlignIO.write function a generator returning hundreds or large alignment objects, they would all get loaded into memory at once. One of the big aims with Bio.SeqIO and AlignIO in using generators/iterators is to allow memory efficient working where we try to keep only one record/alignment in memory at a time. Anyway, I'll take a look at this. I think we need to just check the case where Bio.AlignIO.write uses Bio.SeqIO.write internally... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 11:36:54 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 11:36:54 -0400 Subject: [Biopython-dev] [Bug 2803] Insure Alignment objects are passed to AlignIO.write() In-Reply-To: Message-ID: <200904011536.n31Fasdu027053@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2803 ------- Comment #3 from cymon.cox at gmail.com 2009-04-01 11:36 EST ------- (In reply to comment #2) > Thanks for filing the bug (originally raised in our discussion on the mailing > list). > > There is a major drawback to your proposed fix, > > + if isinstance(alignments, types.GeneratorType): > + alignments = list(alignments) > > This means if you gave the AlignIO.write function a generator returning > hundreds or large alignment objects, they would all get loaded into memory at > once. One of the big aims with Bio.SeqIO and AlignIO in using > generators/iterators is to allow memory efficient working where we try to keep > only one record/alignment in memory at a time. > > Anyway, I'll take a look at this. I think we need to just check the case where > Bio.AlignIO.write uses Bio.SeqIO.write internally... > Yes, I see. I had originally intended to check the type while looping through the alignments before calling SeqIO.write, but thought better of it because some alignments may get written before a error occurs, whereas it seems best that either all or none at all get written from the call to AlignIO.write. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 11:55:26 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 11:55:26 -0400 Subject: [Biopython-dev] [Bug 2803] Insure Alignment objects are passed to AlignIO.write() In-Reply-To: Message-ID: <200904011555.n31FtQ9X028474@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2803 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 11:55 EST ------- (In reply to comment #3) > > Anyway, I'll take a look at this. I think we need to just check the case > > where Bio.AlignIO.write uses Bio.SeqIO.write internally... That turned out to be the case, fixed in CVS. See Bio/AlignIO/__init__.py revision 1.22 and Tests/test_AlignIO.py 1.19 > Yes, I see. I had originally intended to check the type while looping through > the alignments before calling SeqIO.write, but thought better of it because > some alignments may get written before a error occurs, whereas it seems best > that either all or none at all get written from the call to AlignIO.write. You are right, if we are given a list/iterator containing some real Alignments but also some non-Alignments we have a problem. We can't pre-check all the entries before writing without converting to a list (and this ruins the memory benefits). We just catching the erroneous input when we reach it, even though it may happen half way through writing to the file. Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 14:04:05 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 14:04:05 -0400 Subject: [Biopython-dev] [Bug 2804] New: Clustalw subprocess hangs when large stdout returned Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2804 Summary: Clustalw subprocess hangs when large stdout returned Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: cymon.cox at gmail.com As noted on the mailing list, the following hangs waiting for a return: from Bio import SeqIO from Bio import Clustalw from Bio.Clustalw import MultipleAlignCL records = list(SeqIO.parse(open("Tests/NBRF/Cw_prot.pir", "r"), "pir")) handle = open("temp.fasta", "w") SeqIO.write(records, handle, "fasta") handle.close() cline = MultipleAlignCL("temp.fasta", command="clustalw") align = Clustalw.do_alignment(cline) This appears to be due to a known issue as documented here: http://docs.python.org/library/subprocess.html#subprocess.Popen.wait but wasnt being picked up by the tests - presumably because no test file is large enough to trigger the problem. Instead of using .wait() it suggests .communicate() The attached patch works for me on Linux. But as noted in __init__.py this maybe an issue for Windows: #We don't need to supply any piped input, but we setup the #standard input pipe anyway as a work around for a python #bug if this is called from a Windows GUI program. For #details, see http://bugs.python.org/issue1124861 Also subprocess.returncode is now /3 so moved "if status: value = status / 256 "so that only done if calling os.popen() C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 14:05:10 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 14:05:10 -0400 Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large stdout returned In-Reply-To: Message-ID: <200904011805.n31I5ACv005787@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2804 ------- Comment #1 from cymon.cox at gmail.com 2009-04-01 14:05 EST ------- Created an attachment (id=1272) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1272&action=view) clustalw subprocess patch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 18:05:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 18:05:40 -0400 Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large stdout returned In-Reply-To: Message-ID: <200904012205.n31M5eDa024097@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2804 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 18:05 EST ------- It is great that you've found a simple and reproduceable test case. I can confirm this problem on a Linux machine with Python 2.4.3 (what version of python do you have?) (In reply to comment #1) > Created an attachment (id=1272) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1272&action=view) [details] > clustalw subprocess patch Unfortunately the patch is flawed here: status = child_process.communicate()[1] We want to get the return code (a numerical error value), but the communicate method returns two strings giving the contents of stdout and strerr, i.e. ... CLUSTAL W (1.83) Multiple Sequence Alignments ... Sequence format is Pearson Sequence 1: HLA_HLA00401 366 aa Sequence 2: HLA_HLA00402 366 aa ... Group 109: Sequences: 3 Score:6519 Group 110: Sequences: 111 Score:4464 Alignment Score 8299041 CLUSTAL-Alignment file created [temp.aln] for stdout, and an empty string for stderr. Doing this seems to work on Linux with python 2.4.3, child_process.communicate() #ignore the stdout and stderr data! child_process.stdin.close() child_process.stdout.close() child_process.stderr.close() status = child_process.returncode However, I have only tested this one example far, and not on Windows or the Mac yet. It would be a good idea to extend test_Clustalw_tool.py to cover some deliberate failures to check we can read the error level (return code) ClustalW gives back. Of course, this will need testing with both clustalw 1.x and 2.x to be safe. Note that the original code using os.popen still works fine for this example. We switched to subprocess because os.popen* are being deprecated on Python 2.6, and didn't work well with names with spaces as I recall. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 18:42:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 18:42:39 -0400 Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large stdout returned In-Reply-To: Message-ID: <200904012242.n31MgdKd026637@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2804 ------- Comment #3 from cymon.cox at gmail.com 2009-04-01 18:42 EST ------- (In reply to comment #2) > It is great that you've found a simple and reproduceable test case. I can > confirm this problem on a Linux machine with Python 2.4.3 (what version of > python do you have?) Python 2.5.2 (r252:60911, Oct 5 2008, 19:24:49) [GCC 4.3.2] on linux2 on Ubuntu Intrepid > > (In reply to comment #1) > > Created an attachment (id=1272) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1272&action=view) [details] [details] > > clustalw subprocess patch > > Unfortunately the patch is flawed here: > > status = child_process.communicate()[1] Actually, the 'whole' patch is good. Have a look at the second bit of the patch, where I change my initial commit to my branch: #Grab stderr - status = child_process.communicate()[1] + child_process.communicate() + value = child_process.returncode except ImportError : etc... I've been trying to get to grips with git - and clearly havent succeeded to yet! When you run the command "git format-patch" it creates a separate for each commit to the branch, and I can't figure out how to just get the patch against only the current version of the file. So git gave me two patches, which I cat'ed together and submitted as a composite patch. Sorry I didnt make that clear. If anyone knows how to get the diff against only the current file version, I'd appreciate the answer ;) Cheers, C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 2 07:00:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Apr 2009 07:00:48 -0400 Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large stdout returned In-Reply-To: Message-ID: <200904021100.n32B0mEZ014206@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2804 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1272 is|0 |1 obsolete| | ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 07:00 EST ------- Created an attachment (id=1273) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1273&action=view) Patch to Bio/Clustalw/__init__.py (In reply to comment #3) > > When you run the command "git format-patch" it creates a separate for each > commit to the branch, and I can't figure out how to just get the patch against > only the current version of the file. So git gave me two patches, which I > cat'ed together and submitted as a composite patch. > I see - that odd looking patch had confused me. I think you want to look at "giff diff ..." for this, it also can do things like show the diff between the remote branches. I have tested this new patch on both Linux and Mac now, using both ClustalW 1.83 and 2.0.10 - next up Windows, and extending the unit test. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 2 07:32:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Apr 2009 07:32:40 -0400 Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large stdout returned In-Reply-To: Message-ID: <200904021132.n32BWdqU016365@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2804 ------- Comment #5 from cymon.cox at gmail.com 2009-04-02 07:32 EST ------- (In reply to comment #4) > Created an attachment (id=1273) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1273&action=view) [details] > Patch to Bio/Clustalw/__init__.py > > (In reply to comment #3) > > > > When you run the command "git format-patch" it creates a separate for each > > commit to the branch, and I can't figure out how to just get the patch against > > only the current version of the file. So git gave me two patches, which I > > cat'ed together and submitted as a composite patch. > > > > I see - that odd looking patch had confused me. I think you want to look at > "giff diff ..." for this, it also can do things like show the diff between the > remote branches. > > I have tested this new patch on both Linux and Mac now, using both ClustalW > 1.83 and 2.0.10 - next up Windows, and extending the unit test. Your new patch doesnt indent the lines (as in my original patch): 113 value = 0 114 if status: value = status / 256 so that they only get executed when run_clust = os.popen(str(command_line)) The return code from child_process.communicate() is already /256 also assign value = child_process.returncode (the return code is 0 for success and never "") """ child_process.communicate() value = child_process.returncode except ImportError : #Fall back for python 2.3 run_clust = os.popen(str(command_line)) status = run_clust.close() # The exit status is the second byte of the termination status # TODO - Check this holds on win32... value = 0 if status: value = status / 256 # check the return value for errors, as on 1.81 the return value # from Clustalw is actually helpful for figuring out errors # 1 => bad command line option if value == 1: raise ValueError("Bad command line option in the command: %s" % str(command_line)) """ C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 2 10:34:10 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Apr 2009 10:34:10 -0400 Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large stdout returned In-Reply-To: Message-ID: <200904021434.n32EYApO032328@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2804 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 10:34 EST ------- I've updated test_Clustalw_tool.py in CVS to catch this dead lock, and confirmed the unit test will fail on Mac and Linux when using subprocess (on the bright side, Python 2.3 should still work), but the test passes with the fix outlined - or simply using the os.popen code instead. Interestingly the lockup seems to happen more readily on Linux that on the Mac. I've yet to test on Windows. I also added three tests for standard error conditions - interestingly I don't ever seem to get an error code back (either with subprocess or os.popen). What about you? This makes testing these special cases for raising specific IOError exceptions difficult. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 2 11:19:04 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Apr 2009 11:19:04 -0400 Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large stdout returned In-Reply-To: Message-ID: <200904021519.n32FJ4DC003715@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2804 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED OS/Version|Linux |All Resolution| |FIXED ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 11:19 EST ------- Hi Cymon, I've updated the unit test for Windows on Python 2.3 through 2.6 (had to move some file deletions to the end, and watch out for extra error message variations). Windows also deadlocks on this example when using subprocess - the test should normally take about four seconds in total (depending on your computer's speed of course). Using os.popen avoids the deadlock (but can't cope with file names with spaces). Your fix in comment 5 also works :) So, now we have a unit test which catches this deadlock on all three operating systems, which confirms your fix which works on all three. I've checked it into CVS, and marked this bug as fixed. [I'm still not sure what is happening with the return values - if you look into this further please raise a new bug for it.] Thanks! Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 2 11:32:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Apr 2009 11:32:39 -0400 Subject: [Biopython-dev] [Bug 2806] New: Possible deadlock (hang) in Bio.Application using subprocess wait() Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2806 Summary: Possible deadlock (hang) in Bio.Application using subprocess wait() Product: Biopython Version: Not Applicable Platform: PC OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk CC: cymon.cox at gmail.com See Bug 2804 which demonstrated a reproducible hang on Windows, Linux and Mac from the subprocess .wait() method, and a work around. Bio.Application may suffer from the same problem, and could be fixed with the same approach. Patch to follow ... Ideally we'd have a suitable unit test covering this - perhaps using Bio.EMBOSS? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 2 11:33:30 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Apr 2009 11:33:30 -0400 Subject: [Biopython-dev] [Bug 2806] Possible deadlock (hang) in Bio.Application using subprocess wait() In-Reply-To: Message-ID: <200904021533.n32FXU67004756@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2806 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 11:33 EST ------- Created an attachment (id=1274) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1274&action=view) Patch to Bio/Application/__init__.py Use the .communicate() method instead of .wait() -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 2 15:18:56 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Apr 2009 15:18:56 -0400 Subject: [Biopython-dev] [Bug 2734] db.load problem with postgresql and psycopg2 In-Reply-To: Message-ID: <200904021918.n32JIuXc023154@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2734 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |INVALID ------- Comment #9 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 15:18 EST ------- As per comment 8, I'm going to assume Stephen had an old copy of Biopython on his machine, which would explain the error. In the absence of any further information there isn't anything we can do. Marking bug as invalid. Stephen - if you do work out what was going on, or if you still have a problem after sorting out any issue with multiple copies of Biopython installed, please do reopen this report. Thanks Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 2 18:29:18 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Apr 2009 18:29:18 -0400 Subject: [Biopython-dev] [Bug 2807] New: Clustalw return codes Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2807 Summary: Clustalw return codes Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: cymon.cox at gmail.com see bug 2804 More on clustalw return codes: Note return codes are the same whether using subprocess.returncode or (os.popen().close() \3) clustalw1.81 clustalw2.09 ----------------- ------------------ error: Bad command line option in the command: clustalw_bogus -INFILE=Fasta/f002 127 127 error: can't open sequence file: clustalw -INFILE=no_file_present 2 255 error: wrong format of input file: clustalw -INFILE=Phylip/hennigian.phy 3 255 error: only one sequence in input: clustalw -INFILE=Fasta/f001 4 0 ========================================================= Clustalw.__init__ tries to catch return codes 1, 2, 3, and 4, others get caught generically. I dont think it is possible to generate a return code 1 using 1.81 because interface doesnt allow ad hoc options to be added to the command line. Invalid values of options are just ignore by clustalw and it aligns the data anyway (ie return code 0). Return codes 127 and 255 could be caught for newer versions and a more informative error returned. But given that there are 9 other clustalw versions between 1.81 (June 2003) and the latest 2.0.10 (Oct 2008 the latest) for which I havent checked the return codes, it might be better to just return a generic command line error if the return value is > 0. In the case where only one sequence is present, newer versions return code 0, but throws a ValueError when trying to parse the non-existent output file (see comment in test_Clustalw_tools.py). C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 3 05:50:44 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 3 Apr 2009 05:50:44 -0400 Subject: [Biopython-dev] [Bug 2807] Clustalw return codes In-Reply-To: Message-ID: <200904030950.n339oiIx019752@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2807 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-03 05:50 EST ------- (In reply to comment #0) > Clustalw.__init__ tries to catch return codes 1, 2, 3, and 4, others get > caught generically. With the CVS code, using clustalw1.81, is it definitely catching these errors and raising specific IOErrors? > I dont think it is possible to generate a return code 1 using 1.81 because > interface doesnt allow ad hoc options to be added to the command line. The Bio.Clustalw.do_alignment() function accepts any command line string, so you should be able to feed it a clustalw command with invalid arguments. > Invalid values of options are just ignore by clustalw and it aligns the > data anyway (ie return code 0). We'd have to look at the clustalw source code to confirm what should trigger an return error code of 1. > Return codes 127 and 255 could be caught for newer versions and a more > informative error returned. Yes, that sounds sensible. > But given that there are 9 other clustalw versions > between 1.81 (June 2003) and the latest 2.0.10 (Oct 2008 the latest) for which > I havent checked the return codes, it might be better to just return a generic > command line error if the return value is > 0. That also sounds sensible. > In the case where only one sequence is present, newer versions return code 0, > but throws a ValueError when trying to parse the non-existent output file (see > comment in test_Clustalw_tools.py). Maybe we should report that as a bug, I think clustalw2.0 is intended to be API compatible with clustalw1.x Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From thamelry at binf.ku.dk Fri Apr 3 09:31:05 2009 From: thamelry at binf.ku.dk (Thomas Hamelryck) Date: Fri, 3 Apr 2009 15:31:05 +0200 Subject: [Biopython-dev] PDB tidy script In-Reply-To: <320fb6e00903231405l479ddcc6of9cd0c1aa8fd98d4@mail.gmail.com> References: <320fb6e00903231405l479ddcc6of9cd0c1aa8fd98d4@mail.gmail.com> Message-ID: <2d7c25310904030631u56c642d6k83355d6bc4cd5d19@mail.gmail.com> Hi everybody, > I haven't been on this list long enough to know -- is Thomas still > > supporting the PDB module? Yes and no. First, I've been pretty busy with establishing a group here in Copenhagen, but it looks like I will have time for Bio.PDB again in the future. There's for example a set of classes dealing with RNA structure coming up. Just have to submit it. Second, I have no interest in doing anything beyond 3D stuff. I am not going to implement header parsing for example. I know many people have donated code, but in general this code is very messy and ad-hoc. The PDB parser is pretty lean, fast and quite stable now - IMO parsing the header should be the responsibility of a helper class, in order not to overload the 3D code with a lot of stuff that most people will not use. Also, the header info is for most purposes quite useless, especially in PDB files. It makes no sense to parse the PDB header in fact - if you need header info, use the MMCIF files. > If so, would he give his blessing to some more > > invasive changes to the PDB module, such as unifying PDBParser and > > parse_pdb_header? That separation has always seemed curiously vestigal to > > me. You could provide a uniform interface, but please keep the 3D data processing and the header processing in separate classes! The Structure object has functionality to be 'annotated', so you could transfer data from the header to the Structure object easily. > If you look back over the history, there initially was no header parsing, > it was a contribution from Kristian Rother, and I would agree, it is rather > disjoint from the rest of the code. One thing I personally wanted last > time I was working with PDB files was to have secondary structure > information (for them alpha and beta sheet lines in the header) > mapped onto the residue objects automatically. This is a good example of why header parsing is something of a red herring. You really want to recompute that using some decent program like DSSP or PSEA, or even an internal Bio.PDB procedure. But it's fine of course if you want to add this! I would suggest you try and get Thomas involved now for his input > on the design (before you start coding), but if need be press ahead > anyway for your own use, and he can always comment on your > public branch. I hope the two of you can work together on this, and > if/when Thomas does stand down (or delagate), you could then be > in an excellent position to take over as the Bio.PDB maintainer if > that's what you wanted. Sure, I'm open to this, but I'd like to stay involved if the 3D stuff is altered, even just to discuss new designs. Cheers, -Thomas From biopython at maubp.freeserve.co.uk Fri Apr 3 12:41:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 3 Apr 2009 17:41:04 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.50 (beta) Message-ID: <320fb6e00904030941y5110cb95r8be1fb37b01b9b98@mail.gmail.com> On Tue, Mar 31, 2009 at 10:38 PM, Peter wrote: > Hi all, > > OK guys, after a brief chat off the mailing list, I'm hoping to do the > Biopython 1.50 beta release roughly this weekend, somewhere between > Friday 4 and Monday 6 April. ?Until then please consider CVS "frozen" > for anything other that documentation changes or unit test additions, > or at a push really tiny changes. ?Once I'm ready to actually do the > release, I'll send out an email requesting no further CVS commits. I'm going to try and do the release tonight (in the next few hours), so please consider CVS frozen until further notice. Thanks, Peter From biopython at maubp.freeserve.co.uk Fri Apr 3 14:07:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 3 Apr 2009 19:07:58 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.50 (beta) In-Reply-To: <320fb6e00904030941y5110cb95r8be1fb37b01b9b98@mail.gmail.com> References: <320fb6e00904030941y5110cb95r8be1fb37b01b9b98@mail.gmail.com> Message-ID: <320fb6e00904031107q3df63df7q3569c22d4a521b7c@mail.gmail.com> On Fri, Apr 3, 2009 at 5:41 PM, Peter wrote: > > I'm going to try and do the release tonight (in the next few hours), > so please consider CVS frozen until further notice. > OK, its done - uploaded, and tagged in CVS. Could you all give it a quick test now that would be great, especially the Windows installers if possible as I currently only have ready access to the one Windows machine which is where the installers were built. I'll prepare the news entry and email announcement later on tonight, based on the current NEWS file. If there is anything missing which should be mentioned, please email me ASAP. I'm happy for CVS to be used again to check in documentation changes, but no code changes yet please. Thanks Peter From tiagoantao at gmail.com Sat Apr 4 12:43:10 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sat, 4 Apr 2009 17:43:10 +0100 Subject: [Biopython-dev] Merging branches Message-ID: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com> Hi, This might be a lame question but I am completely stuck and don't seem to understand why. I am trying to PARTIALLY merge 2 branches: my popgen branch with Giovanni's. I want to import his changes to Bio/PopGen/Stats , but only that (nothing on other Bio directories, and, above all not a new test). This changes are not conflictual, so I have no warning and everything gets in: If I do a git-merge I get the whole bang. Is there any way to just get partial merge? In this case I only want to merge a single sub dir (although, in general one might just want to import a single file) Of course I could do 2 checkouts and copy files across, on the local filesystem, but is that not loosing the history of connections between the files? Many thanks, Tiago -- "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From biopython at maubp.freeserve.co.uk Sat Apr 4 13:01:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 4 Apr 2009 18:01:53 +0100 Subject: [Biopython-dev] Merging branches In-Reply-To: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com> References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com> Message-ID: <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com> 2009/4/4 Tiago Ant?o: > Is there any way to just get partial merge? In this case I only want > to merge a single sub dir (although, in general one might just want > to import a single file) Can you cherry pick the changes you want? Github's fork queue provides another approach to the same issue. However, these both work on patches (individual commits) rather than files/directories. Peter From tiagoantao at gmail.com Sat Apr 4 13:29:20 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sat, 4 Apr 2009 18:29:20 +0100 Subject: [Biopython-dev] Merging branches In-Reply-To: <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com> References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com> <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com> Message-ID: <6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com> Me thinks I need to get a book on git and understand, once and for all, the basic concepts. I am getting merge conflicts with cherry picking and I don't even understand why Anyway it would be nice (but not fundamental) to merge just a single file. 2009/4/4 Peter : > 2009/4/4 Tiago Ant?o: >> Is there any way to just get partial merge? In this case I only want >> to merge a single sub dir (although, in general one might just want >> to import a single file) > > Can you cherry pick the changes you want? ?Github's fork queue > provides another approach to the same issue. ?However, these both work > on patches (individual commits) ?rather than files/directories. > > Peter > -- "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From biopython at maubp.freeserve.co.uk Sat Apr 4 15:06:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 4 Apr 2009 20:06:57 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.50 (beta) In-Reply-To: <320fb6e00904031107q3df63df7q3569c22d4a521b7c@mail.gmail.com> References: <320fb6e00904030941y5110cb95r8be1fb37b01b9b98@mail.gmail.com> <320fb6e00904031107q3df63df7q3569c22d4a521b7c@mail.gmail.com> Message-ID: <320fb6e00904041206yb0e4a29ja715a54faeeca28e@mail.gmail.com> On Fri, Apr 3, 2009 at 7:07 PM, Peter wrote: > I'm happy for CVS to be used again to check in documentation changes, > but no code changes yet please. Also I should have said before, those with CVS access, please feel free to add more unit tests. I've started work on one using the EMBOSS tools, to check both the command line wrappers in Bio.Emboss but also our parsers. I'm repeating myself but if you have some new code you'd like to check in, while CVS is "frozen" for the release process, this is a nice chance to try playing with git and github ;) Peter From bartek at rezolwenta.eu.org Sun Apr 5 05:49:14 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Sun, 5 Apr 2009 11:49:14 +0200 Subject: [Biopython-dev] Merging branches In-Reply-To: <6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com> References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com> <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com> <6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com> Message-ID: <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com> Hi Tiago, 2009/4/4 Tiago Ant?o : > Me thinks I need to get a book on git and understand, once and for > all, the basic concepts. I am getting merge conflicts with cherry > picking and I don't even understand why > If you could be a bit more specific (providing the files and revision numbers would be great), than it would be easier to help. I know it is an extra work, but we need some info, also to improve our wiki documents. > Anyway it would be nice (but not fundamental) to merge just a single file. > This is one of the fundamentalo changes between CVS and git. CVS uses files as the atomic piece of data, while git works with changesets (commits). This means, that if you only need a part of what was committed as a big changeset, you will need to put an extra effort into selecting what you need. >> 2009/4/4 Tiago Ant?o: >>> Is there any way to just get partial merge? In this case I only want >>> to merge a single sub dir (although, in general one might just want >>> to import a single file) Looking at specific files is not the default way things work in git. The idea is that if someone makes a single commit, it is an atomic contribution that is either to be accepted or not. You can of course create a diff file and then split it into specific files. I'll look into possible easier ways of doing it. cheers Bartek From eric.talevich at gmail.com Sun Apr 5 12:47:39 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 5 Apr 2009 12:47:39 -0400 Subject: [Biopython-dev] Merging branches In-Reply-To: <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com> References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com> <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com> <6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com> <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com> Message-ID: <3f6baf360904050947m5d9ec75eh18d64c53b8d9e2a6@mail.gmail.com> 2009/4/5 Bartek Wilczynski > Hi Tiago, > > >> 2009/4/4 Tiago Ant?o: > >>> Is there any way to just get partial merge? In this case I only want > >>> to merge a single sub dir (although, in general one might just want > >>> to import a single file) > > Looking at specific files is not the default way things work in git. > The idea is that if > someone makes a single commit, it is an atomic contribution that is > either to be > accepted or not. You can of course create a diff file and then split > it into specific files. > I'll look into possible easier ways of doing it. > > cheers > Bartek > You can get a list of the changes that affected a single subdirectory by giving the directory name to git log, e.g. "git log Bio/PopGen/Stats/". Those commits don't necessarily just affect Bio/PopGen/Stats, but assuming there aren't any single-commit code bombs, then it's probably a good idea to take those associated modifications anyway. You can also give a range of versions to git-log to get the commits that occurred since Gio's branch diverged from yours -- it looks something like "git log [path] HEAD..[gio's branch]", details are in the help page for git-rev-parse. Then you can use that list of commits for cherry-picking, in the original order. If it's essential to get just a specific file at a specific version, you can find the SHA1 hash for that blob (probably easiest through github) and use git-show with a redirect to the file in your tree, or a temporary filename. This loses the history, though. Cheers, Eric From tiagoantao at gmail.com Mon Apr 6 06:35:47 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 6 Apr 2009 11:35:47 +0100 Subject: [Biopython-dev] Merging branches In-Reply-To: <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com> References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com> <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com> <6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com> <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com> Message-ID: <6d941f120904060335m380ac820k97a558e8332fdf63@mail.gmail.com> Hi, 2009/4/5 Bartek Wilczynski : > If you could be a bit more specific (providing the files and revision numbers > would be great), than it would be easier to help. I know it is an extra work, > but we need some info, also to improve our wiki documents. > I would like to replace this: http://github.com/tiagoantao/biopython-popgen-test/blob/fa5ebc23e7aaabce94ae594d9a4f83be9bf90215/Bio/PopGen/Stats/Simple.py With this: http://github.com/dalloliogm/biopython/blob/cbaf6249cb91ed505cb575f09c2eaef3809872b9/Bio/PopGen/Stats/Simple.py It would be cool not to loose the history relationship (I suppose that would be the good practice). > This means, that if you only need a part of what was committed as a > big changeset, > you will need to put an extra effort into selecting what you need. But how do you do that (other than manually copying files)? Cherry pick seems to be commit based... > Looking at specific files is not the default way things work in git. > The idea is that if > someone makes a single commit, it is an atomic contribution that is > either to be > accepted or not. You can of course create a diff file and then split > it into specific files. > I'll look into possible easier ways of doing it. The point is: wanting to use part of a commit without loosing history. In my case, I dont want to import a test_PopGen_Fst file that Gio has. That being said, I dont think this is a big deal. I was just to preserve the history connectivity between repositiories. I think we can just use the old fashioned method of copying some files around. But it would be good to know if there is a "best practice" (which, I could not find out) Tiago PS - I might have to go under surgery this week, if I stop responding for a long time, my apologies in advance but I am probably recovering. From biopython at maubp.freeserve.co.uk Mon Apr 6 09:25:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 6 Apr 2009 14:25:29 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code Message-ID: <320fb6e00904060625v4a49da2au76159eae18f707eb@mail.gmail.com> Brad has been working on his GFF parsing code - see progress reports on his blog http://bcbio.wordpress.com/ and his code on github, http://github.com/chapmanb/bcbb/tree/master/gff Potentially this could make it into Biopython 1.51, and I was just thinking about where the code would go. Brad is supporting both GFF3 and the loosely defined GFF2 variants, so Bio.GFF seems a good place. There would also be a wrapper under Bio.SeqIO for loading GFF files as SeqRecord objects (I haven't played with Brad's code, but it can do this already). However, we already have a Bio.GFF module from Michael Hoffman created back in 2002 which accesses MySQL General Feature Format (GFF) databases created with BioPerl. Perhaps we should poll the main discussion list now, and if there are no responses from people using it, we could deprecate Bio.GFF for Biopython 1.50? Under our current deprecation policy we shouldn't then remove Bio.GFF until Biopython 1.52 at the earliest, http://biopython.org/wiki/Deprecation_policy What do you think Brad? How about using Bio.GFF3 instead? Peter From chapmanb at 50mail.com Mon Apr 6 18:08:26 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 6 Apr 2009 18:08:26 -0400 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904060625v4a49da2au76159eae18f707eb@mail.gmail.com> References: <320fb6e00904060625v4a49da2au76159eae18f707eb@mail.gmail.com> Message-ID: <20090406220826.GH43636@sobchak.mgh.harvard.edu> Peter; Thanks for the plug. GFF parsing is moving along; the main feature two things I would like to finish before proposing it for inclusion are writing of GFF files and putting GFF into BioSQL with the nested features. The code does work for parsing, and I've been using it for some real projects; anyone who would like to test it is more than welcome. As far as the current Bio.GFF, that is a bit of a conundrum. The current code does work and for some cases it would be nice of having the utility of working with GFF from a database. Eventually BioSQL from GFF may supplant that, but that should be finished and tested first. I would argue for keeping it in. However, it is a bit confusing if someone is looking for a parser. It would make more sense if it lived under a namespace like Bio.GFF.DB. What do you think about adding a warning that it is going to move to a new namespace and then moving it there, if we don't hear any complaints, for 1.51? This is less cumbersome than a removal for users since it's just an import change. Brad > Brad has been working on his GFF parsing code - see progress reports > on his blog http://bcbio.wordpress.com/ and his code on github, > http://github.com/chapmanb/bcbb/tree/master/gff > > Potentially this could make it into Biopython 1.51, and I was just > thinking about where the code would go. Brad is supporting both GFF3 > and the loosely defined GFF2 variants, so Bio.GFF seems a good place. > There would also be a wrapper under Bio.SeqIO for loading GFF files as > SeqRecord objects (I haven't played with Brad's code, but it can do > this already). > > However, we already have a Bio.GFF module from Michael Hoffman created > back in 2002 which accesses MySQL General Feature Format (GFF) > databases created with BioPerl. Perhaps we should poll the main > discussion list now, and if there are no responses from people using > it, we could deprecate Bio.GFF for Biopython 1.50? Under our current > deprecation policy we shouldn't then remove Bio.GFF until Biopython > 1.52 at the earliest, http://biopython.org/wiki/Deprecation_policy > > What do you think Brad? How about using Bio.GFF3 instead? > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From mjldehoon at yahoo.com Tue Apr 7 07:32:52 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 7 Apr 2009 04:32:52 -0700 (PDT) Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090406220826.GH43636@sobchak.mgh.harvard.edu> Message-ID: <316000.69837.qm@web62407.mail.re1.yahoo.com> Hi Brad, Thanks for your work on the GFF parser; I'm dealing with GFF files quite a lot. Could you maybe give a simple example of how to use your GFF parser, once it's included into Biopython? --Michiel. --- On Mon, 4/6/09, Brad Chapman wrote: > From: Brad Chapman > Subject: Re: [Biopython-dev] Bio.GFF and Brad's code > To: biopython-dev at lists.open-bio.org > Date: Monday, April 6, 2009, 6:08 PM > Peter; > Thanks for the plug. GFF parsing is moving along; the main > feature > two things I would like to finish before proposing it for > inclusion > are writing of GFF files and putting GFF into BioSQL with > the nested > features. The code does work for parsing, and I've been > using it for > some real projects; anyone who would like to test it is > more than > welcome. > > As far as the current Bio.GFF, that is a bit of a > conundrum. The > current code does work and for some cases it would be nice > of having > the utility of working with GFF from a database. Eventually > BioSQL > from GFF may supplant that, but that should be finished and > tested > first. I would argue for keeping it in. > > However, it is a bit confusing if someone is looking for a > parser. It > would make more sense if it lived under a namespace like > Bio.GFF.DB. > What do you think about adding a warning that it is going > to move to > a new namespace and then moving it there, if we don't > hear any > complaints, for 1.51? This is less cumbersome than a > removal for > users since it's just an import change. > > Brad > > > > > Brad has been working on his GFF parsing code - see > progress reports > > on his blog http://bcbio.wordpress.com/ and his code > on github, > > http://github.com/chapmanb/bcbb/tree/master/gff > > > > Potentially this could make it into Biopython 1.51, > and I was just > > thinking about where the code would go. Brad is > supporting both GFF3 > > and the loosely defined GFF2 variants, so Bio.GFF > seems a good place. > > There would also be a wrapper under Bio.SeqIO for > loading GFF files as > > SeqRecord objects (I haven't played with > Brad's code, but it can do > > this already). > > > > However, we already have a Bio.GFF module from Michael > Hoffman created > > back in 2002 which accesses MySQL General Feature > Format (GFF) > > databases created with BioPerl. Perhaps we should > poll the main > > discussion list now, and if there are no responses > from people using > > it, we could deprecate Bio.GFF for Biopython 1.50? > Under our current > > deprecation policy we shouldn't then remove > Bio.GFF until Biopython > > 1.52 at the earliest, > http://biopython.org/wiki/Deprecation_policy > > > > What do you think Brad? How about using Bio.GFF3 > instead? > > > > Peter > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From bartek at rezolwenta.eu.org Tue Apr 7 08:35:21 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 7 Apr 2009 14:35:21 +0200 Subject: [Biopython-dev] Merging branches In-Reply-To: <6d941f120904060335m380ac820k97a558e8332fdf63@mail.gmail.com> References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com> <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com> <6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com> <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com> <6d941f120904060335m380ac820k97a558e8332fdf63@mail.gmail.com> Message-ID: <8b34ec180904070535r3a6f23e8w9b917f7592930eda@mail.gmail.com> Hi, 2009/4/6 Tiago Ant?o : >> This means, that if you only need a part of what was committed as a >> big changeset, >> you will need to put an extra effort into selecting what you need. > > But how do you do that (other than manually copying files)? I think that in this case you need to do this manually. If you care only about one file, copying it is the easiest option. > Cherry pick seems to be commit based... In fact the whole git is commit based. It's not tracking files as such, but blobs of data. >I would like to replace this: >http://github.com/tiagoantao/biopython-popgen-test/blob/fa5ebc23e7aaabce94ae594d9a4f83be9bf90215/Bio/PopGen/Stats/Simple.py >With this: >http://github.com/dalloliogm/biopython/blob/cbaf6249cb91ed505cb575f09c2eaef3809872b9/Bio/PopGen/Stats/Simple.py >It would be cool not to loose the history relationship (I suppose that >would be the good practice). Indeed, keeping history is the right thing and it was one of the reasons to switch to git. It would be perfect if Giovanni could "redo" some of his commits and split them into smaller operations, so that cherry picking commits would be possible. I know it's a pain... > The point is: wanting to use part of a commit without loosing history. > In my case, I dont want to import a test_PopGen_Fst file that Gio has. > That being said, I dont think this is a big deal. I was just to > preserve the history connectivity between repositiories. I think we > can just use the old fashioned method of copying some files around. > But it would be good to know if there is a "best practice" (which, I > could not find out) As far as I can tell, there is no way you could take only a part of a commit. The best practice is to make smaller, atomic commits. It has many advantages: -it's easier to document a smaller change (I think it makes up for potentially more work because of more commits) -you can then "undo" small locally committed changes before pushing them to public repo -cherry picking of nicely documented small changes is an easy job In this particular case of changes in tests, I think really changes to one test should be committed separately from changes in other tests. cheers Bartek From tiagoantao at gmail.com Tue Apr 7 12:43:49 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 7 Apr 2009 17:43:49 +0100 Subject: [Biopython-dev] PopGen Stats Message-ID: <6d941f120904070943n7de7afa7m262dd4f4c0149cb@mail.gmail.com> Hi, I've started a page documenting the effort to implement statstics here http://biopython.org/wiki/PopGen_dev_Statistics anyone is welcomed to participate. I was expecting to have a personal hurdle during this week, which didn't happen. So I expect to be working heavily on this (finally). Tiago -- "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From peter at maubp.freeserve.co.uk Tue Apr 7 15:38:50 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Apr 2009 20:38:50 +0100 Subject: [Biopython-dev] Fwd: [blast-announce] BLAST 2.2.20 now available In-Reply-To: <9E9BBBAC-1E4D-4C2E-AD8E-7F6DF6F317A5@ncbi.nlm.nih.gov> References: <9E9BBBAC-1E4D-4C2E-AD8E-7F6DF6F317A5@ncbi.nlm.nih.gov> Message-ID: <320fb6e00904071238v7ed94541me4af7937fa7ae27b@mail.gmail.com> Hi all, There is a new version of BLAST out - we'll need to check if the NCBI's online server has been updated (if so, our unit test test_NCBI_qblast.py should catch any obvious issues). We'll also want to check the standalone version of BLAST is OK. Point (2) below sounds interesting, previously using BLAST databases with spaces in the path on Windows was rather hairy. Peter ---------- Forwarded message ---------- From: mcginnis Date: Apr 7, 2009 1:50 PM Subject: [blast-announce] BLAST 2.2.20 now available To: blast-announce at ncbi.nlm.nih.gov New BLAST binaries are available on the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/) The list of changes are: 1.) Ungapped blastn searches allow arbitrary reward/penalty scores. 2.) Spaces are allowed in database pathnames on windows 3.) Seedtop now has gilist support. 4.) Fix a bug that caused the number and order of queries to affect blastx results. 5.) Modified the 2-hit blastn algorithm so that no overlap is allowed between hits. From jacobporter2002 at yahoo.com Tue Apr 7 22:27:21 2009 From: jacobporter2002 at yahoo.com (Jacob Porter) Date: Tue, 7 Apr 2009 19:27:21 -0700 (PDT) Subject: [Biopython-dev] Phylogeny modules for BioPython Message-ID: <296822.1198.qm@web33706.mail.mud.yahoo.com> Hi all, My name is Jacob Porter, and I am a graduate student in the math department at UC Davis.? I've done work before on phylogeny inference using so-called "phylogenetic invariants" that can be found at the website: http://www.shsu.edu/~ldg005/small-trees/ It appears to me that BioPython doesn't have much support for phylogeny inference and tools related to phylogeny inference. I have applied to the Google Summer of Code (12 weeks of working part-time on a programming assignment), and I am looking for a project that could work with BioPython as I see a lot of potential in it.? I can bring my expertise on phylogeny inference to this project to add some support for this. I need three things from the community ASAP: 1) Ideas as to which of my several project ideas are the most useful to the BioPython community 2) Information as to what is already included in BioPython concerning phylogeny inference and related tools 3) A mentor that will help me with the project (and possibly work in conjunction with Nascent (https://www.nescent.org/wg_phyloinformatics/Main_Pagementors)? I?would need a 12 -week schedule of tasks for the project (TBD), and answers to questions related to developing for BioPython.? (I've worked with Python a lot before, so I shouldn't need much help with Python so much as I need help with BioPython). Project?1: Add support for popular phylogeny representation standards such as DND files.? Give the ability to read and write such files.? Convert between such files.? I need help in picking which standards to use and need help in picking which operations on these files is the most useful. Project?2: Add wrappers for modern (hopefully high throughput and accurate) phylogeny inference software written in C++/C.? Examples of such software include neighbor-joining, MJOIN software (similar to neighbor-joining) (http://bio.math.berkeley.edu/mjoin/), Garli (http://www.molecularevolution.org/si/software/garli/), treeSVD (http://www.stat.uchicago.edu/~eriksson/software.html), and maximum parsimony.? I would like to know which sort of phylogeny inference software is the most useful in your opinion.? I assume no wrappers for such software exist. Project?3: Add analytic algorithms that use phylogeny in some way.? Examples include bootstrapping and protein-protein interaction inference algorithms.? (i.e. "Inferring protein interactions from phylogenetic distance matrices" by Gertz et al.)? I need information as to what sort of algorithms would be useful. Project 4: Enhance phylogeny inference software further.? MJOIN has bugs (I think it returns negative distances in some cases, and some modifications to it that I developed using phylogenetic invariants are seg-faulting). Not all of these ideas will probably be able to be developed, so I need information as to what might be the most useful.? I was thinking of focusing on Project 1 and Project 2 for the initial phase. Any information will be appreciated, and any mentorship will be great.? I would like a response quickly, so that I can inform Nascent of my plans. Thanks, Jacob Porter UC Davis From p.j.a.cock at googlemail.com Wed Apr 8 04:54:35 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 8 Apr 2009 09:54:35 +0100 Subject: [Biopython-dev] Phylogeny modules for BioPython In-Reply-To: <296822.1198.qm@web33706.mail.mud.yahoo.com> References: <296822.1198.qm@web33706.mail.mud.yahoo.com> Message-ID: <320fb6e00904080154j433ef69cmd99b847240ee5b39@mail.gmail.com> On 4/8/09, Jacob Porter wrote: > > Hi all, > > My name is Jacob Porter, and I am a graduate student in the math > department at UC Davis. I've done work before on phylogeny inference > ... > It appears to me that BioPython doesn't have much support for > phylogeny inference and tools related to phylogeny inference. I'm sure there is room for improvement. > I have applied to the Google Summer of Code (12 weeks of > working part-time on a programming assignment), and I am > looking for a project that could work with BioPython as I see > a lot of potential in it. I can bring my expertise on phylogeny > inference to this project to add some support for this. > > I need three things from the community ASAP: > > 1) Ideas as to which of my several project ideas are the > most useful to the BioPython community Personally, I might pick command line wrappers for existing command line tools. However, these don't actually make anything new possible, as writting your own command line is already fairly easy. This in itself wouldn't be that much work either. > 2) Information as to what is already included in BioPython > concerning phylogeny inference and related tools Look at Bio.Nexus, plus somewhat related, Bio.AlignIO. > 3) A mentor that will help me with the project (and > possibly work in conjunction with Nascent > (https://www.nescent.org/wg_phyloinformatics/Main_Pagementors) > I would need a 12 -week schedule of tasks for the > project (TBD), and answers to questions related to > developing for BioPython. (I've worked with Python > a lot before, so I shouldn't need much help with > Python so much as I need help with BioPython). Brad Chapman may be willing to mentor a GSoC student, have a look back of the recent email discussions here. In particular, Nick Matzke has already expressed some interest in Biogeographical and community phylogenetics for Biopython (there is a wiki page on open-bio.org on this). > Project 1: > Add support for popular phylogeny representation > standards such as DND files. Give the ability to > read and write such files. Convert between such > files. I need help in picking which standards to use > and need help in picking which operations on these > files is the most useful. We have this already in Bio.Nexus, but there is still room for improvement - see Bug 2788 for example. > Project 2: > Add wrappers for modern (hopefully high throughput > and accurate) phylogeny inference software written in > C++/C. Examples of such software include > neighbor-joining, MJOIN software (similar to > neighbor-joining) (http://bio.math.berkeley.edu/mjoin/), > Garli (http://www.molecularevolution.org/si/software/garli/), > treeSVD (http://www.stat.uchicago.edu/~eriksson/software.html), > and maximum parsimony. I would like to know which > sort of phylogeny inference software is the most useful > in your opinion. I assume no wrappers for such software > exist. Well, Bio.Nexus is a great help with certain tools. There is scope for adding more command line wrappers though (I like quick-join and and also quicktree for NJ tree building). > Project 3: > Add analytic algorithms that use phylogeny in some > way. Examples include bootstrapping and protein-protein > interaction inference algorithms. (i.e. "Inferring protein > interactions from phylogenetic distance matrices" by > Gertz et al.) I need information as to what sort of > algorithms would be useful. I feel that this is still very much an active area of research, and there are no clear gold standards. However, perhaps some published algorithms may be worth re-implementing in Biopython. I would still tend to favour more general work for Biopython that would support people implementing any/their own algorithm. > Project 4: > Enhance phylogeny inference software further. > MJOIN has bugs (I think it returns negative distances > in some cases, and some modifications to it that I > developed using phylogenetic invariants are seg-faulting). Fixing any bug in MJOIN sounds like a good idea - but doesn't really affect Biopython directly. > Not all of these ideas will probably be able to be > developed, so I need information as to what might > be the most useful. I was thinking of focusing on > Project 1 and Project 2 for the initial phase. > > Any information will be appreciated, and any > mentorship will be great. I would like a response > quickly, so that I can inform Nascent of my plans. Peter. P.S. Its Biopython, not BioPython From chapmanb at 50mail.com Wed Apr 8 08:32:26 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Apr 2009 08:32:26 -0400 Subject: [Biopython-dev] Phylogeny modules for BioPython In-Reply-To: <320fb6e00904080154j433ef69cmd99b847240ee5b39@mail.gmail.com> References: <296822.1198.qm@web33706.mail.mud.yahoo.com> <320fb6e00904080154j433ef69cmd99b847240ee5b39@mail.gmail.com> Message-ID: <20090408123226.GL43636@sobchak.mgh.harvard.edu> Jacob; Thanks much for your interest in Biopython for Summer of Code; glad to see a discussion here about your proposal. Peter's comments are great; I will add to them from the SoC perspective. > > I have applied to the Google Summer of Code (12 weeks of > > working part-time on a programming assignment) SoC is a full time commitment for the summer. Your proposal also lists some conflicts (classes, other research) for the summer months. On your updated proposal you should be explicit about these and describe how you plan to make up time you miss during the first two weeks of the quarter. More generally, your proposal needs a detailed plan of deliverables on a week to week basis over the project timeline, starting with coding on May 23rd: http://socghop.appspot.com/document/show/program/google/gsoc2009/timeline This is the last hour for refining proposals, so you will need to update your proposal quickly for us to still have time to consider it. I would recommend copying your current proposal to a Google Doc, adding all of the specifics needed, and then submitting a link to the open document as a comment to your initial proposal. > Brad Chapman may be willing to mentor a GSoC student, have a look back > of the recent email discussions here. In particular, Nick Matzke has > already expressed some interest in Biogeographical and community > phylogenetics for Biopython (there is a wiki page on open-bio.org on > this). I am definitely willing to help; spots will be very competitive throughout the program. Echoing Peter's comments, I would put together a project proposal that tackles: - Improving parsing support in Bio.Nexus, based on existing code and bug reports, and other suggestions you might have. - Providing code wrapping for other phylogeny software. Since the usefulness of different algorithms depends heavily on the context in which it is used, you will not find a consensus about which program is most useful. My suggestion is to suggest wrappers for several useful programs covering the spectrum of possibilities. In additions to the ones you listed, a couple others are: RAxML http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm FastTree http://www.microbesonline.org/fasttree/index.html - A higher level API over the parsing and command line program support that helps users with specific phylogenetic tasks. Based on your experience and input from the Biopython community of users, this would have the goal of providing a simple way to do common tasks. This should be a combination of code to surround repetitive items, and cookbook style documentation to help people with specific phylogenetic problems. Other general suggestions: - Tests. Please describe your plans to write unit tests for all the code your write. - Documentation. Please do leave time in your project plan to fully document using your proposed code. - Projects 3 and 4, as Peter suggests, are out of the scope of GSoC. 3, specifically, is more of a research project. Finally, a few meta-items from your e-mail meant as helpful advice: > It appears to me that BioPython doesn't have much support for > phylogeny inference and tools related to phylogeny inference. I understand this is an attempt to provide motivation for your proposal, but you should do so in a way that does not disparage the work of the people you are soliciting advice from. Your request would be better received if you described it in the context of improving existing phylogenetic support in Biopython. > I need three things from the community ASAP: [...] > I would like a response quickly No one likes to be told what to do, much less a group your are requesting help and hopefully a job from. Again, you should think about how your phrasing will be interpreted by those reading it. > Nascent You twice misspelled this: NESCent. Mistakes happen, but it reflects badly on your commitment to the project to not be able to spell the name of the organization you would like to work with. These are the small things you should be careful and double check. Thanks again for your interest and looking forward to seeing your revised project plan, Brad From chapmanb at 50mail.com Wed Apr 8 08:49:08 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Apr 2009 08:49:08 -0400 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <316000.69837.qm@web62407.mail.re1.yahoo.com> References: <20090406220826.GH43636@sobchak.mgh.harvard.edu> <316000.69837.qm@web62407.mail.re1.yahoo.com> Message-ID: <20090408124908.GN43636@sobchak.mgh.harvard.edu> Hi Michiel; > Thanks for your work on the GFF parser; I'm dealing with GFF files > quite a lot. Could you maybe give a simple example of how to use your > GFF parser, once it's included into Biopython? Awesome; I'm glad it will be useful. I'd definitely welcome any feedback you have on the API or implementation. At this stage we can be flexible and hopefully get it finalized before it hits Biopython. I will get some user documentation together soon, but here is some basic usage. To parse an entire GFF file, getting all features at once: from BCBio.GFF.GFFParser import GFFAddingIterator gff_iterator = GFFAddingIterator() rec_dict = gff_iterator.get_all_features(gff_file) The returned dictionary is like a dictionary from SeqIO.to_dict; keys are ids and values are SeqRecords. You can also seed the parser with an initial dictionary containing sequences or other features, and the features from the GFF file will be added to those records: with open(seq_file) as seq_handle: seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta")) gff_iterator = GFFAddingIterator(seq_dict) If a file is very large, you have two ways of limiting the size of items parsed. The first is to specify which items you are interested and return only those. This code will parse out coding transcripts on chromosome I: cds_limit_info = dict( gff_source_type = [('Coding_transcript', 'gene'), ('Coding_transcript', 'mRNA'), ('Coding_transcript', 'CDS')], gff_id = ['I'] ) rec_dict = gff_iterator.get_all_features(gff_file, limit_info=cds_limit_info) The second is to use an iterator over a section of the file: for rec_dict in gff_iterator.get_features(gff_file, target_lines=1000000): # handle partial rec dictionary of first 1000000 lines Finally, there is an interface to examine a GFF file and figure out useful ways to limit it. This will give you a dictionary of all possible ways to limit a file along with the counts in each: gff_examiner = GFFExaminer() possible_limits = gff_examiner.available_limits(gff_file) and this will give a dictionary of the parent-child relationships in the file: gff_examiner = GFFExaminer() pc_map = gff_examiner.parent_child_map(gff_file) Since GFF providers tend to differ in how they structure their information, this helps get a quick overview of the file to determine how to manage it. Happy to hear about thoughts you might have. Thanks, Brad > > --Michiel. > > > --- On Mon, 4/6/09, Brad Chapman wrote: > > > From: Brad Chapman > > Subject: Re: [Biopython-dev] Bio.GFF and Brad's code > > To: biopython-dev at lists.open-bio.org > > Date: Monday, April 6, 2009, 6:08 PM > > Peter; > > Thanks for the plug. GFF parsing is moving along; the main > > feature > > two things I would like to finish before proposing it for > > inclusion > > are writing of GFF files and putting GFF into BioSQL with > > the nested > > features. The code does work for parsing, and I've been > > using it for > > some real projects; anyone who would like to test it is > > more than > > welcome. > > > > As far as the current Bio.GFF, that is a bit of a > > conundrum. The > > current code does work and for some cases it would be nice > > of having > > the utility of working with GFF from a database. Eventually > > BioSQL > > from GFF may supplant that, but that should be finished and > > tested > > first. I would argue for keeping it in. > > > > However, it is a bit confusing if someone is looking for a > > parser. It > > would make more sense if it lived under a namespace like > > Bio.GFF.DB. > > What do you think about adding a warning that it is going > > to move to > > a new namespace and then moving it there, if we don't > > hear any > > complaints, for 1.51? This is less cumbersome than a > > removal for > > users since it's just an import change. > > > > Brad > > > > > > > > > Brad has been working on his GFF parsing code - see > > progress reports > > > on his blog http://bcbio.wordpress.com/ and his code > > on github, > > > http://github.com/chapmanb/bcbb/tree/master/gff > > > > > > Potentially this could make it into Biopython 1.51, > > and I was just > > > thinking about where the code would go. Brad is > > supporting both GFF3 > > > and the loosely defined GFF2 variants, so Bio.GFF > > seems a good place. > > > There would also be a wrapper under Bio.SeqIO for > > loading GFF files as > > > SeqRecord objects (I haven't played with > > Brad's code, but it can do > > > this already). > > > > > > However, we already have a Bio.GFF module from Michael > > Hoffman created > > > back in 2002 which accesses MySQL General Feature > > Format (GFF) > > > databases created with BioPerl. Perhaps we should > > poll the main > > > discussion list now, and if there are no responses > > from people using > > > it, we could deprecate Bio.GFF for Biopython 1.50? > > Under our current > > > deprecation policy we shouldn't then remove > > Bio.GFF until Biopython > > > 1.52 at the earliest, > > http://biopython.org/wiki/Deprecation_policy > > > > > > What do you think Brad? How about using Bio.GFF3 > > instead? > > > > > > Peter > > > _______________________________________________ > > > Biopython-dev mailing list > > > Biopython-dev at lists.open-bio.org > > > > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > From bugzilla-daemon at portal.open-bio.org Wed Apr 8 18:55:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 8 Apr 2009 18:55:59 -0400 Subject: [Biopython-dev] [Bug 2808] New: Bio.SeqIO "ig" format parser doesn't deal with optional 1 terminator Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2808 Summary: Bio.SeqIO "ig" format parser doesn't deal with optional 1 terminator Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk While working on new unit test test_Emboss.py I noticed that EMBOSS seqret creates ig files where the sequence includes a terminal digit one. Further research online suggests this is an optional feature of the file format, although not commonly used. See: http://bmerc-www.bu.edu/needle-doc/latest/seq-formats.html#seq-file-format The Bio.SeqIO "ig" parser should be aware of the (optional) terminal "1" marker, and not include it in the returned sequence. Perhaps we should even add this when writing the files. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Fri Apr 10 09:10:34 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 10 Apr 2009 09:10:34 -0400 Subject: [Biopython-dev] Invitation for Biopython news coordinators In-Reply-To: <49DD5575.4040901@student.otago.ac.nz> References: <20090406230542.GK43636@sobchak.mgh.harvard.edu> <49DD5575.4040901@student.otago.ac.nz> Message-ID: <20090410131034.GH54672@sobchak.mgh.harvard.edu> David; Thanks for taking the time to write; it is great to hear that you are interested. Copying this to the dev list so others can comment and you can feel free to discuss as much as you want. > I'd be keen to help spread the good word about bio-python, I'm a very > novice programmer who has been using the tools to work on some 454 > transcriptome data. I will probably never be a good enough programmer to > contribute code to the project so would see this as a way to "give > something back". Perfect. Getting involved is the first step; you'd be surprised how much you can learn just by taking on new tasks. I started helping with Biopython by writing documentation. > For me as a n00b the most useful resource by far has been the cookbook - > seeing some working scripts that I could change to suit my ends has > helped me get to the point that I can write much more generalised code > for my project 'from scratch'. To that end I think it would be really > helpful to highlight work that other people have done, either published > or made available by authors, with a little detail on the questions > and the way BioPython was used to get at them. We could extend it to > show some "use cases" for BioPython working with other programs or how > new features can be used once they are included in the main release. > > To me the most obvious way of presenting such information would be a > blog, we could invite authors and developers to make short posts and > failing that I'd be happy write up posts summarising published research. > We could also try an aggregate blogs from the devs and anyone else > talking about biopython "in the wild". This sounds great. You are welcome to use the twitter account, news posts, the wiki, or a blog -- however you see fit. For your aggregation idea, you might want to take a look at friendfeed. It's pretty simple to set up a room and pull in RSS feeds, twitter postings, and what not. There is a Python for Bioinformatics room: http://friendfeed.com/rooms/python-for-bioinformatics Most feeds come from general Python sources so it is a bit more broad, but is a good starting place. I know some of the admins (Chris, Paulo, Andrew) are around here, and may want to chime in. For publications, Peter has done a lot of work on identifying papers that use Biopython: http://biopython.org/wiki/Publications Building on this to include short reusable examples from the research would be very useful. > Anyway, those are a few ideas, I'm definitely keen to help out and to > take on board any other ideas that are out there. Great, let us know how you want to get started. Feel free to start with something small and expand from there. Peter can help out with account information for twitter; if you need other things just ask away. Brad > Cheers, > David > > Brad Chapman wrote: > > Biopythonistas; > > Communication is a key component of successful open source projects. > > The challenges of distributed programming by volunteers can be > > overcome by ensuring that the whole community is aware of > > interesting discussions, new contributions, and development goals. > > Traditionally, this communication has happened through our mailing > > lists, wiki pages, and bug tracking system. While these will > > continue to to be useful resources, new methods of disseminating > > information are changing how we interact through the web. > > > > I'd like to issue an invitation for anyone interested in helping > > revolutionize how Biopython news is disseminated. We are looking for > > contributors from the community to brainstorm new ways to make the > > discussions that happen at biopython.org accessible. You would > > actively follow development here and on the development lists and > > distill this information into useful quick bullet points for those > > interested in Biopython but too busy to follow detailed discussions. > > > > We are proposing two ways to do this: > > > > - Monthly highlights on our news server: > > http://news.open-bio.org/news/category/obf-projects/biopython/ > > The RSS feed from these posts are currently widely distributed around the > > internet. > > > > - More frequent pointers to interesting discussions or other items > > of interest happening in Biopython through our Twitter account: > > http://twitter.com/biopython > > > > This is an opportunity for those of you who are looking to become > > more involved, and would like to learn more about Biopython by > > following all of the coding activity more closely. The position is > > very flexible and we are happy to have one or more people take it > > on; we would also encourage you to be as creative as you want in > > doing so. > > > > I see this as an chance to both provide information and to highlight > > the great work people do at Biopython. If you are interested in > > taking on this role please respond with your ideas. Thanks for your > > interest, > > > > Brad > > _______________________________________________ > > BioPython mailing list - BioPython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > From bugzilla-daemon at portal.open-bio.org Fri Apr 10 10:13:58 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 10 Apr 2009 10:13:58 -0400 Subject: [Biopython-dev] [Bug 2809] New: Adding startswith and endswith methods to the Seq object Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2809 Summary: Adding startswith and endswith methods to the Seq object Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk OtherBugsDependingO 2351 nThis: As part of making the Seq object more like the Python string (Bug 2351), we need alphabet aware startswith and endswith methods. Patch to follow. There are many possible use cases for this. One example which prompted me to work on this was taking SeqRecord objects from sequencing reads (a FASTQ file read in with Bio.SeqIO) where some include a PCR primer associated prefix/suffix which I want to strip off (by slicing the SeqRecord). To do this I need to know if a given SeqRecord's sequence starts with (or ends with) a given primer sequence (or tuple of primer sequences). Current work around, str(record.seq).startswith(prefix) Patch to follow, which will allow record.seq.startswith(prefix) directly. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 10 10:13:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 10 Apr 2009 10:13:59 -0400 Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string, even subclass string? In-Reply-To: Message-ID: <200904101413.n3AEDx5I004913@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2351 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2809 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 10 10:15:27 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 10 Apr 2009 10:15:27 -0400 Subject: [Biopython-dev] [Bug 2809] Adding startswith and endswith methods to the Seq object In-Reply-To: Message-ID: <200904101415.n3AEFRRb005139@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2809 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-10 10:15 EST ------- Created an attachment (id=1275) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1275&action=view) Patch to Bio/Seq.py and Tests/test_Seq_objs.py Adds startswith and endswith methods to the Seq object, and tests these with simple doctest and a longer separate unit test. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Apr 10 10:46:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 10 Apr 2009 15:46:02 +0100 Subject: [Biopython-dev] Tutorial & Cookbook Message-ID: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com> David wrote: >> For me as a n00b the most useful resource by far has been the cookbook - >> seeing some working scripts that I could change to suit my ends has >> helped me get to the point that I can write much more generalised code >> for my project 'from scratch'. ... When you said "cookbook", did you mean the Biopython Tutorial & Cookbook? http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf There are a couple of other documents under the "Cookbook" folder here: http://biopython.org/DIST/docs/cookbook/Restriction.html http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf I have been wondering if the "Biopython Tutorial & Cookbook" should be separated now - it is getting a bit long (which in some ways is a good thing!). Maybe we should re-title it as just the "Biopython Tutorial". Some bits of the current "Cookbook chapter" might be moved into the main body of the tutorial (e.g. the alignment stuff), but having the cookbook entries separate might be a good idea. For a separate "Cookbook", we could again use LaTeX for another HTML/PDF document (or set of documents) but perhaps just a series of pages on the wiki would be more accessible - and much easier for people to contribute to? We'd need to organize things (e.g. a cookbook category on the wiki) to make sure everything is still accessible. As a bonus, it would give us more hits on Google - which is probably a good thing. On the other hand, it would be very good if all our cookbook use cases could be rolled into the unit test framework - which wouldn't be so easy if they live on the wiki. Something based on doctests might work... Peter From bugzilla-daemon at portal.open-bio.org Fri Apr 10 13:29:06 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 10 Apr 2009 13:29:06 -0400 Subject: [Biopython-dev] [Bug 2808] Bio.SeqIO "ig" format parser doesn't deal with optional 1 terminator In-Reply-To: Message-ID: <200904101729.n3AHT6g0020169@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2808 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-10 13:29 EST ------- (In reply to comment #0) > > The Bio.SeqIO "ig" parser should be aware of the (optional) terminal "1" > marker, and not include it in the returned sequence. > Fixed in CVS, Bio/SeqIO/IgIO.p revision 1.5 Tests/test_Emboss.py revision 1.10 > > Perhaps we should even add this when writing the files. > We don't write out ig files so this isn't an issue at the moment. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Apr 10 14:12:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 10 Apr 2009 19:12:12 +0100 Subject: [Biopython-dev] Bio.EMBOSS wrappers Message-ID: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> Hi Those of you following the CVS RSS feed will have noticed a lot of activity on my new unit test test_Emboss.py, which now works on Windows, Linux and Mac OS (provided EMBOSS is installed), and does four main tasks: - runs needle, checks Bio.AlignIO can parse the output - runs water, checks Bio.AlignIO can parse the output - runs seqret to check Bio.SeqIO - runs seqret to check Bio.AlignIO It would probably be logical to also include tests for the EMBOSS version of primer3 here too, but I am not familiar with this tool and the Biopython parsers. For now I build the command line strings for seqret and needle "by hand", as Bio.EMBOSS doesn't have wrappers for them yet. I also note that the existing wrappers in Bio.EMBOSS don't support the very handy -auto and -filter command line arguments supported by all (or at least most) of the EMBOSS command line tools. Using -auto turns off any user prompting for missing arguments (very important for calling from a script). Using -filter is useful for running the tools with pipes (i.e. no output file is required as stdout can be used instead, and potentially no input file if we write to stdin correctly). Brad, seeing as Bio.EMBOSS is "your baby" how do you feel about adding these features? The needle wrapper would make an excellent basis for a new water wrapper. For adding -auto and -filter support, there is probably a clever approach with a common EMBOSS specific subclass of Bio.Application.AbstractCommandline, but I haven't tried. Peter From mjldehoon at yahoo.com Fri Apr 10 22:26:45 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 10 Apr 2009 19:26:45 -0700 (PDT) Subject: [Biopython-dev] Tutorial & Cookbook In-Reply-To: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com> Message-ID: <93403.18413.qm@web62406.mail.re1.yahoo.com> --- On Fri, 4/10/09, Peter wrote: > I have been wondering if the "Biopython Tutorial & > Cookbook" should be separated now - it is getting > a bit long (which in some ways is a good thing!). In my opinion, it doesn't matter if the "Biopython Tutorial & Cookbook" is long. I guess that few people actually print this document anyway. I am in favor of having one "official" documentation for Biopython. If we have one Tutorial and one Cookbook, we'll have lots of overlap between the two, it'll be unclear what should be in the Tutorial and what in the Cookbook, and we'll have to make sure the two are consistent. A cookbook on the Wiki could be helpful though, and since the Wiki pages can be fixed easily we won't have to worry so much about inconsistencies with the official documentation. > Maybe we should re-title it as just the "Biopython Tutorial". That sounds like a good idea. > Some bits of the current "Cookbook chapter" might be moved > into the main body of the tutorial (e.g. the alignment > stuff), Yes. The cookbook chapter has the same problem as a cookbook document; it's not clear what should go there. A more logical place for cookbook-style examples is at the end of each chapter in the documentation. For example, Bio.Entrez has a bunch of cookbook-style examples at the end of its chapter in the Biopython Tutorial & Cookbook. Currently, there are not so many sections left in the cookbook chapter; most of them have become full-fledged chapters and were moved out of the cookbook chapter. > For a separate "Cookbook", we could again use LaTeX for another > HTML/PDF document (or set of documents) but perhaps just a > series of pages on the wiki would be more accessible - and much > easier for people to contribute to? +1 for the wiki, -1 for another HTML/PDF document. > On the other hand, it would be very good if all our > cookbook use cases > could be rolled into the unit test framework - which > wouldn't be so > easy if they live on the wiki. Something based on doctests > might work... Whereas it can be useful if some cookbook examples are part of the unit tests, I don't think it's absolutely required. I see a wiki cookbook more as complementary to the unit tests. --Michiel. From mjldehoon at yahoo.com Sat Apr 11 07:29:47 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 11 Apr 2009 04:29:47 -0700 (PDT) Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090408124908.GN43636@sobchak.mgh.harvard.edu> Message-ID: <830379.9837.qm@web62402.mail.re1.yahoo.com> Hi Brad, Thanks for the examples; that clarified it a lot. I have a couple of suggestions of how to make the GFF parser more generally usable, and more consistent with other parsers in Biopython. Looking at your first example: > from BCBio.GFF.GFFParser import GFFAddingIterator > > gff_iterator = GFFAddingIterator() > rec_dict = gff_iterator.get_all_features(gff_file) > > The returned dictionary is like a dictionary from > SeqIO.to_dict; > keys are ids and values are SeqRecords. It's not clear to me why we need an iterator for GFF files. Can't we just use Python's line iterator instead? I would expect code like this: from Bio import GFF handle = open("my_gff_file.gff") for line in handle: # call the appropriate GFF function on the line The second point is about GFFAddingIterator.get_all_features. If this is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict? Then the code looks as follows: from Bio import GFF handle = open("my_gff_file.gff") rec_dict = GFF.to_dict(handle) Another thing to consider is that IDs in the GFF file do not need to be unique. For example, consider a GFF file that stores genome mapping locations for short sequences stored in a Fasta file. Since each sequence can have more than one mapping location, we can have multiple lines in the GFF file for one sequence ID. The last point is about storing SeqRecords in rec_dict. A GFF file typically does not store sequences; if it does, it's not clear which field in the GFF file does. On the other hand, a SeqRecord often does not contain the chromosomal location, which is what the GFF file stores. So why use a SeqRecord for GFF information? Sorry for bringing up lots of issues. But I think that a GFF parser will be heavily used, so we should optimize its design as much as possible. Best, --Michiel. From biopython at maubp.freeserve.co.uk Sun Apr 12 09:16:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 12 Apr 2009 14:16:58 +0100 Subject: [Biopython-dev] Bio.EMBOSS wrappers In-Reply-To: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> Message-ID: <320fb6e00904120616u390cfe56w3889804d2bffd385@mail.gmail.com> On 4/10/09, Peter wrote: > Hi > > Those of you following the CVS RSS feed will have noticed a lot of > activity on my new unit test test_Emboss.py, which now works on > Windows, Linux and Mac OS (provided EMBOSS is installed), and does > four main tasks: > > - runs needle, checks Bio.AlignIO can parse the output > - runs water, checks Bio.AlignIO can parse the output > - runs seqret to check Bio.SeqIO > - runs seqret to check Bio.AlignIO It now also runs transeq to check the Bio.Seq translations on all common tables. This has shown up some differences in our translations for ambiguous sequences - I may have found a bug in EMBOSS... Peter From sbassi at clubdelarazon.org Sun Apr 12 21:57:52 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Sun, 12 Apr 2009 22:57:52 -0300 Subject: [Biopython-dev] Fwd: [blast-announce] BLAST 2.2.20 now available In-Reply-To: <320fb6e00904071238v7ed94541me4af7937fa7ae27b@mail.gmail.com> References: <9E9BBBAC-1E4D-4C2E-AD8E-7F6DF6F317A5@ncbi.nlm.nih.gov> <320fb6e00904071238v7ed94541me4af7937fa7ae27b@mail.gmail.com> Message-ID: <9e2f512b0904121857o3f8da862ycf40a0510f9dbd51@mail.gmail.com> On Tue, Apr 7, 2009 at 4:38 PM, Peter wrote: > Hi all, .... > We'll also want to check the standalone version of BLAST is OK. I've made the following check: Run a blast query (with blast 2.2.20) with output in xml. Run my python script that converts XML to HTML using Biopython (under Biopython 1.50beta) and it worked OK. The script deals with most information bits found in an XML blast file so if there is any change in the blast output, this program would crash. From eric.talevich at gmail.com Sun Apr 12 23:13:32 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 12 Apr 2009 23:13:32 -0400 Subject: [Biopython-dev] PDB tidy script In-Reply-To: <2d7c25310904030631u56c642d6k83355d6bc4cd5d19@mail.gmail.com> References: <320fb6e00903231405l479ddcc6of9cd0c1aa8fd98d4@mail.gmail.com> <2d7c25310904030631u56c642d6k83355d6bc4cd5d19@mail.gmail.com> Message-ID: <3f6baf360904122013k21aa8efcm4aae0ac872e8e6af@mail.gmail.com> Hi Thomas & everyone, I've started a separate branch on GitHub for this work: http://github.com/etal/biopython/tree/pdbtidy I pushed one small change just now (partly to play with git branches), which is basically the example code I gave earlier. It wraps the PDBLoader and parse_pdb_header classes, and sticks a finger into PDBList too, so that parsing and building a structure from a PDB file is a one-liner for both local and RCSB-hosted files: >>> from Bio import PDB >>> prot = PDB.load('pdb2hmb.ent') >>> dir(prot) ['__doc__', '__init__', '__module__', 'author', 'compound', 'deposition_date', 'head', 'journal', 'journal_reference', 'keywords', 'name', 'release_date', 'resolution', 'source', 'structure', 'structure_method', 'structure_reference'] Or: >>> PDB.fetch('2hmb') /usr/lib/python2.5/site-packages/Bio/PDB/PDBList.py:240: UserWarning: Retrieving ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/hm/pdb2hmb.ent.gz warn("Retrieving %s" % url) (The warning is supposed to be a comment, but that cleanup is happening in another branch: http://github.com/etal/biopython/tree/bug2754 ). My idea is to pull all of the parse_pdb_header data out of the PDBParser and Structure classes, and store it in the PDBLoader wrapper instead. The existing "header" attributes can point to the PDBLoader parent if it exists, or temporarily contain None or "" if necessary to avoid breaking scripts, according to the deprecation plan. Annotations could either stay in Structure or move to Loader. Then we'd have a fast, lean, consistent hierarchy of classes for 3D structure work, and an easy API for loading and exploring PDB files interactively. Part of the pdbtidy concept is to check that the PDB header is consistent with the structure it represents, so I'd like the API for metadata to be just as nice as the existing one for 3D structure. So, this is just a start, but I hope the intent is clear enough that someone will tell me to stop if the whole idea is misguided. Thanks, Eric From biopython at maubp.freeserve.co.uk Mon Apr 13 05:51:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 10:51:38 +0100 Subject: [Biopython-dev] Fwd: [blast-announce] BLAST 2.2.20 now available In-Reply-To: <9e2f512b0904121857o3f8da862ycf40a0510f9dbd51@mail.gmail.com> References: <9E9BBBAC-1E4D-4C2E-AD8E-7F6DF6F317A5@ncbi.nlm.nih.gov> <320fb6e00904071238v7ed94541me4af7937fa7ae27b@mail.gmail.com> <9e2f512b0904121857o3f8da862ycf40a0510f9dbd51@mail.gmail.com> Message-ID: <320fb6e00904130251k3e3e77f2x20e03fba19fd8ff7@mail.gmail.com> On Mon, Apr 13, 2009 at 2:57 AM, Sebastian Bassi wrote: > On Tue, Apr 7, 2009 at 4:38 PM, Peter wrote: >> Hi all, > .... >> We'll also want to check the standalone version of BLAST is OK. > > I've made the following check: > Run a blast query (with blast 2.2.20) with output in xml. Run my > python script that converts XML to HTML using Biopython (under > Biopython 1.50beta) and it worked OK. The script deals with most > information bits found in an XML blast file so if there is any change > in the blast output, this program would crash. Great - thanks for checking that :) Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 06:44:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 11:44:29 +0100 Subject: [Biopython-dev] BOSC 2009 Message-ID: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com> Hello Biopythoneers, Those of you following the dev-mailing list or the OBF news feed will know that talk abstracts for BOSC 2009 are due in today, see http://www.open-bio.org/wiki/BOSC_2009 I should to be able to attend and present the Biopython Project Update, and a few other Biopython developers may also be around too, so some sort of hackathon is in the air. It is a bit unfortunate the deadline was scheduled on the Easter break, as I'm sure quite a few of you will be on holiday, but here is an outline abstract. If anyone has comments, please let me know (on the list or directly) in the next couple of hours... Biopython Project Update (draft abstract for BOSC 2009) In this talk we present the current status of the Biopython project, focusing on features developed in the last year, and future plans for the project. The Oxford University Press journal Bioinformatics has recently published an application note describing Biopython: Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, and de Hoon MJ. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009 Mar 20. doi:10.1093/bioinformatics/btp163 Since BOSC 2008, Biopython 1.49 has been released. This was an important milestone in bringing support for Python 2.6, and in terms of our dependence on Numerical Python as we made the transition from the obsolete Numeric library to NumPy. Biopython 1.49 also added more biological methods to our core sequence object. April 2009 will see the release of Biopython 1.50 (at the time of writing, a beta has already been released). Some of the new features include: 1. GenomeDiagram by Leighton Pritchard has been integrated into Biopython as the Bio.Graphics.GenomeDiagram module. 2. A new module Bio.Motif has been added, which is intended to replace the existing Bio.AlignAce and Bio.MEME modules. 3. Bio.SeqIO can now read and write FASTQ and QUAL files used in second generation sequencing work. Biopython will celebrate its 10th Birthday later this year, we will present a brief history of the project and current work. This includes the evaluation of git (and github) as a possible distributed version control system (DVCS) to replace our existing very stable CVS server hosted by the Open Bioinformatics Foundation, which we hope will encourage more participation in the project. -- Thanks, Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 08:16:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 13:16:10 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <830379.9837.qm@web62402.mail.re1.yahoo.com> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> Message-ID: <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> On Sat, Apr 11, 2009 at 12:29 PM, Michiel de Hoon wrote: > > Hi Brad, > > Thanks for the examples; that clarified it a lot. I haven't tried the code yet, but I have a GFF file I need to convert into FASTA format. Hopefully later this week I'll get to that... There are a few things I can ask now through: Why are the functions _gff_line_map() and _gff_line_reduce() private (leading underscores)? I had thought you wanted to make the map/reduce approach available to people trying to parse GFF files on multiple threads (e.g. using disco) which would require them to use these two functions, wouldn't it? If so, they should be part of the public API. I don't see any support for the optional FASTA block in a GFF file. Is this something you intend to add later Brad? See also my thoughts below for Bio.SeqIO integration. > I have a couple of suggestions of how to make the GFF parser more generally usable, and more consistent with other parsers in Biopython. > Looking at your first example: > >> from BCBio.GFF.GFFParser import GFFAddingIterator >> >> gff_iterator = GFFAddingIterator() >> rec_dict = gff_iterator.get_all_features(gff_file) >> >> The returned dictionary is like a dictionary from >> SeqIO.to_dict; >> keys are ids and values are SeqRecords. > > It's not clear to me why we need an iterator for GFF files. Can't we just use Python's line iterator instead? I would expect code like this: > > from Bio import GFF > handle = open("my_gff_file.gff") > for line in handle: > ? ?# call the appropriate GFF function on the line I think the appropriate GFF function here might be Brad's _gff_line_map(). This knows about different GFF line types (e.g. ## header lines). I'm not sure if a line based approach like this can cope with the optional ##FASTA block through. > The second point is about GFFAddingIterator.get_all_features. If this > is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict? > Then the code looks as follows: > > from Bio import GFF > handle = open("my_gff_file.gff") > rec_dict = GFF.to_dict(handle) Well, the Bio.SeqIO.to_dict() function takes a SeqRecord list/iterator rather than a handle, but that might make sense here. > Another thing to consider is that IDs in the GFF file do not need to be unique. > For example, consider a GFF file that stores genome mapping locations for > short sequences stored in a Fasta file. Since each sequence can have more > than one mapping location, we can have multiple lines in the GFF file for one > sequence ID. That sounds nasty. Do you have any example files of this we could use for a test case? > The last point is about storing SeqRecords in rec_dict. A GFF file typically > does not store sequences; if it does, it's not clear which field in the GFF file > does. On the other hand, a SeqRecord often does not contain the > chromosomal location, which is what the GFF file stores. So why use a > SeqRecord for GFF information? I don't think the GFF parser should only return SeqRecord object, but I do see a use for this (via Bio.SeqIO). GFF files could be represented as a list of SeqFeature objects, and using a SeqRecord to hold this seems very natural to me. It also means we could use Bio.SeqIO to load a GFF file into SeqRecord objects for storage in a BioSQL database. If you look at the NCBI FTP site, they often provide genome sequences in a range of file formats including GenBank and GFF. e.g. ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/ The GenBank files contain the features plus the sequence, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gbk Their GFF3 file only contains the features: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff Some GFF files will include the sequence too, in this case we can fetch it in FASTA format: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna In principle, you could parse this FASTA file and the GFF3 file and put together a GenBank file - or vice versa. As an aside, I would also consider adding protein table support on the same lines, look at this file: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.ptt The header information gives us the genome size, so Bio.SeqIO could return a SeqRecord with lots of SeqFeature objects and for the SeqRecord's seq property use a Bio.Seq.UnknownSeq of length 4639675bp. This is something I might look at implementing myself after Biopython 1.50 is out. We should be able to read in a GenBank file and output a PTT file, and verify it matches the NCBI provided version of the PTT file. Similarly, I would want Bio.SeqIO to be able to parse a GFF3 file, and give me a SeqRecord with lots of SeqFeature objects. If the sequence is present in the file, it should use that (not the case for these NCBI GFF3 files). Otherwise, we wouldn't necessarily know the actual sequence length which we'd need to use the new Bio.Seq.UnknownSeq object. However, we can infer from the maximum feature coordinates a minimum sequence length. For these NCBI GFF3 files, as there is a source feature this does actually give use the genome length, so this should work very nicely. Peter From chapmanb at 50mail.com Mon Apr 13 08:32:19 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 13 Apr 2009 08:32:19 -0400 Subject: [Biopython-dev] Bio.EMBOSS wrappers In-Reply-To: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> Message-ID: <20090413123219.GB5429@sobchak.mgh.harvard.edu> Hi Peter; The tests from EMBOSS look great; thanks for putting this together. > For now I build the command line strings for seqret and needle "by > hand", as Bio.EMBOSS doesn't have wrappers for them yet. I also note > that the existing wrappers in Bio.EMBOSS don't support the very handy > -auto and -filter command line arguments supported by all (or at least > most) of the EMBOSS command line tools. Using -auto turns off any > user prompting for missing arguments (very important for calling from > a script). Using -filter is useful for running the tools with pipes > (i.e. no output file is required as stdout can be used instead, and > potentially no input file if we write to stdin correctly). > > Brad, seeing as Bio.EMBOSS is "your baby" how do you feel about adding > these features? The needle wrapper would make an excellent basis for > a new water wrapper. For adding -auto and -filter support, there is > probably a clever approach with a common EMBOSS specific subclass of > Bio.Application.AbstractCommandline, but I haven't tried. Definitely go for it. My approach on this has mostly been to add command lines as they are requested, or if I need them for something I am doing. Not ideal. Having a subclass with -auto and -filter is a really good idea; unfortunately nothing clever is designed into the command line builders right now. Feel free to add away. Brad From chapmanb at 50mail.com Mon Apr 13 08:52:55 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 13 Apr 2009 08:52:55 -0400 Subject: [Biopython-dev] Tutorial & Cookbook In-Reply-To: <93403.18413.qm@web62406.mail.re1.yahoo.com> References: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com> <93403.18413.qm@web62406.mail.re1.yahoo.com> Message-ID: <20090413125255.GC5429@sobchak.mgh.harvard.edu> Hi all; > > I have been wondering if the "Biopython Tutorial & > > Cookbook" should be separated now - it is getting > > a bit long (which in some ways is a good thing!). > > In my opinion, it doesn't matter if the "Biopython Tutorial & > Cookbook" is long. I guess that few people actually print this > document anyway. > > I am in favor of having one "official" documentation for Biopython. > If we have one Tutorial and one Cookbook, we'll have lots of overlap > between the two, it'll be unclear what should be in the Tutorial > and what in the Cookbook, and we'll have to make sure the two are > consistent. I am for whatever is easiest to maintain. Being long isn't a problem as people can just skip to whatever they need; reading things online will be increasingly common. Agreed with Michiel that minimizing overlap is key. It's the same as maintaining code; if you have the same thing in multiple places it is more likely to get out of sync and be confusing. There is a pretty clear distinction between tutorial documentation and cookbook examples, so... > A cookbook on the Wiki could be helpful though, and since the Wiki > pages can be fixed easily we won't have to worry so much about > inconsistencies with the official documentation. [...] > +1 for the wiki, -1 for another HTML/PDF document. Same vote for me. I am responsible for the LaTeX file, but if I were starting it today would do things entirely on the web. The barrier to contributing is much lower. > > On the other hand, it would be very good if all our cookbook use cases > > could be rolled into the unit test framework - which wouldn't be so > > easy if they live on the wiki. Something based on doctests might work... This is a good idea; broken examples in documentation are definitely annoying. If we enforce a common format for cookbook items, then we could scrape the wiki pages, extract the python code and run it as part of the tests. The python cookbook could serve as some inspiration: http://code.activestate.com/recipes/langs/python/ Brad From biopython at maubp.freeserve.co.uk Mon Apr 13 08:53:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 13:53:18 +0100 Subject: [Biopython-dev] Bio.EMBOSS wrappers In-Reply-To: <20090413123219.GB5429@sobchak.mgh.harvard.edu> References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> <20090413123219.GB5429@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com> On Mon, Apr 13, 2009 at 1:32 PM, Brad Chapman wrote: >> Brad, seeing as Bio.EMBOSS is "your baby" how do you feel about adding >> these features? ?The needle wrapper would make an excellent basis for >> a new water wrapper. ?For adding -auto and -filter support, there is >> probably a clever approach with a common EMBOSS specific subclass of >> Bio.Application.AbstractCommandline, but I haven't tried. > > Definitely go for it. My approach on this has mostly been to add > command lines as they are requested, or if I need them for something > I am doing. Not ideal. > > Having a subclass with -auto and -filter is a really good idea; > unfortunately nothing clever is designed into the command line builders > right now. Feel free to add away. I need to work on my delegation skills - that seems to have back fired ;) Regarding adding -auto support, I have a question about the needle wrapper and the gap parameters. Using the needle tool at the command line will prompt for the gap parameters UNLESS the -auto argument has been used. i.e. Without -auto, it makes sense to insist on the gap parameters being included, which is what the current wrapper does. However, if we add support for -auto, then these parameters can be optional. We could handle this in the wrapper, but it would be messy (and there may be similar questions with other EMBOSS tools). What do you think - stick with the simple option of insisting the Biopython user set the gap parameters, even if they are using -auto? Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 09:16:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 14:16:51 +0100 Subject: [Biopython-dev] Tutorial & Cookbook In-Reply-To: <20090413125255.GC5429@sobchak.mgh.harvard.edu> References: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com> <93403.18413.qm@web62406.mail.re1.yahoo.com> <20090413125255.GC5429@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904130616t2cd5f029i582f4a2488d1182c@mail.gmail.com> Brad wrote: >Michiel wrote: >> A cookbook on the Wiki could be helpful though, and since the Wiki >> pages can be fixed easily we won't have to worry so much about >> inconsistencies with the official documentation. >> [...] >> +1 for the wiki, -1 for another HTML/PDF document. > > Same vote for me. I am responsible for the LaTeX file, but if I were > starting it today would do things entirely on the web. The barrier > to contributing is much lower. One of the nice things about the current PDF (and HTML) file is we can ship it with each release, meaning it can be used while offline. Also it means we don't have to worry too much about having our online documentation deal with older versions of Biopython. But you are right that LaTeX is a slight barrier to contributing - although it wasn't an issue for me personally as I learnt LaTeX during my Maths/Physics undergraduate degree. In anycase, I've previously said that if people have additions for the tutorial, I'll take plain text and do the mark up for them. >> > On the other hand, it would be very good if all our cookbook use cases >> > could be rolled into the unit test framework - which wouldn't be so >> > easy if they live on the wiki. ?Something based on doctests might work... > > This is a good idea; broken examples in documentation are definitely > annoying. If we enforce a common format for cookbook items, then we > could scrape the wiki pages, extract the python code and run it as > part of the tests. That sounds possible - we might be able to scrape the wiki page, reformat it and feed it into doctests... although testing graphical output will still be a problem. Speaking of doctests, we should do more of those in our docstrings. For our online API documentation at http://biopython.org/DIST/docs/api/ it would be nice to have the python examples within the docstrings (including the doctests) shown with syntax colouring. See http://epydoc.sourceforge.net/manual-epytext.html#doctest-blocks for an example, and compare this to http://biopython.org/DIST/docs/api/Bio.Seq-module.html - maybe we need to adjust our indentation? Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 09:33:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 14:33:03 +0100 Subject: [Biopython-dev] BOSC 2009 In-Reply-To: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com> References: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com> Message-ID: <320fb6e00904130633k68fe32bdj3c0419afc5ada71a@mail.gmail.com> On Mon, Apr 13, 2009 at 11:44 AM, Peter wrote: > Hello Biopythoneers, > > Those of you following the dev-mailing list or the OBF news feed will > know that talk abstracts for BOSC 2009 are due in today, see > http://www.open-bio.org/wiki/BOSC_2009 > I should to be able to attend and present the Biopython Project > Update, and a few other Biopython developers may also be > around too, so some sort of hackathon is in the air. > > It is a bit unfortunate the deadline was scheduled on the Easter > break, as I'm sure quite a few of you will be on holiday, but here > is an outline abstract. ?If anyone has comments, please let me > know (on the list or directly) in the next couple of hours... That's been submitted now, although I can still make revisions at the moment if anyone spots something worth adding/fixing. I did remember to add the website and license information as BOSC request on their instructions. Peter From chapmanb at 50mail.com Mon Apr 13 09:35:39 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 13 Apr 2009 09:35:39 -0400 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> Message-ID: <20090413133539.GD5429@sobchak.mgh.harvard.edu> Michiel and Peter; Thanks for your comments on this. I'm definitely open to modifying the interface and am happy to y'all giving feedback. In reading through your comments, there is a bit of a disconnect between what you are expecting the parser to do and how it is designed right now. You both are thinking of the GFF parser as a line oriented parser that emits an object, like a SeqFeature, for each line in the file. This one way to do it, but the downsides are: - Many features, like coding regions, are actually represented over multiple lines. - As Michiel pointed out, almost all files have many replicating IDs (the first column). Ideally you want all of these features consolidated to a single SeqRecord. So the parser now takes a higher level view and assumes that the user will want those two things done for them. So it is designed as an "adder," that puts features onto SeqRecord objects. A normal use case would be: - Use SeqIO to parse a FASTA file with the sequences => SeqRecords - Use the GFFParser to add features from a separate GFF file to the SeqRecords. These are SeqFeatures, added to the right records and nested in a parent/child relationship as appropriate. Ideally you would parse the entire GFF file and do all this feature adding at once. For big files this fails due to memory issues, which is why the filtering and iterating features were introduced. Okay, so that is the top level view. I will try to hit some of the specifics: > Why are the functions _gff_line_map() and _gff_line_reduce() private > (leading underscores)? I had thought you wanted to make the > map/reduce approach available to people trying to parse GFF files on > multiple threads (e.g. using disco) which would require them to use > these two functions, wouldn't it? If so, they should be part of the > public API. I don't think a standard user would want to deal with these directly. They just parse lines into their components and build an intermediate dictionary object. To parallelize the job, the GFFMapReduceFeatureAdder class has a 'disco_host' parameter which then runs the job in parallel. > I don't see any support for the optional FASTA block in a GFF file. > Is this something you intend to add later Brad? See also my thoughts > below for Bio.SeqIO integration. I haven't added anything for parsing header and footer directives but it is on the to do list and I have a good idea how to handle them. Definitely pass along a file that uses these you want to parse and we can work on it. > > I have a couple of suggestions of how to make the GFF parser more > > generally usable, and more consistent with other parsers in Biopython. [...] > > It's not clear to me why we need an iterator for GFF files. Can't we > > just use Python's line iterator instead? I would expect code like this: > > > > from Bio import GFF > > handle = open("my_gff_file.gff") > > for line in handle: > > ? ?# call the appropriate GFF function on the line Right, so this was tackled in the top level overview above. Michiel, does the design make more sense now? > > The second point is about GFFAddingIterator.get_all_features. If this > > is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict? > > Then the code looks as follows: > > > > from Bio import GFF > > handle = open("my_gff_file.gff") > > rec_dict = GFF.to_dict(handle) Yes, except in the more common cases you are adding to a dictionary of records as opposed to generating one from scratch. My thought was that copying the SeqIO behavior made it more confusing because it doesn't do quite the same thing. After my explanation, what are your thoughts? > > Another thing to consider is that IDs in the GFF file do not need to be unique. > > For example, consider a GFF file that stores genome mapping locations for > > short sequences stored in a Fasta file. Since each sequence can have more > > than one mapping location, we can have multiple lines in the GFF file for one > > sequence ID. Yes, this goes back to my explanation above and is why the parser works differently than the standard SeqIO parsers. GFF ends up being a different beast. I think it makes sense to copy useful patterns we have already, but don't want to confuse users with close by not the same functionality. > > The last point is about storing SeqRecords in rec_dict. A GFF file typically > > does not store sequences; if it does, it's not clear which field in the GFF file > > does. On the other hand, a SeqRecord often does not contain the > > chromosomal location, which is what the GFF file stores. So why use a > > SeqRecord for GFF information? Hopefully the SeqRecords make more sense now. What it is really doing is adding SeqFeatures to SeqRecords. When the user doesn't provide one, it creates an empty SeqRecord with the appropriate ID to use and adds SeqFeatures to it. > If you look at the NCBI FTP site, they often provide genome sequences > in a range of file formats including GenBank and GFF. [...] > Their GFF3 file only contains the features: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff > > Some GFF files will include the sequence too, in this case we can > fetch it in FASTA format: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna Right on. So you would first parse the Fasta file with the SeqIO parser to_dict functionality, and then feed this dictionary to the GFF parser to add the features. > In principle, you could parse this FASTA file and the GFF3 file and > put together a GenBank file - or vice versa. Yes. > Similarly, I would want Bio.SeqIO to be able to parse a GFF3 file, and > give me a SeqRecord with lots of SeqFeature objects. If the sequence > is present in the file, it should use that (not the case for these > NCBI GFF3 files). Otherwise, we wouldn't necessarily know the actual > sequence length which we'd need to use the new Bio.Seq.UnknownSeq > object. However, we can infer from the maximum feature coordinates a > minimum sequence length. For these NCBI GFF3 files, as there is a > source feature this does actually give use the genome length, so this > should work very nicely. Using UnknownSeq is a good idea, and I will do. Whew. Michiel and Peter -- hopefully the high level intentions are a bit more clear. Thanks for your input so far; let's hash this out so it makes sense to everyone. Brad From chapmanb at 50mail.com Mon Apr 13 09:44:29 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 13 Apr 2009 09:44:29 -0400 Subject: [Biopython-dev] Bio.EMBOSS wrappers In-Reply-To: <320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com> References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> <20090413123219.GB5429@sobchak.mgh.harvard.edu> <320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com> Message-ID: <20090413134429.GE5429@sobchak.mgh.harvard.edu> Hi Peter; > >> Brad, seeing as Bio.EMBOSS is "your baby" how do you feel about adding > >> these features? ?The needle wrapper would make an excellent basis for > >> a new water wrapper. ?For adding -auto and -filter support, there is > >> probably a clever approach with a common EMBOSS specific subclass of > >> Bio.Application.AbstractCommandline, but I haven't tried. > > > > Definitely go for it. My approach on this has mostly been to add > > command lines as they are requested, or if I need them for something > > I am doing. Not ideal. > > > > Having a subclass with -auto and -filter is a really good idea; > > unfortunately nothing clever is designed into the command line builders > > right now. Feel free to add away. > > I need to work on my delegation skills - that seems to have back fired ;) Oops. I honestly read that as "do I have your permission?" I can of course tackle this, but am a bit underwater now. > Regarding adding -auto support, I have a question about the needle > wrapper and the gap parameters. Using the needle tool at the command > line will prompt for the gap parameters UNLESS the -auto argument has > been used. i.e. Without -auto, it makes sense to insist on the gap > parameters being included, which is what the current wrapper does. > However, if we add support for -auto, then these parameters can be > optional. We could handle this in the wrapper, but it would be messy > (and there may be similar questions with other EMBOSS tools). What do > you think - stick with the simple option of insisting the Biopython > user set the gap parameters, even if they are using -auto? I think we should stick with the simple option. These were meant to be pretty dumb specifiers that help users write more modular code than simply pasting in a raw string for the command line. Trying to get too fancy is probably overkill. Brad From biopython at maubp.freeserve.co.uk Mon Apr 13 09:49:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 14:49:56 +0100 Subject: [Biopython-dev] Bio.EMBOSS wrappers In-Reply-To: <20090413134429.GE5429@sobchak.mgh.harvard.edu> References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> <20090413123219.GB5429@sobchak.mgh.harvard.edu> <320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com> <20090413134429.GE5429@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com> On Mon, Apr 13, 2009 at 2:44 PM, Brad Chapman wrote: >> > ... Feel free to add away. >> >> I need to work on my delegation skills - that seems to have back fired ;) > > Oops. I honestly read that as "do I have your permission?" I can of > course tackle this, but am a bit underwater now. Looking back, I was a bit ambiguous. I don't mind who does it - let's see who has time free first. >> Regarding adding -auto support, I have a question about the needle >> wrapper and the gap parameters. ?Using the needle tool at the command >> line will prompt for the gap parameters UNLESS the -auto argument has >> been used. ?i.e. Without -auto, it makes sense to insist on the gap >> parameters being included, which is what the current wrapper does. >> However, if we add support for -auto, then these parameters can be >> optional. ?We could handle this in the wrapper, but it would be messy >> (and there may be similar questions with other EMBOSS tools). ?What do >> you think - stick with the simple option of insisting the Biopython >> user set the gap parameters, even if they are using -auto? > > I think we should stick with the simple option. These were meant to > be pretty dumb specifiers that help users write more modular code than > simply pasting in a raw string for the command line. Trying to get > too fancy is probably overkill. Agreed. Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 10:19:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 15:19:54 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090413133539.GD5429@sobchak.mgh.harvard.edu> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> <20090413133539.GD5429@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com> > Okay, so that is the top level view. I will try to hit some of the > specifics: > >> Why are the functions _gff_line_map() and _gff_line_reduce() private >> (leading underscores)? ?I had thought you wanted to make the >> map/reduce approach available to people trying to parse GFF files on >> multiple threads (e.g. using disco) which would require them to use >> these two functions, wouldn't it? ?If so, they should be part of the >> public API. > > I don't think a standard user would want to deal with these > directly. They just parse lines into their components and build an > intermediate dictionary object. To parallelize the job, the > GFFMapReduceFeatureAdder class has a 'disco_host' parameter which > then runs the job in parallel. Are you aware of any alternatives to disco for doing map/reduce on Python, and does that impact your design choices? >> I don't see any support for the optional FASTA block in a GFF file. >> Is this something you intend to add later Brad? ?See also my thoughts >> below for Bio.SeqIO integration. > > I haven't added anything for parsing header and footer directives but > it is on the to do list and I have a good idea how to handle them. Definitely > pass along a file that uses these you want to parse and we can work on it. There are some partial examples here: http://www.sequenceontology.org/gff3.shtml We should have a peep at BioPerl's unit tests and/or ask Lincoln directly. >> > I have a couple of suggestions of how to make the GFF parser more >> > generally usable, and more consistent with other parsers in Biopython. > [...] >> > It's not clear to me why we need an iterator for GFF files. Can't we >> > just use Python's line iterator instead? I would expect code like this: >> > >> > from Bio import GFF >> > handle = open("my_gff_file.gff") >> > for line in handle: >> > ? ?# call the appropriate GFF function on the line > > Right, so this was tackled in the top level overview above. Michiel, > does the design make more sense now? > >> > The second point is about GFFAddingIterator.get_all_features. If this >> > is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict? >> > Then the code looks as follows: >> > >> > from Bio import GFF >> > handle = open("my_gff_file.gff") >> > rec_dict = GFF.to_dict(handle) > > Yes, except in the more common cases you are adding to a dictionary > of records as opposed to generating one from scratch. My thought was > that copying the SeqIO behavior made it more confusing because it > doesn't do quite the same thing. After my explanation, what are your > thoughts? Maybe there is a role for a to_dict() function for when you start from scratch, but as you say, it does sound like there is a general need to add to an existing dict. >> > Another thing to consider is that IDs in the GFF file do not need to be unique. >> > For example, consider a GFF file that stores genome mapping locations for >> > short sequences stored in a Fasta file. Since each sequence can have more >> > than one mapping location, we can have multiple lines in the GFF file for one >> > sequence ID. > > Yes, this goes back to my explanation above and is why the > parser works differently than the standard SeqIO parsers. GFF ends > up being a different beast. I think it makes sense to copy useful > patterns we have already, but don't want to confuse users with close > by not the same functionality. > >> > The last point is about storing SeqRecords in rec_dict. A GFF file typically >> > does not store sequences; if it does, it's not clear which field in the GFF file >> > does. On the other hand, a SeqRecord often does not contain the >> > chromosomal location, which is what the GFF file stores. So why use a >> > SeqRecord for GFF information? > > Hopefully the SeqRecords make more sense now. What it is really doing is > adding SeqFeatures to SeqRecords. When the user doesn't provide one, > it creates an empty SeqRecord with the appropriate ID to use and > adds SeqFeatures to it. > >> If you look at the NCBI FTP site, they often provide genome sequences >> in a range of file formats including GenBank and GFF. >> [...] >> Their GFF3 file only contains the features: >> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff >> >> Some GFF files will include the sequence too, in this case we can >> fetch it in FASTA format: >> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna > > Right on. So you would first parse the Fasta file with the SeqIO > parser to_dict functionality, and then feed this dictionary to the > GFF parser to add the features. Hmm. I'm with you on the idea that you may need to parse a GFF file and a separate second file to get the actual sequence (e.g. a FASTA file), but there is more than one way to combine the two. For a single sequence, I was thinking more along the lines of: from Bio import SeqIO record = SeqIO.read(open("NC_000913.fna"),"fasta") record.features = SeqIO.read(open("NC_000913.gff"),"gff3").features Or, depending on what other annotation you can extract, perhaps the other way round would be best: from Bio import SeqIO record = SeqIO.read(open("NC_000913.gff"),"gff3") record.seq = SeqIO.read(open("NC_000913.fna"),"fasta").seq The above is pretty trivial I think, as long as we include examples of this in our documentation. This kind of manipulation is also file format neutral - it would work equally well with a FASTA file and a PTT file (assuming we add parsing NCBI protein tables to Bio.SeqIO as outlined in my earlier email). Or for another example, perhaps an annotated GenBank file without the sequence (e.g. just a CONTIG assembly line) plus a FASTA file for the full nucleotide sequence. If the FASTA and GFF file apply to multiple sequences (e.g. a set of contigs, rather than a single chromosome), and you have enough memory, then something using dictionaries should work: from Bio import SeqIO records = SeqIO.to_dict(SeqIO.read(open("NC_000913.fna"),"fasta")) for temp_rec in SeqIO.parse(open("NC_000913.gff"),"gff3") : records[temp_rec.id].features = temp_rec.features or, from Bio import SeqIO records = SeqIO.to_dict(SeqIO.read(open("NC_000913.gff"),"gff3")) for temp_rec in SeqIO.parse(open("NC_000913.fna"),"fasta") : records[temp_rec.id].seq = temp_rec.seq (You may need to massage the keys to match up, I'm assuming here that isn't required). i.e. It can all be done from Bio.SeqIO without needing to dive into Bio.GFF unless you need to do something special (e.g. filtering the features). >> In principle, you could parse this FASTA file and the GFF3 file and >> put together a GenBank file - or vice versa. > > Yes. > >> Similarly, I would want Bio.SeqIO to be able to parse a GFF3 file, and >> give me a SeqRecord with lots of SeqFeature objects. ?If the sequence >> is present in the file, it should use that (not the case for these >> NCBI GFF3 files). ?Otherwise, we wouldn't necessarily know the actual >> sequence length which we'd need to use the new Bio.Seq.UnknownSeq >> object. ?However, we can infer from the maximum feature coordinates a >> minimum sequence length. ?For these NCBI GFF3 files, as there is a >> source feature this does actually give use the genome length, so this >> should work very nicely. > > Using UnknownSeq is a good idea, and I will do. Great. > Whew. Michiel and Peter -- hopefully the high level intentions are a > bit more clear. Thanks for your input so far; let's hash this out so > it makes sense to everyone. Good plan :) As you can probably tell, I am concentrating on getting this to match up well with the Bio.SeqIO framework. It will be nice to know the underlying Bio.GFF module has more options, but I expect most people to start with reading in a GFF file using Bio.SeqIO, and being able to transfer their existing knowledge of SeqFeature objects learnt from using Bio.SeqIO to read in GenBank files. Peter From jflatow at gmail.com Mon Apr 13 10:41:56 2009 From: jflatow at gmail.com (Jared Flatow) Date: Mon, 13 Apr 2009 09:41:56 -0500 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> <20090413133539.GD5429@sobchak.mgh.harvard.edu> <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com> Message-ID: <3050CC48-7365-4746-B30C-F56C2ACAA2F8@gmail.com> FYI: On Apr 13, 2009, at 9:19 AM, Peter wrote: > Are you aware of any alternatives to disco for doing map/reduce on > Python, and does that impact your design choices? You can use Python map/reduce functions with Hadoop via the Streaming contrib package included with Hadoop. An overview: http://docs.google.com/Presentation?id=dgr666gg_31cd4n7qdz Here is an input reader/record reader for FASTA: http://gist.github.com/45551 jared From bugzilla-daemon at portal.open-bio.org Mon Apr 13 11:41:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 13 Apr 2009 11:41:29 -0400 Subject: [Biopython-dev] [Bug 2601] Seq find() method: proposal In-Reply-To: Message-ID: <200904131541.n3DFfTGN022460@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2601 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-13 11:41 EST ------- See also Bug 2809, for the much narrower option of adding string-like startswith and endswith methods to the Seq object (which as proposed would not deal with ambiguity characters). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Apr 13 13:55:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 18:55:53 +0100 Subject: [Biopython-dev] EFetch returning ASN.1 not genbank format Message-ID: <320fb6e00904131055r24ac6e42p1da55cfcf173e@mail.gmail.com> Hi all, At then end of last week I found test_SeqIO_online.py was failing and traced this to a change in Entrez EFetch. EFetch is documented here: http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html The issue is with EFetch and the undocumented rettype=genbank argument which we currently use in our documentation and unit tests. This isn't an "official" argument in that it isn't listed on their website, but until recently it returned plain text GenBank files, acting like the official rettype=gb or gp arguments. However, as of the end of last week, EFtech returns the default format instead (ASN.1), causing test_SeqIO_online.py to fail and rendering some of our examples misleading. I emailed the NCBI and received a very prompt reply, > Dear Colleague, > >?As the e-Utils continue to be refined our developers sometimes > address one-off issues, and this was one of them. The 'official' > parameter for GenBank is rettype=gb. Now if the parameter is not > correct you will default to ASN.1 in the nucleotide databases. We > apologize for any inconvenience. > > Regards, > > Steve Pechous, Ph.D. > NCBI User Services I then emailed back (before Easter) to ask if they would reconsider this change, and have just had a reply: > Hi Peter, > > This will likely not reverse back as the true parameters are laid out > in the help documents and are now required, so to speak. > > Regards, > > Steve Pechous, Ph.D. > NCBI User Services With hindsight we shouldn't have used rettype="genbank", but it did seem to make things simpler for our documentation and I really hadn't expected the NCBI to change this. I think we have two options: (1) Add a special case to Bio.Entrez.eftech to map rettype="genbank" to rettype="gb" (or "gp" for the protein database). This is simple and causes least disruption to Biopython uses, but is a bad idea in the long run as it means we are effectively providing our own variant of the Entrez API. (2) Update our documentation and unit tests to use rettype="gb" or "gp" instead of rettype="genbank", and add a special case to Bio.Entrez.eftech to map rettype="genbank" to rettype="gb" (or "gp" for the protein database) and issue a warning that the NCBI have changed their API. At a later point we might change this warning to an error. This would provide a clear transition for end user scripts, and keep us consistent with the official Entrez API. I favour option (2) here. Any other thoughts? Whatever we do should happen before we release Biopython 1.50. Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 14:06:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 19:06:25 +0100 Subject: [Biopython-dev] Plan for Biopython 1.50 (final) Message-ID: <320fb6e00904131106s70028e9el56d334fa732bddf8@mail.gmail.com> On Tue, Mar 31, 2009 at 10:38 PM, Peter wrote: > Hi all, > > OK guys, after a brief chat off the mailing list, I'm hoping to do the > Biopython 1.50 beta release roughly this weekend, ... > > After the release of Biopython 1.50 beta, we'll reopen CVS again for > small changes and documentation. ?While the beta is being tested by > our user base, I'd like us to push to finish any missing documentation > - in particular for new modules Bio.Motif (Bartek) and > Bio.Graphics.GenomeDiagram (me and/or Leighton), plus the new > SeqRecord slicing and UnknownSeq class (me). That documentation still needs doing, and it would be nice to have it with Biopython 1.50. If Bartek or Leighton expects to add anything in the next few days, then I'd be happy to hold back the release for that. I'll try and do the SeqRecord stuff myself shortly. > Depending on the feedback from the beta, I'd hope we can do the final > release of Biopython 1.50 well before the end of April, and then > reopen CVS for new code. There haven't been any problems with the beta reported, however there is the issue of EFetch returning ASN.1 not genbank format (see my earlier email) which I think we must resolve before Biopython 1.50 is released. Apart from these two points (documentation and EFetch), are there any issues regarding doing the official release of Biopython 1.50? I think we can aim for a release this week... Peter From lpritc at scri.ac.uk Tue Apr 14 04:29:14 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 14 Apr 2009 09:29:14 +0100 Subject: [Biopython-dev] EFetch returning ASN.1 not genbank format In-Reply-To: <320fb6e00904131055r24ac6e42p1da55cfcf173e@mail.gmail.com> Message-ID: On 13/04/2009 18:55, "Peter" wrote: [...] > I think we have two options: > > (1) Add a special case to Bio.Entrez.eftech to map rettype="genbank" > to rettype="gb" (or "gp" for the protein database). This is simple > and causes least disruption to Biopython uses, but is a bad idea in > the long run as it means we are effectively providing our own variant > of the Entrez API. > > (2) Update our documentation and unit tests to use rettype="gb" or > "gp" instead of rettype="genbank", and add a special case to > Bio.Entrez.eftech to map rettype="genbank" to rettype="gb" (or "gp" > for the protein database) and issue a warning that the NCBI have > changed their API. At a later point we might change this warning to > an error. This would provide a clear transition for end user scripts, > and keep us consistent with the official Entrez API. > > I favour option (2) here. Any other thoughts? Whatever we do should > happen before we release Biopython 1.50. Option (2). Option (1) risks cementing an argument into place in Biopython that could potentially contradict future Entrez API usage. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From mjldehoon at yahoo.com Tue Apr 14 04:33:48 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 14 Apr 2009 01:33:48 -0700 (PDT) Subject: [Biopython-dev] EFetch returning ASN.1 not genbank format In-Reply-To: <320fb6e00904131055r24ac6e42p1da55cfcf173e@mail.gmail.com> Message-ID: <273080.33626.qm@web62408.mail.re1.yahoo.com> I am also in favor of option (2). --Michiel > I think we have two options: > > (1) Add a special case to Bio.Entrez.eftech to map > rettype="genbank" > to rettype="gb" (or "gp" for the > protein database). This is simple > and causes least disruption to Biopython uses, but is a bad > idea in > the long run as it means we are effectively providing our > own variant > of the Entrez API. > > (2) Update our documentation and unit tests to use > rettype="gb" or > "gp" instead of rettype="genbank", and > add a special case to > Bio.Entrez.eftech to map rettype="genbank" to > rettype="gb" (or "gp" > for the protein database) and issue a warning that the NCBI > have > changed their API. At a later point we might change this > warning to > an error. This would provide a clear transition for end > user scripts, > and keep us consistent with the official Entrez API. From bugzilla-daemon at portal.open-bio.org Tue Apr 14 04:51:56 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 14 Apr 2009 04:51:56 -0400 Subject: [Biopython-dev] [Bug 2811] New: EFetch returning ASN.1 not GenBank format for rettype=genbank Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2811 Summary: EFetch returning ASN.1 not GenBank format for rettype=genbank Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk At the end of last week I found test_SeqIO_online.py was failing and traced this to a change in Entrez EFetch. EFetch is documented here: http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html The issue is with EFetch and the undocumented rettype=genbank argument which we currently use in our documentation and unit tests. This isn't an "official" argument in that it isn't listed on their website, but until recently it returned plain text GenBank files, acting like the official rettype=gb or gp arguments. However, as of the end of last week, EFtech returns the default format instead (ASN.1), causing test_SeqIO_online.py to fail and rendering some of our examples misleading. I emailed the NCBI and received a very prompt reply, > Dear Colleague, > > As the e-Utils continue to be refined our developers sometimes > address one-off issues, and this was one of them. The 'official' > parameter for GenBank is rettype=gb. Now if the parameter is not > correct you will default to ASN.1 in the nucleotide databases. We > apologize for any inconvenience. > > Regards, > > Steve Pechous, Ph.D. > NCBI User Services I then emailed back (before Easter) to ask if they would reconsider this change, and have just had a reply: > Hi Peter, > > This will likely not reverse back as the true parameters are laid out > in the help documents and are now required, so to speak. > > Regards, > > Steve Pechous, Ph.D. > NCBI User Services With hindsight we shouldn't have used rettype="genbank", but it did seem to make things simpler for our documentation and I really hadn't expected the NCBI to change this. After discussion on the mailing list, the plan is to update our documentation and unit tests to use rettype="gb" or "gp" instead of rettype="genbank", and add a special case to Bio.Entrez.eftech to map rettype="genbank" to rettype="gb" (or "gp" for the protein database) and issue a warning that the NCBI have changed their API. At a later point we might change this warning to an error. This would provide a clear transition for end user scripts, and keep us consistent with the official Entrez API. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Apr 14 04:53:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Apr 2009 09:53:02 +0100 Subject: [Biopython-dev] EFetch returning ASN.1 not genbank format In-Reply-To: <273080.33626.qm@web62408.mail.re1.yahoo.com> References: <320fb6e00904131055r24ac6e42p1da55cfcf173e@mail.gmail.com> <273080.33626.qm@web62408.mail.re1.yahoo.com> Message-ID: <320fb6e00904140153w4c659655q64f19540f7bd12b7@mail.gmail.com> On Tue, Apr 14, 2009 at 9:33 AM, Michiel de Hoon wrote: > > I am also in favor of option (2). > > --Michiel > OK. Let's do that then. I've filed Bug 2811 for this issue, http://bugzilla.open-bio.org/show_bug.cgi?id=2811 Peter From bugzilla-daemon at portal.open-bio.org Tue Apr 14 05:54:23 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 14 Apr 2009 05:54:23 -0400 Subject: [Biopython-dev] [Bug 2811] EFetch returning ASN.1 not GenBank format for rettype=genbank In-Reply-To: Message-ID: <200904140954.n3E9sND0024084@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2811 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-14 05:54 EST ------- Tutorial updated, see Doc/Tutorial.tex revision 1.221 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Tue Apr 14 06:36:03 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 14 Apr 2009 03:36:03 -0700 (PDT) Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090413133539.GD5429@sobchak.mgh.harvard.edu> Message-ID: <322143.67385.qm@web62403.mail.re1.yahoo.com> --- On Mon, 4/13/09, Brad Chapman wrote: > A normal use case would be: > > - Use SeqIO to parse a FASTA file with the sequences => > SeqRecords > - Use the GFFParser to add features from a separate GFF > file to the SeqRecords. These are SeqFeatures, added to > the right records and nested in a parent/child relationship > as appropriate. Usually, when I use a GFF file I either don't have an associated Fasta file, or I am not particularly interested in the original sequences. So while this approach is useful for some people, in its current form it's not exactly generally usable. First, let's discuss how to represent the information contained in a GFF file. SeqRecords are good if the GFF file is associated with a Fasta file (or contains the sequence itself), but if not it seems to be a bit awkward. How about the following (and I think Peter was hinting at the same idea): The actual parser lives in Bio.GFF, and produces Bio.GFF.Record objects that closely resemble the GFF file structure. For example, we use the GFF specified fields ( [attributes] [comments]) as attributes to Bio.GFF.Record objects. Bio.SeqIO then uses the parser in Bio.GFF, and puts its information in the appropriate fields of a SeqRecord. Here, we have to think about two cases: Simply creating a SeqRecord based on the GFF file, and adding the information in the GFF file as annotations to a pre-existing set of SeqRecords. (I am not sure if we need a separate function for that, or, as Peter suggested, let the user do that himself, guided by some examples in the documentation). Users then have a choice to use Bio.SeqIO to get SeqRecords, or Bio.GFF to see the "raw" GFF data, depending on their needs. How does that sound? --Michiel From biopython at maubp.freeserve.co.uk Tue Apr 14 07:04:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Apr 2009 12:04:39 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <322143.67385.qm@web62403.mail.re1.yahoo.com> References: <20090413133539.GD5429@sobchak.mgh.harvard.edu> <322143.67385.qm@web62403.mail.re1.yahoo.com> Message-ID: <320fb6e00904140404x35f87a00ude242e6c3c4c7971@mail.gmail.com> On Tue, Apr 14, 2009 at 11:36 AM, Michiel de Hoon wrote: > > Usually, when I use a GFF file I either don't have an associated Fasta file, > or I am not particularly interested in the original sequences. So while this > approach is useful for some people, in its current form it's not exactly > generally usable. > > First, let's discuss how to represent the information contained in a GFF > file. SeqRecords are good if the GFF file is associated with a Fasta file > (or contains the sequence itself), but if not it seems to be a bit awkward. I think parsing a GFF file with Bio.SeqIO into SeqRecord object(s) can still be useful even without the sequence. The list of SeqFeature objects belonging to each SeqRecord can be used for example with GenomeDiagram to draw a picture of the organism. Because you lack the sequence, you won't be able to include GC% or GC skew, but it is nice to visualize the annotation all the same. You could also do things like looking for the ratio of genic and inter-genic usage, or hunt for overlapping genes - although for these it may be easier to work with a more low level representation. > How about the following (and I think Peter was hinting at the same idea): > > The actual parser lives in Bio.GFF, and produces Bio.GFF.Record objects > that closely resemble the GFF file structure. For example, we use the > GFF specified fields ( > [attributes] [comments]) as attributes to > Bio.GFF.Record objects. That sounds possible to me - although I haven't given the basic Bio.GFF.Record structure any thought, nor indeed have I examined what data objects Brad is returning at the moment. > Bio.SeqIO then uses the parser in Bio.GFF, and puts its information in the > appropriate fields of a SeqRecord. Yes - much like how Bio.SeqIO calls other modules like Bio.GenBank and Bio.SwissProt now. However, regarding the implementation, I wouldn't automatically insist the Bio.SeqIO GFF wrapper *has* to use a Bio.GFF.Record internally (assuming we have such a thing) as that could be a performance bottleneck. I guess it depends on how simple the Bio.GFF.Record objects are. > Here, we have to think about two cases: > Simply creating a SeqRecord based on the GFF file, and adding the > information in the GFF file as annotations to a pre-existing set of SeqRecords. > (I am not sure if we need a separate function for that, or, as Peter suggested, > let the user do that himself, guided by some examples in the documentation). Simply creating SeqRecord objects from a GFF file is the standard Bio.SeqIO approach. For combining data from a GFF file and a FASTA file, this is rather like the FASTA+QUAL situation. Here we do document (in the docstrings, not yet in the tutorial) how to use Bio.SeqIO to read in two sets of SeqRecord objects and combine them, but also provide a "paired file iterator" to do this for you. Right now this function is in Bio.SeqIO.QualityIO, but I am open to moving this and the low level bits to somewhere like Bio.Sequencing.Quality instead (as long as we do this before Biopython 1.50 is released). I have pondered a "paired file iterator" function for Bio.SeqIO for dealing with FASTA+QUAL, FASTA+GFF, FASTA+PPT, etc, which would take TWO file handles and return SeqRecord objects. Interestingly all the examples thus far are FASTA+other. Anyway, this could be added later if need be. > Users then have a choice to use Bio.SeqIO to get SeqRecords, or Bio.GFF to see the "raw" GFF data, depending on their needs. > How does that sound? Pretty much what I had in mind - although as I said, I've not given much thought to how to present the "raw" GFF data. Peter From bugzilla-daemon at portal.open-bio.org Tue Apr 14 08:05:07 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 14 Apr 2009 08:05:07 -0400 Subject: [Biopython-dev] [Bug 2811] EFetch returning ASN.1 not GenBank format for rettype=genbank In-Reply-To: Message-ID: <200904141205.n3EC570L032323@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2811 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-14 08:05 EST ------- Bio/Entrez/__init__.py CVS revision 1.41 Tests/test_SeqIO_online.py CVS revision 1.7 DEPRECATED CVS revision 1.50 Marking as fixed (although a proof reading of the tutorial wouldn't hurt). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 14 19:33:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 14 Apr 2009 19:33:59 -0400 Subject: [Biopython-dev] [Bug 2811] EFetch returning ASN.1 not GenBank format for rettype=genbank In-Reply-To: Message-ID: <200904142333.n3ENXxFX018002@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2811 ------- Comment #3 from sbassi at gmail.com 2009-04-14 19:33 EST ------- I saw in the online Tutorial this small typo: "form Bio import SeqIO" I and have a question regarding this bug: What about adding "gb" as format type in SeqIO, and mapped to "genbank". This would add consistency (if I retrieve a sequence using "gb" from Entrez, I expect to save it using SeqIO with "gb"). I think it won't hurt to have "gb" as an alias for "genbank" in SeqIO. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From peter at maubp.freeserve.co.uk Tue Apr 14 19:34:02 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 00:34:02 +0100 Subject: [Biopython-dev] Bio.Motif breaks epydoc? Message-ID: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com> Hi all, I forgot to run epydoc when I did Biopython 1.50 beta, but I've just tried and it is failing - apparently due to an issue with Bio.Motif. First of all there are some warnings which we should probably address now, before the Bio.Motif API is officially released: Warning: Module Bio.Motif.AlignAceParser is shadowed by a variable with the same name. Warning: Module Bio.Motif.MEMEParser is shadowed by a variable with the same name. Warning: Module Bio.Motif.Motif is shadowed by a variable with the same name. Ignoring these warnings for now, epydoc then crashes for me doing Bio.Motif.Motif.Motif-class.html - which is bigger problem. This was using Epydoc version 3.0.1 (with python 2.6 on Ubuntu Jaunty). I'll try another machine tomorrow just to make sure this isn't a local setup issue. Also we should probably fix these "shadowing warnings", they can make the API confusing - in addition to confusing epydoc and making the API doc pages confusing. GenomeDiagram is also doing this, and we should try and fix that too: Warning: Module Bio.Graphics.GenomeDiagram.Diagram is shadowed by a variable with the same name. Warning: Module Bio.Graphics.GenomeDiagram.FeatureSet is shadowed by a variable with the same name. Warning: Module Bio.Graphics.GenomeDiagram.GraphSet is shadowed by a variable with the same name. Warning: Module Bio.Graphics.GenomeDiagram.Track is shadowed by a variable with the same name. However it may be a bit late to fix the main source of these warnings, Bio.PDB, without breaking things (i.e. any fix may not be backwards compatible). See also this thread from when I was running epydoc for Biopython 1.49 late last year: http://lists.open-bio.org/pipermail/biopython-dev/2008-November/004810.html Peter From bugzilla-daemon at portal.open-bio.org Tue Apr 14 20:13:49 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 14 Apr 2009 20:13:49 -0400 Subject: [Biopython-dev] [Bug 2811] EFetch returning ASN.1 not GenBank format for rettype=genbank In-Reply-To: Message-ID: <200904150013.n3F0DnkE021278@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2811 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-14 20:13 EST ------- (In reply to comment #3) > I saw in the online Tutorial this small typo: > "form Bio import SeqIO" I'd fixed at least one occurange of that error before, but you are right - there were still two left in CVS. Thanks. > I and have a question regarding this bug: What about adding "gb" as format > type in SeqIO, and mapped to "genbank". This would add consistency (if I > retrieve a sequence using "gb" from Entrez, I expect to save it using SeqIO > with "gb"). I think it won't hurt to have "gb" as an alias for "genbank" in > SeqIO. The reason we have this bug in the first place was we used an unofficial return type in EFetch in order to use the same format name ("genbank") in both Bio.Entrez and Bio.SeqIO - and this did make the examples straight forward. Adding aliases (such as "gb", "gp", and maybe also "genpept" for "genbank") might make Bio.Entrez and Bio.SeqIO a little nicer to use together after the changes forced by this bug. There are also several aliases used in EMBOSS that would also make sense (e.g. "pfam" for "stockholm"). On the down side, having more than one name risks confusion. Bring this up on the mailing list if you like. Leaving this bug as fixed. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sbassi at clubdelarazon.org Tue Apr 14 22:05:53 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Tue, 14 Apr 2009 23:05:53 -0300 Subject: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank in SeqIO Message-ID: <9e2f512b0904141905x69d10c48s95f5a808e1cc430f@mail.gmail.com> As a follow up to bug 2811 where "gb" is now a valid name in Bio.Entrez, I propose to add "gb" as an alias for "genbank" in SeqIO. This proposal is backward compatible since previous code using "genbank" is unaffected. The rationale behind my request is that Entrez.efetch(db=db,id=x,rettype='gb') When I want to save the sequence I got using rettype='gb', seems consistent to use SeqIO.write(myseq,fielhandle,'gb') Bugtrack chat related: ---------- Forwarded message ---------- From: Date: Tue, Apr 14, 2009 at 9:13 PM Subject: [Biopython-dev] [Bug 2811] EFetch returning ASN.1 not GenBank format for rettype=genbank To: biopython-dev at biopython.org http://bugzilla.open-bio.org/show_bug.cgi?id=2811 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-14 20:13 EST ------- (In reply to comment #3) > I saw in the online Tutorial this small typo: > "form Bio import SeqIO" I'd fixed at least one occurange of that error before, but you are right - there were still two left in CVS. Thanks. > I and have a question regarding this bug: What about adding "gb" as format > type in SeqIO, and mapped to "genbank". This would add consistency (if I > retrieve a sequence using "gb" from Entrez, I expect to save it using SeqIO > with "gb"). I think it won't hurt to have "gb" as an alias for "genbank" in > SeqIO. The reason we have this bug in the first place was we used an unofficial return type in EFetch in order to use the same format name ("genbank") in both Bio.Entrez and Bio.SeqIO - and this did make the examples straight forward. Adding aliases (such as "gb", "gp", and maybe also "genpept" for "genbank") might make Bio.Entrez and Bio.SeqIO a little nicer to use together after the changes forced by this bug. There are also several aliases used in EMBOSS that would also make sense (e.g. "pfam" for "stockholm"). On the down side, having more than one name risks confusion. Bring this up on the mailing list if you like. Leaving this bug as fixed. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev -- Sebasti?n Bassi. Diplomado en Ciencia y Tecnolog?a. Non standard disclaimer: READ CAREFULLY. By reading this email, you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies ("BOGUS AGREEMENTS") that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. From biopython at maubp.freeserve.co.uk Wed Apr 15 05:40:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 10:40:56 +0100 Subject: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank in SeqIO In-Reply-To: <9e2f512b0904141905x69d10c48s95f5a808e1cc430f@mail.gmail.com> References: <9e2f512b0904141905x69d10c48s95f5a808e1cc430f@mail.gmail.com> Message-ID: <320fb6e00904150240l1e6b424tdd34035256876226@mail.gmail.com> On Wed, Apr 15, 2009 at 3:05 AM, Sebastian Bassi wrote: > As a follow up to bug 2811 where "gb" is now a valid name in > Bio.Entrez, ... Just to note that in Entrez EFetch, using rettype=gb (and the related rettype=gb for proteins in GenPept format) has always been a valid argument (and in fact has always been the documented way to get a GenBank/GenPept file back). >From my point of view it was a nice feature of Entrez EFetch that they used to (unofficially) support retype=genbank, which was consistent with Bio.SeqIO. I suppose you could all try lobbing the NCBI to put Entrez EFetch back to the pre Easter 2009 behavior, but realistically we'll just have to live with it. Now that Entrez EFetch doesn't support the unofficial rettype=genbank argument anymore, we have the current situation where you must use "gb" (or "gp") for Bio.Entrez but "genbank" for Bio.SeqIO. I agree this isn't so nice, but as I wrote on Bug 2811, I'm not keen on having aliases in Bio.SeqIO (but I may be in a minority here, hence suggesting a discussion). On the plus side, EMBOSS offers "gb" (and "ddbj") as alternative aliases for "genbank", so there is precedent. In a related approach, I suppose we could have Bio.SeqIO take "genbank" to mean GenBank or GenPept as determined from the file or the alphabet (as now), and add "gb" meaning (nucelotide) GenBank files, and "gb" meaning (protein) GenPept files. But again, this breaks the Python ideal of there being one clear way to do things (having multiple names for the same format). Peter From peter at maubp.freeserve.co.uk Wed Apr 15 06:43:40 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 11:43:40 +0100 Subject: [Biopython-dev] Bio.Motif breaks epydoc? In-Reply-To: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com> References: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com> Message-ID: <320fb6e00904150343u35f66911pd45520c399e2e5f1@mail.gmail.com> On Wed, Apr 15, 2009 at 12:34 AM, Peter wrote: > Hi all, > > I forgot to run epydoc when I did Biopython 1.50 beta, but I've just tried [...] > we should probably fix these "shadowing warnings", they can make > the API confusing - in addition to confusing epydoc and making the API > doc pages confusing. ?GenomeDiagram is also doing this, and we should > try and fix that too: > > Warning: Module Bio.Graphics.GenomeDiagram.Diagram is shadowed by a > ? ? ? ? variable with the same name. > Warning: Module Bio.Graphics.GenomeDiagram.FeatureSet is shadowed by a > ? ? ? ? variable with the same name. > Warning: Module Bio.Graphics.GenomeDiagram.GraphSet is shadowed by a > ? ? ? ? variable with the same name. > Warning: Module Bio.Graphics.GenomeDiagram.Track is shadowed by a variable > ? ? ? ? with the same name. The shadowing issue with GenomeDiagram should be OK in CVS now - this was an accidental side effect of renaming the internal modules as part of integrating GenomeDiagram into Biopython. I discussed this with Leighton (off list) and we agreed that renaming the modules with the simplest solution, and opted for adding an underscore which makes it explicit that the modules concerned are intended to be private. This doesn't affect the (intended) public API for GenomeDiagram. Peter From mjldehoon at yahoo.com Wed Apr 15 06:57:43 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 15 Apr 2009 03:57:43 -0700 (PDT) Subject: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank in SeqIO In-Reply-To: <320fb6e00904150240l1e6b424tdd34035256876226@mail.gmail.com> Message-ID: <587664.25168.qm@web62402.mail.re1.yahoo.com> I think it's nice to be consistent with NCBI, and I don't see a big problem in having an alias for GenBank in SeqIO. At least, having "gb" in Bio.Entrez but "genbank" in Bio.SeqIO would go against the principle of least surprise. --Michiel. --- On Wed, 4/15/09, Peter wrote: > From: Peter > Subject: Re: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank in SeqIO > To: "Sebastian Bassi" > Cc: biopython-dev at lists.open-bio.org > Date: Wednesday, April 15, 2009, 5:40 AM > On Wed, Apr 15, 2009 at 3:05 AM, Sebastian Bassi > wrote: > > As a follow up to bug 2811 where "gb" is now > a valid name in > > Bio.Entrez, ... > > Just to note that in Entrez EFetch, using rettype=gb (and > the related > rettype=gb for proteins in GenPept format) has always been > a valid > argument (and in fact has always been the documented way to > get a > GenBank/GenPept file back). > > >From my point of view it was a nice feature of Entrez > EFetch that they > used to (unofficially) support retype=genbank, which was > consistent with > Bio.SeqIO. I suppose you could all try lobbing the NCBI to > put Entrez > EFetch back to the pre Easter 2009 behavior, but > realistically we'll just > have to live with it. > > Now that Entrez EFetch doesn't support the unofficial > rettype=genbank > argument anymore, we have the current situation where you > must use > "gb" (or "gp") for Bio.Entrez but > "genbank" for Bio.SeqIO. I agree this > isn't so nice, but as I wrote on Bug 2811, I'm not > keen on having aliases > in Bio.SeqIO (but I may be in a minority here, hence > suggesting a > discussion). On the plus side, EMBOSS offers > "gb" (and "ddbj") as > alternative aliases for "genbank", so there is > precedent. > > In a related approach, I suppose we could have Bio.SeqIO > take > "genbank" to mean GenBank or GenPept as > determined from the file > or the alphabet (as now), and add "gb" meaning > (nucelotide) GenBank > files, and "gb" meaning (protein) GenPept files. > > But again, this breaks the Python ideal of there being one > clear way to > do things (having multiple names for the same format). > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From biopython at maubp.freeserve.co.uk Wed Apr 15 07:01:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 12:01:54 +0100 Subject: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank in SeqIO In-Reply-To: <587664.25168.qm@web62402.mail.re1.yahoo.com> References: <320fb6e00904150240l1e6b424tdd34035256876226@mail.gmail.com> <587664.25168.qm@web62402.mail.re1.yahoo.com> Message-ID: <320fb6e00904150401q209ae99id6746f2a0c4e3532@mail.gmail.com> On Wed, Apr 15, 2009 at 11:57 AM, Michiel de Hoon wrote: > > I think it's nice to be consistent with NCBI, and I don't see a big > problem in having an alias for GenBank in SeqIO. At least, > having "gb" in Bio.Entrez but "genbank" in Bio.SeqIO would > go against the principle of least surprise. True. Would you support other aliases such as "pfam" for "stockholm", an alias supported in EMBOSS for this alignment format? Peter From biopython at maubp.freeserve.co.uk Wed Apr 15 08:21:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 13:21:17 +0100 Subject: [Biopython-dev] Tutorial & Cookbook In-Reply-To: <320fb6e00904130616t2cd5f029i582f4a2488d1182c@mail.gmail.com> References: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com> <93403.18413.qm@web62406.mail.re1.yahoo.com> <20090413125255.GC5429@sobchak.mgh.harvard.edu> <320fb6e00904130616t2cd5f029i582f4a2488d1182c@mail.gmail.com> Message-ID: <320fb6e00904150521q536fa27drd54db5e267876b15@mail.gmail.com> On Mon, Apr 13, 2009 at 2:16 PM, Peter wrote: > Speaking of doctests, we should do more of those in our docstrings. > For our online API documentation at > http://biopython.org/DIST/docs/api/ it would be nice to have the > python examples within the docstrings (including the doctests) shown > with syntax colouring. ?See > http://epydoc.sourceforge.net/manual-epytext.html#doctest-blocks for > an example, and compare this to > http://biopython.org/DIST/docs/api/Bio.Seq-module.html - maybe we need > to adjust our indentation? We currently explicitly use plain text for epydoc, rather than the default epytext markup language. If we switch to epytext (or at least a very simple subset of it, as some of the markup doesn't lend itself to friendly human readable docstrings) then we do get python syntax colouring on the doctests. However, this will require some effort to fine tune the docstrings, and right now it makes a mess of in some cases. Peter From biopython at maubp.freeserve.co.uk Wed Apr 15 09:19:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 14:19:34 +0100 Subject: [Biopython-dev] docstrings, doctests and epydoc API pages Message-ID: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com> I've changed the thread title to something a little more specific. On Wed, Apr 15, 2009 at 1:21 PM, Peter wrote: > We currently explicitly use plain text for epydoc, rather than the > default epytext markup language. ?If we switch to epytext (or at least > a very simple subset of it, as some of the markup doesn't lend itself > to friendly human readable docstrings) then we do get python syntax > colouring on the doctests. ?However, this will require some effort to > fine tune the docstrings, and right now it makes a mess of in some > cases. As a test, I was able to update Bio/Seq.py to look good as epytext (while still being equally readable as plain text for when reading the API documentation at the python prompt with the help function). I uploaded one new page to the website: http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html The rest of the online API pages are currently still from Biopython 1.49, when epydoc parsed the docstrings as plain text. For another example with quite a few docstrings and doctests, look at: http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html What do you all think? I don't know if it will encourage more people to look at the API pages, but I certainly like the new version where the doctests are shown boxed with syntax colouring. Note that this would be a lot easier to do if epydoc supported "plaintext with doctests" as a markup type, or did this automatically when told the markup is just "plaintext" (as I had originally hoped for). I wonder how easy that would be to implement... it might be less work than checking all our API pages by hand and fixing our markup to follow epytext standards. See also: http://epydoc.sourceforge.net/epytext.html Peter From bartek at rezolwenta.eu.org Wed Apr 15 10:43:15 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Wed, 15 Apr 2009 16:43:15 +0200 Subject: [Biopython-dev] Bio.Motif breaks epydoc? In-Reply-To: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com> References: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com> Message-ID: <8b34ec180904150743tffaca5dva03c58bbe2a0eade@mail.gmail.com> Hi, I'm working on Bio.Motif to fix this. I'll send a patch later today. cheers Bartek On Wed, Apr 15, 2009 at 1:34 AM, Peter wrote: > Hi all, > > I forgot to run epydoc when I did Biopython 1.50 beta, but I've just > tried and it is failing - apparently due to an issue with Bio.Motif. > > First of all there are some warnings which we should probably address > now, before the Bio.Motif API is officially released: > > Warning: Module Bio.Motif.AlignAceParser is shadowed by a variable with the > ? ? ? ? same name. > Warning: Module Bio.Motif.MEMEParser is shadowed by a variable with the > ? ? ? ? same name. > Warning: Module Bio.Motif.Motif is shadowed by a variable with the same > ? ? ? ? name. > > Ignoring these warnings for now, epydoc then crashes for me doing > Bio.Motif.Motif.Motif-class.html - which is bigger problem. ?This was > using Epydoc version 3.0.1 (with python 2.6 on Ubuntu Jaunty). ?I'll > try another machine tomorrow just to make sure this isn't a local > setup issue. > > Also we should probably fix these "shadowing warnings", they can make > the API confusing - in addition to confusing epydoc and making the API > doc pages confusing. ?GenomeDiagram is also doing this, and we should > try and fix that too: > > Warning: Module Bio.Graphics.GenomeDiagram.Diagram is shadowed by a > ? ? ? ? variable with the same name. > Warning: Module Bio.Graphics.GenomeDiagram.FeatureSet is shadowed by a > ? ? ? ? variable with the same name. > Warning: Module Bio.Graphics.GenomeDiagram.GraphSet is shadowed by a > ? ? ? ? variable with the same name. > Warning: Module Bio.Graphics.GenomeDiagram.Track is shadowed by a variable > ? ? ? ? with the same name. > > However it may be a bit late to fix the main source of these warnings, > Bio.PDB, without breaking things (i.e. any fix may not be backwards > compatible). ?See also this thread from when I was running epydoc for > Biopython 1.49 late last year: > http://lists.open-bio.org/pipermail/biopython-dev/2008-November/004810.html > > Peter > -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From biopython at maubp.freeserve.co.uk Wed Apr 15 12:46:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 17:46:02 +0100 Subject: [Biopython-dev] docstrings, doctests and epydoc API pages In-Reply-To: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com> References: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com> Message-ID: <320fb6e00904150946y45010c99u8508e8e6fd71eb75@mail.gmail.com> > As a test, I was able to update Bio/Seq.py to look good as epytext > (while still being equally readable as plain text for when reading the > API documentation at the python prompt with the help function). I > uploaded one new page to the website: > > http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html > > The rest of the online API pages are currently still from Biopython > 1.49, when epydoc parsed the docstrings as plain text. ?For another > example with quite a few docstrings and doctests, look at: > > http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html > > What do you all think? ?I don't know if it will encourage more people > to look at the API pages, but I certainly like the new version where > the doctests are shown boxed with syntax colouring. I've done Bio/SeqIO/QualityIO.py as well which proved harder due to lots of example FASTQ records embedded in the text. I've also worked out how to set the epydoc markup format on a per file basis with the __docformat__ setting (see also PEP 258). This means we can gradually convert existing docstrings on a file by file basis - I'd suggest we focus on those with docstrings first, as they will benefit most from this. The only downside thus far is that the epytext mark up seems rather fragile, and it is easy to "break" a docstring such that epydoc fails to render nicely. At least epydoc falls back on plain text in this situation, so the text is still human readable. Tip: You need an EMPTY line before and after each doctest in order for it to work with epydoc as epytext markup. This is annoying as the doctest framework can cope with a line with spaces in it. Peter From sbassi at clubdelarazon.org Wed Apr 15 15:19:13 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Wed, 15 Apr 2009 16:19:13 -0300 Subject: [Biopython-dev] Proposal: Parse and read in SeqIO and NCBIXML Message-ID: <9e2f512b0904151219i60a8eda0xd06c9c86c690b6e3@mail.gmail.com> In SeqIO there is parse and read. Parse return an iterable with all the record found in the file, while read return only a record and it is used when we know that the file has only one record. This is OK. But in NCBIXML, there is only parse. If the the ncbiblast output has only one record (because it was made from 1 query), now we have to write: NCBIXML.parse(x).next() or iterate over a "list" of one member. I think it would be nice to add a read method to NCBIXML, such as the one in SeqIO. From biopython at maubp.freeserve.co.uk Wed Apr 15 17:30:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 22:30:55 +0100 Subject: [Biopython-dev] Proposal: Parse and read in SeqIO and NCBIXML In-Reply-To: <9e2f512b0904151219i60a8eda0xd06c9c86c690b6e3@mail.gmail.com> References: <9e2f512b0904151219i60a8eda0xd06c9c86c690b6e3@mail.gmail.com> Message-ID: <320fb6e00904151430i19983fafq43ca1c9395579fb3@mail.gmail.com> On Wed, Apr 15, 2009 at 8:19 PM, Sebastian Bassi wrote: > In SeqIO there is parse and read. Parse return an iterable with all > the record found in the file, while read return only a record and it > is used when we know that the file has only one record. This is OK. > But in NCBIXML, there is only parse. If the the ncbiblast output has > only one record (because it was made from 1 query), now we have to > write: > NCBIXML.parse(x).next() or iterate over a "list" of one member. I > think it would be nice to add a read method to NCBIXML, such as the > one in SeqIO. That seems sensible to me, we could probably squeeze that in for Biopython 1.50 too. Could you file an enhancement bug in case I forget about this? Peter From bugzilla-daemon at portal.open-bio.org Wed Apr 15 17:42:28 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 15 Apr 2009 17:42:28 -0400 Subject: [Biopython-dev] [Bug 2812] New: Adding read method to NCBIXML (just like SeqIO and SwissProt). Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2812 Summary: Adding read method to NCBIXML (just like SeqIO and SwissProt). Product: Biopython Version: 1.50b Platform: PC OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: sbassi at gmail.com NCBIXML should have a "read" method. It has a parse method that returns an iterable. If the the ncbiblast output has only one record (because it was made from 1 query), now we have to write: NCBIXML.parse(x).next() or iterate over a "list" of one member. Other objects like SeqIO and SwissProt has both "read" and "parse" to deal with one entry files. I think for the sake of consistency NCBIXML should also have a read method. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 15 17:58:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 15 Apr 2009 17:58:15 -0400 Subject: [Biopython-dev] [Bug 2812] Adding read method to NCBIXML (just like SeqIO and SwissProt). In-Reply-To: Message-ID: <200904152158.n3FLwFYc027155@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2812 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-15 17:58 EST ------- Adding this should do the trick (based on the SeqIO.read function): def read(handle, debug=0) : """Returns a single Blast record (assumes just one query). Use the Bio.Blast.NCBIXML.read() function if you expect more than one BLAST record (i.e. if you have more than one query sequence). This function is for use when there is one and only one BLAST result. """ iterator = parse(handle, debug) try : first = iterator.next() except StopIteration : first = None if first is None : raise ValueError("No records found in handle") try : second = iterator.next() except StopIteration : second = None if second is not None : raise ValueError("More than one record found in handle") return first However, on reflection this needs some special testing for when there is a single query giving NO hits. I suspect that means the BLAST XML file will contain no records (at least that's my guess from recent versions - I haven't tried 2.2.20 yet). Would raising a ValueError in this situation reasonable? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bartek at rezolwenta.eu.org Wed Apr 15 19:10:03 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 16 Apr 2009 01:10:03 +0200 Subject: [Biopython-dev] Bio.Motif breaks epydoc? In-Reply-To: <320fb6e00904151514g2b9709fbj7c3de68d88db3f7d@mail.gmail.com> References: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com> <8b34ec180904150743tffaca5dva03c58bbe2a0eade@mail.gmail.com> <8b34ec180904151458k39fec681u53fcf64de9f7590d@mail.gmail.com> <320fb6e00904151514g2b9709fbj7c3de68d88db3f7d@mail.gmail.com> Message-ID: <8b34ec180904151610j4c62b7d7k51be600420aa73c@mail.gmail.com> Hi On Thu, Apr 16, 2009 at 12:14 AM, Peter wrote: > How about putting it in Bio/Motif/_Motif.py? ?That makes it clear > people are expected to access it via Bio.Motif.Motif, and not go via > the module. ?This is what Leighton and I did for GenomeDiagram which > was a very similar situation. ?Using an underscore denotes a private > module, so you could at a later date rename it to something else > without worrying about backwards compatibiltiy (if you do change your > mind). > OK, I'll update the source tomorrow. > Are you planning any documentation to go with this? ?It would be nice > to include it with Biopython 1.50 but not essential. There is a cookbook-style tutorial in Docs/cookbook/motif. I'm not sure if it's ready for inclusion into the official tutorial. I'm hoping to add some more features soon and then it could be improved and included into the tutorial. cheers Bartek From winda002 at student.otago.ac.nz Wed Apr 15 23:30:43 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Thu, 16 Apr 2009 15:30:43 +1200 Subject: [Biopython-dev] Tutorial & Cookbook In-Reply-To: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com> References: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com> Message-ID: <49E6A663.90900@student.otago.ac.nz> Hi all, Sorry about the delay in replying to this, the easter holidays are the last chance to play in the sun in the southern hemisphere. Peter wrote: > David wrote: > >>> For me as a n00b the most useful resource by far has been the cookbook - >>> >>> > > When you said "cookbook", did you mean the Biopython Tutorial & Cookbook? > http://biopython.org/DIST/docs/tutorial/Tutorial.html > http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > > There are a couple of other documents under the "Cookbook" folder here: > http://biopython.org/DIST/docs/cookbook/Restriction.html > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf > I really meant the Tutorial and Cookbook and specifically the examples in it. The first thing I tried to do with BioPython was parse BLAST outputs and actually seeing a loop that would work and I that I could tweak to get what I wanted from by BLAST results was really cool. From my perspective it makes sense to have a tutorial that walks through the main features with some relatively simple examples (like the existing one) with a separate cookbook highlighting what you can actually do when you bring everything together. I think this would fulfill the goals I was talking about in my original post (having nicely documented examples of BioPython in action out there for anyone who's looking) and adding a cookbook catergory to the wiki achieves this with the smallest impediment to participation . If anyone's counting I think that's +3 for wiki and -3 for a new html/pdf document. David From peter at maubp.freeserve.co.uk Thu Apr 16 06:56:23 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Apr 2009 11:56:23 +0100 Subject: [Biopython-dev] Bio.Motif breaks epydoc? In-Reply-To: <8b34ec180904150743tffaca5dva03c58bbe2a0eade@mail.gmail.com> References: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com> <8b34ec180904150743tffaca5dva03c58bbe2a0eade@mail.gmail.com> Message-ID: <320fb6e00904160356i68ca063ak370faa78eda63876@mail.gmail.com> On Wed, Apr 15, 2009 at 3:43 PM, Bartek Wilczynski wrote: > Hi, > > I'm working on Bio.Motif to fix this. [...] > > cheers > Bartek Bartek has solved the epydoc problem in CVS now, and I have been able to build the API documentation using a clean installation of Biopython from CVS. :) It looks like the LaTeX equation in Bio/Motif/Motif.py (which was full of backslashes) was causing some of the trouble. Peter From biopython at maubp.freeserve.co.uk Thu Apr 16 12:45:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Apr 2009 17:45:13 +0100 Subject: [Biopython-dev] Where to put command line wrappers Message-ID: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> Hi all, We were recently discussing alignment tools like MUSCLE and ClustalW and putting together a set of command line wrappers under Bio.Align for them. I think Bio.Align.Applications was suggested to match Bio.EMBOSS.Applications. For EMBOSS we have a single file, Bio/Emboss/Applications.py, which has about 15 wrappers (all very similar as the EMBOSS applications are very consistent). This is nice in that all the wrappers are in the Bio.Emboss.Application namespace. Bartek and I have been having a similar discussion for Motif tools, and if the AliceAce wrappers should go in Bio.Motif.Applications to match. For now Bio.Motif has just one wrapper for AlignACE and sister tool CompareACE. Now giving each tool-set its own file is possible (Bio/Motif/Applications/AlignAce.py) but would one (large) file be simpler? (i.e. Bio/Motif/Applications.py). I'm not sure how many wrappers we might eventually expect for multiple sequence alignments, maybe ten or twenty, mostly from different tool sets. Maybe Bio/Align/Applications/Muscle.py etc is the way to go, but we can then import all the command line objects under the Bio.Align.Applications namespace. Any comments? Peter From biopython at maubp.freeserve.co.uk Thu Apr 16 13:16:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Apr 2009 18:16:10 +0100 Subject: [Biopython-dev] Where to put command line wrappers In-Reply-To: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> Message-ID: <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> On Thu, Apr 16, 2009 at 5:45 PM, Peter wrote: > Hi all, > > We were recently discussing alignment tools like MUSCLE and ClustalW > and putting together a set of command line wrappers under Bio.Align > for them. ?I think Bio.Align.Applications was suggested to match > Bio.EMBOSS.Applications. > > For EMBOSS we have a single file, Bio/Emboss/Applications.py, which > has about 15 wrappers (all very similar as the EMBOSS applications are > very consistent). ?This is nice in that all the wrappers are in the > Bio.Emboss.Application namespace. > > Bartek and I have been having a similar discussion for Motif tools, > and if the AliceAce wrappers should go in Bio.Motif.Applications to > match. ?For now Bio.Motif has just one wrapper for AlignACE and sister > tool CompareACE. ?Now giving each tool-set its own file is possible > (Bio/Motif/Applications/AlignAce.py) but would one (large) file be > simpler? (i.e. Bio/Motif/Applications.py). > > I'm not sure how many wrappers we might eventually expect for multiple > sequence alignments, maybe ten or twenty, mostly from different tool > sets. ?Maybe Bio/Align/Applications/Muscle.py etc is the way to go, > but we can then import all the command line objects under the > Bio.Align.Applications namespace. > > Any comments? For any that missed the thread last week, I'd like to link back to the end of my post: http://lists.open-bio.org/pipermail/biopython-dev/2009-March/005658.html I see introducing Bio.Align.Applications as chance to get a more consistent approach to Biopython's command line wrappers established (replacing Bio.Clustalw). And as I wrote last month, I think we should focus on the Bio.Application command line wrapper object. For reasons explained in the linked email, I would want to rewrite Bio.Blast.NCBIStandalone in the same way (probably putting the command line wrapper classes in Bio.Blast.Applications, and if there is interesting, include other variants like WUBlast). Are there any other wrappers not using Bio.Application which I have forgotten about? Bio/AlignAce/Applications.py does use Bio.Application, but we are planning to replace this module with Bio.Motif which gives us a chance to review the API without worrying too much about backwards compatibility. As part of moving it to Bio.Motif, I would remove the run methods from AlignAceCommandline and CompareAceCommandline (none of the other Biopython command line objects have them as far as I know), and also remove the AlignAce and CompareAce helper functions (in Bio/AlignAce/AlignAceStandalone.py and Bio/AlignAce/CompareAceStandalone.py). Internally these all call the Bio.Application.generic_run function, and return stdout and stderr as wrapped StringIO handles. Because it reads in all the stdout and stderr output into memory, Bio.Application.generic_run function is only suitable for tools with print very little to the console (or nothing, in which case the return values can be ignored). This method is useless on things like BLAST XML output to stdout which can be hundreds of megabytes in size. I would generally discourage the use of the Bio.Application.generic_run function and instead we should give examples using the command line object together with the subprocess module (Python 2.3 doesn't have subprocess, but Biopthyon 1.50 will be the last release to care about this) which lets the user choose what if any handles they care about. Peter From bartek at rezolwenta.eu.org Thu Apr 16 13:37:29 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 16 Apr 2009 19:37:29 +0200 Subject: [Biopython-dev] Fwd: Where to put command line wrappers In-Reply-To: <8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com> References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> <8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com> Message-ID: <8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com> Hi All, On Thu, Apr 16, 2009 at 5:45 PM, Peter wrote: > For EMBOSS we have a single file, Bio/Emboss/Applications.py, which > has about 15 wrappers (all very similar as the EMBOSS applications are > very consistent). ?This is nice in that all the wrappers are in the > Bio.Emboss.Application namespace. > > Bartek and I have been having a similar discussion for Motif tools, > and if the AliceAce wrappers should go in Bio.Motif.Applications to > match. ?For now Bio.Motif has just one wrapper for AlignACE and sister > tool CompareACE. ?Now giving each tool-set its own file is possible > (Bio/Motif/Applications/AlignAce.py) but would one (large) file be > simpler? (i.e. Bio/Motif/Applications.py). > I think that there is a difference between EMBOSS and Bio.[Motif|Align]. In EMBOSS we have a very nicely comoditized set of tools with similar interfaces, while both for multiple alignment and motif searching the tools vary a lot. In case of multiple alignments this is only with respect to parameters and output format, while in motif searching there is also a lot of differences in the types of input (background models etc.). Also, quite likely the parsers for different tools will be written by different people. In this case, I think that it's much easier from the maintainers point of view to have a directory with separate files rather than a single module. If people are scared by nested namespaces, we can import the important classes into the higher level. >> I'm not sure how many wrappers we might eventually expect for multiple >> sequence alignments, maybe ten or twenty, mostly from different tool >> sets. ?Maybe Bio/Align/Applications/Muscle.py etc is the way to go, >> but we can then import all the command line objects under the >> Bio.Align.Applications namespace. >> +1 from me. > > Bio/AlignAce/Applications.py does use Bio.Application, but we are > planning to replace this module with Bio.Motif which gives us a chance > to review the API without worrying too much about backwards > compatibility. ?As part of moving it to Bio.Motif, I would remove the > run methods from AlignAceCommandline and CompareAceCommandline (none > of the other Biopython command line objects have them as far as I > know), and also remove the AlignAce and CompareAce helper functions > (in Bio/AlignAce/AlignAceStandalone.py and > Bio/AlignAce/CompareAceStandalone.py). Internally these all call the > Bio.Application.generic_run function, and return stdout and stderr as > wrapped StringIO handles. > > Because it reads in all the stdout and stderr output into memory, > Bio.Application.generic_run function is only suitable for tools with > print very little to the console (or nothing, in which case the return > values can be ignored). ?This method is useless on things like BLAST > XML output to stdout which can be hundreds of megabytes in size. ?I > would generally discourage the use of the Bio.Application.generic_run > function and instead we should give examples using the command line > object together with the subprocess module (Python 2.3 doesn't have > subprocess, but Biopthyon 1.50 will be the last release to care about > this) which lets the user choose what if any handles they care about. Motif finding programs usually output a lot less than there is input. Normally, you don't want to see more than 10 motifs and each contributes ~1kb so I don't see this as a huge problem in this case. To be honest, I'm not too keen on rewriting this old code (as well as MEME parser which was contributed by Jason Hackney). But if there will be any new motif parsers (I'd like to have weederand RSAT one day...) I'm happy to conform to any (reasonable) policy. cheers Bartek From biopython at maubp.freeserve.co.uk Thu Apr 16 14:53:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Apr 2009 19:53:03 +0100 Subject: [Biopython-dev] Fwd: Where to put command line wrappers In-Reply-To: <8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com> References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> <8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com> <8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com> Message-ID: <320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com> On 4/16/09, Bartek Wilczynski wrote: > Hi All, > > On Thu, Apr 16, 2009 at 5:45 PM, Peter wrote: > > For EMBOSS we have a single file, Bio/Emboss/Applications.py, which > > has about 15 wrappers (all very similar as the EMBOSS applications are > > very consistent). This is nice in that all the wrappers are in the > > Bio.Emboss.Application namespace. > > > > Bartek and I have been having a similar discussion for Motif tools, > > and if the AliceAce wrappers should go in Bio.Motif.Applications to > > match. For now Bio.Motif has just one wrapper for AlignACE and sister > > tool CompareACE. Now giving each tool-set its own file is possible > > (Bio/Motif/Applications/AlignAce.py) but would one (large) file be > > simpler? (i.e. Bio/Motif/Applications.py). > > > I think that there is a difference between EMBOSS and > Bio.[Motif|Align]. In EMBOSS we have a very nicely comoditized > set of tools with similar interfaces, while both for multiple > alignment and motif searching the tools vary a lot. In case of > multiple alignments this is only with respect to parameters and > output format, while in motif searching there is also a lot of > differences in the types of input (background models etc.). That is a good argument for using Bio/Align/Applications/XXX.py and Bio/Motif/Applications/XXX.py while also having Bio/EMBOSS/Applications.py > Also, quite likely the parsers for different tools will be written by > different people. Biopython's command line wrappers can be quite separate from the parsers - this is a natural break. One can be useful without the other, and keeping them separate allows you to for example use a Biopython wrapper with another parser, or vice versa. > In this case, I think that it's much easier from the maintainers point > of view to have a directory with separate files rather than a single > module. [...] True. > >> I'm not sure how many wrappers we might eventually expect for multiple > >> sequence alignments, maybe ten or twenty, mostly from different tool > >> sets. Maybe Bio/Align/Applications/Muscle.py etc is the way to go, > >> but we can then import all the command line objects under the > >> Bio.Align.Applications namespace. > > +1 from me. > > > Bio/AlignAce/Applications.py does use Bio.Application, but we are > > planning to replace this module with Bio.Motif which gives us a chance > > to review the API without worrying too much about backwards > > compatibility. As part of moving it to Bio.Motif, I would remove the > > run methods from AlignAceCommandline and CompareAceCommandline (none > > of the other Biopython command line objects have them as far as I > > know), and also remove the AlignAce and CompareAce helper functions > > (in Bio/AlignAce/AlignAceStandalone.py and > > Bio/AlignAce/CompareAceStandalone.py). Internally these all call the > > Bio.Application.generic_run function, and return stdout and stderr as > > wrapped StringIO handles. > > > > Because it reads in all the stdout and stderr output into memory, > > Bio.Application.generic_run function is only suitable for tools with > > print very little to the console (or nothing, in which case the return > > values can be ignored). This method is useless on things like BLAST > > XML output to stdout which can be hundreds of megabytes in size. I > > would generally discourage the use of the Bio.Application.generic_run > > function and instead we should give examples using the command line > > object together with the subprocess module (Python 2.3 doesn't have > > subprocess, but Biopthyon 1.50 will be the last release to care about > > this) which lets the user choose what if any handles they care about. > > Motif finding programs usually output a lot less than there is input. Normally, > you don't want to see more than 10 motifs and each contributes ~1kb so > I don't see this as a huge problem in this case. I can see that Bio.Application.generic_run function is often handy, but sometimes it is quite inappropriate. For AlignAce obviously it has sufficed. > To be honest, I'm not too keen on rewriting this old code (as well as > MEME parser which was contributed by Jason Hackney). But if there > will be any new motif parsers (I'd like to have weederand RSAT one > day...) I'm happy to conform to any (reasonable) policy. In the AlignAce case, in the above I wasn't suggesting rewriting, rather removing some of the what I saw as redundant bits (in an effort at consistency). On reflection, perhaps the core Bio.Application.AbstractCommandline object might benefit from some "run" like methods? However they do morph it from a command line string representation into something bigger... feature creep! ;) Peter From biopython at maubp.freeserve.co.uk Thu Apr 16 16:16:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Apr 2009 21:16:04 +0100 Subject: [Biopython-dev] Where to put command line wrappers In-Reply-To: <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> Message-ID: <320fb6e00904161316m62162af2s506442502b73c8bc@mail.gmail.com> > I see introducing Bio.Align.Applications as chance to get a more > consistent approach to Biopython's command line wrappers established > (replacing Bio.Clustalw). And as I wrote last month, I think we > should focus on the Bio.Application command line wrapper object. For > reasons explained in the linked email, I would want to rewrite > Bio.Blast.NCBIStandalone in the same way (probably putting the command > line wrapper classes in Bio.Blast.Applications, and if there is > interesting, include other variants like WUBlast). Are there any > other wrappers not using Bio.Application which I have forgotten about? Funnily enough, there already is a Bio.Blast.Applications module containing a wrapper for NCBI Fasta and NCBI blastall (a little out of data, also nothing for rpsblast or blastpgpg). The older Bio.Blast.NCBIStandalone was never updated to use this internally. Here's a nice little job for after Biopython 1.50 is out... Peter From bugzilla-daemon at portal.open-bio.org Thu Apr 16 18:40:53 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 16 Apr 2009 18:40:53 -0400 Subject: [Biopython-dev] [Bug 2809] Adding startswith and endswith methods to the Seq object In-Reply-To: Message-ID: <200904162240.n3GMerIj001589@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2809 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-16 18:40 EST ------- Checked in after discussion on the mailing list. Checking in Bio/Seq.py; /home/repository/biopython/biopython/Bio/Seq.py,v <-- Seq.py new revision: 1.76; previous revision: 1.75 done Checking in Tests/test_Seq_objs.py; /home/repository/biopython/biopython/Tests/test_Seq_objs.py,v <-- test_Seq_objs.py new revision: 1.5; previous revision: 1.4 done Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 16 18:40:54 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 16 Apr 2009 18:40:54 -0400 Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string, even subclass string? In-Reply-To: Message-ID: <200904162240.n3GMesOq001602@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2351 Bug 2351 depends on bug 2809, which changed state. Bug 2809 Summary: Adding startswith and endswith methods to the Seq object http://bugzilla.open-bio.org/show_bug.cgi?id=2809 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From winda002 at student.otago.ac.nz Fri Apr 17 01:31:45 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Fri, 17 Apr 2009 17:31:45 +1200 Subject: [Biopython-dev] Cookbook recipes on the wiki Message-ID: <49E81441.8040906@student.otago.ac.nz> Hi all, In the recent thread about the cookbook style entries in the tutorial everyone that had an opinion seemed to think it was best to incorporate these into the wiki. I've made a very small start at doing this with a category on the wiki (http://biopython.org/wiki/Category:Cookbook) and an example of what an entry in the cookbook might look like (http://biopython.org/wiki/Split_fasta_file). What do people think of these? If we decide this is the way to go then to have an entry turn up in the cookbook category you need only to add [[Category:Cookbook]] to an entry david From biopython at maubp.freeserve.co.uk Fri Apr 17 05:32:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 10:32:32 +0100 Subject: [Biopython-dev] Cookbook recipes on the wiki In-Reply-To: <49E81441.8040906@student.otago.ac.nz> References: <49E81441.8040906@student.otago.ac.nz> Message-ID: <320fb6e00904170232i75d88a73p5738e54a32de8bdf@mail.gmail.com> On Fri, Apr 17, 2009 at 6:31 AM, David Winter wrote: > Hi all, > > In the recent thread about the cookbook style entries in the tutorial > everyone that had an opinion seemed to think it was best to incorporate > these into the wiki. I've made a very small start at doing this with a > category on the wiki (http://biopython.org/wiki/Category:Cookbook) and an > example of what an entry in the cookbook might look like > (http://biopython.org/wiki/Split_fasta_file). > > What do people think of these? If we decide this is the way to go then to > have an entry turn up in the cookbook category you need only to add > [[Category:Cookbook]] to an entry We'd previously discussed using a cookbook category on the wiki, and that looks good: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005715.html I'm tempted to get rid of Category:Wiki_Documentation though - it seems a bit redundant, almost everything on the wiki is documentation. At least rename this to Category:Documentation? Peter From biopython at maubp.freeserve.co.uk Fri Apr 17 07:08:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 12:08:12 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <200904171246.46568.jblanca@btc.upv.es> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> Message-ID: <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca wrote: > Hi Peter: > Here you have some code to read the sff files. Thanks - I'm not sure when I'll get to look at this, maybe next week. > For the time being it creates a dict for the sequences. I'm not sure about > how to integrate the generated data in BioPython. The sequence and > qualities should go to a SeqRecord, but there is also the information > about the clipping. For Bio.SeqIO, we would need to use a SeqRecord. Ideally we'd want to be able to read and write SFF files, and to do that we'll have to record all the essential annotation (i.e. clipping) somehow. Can you write SFF files? > For my work I use a kind of SeqRecord with a mask property and the > mask is a Location that shows which part of the sequence is ok. I don't > know if that's a valid model for BioPython. A mask could be done as a list of booleans, and we can treat it as another per-letter-annotation in the SeqRecord. I'm not sure if this is helpful or not. The Roche tools let you choose to extract trimmed reads as FASTA and QUAL, or untrimmed. Perhaps for reading SFF files with Bio.SeqIO we should get the user to choose between these options (e.g. format names "roche-sff" and "roche-sff-notrim")? Roche's FASTA files use upper case for the trimmed region, and lower case for the start/end which would get trimmed off. This is simple and we could do this for Biopython too - meaning you'd get the same data if you read the SFF file directly, or used Roche's FASTA+QUAL files with SeqIO. Note that when reading an SFF file directly, we should probably record the real trim data as well. > In the extract_sff script we generated three files: the fasta sequences, > the fasta qualities and the xml with the clippings. > One option could be to clip the sequences, but I don't know if that's the > desired behaviour in all cases. Trimming is probably a sensible default. If we do give the untrimmed sequences, we'd need a way to easily trim them. > There's also a couple of more tricks with the clipping. > In theory there's clip_qual and clip_adapter, but in the files > we've seen clip_adapter is always zero and clip_quality is used > instead for both quality and adapter. I think we could generate > one clipping combining both. Let me know what do you think. > Also take into account that in some cases the generated clipping > from the 454 software are just wrong. I'll need to learn more about the details before coming to any conclusions about how to deal with this information in Biopython. > If you want to forward this mail to the list you're more than welcome. > Best regards, > > Jose Blanca I've CC'd this reply to the list (without the python file attachments). Regards, Peter From chapmanb at 50mail.com Fri Apr 17 09:23:34 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 17 Apr 2009 09:23:34 -0400 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> <20090413133539.GD5429@sobchak.mgh.harvard.edu> <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com> Message-ID: <20090417132334.GA16092@sobchak.mgh.harvard.edu> Peter, Michiel and Jared; Thanks for the comments. My apologies for the late reply; I've been sick the past few days and am trying to catch back up. All your points from the different posts are consolidated below. [Michiel] > First, let's discuss how to represent the information contained in > a GFF file. SeqRecords are good if the GFF file is associated with > a Fasta file (or contains the sequence itself), but if not it seems > to be a bit awkward. How about the following (and I think Peter was > hinting at the same idea): > > The actual parser lives in Bio.GFF, and produces Bio.GFF.Record > objects that closely resemble the GFF file structure. For example, we > use the GFF specified fields ( > [attributes] [comments]) as attributes > to Bio.GFF.Record objects. The GFF parser right now is really generating SeqFeature objects for each GFF line; the top level SeqRecords are a collection that holds the individual features. The SeqFeature object is pretty similar to GFF and the generic object you are proposing. For instance, here is a GFF line and the relevant attributes from SeqFeature for the line: I Orfeome PCR_product 12759747 12764936 . - . PCR_product "mv_B0019.1" ; Amplified 1 ; Amplified 1 type: PCR_product location: [12759746:12764936] strand: -1 qualifiers: Key: amplified, Value: ['1'] Key: pcr_product, Value: ['mv_B0019.1'] Key: source, Value: ['Orfeome'] Things are a bit more generalized as key/value pairs in qualifiers, but the mapping straightforward. My only suggestion would be that we add 'start' and 'end' accessors to SeqFeature that map to feature.location.nofuzzy_start and feature.location.nofuzzy_end, respectively. SeqFeature is more generalized, for GenBank location nastiness, but we should make the common simple case simpler. > Bio.SeqIO then uses the parser in Bio.GFF, and puts its information > in the appropriate fields of a SeqRecord. Here, we have to think > about two cases: Simply creating a SeqRecord based on the GFF file, > and adding the information in the GFF file as annotations to a > pre-existing set of SeqRecords. Yes. Both of these cases are handled now -- a user can supply a seed dictionary of SeqRecords to which SeqFeatures are added. Alternatively, a new SeqRecord is created for features if one is not provided. > Users then have a choice to use Bio.SeqIO to get SeqRecords, or > Bio.GFF to see the "raw" GFF data, depending on their needs. > > How does that sound? So we could have two ways to access the GFF file: - An iterator that returns SeqFeature objects for each line in the file. No other processing is done. - The higher level interface that we have been discussing, which adds them to records and nests features. My only question is concerning the nested features, like coding sequences. This a very common GFF case (see http://www.sequenceontology.org/gff3.shtml; The Canonical Gene section for the GFF). A raw parser iterator cannot handle these as it needs to read multiple lines to build the nested feature. Is this still useful for the use cases you were thinking of? [Peter] > Hmm. I'm with you on the idea that you may need to parse a GFF file > and a separate second file to get the actual sequence (e.g. a FASTA > file), but there is more than one way to combine the two. For a > single sequence, I was thinking more along the lines of: > > from Bio import SeqIO > record = SeqIO.read(open("NC_000913.fna"),"fasta") > record.features = SeqIO.read(open("NC_000913.gff"),"gff3").features Make sense, but this only works for the case where you have a single FASTA sequence and a single GFF file describing one record. This is a special case for bacterial genomes and GFF from NCBI, but doesn't work for other Eukaryotic GFFs and SOLiD GFF files. Do we want different ways to use the parser for custom cases? > If the FASTA and GFF file apply to multiple sequences (e.g. a set of > contigs, rather than a single chromosome), and you have enough memory, > then something using dictionaries should work: > > from Bio import SeqIO > records = SeqIO.to_dict(SeqIO.read(open("NC_000913.fna"),"fasta")) > for temp_rec in SeqIO.parse(open("NC_000913.gff"),"gff3") : > records[temp_rec.id].features = temp_rec.features Your intention makes good sense here, and this is more or less what it is doing under the covers. Could we think about expanding SeqIO to have functionality for this "adding to a record" case? Something like: from Bio import SeqIO records = SeqIO.to_dict(SeqIO.read(open("NC_000913.fna"),"fasta")) records = SeqIO.add_to_dict(records, open("NC_000913.gff"), "gff3") This exposes less of the actual implementation details to the user. > As you can probably tell, I am concentrating on getting this to match > up well with the Bio.SeqIO framework. It will be nice to know the > underlying Bio.GFF module has more options, but I expect most people > to start with reading in a GFF file using Bio.SeqIO, and being able to > transfer their existing knowledge of SeqFeature objects learnt from > using Bio.SeqIO to read in GenBank files. I'm really glad you are thinking about it from this angle. The limit cases will be pretty common for real life work; most of the eukaryotic GFF dumps from Ensembl or wherever are quite large and are going to need some intelligent parsing to not get into memory issues. I worry that if we try to put this right on top of the existing SeqIO functionality, which deal with different kinds of files, we are going to clutter the interface. > I have pondered a "paired file iterator" function for Bio.SeqIO for > dealing with FASTA+QUAL, FASTA+GFF, FASTA+PPT, etc, which would take > TWO file handles and return SeqRecord objects. Interestingly all the > examples thus far are FASTA+other. Anyway, this could be added later > if need be. I like the way you did this for FASTA/Qual files but am not sure if would map nicely to GFF for the memory reasons mentioned above. [MapReduce] > Are you aware of any alternatives to disco for doing map/reduce on > Python, and does that impact your design choices? Jared is right on; Hadoop is the another MapReduce framework in wide use. More generally, I agree with you; the distributed portion needs to be generalized. Let's lock down the interface and local parsing, and then I will circle around on that again. Thanks all again for the thoughts, Brad From chapmanb at 50mail.com Fri Apr 17 09:30:20 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 17 Apr 2009 09:30:20 -0400 Subject: [Biopython-dev] docstrings, doctests and epydoc API pages In-Reply-To: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com> References: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com> Message-ID: <20090417133020.GB16092@sobchak.mgh.harvard.edu> Peter; > As a test, I was able to update Bio/Seq.py to look good as epytext > (while still being equally readable as plain text for when reading the > API documentation at the python prompt with the help function). I > uploaded one new page to the website: > > http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html I had some time where I was obsessed with making Biopython look good in one of these API documentation modules (maybe HappyDoc, back in the day). Eventually I came to the sad conclusion that not too many people really seem to actually look at auto generated API docs. Most will fire up the code in their favorite editor if they are interested in the fine details. So, I like the way this looks, but my vote is it is probably not worth the cycles unless you are having fun with it. Also, be ready get mad when the preferred method of markup changes from epytext to structuredtext or someothertext. Brad From biopython at maubp.freeserve.co.uk Fri Apr 17 09:45:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 14:45:11 +0100 Subject: [Biopython-dev] docstrings, doctests and epydoc API pages In-Reply-To: <20090417133020.GB16092@sobchak.mgh.harvard.edu> References: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com> <20090417133020.GB16092@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904170645u463d0b4ej8be66735bd2889e3@mail.gmail.com> On Fri, Apr 17, 2009 at 2:30 PM, Brad Chapman wrote: > Peter; > >> As a test, I was able to update Bio/Seq.py to look good as epytext >> (while still being equally readable as plain text for when reading the >> API documentation at the python prompt with the help function). I >> uploaded one new page to the website: >> >> http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html > > I had some time where I was obsessed with making Biopython look > good in one of these API documentation modules (maybe HappyDoc, > back in the day). Eventually I came to the sad conclusion that not too > many people really seem to actually look at auto generated API docs. > Most will fire up the code in their favorite editor if they are > interested in the fine details. I agree that we don't push the API docs page enough (and indeed corresponding built in documentation). This is a shame, as the built in docstrings should really get more attention. To try and raise their profile I've added links to the relevant pages from some of the wiki pages to try and encourage people to look at them. There is probably a cunning redirect link which will get the frames to work, but I've just used deep linking on these pages for now: http://biopython.org/wiki/Seq http://biopython.org/wiki/SeqRecord http://biopython.org/wiki/SeqIO http://biopython.org/wiki/AlignIO In fact, maybe we should simplify/remove these wiki pages and just push the API pages and relevant cookbook wiki pages in their place? Up until now, the wiki was nicer in that it looked better - with the epydoc mark up that isn't the case. The API docs should be the definitive documentation, in that the are kept up to date with the code, and are under version control. > So, I like the way this looks, but my vote is it is probably not > worth the cycles unless you are having fun with it. Also, be ready > get mad when the preferred method of markup changes from epytext to > structuredtext or someothertext. I know what you mean - the novelty has worn off now, and doing further conversions is tedious. I like the idea of a tweak to epydoc to do "plain text + automatic markup of doctests". If that existed it would be a great default option for Biopython, as all I really care about for the markup is getting the python doctests to look good. Peter From chapmanb at 50mail.com Fri Apr 17 10:02:41 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 17 Apr 2009 10:02:41 -0400 Subject: [Biopython-dev] Fwd: Where to put command line wrappers In-Reply-To: <320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com> References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> <8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com> <8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com> <320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com> Message-ID: <20090417140241.GD16092@sobchak.mgh.harvard.edu> Hi all; [Where to put the commandline objects] > > I think that there is a difference between EMBOSS and > > Bio.[Motif|Align]. In EMBOSS we have a very nicely comoditized > > set of tools with similar interfaces, while both for multiple > > alignment and motif searching the tools vary a lot. In case of > > multiple alignments this is only with respect to parameters and > > output format, while in motif searching there is also a lot of > > differences in the types of input (background models etc.). > > That is a good argument for using Bio/Align/Applications/XXX.py and > Bio/Motif/Applications/XXX.py while also having > Bio/EMBOSS/Applications.py There is a natural tension between overgeneralizing and dumping too much into one file. At one end you have deeply nested Java-like directories with a few lines of code in each file. I tend towards the "more in a single file and less nesting" camp. My vote would be that if the Motif Applications file will only contain commandline wrappers, they could live in one file. [generic_run] > > Motif finding programs usually output a lot less than there is input. Normally, > > you don't want to see more than 10 motifs and each contributes ~1kb so > > I don't see this as a huge problem in this case. > > I can see that Bio.Application.generic_run function is often handy, > but sometimes it is quite inappropriate. For AlignAce obviously it > has sufficed. Yeah, generic_run is not as generic as it should be. It does have a lot of hard fought logic for working with multiple python versions and windows/unix. Could we make generic_run appropriate for the big standard out cases so we don't end up duplicating that in Blast/Clustalw/wherever runners? Brad From biopython at maubp.freeserve.co.uk Fri Apr 17 10:13:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 15:13:18 +0100 Subject: [Biopython-dev] Fwd: Where to put command line wrappers In-Reply-To: <20090417140241.GD16092@sobchak.mgh.harvard.edu> References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> <8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com> <8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com> <320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com> <20090417140241.GD16092@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904170713p30dc4d51m284c897ec1b9b505@mail.gmail.com> On Fri, Apr 17, 2009 at 3:02 PM, Brad Chapman wrote: >> >> I can see that Bio.Application.generic_run function is often handy, >> but sometimes it is quite inappropriate. ?For AlignAce obviously it >> has sufficed. > > Yeah, generic_run is not as generic as it should be. It does have a > lot of hard fought logic for working with multiple python versions > and windows/unix. Could we make generic_run appropriate for the big > standard out cases so we don't end up duplicating that in > Blast/Clustalw/wherever runners? The AlignAce and Clustalw already call generic_run internally - and for them it is fine. For BLAST, by default the output goes to standard out, so generic run is a bad idea as this loads all of stdout into memory. We may want to add some variations on generic_run for this kind of usage, or say it is up to the user to deal with it as appropriate for their setup. Peter From p.j.a.cock at googlemail.com Fri Apr 17 10:40:42 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 17 Apr 2009 15:40:42 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090417132334.GA16092@sobchak.mgh.harvard.edu> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> <20090413133539.GD5429@sobchak.mgh.harvard.edu> <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com> <20090417132334.GA16092@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904170740y17aec13cta2137d6d45f43e7c@mail.gmail.com> On Fri, Apr 17, 2009 at 2:23 PM, Brad Chapman wrote: > Things are a bit more generalized as key/value pairs in qualifiers, > but the mapping straightforward. My only suggestion would be that we > add 'start' and 'end' accessors to SeqFeature that map to > feature.location.nofuzzy_start and feature.location.nofuzzy_end, > respectively. SeqFeature is more generalized, for GenBank location > nastiness, but we should make the common simple case simpler. The SeqFeature already has start and end "attributes", but they are done with some magic in __getattr__, I was planning to update this to use a modern python property get. I can't find an enhancement bug on this so it may just have been on my mental to do list ;) See also, http://lists.open-bio.org/pipermail/biopython/2007-September/003703.html Peter From biopython at maubp.freeserve.co.uk Fri Apr 17 11:25:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 16:25:35 +0100 Subject: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank in SeqIO In-Reply-To: <320fb6e00904150401q209ae99id6746f2a0c4e3532@mail.gmail.com> References: <320fb6e00904150240l1e6b424tdd34035256876226@mail.gmail.com> <587664.25168.qm@web62402.mail.re1.yahoo.com> <320fb6e00904150401q209ae99id6746f2a0c4e3532@mail.gmail.com> Message-ID: <320fb6e00904170825w191f7c90p9cb7f175e3f5be17@mail.gmail.com> On Wed, Apr 15, 2009 at 12:01 PM, Peter wrote: > On Wed, Apr 15, 2009 at 11:57 AM, Michiel de Hoon wrote: >> >> I think it's nice to be consistent with NCBI, and I don't see a big >> problem in having an alias for GenBank in SeqIO. At least, >> having "gb" in Bio.Entrez but "genbank" in Bio.SeqIO would >> go against the principle of least surprise. > > True. OK, in the absence of any objections, I have added "gb" as an alias for "genbank" in Bio.SeqIO: Bio/SeqIO/__init__.py CVS revision 1.52 Tests/test_SeqIO_online.py revision 1.8 Tests/output/test_SeqIO_online CVS revision 1.4 Doc/Tutorial.tex CVS revision 1.229 Peter From mjldehoon at yahoo.com Fri Apr 17 12:44:34 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 17 Apr 2009 09:44:34 -0700 (PDT) Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090417132334.GA16092@sobchak.mgh.harvard.edu> Message-ID: <148828.89199.qm@web62404.mail.re1.yahoo.com> --- On Fri, 4/17/09, Brad Chapman wrote: > The GFF parser right now is really generating SeqFeature > objects for each GFF line; the top level SeqRecords are a > collection that holds the individual features. The SeqFeature > object is pretty similar to GFF and the generic object you are > proposing. For instance, here is a GFF line and the relevant > attributes from SeqFeature for the line: > > I Orfeome PCR_product 12759747 12764936 . - . PCR_product "mv_B0019.1" ; Amplified 1 ; Amplified 1 > > type: PCR_product > location: [12759746:12764936] > strand: -1 > qualifiers: > Key: amplified, Value: ['1'] > Key: pcr_product, Value: ['mv_B0019.1'] > Key: source, Value: ['Orfeome'] > Just to make I understand how this works, looking at your previous code example: >>> from BCBio.GFF.GFFParser import GFFAddingIterator >>> gff_iterator = GFFAddingIterator() >>> rec_dict = gff_iterator.get_all_features(gff_file) > The returned dictionary is like a dictionary from SeqIO.to_dict; > keys are ids and values are SeqRecords. What will be the key in rec_dict for the example GFF file above? Is that the "I" in the first column, as in rec_dict["I"] = a SeqRecord with the SeqFeature you described above? Best, --Michiel From bugzilla-daemon at portal.open-bio.org Fri Apr 17 13:03:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 17 Apr 2009 13:03:59 -0400 Subject: [Biopython-dev] [Bug 2812] Adding read method to NCBIXML (just like SeqIO and SwissProt). In-Reply-To: Message-ID: <200904171703.n3HH3xrq015467@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2812 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-17 13:03 EST ------- Fixed in CVS (without the read/parse typo in the docstring suggested in comment 1). Checking in Bio/Blast/NCBIXML.py; /home/repository/biopython/biopython/Bio/Blast/NCBIXML.py,v <-- NCBIXML.py new revision: 1.22; previous revision: 1.21 done Checking in Tests/test_NCBIXML.py; /home/repository/biopython/biopython/Tests/test_NCBIXML.py,v <-- test_NCBIXML.py new revision: 1.7; previous revision: 1.6 done Checking in Tests/test_NCBI_qblast.py; /home/repository/biopython/biopython/Tests/test_NCBI_qblast.py,v <-- test_NCBI_qblast.py new revision: 1.6; previous revision: 1.5 done Checking in Tests/output/test_NCBIXML; /home/repository/biopython/biopython/Tests/output/test_NCBIXML,v <-- test_NCBIXML new revision: 1.6; previous revision: 1.5 done RCS file: /home/repository/biopython/biopython/Tests/Blast/blastp_no_hits.xml,v done Checking in blastp_no_hits.xml; /home/repository/biopython/biopython/Tests/Blast/blastp_no_hits.xml,v <-- blastp_no_hits.xml initial revision: 1.1 done -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Apr 17 13:16:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 18:16:55 +0100 Subject: [Biopython-dev] Plan for Biopython 1.50 (final) In-Reply-To: <320fb6e00904131106s70028e9el56d334fa732bddf8@mail.gmail.com> References: <320fb6e00904131106s70028e9el56d334fa732bddf8@mail.gmail.com> Message-ID: <320fb6e00904171016q40f99a3fjda75b3add17ab8c0@mail.gmail.com> On Mon, Apr 13, 2009 at 7:06 PM, Peter wrote: > Apart from these two points (documentation and EFetch), are there any > issues regarding doing the official release of Biopython 1.50? ?I > think we can aim for a release this week... Other than a little more documentation polishing, I think we are ready for Biopython 1.50 now. Thanks Bartek and Tiago for dealing with the Bio.Motif and Bio.PopGen issues I raised so promptly :) Are there any release blocking issues I've missed? I was going to do it this evening before leaving work, but I'm tired and wouldn't want to make any mistakes. Instead, I aim to do the release this weekend, and make the Windows installers at some point on Monday. The more rain we get this weekend, the more time I'll try and spend on the docs first - otherwise the lawn needs cutting... ;) I'll send out a warning email before hand - but until then please feel free to check in documentation changes (including docstrings and doctests). We still don't have much on GenomeDiagram in the main tutorial, but I have some plans to improve this. We also don't have the misc GC related functions from the standalone GenomeDiagram which we might add to Bio.SeqUtils, but I think that can wait till Biopython 1.51. Bartek has made a start on the Bio.Motif documentation as a separate "cookbook" LaTeX file (plus we have some basic docstrings done): http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Doc/cookbook/motif/motif.tex?cvsroot=biopython For the long term I think we want to get rid of these misc "cookbook" documents (by moving their content), to focus on the main document ("Biopython Tutorial and Cookbook"), the docstrings, and in future cookbook entries on the wiki (which can be more user driven). Peter From ogmaciel at gnome.org Fri Apr 17 13:23:07 2009 From: ogmaciel at gnome.org (Og Maciel) Date: Fri, 17 Apr 2009 13:23:07 -0400 Subject: [Biopython-dev] Plan for Biopython 1.50 (final) In-Reply-To: <320fb6e00904171016q40f99a3fjda75b3add17ab8c0@mail.gmail.com> References: <320fb6e00904131106s70028e9el56d334fa732bddf8@mail.gmail.com> <320fb6e00904171016q40f99a3fjda75b3add17ab8c0@mail.gmail.com> Message-ID: <98a1f5280904171023p13c1e7a9o5686b451fd3da61c@mail.gmail.com> On Fri, Apr 17, 2009 at 1:16 PM, Peter wrote: >> Apart from these two points (documentation and EFetch), are there any >> issues regarding doing the official release of Biopython 1.50? ?I >> think we can aim for a release this week... Cool! I have 1.50b packaged for Foresight Linux and will update it once the new version is released. :) Cheers, -- Og B. Maciel omaciel at foresightlinux.org ogmaciel at gnome.org ogmaciel at ubuntu.com GPG Keys: D5CFC202 http://www.ogmaciel.com (en_US) http://blog.ogmaciel.com (pt_BR) From chapmanb at 50mail.com Fri Apr 17 16:05:58 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 17 Apr 2009 16:05:58 -0400 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904170740y17aec13cta2137d6d45f43e7c@mail.gmail.com> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> <20090413133539.GD5429@sobchak.mgh.harvard.edu> <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com> <20090417132334.GA16092@sobchak.mgh.harvard.edu> <320fb6e00904170740y17aec13cta2137d6d45f43e7c@mail.gmail.com> Message-ID: <20090417200558.GC19290@sobchak.mgh.harvard.edu> Peter and Michiel; [start/end attributes on SeqFeatures] > The SeqFeature already has start and end "attributes", but they are > done with some magic in __getattr__, I was planning to update this > to use a modern python property get. I can't find an enhancement > bug on this so it may just have been on my mental to do list ;) These attributes are on the FeatureLocation object. The whole location hierarchy is a bit complicated to represent all of the GenBank fuzziness, but it looks like: SeqFeature -- has_a --> FeatureLocation -- has_two --> Positions (start, end) So if you wanted to get a non-fuzzy start end, you need to do: feature.location.nofuzzy_start, feature.location.nofuzzy_end Your way above would be: feature.location.start.position So, I was thinking of hiding this Location/Position stuff from the end user and just adding a start and end attribute directly on the feature. For everyone that never touches fuzziness, this would make more sense; it is also in line with making SeqFeature like Michiel's proposed GFFRecord object. [GFF to SeqFeature example] > > I Orfeome PCR_product 12759747 12764936 . - . PCR_product "mv_B0019.1" ; Amplified 1 ; Amplified 1 > > > > type: PCR_product > > location: [12759746:12764936] > > strand: -1 > > qualifiers: > > Key: amplified, Value: ['1'] > > Key: pcr_product, Value: ['mv_B0019.1'] > > Key: source, Value: ['Orfeome'] > > > > Just to make I understand how this works, looking at your previous code example: > > >>> from BCBio.GFF.GFFParser import GFFAddingIterator > >>> gff_iterator = GFFAddingIterator() > >>> rec_dict = gff_iterator.get_all_features(gff_file) > > > The returned dictionary is like a dictionary from SeqIO.to_dict; > > keys are ids and values are SeqRecords. > > What will be the key in rec_dict for the example GFF file above? Is that the "I" in the first column, as in > > rec_dict["I"] = a SeqRecord with the SeqFeature you described above? Yes, that is exactly right. If we decide to have a SeqFeature iterator, we should also add a 'rec_id' key/value pair to the qualifiers that would map to the record -- chromosome 'I' in this case. This would let the user do the mapping themselves. Brad From biopython at maubp.freeserve.co.uk Fri Apr 17 18:12:14 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 23:12:14 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090417200558.GC19290@sobchak.mgh.harvard.edu> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> <20090413133539.GD5429@sobchak.mgh.harvard.edu> <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com> <20090417132334.GA16092@sobchak.mgh.harvard.edu> <320fb6e00904170740y17aec13cta2137d6d45f43e7c@mail.gmail.com> <20090417200558.GC19290@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904171512n3ff0090dy8042b1c860cf5a2c@mail.gmail.com> On 4/17/09, Brad Chapman wrote: > Peter and Michiel; > > [start/end attributes on SeqFeatures] > > > The SeqFeature already has start and end "attributes", but they are > > done with some magic in __getattr__, I was planning to update this > > to use a modern python property get. I can't find an enhancement > > bug on this so it may just have been on my mental to do list ;) > > These attributes are on the FeatureLocation object. Sorry - yeah, you're right. I wasn't paying enough attention. > The whole location hierarchy is a bit complicated to represent all > of the GenBank fuzziness, but it looks like: > > SeqFeature -- has_a --> FeatureLocation -- has_two --> Positions (start, end) > And that's the nice case without sub-features and joins ;) Peter From mjldehoon at yahoo.com Sat Apr 18 00:28:09 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 17 Apr 2009 21:28:09 -0700 (PDT) Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090417200558.GC19290@sobchak.mgh.harvard.edu> Message-ID: <252312.21376.qm@web62408.mail.re1.yahoo.com> I tried this code to read a GFF file from miRBase, containing the genome positions of microRNAs in human. The good news is that the code works as advertised. At the same time, I think that for a basic parser (as opposed to a parser integrated with Bio.SeqIO), the SeqFeatures are way too complicated for my mind. This is how I used the parser: >>> from GFFParser import GFFAddingIterator >>> gff_iterator = GFFAddingIterator() >>> rec_dict = gff_iterator.get_all_features("Data/miRBase/hsa.gff") # It would be better to pass a handle to get_all_features # instead of a file name. The file may be gzipped or bzipped, # or the user may want to read it from the internet. >>> len(rec_dict['1']) 50 # fifty microRNAs on chromosome 1 >>> rec_dict['1'].features[0] Bio.SeqFeature.SeqFeature(Bio.SeqFeature.FeatureLocation(Bio.SeqFeature.ExactPosition(20228),Bio.SeqFeature.ExactPosition(20366)), type='miRNA', strand=1, id='hsa-mir-1302-2') >>> rec_dict['1'].features[0].qualifiers['ACC'] ['MI0006363'] >>> rec_dict['1'].features[0].qualifiers['ID'] ['hsa-mir-1302-2'] # This is still OK, though a bit more deeply nested than I would like. >>> rec_dict['1'].features[0].location Bio.SeqFeature.FeatureLocation(Bio.SeqFeature.ExactPosition(20228),Bio.SeqFeature.ExactPosition(20366)) >>> rec_dict['1'].features[0].location._start Bio.SeqFeature.ExactPosition(20228) # Am I supposed to use _start here? It looks like a private variable. >>> rec_dict['1'].features[0].location._start.position 20228 # Too much typing for everyday usage. I don't think that I would use it. For a basic parser, I like the _gff_line_map function much better. Applied to the first line in the GFF file, it returns >>> result = _gff_line_map(line, params) [('parent', {'quals': {'ACC': ['MI0006363'], 'ID': ['hsa-mir-1302-2']}, 'rec_id': '1', 'location': [20228, 20366], 'is_gff2': False, 'type': 'miRNA', 'id': 'hsa-mir-1302-2', 'strand': 1})] >>> print result[0][1] {'quals': {'ACC': ['MI0006363'], 'ID': ['hsa-mir-1302-2']}, 'rec_id': '1', 'location': [20228, 20366], 'is_gff2': False, 'type': 'miRNA', 'id': 'hsa-mir-1302-2', 'strand': 1} which is exactly what I need, in (almost) the places where I'd expect them. --Michiel From biopython at maubp.freeserve.co.uk Sat Apr 18 09:54:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 18 Apr 2009 14:54:44 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <252312.21376.qm@web62408.mail.re1.yahoo.com> References: <20090417200558.GC19290@sobchak.mgh.harvard.edu> <252312.21376.qm@web62408.mail.re1.yahoo.com> Message-ID: <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> On Sat, Apr 18, 2009 at 5:28 AM, Michiel de Hoon wrote: > > This is how I used the parser: > >>>> from GFFParser import GFFAddingIterator >>>> gff_iterator = GFFAddingIterator() >>>> rec_dict = gff_iterator.get_all_features("Data/miRBase/hsa.gff") > # It would be better to pass a handle to get_all_features > # instead of a file name. The file may be gzipped or bzipped, > # or the user may want to read it from the internet. >>>> len(rec_dict['1']) > 50 > # fifty microRNAs on chromosome 1 >>>> rec_dict['1'].features[0] > Bio.SeqFeature.SeqFeature(Bio.SeqFeature.FeatureLocation(Bio.SeqFeature.ExactPosition(20228),Bio.SeqFeature.ExactPosition(20366)), type='miRNA', strand=1, id='hsa-mir-1302-2') >>>> rec_dict['1'].features[0].qualifiers['ACC'] > ['MI0006363'] >>>> rec_dict['1'].features[0].qualifiers['ID'] > ['hsa-mir-1302-2'] > # This is still OK, though a bit more deeply nested than I would like. >>>> rec_dict['1'].features[0].location > Bio.SeqFeature.FeatureLocation(Bio.SeqFeature.ExactPosition(20228),Bio.SeqFeature.ExactPosition(20366)) >>>> rec_dict['1'].features[0].location._start > Bio.SeqFeature.ExactPosition(20228) > # Am I supposed to use _start here? It looks like a private variable. >>>> rec_dict['1'].features[0].location._start.position > 20228 No, you are meant to use start, e.g.: >>> print rec_dict['1'].features[0].location.start 20228 >>> rec_dict['1'].features[0].location.start.position 20228 This is what I was talking about in the earlier email on this thread, the SeqFeature has start and end "attributes", but they are done with some magic in __getattr__. I plan to update this to use a modern python property get (so they will show up in dir(...) and we can give them docstring), but don't recall filing a bug on this issue yet. See also, http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005734.html http://lists.open-bio.org/pipermail/biopython/2007-September/003703.html Related to this, perhaps the position classes (and in particular the ExactPosition class) should have an __int__ method, so you can use the object directly (rather than messing about with subproperties like .position). This should let you do the following (untested): record = ... #e.g. a SeqRecord from a GFF file or GenBank feature = record.features[5] #for example sub_seq = my_seq[feature.location.start:feature.location.end] Coupled with a variation of Brad's suggestion of adding start and end properties to the SeqFeature, if we make these act as proxies for feature.location.start and feature.location.end that would become just: record = ... feature = record.features[5] #for example sub_seq = my_seq[feature.start:feature.end] The fuzzy locations (from GenBank or EMBL files) would need a bit of care, ideally matching how the NCBI do things (easily checked by taking an NCBI GenBank files and comparing it to the simpler locations given in their FASTA, PTT or GFF files). Peter From bugzilla-daemon at portal.open-bio.org Sat Apr 18 17:45:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 18 Apr 2009 17:45:12 -0400 Subject: [Biopython-dev] [Bug 2814] New: Use properties instead of __getattr__ in FeatureLocation Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2814 Summary: Use properties instead of __getattr__ in FeatureLocation Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk The SeqFeature's location (i.e. the FeatureLocation object) has start and end "attributes", but they are done with some magic in __getattr__. We should use a modern python property get (so they will show up in dir(...) and we can give them docstrings) See also, http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005781.html http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005734.html http://lists.open-bio.org/pipermail/biopython/2007-September/003703.html Patch to follow -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 18 17:47:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 18 Apr 2009 17:47:59 -0400 Subject: [Biopython-dev] [Bug 2814] Use properties instead of __getattr__ in FeatureLocation In-Reply-To: Message-ID: <200904182147.n3ILlx88027985@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2814 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-18 17:47 EST ------- Created an attachment (id=1278) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1278&action=view) Patch to Bio/SeqFeature.py This doesn't try and change the functionality or API at all. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sat Apr 18 17:48:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 18 Apr 2009 22:48:58 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> References: <20090417200558.GC19290@sobchak.mgh.harvard.edu> <252312.21376.qm@web62408.mail.re1.yahoo.com> <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> Message-ID: <320fb6e00904181448u49e92549t70c3a23c1a0c4d4f@mail.gmail.com> On Sat, Apr 18, 2009 at 2:54 PM, Peter wrote: > This is what I was talking about in the earlier email on this thread, > the SeqFeature has start and end "attributes", but they are done > with some magic in __getattr__. I plan to update this to use a > modern python property get (so they will show up in dir(...) and > we can give them docstring), but don't recall filing a bug on this > issue yet. Filed now, Bug 2814 - Use properties instead of __getattr__ in FeatureLocation http://bugzilla.open-bio.org/show_bug.cgi?id=2814 Something for after Biopython 1.50 is done. Peter From bugzilla-daemon at portal.open-bio.org Sat Apr 18 18:42:57 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 18 Apr 2009 18:42:57 -0400 Subject: [Biopython-dev] [Bug 2814] Use properties instead of __getattr__ in FeatureLocation In-Reply-To: Message-ID: <200904182242.n3IMgvOq031013@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2814 ------- Comment #2 from eric.talevich at gmail.com 2009-04-18 18:42 EST ------- (In reply to comment #1) > Created an attachment (id=1278) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1278&action=view) [details] > Patch to Bio/SeqFeature.py Peter, you mentioned on the mailing list that this will be applied after the 1.50 release. Since Py2.3 support ends there also, you could use the newer decorator style instead: start = property(fget= lambda self : self._start, doc="Start location (possibly a fuzzy position).") becomes: @property def start(self): """Start location (possibly a fuzzy position).""" return self._start I think this is the preferred style for Python 2.4 and later. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Apr 20 05:03:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 10:03:47 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.50 Message-ID: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> On Fri, Apr 17, 2009 at 6:16 PM, Peter wrote: > > Are there any release blocking issues I've missed? I'm going to assume not. > I was going to do it this evening before leaving work, but I'm tired and > wouldn't want to make any mistakes. ?Instead, I aim to do the release > this weekend, and make the Windows installers at some point on > Monday. ?The more rain we get this weekend, the more time I'll try > and spend on the docs first - otherwise the lawn needs cutting... ;) Well the good news is it didn't rain, I had a nice weekend, and cut half the grass. The bad news is obviously I didn't do the Biopython release, although I did work on the documentation. In addition to the nice weather, my other excuse is I had forgotten I'd upgraded my old laptop so I didn't have a Python 2.3 machine handy at home. ;) > I'll send out a warning email before hand - but until then please > feel free to check in documentation changes (including docstrings > and doctests). This is the CVS freeze email. I'm going to do the release in the next hour or two. > We still don't have much on GenomeDiagram in the main tutorial, but I > have some plans to improve this. [...] I got most of that done at the weekend :) Peter From biopython at maubp.freeserve.co.uk Mon Apr 20 08:11:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 13:11:18 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.50 In-Reply-To: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> Message-ID: <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> On Mon, Apr 20, 2009 at 10:03 AM, Peter wrote: > > This is the CVS freeze email. ?I'm going to do the release in the next > hour or two. > Well, its done. CVS is tagged, the packages are online, I've updated the wiki, the epydoc API pages, and the online copy of the tutorial. You can use CVS again, but just in case there are any surprises in the next few days which would force a re-release, minor changes only please. That just leaves the official announcement on the news page (which will be echoed onto twitter automatically) and to the mailing lists. I'll circulate a draft after lunch, unless one of our news coordinator volunteers wants to write something? I realize I should have suggested this earlier as this is short notice, and you are in different time zones, but its worth a try. For reference, here is the 1.50 beta announcement, http://news.open-bio.org/news/2009/04/biopython-150-beta-released/ I can't find anything on http://lists.open-bio.org/pipermail/biopython-announce/ or the main list, so it looks like I forget that :( This might explain the relatively low amount of feedback... The NEWS and DEPRECATED files are here: http://biopython.open-bio.org/SRC/biopython/NEWS http://biopython.open-bio.org/SRC/biopython/DEPRECATED Peter From bugzilla-daemon at portal.open-bio.org Mon Apr 20 08:18:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 20 Apr 2009 08:18:33 -0400 Subject: [Biopython-dev] [Bug 2815] New: Bio.Application MUSCLE command line interface Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2815 Summary: Bio.Application MUSCLE command line interface Product: Biopython Version: Not Applicable Platform: PC OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: cymon.cox at gmail.com Attached is a module to run the MUSCLE alignment programme based on the Bio.Applications interface. A couple of helper functions are included MuscleAlign and ProfileMuscleAlign. Discussion on the dev-list suggests that helper functions are superfluous. Maybe, but I thought I'd include them anyway. A couple of unittests are included for the helper funcs. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 20 08:19:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 20 Apr 2009 08:19:38 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line interface In-Reply-To: Message-ID: <200904201219.n3KCJcSu009533@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #1 from cymon.cox at gmail.com 2009-04-20 08:19 EST ------- Created an attachment (id=1279) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1279&action=view) MUSCLE module -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 20 08:21:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 20 Apr 2009 08:21:19 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line interface In-Reply-To: Message-ID: <200904201221.n3KCLJjf009683@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #2 from cymon.cox at gmail.com 2009-04-20 08:21 EST ------- Created an attachment (id=1280) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1280&action=view) unittest for MuscleAlign -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Mon Apr 20 09:29:46 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 20 Apr 2009 09:29:46 -0400 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> References: <20090417200558.GC19290@sobchak.mgh.harvard.edu> <252312.21376.qm@web62408.mail.re1.yahoo.com> <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> Message-ID: <20090420132946.GB29652@sobchak.mgh.harvard.edu> Michiel; Thanks for trying this out and your thoughts. > > # It would be better to pass a handle to get_all_features > > # instead of a file name. The file may be gzipped or bzipped, > > # or the user may want to read it from the internet. Yes, this is the way it was originally designed. I changed to files to be consistent with a distributed Disco implementation, which needs to be fed a file instead of a handle. Your suggestion is a good one. Let me give some thought to separating the interfaces, as handles would be more consistent with the rest of Biopython. [accessing start and end] > >>> print rec_dict['1'].features[0].location.start > 20228 > >>> rec_dict['1'].features[0].location.start.position > 20228 [...] > Coupled with a variation of Brad's suggestion of adding start > and end properties to the SeqFeature, if we make these act > as proxies for feature.location.start and feature.location.end > that would become just: > > record = ... > feature = record.features[5] #for example > sub_seq = my_seq[feature.start:feature.end] Thanks Peter, that's exactly right. Accessing the start and end coordinates in SeqFeatures is unnecessarily cumbersome right now, but can be fixed fairly simply. We should be able to get this in now that 1.50 is rolled out. Eric's decorator way of doing this was very nice. > The fuzzy locations (from GenBank or EMBL files) would need > a bit of care, ideally matching how the NCBI do things (easily > checked by taking an NCBI GenBank files and comparing it to > the simpler locations given in their FASTA, PTT or GFF files). To be clear, start and end in SeqFeature would be integers and not handle any fuzzy stuff. All of the representation is still there for those actually dealing with fuzziness, but the top level attributes would expose the coordinates nicely for the remaining 99% of cases. > I think that for a basic parser (as opposed to a parser integrated with Bio.SeqIO), > the SeqFeatures are way too complicated for my mind. [...] > For a basic parser, I like the _gff_line_map function much better. > Applied to the first line in the GFF file, it returns [...] > which is exactly what I need, in (almost) the places where I'd expect them. Does solving the start/end problem as described above help bridge the gap between SeqFeatures and the custom representation? Are there other usability issues you found? I would prefer to expose one data structure and think SeqFeature can handle the data well. They scale to nested cases, and will be familiar to those using features in SeqIO or BioSQL. Brad From dave.bridges at gmail.com Mon Apr 20 09:55:40 2009 From: dave.bridges at gmail.com (Dave Bridges) Date: Mon, 20 Apr 2009 09:55:40 -0400 Subject: [Biopython-dev] Bio.Motif Suggestions Message-ID: <49EC7EDC.2030809@gmail.com> From an off-list conversation with Bartek > > Is it possible to give a name to an instance, so that when you > print, say to > > fasta it retains that info > Yes and no... Motifs have a .name property which can be used for storing names of motifs, but it is currently not used in fasta output. BTW. fasta (and other) output functions changed recently in CVS, but I didn't have time to update my branch in git. Please have a look at the .format method of Motif class in the main branch. There are also some (minor) changes in the tutorial, so you may want to merge them back into your branch. Bio.Motif got refactored quite a bit (on Peter's request), so you should update the code, but the API didn't change too much. Currently, the fasta output prints only Instance 1, Instance 2 and so on in the ID field but it would be a trivial improvement to add motif name there. > > Is there an alphabet that accepts spaces which might be necessary for > > correct alignment of a motif, and if so will that work with the rest of > > motif.py? > That's a tougher one. It wasn't really needed so far (DNA motifs rarely have spaces), but I guess that for protein motifs it's a very important thing. I have some code for doing that, but I will need to find it. I'll write you later about it. > > in to_horizontal_matrix/to_vertical_matrix is it possible to print > out a > > legend for the matrices (for ex. the alphabet letters and the position) > > along the top and side. > No, not yet, but again, it would be a nice improvement (and easy to make). cheers Bartek From biopython at maubp.freeserve.co.uk Mon Apr 20 10:35:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 15:35:15 +0100 Subject: [Biopython-dev] Bio.Motif Suggestions In-Reply-To: <49EC7EDC.2030809@gmail.com> References: <49EC7EDC.2030809@gmail.com> Message-ID: <320fb6e00904200735y1002ee71i1a2f11c664045567@mail.gmail.com> On Mon, Apr 20, 2009 at 2:55 PM, Dave Bridges wrote: > >> > Is there an alphabet that accepts spaces which might be necessary for >> > correct alignment of a motif, and if so will that work with the rest of >> > motif.py? >> > > That's a tougher one. It wasn't really needed so far (DNA motifs > rarely have spaces), but I guess that for protein motifs it's a very > important thing. > I have some code for doing that, but I will need to find it. I'll > write you later about it. > What would a space in a motif mean? Clearly something different from a wildcard like N or X in nucleotide or protein sequences. Does it mean a gap of variable length? If it means a gap of one character then surely just using a "-" would be sensible (as used in multiple sequence alignments), for which we have a gapped alphabet system setup. Note that there are some issues with the current Bio.Motif code and alphabets, which should be addressed. For example, generic alphabets don't have a letters property giving the list of expected letters, so using set() on the sequences themselves might be more appropriate in places. Peter From biopython at maubp.freeserve.co.uk Mon Apr 20 10:37:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 15:37:02 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090420132946.GB29652@sobchak.mgh.harvard.edu> References: <20090417200558.GC19290@sobchak.mgh.harvard.edu> <252312.21376.qm@web62408.mail.re1.yahoo.com> <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> <20090420132946.GB29652@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904200737s71e0dfa2y3d7cfbf36324a79d@mail.gmail.com> On Mon, Apr 20, 2009 at 2:29 PM, Brad Chapman wrote: > Michiel; > Thanks for trying this out and your thoughts. > >> > # It would be better to pass a handle to get_all_features >> > # instead of a file name. The file may be gzipped or bzipped, >> > # or the user may want to read it from the internet. > > Yes, this is the way it was originally designed. I changed to files to > be consistent with a distributed Disco implementation, which needs to be > fed a file instead of a handle. Your suggestion is a good one. Let me > give some thought to separating the interfaces, as handles would be more > consistent with the rest of Biopython. I'd second that - definitely go with handles rather than filenames. Peter From biopython at maubp.freeserve.co.uk Mon Apr 20 10:55:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 15:55:21 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.50 In-Reply-To: <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> Message-ID: <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> On Mon, Apr 20, 2009 at 1:11 PM, Peter wrote: > That just leaves the official announcement on the news page (which > will be echoed onto twitter automatically) and to the mailing lists. > I'll circulate a draft after lunch, unless one of our news coordinator > volunteers wants to write something? ?I realize I should have > suggested this earlier as this is short notice, and you are in > different time zones, but its worth a try. And here is my draft - the HTML is just for the links on the news site. Should we add something about the Entrez EFetch change ("genbank" to "gb")? Peter -- We are pleased to announce Biopython release 1.50, featuring some significant additions since Biopython 1.49 was released late last year. GenomeDiagram by Leighton Pritchard has been integrated into Biopython as the Bio.Graphics.GenomeDiagram module. A new module Bio.Motif has been added, which is intended to replace the existing Bio.AlignAce and Bio.MEME modules. Also have a look at Bio.ExPASy and the revised Prosite and Enzyme parsers. As noted in a previous news posting, Bio.SeqIO can now read and write FASTQ and QUAL files used in second generation sequencing work. In connection with this, our SeqRecord object has a new dictionary attribute, letter_annotations, for per-letter-annotation information like sequence quality scores or secondary structure predictions. Also, the SeqRecord object can now be sliced to give a new SeqRecord covering just part of the sequence. Biopython 1.50 supports Python 2.3, 2.4, 2.5 and 2.6. However, this is expected to be the final version to support Python 2.3 (see this previous announcement). Also, Biopython 1.50 should be the last release to include our old deprecated parsing infrastructure (Martel and Bio.Mindy). We?ve also updated the Biopython Tutorial and Cookbook (also available in PDF), and not just by adding our logo to the cover ;) Thank you to everyone who tested the Biopython 1.50 beta release, and to all our contributors. Source distributions and Windows installers are available from the downloads page on the Biopython website (biopython.org). -Peter on behalf of the Biopython developers From bartek at rezolwenta.eu.org Mon Apr 20 11:04:44 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 20 Apr 2009 17:04:44 +0200 Subject: [Biopython-dev] Bio.Motif Suggestions In-Reply-To: <8b34ec180904200800j52f9accdk27bb9c499c7b0761@mail.gmail.com> References: <49EC7EDC.2030809@gmail.com> <320fb6e00904200735y1002ee71i1a2f11c664045567@mail.gmail.com> <8b34ec180904200800j52f9accdk27bb9c499c7b0761@mail.gmail.com> Message-ID: <8b34ec180904200804o58d531ache7e3110f3b919cfa@mail.gmail.com> On Mon, Apr 20, 2009 at 4:35 PM, Peter wrote: > On Mon, Apr 20, 2009 at 2:55 PM, Dave Bridges wrote: >> >>> > Is there an alphabet that accepts spaces which might be necessary for >>> > correct alignment of a motif, and if so will that work with the rest of >>> > motif.py? >>> >> >> That's a tougher one. It wasn't really needed so far (DNA motifs >> rarely have spaces), but I guess that for protein motifs it's a very >> important thing. >> I have some code for doing that, but I will need to find it. I'll >> write you later about it. >> > > What would a space in a motif mean? ?Clearly something different from > a wildcard like N or X in nucleotide or protein sequences. ?Does it > mean a gap of variable length? ?If it means a gap of one character > then surely just using a "-" would be sensible (as used in multiple > sequence alignments), for which we have a gapped alphabet system > setup. > I think that once we start talking about gapped motifs, we are really talking about multiple alignments on steroids. This hasn't been done so far because you don't really need it for DNA motifs, but in case of protein motifs we need to make it compatible with multiple alignments. I think it would be great to be able to easily convert multiple alignments into motifs. This would allow us to ?use the power of BIo.AlignIO for IO and Bio.Motif for searching and comparisons.The question is how to design API for these ?functions. What about: align= Bio.AlignIO.read(....) motif=Bio.Motif.from_alignment(align) ... > Note that there are some issues with the current Bio.Motif code and > alphabets, which should be addressed. ?For example, generic alphabets > don't have a letters property giving the list of expected letters, so > using set() on the sequences themselves might be more appropriate in > places. Yes, I was using Bio.Motif only for DNA motifs myself, so there was not much consideration given to proper handling of alphabets. I'll need to clear it up now. cheers ?Bartek From bartek at rezolwenta.eu.org Mon Apr 20 11:08:57 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 20 Apr 2009 17:08:57 +0200 Subject: [Biopython-dev] CVS freeze for Biopython 1.50 In-Reply-To: <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> Message-ID: <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com> Hi Peter, Looks fine to me. Thanks for your effort put into this release. Bio.Motif certainly benefited from refactoring initiated by your comments before the release. cheers Bartek On Mon, Apr 20, 2009 at 4:55 PM, Peter wrote: > On Mon, Apr 20, 2009 at 1:11 PM, Peter wrote: >> That just leaves the official announcement on the news page (which >> will be echoed onto twitter automatically) and to the mailing lists. >> I'll circulate a draft after lunch, unless one of our news coordinator >> volunteers wants to write something? ?I realize I should have >> suggested this earlier as this is short notice, and you are in >> different time zones, but its worth a try. > > And here is my draft - the HTML is just for the links on the news > site. ?Should we add something about the Entrez EFetch change > ("genbank" to "gb")? > > Peter > > -- > > We are pleased to announce Biopython release 1.50, featuring some > significant additions since Biopython 1.49 was released late last > year. > > GenomeDiagram > by Leighton Pritchard has been integrated into Biopython as the > Bio.Graphics.GenomeDiagram module. > > A new module Bio.Motif has been added, which is intended to replace > the existing Bio.AlignAce and Bio.MEME modules. Also have a look at > Bio.ExPASy and the revised Prosite and Enzyme parsers. > > As noted in a previous news posting, href="http://biopython.org/wiki/SeqIO">Bio.SeqIO can now read and > write FASTQ > and QUAL files used in second generation sequencing work. In > connection with this, our href="http://biopython.org/wiki/SeqRecord">SeqRecord object has a > new dictionary attribute, letter_annotations, for > per-letter-annotation information like sequence quality scores or > secondary structure predictions. Also, the SeqRecord object can now be > sliced to give a new SeqRecord covering just part of the sequence. > > Biopython 1.50 supports Python 2.3, 2.4, 2.5 and 2.6. However, this is > expected to be the final version to support Python 2.3 (see this href="http://news.open-bio.org/news/2009/04/2008/11/biopython-and-python-26-and-python-23/">previous > announcement). Also, Biopython 1.50 should be the last release to > include our old deprecated parsing infrastructure (Martel and > Bio.Mindy). > > We?ve also updated the Biopython > Tutorial and Cookbook (also available in href="http://biopython.org/DIST/docs/tutorial/Tutorial.pdf">PDF), > and not just by adding href="http://biopython.org/wiki/Logo">our logo to the cover ;) > > Thank you to everyone who tested the href="http://news.open-bio.org/news/2009/04/biopython-150-beta-released/">Biopython > 1.50 beta release, and to all our contributors. > > Source distributions and Windows installers are available from the href="http://biopython.org/wiki/Download">downloads page on the href="http://biopython.org/">Biopython website (biopython.org). > > -Peter on behalf of the Biopython developers > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From biopython at maubp.freeserve.co.uk Mon Apr 20 12:04:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 17:04:56 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.50 In-Reply-To: <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com> Message-ID: <320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com> On Mon, Apr 20, 2009 at 4:08 PM, Bartek Wilczynski wrote: > Hi Peter, > > Looks fine to me. Cool. > Thanks for your effort put into this release. Thanks. I'd forgotten how much work these can be - the Biopython 1.50 beta release seemed to go much more smoothly, but there I wasn't aiming quite so high (e.g. it didn't have any GenomeDiagram documentation in it, and I hadn't really looked at Bio.Motif in detail). Michiel did offer this time round, but maybe next time it should be someone else's turn to do the actual release bit? What I mean is the project co-ordination is a bit nebulous, but the actual mechanics of doing a release are fairly simple (assuming you have a Windows machine already setup to do the installers), pretty well documented, and that part could be delegated. See http://biopython.org/wiki/Building_a_release i.e. Maybe in a few months time I (or Michiel) can say "Right, CVS freeze while XXX does the release", where person XXX gets to scan the documentation, double check the NEWS files, check the unit tests etc, before putting together the packages and uploading them to the server. And maybe then hand over to our "News Coordinator" to do the release announcement? Having more people involved will make it take a little longer, but should mean less minor things get missed (e.g. a typo in the NEWS file, or a broken unit test specific to a particular OS or version of python). > Bio.Motif certainly benefited from refactoring ?initiated by your > comments before the release. Well, I hope so :) Peter From biopython at maubp.freeserve.co.uk Mon Apr 20 13:27:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 18:27:00 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com> <8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com> <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> Message-ID: <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> On Wed, Mar 25, 2009 at 11:28 AM, Peter wrote: > On Tue, Mar 24, 2009 at 11:58 PM, Bartek Wilczynski wrote: >> For the tags, they were not pushed to github before, because I didn't >> know I need to specifically do it qith git push --tags. > > ... > They also show up in github (near the top, drop down menu next to > branches) and in gitx (and I assume other GUI clients). Bartek fixed the tag issue, but I don't like how they show up in github. The most visible sign of the tags is in the downloads menu which lets you get a source code bundle using that tag. If we could turn that off I would - these bundles won't include the compiled PDF and HTML documentation, and could cause confusion when people have a problems and they just say they "downloaded version X from the website". My main concern is the tags don't appear to be shown when looking at the history in github, which is the main reason I wanted them in the first place. e.g. http://github.com/biopython/biopython/commits/master/Bio/Blast/NCBIXML.py Compare this to ViewCVS, which shows the tags in the history: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIXML.py?cvsroot=biopython I find this very handy for investigating bugs, and much easier than messing about at the command line with CVS. The fact that I can do this from almost any networked computer in the world is great for triaging bugs or responding to emails - it lets me look back over the history with our releases clearly labeled. So right now, the github history is a big step backwards for me. As an alternative, I had a quick look at GitX (on the Mac) from the GUI, they don't seem to have a history-of-one-file view, just a global history. For how I have been using ViewCVS's history, this is useless. However, interesting for a GUI tool, they have a command line option which sort of does this, e.g. $ gitx -- Bio/Blast/NCBIXML.py Then the history shows all changes affecting the given file (or path), but as you might guess from git's commit based design, you also get shown other changes made in the same commit. This is kind of nice, just different. But still no tags visible :( Peter P.S. Tags aside, the github history view hasn't been working 100% for me, e.g. http://support.github.com/discussions/site/487-commit-history-sorry-this-commit-log-is-taking-too-long-to-generate From biopython at maubp.freeserve.co.uk Mon Apr 20 13:36:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 18:36:36 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com> <8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com> <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> Message-ID: <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> On Mon, Apr 20, 2009 at 6:27 PM, Peter wrote: > On Wed, Mar 25, 2009 at 11:28 AM, Peter wrote: >> On Tue, Mar 24, 2009 at 11:58 PM, Bartek Wilczynski wrote: >>> For the tags, they were not pushed to github before, because I didn't >>> know I need to specifically do it qith git push --tags. >> >> ... >> They also show up in github (near the top, drop down menu next to >> branches) and in gitx (and I assume other GUI clients). > > Bartek fixed the tag issue, but I don't like how they show up in > github. >From some more reading this, it sounds like our CVS tags are essentially turned into commit markers in git. See: http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#how-git-stores-references http://book.git-scm.com/3_git_tag.html This shouldn't rule out showing them in the history, but perhaps the cvs to git migration confuses things... Peter From biopython at maubp.freeserve.co.uk Mon Apr 20 15:02:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 20:02:18 +0100 Subject: [Biopython-dev] Biopython 1.50 released Message-ID: <320fb6e00904201202j4bb9666es18c89136ce973a48@mail.gmail.com> Dear all, We are pleased to announce Biopython release 1.50, featuring some significant additions since Biopython 1.49 was released late last year. GenomeDiagram by Leighton Pritchard has been integrated into Biopython as the Bio.Graphics.GenomeDiagram module. A new module Bio.Motif has been added, which is intended to replace the existing Bio.AlignAce and Bio.MEME modules. Also have a look at Bio.SwissProt and Bio.ExPASy and their revised parsers. As noted in a previous news posting, Bio.SeqIO can now read and write FASTQ and QUAL files used in second generation sequencing work. In connection with this, our SeqRecord object has a new dictionary attribute, letter_annotations, for per-letter-annotation information like sequence quality scores or secondary structure predictions. Also, the SeqRecord object can now be sliced to give a new SeqRecord covering just part of the sequence. Biopython 1.50 supports Python 2.3, 2.4, 2.5 and 2.6. However, this is expected to be the final version to support Python 2.3 (see this previous announcement). Also, Biopython 1.50 should be the last release to include our old deprecated parsing infrastructure (Martel and Bio.Mindy). We?ve also updated the Biopython Tutorial and Cookbook (also available in PDF), and not just by adding our logo to the cover ;) http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Thank you to everyone who tested the Biopython 1.50 beta release, and to all our contributors. Source distributions and Windows installers are available from the downloads page on the Biopython website: http://biopython.org/wiki/Download -Peter, on behalf of the Biopython developers P.S. This news post is online at http://news.open-bio.org/news/2009/04/biopython-release-150/ You may wish to subscribe to our news feed. For RSS links etc, see: http://biopython.org/wiki/News From lpritc at scri.ac.uk Tue Apr 21 04:34:25 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 21 Apr 2009 09:34:25 +0100 Subject: [Biopython-dev] Bio.Motif Suggestions In-Reply-To: <8b34ec180904200804o58d531ache7e3110f3b919cfa@mail.gmail.com> Message-ID: Hi, Some thoughts and a bit of a wishlist... On 20/04/2009 16:04, "Bartek Wilczynski" wrote: > On Mon, Apr 20, 2009 at 4:35 PM, Peter > wrote: >> >> What would a space in a motif mean? ?Clearly something different from >> a wildcard like N or X in nucleotide or protein sequences. ?Does it >> mean a gap of variable length? ?If it means a gap of one character >> then surely just using a "-" would be sensible (as used in multiple >> sequence alignments), for which we have a gapped alphabet system >> setup. >> > I think that once we start talking about gapped motifs, we are really > talking about > multiple alignments on steroids. This hasn't been done so far because you > don't > really need it for DNA motifs, It might not be required for the motifs you've been working with, but we've been doing profile-based searches for bipartite regulatory binding sites in DNA. These sites have a variable-length spacer region, and so require gapped alignments for building motifs. The spacer region consensus (depending on the level of identity required for the consensus) is usually composed of Ns. I guess that this comes down to whether we choose to restrict the meaning of "motif" to an ungapped string of symbols (including ambiguity) representing nt/aa, or whether we want to permit the inclusion of variable-length gaps, regions, or ambiguities in a PROSITE or regular expression-like manner (e.g. C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H, GAACC.{17,21}AAC or C{,3}A{3,5}TTTT). Although profile methods like HMMer can produce a consensus output that looks like an ungapped string of symbols to represent a motif, it doesn't capture important features of the HMM representation. I think the latter representations are more useful, even if harder to code/maintain. I think that leaving them out would be a glaring hole in functionality, and that they're a target Biopython should aim for. > I think it would be great to be > able to easily > convert multiple alignments into motifs. This would allow us to ?use > the power of > BIo.AlignIO for IO and Bio.Motif for searching and comparisons.The question is > how to design API for these ?functions. I agree. I think that there's another important question: what do we mean, and need to do, when we talk about converting an alignment into a motif? Consensus/majority and PSSM methods from a sequence alignment should be straightforward to implement in Python - even for gapped alignments. Including a representation of variable-length gaps might be a little more difficult, and storing an HMM representation may be too much to manage immediately. That's still three different types of object - with likely different components to their interfaces - to be stored. In their relationship to a source alignment, these representations could be properties of a single alignment, or independent Bio.Motif objects (perhaps each with a link back to their parent alignment). The results of searches are also likely to be qualitatively different, depending on the type of motif used for the search, and the results desired by the user. I think that, for anything other than simple searches (string search, regex), we'd be on a hiding to nothing by implementing search methods within Python. It's not likely to be as fast as dedicated search packages, and it would be a headache for maintenance. So, with apologies if I missed this part of the discussion or documentation, it seems to me that Bio.Motif could be most powerful in the alignment/searching/comparison process as a 'broker' within BioPython, providing a consistent API for interface with external alignment/search/comparison applications that also permits programmatic manipulation of the profile/HMM/alignment. E.g. align = Bio.AlignIO.read(alignfilehandle) consensus = align.build_consensus(threshold=0.9) pssm = align.build_pssm() hmmer = align.build_hmmer() hmm = align.build_hmm(order=3) Or consensus = Bio.Motif.consensus_from_alignment(align, threshold=0.9) pssm = Bio.Motif.build_pssm_from_alignment(align) hmmer = Bio.Motif.build_hmmer_from_alignment(align) hmm = Bio.Motif.build_hmm_from_alignment(align, order=3) (which I don't think is as neat an interface, even if all align.build_consensus does is call the Bio.Motif.consensus_from_alignment method) Followed by things like pssm.consensus() pssm.logo() hmm.generate_sequence(length=100) hmm.to_graphviz() And then the consensus, pssm, hmm and hmmer objects could be used as input to interfaces for the relevant applications. Converting an alignment into an HMM for this purpose may itself benefit from a call to HMMer's hmmbuild (and Pythonic representation of the data structure), rather than implementation of an equivalent internal function - even though I think one of those would be useful, too. Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From biopython at maubp.freeserve.co.uk Tue Apr 21 06:15:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Apr 2009 11:15:05 +0100 Subject: [Biopython-dev] Python 2.3 support Message-ID: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com> Hi all, As we've been warning for the last couple of releases, Biopython 1.50 should be the last release to officially support Python 2.3. No one has complained yet, but they may not have noticed. I suspect there may be people out there using a local Biopython installation on an old Linux/Unix computer where the system Python is rather old. For Biopython 1.50 I added a warning to setup.py when run on Python 2.3 so that may get more attention. Given the small possibility that we may get need to do a fix release with Python 2.3 support, I propose that we don't actively remove any Python 2.3 support in CVS yet (maybe not until after Biopython 1.51?). Any new modules that require Python 2.4+ to run would be OK, but I would like to avoid breaking existing core functionality on Python 2.3 in the short term. I know I'm dragging my feed on this, but being a bit cautious here shouldn't hurt. Plus I have an ulterior motive: I'm one of the few Biopython users still actually using Python 2.3! To be precise, this now only on one machine at work - but this is the cluster head node. However, an upgrade is planned in the next month or so, and once that is done, maybe I'll relent and we can remove Python 2.3 support in CVS ;) Peter From bugzilla-daemon at portal.open-bio.org Tue Apr 21 07:05:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Apr 2009 07:05:48 -0400 Subject: [Biopython-dev] [Bug 2817] New: Meta-bug for cleanup once we drop Python 2.3 support Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2817 Summary: Meta-bug for cleanup once we drop Python 2.3 support Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk We are going to drop support for Python 2.3, see: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005812.html This means we can remove a number of workarounds in the code: Python 2.4+ includes the built in set, so we can remove numerous uses of the following: #TODO - Remove this work around once we drop python 2.3 support try: set = set except NameError: from sets import Set as set Python 2.4+ includes the subprocess module, so we can use this unconditionally in Bio.Application.generic_run() etc. Python 2.4+ includes support for generator expressions. We should update the documentation examples as appropriate, and this may also allow some memory optimizations in places. Python 2.4+ will also allow us to update our property methods to use decorators as suggested by Eric Talevich on Bug 2814. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 21 07:12:18 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Apr 2009 07:12:18 -0400 Subject: [Biopython-dev] [Bug 2814] Use properties instead of __getattr__ in FeatureLocation In-Reply-To: Message-ID: <200904211112.n3LBCILI021318@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2814 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-21 07:12 EST ------- (In reply to comment #2) > Peter, you mentioned on the mailing list that this will be applied after the > 1.50 release. Since Py2.3 support ends there also, you could use the newer > decorator style instead: > > start = property(fget= lambda self : self._start, > doc="Start location (possibly a fuzzy position).") > > becomes: > > @property > def start(self): > """Start location (possibly a fuzzy position).""" > return self._start > > > I think this is the preferred style for Python 2.4 and later. Thanks for the suggestion Eric. That sounds like a good plan, but not yet. See Bug 2917. I've checked in this patch and am marking this bug as fixed. See: Bio/SeqFeature.py CVS revision 1.17 Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Tue Apr 21 07:12:20 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 21 Apr 2009 04:12:20 -0700 (PDT) Subject: [Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO, Bio.SwissProt Message-ID: <393946.5637.qm@web62408.mail.re1.yahoo.com> Dear all, I've noticed an inconsistency between how Bio.SeqIO and Bio.SwissProt parse DE (description) lines in SwissProt files. For these DE lines: DE RecName: Full=11S globulin seed storage protein 2; DE AltName: Full=11S globulin seed storage protein II; DE AltName: Full=Alpha-globulin; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 acidic chain; DE AltName: Full=11S globulin seed storage protein II acidic chain; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 basic chain; DE AltName: Full=11S globulin seed storage protein II basic chain; DE Flags: Precursor; a SwissProt record created by Bio.SwissProt contains the following: >>> print swiss_record.description RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S globulin seed storage protein II; AltName: Full=Alpha-globulin; Contains: RecName: Full=11S globulin seed storage protein 2 acidic chain; AltName: Full=11S globulin seed storage protein II acidic chain; Contains: RecName: Full=11S globulin seed storage protein 2 basic chain; AltName: Full=11S globulin seed storage protein II basic chain; Flags: Precursor; but a SeqRecord returned by Bio.SeqIO contains this: >>> print seq_record.description RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S globulin seed storage protein II; AltName: Full=Alpha-globulin; Contains: RecName: Full=11S globulin seed storage protein 2 acidic chain; AltName: Full=11S globulin seed storage protein II acidic chain; Contains: RecName: Full=11S globulin seed storage protein 2 basic chain; AltName: Full=11S globulin seed storage protein II basic chain; Flags: Precursor; So Bio.SeqIO removes the spaces in front of the line, but Bio.SwissProt doesn't. For consistency, I think it's better to decide on one of these two styles. My preference is for the approach used by Bio.SwissProt. Any objections to modifying the code used by Bio.SeqIO? --Michiel. From p.j.a.cock at googlemail.com Tue Apr 21 07:26:00 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 12:26:00 +0100 Subject: [Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO, Bio.SwissProt In-Reply-To: <393946.5637.qm@web62408.mail.re1.yahoo.com> References: <393946.5637.qm@web62408.mail.re1.yahoo.com> Message-ID: <320fb6e00904210426q19ea7a44xdb902661297ec855@mail.gmail.com> On Tue, Apr 21, 2009 at 12:12 PM, Michiel de Hoon wrote: > > Dear all, > > I've noticed an inconsistency between how Bio.SeqIO and Bio.SwissProt parse DE (description) lines in SwissProt files. > > For these DE lines: > > DE ? RecName: Full=11S globulin seed storage protein 2; > DE ? AltName: Full=11S globulin seed storage protein II; > DE ? AltName: Full=Alpha-globulin; > DE ? Contains: > DE ? ? RecName: Full=11S globulin seed storage protein 2 acidic chain; > DE ? ? AltName: Full=11S globulin seed storage protein II acidic chain; > DE ? Contains: > DE ? ? RecName: Full=11S globulin seed storage protein 2 basic chain; > DE ? ? AltName: Full=11S globulin seed storage protein II basic chain; > DE ? Flags: Precursor; > > a SwissProt record created by Bio.SwissProt contains the following: >>>> print swiss_record.description > RecName: Full=11S globulin seed storage protein 2; > AltName: Full=11S globulin seed storage protein II; > AltName: Full=Alpha-globulin; > Contains: > ?RecName: Full=11S globulin seed storage protein 2 acidic chain; > ?AltName: Full=11S globulin seed storage protein II acidic chain; > Contains: > ?RecName: Full=11S globulin seed storage protein 2 basic chain; > ?AltName: Full=11S globulin seed storage protein II basic chain; > Flags: Precursor; > > but a SeqRecord returned by Bio.SeqIO contains this: > >>>> print seq_record.description > RecName: Full=11S globulin seed storage protein 2; > AltName: Full=11S globulin seed storage protein II; > AltName: Full=Alpha-globulin; > Contains: > RecName: Full=11S globulin seed storage protein 2 acidic chain; > AltName: Full=11S globulin seed storage protein II acidic chain; > Contains: > RecName: Full=11S globulin seed storage protein 2 basic chain; > AltName: Full=11S globulin seed storage protein II basic chain; > Flags: Precursor; > > So Bio.SeqIO removes the spaces in front of the line, but Bio.SwissProt doesn't. > For consistency, I think it's better to decide on one of these two styles. > My preference is for the approach used by Bio.SwissProt. Any objections to modifying the code used by Bio.SeqIO? Have you got a link for the full record in your example? For interaction with other Bio.SeqIO formats, I generally expect the description to be a single line string (with no embedded newlines). If you look at the (old) SwissProt files in our unit tests, the current Bio.SeqIO behaviour makes sense - the DE line(s) just encode a fairly short simple string. It looks like the SwissProt format has changed, and we should be parsing the new extended DE lines more carefully, and splitting these entries up and recording them in the SeqRecord.annotations dictionary? Peter From bartek at rezolwenta.eu.org Tue Apr 21 07:29:39 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 21 Apr 2009 13:29:39 +0200 Subject: [Biopython-dev] Bio.Motif Suggestions In-Reply-To: References: <8b34ec180904200804o58d531ache7e3110f3b919cfa@mail.gmail.com> Message-ID: <8b34ec180904210429g36d089a6h578dc0197a94516a@mail.gmail.com> On Tue, Apr 21, 2009 at 10:34 AM, Leighton Pritchard wrote: > Hi, > > Some thoughts and a bit of a wishlist... These are always welcome. I can make no promises on timing of making your wishes come true ;) >>> >> I think that once we start talking about gapped motifs, we are really >> talking about >> multiple alignments on steroids. This hasn't been done so far because you >> don't >> really need it for DNA motifs, > > It might not be required for the motifs you've been working with, but we've > been doing profile-based searches for bipartite regulatory binding sites in > DNA. ?These sites have a variable-length spacer region, and so require > gapped alignments for building motifs. ?The spacer region consensus > (depending on the level of identity required for the consensus) is usually > composed of Ns. Indeed There are dyadic motifs for some of transcription factors. So far I was working only under assumption that that the gap is not too variable (say 3-5 nucleotides) and this you can fake by using multiple PWMs with different sizes of the gap e.g.: CACnnnGTG CACnnnnGTG CACnnnnnGTG But it is a workaround rather than a feature... I'd be also interested in knowing about other applications where maybe this assumption (small gaps) is violated. Are there also motifs with multiple gaps? Implementing this feature would probably require a separate subclass of Motif, since the internal implementation of searching would need to be different. This is a very good feature request, I think it is worth implementing, though currently I have no time to do it properly. If You don't care too much about efficiency, I could write quickly this dyadic subclass with the implementation based on two motif instances and a variable gap. > > I guess that this comes down to whether we choose to restrict the meaning of > "motif" to an ungapped string of symbols (including ambiguity) representing > nt/aa, or whether we want to permit the inclusion of variable-length gaps, > regions, or ambiguities in a PROSITE or regular expression-like manner (e.g. > C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H, GAACC.{17,21}AAC or > C{,3}A{3,5}TTTT). ?Although profile methods like HMMer can produce a > consensus output that looks like an ungapped string of symbols to represent > a motif, it doesn't capture important features of the HMM representation. > I think that you are touching on multiple issues here. I'll try to answer them separately: - gapped alignemnts are one thing. If we have a gap in one sequence but not in the others (frequent in protein motifs, not so much in DNA motifs) we just need a way to sensibly use it in creation of PWMs for searching - dyadic motifs (gaps in otherwise ungapped alignments) are a different issue, since we have a gap in all instances, but it may have a variable length. see above. -regular expressions are a different way of describing motifs. I think that it is not a purpose of Bio.Motif to compete with regexps, but it would be certainly valuable to be able to have a possibility of creating motifs from some sort of (simplified) regexps. This was, to some extent, discussed in a recent thread on Seq.startswith methods -HMM motifs are totally different kind of beast. These guys introduce dependencies between positions (doable also with regexps) and there is currently no support for them in Bio.Motif. It would be cool to have support for them, but I'm not an expert here and it looks to me like a lot of work (also probably the methods of Bio.Motif are not exactly right for HMMs). -finally, suporting prosite syntax seems to be depending on the variable gap feature, but otherwise it's simple an important input fomat to support. > I think the latter representations are more useful, even if harder to > code/maintain. ?I think that leaving them out would be a glaring hole in > functionality, and that they're a target Biopython should aim for. Usefulness is hard to define in abstract of a particular problem , so this is arguable. It is certain that bio.Motif is not complete suite for all kinds of motif analysis but i don't know of any tool that is supporting alll these types of motifs with a single API (if you know one, please tell me). We should have ambitious goals, but I wouldn't call it a glaring hole not to have what is currently not available elsewhere... > >> I think it would be great to be able to easily >> convert multiple alignments into motifs. This would allow us to ?use >> the power of >> BIo.AlignIO for IO and Bio.Motif for searching and comparisons.The question is >> how to design API for these ?functions. > > I agree. ?I think that there's another important question: what do we mean, > and need to do, when we talk about converting an alignment into a motif? > Consensus/majority and PSSM methods from a sequence alignment should be > straightforward to implement in Python - even for gapped alignments. > Including a representation of variable-length gaps might be a little more > difficult, and storing an HMM representation may be too much to manage > immediately. ?That's still three different types of object - with likely > different components to their interfaces - to be stored. ?In their > relationship to a source alignment, these representations could be > properties of a single alignment, or independent Bio.Motif objects (perhaps > each with a link back to their parent alignment). > > The results of searches are also likely to be qualitatively different, > depending on the type of motif used for the search, and the results desired > by the user. > > I think that, for anything other than simple searches (string search, > regex), we'd be on a hiding to nothing by implementing search methods within > Python. ?It's not likely to be as fast as dedicated search packages, and it > would be a headache for maintenance. ?So, with apologies if I missed this What do you mean by searching here? Searching for a known motif or searching for a new motif? And what dedicated packages you have on your mind? > part of the discussion or documentation, it seems to me that Bio.Motif could > be most powerful in the alignment/searching/comparison process as a 'broker' > within BioPython, providing a consistent API for interface with external > alignment/search/comparison applications that also permits programmatic > manipulation of the profile/HMM/alignment. ?E.g. > That's definitely an important field, though I'm not sure if _the_ function for Bio.Motif. I think that the most valuable thing would be to internalize some of the compliexity of different ways of using motifs in bioinformatics. My modest goal for now is making protein motifs first class citizens (meaning handling alphabets and gaps properly etc. ). The next thing would be to make bio.motif cooperate nicely with - Bio.Seq (e.g seq.startswith etc.), - Bio.Align (conversions from-to alignments) which includes easy motif creation from simple formats like IUPAC and simple regexps and would correspond to the "broker" function if I understand it correctly. Then I think it would be really cool to have spaced motifs, although here we need to be careful about performance. > align = Bio.AlignIO.read(alignfilehandle) > consensus = align.build_consensus(threshold=0.9) > pssm = align.build_pssm() > hmmer = align.build_hmmer() > hmm = align.build_hmm(order=3) > > Or > > consensus = Bio.Motif.consensus_from_alignment(align, threshold=0.9) > pssm = Bio.Motif.build_pssm_from_alignment(align) > hmmer = Bio.Motif.build_hmmer_from_alignment(align) > hmm = Bio.Motif.build_hmm_from_alignment(align, order=3) > I would guess that the first example is what would be actually used, but it requires the functions on the Motif.side to be available. As for more specific things: - I don't like the usage of PSSM and consensus here. these are just different ways of looking at a Motif. -Also the difference between HMMer and HMM is unclear to me (isn't hmmer a tool to make HMMS? Do we support HMMER in Biopython currently?) But I'm not too concerned about HMMs at the moment. I would rather think of something like: align = Bio.AlignIO.read(alignfilehandle) motif= align.build_motif() followed by: motif.consensus() motif.search_pwm(seq) motif.search_instances(seq) motif.weblogo() > > And then the consensus, pssm, hmm and hmmer objects could be used as input > to interfaces for the relevant applications. > I don't understand your idea of separating consensus from pssm motifs. These are not fundamentally different. HMMs though are really different. > Converting an alignment into an HMM for this purpose may itself benefit from > a call to HMMer's hmmbuild (and Pythonic representation of the data > structure), rather than implementation of an equivalent internal function - > even though I think one of those would be useful, too. > Again, I'm not sure whether we have support for HMMer now (it was mentioned on the mailing-list once, but I don't know what happened to it). But I agree it would be useful. To summarize: - thanks for so much input, I especially apreciate the input on possible usages - I will work on the features I mentioned in the direction of unifying the API for DNA and protein motifs, and I would definitely appreciate any help from others - The dyadic motifs (or more generally gapped motifs) are next, and require taking care of performance issues - HMM support is currently further down on my to-do list, mostly because It needs a rather different API. But once we have the "glue" functions for motifs, we can try to make similar "glue" functions for HMMs. cheers Bartek From p.j.a.cock at googlemail.com Tue Apr 21 07:52:26 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 12:52:26 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090420132946.GB29652@sobchak.mgh.harvard.edu> References: <20090417200558.GC19290@sobchak.mgh.harvard.edu> <252312.21376.qm@web62408.mail.re1.yahoo.com> <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> <20090420132946.GB29652@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904210452n6c17fcafl205abcfcc4e3edd1@mail.gmail.com> On Mon, Apr 20, 2009 at 2:29 PM, Brad Chapman wrote: > [accessing start and end] >> >>> print rec_dict['1'].features[0].location.start >> 20228 >> >>> rec_dict['1'].features[0].location.start.position >> 20228 > [...] >> Coupled with a variation of Brad's suggestion of adding start >> and end properties to the SeqFeature, if we make these act >> as proxies for feature.location.start and feature.location.end >> that would become just: >> >> record = ... >> feature = record.features[5] #for example >> sub_seq = my_seq[feature.start:feature.end] > > Thanks Peter, that's exactly right. Actually, it isn't - my mistake. Adding start and end properties to the SeqFeature as proxies for feature.location.start and feature.location.end wouldn't be a great idea. Currently feature.location.start and features.location.end are position objects, and even if they had an __int__ method you can't do this: record[feature.location.start:record.feature.location.end] or: record.seq[feature.location.start:record.feature.location.end] You would have to do this: record[int(feature.location.start):int(record.feature.location.end)] or: record.seq[int(feature.location.start):int(record.feature.location.end)] The above wouldn't work well for fuzzy locations, we're better off with the current explicit option: record[feature.location.start.position:record.feature.location.end.position] or: record.seq[feature.location.start.position:record.feature.location.end.position] where if the user wants to they can take into account the fuzzy details, such as adding record.feature.location.end.extension to the end slice point. ---------------- Now the good news, we can instead simply using the FeatureLocation shortcuts for (approximated) plain integers: record[feature.location.nofuzzy_start:record.feature.location.nofuzzy_end] or: record.seq[feature.location.nofuzzy_start:record.feature.location.nofuzzy_end] These methods already take into consideration fuzzy ends, and knows to treat the start and end differently to get the wider feature. So, a slight variation of the proposed internal details would be to make SeqFeature.start and end proxies for SeqFeature.location.nofuzzy_start and SeqFeature.location.nofuzzy_end (i.e. plain integers), achieving the goal of just: record[feature.start:record.feature.end] or: record.seq[feature.start:record.feature.location.end] (Suitable for non-join features, and gives a reasonable approximation for fuzzy locations). > Accessing the start and end coordinates in SeqFeatures is unnecessarily > cumbersome right now, but can be fixed fairly simply. We should be able > to get this in now that 1.50 is rolled out. > ... > To be clear, start and end in SeqFeature would be integers and not > handle any fuzzy stuff. All of the representation is still there for > those actually dealing with fuzziness, but the top level attributes > would expose the coordinates nicely for the remaining 99% of cases. Right - and with the above correction that SeqFeature.start and end would be proxies for SeqFeature.location.nofuzzy_start and SeqFeature.location.nofuzzy_end, you would get plain integers, and this should cover most use cases. At least for non-Eukaryotes ;) >> I think that for a basic parser (as opposed to a parser integrated with Bio.SeqIO), >> the SeqFeatures are way too complicated for my mind. > [...] >> For a basic parser, I like the _gff_line_map function much better. >> Applied to the first line in the GFF file, it returns > [...] >> which is exactly what I need, in (almost) the places where I'd expect them. > > Does solving the start/end problem as described above help bridge the > gap between SeqFeatures and the custom representation? Are there other > usability issues you found? I would prefer to expose one data structure > and think SeqFeature can handle the data well. They scale to nested > cases, and will be familiar to those using features in SeqIO or BioSQL. You must agree that SeqFeature and FeatureLocation objects are not very lightweight. I understood that one of your goals with Bio.GFF and map/reduce is to handle massive files, so surely it makes sense to use a simple object structure here? Peter From mjldehoon at yahoo.com Tue Apr 21 07:55:36 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 21 Apr 2009 04:55:36 -0700 (PDT) Subject: [Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO, Bio.SwissProt In-Reply-To: <320fb6e00904210426q19ea7a44xdb902661297ec855@mail.gmail.com> Message-ID: <861995.42083.qm@web62406.mail.re1.yahoo.com> > Have you got a link for the full record in your example? > You can find it here: http://www.uniprot.org/uniprot/Q9XHP0.txt > For interaction with other Bio.SeqIO formats, I generally > expect the description to be a single line string (with no > embedded newlines). > It looks like the SwissProt format has changed, and we > should be parsing the new extended DE lines more > carefully, and splitting these entries up and recording > them in the SeqRecord.annotations dictionary? > That sounds reasonable. The dictionary will have to be nested though. Something like this: annotations["RecName"] = [{"Full": "11S globulin seed storage protein 2"}] annotations["AltName"] = [{"Full": "11S globulin seed storage protein II"}, {"Full": "Alpha-globulin"}] annotations["Contains"] = [{"RecName": {"Full": "11S globulin seed storage protein 2 acidic chain"}}, "AltName": {"Full": "Full=11S globulin seed storage protein II acidic chain"}}, {"RecName": {"Full": "11S globulin seed storage protein 2 basic chain"}}, "AltName": {"Full": "Full=11S globulin seed storage protein II basic chain"}}, ] annotations["Flags"] = "Precursor" --Michiel From p.j.a.cock at googlemail.com Tue Apr 21 08:04:44 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 13:04:44 +0100 Subject: [Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO, Bio.SwissProt In-Reply-To: <861995.42083.qm@web62406.mail.re1.yahoo.com> References: <320fb6e00904210426q19ea7a44xdb902661297ec855@mail.gmail.com> <861995.42083.qm@web62406.mail.re1.yahoo.com> Message-ID: <320fb6e00904210504g6c7f60f1o96129c9a6759c256@mail.gmail.com> On Tue, Apr 21, 2009 at 12:55 PM, Michiel de Hoon wrote: > >> Have you got a link for the full record in your example? >> > You can find it here: > > http://www.uniprot.org/uniprot/Q9XHP0.txt > >> For interaction with other Bio.SeqIO formats, I generally >> expect the description to be a single line string (with no >> embedded newlines). > >> It looks like the SwissProt format has changed, and we >> should be parsing the new extended DE lines more >> carefully, and splitting these entries up and recording >> them in the SeqRecord.annotations dictionary? >> > That sounds reasonable. The dictionary will have to be nested though. Something like this: > > annotations["RecName"] = [{"Full=11S globulin seed storage protein 2"] > annotations["AltName"] = ["Full=11S globulin seed storage protein II", "Full=Alpha-globulin"] > annotations["Contains"] = [{"RecName": {"Full": "11S globulin seed storage protein 2 acidic chain"}}, > ? ? ? ? ? ? ? ? ? ? ? ? ? ?"AltName": {"Full": "Full=11S globulin seed storage protein II acidic chain"}}, > ? ? ? ? ? ? ? ? ? ? ? ? ? {"RecName": {"Full": "11S globulin seed storage protein 2 basic chain"}}, > ? ? ? ? ? ? ? ? ? ? ? ? ? ?"AltName": {"Full": "Full=11S globulin seed storage protein II basic chain"}}, > ? ? ? ? ? ? ? ? ? ? ? ? ?] > annotations["Flags"] = "Precursor" > Possible - but for BioSQL we couldn't store those dictionaries. A list of strings should work, but isn't as elegant. Maybe something along these lines? annotations["RecName"] = ["Full: 11S globulin seed storage protein 2;"}] annotations["AltName"] = ["Full: 11S globulin seed storage protein II", "Full: Alpha-globulin"] annotations["Contains"] = ["RecName: Full=11S globulin seed storage protein 2 acidic chain;\nAltName: Full=11S globulin seed storage protein II acidic chain;", "RecName: Full=11S globulin seed storage protein 2 basic chain;\nAltName: Full=11S globulin seed storage protein II basic chain;"] annotations["Flags"] = "Precursor" Or for "Contains" just have a flat list of strings, one for each name (here four names). Or for "Contains" just drop the AltName entries, and simply have a list of the RecName entries (here two names). Peter From bugzilla-daemon at portal.open-bio.org Tue Apr 21 08:13:04 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Apr 2009 08:13:04 -0400 Subject: [Biopython-dev] [Bug 2818] New: Add start and end properties to SeqFeature object Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2818 Summary: Add start and end properties to SeqFeature object Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk An enhancment proposed on the mailing list would add start and end properties to the SeqFeature returning plain integers (non-fuzzy approximations to the start and end locations) suitable for slicing most parent sequences. Dealing with a join location would still be tricky. Example usage: >>> from Bio import SeqIO >>> record = SeqIO.read(open("NC_005816.gb"),"gb") >>> feature = record.features[2] >>> print feature type: gene location: [86:1109] ref: None:None strand: 1 qualifiers: Key: db_xref, Value: ['GeneID:2767718'] Key: locus_tag, Value: ['YP_pPCP01'] >>> record[feature.start:feature.end] SeqRecord(seq=Seq('ATGGTCACTTTTGAGACAGTTATGGAAATTAAAATCCTGCACAAGCAGGGAATG...TGA', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.', dbxrefs=[]) >>> record.seq[feature.start:feature.end] Seq('ATGGTCACTTTTGAGACAGTTATGGAAATTAAAATCCTGCACAAGCAGGGAATG...TGA', IUPACAmbiguousDNA()) Patch to follow. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 21 08:16:17 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Apr 2009 08:16:17 -0400 Subject: [Biopython-dev] [Bug 2818] Add start and end properties to SeqFeature object In-Reply-To: Message-ID: <200904211216.n3LCGHWZ025657@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2818 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-21 08:16 EST ------- Created an attachment (id=1281) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1281&action=view) Patch to Bio/SeqFeature.py Makes SeqFeature.start and end proxies for SeqFeature.location.nofuzzy_start and SeqFeature.location.nofuzzy_end (i.e. plain integers) See also: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005818.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Tue Apr 21 08:17:41 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 13:17:41 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904210452n6c17fcafl205abcfcc4e3edd1@mail.gmail.com> References: <20090417200558.GC19290@sobchak.mgh.harvard.edu> <252312.21376.qm@web62408.mail.re1.yahoo.com> <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> <20090420132946.GB29652@sobchak.mgh.harvard.edu> <320fb6e00904210452n6c17fcafl205abcfcc4e3edd1@mail.gmail.com> Message-ID: <320fb6e00904210517k63edc766xcb830a7150e4c5d1@mail.gmail.com> On Tue, Apr 21, 2009 at 12:52 PM, Peter Cock wrote: >> Accessing the start and end coordinates in SeqFeatures is unnecessarily >> cumbersome right now, but can be fixed fairly simply. We should be able >> to get this in now that 1.50 is rolled out. >> ... >> To be clear, start and end in SeqFeature would be integers and not >> handle any fuzzy stuff. All of the representation is still there for >> those actually dealing with fuzziness, but the top level attributes >> would expose the coordinates nicely for the remaining 99% of cases. > > Right - and with the above correction that SeqFeature.start and end > would be proxies for SeqFeature.location.nofuzzy_start and > SeqFeature.location.nofuzzy_end, you would get plain integers, and > this should cover most use cases. ?At least for non-Eukaryotes ;) Patch for this proposal on Bug 2818, http://bugzilla.open-bio.org/show_bug.cgi?id=2818 Peter From bartek at rezolwenta.eu.org Tue Apr 21 08:17:55 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 21 Apr 2009 14:17:55 +0200 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180904210516y21bec2e1r3294b2d15edf386f@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com> <8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com> <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> <8b34ec180904210434t2ee76e8bsc91af814f53e2df4@mail.gmail.com> <320fb6e00904210457p6189e096m966becad772cd610@mail.gmail.com> <8b34ec180904210516y21bec2e1r3294b2d15edf386f@mail.gmail.com> Message-ID: <8b34ec180904210517i259762e7t343e0f773c939a15@mail.gmail.com> On Tue, Apr 21, 2009 at 1:57 PM, Peter wrote: > Maybe. ?We can double check this by creating a trivial project in > github, doing a few commits, tag, commits, tag - and checking the > github interface and also the GitX presentation. ?That should tell us > if the issue is specific to our converted repository or not. no it's not specific. You can find a toy repository here: http://github.com/barwil/testing_tags/tree/master (please don't consider this link a permanent one, I'll remove it soon.) cheers Bartek From chapmanb at 50mail.com Tue Apr 21 08:20:45 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 21 Apr 2009 08:20:45 -0400 Subject: [Biopython-dev] Rolling new releases In-Reply-To: <320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com> <320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com> Message-ID: <20090421122045.GD30529@sobchak.mgh.harvard.edu> Hi Peter; > http://biopython.org/wiki/Building_a_release > > i.e. Maybe in a few months time I (or Michiel) can say "Right, CVS > freeze while XXX does the release", where person XXX gets to scan the > documentation, double check the NEWS files, check the unit tests etc, > before putting together the packages and uploading them to the server. > And maybe then hand over to our "News Coordinator" to do the release > announcement? Having more people involved will make it take a little > longer, but should mean less minor things get missed (e.g. a typo in > the NEWS file, or a broken unit test specific to a particular OS or > version of python). It would be great to have others involved in rolling releases. BioPerl often passes the Release Manager hat around for release to release, and perhaps we can get the same tradition going here. I like the idea of people volunteering for this. It would also be worth thinking about what the worst parts of building the releases are and seeing if we can automate or eliminate them. A few things that I can think of: - Remove support for older python versions, which would eliminate all those windows installers. I will write more about this in your other thread. - Eliminating the beta releases. Biopython is developed as stable in Git/CVS, so gets testing that way on developer machines. Are we getting enough feedback from betas to make them worthwhile? - Automate building the docs nightly/weekly on biopython.org. If the Tutorial/epydoc stuff is a lot of work, we could work up a script and cron to eliminate this part. That's from my fuzzy memory of rolling releases. Brad From chapmanb at 50mail.com Tue Apr 21 08:35:31 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 21 Apr 2009 08:35:31 -0400 Subject: [Biopython-dev] Python 2.3 support In-Reply-To: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com> References: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com> Message-ID: <20090421123531.GE30529@sobchak.mgh.harvard.edu> Hi Peter; > As we've been warning for the last couple of releases, Biopython 1.50 > should be the last release to officially support Python 2.3. No one > has complained yet, but they may not have noticed. I suspect there may > be people out there using a local Biopython installation on an old > Linux/Unix computer where the system Python is rather old. For > Biopython 1.50 I added a warning to setup.py when run on Python 2.3 so > that may get more attention. Are we getting a lot of feedback that we need to keep supporting these old versions? 2.3 was released in 2003, 2.4 in 2004, and 2.5 in 2006. This means people who need anything prior to 2.5 haven't updated in over 3 years. I understand the problem of non-responsive sysadmins and what not. However, we only have so many cycles for testing and coding; is it worthwhile spending some on these problems? One of the nice selling points of Python is that it's a dynamic language, and I like using new features of the language as much as anyone. Beyond the 2/3 split, it is very back compatible and I've never had any problems moving even very large projects forward to new versions. Practically, I'd be for dropping 2.4 support in the next release and being a bit more aggressive in general on moving upwards and onwards. Brad From p.j.a.cock at googlemail.com Tue Apr 21 08:43:11 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 13:43:11 +0100 Subject: [Biopython-dev] Rolling new releases In-Reply-To: <20090421122045.GD30529@sobchak.mgh.harvard.edu> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com> <320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com> <20090421122045.GD30529@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com> On Tue, Apr 21, 2009 at 1:20 PM, Brad Chapman wrote: > It would also be worth thinking about what the worst parts of > building the releases are and seeing if we can automate or eliminate > them. A few things that I can think of: > > - Remove support for older python versions, which would eliminate > ?all those windows installers. I will write more about this in your > ?other thread. That makes almost no difference, its just one extra line to do at the command line: c:\python23\python setup.py bdist_wininst c:\python24\python setup.py bdist_wininst c:\python25\python setup.py bdist_wininst c:\python26\python setup.py bdist_wininst Yes, you also have to build and test on each version of python, but honestly, once the build environment is setup doing the Windows release on three versus four versions of Python isn't worth worrying about. > - Eliminating the beta releases. Biopython is developed as stable > ?in Git/CVS, so gets testing that way on developer machines. Are we > ?getting enough feedback from betas to make them worthwhile? For Biopython's move from Numeric to NumPy, I think doing a beta was worthwhile. Maybe the feedback from the 1.50 beta release wasn't that big, but it didn't take that much effort, and it focused us ready for Biopython 1.50 well. Beta releases are also good for any Windows users, for whom setting up the build environment is quite a hurdle, so running the latest code from the repository is more difficult. Beta releases also give us more press coverage - and gives us a clear way to ask people to try out particular new stuff. > - Automate building the docs nightly/weekly on biopython.org. If the > ?Tutorial/epydoc stuff is a lot of work, we could work up a script > ?and cron to eliminate this part. Again, building the docs is pretty trivial. We have in the past deliberately NOT updated the online copies, so that it is in sync with the latest release. I suppose we could have two copies on the website, the "latest release" and the "nightly code". Peter From chapmanb at 50mail.com Tue Apr 21 08:44:49 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 21 Apr 2009 08:44:49 -0400 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904210452n6c17fcafl205abcfcc4e3edd1@mail.gmail.com> References: <20090417200558.GC19290@sobchak.mgh.harvard.edu> <252312.21376.qm@web62408.mail.re1.yahoo.com> <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> <20090420132946.GB29652@sobchak.mgh.harvard.edu> <320fb6e00904210452n6c17fcafl205abcfcc4e3edd1@mail.gmail.com> Message-ID: <20090421124449.GF30529@sobchak.mgh.harvard.edu> Hi Peter; [...fuzzy handling...] > Right - and with the above correction that SeqFeature.start and end > would be proxies for SeqFeature.location.nofuzzy_start and > SeqFeature.location.nofuzzy_end, you would get plain integers, and > this should cover most use cases. At least for non-Eukaryotes ;) Yes, that was my proposal. Thanks for fleshing it out and for the patch. > > Does solving the start/end problem as described above help bridge the > > gap between SeqFeatures and the custom representation? Are there other > > usability issues you found? I would prefer to expose one data structure > > and think SeqFeature can handle the data well. They scale to nested > > cases, and will be familiar to those using features in SeqIO or BioSQL. > > You must agree that SeqFeature and FeatureLocation objects are not > very lightweight. I understood that one of your goals with Bio.GFF > and map/reduce is to handle massive files, so surely it makes sense to > use a simple object structure here? Unless you are thinking of having an object representation as being too heavy, the non-light part of SeqFeature is all the FeatureLocation fuzziness. I would be for a SeqFeatureLite class that is API compatible with SeqFeature (with the new start/end attributes) and does not support fuzzy locations. This would handle GFF understandably, be lightweight, and allow access to BioSQL and SeqIO. How does this sound? Brad From p.j.a.cock at googlemail.com Tue Apr 21 08:56:23 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 13:56:23 +0100 Subject: [Biopython-dev] Python 2.3 support In-Reply-To: <20090421123531.GE30529@sobchak.mgh.harvard.edu> References: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com> <20090421123531.GE30529@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904210556i7fd3efe3pe74a6773d367a3e9@mail.gmail.com> On Tue, Apr 21, 2009 at 1:35 PM, Brad Chapman wrote: > Hi Peter; > >> As we've been warning for the last couple of releases, Biopython 1.50 >> should be the last release to officially support Python 2.3. ?No one >> has complained yet, but they may not have noticed. I suspect there may >> be people out there using a local Biopython installation on an old >> Linux/Unix computer where the system Python is rather old. For >> Biopython 1.50 I added a warning to setup.py when run on Python 2.3 so >> that may get more attention. > > Are we getting a lot of feedback that we need to keep supporting these > old versions? 2.3 was released in 2003, 2.4 in 2004, and 2.5 in 2006. > This means people who need anything prior to 2.5 haven't updated in over > 3 years. I understand the problem of non-responsive sysadmins and what > not. However, we only have so many cycles for testing and coding; is it > worthwhile spending some on these problems? Until recently I have a very strong personal interest in keeping Biopython running on Python 2.3, so I never regarded this as "wasted cycles". My personal Windows machine ran Python 2.3 and MSCV 6.0. In order to update the python version and continue to compile Biopython, I would also have had to replace the compiler etc. and the hard drive was pretty full so this didn't appeal. I have recently been trying Ubuntu on this machine instead (on a second hard drive). For reference, my current (only) Windows machine (at work) has Python 2.3, 2.4 and 2.5 for which I use mingw32 to compile Biopython (same setup as Michiel), plus Python 2.6 for which I'm using Microsoft's free VC++ 2008 Express Edition from http://www.microsoft.com/express/download/ > Practically, I'd be for dropping 2.4 support in the next release and > being a bit more aggressive in general on moving upwards and onwards. I wouldn't support that. I would insist on giving at least one release's notice as a minimum. Peter From p.j.a.cock at googlemail.com Tue Apr 21 09:05:23 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 14:05:23 +0100 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) Message-ID: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> On Tue, Apr 21, 2009 at 1:44 PM, Brad Chapman wrote: >> You must agree that SeqFeature and FeatureLocation objects are not >> very lightweight. ?I understood that one of your goals with Bio.GFF >> and map/reduce is to handle massive files, so surely it makes sense to >> use a simple object structure here? > > Unless you are thinking of having an object representation as being too > heavy, the non-light part of SeqFeature is all the FeatureLocation > fuzziness. Fair point. > I would be for a SeqFeatureLite class that is API compatible with > SeqFeature (with the new start/end attributes) and does not support > fuzzy locations. This would handle GFF understandably, be lightweight, > and allow access to BioSQL and SeqIO. How does this sound? I have also been thinking about how I would (re)design the SeqFeature and FeatureLocation objects. In particular I would want to put the strand as part of the same object as the location, and also any join-locations. I would still want to cope with fuzzy locations, but make the non-fuzzy approximations more prominent in comparison. Also, I really don't like the way joins are currently stored as more SeqFeatures in the sub_features list (plus this kind of blocks alternative usage for child/parent nesting that might be nice for GFF files). The prime use case to keep in mind is taking a feature location (even a join), and using this to extract that region of nucleotides from the parent sequence (i.e. a Seq object or a SeqRecord object, as now both can be sliced). Peter From dalloliogm at gmail.com Tue Apr 21 09:25:52 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 21 Apr 2009 15:25:52 +0200 Subject: [Biopython-dev] Python 2.3 support In-Reply-To: <320fb6e00904210556i7fd3efe3pe74a6773d367a3e9@mail.gmail.com> References: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com> <20090421123531.GE30529@sobchak.mgh.harvard.edu> <320fb6e00904210556i7fd3efe3pe74a6773d367a3e9@mail.gmail.com> Message-ID: <5aa3b3570904210625v78e604c4j87c4be62b9a0488c@mail.gmail.com> On Tue, Apr 21, 2009 at 2:56 PM, Peter Cock wrote: > On Tue, Apr 21, 2009 at 1:35 PM, Brad Chapman wrote: > > Hi Peter; > > > >> As we've been warning for the last couple of releases, Biopython 1.50 > >> should be the last release to officially support Python 2.3. No one > >> has complained yet, but they may not have noticed. > I know of many people (a whole lab) which until recently were still using python 2.3. However, please, drop support for these older version or people won't never upgrade :) -- My blog on bioinformatics (now in English): http://bioinfoblog.it From p.j.a.cock at googlemail.com Tue Apr 21 09:51:26 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 14:51:26 +0100 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> Message-ID: <320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com> On Tue, Apr 21, 2009 at 1:44 PM, Brad Chapman wrote: > Unless you are thinking of having an object representation as being too > heavy, the non-light part of SeqFeature is all the FeatureLocation > fuzziness. I've just had a quick go at what should be a 100% backwards compatible modification to the FeatureLocation class to store ExactPosition start or end positions as integers. The idea should be more memory efficient, using the complex position objects only when required. The new __init__ method would look like this: def __init__(self, start, end): """Specify the start and end of a sequence feature.""" #Keeps exact locations as plain integers #Calculates the non-fuzzy versions now so make accessing #them simpler and faster (expected to be used more often) if isinstance(start, int) or isinstance(start, long): self._start = None self._start_int_nofuzzy = start elif isinstance(start, ExactPosition) : #Don't need to keep the full object self._start = None self._start_int_nofuzzy = start.position else : assert isinstance(start, AbstractPosition), repr(start) self._start = start self._start_int_nofuzzy = min(start.position, start.position + start.extension) if isinstance(end, int) or isinstance(end, long) : self._end = None self._end_int_nofuzzy = end elif isinstance(end, ExactPosition) : #Don't need to keep the full object self._end = None self._end_int_nofuzzy = end.position else : assert isinstance(end, AbstractPosition), repr(end) self._end = end self._end_int_nofuzzy = max(end.position, end.position + end.extension) The associated methods are then updated accordingly. When a position object is requested, self._start or self._end is used (if it is not None, when an ExactPosition is generated on the fly from the integer self.self._start_int_nofuzzy or self._end_int_nofuzzy). When the non-fuzzy integer approximation is wanted (the typical use case), we have those cached as the integers. The unit tests all pass (except test_BioSQL_SeqIO.py), but we'd need to have some sort of benchmark to demonstrate any memory gains in order to justify this kind of change. Maybe try it with Brad's GFF parser on a very large file? I could stick the full patch on Bugzilla (or perhaps github) is this sounds worth pursuing... An alternative implementation would use a single private variable to store either the integer position or the position object, and check the type when the public properties are accessed. This should be an even bigger memory saving, but may be slower. Peter From p.j.a.cock at googlemail.com Tue Apr 21 09:55:23 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 14:55:23 +0100 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> Message-ID: <320fb6e00904210655u5f497227o24457ec2ee3d711f@mail.gmail.com> > I have also been thinking about how I would (re)design the SeqFeature > and FeatureLocation objects. ?In particular I would want to put the > strand as part of the same object as the location, and also any > join-locations. ?I would still want to cope with fuzzy locations, but > make the non-fuzzy approximations more prominent in comparison. ?Also, > I really don't like the way joins are currently stored as more > SeqFeatures in the sub_features list (plus this kind of blocks > alternative usage for child/parent nesting that might be nice for GFF > files). > > The prime use case to keep in mind is taking a feature location (even > a join), and using this to extract that region of nucleotides from the > parent sequence (i.e. a Seq object or a SeqRecord object, as now both > can be sliced). I forgot to mention the second major use case I'm concerned about, which is recovering the GenBank/EMBL style location string. I have looked at this in the past, by adding methods to the FeatureLocation and all the Position objects, but it is complicated by the fact the Position objects don't know if they are at the start or end (and for the start locations we need to add one to convert from Python counting). This is the main block on having Bio.SeqIO support writing GenBank (or EMBL) files with their features included. Peter From lpritc at scri.ac.uk Tue Apr 21 09:50:01 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 21 Apr 2009 14:50:01 +0100 Subject: [Biopython-dev] Bio.Motif Suggestions In-Reply-To: <8b34ec180904210429g36d089a6h578dc0197a94516a@mail.gmail.com> Message-ID: Hi Bartek, It's a long one, this... I expect many TLDR response ;) On 21/04/2009 12:29, "Bartek Wilczynski" wrote: > On Tue, Apr 21, 2009 at 10:34 AM, Leighton Pritchard > wrote: >> Some thoughts and a bit of a wishlist... > > These are always welcome. I can make no promises on timing of making > your wishes come true ;) No-one ever does :( > But it is a workaround rather than a feature... I'd be also interested > in knowing about other applications where maybe this assumption (small gaps) > is violated. Are there also motifs with multiple gaps? Yes - it might be a stretch, but if you wanted to represent the organisation of protein domains in a multi-domain protein (e.g. a transposase, or some pathogen effectors) as motifs you might want to do this. > Implementing this feature would probably require a separate > subclass of Motif, since the internal implementation of searching would > need to be different. I'm not sure that this needs to be true. A motif with no gaps can be considered as a special case of a motif with an arbitrary number of gaps. If the base implementation is that of a gapped motif (e.g. Represented as ACT.{5,10}CCC.{,4}TATCAT.{3}GGG) then the basic method of searching - and here using the re module might work - doesn't need to be any different for an ungapped variant representing a particular instance of the multiply-gapped motif (ACTNNNNNNCCCNNNNTATCATNNNGGG), or for any other ungapped sequence (e.g. ACTCCCTATCATGGG). This may not be the case for more complex search algorithms, however. Other classes of Motif may well be necessary, in any case... > This is a very good feature request, I think it is worth implementing, > though currently I have no time to do it properly. I'm right there with you, unfortunately ;) >> I guess that this comes down to whether we choose to restrict the meaning of >> "motif" to an ungapped string of symbols (including ambiguity) representing >> nt/aa, or whether we want to permit the inclusion of variable-length gaps >> > I think that you are touching on multiple issues here. I was trying to focus on one issue, but it does have lots of implications, which you cover below. The one issue I intended is this: A sequence motif can be represented in more than one way, and those ways are not necessarily interchangeable - either conceptually or in code. An ungapped string of symbols isn't able to represent the same information as a regular expression (can do ambiguity of repeat counts), which in turn isn't able to represent the same information as a PSSM (can represent probabilities at each position), which in turn isn't able to represent the same information as an HMM (can represent variable-order dependency). However, the things you want to do with that motif, such as use it to search a set of candidate sequences or produce an example matching sequence for test purposes, can be the same regardless of the coding or conceptual representation of that motif. We come back to this below, but for now this does lead on to... > - gapped alignemnts are one thing. If we have a gap in one sequence > but not in the others > (frequent in protein motifs, not so much in DNA motifs) we just need a > way to sensibly use it in creation of PWMs for searching > - dyadic motifs (gaps in otherwise ungapped alignments) are a > different issue, since we have a > gap in all instances, but it may have a variable length. see above. These are, I think, the same issue. In your first example, PWMs will (mostly) work because the lengths of most sequences are the same and there are few gaps. However, unless you have a way of varying the length of your PWM during a query of the target sequence, the PWM need not match the gapped sequence strongly, potentially leading to a false negative. As an example: ABCDE AB-DE ABCDE ABCDE The PWM will be (shorthand) [A1][B1][C.75,-.25][D1][E1], and when applied to the target sequence ABDE (which was in your alignment), will not produce as high a score as it would for the other members of the alignment. For the alignment: A-CDE AB-DE ABC-E ABCDE The PWM is (shorthand) [A1][B.75,-.25][C.75,-.25][D.75,-.25][E1] With corresponding poor scores (potential false negatives) for target sequences ACDE, ABDE and ABCE. Without a way to (intelligently) place gaps in your target sequences, or otherwise account for gaps when searching, the problem is the same whether there is one gap or a dyadic motif. The *practical* issue is different, in that you can probably accept the odd false negative for a motif in which one training sequence has a gap, but PWMs are poor candidates for alignments with many gaps, as they can readily produce false negatives. The key issue is that PWMs are fixed-length, and variable-length representations are common, desirable, and difficult to express in a fixed-width framework. > -regular expressions are a different way of describing motifs. That is true - they are intermediate between consensus sequence, and PSSMs in their ability to describe variation, but also have the capacity to represent variable-length sequences. > I think that it is not a purpose of Bio.Motif to compete with regexps, but it > would be certainly valuable to be able to have a possibility of creating > motifs from some sort of (simplified) regexps. This was, to some extent, > discussed in a recent thread on Seq.startswith methods I was involved in that discussion :D I don't think that Bio.Motif needs to compete with the re module, but instead could use its robust, stable code to implement a regular expression representation of sequence motifs, seamlessly. > -HMM motifs are totally different kind of beast. These guys introduce > dependencies between positions (doable also with regexps) and there is > currently no support for them in Bio.Motif. It would be cool to have > support for them, but I'm not an expert here and it looks to me like a > lot of work (also probably the methods of Bio.Motif are not exactly right for > HMMs). You're right about the dependencies - they're the important features I was alluding to in my post - but I don't think that regular expressions are a good way to approach the same problem; they don't encode the same information. > -finally, suporting prosite syntax seems to be depending on the variable gap > feature, but otherwise it's simple an important input fomat to support. I wasn't suggesting PROSITE syntax as part of any desire for implementation - though a PROSITE <-> regex/consensus translation would be useful, I think - rather as an illustration that more people than me need variable length spacers in their motifs. >> I think the latter representations are more useful, even if harder to >> code/maintain. ?I think that leaving them out would be a glaring hole in >> functionality > > Usefulness is hard to define in abstract of a particular problem , so > this is arguable. It is certain that bio.Motif is not complete suite for all > kinds of motif analysis but i don't know of any tool that is supporting alll > these types of motifs with a single API (if you know one, please tell me). > We should have ambitious goals, but I wouldn't call it a glaring hole not to > have what is currently not available elsewhere... I apologise for my poor wording. What I meant was that it would seem odd if support for motif representation was considered complete without representing variable-length sequences. Left alone, this would always represent an obvious target for improvement (i.e. 'a glaring hole in functionality'). No criticism was meant by it - I think you've done a great job so far on Bio.Motif - and I apologise if I have caused offence. >> I think that, for anything other than simple searches (string search, >> regex), we'd be on a hiding to nothing by implementing search methods within >> Python. ?It's not likely to be as fast as dedicated search packages, and it >> would be a headache for maintenance. > What do you mean by searching here? Searching for a known motif or searching > for a new motif? And what dedicated packages you have on your mind? Searching for a known motif in a larger sequence. Three packages - two biologically-dedicated, one not - spring to mind. The non-biologically-dedicated one is grep. Representing ambiguity symbols as combinations of bases, e.g. [ACT] . [TA], [^T] and so on - with FASTA files where sequences are not punctuated by \n or \r - is highly effective for finding sequence motifs representable by regular expressions. Dedicated 1: PSI-BLAST - takes PSSMs representing a sequence profile Dedicated 2: HMMer - builds and uses an HMM representation of the sequence profile. There are others, but I'd have to think hard to recall them. You could consider HMMer versions 1, 2 and 3 as different, in a number of ways - including their utility for nucleotide sequence representation... >> it seems to me that Bio.Motif could >> be most powerful in the alignment/searching/comparison process as a 'broker' >> within BioPython, providing a consistent API for interface with external >> alignment/search/comparison applications that also permits programmatic >> manipulation of the profile/HMM/alignment. ?E.g. > I think that the most valuable thing would be to internalize some of > the compliexity of different ways of using motifs in bioinformatics. My modest > goal for now is making protein motifs first class citizens (meaning handling > alphabets and gaps properly etc. ). > The next thing would be to make bio.motif cooperate nicely with > - Bio.Seq (e.g seq.startswith etc.), > - Bio.Align (conversions from-to alignments) > which includes easy motif creation from simple formats like IUPAC and > simple regexps and would correspond to the "broker" function if I understand > it correctly. > Then I think it would be really cool to have spaced motifs, although > here we need to be careful about performance. If I might suggest: the main role of the Bio.Motif module as you intend it appears to be to represent motifs of biological sequences, and to provide useful functionality for them. Now, there are several ways of representing these motifs both conceptually, and in code - and they're not all interchangeable. Some of them have a many -> one mapping (PSSM -> consensus sequence), and some have no obvious mapping at all (HMM <-/-> PSSM). There is a decision to be made concerning how motifs are represented internally: PSSM, regex and/or HMM. PSSM has the clear benefit that, given a PSSM, you can easily generate the consensus sequence and a regular expression of fixed-length - but the mapping to a regular expression is not clear, and may not produce the one that the user would prefer. HMMs can't readily be converted to other representations, and regular expressions can't be expanded to PSSMs, or converted to consensus sequences (unless they have no length ambiguities). It is not just performance we need to think about, but the very representation of a motif. Each of these representations is useful under different circumstances. I think it is worth avoiding a structure that enforces a single internal representation and closes off future alternative representations. Giving the user sufficient flexibility/rope to hang themselves with in their choice of internal representation is a Good Thing?, in my opinion. > As for more specific things: > - I don't like the usage of PSSM and consensus here. these are just > different ways of looking at a Motif. > I don't understand your idea of separating consensus from pssm motifs. These > are not fundamentally different. HMMs though are really different. I see what you mean, but I think you're associating PSSM with Motif too strongly. A PSSM can be used to generate a consensus sequence, but the resulting consensus sequence cannot be used to generate the corresponding PSSM uniquely. There is not a one-one mapping, and they do not describe the same information. Consensus sequences, for example, do not indicate the probability of finding a particular symbol at any given position; PSSMs can. PSSMs are fundamentally different from consensus sequences in that they don't encode variability at any position. Consensus, regex, PSSM and HMM are all different ways of looking at a Motif, but they're not all internally-compatible - which is my point. If you build a PSSM motif and make the alignment data nonrecoverable, you cannot reconstruct a corresponding HMM representation, later, for example. So you would have to decide what kind of representation you use at motif build-time, build all of them at once, or keep the alignment around to build what you need later. I'd prefer to choose at build time, but YMMV. > -Also the difference between HMMer and HMM is unclear to me > (isn't hmmer a tool to make HMMS? Do we support HMMER in Biopython currently?) > But I'm not too concerned about HMMs at the moment. There is a fair amount of flexibility in how you choose to define your HMM for a motif, and not just in the order of the HMM. There has been corresponding variation in how HMMer represents its data internally, over the years. I was meaning to imply by syntax that a HMMer-specific representation could be called 'hmmer', but a generic internal HMM representation could just be called 'hmm', to reflect this. I'm not going to insist on the convention, but it seems simple and obvious to me (again, YMMV). Sorry for the length and likely repetition, but I think these are issues worth thinking about. Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From biopython at maubp.freeserve.co.uk Tue Apr 21 10:30:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Apr 2009 15:30:20 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com> <8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com> <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> Message-ID: <320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com> > > From some more reading this, it sounds like our CVS tags are > essentially turned into commit markers in git. ?See: > > http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#how-git-stores-references > http://book.git-scm.com/3_git_tag.html > > This shouldn't rule out showing them in the history, but perhaps the > cvs to git migration confuses things... By setting up a toy repository with tags done though git itself (I assume), Bartek has convinced me that GitHub itself never shows the tags in the history. I think this is big drawback, and that we should ask GitHub about this. However, using the Mac GUI tool GitX, I was able to see the tags in the history using the toy repository (they show up as nice yellow blobs), but not using the current Biopython CVS to git conversion. There appears to be something less than ideal about our CVS to git conversion. I believe this relates to how the tag commits appear in the commit tree - and it looks like for Biopython they are all tiny branches off the main trunk. i.e. If you look at the main trunk history (overall or for any one file) then tags commits are not in it. This hunch appears to be supported by the git log output: $ git clone git://github.com/biopython/biopython.git $ cd biopython $ git log --graph --all ... | * commit 8fb446965d58f266ba8bf41a992a09e4bedbac3e | Author: peterc | Date: Mon Apr 20 16:07:41 2009 +0000 | | Bump the version number now that Biopython 1.50 is released | | * commit 4ed11049092d86704a2a15359c77459bad30e291 |/ Author: cvs2dvcs transform | Date: Mon Apr 20 10:48:32 2009 +0000 | | This commit was manufactured by cvs2svn to create tag 'biopython-150'. | * commit 29aa4df3480cdee803694766f137ab2baf5625b2 | Author: peterc | Date: Mon Apr 20 10:48:31 2009 +0000 | | You don't have to email Iddo to get on the CONTRIB file | ... In comparison, for Bartek's toy repository there is a single branch shown. Peter From sbassi at clubdelarazon.org Tue Apr 21 10:34:20 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Tue, 21 Apr 2009 11:34:20 -0300 Subject: [Biopython-dev] Python 2.3 support In-Reply-To: <5aa3b3570904210625v78e604c4j87c4be62b9a0488c@mail.gmail.com> References: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com> <20090421123531.GE30529@sobchak.mgh.harvard.edu> <320fb6e00904210556i7fd3efe3pe74a6773d367a3e9@mail.gmail.com> <5aa3b3570904210625v78e604c4j87c4be62b9a0488c@mail.gmail.com> Message-ID: <9e2f512b0904210734w46d22856k46c4bdfcddcf0346@mail.gmail.com> On Tue, Apr 21, 2009 at 10:25 AM, Giovanni Marco Dall'Olio wrote: > However, please, drop support for these older version or people won't never > upgrade :) That is true, but is also true that you can use a new version without upgrading. The reason for not upgrading is in most cases avoiding to break working scripts. I my (old) OS, the WIFI card uses Python 2.3 to work. But Python allows to install "alternative" versions without conflicting with your default system version. This way I have Python 2.4, 2.5, 2.6 and 3 all installed in the same machine. Using alt-install or just compiling a Python version without doing a system install. I even have more than one 2.5 version and each with a different Biopython installation (using virtual_env) for testing purposes. So I don't think there is a valid reason to keep supporting such an old version. From biopython at maubp.freeserve.co.uk Tue Apr 21 10:47:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Apr 2009 15:47:40 +0100 Subject: [Biopython-dev] Python 2.3 support In-Reply-To: <9e2f512b0904210734w46d22856k46c4bdfcddcf0346@mail.gmail.com> References: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com> <20090421123531.GE30529@sobchak.mgh.harvard.edu> <320fb6e00904210556i7fd3efe3pe74a6773d367a3e9@mail.gmail.com> <5aa3b3570904210625v78e604c4j87c4be62b9a0488c@mail.gmail.com> <9e2f512b0904210734w46d22856k46c4bdfcddcf0346@mail.gmail.com> Message-ID: <320fb6e00904210747h73b8881dkcfaf8a53f2f7aab@mail.gmail.com> > ... Python allows to install "alternative" versions without > conflicting with your default system version. This way I have Python > 2.4, 2.5, 2.6 and 3 all installed in the same machine. Using > alt-install or just compiling a Python version without doing a system > install. I even have more than one 2.5 version and each with a > different Biopython installation (using virtual_env) for testing > purposes. So I don't think there is a valid reason to keep supporting > such an old version. OK, OK, no one loves Python 2.3 anymore, and you'll all be glad to see the back of it ;) Shall we say that at the end of April, unless anyone has come forward with a strong need to continue using Biopython on Python 2.3 (or we are forced to do another release to fix something), we'll start work on removing Python 2.3 specific code in May? A lot (hopefully most) of the Python 2.3 bits have a comment about this in the source code, so a quick grep should pull out most of them. If any of you remember any other specific things we need to change add a note to Bug 2817 please. http://bugzilla.open-bio.org/show_bug.cgi?id=2817 Thanks Peter From bartek at rezolwenta.eu.org Tue Apr 21 11:19:00 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 21 Apr 2009 17:19:00 +0200 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com> <8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com> <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> <320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com> <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> Message-ID: <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> Hi, > There appears to be something less than ideal about our CVS to git > conversion. ?I believe this relates to how the tag commits appear in > the commit tree - and it looks like for Biopython they are all tiny > branches off the main trunk. ?i.e. If you look at the main trunk > history (overall or for any one file) then tags commits are not in it. > I haven't noticed this difference. It just seems to be the way cvs2got handles tags. This behavior does not seem to be controllable from the config file. I'll try to ask on the cvs2git mailing list. In case it is not possible to change it in cvs2git itself, the worst scenario would be to re-tag the git tree manually (or with a help of some script). So there is no risk of loosing tags. I'll post when I have any progress onn this issue. cheers ?Bartek -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From mjldehoon at yahoo.com Tue Apr 21 11:23:03 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 21 Apr 2009 08:23:03 -0700 (PDT) Subject: [Biopython-dev] Rolling new releases In-Reply-To: <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com> Message-ID: <955867.40270.qm@web62404.mail.re1.yahoo.com> --- On Tue, 4/21/09, Peter Cock wrote: > Again, building the docs is pretty trivial. We have in the > past deliberately NOT updated the online copies, so that it is > in sync with the latest release. I suppose we could have two > copies on the website, the "latest release" and the > "nightly code". > That would be nice. In the past, I've done such things by hand to let people look at the documentation for a piece of code that's about to go into CVS. From mjldehoon at yahoo.com Tue Apr 21 11:28:56 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 21 Apr 2009 08:28:56 -0700 (PDT) Subject: [Biopython-dev] Rolling new releases In-Reply-To: <20090421122045.GD30529@sobchak.mgh.harvard.edu> Message-ID: <268684.34243.qm@web62402.mail.re1.yahoo.com> --- On Tue, 4/21/09, Brad Chapman wrote: > - Eliminating the beta releases. Biopython is developed as > stable in Git/CVS, so gets testing that way on developer > machines. Are we getting enough feedback from betas to make > them worthwhile? I agree. A project like Biopython is destined to be in perpetual beta mode anyway. To my mind, Biopython 1.50-beta is as stable as Biopython 1.49 and Biopython 1.51. In addition, will we be able to remember that Biopython 1.50b is the beta release of version 1.50 (or did we have a 1.50, then a 1.50a, and then a 1.50b release?). --Michiel From mjldehoon at yahoo.com Tue Apr 21 11:35:44 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 21 Apr 2009 08:35:44 -0700 (PDT) Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090421124449.GF30529@sobchak.mgh.harvard.edu> Message-ID: <595877.69734.qm@web62408.mail.re1.yahoo.com> --- On Tue, 4/21/09, Brad Chapman wrote: > I would be for a SeqFeatureLite class that is API > compatible with SeqFeature (with the new start/end > attributes) and does not support > fuzzy locations. This would handle GFF understandably, be > lightweight, and allow access to BioSQL and SeqIO. > How does this sound? Depends on whether SeqFeatureLite only exists for the benefit of GFF files. If so, we're better off with a light-weight GFF-specific object. If not, then it may make sense. But even then it sounds a bit like class creep. --Michiel. From p.j.a.cock at googlemail.com Tue Apr 21 11:58:23 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 16:58:23 +0100 Subject: [Biopython-dev] Rolling new releases In-Reply-To: <955867.40270.qm@web62404.mail.re1.yahoo.com> References: <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com> <955867.40270.qm@web62404.mail.re1.yahoo.com> Message-ID: <320fb6e00904210858k1aa4b5cav2a784b75fb3b3f8@mail.gmail.com> On Tue, Apr 21, 2009 at 4:23 PM, Michiel de Hoon wrote: > > Peter wrote: >> Again, building the docs is pretty trivial. ?We have in the >> past deliberately NOT updated the online copies, so that it is >> in sync with the latest release. ?I suppose we could have two >> copies on the website, the "latest release" and the >> "nightly code". > > That would be nice. In the past, I've done such things by hand to > let people look at the documentation for a piece of code that's > about to go into CVS. > This should be trivial to get setup - at least as long as our repository lives on the OBF server. There are already scripts or CVS hooks in place to update http://biopython.org/SRC/ although I don't know how exactly this is configured. On Tue, Apr 21, 2009 at 4:28 PM, Michiel de Hoon wrote: >Brad wrote: >>> - Eliminating the beta releases. Biopython is developed as >>> stable in Git/CVS, so gets testing that way on developer >>> machines. Are we getting enough feedback from betas to make >>> them worthwhile? > > I agree. A project like Biopython is destined to be in perpetual beta > mode anyway. To my mind, Biopython 1.50-beta is as stable as > Biopython 1.49 and Biopython 1.51. In addition, will we be able to > remember that Biopython 1.50b is the beta release of version 1.50 > (or did we have a 1.50, then a 1.50a, and then a 1.50b release?). Maybe I have hung about with computer scientists / programmers too long, as to me there is no confusion about the ordering alpha -> beta -> release candidate -> final. However, if the consensus is that explicit beta releases are redundant, then so be it. Peter From bartek at rezolwenta.eu.org Tue Apr 21 11:59:32 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 21 Apr 2009 17:59:32 +0200 Subject: [Biopython-dev] Bio.Motif Suggestions In-Reply-To: References: <8b34ec180904200804o58d531ache7e3110f3b919cfa@mail.gmail.com> Message-ID: <8b34ec180904210859r34a0a034qdfb54d57c3ca85e3@mail.gmail.com> Hi, thanks for your suggestions. To make the long story short: - I mostly agree with your points - I've updated the wiki page to include your requests http://biopython.org/wiki/MotifDev - I'll definitely spend some time working on particular requests and then post specifically. cheers Bartek On Tue, Apr 21, 2009 at 10:34 AM, Leighton Pritchard wrote: > Hi, > > Some thoughts and a bit of a wishlist... > > On 20/04/2009 16:04, "Bartek Wilczynski" wrote: > >> On Mon, Apr 20, 2009 at 4:35 PM, Peter >> wrote: >>> >>> What would a space in a motif mean? ?Clearly something different from >>> a wildcard like N or X in nucleotide or protein sequences. ?Does it >>> mean a gap of variable length? ?If it means a gap of one character >>> then surely just using a "-" would be sensible (as used in multiple >>> sequence alignments), for which we have a gapped alphabet system >>> setup. >>> >> I think that once we start talking about gapped motifs, we are really >> talking about >> multiple alignments on steroids. This hasn't been done so far because you >> don't >> really need it for DNA motifs, > > It might not be required for the motifs you've been working with, but we've > been doing profile-based searches for bipartite regulatory binding sites in > DNA. ?These sites have a variable-length spacer region, and so require > gapped alignments for building motifs. ?The spacer region consensus > (depending on the level of identity required for the consensus) is usually > composed of Ns. > > I guess that this comes down to whether we choose to restrict the meaning of > "motif" to an ungapped string of symbols (including ambiguity) representing > nt/aa, or whether we want to permit the inclusion of variable-length gaps, > regions, or ambiguities in a PROSITE or regular expression-like manner (e.g. > C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H, GAACC.{17,21}AAC or > C{,3}A{3,5}TTTT). ?Although profile methods like HMMer can produce a > consensus output that looks like an ungapped string of symbols to represent > a motif, it doesn't capture important features of the HMM representation. > > I think the latter representations are more useful, even if harder to > code/maintain. ?I think that leaving them out would be a glaring hole in > functionality, and that they're a target Biopython should aim for. > >> I think it would be great to be >> able to easily >> convert multiple alignments into motifs. This would allow us to ?use >> the power of >> BIo.AlignIO for IO and Bio.Motif for searching and comparisons.The question is >> how to design API for these ?functions. > > I agree. ?I think that there's another important question: what do we mean, > and need to do, when we talk about converting an alignment into a motif? > Consensus/majority and PSSM methods from a sequence alignment should be > straightforward to implement in Python - even for gapped alignments. > Including a representation of variable-length gaps might be a little more > difficult, and storing an HMM representation may be too much to manage > immediately. ?That's still three different types of object - with likely > different components to their interfaces - to be stored. ?In their > relationship to a source alignment, these representations could be > properties of a single alignment, or independent Bio.Motif objects (perhaps > each with a link back to their parent alignment). > > The results of searches are also likely to be qualitatively different, > depending on the type of motif used for the search, and the results desired > by the user. > > I think that, for anything other than simple searches (string search, > regex), we'd be on a hiding to nothing by implementing search methods within > Python. ?It's not likely to be as fast as dedicated search packages, and it > would be a headache for maintenance. ?So, with apologies if I missed this > part of the discussion or documentation, it seems to me that Bio.Motif could > be most powerful in the alignment/searching/comparison process as a 'broker' > within BioPython, providing a consistent API for interface with external > alignment/search/comparison applications that also permits programmatic > manipulation of the profile/HMM/alignment. ?E.g. > > align = Bio.AlignIO.read(alignfilehandle) > consensus = align.build_consensus(threshold=0.9) > pssm = align.build_pssm() > hmmer = align.build_hmmer() > hmm = align.build_hmm(order=3) > > Or > > consensus = Bio.Motif.consensus_from_alignment(align, threshold=0.9) > pssm = Bio.Motif.build_pssm_from_alignment(align) > hmmer = Bio.Motif.build_hmmer_from_alignment(align) > hmm = Bio.Motif.build_hmm_from_alignment(align, order=3) > > (which I don't think is as neat an interface, even if all > align.build_consensus does is call the Bio.Motif.consensus_from_alignment > method) > > Followed by things like > > pssm.consensus() > pssm.logo() > hmm.generate_sequence(length=100) > hmm.to_graphviz() > > And then the consensus, pssm, hmm and hmmer objects could be used as input > to interfaces for the relevant applications. > > Converting an alignment into an HMM for this purpose may itself benefit from > a call to HMMer's hmmbuild (and Pythonic representation of the data > structure), rather than implementation of an equivalent internal function - > even though I think one of those would be useful, too. > > Cheers, > > L. > > -- > Dr Leighton Pritchard MRSC > D131, Plant Pathology Programme, SCRI > Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA > e:lpritc at scri.ac.uk ? ? ? w:http://www.scri.ac.uk/staff/leightonpritchard > gpg/pgp: 0xFEFC205C ? ? ? tel:+44(0)1382 562731 x2405 > > > ______________________________________________________ > SCRI, Invergowrie, Dundee, DD2 5DA. > The Scottish Crop Research Institute is a charitable company limited by guarantee. > Registered in Scotland No: SC 29367. > Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. > > > DISCLAIMER: > > This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. ?This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. ?It may not be disclosed or used by any other than that > addressee. > If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on > this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. > > Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). > ______________________________________________________ > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From biopython at maubp.freeserve.co.uk Tue Apr 21 12:29:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Apr 2009 17:29:19 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com> <8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com> <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> <320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com> <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> Message-ID: <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> On Tue, Apr 21, 2009 at 4:19 PM, Bartek Wilczynski wrote: > Hi, > >> There appears to be something less than ideal about our CVS to git >> conversion. ?I believe this relates to how the tag commits appear in >> the commit tree - and it looks like for Biopython they are all tiny >> branches off the main trunk. ?i.e. If you look at the main trunk >> history (overall or for any one file) then tags commits are not in it. >> > > I haven't noticed this difference. It just seems to be the way cvs2got > handles tags. This behavior does not seem to be controllable from > the config file. I'll try to ask on the cvs2git mailing list. > > In case it is not possible to change it in cvs2git itself, the worst > scenario would be to re-tag the git tree manually (or with a help > of some script). So there is no risk of loosing tags. > > I'll post when I have any progress onn this issue. There is another option, redo the import using git cvsimport. This has the downside that we lose all the network history currently in github, but its only going to affect a couple of people and that was always a possibility. I've just done this twice, firstly over the network (just over an hour, probably a bad idea in terms of wasting the OBF bandwidth). Then I succeeded in doing it locally (under 15 minutes) on my Mac after logging into dev.open-bio.org and fetching a zipped up copy of the CVS files. The hard bit was working out how to get the CVSROOT directory setup: cvs -d $PWD/biopython_cvs init cd biopython_cvs unzip ../../Biopython-CVS-2009-04-21.zip cd .. time nice -n 10 git cvsimport -v -k -d /Users/pjcock/repositories/bp_cvs_local_to_git/biopython_cvs -C biopython_git biopython Both conversion appear to give the same result. Using GitX the history how shows the tags as I expect them to appear (nice yellow markers on the main branch), and the tag side branches have gone: $ cd biopython_git $ git log --graph --all ... | * commit 6283ffe77fdd07ae678d2fa35ae9311ee7fd51ee | Author: peterc | Date: Mon Apr 20 16:07:41 2009 +0000 | | Bump the version number now that Biopython 1.50 is released | * commit 17a9b80f89be97fd4cc31d7c3618e82e4c83cafc | Author: peterc | Date: Mon Apr 20 10:48:31 2009 +0000 | | You don't have to email Iddo to get on the CONTRIB file | ... I'm not sure if "git log" can be told to show the tags itself. Also, just like Bartek's conversion using cvs2svn, this also appears to correctly identify simple file moving (when the add and delete are done in one CVS operation, obviously not when it was done in two steps like my recent changes in Bio.Graphics.GenomeDiagram). Note - we can probably use http://github.com/guides/change-author-details-in-commit-history to map author names to github user names later, but in theory git cvsimport will do this with the -A option. Peter From biopython at maubp.freeserve.co.uk Tue Apr 21 12:58:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Apr 2009 17:58:23 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com> <8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com> <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> <320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com> <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> Message-ID: <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> On Tue, Apr 21, 2009 at 5:29 PM, Peter wrote: > > There is another option, redo the import using git cvsimport. ?This > has the downside that we lose all the network history currently in > github, but its only going to affect a couple of people and that was > always a possibility. > > I've just done this twice, firstly over the network (just over an > hour, probably a bad idea in terms of wasting the OBF bandwidth). > Then I succeeded in doing it locally (under 15 minutes) on my Mac > after logging into dev.open-bio.org and fetching a zipped up copy of > the CVS files. ?The hard bit was working out how to get the CVSROOT > directory setup: > > cvs -d $PWD/biopython_cvs init > cd biopython_cvs > unzip ../../Biopython-CVS-2009-04-21.zip > cd .. > time nice -n 10 git cvsimport -v -k -d > /Users/pjcock/repositories/bp_cvs_local_to_git/biopython_cvs ?-C > biopython_git biopython > > Both conversion appear to give the same result. ?Using GitX the > history how shows the tags as I expect them to appear (nice yellow > markers on the main branch), and the tag side branches have gone: > I've pushed this to github as http://github.com/peterjc/biopython-cvs-import/tree/master $ cd biopython_git $ git remote add origin git at github.com:peterjc/biopython-cvs-import.git $ git push origin master $ git push origin master --tags This won't be automatically updated, so please don't fork it! Peter From biopython at maubp.freeserve.co.uk Tue Apr 21 14:18:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Apr 2009 19:18:12 +0100 Subject: [Biopython-dev] Possible re-import from CVS to git Message-ID: <320fb6e00904211118n23e68b2vc43de8397ff3b08c@mail.gmail.com> On the thread about the missing history tags in github, I wrote: >> ... Then I succeeded in doing it locally (under 15 minutes) on my Mac >> after logging into dev.open-bio.org and fetching a zipped up copy of >> the CVS files. ?The hard bit was working out how to get the CVSROOT >> directory setup: >> >> cvs -d $PWD/biopython_cvs init >> cd biopython_cvs >> unzip ../../Biopython-CVS-2009-04-21.zip >> cd .. >> time nice -n 10 git cvsimport -v -k -d >> /Users/pjcock/repositories/bp_cvs_local_to_git/biopython_cvs ?-C >> biopython_git biopython >> I've been testing the -A option for git cvsimport to map our CVS usernames to hithub accounts. http://www.kernel.org/pub/software/scm/git/docs/git-cvsimport.html The following format omitting the email address does nothing at all (checking the local repository), which is a shame as I was hoping it would allow a quick and simple way to map the CVS usernames to the github usernames: peterc=peterjc However, the documented format does work: peterc=full name It seems that as long as the email address matches that used for your github account, once the repository is uploaded to github it will all work nicely - and your github account will be linked to the commit. So, if we are going to re-do the git import (and we may have to fix the tag history), it would be very nice if all the existing CVS users could first: (a) setup an account on github, and (b) tell me the email address you are using for it. If we do move to github, you would need to do this anyway in order to be given collaborator status to make commits direct to the main trunk. > I've pushed this to github as > http://github.com/peterjc/biopython-cvs-import/tree/master That is deleted now. Peter From p.j.a.cock at googlemail.com Tue Apr 21 16:06:56 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 21:06:56 +0100 Subject: [Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO, Bio.SwissProt In-Reply-To: <320fb6e00904210504g6c7f60f1o96129c9a6759c256@mail.gmail.com> References: <320fb6e00904210426q19ea7a44xdb902661297ec855@mail.gmail.com> <861995.42083.qm@web62406.mail.re1.yahoo.com> <320fb6e00904210504g6c7f60f1o96129c9a6759c256@mail.gmail.com> Message-ID: <320fb6e00904211306u50955608ndccef5d0cb6ba09b@mail.gmail.com> On Tue, Apr 21, 2009 at 1:04 PM, Peter Cock wrote: >>> It looks like the SwissProt format has changed, and we >>> should be parsing the new extended DE lines more >>> carefully, and splitting these entries up and recording >>> them in the SeqRecord.annotations dictionary? >> >> That sounds reasonable. The dictionary will have to be >> nested though. Something like this ... >> Thinking this over, we should take that SwissProt file and load it into BioSQL using BioPerl, and see how they dealt with the DE lines, and try and do the same for Bio.SeqIO in order that loading it into BioSQL with Biopython gives more or less the same thing. Peter From eric.talevich at gmail.com Wed Apr 22 00:32:33 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 22 Apr 2009 00:32:33 -0400 Subject: [Biopython-dev] Possible re-import from CVS to git In-Reply-To: <320fb6e00904211118n23e68b2vc43de8397ff3b08c@mail.gmail.com> References: <320fb6e00904211118n23e68b2vc43de8397ff3b08c@mail.gmail.com> Message-ID: <3f6baf360904212132k7110deeft6829c1b4a7b18f24@mail.gmail.com> On Tue, Apr 21, 2009 at 2:18 PM, Peter wrote: > So, if we are going to re-do the git import (and we may have to fix > the tag history), it would be very nice if all the existing CVS users > could first: > (a) setup an account on github, and > (b) tell me the email address you are using for it. > > If we do move to github, you would need to do this anyway in order to > be given collaborator status to make commits direct to the main trunk. > > Eek. Now that the Summer of Code is under way, I guess this is a good time to bring up the question of how Nick and I should be following the Biopython trunk and publishing our own code. In spite of the warning that the CVS tracker in GitHub was tentative, I was getting comfortable with the setup we had. Should I (we) hold off on pushing anything substantial to GitHub until this tagging situation is resolved, or is there a better way to approach this? For example, does anyone know if it's straightforward to back up a branch's recent history with git-format-patch and apply it directly onto a new repository with different references? Thanks, Eric From lpritc at scri.ac.uk Wed Apr 22 03:56:46 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 22 Apr 2009 08:56:46 +0100 Subject: [Biopython-dev] Bio.Motif Suggestions In-Reply-To: <8b34ec180904210859r34a0a034qdfb54d57c3ca85e3@mail.gmail.com> Message-ID: Hi Bart, On 21/04/2009 16:59, "Bartek Wilczynski" wrote: > Hi, > > thanks for your suggestions. > > To make the long story short: > - I mostly agree with your points > - I've updated the wiki page to include your requests > http://biopython.org/wiki/MotifDev > - I'll definitely spend some time working on particular requests and > then post specifically. Many thanks for the quick response - I've seen your wiki update, too. Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From bartek at rezolwenta.eu.org Wed Apr 22 04:53:21 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Wed, 22 Apr 2009 10:53:21 +0200 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> <320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com> <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> Message-ID: <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> Hi, On Tue, Apr 21, 2009 at 6:58 PM, Peter wrote: > On Tue, Apr 21, 2009 at 5:29 PM, Peter wrote: >> >> There is another option, redo the import using git cvsimport. ?This >> has the downside that we lose all the network history currently in >> github, but its only going to affect a couple of people and that was >> always a possibility. Yes, it is ?an option, but I would be quite reluctant to do it. I think this issue with tags is possible to get fixed without re-doing the import. I'm scared by the possibility the we re-import stuff, fix the tags, everybody swithches, people complain how good it was back then with CVS, ane one month down the road, we find that there is an issue with something else, that was not present in the previous import. I think this is becoming a bit chaotic now. We still haven't removed the first github conversion: (biopython_old branch: is anyone using it anyway?), ?there is this semi-official one that has a (fixable in my opinion) issue with tags and now there is a new one made by Peter. In summary: I have no objections to using any particular tool for importing stuff to git. I don't like the idea of not even trying to fix tghe problem we have but instantly changing the tool we are using. I consider now re-importing stuff a major problem: everybody will need to port their changes which is work. >> >> I've just done this twice, firstly over the network (just over an >> hour, probably a bad idea in terms of wasting the OBF bandwidth). >> Then I succeeded in doing it locally (under 15 minutes) on my Mac >> after logging into dev.open-bio.org and fetching a zipped up copy of >> the CVS files. ?The hard bit was working out how to get the CVSROOT >> directory setup: >> itt's good to know it works, I don't think the time differences are significant. > > This won't be automatically updated, so please don't fork it! exactly Bartek -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From biopython at maubp.freeserve.co.uk Wed Apr 22 05:08:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Apr 2009 10:08:13 +0100 Subject: [Biopython-dev] Possible re-import from CVS to git In-Reply-To: <3f6baf360904212132k7110deeft6829c1b4a7b18f24@mail.gmail.com> References: <320fb6e00904211118n23e68b2vc43de8397ff3b08c@mail.gmail.com> <3f6baf360904212132k7110deeft6829c1b4a7b18f24@mail.gmail.com> Message-ID: <320fb6e00904220208oaecd1a5p844ce8642acc51fa@mail.gmail.com> On Wed, Apr 22, 2009 at 5:32 AM, Eric Talevich wrote: > On Tue, Apr 21, 2009 at 2:18 PM, Peter wrote: > >> So, if we are going to re-do the git import (and we may have to fix >> the tag history), ... > > Eek. Now that the Summer of Code is under way, I guess this is a good time > to bring up the question of how Nick and I should be following the Biopython > trunk and publishing our own code. > > In spite of the warning that the CVS tracker in GitHub was tentative, I was > getting comfortable with the setup we had. Should I (we) hold off on pushing > anything substantial to GitHub until this tagging situation is resolved, or > is there a better way to approach this? For example, does anyone know if > it's straightforward to back up a branch's recent history with > git-format-patch and apply it directly onto a new repository with different > references? Bartek is looking into fixing the existing CVS to git mirror on github, but that may not be possible. And I do think it is worth fixing the tag history even at the cost of some upheaval in the short term. In terms of you and Nick, for now carry on using github if you are comfortable with it. The new phylogenetics stuff will I assume be mostly new python modules, or modifications to a couple of existing ones (e.g. Bio.Nexus). Merging this later shouldn't be too bad - you should be able to generate a diff against CVS (or its current mirror in git) and we can apply that to CVS (or a new git repository). Peter From biopython at maubp.freeserve.co.uk Wed Apr 22 05:23:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Apr 2009 10:23:46 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> <320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com> <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> Message-ID: <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> On Wed, Apr 22, 2009 at 9:53 AM, Bartek Wilczynski wrote: > Hi, > > Peter wrote: >>> There is another option, redo the import using git cvsimport. ?This> >>> has the downside that we lose all the network history currently in >>> github, but its only going to affect a couple of people and that was >>> always a possibility. > > Yes, it is ?an option, but I would be quite reluctant to do it. I think this > issue with tags is possible to get fixed without re-doing the import. If you can fix the current git hub repository, great. > I'm scared by the possibility the we re-import stuff, fix the tags, everybody > swithches, people complain how good it was back then with CVS, ane one > month down the road, we find that there is an issue with something else, > that was not present in the previous import. This is why we are testing things: We have found something wrong with the current import, and it wasn't immediately obvious (partly because we were still getting to know git and github). > I think this is becoming a bit chaotic now. We still haven't removed the > first github conversion: (biopython_old branch: is anyone using it anyway?), The old conversion's deletion is still in progress, it must have stalled: http://support.github.com/discussions/repos/485-reposiotry-stuck-in-rename >?there is this semi-official one that has a (fixable in my opinion) issue with > tags ... If we can fix the tags, great. If we can also remap the authors to their git usernames, even better. > ... and now there is a new one made by Peter. I deleted that one - it was just a proof of principle. > In summary: > I have no objections to using any particular tool for importing stuff to git. > I don't like the idea of not even trying to fix the problem we have > but instantly changing the tool we are using. It was really to demonstrate to my own satisfaction that we could have the tags in the history properly. > I consider now re-importing stuff a major problem: everybody will need to port > their changes which is work. True - but this was always a possibility. From browsing the github network this really will just affect basically just two people: * Eric - quite a few changes, some of which we can probably look at merging into CVS now which would solve that. * Giovanni - quite a few changes (on a couple of files) on one branch, and a couple of other branches for proposed unit tests Also: * Dave Bridges - documentation changes to one file which we can merge into CVS and then he can delete that branch * Tiago - trivial changes to one file (stats in PopGen) * Peter (me) - I have a few test branches, nothing I care about. Brad, Bartek and Leighton have no changes made. Peter From cy at cymon.org Wed Apr 22 05:48:08 2009 From: cy at cymon.org (Cymon Cox) Date: Wed, 22 Apr 2009 10:48:08 +0100 Subject: [Biopython-dev] Bio.Application interface Message-ID: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> >From reading the previous discussion on the list, I gather there is a preference for removing helper functions to the Bio.Application command line interfaces, such that the user interface would be something like: from Bio import Application from Bio.Align.Applications import MafftCommandline cmd = MafftCommandline() cmd.set_parameter("input", "sample.fa") [etc...] i, o, e = Application.generic_run(cmd) ie the user explicitly sets the cl parameters. Ive written Application.AbstractCommandline for both MUSCLE and MAFFT. However, each of these programmes uses a variation on the parameter styles not easily covered by the current _AbstractParameter classes _Option and _Argument. The _Option class deals with parameters of the type "- -append=yes" and "-a yes", and the _Argument returns just the value to the command line, ie cmd.set_parameter("input", "sample.fa") puts just "sample.fa" on the cl. A muscle command might be: "muscle -in Fasta/f002 -out Fasta/temp_align_out2.fa -objscore sp -noanchors" ie with a "-noanchors" command, currently the parameter would need to be an _Argument and set using: cmd.set_parameter("noanchors", "-noanchors") A MAFFT command might be: "mafft - -maxiterate 200 - -nofft myInputData.fa" ie with a "- -nofft" parameter which would need to be an _Argument and set using: cmd.set_parameter("nofft", "- -nofft") and a "- -maxiterate 200" parameter which _Option doesnt cover, that is "- -" params always have an "=" before the value. So, it looks like a _OptionNoEquals parameter class is required to cover the "- -param value", and I would suggest a _ArgumentName class that returns the parameter name to the command line such that: cmd.set_parameter("- -nofit") returns "- -nofit" to the cl, and cmd.set_parameter("- -nofit", value) raises and error via the checker_function As and aside, MAFFT also has a: "mafft - -seed file1 - -seed file2 inputData.fa" ie mulitple number of - -seed parameters which is not covered by the current interface. Cheers, C. -- From biopython at maubp.freeserve.co.uk Wed Apr 22 06:26:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Apr 2009 11:26:11 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> Message-ID: <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> On Wed, Apr 22, 2009 at 10:48 AM, Cymon Cox wrote: > From reading the previous discussion on the list, I gather there is a > preference for removing helper functions to the Bio.Application command line > interfaces, such that the user interface would be something like: > > from Bio import Application > from Bio.Align.Applications import MafftCommandline > cmd = MafftCommandline() > cmd.set_parameter("input", "sample.fa") > [etc...] > i, o, e = Application.generic_run(cmd) > > ie the user explicitly sets the cl parameters. Yes, that would fit my preference for giving the user direct access to the command line as a string, to invoke as they choose. We might want to discuss extending the AbstractCommandline __init__ method to take **kwargs, allowing the parameters to be set like this: from Bio import Application from Bio.Align.Applications import MafftCommandline cmd = MafftCommandline(input="sample.fa", ...) return_code, std_handle, err_handle = Application.generic_run(cmd) I'm not sure how well this would work in practice as the range of validate argument names in python may not overlap with the valid parameter names. > Ive written Application.AbstractCommandline for both MUSCLE and MAFFT. > However, each of these programmes uses a variation on the parameter styles > not easily covered by the current _AbstractParameter classes _Option and > _Argument. The _Option class deals with parameters of the type "- > -append=yes" and "-a yes", ... > A muscle command might be: > "muscle -in Fasta/f002 -out Fasta/temp_align_out2.fa -objscore sp > -noanchors" > ie with a "-noanchors" command Those kind of options which don't take a value are really common on Unix, I suspect we already have things like this in the other wrappers. I'd guess they just use the _Option class and omit the value. > So, it looks like a _OptionNoEquals parameter class is required to cover the > "- -param value", and I would suggest a _ArgumentName class that returns the > parameter name to the command line such that: > > cmd.set_parameter("- -nofit") returns "- -nofit" to the cl, and > cmd.set_parameter("- -nofit", value) raises and error via the > checker_function You are right, a subclass of _Option which checks there is no value argument could be sensible. Maybe _OptionNoValue rather than _OptionNoEquals? Peter From peter at maubp.freeserve.co.uk Wed Apr 22 07:00:19 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Apr 2009 12:00:19 +0100 Subject: [Biopython-dev] Nice small test case for fuzzy locations Message-ID: <320fb6e00904220400m5c18ad42gbe301b739d54ce99@mail.gmail.com> Hi all, This is a nice small GenBank file with fuzzy locations, joins, and fuzzy joins: ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Cryptosporidium_parvum/NC_006980.gbk I think this will make an excellent test case, see new unittest based Tests/test_SeqIO_feature.py which we can extend to include GFF or PTT files when they are in Bio.SeqIO too. The good news is our non-fuzzy locations appear to be doing just what GenBank does - you did a good job there Brad :) If anyone comes across a better example file let us know (i.e. also very small, but with between positions, one of position etc as well). Peter From biopython at maubp.freeserve.co.uk Wed Apr 22 09:30:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Apr 2009 14:30:00 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> Message-ID: <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> On Wed, Apr 22, 2009 at 2:23 PM, Cymon Cox wrote: > 2009/4/22 Peter >> >> On Wed, Apr 22, 2009 at 10:48 AM, Cymon Cox wrote: >> >> > Ive written Application.AbstractCommandline for both MUSCLE and MAFFT. >> > However, each of these programmes uses a variation on the parameter >> > styles >> > not easily covered by the current _AbstractParameter classes _Option and >> > _Argument. The _Option class deals with parameters of the type "- >> > -append=yes" and "-a yes", ... >> > A muscle command might be: >> > "muscle -in Fasta/f002 -out Fasta/temp_align_out2.fa -objscore sp >> > -noanchors" >> > ie with a "-noanchors" command >> >> Those kind of options which don't take a value are really common on >> Unix, ?I suspect we already have things like this in the other wrappers. >> I'd guess they just use the _Option class and omit the value. > > Yes, I see now... they need to be _Options with a "lambda x: 0" value > checker function - for some reason was trying to force them into _Argument > > This is the current _Option class: > ... > So _Option covers: "- -param=value", "-param value", "-param", "- -param" > > What it doesnt cover is "- -param value" and "-param=value" > ... This might be a silly question, but do you actually these exact option layouts for MUSCLE and MAFFT? Many Unix tools use something like libopt and will actually take slight variations, and may also offer short and long names for the same option. Perhaps the existing option code in Bio.Application will suffice? Peter From mjldehoon at yahoo.com Wed Apr 22 10:31:48 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 22 Apr 2009 07:31:48 -0700 (PDT) Subject: [Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO, Bio.SwissProt In-Reply-To: <320fb6e00904211306u50955608ndccef5d0cb6ba09b@mail.gmail.com> Message-ID: <218724.9949.qm@web62406.mail.re1.yahoo.com> --- On Tue, 4/21/09, Peter Cock wrote: > Thinking this over, we should take that SwissProt file and > load it into BioSQL using BioPerl, and see how they dealt > with the DE lines, and try and do the same for Bio.SeqIO > in order that loading it into BioSQL with Biopython gives > more or less the same thing. Good point. Does anybody know how BioPerl stores SwissProt files in SQL databases? I know neither Perl nor SQL ... --Michiel From p.j.a.cock at googlemail.com Wed Apr 22 10:44:23 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Apr 2009 15:44:23 +0100 Subject: [Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO, Bio.SwissProt In-Reply-To: <218724.9949.qm@web62406.mail.re1.yahoo.com> References: <320fb6e00904211306u50955608ndccef5d0cb6ba09b@mail.gmail.com> <218724.9949.qm@web62406.mail.re1.yahoo.com> Message-ID: <320fb6e00904220744s1c88c725nb1fa607ce10df723@mail.gmail.com> On Wed, Apr 22, 2009 at 3:31 PM, Michiel de Hoon wrote: > > --- On Tue, 4/21/09, Peter Cock wrote: > >> Thinking this over, we should take that SwissProt file and >> load it into BioSQL using BioPerl, and see how they dealt >> with the DE lines, and try and do the same for Bio.SeqIO >> in order that loading it into BioSQL with Biopython gives >> more or less the same thing. > > Good point. Does anybody know how BioPerl stores SwissProt files in SQL databases? I know neither Perl nor SQL ... > Not off hand, but I know enough about BioPerl to be able to load the file into a BioSQL database. I'll post back later (but probably not today). Peter From bugzilla-daemon at portal.open-bio.org Wed Apr 22 12:14:47 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Apr 2009 12:14:47 -0400 Subject: [Biopython-dev] [Bug 2819] New: Bio.SeqIO support for NCBI protein tables (*.ptt files) Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2819 Summary: Bio.SeqIO support for NCBI protein tables (*.ptt files) Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk On their FTP site the NCBI provide a range of files for each genome/plasmid/chromosome, e.g. ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Cryptosporidium_parvum/ The *.ptt files are simple tab separated tables listing all the proteins. They correspond to the CDS features in the GenBank file. This enhancement bug is about adding "ptt" as an input file format in Bio.SeqIO (and potentially as an output format too), where a single ptt file gives a single SeqRecord object containing a SeqFeature object for each protein. The header line gives the sequence length, so an UnknownSeq can be used for the SeqRecrd's seq property. One example application of this would be to draw a GenomeDiagram showing the protein locations. This can be done using the SeqFeature objects from parsing a GenBank file, but using the ptt file will be much faster. See earlier suggestions on the mailing list (part of the GFF thread): http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005725.html http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005745.html Patch to follow... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 22 12:15:26 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Apr 2009 12:15:26 -0400 Subject: [Biopython-dev] [Bug 2819] Bio.SeqIO support for NCBI protein tables (*.ptt files) In-Reply-To: Message-ID: <200904221615.n3MGFQZi027802@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2819 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-22 12:15 EST ------- Created an attachment (id=1282) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1282&action=view) New file Bio/SeqIO/ProteinTableIO.py -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 22 12:16:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Apr 2009 12:16:37 -0400 Subject: [Biopython-dev] [Bug 2819] Bio.SeqIO support for NCBI protein tables (*.ptt files) In-Reply-To: Message-ID: <200904221616.n3MGGbXh027904@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2819 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-22 12:16 EST ------- Created an attachment (id=1283) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1283&action=view) Patch to Bio/SeqIO/__init__.py to use "ptt" files for input -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 22 12:19:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Apr 2009 12:19:15 -0400 Subject: [Biopython-dev] [Bug 2819] Bio.SeqIO support for NCBI protein tables (*.ptt files) In-Reply-To: Message-ID: <200904221619.n3MGJF3V028128@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2819 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-22 12:19 EST ------- Created an attachment (id=1284) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1284&action=view) Patch to Tests/test_SeqIO_features.py to check "genbank" vs "ptt" parsing Requires additional input files from the NCBI to go in Tests/GenBank, ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Cryptosporidium_parvum/NC_006980.ptt ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Yersinia_pestis_biovar_Microtus_91001/NC_005816.ptt -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Apr 22 12:24:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Apr 2009 17:24:36 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> Message-ID: <320fb6e00904220924x38466ac1sc80fe344eec1b200@mail.gmail.com> On Mon, Apr 13, 2009 at 1:16 PM, Peter wrote: > I don't think the GFF parser should only return SeqRecord object, but > I do see a use for this (via Bio.SeqIO). ?GFF files could be > represented as a list of SeqFeature objects, and using a SeqRecord to > hold this seems very natural to me. ?It also means we could use > Bio.SeqIO to load a GFF file into SeqRecord objects for storage in a > BioSQL database. > > If you look at the NCBI FTP site, they often provide genome sequences > in a range of file formats including GenBank and GFF. > > e.g. > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/ > > The GenBank files contain the features plus the sequence, > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gbk > > Their GFF3 file only contains the features: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff > > Some GFF files will include the sequence too, in this case we can > fetch it in FASTA format: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna > > In principle, you could parse this FASTA file and the GFF3 file and > put together a GenBank file - or vice versa. > > As an aside, I would also consider adding protein table support on the > same lines, look at this file: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.ptt > The header information gives us the genome size, so Bio.SeqIO could > return a SeqRecord with lots of SeqFeature objects and for the > SeqRecord's seq property use a Bio.Seq.UnknownSeq of length 4639675bp. > ?This is something I might look at implementing myself after Biopython > 1.50 is out. ?We should be able to read in a GenBank file and output a > PTT file, and verify it matches the NCBI provided version of the PTT > file. There is a working NCBI protein table ("ptt") format parser for Bio.SeqIO on Bug 2819 including unit tests. http://bugzilla.open-bio.org/show_bug.cgi?id=2819 Hopefully this will be useful in integrating the GFF/GFF3 parser into Bio.SeqIO, as well as being worth while in its own right. This "ptt" parser should work fine with BioSQL and GenomeDiagram, offering a light weight alternative to parsing the GenBank or GFF3 file when all you care about is the locations of the proteins (CDS features). Peter From cy at cymon.org Wed Apr 22 13:00:38 2009 From: cy at cymon.org (Cymon Cox) Date: Wed, 22 Apr 2009 18:00:38 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> Message-ID: <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> 2009/4/22 Peter > On Wed, Apr 22, 2009 at 2:23 PM, Cymon Cox wrote: > > 2009/4/22 Peter > >> > >> On Wed, Apr 22, 2009 at 10:48 AM, Cymon Cox wrote: > >> > >> > Ive written Application.AbstractCommandline for both MUSCLE and MAFFT. > >> > However, each of these programmes uses a variation on the parameter > >> > styles > >> > not easily covered by the current _AbstractParameter classes _Option > and > >> > _Argument. The _Option class deals with parameters of the type "- > >> > -append=yes" and "-a yes", ... > >> > A muscle command might be: > >> > "muscle -in Fasta/f002 -out Fasta/temp_align_out2.fa -objscore sp > >> > -noanchors" > >> > ie with a "-noanchors" command > >> > >> Those kind of options which don't take a value are really common on > >> Unix, I suspect we already have things like this in the other wrappers. > >> I'd guess they just use the _Option class and omit the value. > > > > Yes, I see now... they need to be _Options with a "lambda x: 0" value > > checker function - for some reason was trying to force them into > _Argument > > > > This is the current _Option class: > > ... > > So _Option covers: "- -param=value", "-param value", "-param", "- -param" > > > > What it doesnt cover is "- -param value" and "-param=value" > > ... > > This might be a silly question, but do you actually these exact option > layouts for MUSCLE and MAFFT? Many Unix tools use something like > libopt and will actually take slight variations, and may also offer short > and long names for the same option. Perhaps the existing option code > in Bio.Application will suffice? MAFFT uses "--param value" style options, and won't accept "--param=value" or "-param value" as alternatives. Neither use "-param=value", but if more applications it may turn up. C. > > > Peter > -- ____________________________________________________________________ Cymon J. Cox Centro de Ciencias do Mar Faculdade de Ciencias do Mar e Ambiente (FCMA) Universidade do Algarve Campus de Gambelas 8005-139 Faro Portugal Phone: +0351 289800909 ext 7909 Fax: +0351 289800051 Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com HomePage : http://biology.duke.edu/bryology/cymon.html -8.63/-6.77 From biopython at maubp.freeserve.co.uk Wed Apr 22 17:25:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Apr 2009 22:25:35 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> Message-ID: <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> On Wed, Apr 22, 2009 at 6:00 PM, Cymon Cox wrote: >> >> This might be a silly question, but do you actually these exact option >> layouts for MUSCLE and MAFFT? Many Unix tools use something like >> libopt and will actually take slight variations, and may also offer short >> and long names for the same option. Perhaps the existing option code >> in Bio.Application will suffice? > > MAFFT uses "--param value" style options, and won't accept "--param=value" > or "-param value" as alternatives. OK. Then yes, we should support that. Brad, as Bio.Application is your module, would you like to comment? > > Neither use "-param=value", but if more applications it may turn up. > I don't think I have ever see a command line application that used that. Peter From chapmanb at 50mail.com Wed Apr 22 18:44:01 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 22 Apr 2009 18:44:01 -0400 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> Message-ID: <20090422224401.GC34546@sobchak.mgh.harvard.edu> Peter and Cymon; > >> This might be a silly question, but do you actually these exact option > >> layouts for MUSCLE and MAFFT? Many Unix tools use something like > >> libopt and will actually take slight variations, and may also offer short > >> and long names for the same option. Perhaps the existing option code > >> in Bio.Application will suffice? > > > > MAFFT uses "--param value" style options, and won't accept "--param=value" > > or "-param value" as alternatives. > > OK. Then yes, we should support that. Brad, as Bio.Application is your > module, would you like to comment? My comment is: I think it is awesome MAFFT made up their own way of doing the command line. Seriously, y'all are doing the right thing. Add a new class to Bio.Application: _OptionAlt or whatever you'd like to call MAFFT's inventive new way to specify command line arguments. Adapt the __str__ from _Option to do it the "--param val" way in this class. Then use this for your MAFFT commandline. I believe I just summarized your discussion, so you can replace this whole message with +1. Brad From winda002 at student.otago.ac.nz Wed Apr 22 22:14:31 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Thu, 23 Apr 2009 14:14:31 +1200 Subject: [Biopython-dev] main page on wiki Message-ID: <49EFCF07.2050502@student.otago.ac.nz> Hi all, As you probably know the main page of the wiki (http://biopython.org/wiki/Main_Page) is the first place someone washes up when they google 'biopython'. As part of this "news coordinator" idea I have made an alternative version of the main page (http://biopython.org/wiki/User:Davidw/homepage) which acts a bit more as a "portal" for the wiki/project. This is born from my own experience with the wiki as a newcomer; it took me a long time to cotton on to the fact there was a navigation box on each page so I didn't realise what the website had to offer (this may say more about me than the design of the front page). Which version would you like to see as the main page? Obviously this isn't an either-or thing, my 'mock-up' version can be edited by anyone with an account on the wiki (the main page is protected for obvious reasons) so any ideas that you have can be incorporated to that one (older versions of the page are all saved so you can edit as bravely as you like). Thanks, David From sbassi at clubdelarazon.org Wed Apr 22 21:53:09 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Wed, 22 Apr 2009 22:53:09 -0300 Subject: [Biopython-dev] main page on wiki In-Reply-To: <49EFCF07.2050502@student.otago.ac.nz> References: <49EFCF07.2050502@student.otago.ac.nz> Message-ID: <9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com> On Wed, Apr 22, 2009 at 11:14 PM, David Winter wrote: > Which version would you like to see as the main page? Obviously this isn't I liked the new version. I would add (if I knew how to do it) some icons near each of: Get Started Get help Contribute From argriffi at ncsu.edu Wed Apr 22 21:42:21 2009 From: argriffi at ncsu.edu (alex) Date: Wed, 22 Apr 2009 21:42:21 -0400 Subject: [Biopython-dev] main page on wiki In-Reply-To: <49EFCF07.2050502@student.otago.ac.nz> References: <49EFCF07.2050502@student.otago.ac.nz> Message-ID: <49EFC77D.5070307@ncsu.edu> David Winter wrote: > Hi all, > > As you probably know the main page of the wiki > (http://biopython.org/wiki/Main_Page) is the first place someone washes > up when they google 'biopython'. As part of this "news coordinator" idea > I have made an alternative version of the main page > (http://biopython.org/wiki/User:Davidw/homepage) which acts a bit more > as a "portal" for the wiki/project. This is born from my own experience > with the wiki as a newcomer; it took me a long time to cotton on to the > fact there was a navigation box on each page so I didn't realise what > the website had to offer (this may say more about me than the design of > the front page). > > Which version would you like to see as the main page? Obviously this > isn't an either-or thing, my 'mock-up' version can be edited by anyone > with an account on the wiki (the main page is protected for obvious > reasons) so any ideas that you have can be incorporated to that one > (older versions of the page are all saved so you can edit as bravely as > you like). > > Thanks, > David I like your version better than the current main page. From idoerg at gmail.com Wed Apr 22 23:49:39 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Wed, 22 Apr 2009 20:49:39 -0700 Subject: [Biopython-dev] main page on wiki In-Reply-To: <9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com> References: <49EFCF07.2050502@student.otago.ac.nz> <9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com> Message-ID: <49EFE553.6070405@gmail.com> I second Sebastian on the icons, and third Sebastian and Alex on preferring David's take on a main page. Sebastian Bassi wrote: > On Wed, Apr 22, 2009 at 11:14 PM, David Winter > wrote: >> Which version would you like to see as the main page? Obviously this isn't > > I liked the new version. I would add (if I knew how to do it) some > icons near each of: > Get Started Get help Contribute > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev -- Iddo Friedberg Ph.D. Atkinson Hall MC 0446 University of California San Diego 9500 Gilman Dr. La Jolla, CA 92093-0446 USA http://iddo-friedberg.net From biopython at maubp.freeserve.co.uk Thu Apr 23 05:16:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Apr 2009 10:16:44 +0100 Subject: [Biopython-dev] main page on wiki In-Reply-To: <49EFE553.6070405@gmail.com> References: <49EFCF07.2050502@student.otago.ac.nz> <9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com> <49EFE553.6070405@gmail.com> Message-ID: <320fb6e00904230216l4b2cd91k513baa3f7a08172c@mail.gmail.com> On Thu, Apr 23, 2009 at 4:49 AM, Iddo Friedberg wrote: > I second Sebastian on the icons, and third Sebastian and Alex on preferring > David's take on a main page. Are you all looking at the *current* home page which already has a few of David's suggestions (in particular the news feed on the right), or the old version from memory? Also, what size screens do you all have? It should ideally look OK on small screens or windows (e.g. 1024 by 768 is what my laptop uses, which isn't that old). From playing with my window size, it should be OK - the proposed layout seems quite flexible :) If there are no counter comments, I'll put David's changes up later today or tomorrow. Peter From biopython at maubp.freeserve.co.uk Thu Apr 23 05:29:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Apr 2009 10:29:04 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <20090422224401.GC34546@sobchak.mgh.harvard.edu> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> <20090422224401.GC34546@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904230229o7efcfbe0ld2da94f10bd1b3b8@mail.gmail.com> On Wed, Apr 22, 2009 at 11:44 PM, Brad Chapman wrote: > Peter and Cymon; > > My comment is: I think it is awesome MAFFT made up their own way > of doing the command line. Was that sarcasm Brad? > Seriously, y'all are doing the right thing. Add a new class to > Bio.Application: _OptionAlt or whatever you'd like to call MAFFT's > inventive new way to specify command line arguments. Adapt the > __str__ from _Option to do it the "--param val" way in this class. > Then use this for your MAFFT commandline. Maybe _DoubleDashOption for the class name? I haven't looked at this closely enough to have a firm opinion - but as this will be a private class anyway, the name doesn't matter so much. > I believe I just summarized your discussion, so you can replace this > whole message with +1. :) What about this bit I wrote earlier: >> ... We might want to discuss extending the AbstractCommandline >> __init__ method to take **kwargs, allowing the parameters to be >> set like this: >> >> from Bio import Application >> from Bio.Align.Applications import MafftCommandline >> cmd = MafftCommandline(input="sample.fa", ...) >> return_code, std_handle, err_handle = Application.generic_run(cmd) >> >> I'm not sure how well this would work in practice as the range of >> valid argument names in python may not overlap with the valid >> parameter names. We'll have to see how well the above idea works in practice - it may not be general enough to be useful. Also, perhaps we can automatically generate properties for each argument allowing this: cmd.input = "sample.fa" rather than: cmd.set_parameter("input", "sample.fa") For the "switch" type arguments which take no value, if these are implemented with a separate option class (maybe _Switch or _OptionNoValue) then rather than: cmd.set_parameter("noanchors") we might want to do: cmd.noanchors = True and allow the switch to be removed with: cmd.noanchors = False i.e. For those arguments which take no argument (is "switch" the right term here?), evaluate the property set value as a boolean to add/remove -noanchors from the command line string. I think using properties in this way could make the command line object more intuitive, but again python puts limits on property names which might mean for some arguments you'd have to use the set_parameter version. Peter From bugzilla-daemon at portal.open-bio.org Thu Apr 23 05:39:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 05:39:11 -0400 Subject: [Biopython-dev] [Bug 2819] Bio.SeqIO support for NCBI protein tables (*.ptt files) In-Reply-To: Message-ID: <200904230939.n3N9dBZ5000718@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2819 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-23 05:39 EST ------- Just to note that Bio/SeqIO/ProteinTableIO.py needs a minor improvement to cope with one special case - features which wrap the origin, e.g. NEQ001 in Nanoarchaeum equitans. ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.gbk ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.ptt This is the first CDS in the GenBank file, location given as: complement(join(490883..490885,1..879)) It is the last entry in the Protein Table file, 490883..879 - ... All my code needs to do is spot when start > end, and then add the two appropriate sub-features (using the known genome length, 490885) and set the location operator to join (to match what the GenBank parser does). I'll do this at some point assuming there is interest in adding this parser to Bio.SeqIO. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Thu Apr 23 08:36:35 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 23 Apr 2009 08:36:35 -0400 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> <320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com> Message-ID: <20090423123635.GD34546@sobchak.mgh.harvard.edu> Hi all; > > Unless you are thinking of having an object representation as being too > > heavy, the non-light part of SeqFeature is all the FeatureLocation > > fuzziness. > > I've just had a quick go at what should be a 100% backwards compatible > modification to the FeatureLocation class to store ExactPosition start > or end positions as integers. The idea should be more memory > efficient, using the complex position objects only when required. I like the idea here but I would go a step further and get rid of FeatureLocation, collapsing the start and end location onto the SeqFeature itself. FeatureLocation is basically just a holder for a start and end coordinates. In this version, you would store the positions plus extensions and fuzzy type on the Feature, and then instantiate fuzzy objects on demand. I took a look at the resource usage of these objects versus a lightweight implementation. For a GFF file with 70k features, the maximum memory usage is 128M versus 111M for the lightweight version. So the improvement is rather modest, ~15%. > I forgot to mention the second major use case I'm concerned about, > which is recovering the GenBank/EMBL style location string. I have > looked at this in the past, by adding methods to the FeatureLocation > and all the Position objects, but it is complicated by the fact the > Position objects don't know if they are at the start or end (and for > the start locations we need to add one to convert from Python > counting). This is the main block on having Bio.SeqIO support writing > GenBank (or EMBL) files with their features included. I admittedly haven't looked at this in a while, but this was designed to be round tripped. The GenBank Record class can be written out back in GenBank format, and test_GenBank explicitly checks that the start and end records are the same. Brad From chapmanb at 50mail.com Thu Apr 23 08:53:56 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 23 Apr 2009 08:53:56 -0400 Subject: [Biopython-dev] Rolling new releases In-Reply-To: <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com> <320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com> <20090421122045.GD30529@sobchak.mgh.harvard.edu> <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com> Message-ID: <20090423125356.GE34546@sobchak.mgh.harvard.edu> Hi all; > > It would also be worth thinking about what the worst parts of > > building the releases are and seeing if we can automate or eliminate > > them. A few things that I can think of: [Brainstorming a few suggestions] I feel like I derailed from the main point by making suggestions. Separate from a debate about betas and version support and documentation -- how can we make releases easier to roll? Peter, this started when you mentioned that rolling the release felt kind of painful and it would be great if others would pitch in. The idea of soliciting volunteers as release coordinators is great. In addition to that, we should think about streamlining the release process -- what are the parts we can get rid of and still have high quality releases? Peter, since you are doing them right now, what are your thoughts? Brad From lpritc at scri.ac.uk Thu Apr 23 09:43:43 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 23 Apr 2009 14:43:43 +0100 Subject: [Biopython-dev] main page on wiki In-Reply-To: <49EFC77D.5070307@ncsu.edu> Message-ID: On 23/04/2009 02:42, "alex" wrote: > David Winter wrote: >> Hi all, [...] >> Which version would you like to see as the main page? > I like your version better than the current main page. +1 I like the layout. Sebastian's idea for icons is also good. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From p.j.a.cock at googlemail.com Thu Apr 23 09:58:58 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 Apr 2009 14:58:58 +0100 Subject: [Biopython-dev] Rolling new releases In-Reply-To: <20090423125356.GE34546@sobchak.mgh.harvard.edu> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com> <320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com> <20090421122045.GD30529@sobchak.mgh.harvard.edu> <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com> <20090423125356.GE34546@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904230658y310609c8l89bf27c33bd56d56@mail.gmail.com> On Thu, Apr 23, 2009 at 1:53 PM, Brad Chapman wrote: > Hi all; > >> > It would also be worth thinking about what the worst parts of >> > building the releases are and seeing if we can automate or eliminate >> > them. A few things that I can think of: > > [Brainstorming a few suggestions] > > I feel like I derailed from the main point by making suggestions. > Separate from a debate about betas and version support and > documentation -- how can we make releases easier to roll? > > Peter, this started when you mentioned that rolling the release > felt kind of painful and it would be great if others would pitch in. > The idea of soliciting volunteers as release coordinators is great. I didn't mean painful, so much as time consuming - but this was mostly coordinating final polish/bug fixes and documentation. This kind of thing requires some debate and judgement calls, and will be different for every release. I spent quite a lot of time on documentation for things which I really wanted to get into the Tutorial that shipped with the release (some of which should have happened earlier, so this was partly my own fault). In terms of getting the documentation updated for each release, this would be less effort if we as a group were more diligent about putting things in the tutorial and/or docstrings as we go along. It's important that nice new features are demonstrated, otherwise no-one will know they are there without reading the code itself or from following the mailing list discussions carefully. > In addition to that, we should think about streamlining the release > process -- what are the parts we can get rid of and still have high > quality releases? Peter, since you are doing them right now, what > are your thoughts? The complicated bit is getting the code and documentation in CVS ready, and that is harder to delegate. Once that is done though, the actual release process is fairly straight forward - as documented here - and could be delegated to anyone methodical with suitably setup development machine(s): http://biopython.org/wiki/Building_a_release Maybe some of the release process could be automated literally as a script - but doing each step methodically by hand and checking as you go is wise. For the release process, I'm basically proposing splitting this up into up to three jobs: (1) Coordinating final bug fixes and documentation in CVS. This has recently been handled by me or Michiel with most discussion on the dev lists, and some module specific details off list, and this works and I wouldn't change it. (2) Once CVS is ready, building the documentation, doing the release archives, doing epydoc, doing the Windows installers, tagging CVS, and uploading to the website. Part of the job would include scanning the NEWS and DEPRECATED files, plus recent documentation to make sure nothing was missed. This can be delegated. (3) Writing and publishing the release announcement on the news site and email lists (with the timing coordinated with the people doing jobs 1 and 2). I suggest having our new news coordinators take over this bit. So, while historically (1), (2) and (3) have be done by one person I think this could be split up into the "Release Director", "Release Manager" and "News Coordinator" roles (perhaps with different job titles?). Peter From p.j.a.cock at googlemail.com Thu Apr 23 10:06:14 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 Apr 2009 15:06:14 +0100 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <20090423123635.GD34546@sobchak.mgh.harvard.edu> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> <320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com> <20090423123635.GD34546@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904230706j213d6a47iadc6722581e52588@mail.gmail.com> On Thu, Apr 23, 2009 at 1:36 PM, Brad Chapman wrote: > Hi all; > >> > Unless you are thinking of having an object representation as being too >> > heavy, the non-light part of SeqFeature is all the FeatureLocation >> > fuzziness. >> >> I've just had a quick go at what should be a 100% backwards compatible >> modification to the FeatureLocation class to store ExactPosition start >> or end positions as integers. ?The idea should be more memory >> efficient, using the complex position objects only when required. > > I like the idea here but I would go a step further and get rid of > FeatureLocation, collapsing the start and end location onto the > SeqFeature itself. FeatureLocation is basically just a holder for a > start and end coordinates. In this version, you would store the > positions plus extensions and fuzzy type on the Feature, and then > instantiate fuzzy objects on demand. > > I took a look at the resource usage of these objects versus > a lightweight implementation. For a GFF file with 70k features, the > maximum memory usage is 128M versus 111M for the lightweight > version. So the improvement is rather modest, ~15%. Thanks for that. Perhaps the variant idea using a using a single reference for each location would save more (currently is uses two references, one for the object and one for the integer - so in general we are wasting memory on a pointer to None). Certainly merging the SeqFeature and FeatureLocation should save even more memory. We could do this with full backward compatibility by generating the FeatureLocation object on request (using a property method for the SeqFeature's location), and this can also trigger a deprecation warning. We'd have to think about what to do with the SeqFeature's __init__ method more carefully. >> I forgot to mention the second major use case I'm concerned about, >> which is recovering the GenBank/EMBL style location string. ?I have >> looked at this in the past, by adding methods to the FeatureLocation >> and all the Position objects, but it is complicated by the fact the >> Position objects don't know if they are at the start or end (and for >> the start locations we need to add one to convert from Python >> counting). ?This is the main block on having Bio.SeqIO support writing >> GenBank (or EMBL) files with their features included. > > I admittedly haven't looked at this in a while, but this was > designed to be round tripped. The GenBank Record class can be > written out back in GenBank format, and test_GenBank explicitly > checks that the start and end records are the same. Yes - The Bio.GenBank.Record class should round-trip, from memory it stores feature locations as string. I'm interested in writing a SeqRecord out as a GenBank file (which already do, but without the features). This would let you do things like load an EMBL or GFF3 file as a SeqRecord, and output it as a GenBank file. Peter From cy at cymon.org Thu Apr 23 10:32:10 2009 From: cy at cymon.org (Cymon Cox) Date: Thu, 23 Apr 2009 15:32:10 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <20090422224401.GC34546@sobchak.mgh.harvard.edu> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> <20090422224401.GC34546@sobchak.mgh.harvard.edu> Message-ID: <7265d4f0904230732i124670ebvf859b2e27943ba37@mail.gmail.com> 2009/4/22 Brad Chapman > Peter and Cymon; > > > >> This might be a silly question, but do you actually these exact option > > >> layouts for MUSCLE and MAFFT? Many Unix tools use something like > > >> libopt and will actually take slight variations, and may also offer > short > > >> and long names for the same option. Perhaps the existing option code > > >> in Bio.Application will suffice? > > > > > > MAFFT uses "--param value" style options, and won't accept > "--param=value" > > > or "-param value" as alternatives. > > > > OK. Then yes, we should support that. Brad, as Bio.Application is your > > module, would you like to comment? > > My comment is: I think it is awesome MAFFT made up their own way > of doing the command line. I think you'll be likewise inspired by the MUSCLE command line parsing: [cymon at chara mafft]$ muscle -in Tests/Fasta/f002 -anchorspacing -cluster1 upgmb Command-line option "upgmb" must start with '-' But of course, these two are perfectly acceptable: [cymon at chara mafft]$ muscle -in Tests/Fasta/f002 -anchorspacing --cluster1=upgmb [cymon at chara mafft]$ muscle -in Tests/Fasta/f002 -anchorspacing on-balance-I-think-Ill-go-home -cluster1 upgmb At present, there is no current way to force a value argument to an option so cmd.set_parameter("-anchorspacing") is acceptable in the interface. But, in general, I assume the idea is not 'save' the user from niceties of the particular programme command line, ie in command line interface I'm allowing users to set parameters which either dont work or crash the programme... > Seriously, y'all are doing the right thing. Add a new class to > Bio.Application: _OptionAlt or whatever you'd like to call MAFFT's > inventive new way to specify command line arguments. Adapt the > __str__ from _Option to do it the "--param val" way in this class. > Then use this for your MAFFT commandline. class _OptionAlt(_AbstractParameter): """Represent an option that can be set for a program. This holds UNIXish options like: --append yes --append """ def __str__(self): """Return the value of this option for the commandline. """ if self.names[0].find("--") >= 0: output = "%s" % self.names[0] if self.value is not None: output += " %s " % self.value else: output += " " else: raise ValueError("Unrecognized option type: %s" % self.names[0]) return output C. -- ____________________________________________________________________ Cymon J. Cox Centro de Ciencias do Mar Faculdade de Ciencias do Mar e Ambiente (FCMA) Universidade do Algarve Campus de Gambelas 8005-139 Faro Portugal Phone: +0351 289800909 ext 7909 Fax: +0351 289800051 Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com HomePage : http://biology.duke.edu/bryology/cymon.html -8.63/-6.77 From bugzilla-daemon at portal.open-bio.org Thu Apr 23 11:22:36 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 11:22:36 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line interface In-Reply-To: Message-ID: <200904231522.n3NFMal6026332@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 cymon.cox at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1280 is|0 |1 obsolete| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 11:23:05 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 11:23:05 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line interface In-Reply-To: Message-ID: <200904231523.n3NFN5va026431@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 cymon.cox at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1279 is|0 |1 obsolete| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 11:25:43 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 11:25:43 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line interface In-Reply-To: Message-ID: <200904231525.n3NFPhPH026661@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #3 from cymon.cox at gmail.com 2009-04-23 11:25 EST ------- Created an attachment (id=1285) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1285&action=view) Bio.Align.Applications.py text -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 11:32:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 11:32:34 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line interface In-Reply-To: Message-ID: <200904231532.n3NFWYkO027258@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #4 from cymon.cox at gmail.com 2009-04-23 11:32 EST ------- Created an attachment (id=1286) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1286&action=view) Patch for Bio.Applications __init__.py -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 11:33:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 11:33:09 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line interface In-Reply-To: Message-ID: <200904231533.n3NFX9kw027294@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #5 from cymon.cox at gmail.com 2009-04-23 11:33 EST ------- MUSCLE and MAFFT Bio.Application command lines Patch for Bio.Applications __init__py to add _OptionAlt class covering "--param value" style options C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 11:43:04 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 11:43:04 -0400 Subject: [Biopython-dev] [Bug 2754] Bio.PDB: Parse warnings should print to stderr, not stdout In-Reply-To: Message-ID: <200904231543.n3NFh4cT028184@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2754 ------- Comment #13 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-23 11:43 EST ------- In comment #3 Bruce wrote: > > I believe that we should be using the using Python warnings module for these > types of messages: > http://docs.python.org/library/warnings.html > > This permits the user to have a greater control over the output and also > allows redirecting the output as required. In the Bio directory, there are > currently 36 and 25 uses of stderr and stdout, respectively. > > In terms of the patch, my limited understanding is that local import sys will > override any global redirection of the output which in my opinion is a bad > idea. Good points, and yes, using the warnings module here (and probably elsewhere in Biopython) makes sense. Eric wrote in comment #9: > Yes, something must be done with test_PDB.py, because I don't think > warnings.warn can be made to play nice with that print-and-compare test > -- or any print-and-compare, since the warning messages contain extra > environment-specific information. I was able to solve this with the following trick: import warnings def send_warnings_to_stdout(message, category, filename, lineno, file=None): print message warnings.showwarning = send_warnings_to_stdout This now prints *just* the message text without the stack trace information etc. This also means it looks like any other output from the print-and-compare test, to test_PDB.py required only a trivial change. Note that I haven't taken Eric's patches/branch as is - for one thing I wanted to use the same import style as elsewhere in Biopython: i.e. import warnings warnings.warn("Message") rather than: from warnings import warn warn("Message") However, I think we can now close Bug 2754. Eric - please try the latest code from CVS (or the mirror on github). Also, could you also open separate bug(s) for the other issues, such as your new unittest based version of test_PDB.py? Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Apr 23 12:34:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Apr 2009 17:34:27 +0100 Subject: [Biopython-dev] How are people doing their git merges from the trunk? Message-ID: <320fb6e00904230934kac2ca0eoc8ba284fbe2c382e@mail.gmail.com> Hi all, We have the CVS trunk mirrored here: http://github.com/biopython/biopython/tree/master I have a copy of this in my github account here, http://github.com/peterjc/biopython/tree/master I decided that I would (initially at least) treat my master branch as a copy of the master branch, and not commit local changes to this branch. Instead I periodically grab the latest commits from the master using the commands: #Do this once only: #git remote add official_dist git://github.com/biopython/biopython.git echo Checking out my local master branch... git checkout master echo Updating my local master branch with the official dist... git pull official_dist master echo Status: git status echo Pushing to my github master branch... git push origin master This means the github network diagram only advances by one step, even if the operation combined 10s of individual commits (which are still shown individually on my history on github). Alternatively, I could have used github's cherry pick interface (the fork queue), or used git cherry pick at the command line. I can see this is useful if you only want to pick out a few patches. Is there any reason to use this when you want all the commits from another branch? Bartek's latest activity on the github network is a series of points - I think this means he did a "cherry pick", and selected most (maybe even all) of the changes from the main trunk. Am I interpreting this right? Thanks Peter From biopython at maubp.freeserve.co.uk Thu Apr 23 17:21:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Apr 2009 22:21:41 +0100 Subject: [Biopython-dev] Fwd: Where to put command line wrappers In-Reply-To: <20090417140241.GD16092@sobchak.mgh.harvard.edu> References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> <8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com> <8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com> <320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com> <20090417140241.GD16092@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904231421k3e18d0b1y9003614e906fcb1c@mail.gmail.com> On Fri, Apr 17, 2009 at 3:02 PM, Brad Chapman wrote: > Hi all; > > [Where to put the commandline objects] >> > I think that there is a difference between EMBOSS and >> > Bio.[Motif|Align]. In EMBOSS we have a very nicely comoditized >> > set of tools with similar interfaces, while both for multiple >> > alignment and motif searching the tools vary a lot. In case of >> > multiple alignments this is only with respect to parameters and >> > output format, while in motif searching there is also a lot of >> > differences in the types of input (background models etc.). >> >> That is a good argument for using Bio/Align/Applications/XXX.py and >> Bio/Motif/Applications/XXX.py while also having >> Bio/EMBOSS/Applications.py > > There is a natural tension between overgeneralizing and dumping > too much into one file. At one end you have deeply nested Java-like > directories with a few lines of code in each file. I tend towards the > "more in a single file and less nesting" camp. My vote would be that > if the Motif Applications file will only contain commandline > wrappers, they could live in one file. OK, what I propose is that the command line objects are exposed as Bio.Align.Applications.MuscleCommandline, Bio.Align.Applications.ClustalwCommandline, etc but that the implementations live in Bio/Align/Applications/_Muscle.py, _Clustalw.py etc. To do this the Bio/Align/Applications/__init__.py file will look like this: from _Muscle import MuscleCommandline from _Clustalw import ClustalwCommandline This avoids having a single massive file, yet keeps the public namespace simple. For the user, they do this: from Bio.Align.Applications import MuscleCommandline cline = MuscleCommandline(...) or if they prefer, from Bio.Align import Applications cline = Applications.MuscleCommandline(...) >From the user's point of view all the alignment command line wrapper objects live together under Bio.Align.Applications. This will be consistent with the public API for the EMBOSS wrappers where you can do: from Bio.Emboss.Applications import Primer3Commandline cline = Primer3Commandline(...) or variants like that. For Bio.Motif.Applications we can do the same as for Bio.Align.Applications, or if there are only one or two wrappers initially put the classes directly in Bio/Motif/Applications/__init__.py and then split them into private files later on if the file gets too big. Peter From bugzilla-daemon at portal.open-bio.org Thu Apr 23 17:52:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 17:52:08 -0400 Subject: [Biopython-dev] [Bug 2820] New: Convert test_PDB.py to unittest Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2820 Summary: Convert test_PDB.py to unittest Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: enhancement Priority: P3 Component: Unit Tests AssignedTo: biopython-dev at biopython.org ReportedBy: eric.talevich at gmail.com The current test script for Bio.PDB uses the print-and-compare approach. I've written an equivalent test script using unittest, assuming that style is the preferred one. It was written to go with Bug 2754, but now lives on my pdbtidy branch: http://github.com/etal/biopython/tree/pdbtidy This script could also live alongside the original test_PDB.py for awhile, as an additional check on Bio.PDB's error handling. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 18:01:16 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 18:01:16 -0400 Subject: [Biopython-dev] [Bug 2754] Bio.PDB: Parse warnings should print to stderr, not stdout In-Reply-To: Message-ID: <200904232201.n3NM1GW1025781@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2754 ------- Comment #14 from eric.talevich at gmail.com 2009-04-23 18:01 EST ------- (In reply to comment #13) > I think we can now close Bug 2754. Eric - please try the latest code > from CVS (or the mirror on github). Works for me. I'll delete the bug2754 branch from github. > Also, could you also open separate bug(s) for the other issues, such as your > new unittest based version of test_PDB.py? I opened Bug 2820 for the unittest version of test_PDB.py. The script itself is living on my pdbtidy branch at Tests/test_PDB_unit.py now, although one of the tests broke during the merge (there were a lot of conflicts). I'll open bugs for the other changes once I figure out which modifications are worth sharing. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 18:26:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 18:26:38 -0400 Subject: [Biopython-dev] [Bug 2754] Bio.PDB: Parse warnings should print to stderr, not stdout In-Reply-To: Message-ID: <200904232226.n3NMQcPf027372@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2754 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-23 18:26 EST ------- (In reply to comment #14) > (In reply to comment #13) > > I think we can now close Bug 2754. Eric - please try the latest code > > from CVS (or the mirror on github). > > Works for me. > Great - marking this bug as fixed :) > > I'll delete the bug2754 branch from github. > OK - it has served its purpose now :) > > Also, could you also open separate bug(s) for the other issues, > > such as your new unittest based version of test_PDB.py? > > I opened Bug 2820 for the unittest version of test_PDB.py. The script itself > is living on my pdbtidy branch at Tests/test_PDB_unit.py now, although one > of the tests broke during the merge (there were a lot of conflicts). Thanks. > I'll open bugs for the other changes once I figure out which modifications > are worth sharing. Thank you :) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Thu Apr 23 18:54:02 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 23 Apr 2009 18:54:02 -0400 Subject: [Biopython-dev] How are people doing their git merges from the trunk? In-Reply-To: <320fb6e00904230934kac2ca0eoc8ba284fbe2c382e@mail.gmail.com> References: <320fb6e00904230934kac2ca0eoc8ba284fbe2c382e@mail.gmail.com> Message-ID: <3f6baf360904231554y7e28084ex79bb0a7f60b4cef7@mail.gmail.com> On Thu, Apr 23, 2009 at 12:34 PM, Peter wrote: > > I decided that I would (initially at least) treat my master branch as > a copy of the master branch, and not commit local changes to this > branch. Instead I periodically grab the latest commits from the > master using the commands: I think this is the recommended way to do it. I read a thread where Mercurial gurus recommended keeping a clean clone of the upstream repository, and never committing to that clone. Git seems to have a cleaner version of this with in-place branches. After a few bad incidents with git-rebase, I resolved to keep 'master' in sync with the biopython trunk, and use new named branches for all modifications. The workflow is: git checkout master git pull origin # if I've pushed commits from a different computer recently git pull upstream master # upstream is the remote biopython/biopython git push origin master git checkout phyloxml # a local branch git merge master # hack, commit, repeat # rebasing commits made in this session on this branch is still safe git push origin phyloxml This means the github network diagram only advances by one step, even > if the operation combined 10s of individual commits (which are still > shown individually on my history on github). > > I think mine shows up as multiple dots, and I don't use cherry-pick. Pulling from upstream on the master branch always results in a fast-forward, though. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 19:36:28 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 19:36:28 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200904232336.n3NNaSw6031547@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-23 19:36 EST ------- (In reply to comment #0) > The current test script for Bio.PDB uses the print-and-compare approach. I've > written an equivalent test script using unittest, assuming that style is the > preferred one. Yes, in principle the unittest style is prefferred. In practice I am pragmatic about this - a print-and-compare test is better than nothing, and for some things is much easier to write. > It was written to go with Bug 2754, but now lives on my pdbtidy branch: > http://github.com/etal/biopython/tree/pdbtidy > > This script could also live alongside the original test_PDB.py for awhile, as > an additional check on Bio.PDB's error handling. I've checked in a slightly modified version as test_PDB_unit.py - I think having both this and the original test_PDB.py is sensible in the short term. You wrote on Bug 2754 comment 14 that "one of the tests broke during the merge", was that this one: def test_warnings(self): """Parse a flawed PDB file in permissive mode, with warnings""" # Python 2.6+: rewrite this using warnings.catch_warnings parser = PDBParser(PERMISSIVE=1) msg_redef_n = r"Atom N defined twice in residue at line 19\." msg_blank_alt = r"Blank altlocs in duplicate residue SER \(' ', 4, ' '\) at line 41\." msg_redef_o = r"Atom O defined twice in residue at line 820\." warnings.simplefilter('ignore') # NB: Order is important here! warnings.filterwarnings('error', msg_redef_n, PDBConstructionWarning) self.assertRaises(PDBConstructionWarning, parser.get_structure, "example", "PDB/a_structure.pdb") warnings.filters.pop(0) warnings.filterwarnings('error', msg_blank_alt, PDBConstructionWarning) self.assertRaises(PDBConstructionWarning, parser.get_structure, "example", "PDB/a_structure.pdb") warnings.filters.pop(0) warnings.filterwarnings('error', msg_redef_o, PDBConstructionWarning) self.assertRaises(PDBConstructionWarning, parser.get_structure, "example", "PDB/a_structure.pdb") warnings.filters.pop(0) warnings.filters.pop(0) I tried but couldn't get this to work (on Python 2.4.3 on Linux), even with plenty of warnings.resetwarnings() which seemed cleaner than popping things. I agree with the idea that we should make sure particular errors do get raised (this is checked by the print-and-compare test_PDB.py because we capture these warnings to stdout), but right now how to make it work escapes me. Maybe after a good night's sleep things will make sense ;) Leaving this bug open to address this point. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 23:12:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 23:12:15 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200904240312.n3O3CFdn011360@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #2 from eric.talevich at gmail.com 2009-04-23 23:12 EST ------- (In reply to comment #1) > You wrote on Bug 2754 comment 14 that "one of the tests broke during the > merge", was that this one: > > def test_warnings(self): > [...] > > I tried but couldn't get this to work (on Python 2.4.3 on Linux), even with > plenty of warnings.resetwarnings() which seemed cleaner than popping things. > Yep, that's the one. The behavior of the warnings module and resetwarnings() is pathological, I think. If a warning is triggered before the warnings.simplefilter('always') function is called, that specific warning will be silent until the interpreter is restarted. That's why order is sensitive in that function, and why the three exceptions aren't three separate functions. The attribute warnings.filters is a list of filters that warnings are checked against as they're raised, and at startup the list is not empty. Calling warnings.resetwarnings() just empties this list, including the default filters and any use of 'ignore' or 'always'. Maybe the popping was just voodoo and an empty filter list is fine... dunno. Python 2.6 includes a context manager that makes all these problems *completely* go away, by catching all of the warnings raised within a context and optionally storing them as a list of warning objects that can be inspected. Would you be interested in having a unit test that does a more thorough check of the warnings system, but only runs on Py2.6? I'm guessing no, but hey, worth a shot. Most likely, some warnings just aren't being caught because my version of the unit test assumed a different variety of warnings coming out of PDB. If that's the case then it should be an easy fix and you can disregard my whining. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 23:56:49 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 23:56:49 -0400 Subject: [Biopython-dev] [Bug 2821] New: NCBIXML.parse only returns results for non-empty hits rather than one per query sequence Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2821 Summary: NCBIXML.parse only returns results for non-empty hits rather than one per query sequence Product: Biopython Version: 1.50b Platform: Other OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: camilla at ip.id.au I used NCBIStandalone.blastall to BLAST all records in query database VEKY.faa (a FASTA-format file of 226 proteins) significantly similar in proteins in target database VPOO.faa (a FASTA-format file of 80 proteins). Many of the 'VEKY' proteins do not have a significant hit in the 'VPOO' database (which is what I expect and this is fine). To access the results, I iterate using a loop like the following to parse the raw BLAST results in XML format: blast_out = _open_file(outraw_file, 'r') blast_records = NCBIXML.parse(blast_out) for b_record in blast_records: # deal with each record here However, instead of getting 226 records as I expect, some of which have a description of alignments field of length zero, this returns 64 records - the records that did not have 'no hits'. My problem is that I'd like to work out which VEKY query sequence each 'b_record' corresponds to. But so far I have not been able to find any such information in the b_record. And because it doesn't produce one per query sequence, I cannot infer that information from the order of the query sequences in my input VEKY.faa file. Do you know how I can get around this problem? Warm thanks in advance for any help or tips, Camilla -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 24 04:05:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 24 Apr 2009 04:05:29 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200904240805.n3O85TqY030236@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #3 from dalloliogm at gmail.com 2009-04-24 04:05 EST ------- (In reply to comment #0) > The current test script for Bio.PDB uses the print-and-compare approach. I've > written an equivalent test script using unittest, assuming that style is the > preferred one. > > It was written to go with Bug 2754, but now lives on my pdbtidy branch: > http://github.com/etal/biopython/tree/pdbtidy > > This script could also live alongside the original test_PDB.py for awhile, as > an additional check on Bio.PDB's error handling. > I also tried to write an unittest-based test for PDB exposure, just for playing with it a bit: - http://github.com/dalloliogm/biopython/blob/7dabfff5f7b523479bf8d6de120d0f6c7d03f7df/Tests/test_PDBexposure.py I used the approach where one unit test is equivalent to a PDB file, instead of a set of functions. For example: - test case 1: PDB.NeighborSearch is able to read a random generated PDB file - test case 2: PDB.NeighborSearch is able to read a pdb file with only one structure - test case 3: PDB.NeighborSearch is able to read another specific pdb case -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 24 04:06:43 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 24 Apr 2009 04:06:43 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200904240806.n3O86h0q030360@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #4 from dalloliogm at gmail.com 2009-04-24 04:06 EST ------- (In reply to comment #3) This has the advantage that you can write a base test class and then apply the same tests to various files, by subclassing. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 24 05:02:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 24 Apr 2009 05:02:35 -0400 Subject: [Biopython-dev] [Bug 2821] NCBIXML.parse only returns results for non-empty hits rather than one per query sequence In-Reply-To: Message-ID: <200904240902.n3O92Z68004987@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2821 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-24 05:02 EST ------- What version of BLAST do you have, and (assuming its less than say 10 MB) could you attach the XML file to this bug? >From memory this is a limitation of the raw XML file from the NCBI - there is no way to tell if there were additional queries with no hits (so Biopython can't help directly). I have not checked BLAST 2.2.20, but had been meaning to ask the NCBI about this. They may not regard it as a "bug", but it was annoying. I have used two workarounds in my own code. (1) Load a list of the query IDs into memory, and as you go though the BLAST results you can see which queries don't appear - and therefore had no hits. (2) Use the .next() methods on a FASTA iterator on the query file, and the NCBIXML iterator on the BLAST XML file to step through the two files in sync. I have some code to do this somewhere... maybe I should turn this into a cookbook recipe for the wiki: http://biopython.org/wiki/Category:Cookbook Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 24 05:07:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 24 Apr 2009 05:07:48 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200904240907.n3O97mDU005535@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-24 05:07 EST ------- (In reply to comment #3) > > I also tried to write an unittest-based test for PDB exposure, just for > playing with it a bit: > ... > I used the approach where one unit test is equivalent to a PDB file, > instead of a set of functions. Hi Giovanni, Isn't Bug 2759 for the PDB exposure test? I was thinking of just adding that to the new file test_PDB_unit.py, rather than making it into its own file. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 24 05:15:05 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 24 Apr 2009 05:15:05 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200904240915.n3O9F5hr006324@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #6 from dalloliogm at gmail.com 2009-04-24 05:15 EST ------- (In reply to comment #5) > (In reply to comment #3) > > > > I also tried to write an unittest-based test for PDB exposure, just for > > playing with it a bit: > > ... > > I used the approach where one unit test is equivalent to a PDB file, > > instead of a set of functions. > > Hi Giovanni, > > Isn't Bug 2759 for the PDB exposure test? I was thinking of just adding that > to the new file test_PDB_unit.py, rather than making it into its own file. > > Peter Ok, of course :) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 24 05:50:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 24 Apr 2009 05:50:33 -0400 Subject: [Biopython-dev] [Bug 2759] Unit test for Bio.PDB.HSExposure In-Reply-To: Message-ID: <200904240950.n3O9oXlV008332@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2759 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1234 is|0 |1 obsolete| | ------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-24 05:50 EST ------- (From update of attachment 1234) I have checked this initial exposure test in as part of new file test_PDB_unit.py (created for Bug 2820). Leaving this bug open to look at Martin and/or Giovanni's improvements/extensions. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 24 05:59:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 24 Apr 2009 05:59:09 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200904240959.n3O9x9M8008849@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-24 05:59 EST ------- (In reply to comment #2) > > Yep, that's the one. > > The behavior of the warnings module and resetwarnings() is pathological, I > think. If a warning is triggered before the warnings.simplefilter('always') > function is called, that specific warning will be silent until the interpreter > is restarted. That's why order is sensitive in that function, and ... > Calling warnings.resetwarnings() just empties this list, including the > default filters and any use of 'ignore' or 'always'. The reduced warning test in CVS was working until I added more unit tests (for Bug 2759). This changed the test order, and the warnings were no longer being triggered. I tried a few things like setting warnings.defaultaction="always" at the top of the file, and adding and warnings. onceregistry={} to the test method, but I have given up. We need to be able to *completely* reset the warnings module for this approach to work. > Python 2.6 includes a context manager that makes all these problems > *completely* go away, by catching all of the warnings raised within a > context and optionally storing them as a list of warning objects that > can be inspected. That sounds much better :) > Would you be interested in having a unit test that does a more thorough > check of the warnings system, but only runs on Py2.6? I'm guessing no, > but hey, worth a shot. Yes - other than using the old print-and-compare test, this seems worth doing in order to actually test the warnings we expect are being issued. It could be a whole new file, test_PDB_warnings.py which required Python 2.6+, but as its just one or two tests, maybe just use conditional method(s) within the test_PDB_unit.py file. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Fri Apr 24 06:57:03 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 24 Apr 2009 11:57:03 +0100 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <20090423123635.GD34546@sobchak.mgh.harvard.edu> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> <320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com> <20090423123635.GD34546@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904240357j29663811q80d4f7c2e7cf6382@mail.gmail.com> On Thu, Apr 23, 2009 at 1:36 PM, Brad Chapman wrote: > I took a look at the resource usage of these objects versus > a lightweight implementation. For a GFF file with 70k features, the > maximum memory usage is 128M versus 111M for the lightweight > version. So the improvement is rather modest, ~15%. How did you measure these memory figures? And was your 15% comparison between the current "heavy" SeqFeature + FeatureLocation system as in CVS, and my lightweight alternative described earlier? Peter From cy at cymon.org Fri Apr 24 07:43:33 2009 From: cy at cymon.org (Cymon Cox) Date: Fri, 24 Apr 2009 12:43:33 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> Message-ID: <7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com> 2009/4/22 Peter > On Wed, Apr 22, 2009 at 6:00 PM, Cymon Cox wrote: > >> > >> This might be a silly question, but do you actually these exact option > >> layouts for MUSCLE and MAFFT? Many Unix tools use something like > >> libopt and will actually take slight variations, and may also offer > short > >> and long names for the same option. Perhaps the existing option code > >> in Bio.Application will suffice? > > > > MAFFT uses "--param value" style options, and won't accept > "--param=value" > > or "-param value" as alternatives. > > OK. Then yes, we should support that. Brad, as Bio.Application is your > module, would you like to comment? > > > > > Neither use "-param=value", but if more applications it may turn up. > > > > I don't think I have ever see a command line application that used that. PRANK - Probabilistic Alignment Kit http://www.ebi.ac.uk/goldman-srv/prank/prank/ Advanced usage: 'prank [optional parameters] -d=sequence_file [optional parameters]' Doesn't accept "-d sequence_file" or "- -d=sequence_file" C. -- From biopython at maubp.freeserve.co.uk Fri Apr 24 07:51:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Apr 2009 12:51:58 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> <7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com> Message-ID: <320fb6e00904240451ma79b037y36315270a03724b@mail.gmail.com> On Fri, Apr 24, 2009 at 12:43 PM, Cymon Cox wrote: > 2009/4/22 Peter > >> On Wed, Apr 22, 2009 at 6:00 PM, Cymon Cox wrote: >> >> >> >> This might be a silly question, but do you actually these exact option >> >> layouts for MUSCLE and MAFFT? ?Many Unix tools use something like >> >> libopt and will actually take slight variations, and may also offer >> >> short and long names for the same option. ?Perhaps the existing >> >> option code in Bio.Application will suffice? >> > >> > MAFFT uses "--param value" style options, and won't accept >> "--param=value" >> > or "-param value" as alternatives. >> >> OK. ?Then yes, we should support that. ?Brad, as Bio.Application is your >> module, would you like to comment? >> >> > >> > Neither use "-param=value", but if more applications it may turn up. >> > >> >> I don't think I have ever see a command line application that used that. > > > PRANK - Probabilistic Alignment Kit > http://www.ebi.ac.uk/goldman-srv/prank/prank/ > > Advanced usage: 'prank [optional parameters] -d=sequence_file [optional > parameters]' > > Doesn't accept "-d sequence_file" or "- -d=sequence_file" I had misunderstood the quotes to be literally typed on the command line ;) Peter From biopython at maubp.freeserve.co.uk Fri Apr 24 08:39:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Apr 2009 13:39:51 +0100 Subject: [Biopython-dev] How are people doing their git merges from the trunk? In-Reply-To: <3f6baf360904231554y7e28084ex79bb0a7f60b4cef7@mail.gmail.com> References: <320fb6e00904230934kac2ca0eoc8ba284fbe2c382e@mail.gmail.com> <3f6baf360904231554y7e28084ex79bb0a7f60b4cef7@mail.gmail.com> Message-ID: <320fb6e00904240539n616a5e77s7ef4377c2cd4c336@mail.gmail.com> On Thu, Apr 23, 2009 at 11:54 PM, Eric Talevich wrote: > On Thu, Apr 23, 2009 at 12:34 PM, Peter wrote: > >> I decided that I would (initially at least) treat my master branch as >> a copy of the master branch, and not commit local changes to this >> branch. ?Instead I periodically grab the latest commits from the >> master using the commands: > > I think this is the recommended way to do it. I read a thread where > Mercurial gurus recommended keeping a clean clone of the upstream > repository, and never committing to that clone. Git seems to have a cleaner > version of this with in-place branches. > > After a few bad incidents with git-rebase, I resolved to keep 'master' in > sync with the biopython trunk, and use new named branches for all > modifications. The workflow is: > > git checkout master > git pull origin ? ?# if I've pushed commits from a different computer > recently > git pull upstream master ? # upstream is the remote biopython/biopython > git push origin master Using "upstream" seems like a very sensible name, I assume you set up: git remote add upstream git://github.com/biopython/biopython.git Peter From chapmanb at 50mail.com Fri Apr 24 08:45:15 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 24 Apr 2009 08:45:15 -0400 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <320fb6e00904240357j29663811q80d4f7c2e7cf6382@mail.gmail.com> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> <320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com> <20090423123635.GD34546@sobchak.mgh.harvard.edu> <320fb6e00904240357j29663811q80d4f7c2e7cf6382@mail.gmail.com> Message-ID: <20090424124515.GJ34546@sobchak.mgh.harvard.edu> Hi Peter; > > I took a look at the resource usage of these objects versus > > a lightweight implementation. For a GFF file with 70k features, the > > maximum memory usage is 128M versus 111M for the lightweight > > version. So the improvement is rather modest, ~15%. > > How did you measure these memory figures? With the unix 'time' command; those are the values reported by %M, which is the maximum memory used during the process. > And was your 15% comparison between the current "heavy" SeqFeature + > FeatureLocation system as in CVS, and my lightweight alternative > described earlier? This was with an even lighter version. I just added start/end as attributes to the SeqFeatures. So there was no FeatureLocation or individual position objects. This was a hack to look at the best case scenario to save memory. The baseline was the default SeqFeatures before we started thinking about changing them. > How does this version look? It should save more memory that the > version I sent you three days ago, and again aims for 100% backwards > compatibility - all the unit tests pass. That is nice. Do we still want to keep a FeatureLocation, or condense this all onto the SeqFeature itself? Brad From chapmanb at 50mail.com Fri Apr 24 08:47:06 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 24 Apr 2009 08:47:06 -0400 Subject: [Biopython-dev] How are people doing their git merges from the trunk? In-Reply-To: <320fb6e00904240539n616a5e77s7ef4377c2cd4c336@mail.gmail.com> References: <320fb6e00904230934kac2ca0eoc8ba284fbe2c382e@mail.gmail.com> <3f6baf360904231554y7e28084ex79bb0a7f60b4cef7@mail.gmail.com> <320fb6e00904240539n616a5e77s7ef4377c2cd4c336@mail.gmail.com> Message-ID: <20090424124706.GK34546@sobchak.mgh.harvard.edu> Eric and Peter; This is really good stuff. Can we add the details to the wiki? It looks like this section could use the information from this thread: http://biopython.org/wiki/GitUsage#Merging_upstream_changes Brad > On Thu, Apr 23, 2009 at 11:54 PM, Eric Talevich wrote: > > On Thu, Apr 23, 2009 at 12:34 PM, Peter wrote: > > > >> I decided that I would (initially at least) treat my master branch as > >> a copy of the master branch, and not commit local changes to this > >> branch. ?Instead I periodically grab the latest commits from the > >> master using the commands: > > > > I think this is the recommended way to do it. I read a thread where > > Mercurial gurus recommended keeping a clean clone of the upstream > > repository, and never committing to that clone. Git seems to have a cleaner > > version of this with in-place branches. > > > > After a few bad incidents with git-rebase, I resolved to keep 'master' in > > sync with the biopython trunk, and use new named branches for all > > modifications. The workflow is: > > > > git checkout master > > git pull origin ? ?# if I've pushed commits from a different computer > > recently > > git pull upstream master ? # upstream is the remote biopython/biopython > > git push origin master > > Using "upstream" seems like a very sensible name, I assume you set up: > git remote add upstream git://github.com/biopython/biopython.git > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Fri Apr 24 10:14:10 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 24 Apr 2009 15:14:10 +0100 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <20090424124515.GJ34546@sobchak.mgh.harvard.edu> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> <320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com> <20090423123635.GD34546@sobchak.mgh.harvard.edu> <320fb6e00904240357j29663811q80d4f7c2e7cf6382@mail.gmail.com> <20090424124515.GJ34546@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904240714s3a0df8cfk75330fd4025c13a3@mail.gmail.com> On Fri, Apr 24, 2009 at 1:45 PM, Brad Chapman wrote: > With the unix 'time' command; those are the values reported by %M, > which is the maximum memory used during the process. > You said 70k features, but how big was the file on disk? >> >> And was your 15% comparison between the current "heavy" SeqFeature + >> FeatureLocation system as in CVS, and my lightweight alternative >> described earlier? >> > > This was with an even lighter version. I just added start/end as > attributes to the SeqFeatures. So there was no FeatureLocation or > individual position objects. This was a hack to look at the best case > scenario to save memory. The baseline was the default SeqFeatures > before we started thinking about changing them. Right - so even if the FeatureLocation is a bit "heavy", getting rid of it wouldn't make that much difference based on your simple profiling. >> How does this version look? It should save more memory that the >> version I sent you three days ago, and again aims for 100% backwards >> compatibility - all the unit tests pass. > > That is nice. Do we still want to keep a FeatureLocation, or > condense this all onto the SeqFeature itself? For the moment I was exploring ways to avoid wasting memory in the FeatureLocation object while retaining 100% compatibility. If your simple profiling numbers are telling the whole story, then there isn't a great deal of point in adding any internal complexity for a small memory saving. If we do want to preserve the current SeqFeature and FeatureLocation API, then the proposal on Bug 2818 is a worthwhile incremental improvement. However, we can probably come up with something even nicer if we change the SeqFeature and FeatureLocation in a non-backwards compatible way. If we did change the API, I would want to stop using the sub_features list to hold join information as child SeqFeatures. I was thinking the FeatureLocation object should hold this, but merging the SeqFeature and FeatureLocation could make sense. Are there any other non-join location operators we really have to deal with? Internally the FeatureLocation (or SeqFeature) could have a list of child locations held as a private list holding two entry tuples (start and end positions). Typically for a non-join feature this will be just _loc_list=[(start,end)], while more generally it would be _loc_list=[(start1,end1),...,(startN,endN)]. The FeatureLocation (or SeqFeature) would have (fuzzy/non-fuzzy) start and end properties which would access _loc_list[0][0] for the start, and loc_list[-1][1] for the end. I would still use the existing position objects to store fuzzy positions. Peter From cy at cymon.org Fri Apr 24 11:31:28 2009 From: cy at cymon.org (Cymon Cox) Date: Fri, 24 Apr 2009 16:31:28 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <320fb6e00904240451ma79b037y36315270a03724b@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> <7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com> <320fb6e00904240451ma79b037y36315270a03724b@mail.gmail.com> Message-ID: <7265d4f0904240831p378a467dx34152951a3f16c52@mail.gmail.com> 2009/4/24 Peter > >> > MAFFT uses "--param value" style options, and won't accept > >> "--param=value" > >> > or "-param value" as alternatives. > >> > >> OK. Then yes, we should support that. Brad, as Bio.Application is your > >> module, would you like to comment? > >> > >> > > >> > Neither use "-param=value", but if more applications it may turn up. > >> > > >> > >> I don't think I have ever see a command line application that used that. > > > > > > PRANK - Probabilistic Alignment Kit > > http://www.ebi.ac.uk/goldman-srv/prank/prank/ > > > > Advanced usage: 'prank [optional parameters] -d=sequence_file [optional > > parameters]' > > > > Doesn't accept "-d sequence_file" or "- -d=sequence_file" > > I had misunderstood the quotes to be literally typed on the command line ;) So the upshot is that both "- -param value" and "-param=value" need to be supported. Rather than add another variation on _Option, or alter _OptionAlt to cover "-param=value", and as we only have a few command line interfaces at present, I'd like to suggest the following simplification to _Option: _AbstractParameter.__init__(:self, names = [], types = [], checker_function = None, is_required = 0, description = "", equate=True): self.names = names self.param_types = types self.checker_function = checker_function self.description = description self.is_required = is_required self.equate = equate [...] class _Option(_AbstractParameter): """Represent an option that can be set for a program. This holds UNIXish options like: --append=yes --append yes --append -append=yes -a yes -append """ def __str__(self): """Return the value of this option for the commandline. """ output = "%s" % self.names[0] if self.value is not None: output += "%s%s " % \ (self.equate and "=" or " ", self.value) return output ie. add an equate flag C. -- From biopython at maubp.freeserve.co.uk Fri Apr 24 12:59:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Apr 2009 17:59:28 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <7265d4f0904240831p378a467dx34152951a3f16c52@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> <7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com> <320fb6e00904240451ma79b037y36315270a03724b@mail.gmail.com> <7265d4f0904240831p378a467dx34152951a3f16c52@mail.gmail.com> Message-ID: <320fb6e00904240959k78d0805bo469dc9666c70d3c0@mail.gmail.com> On Fri, Apr 24, 2009 at 4:31 PM, Cymon Cox wrote: > > So the upshot is that both "- -param value" and "-param=value" need to be > supported. > > Rather than add another variation on _Option, or alter _OptionAlt to cover > "-param=value", and as we only have a few command line interfaces at > present, I'd like to suggest the following simplification to _Option: > ... > ie. add an equate flag That looks very sensible. If there are no counter suggestions, I think that could be checked in :) Peter From bugzilla-daemon at portal.open-bio.org Sat Apr 25 16:36:36 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 25 Apr 2009 16:36:36 -0400 Subject: [Biopython-dev] [Bug 2817] Meta-bug for cleanup once we drop Python 2.3 support In-Reply-To: Message-ID: <200904252036.n3PKaa7G001530@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2817 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-25 16:36 EST ------- Python 2.4+ should let us use the package_data option in setup.py to install the data files needed for Bio.Entrez and Bio.PopGen (and, if we still include it, Bio.EUtils). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sat Apr 25 19:30:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 26 Apr 2009 00:30:15 +0100 Subject: [Biopython-dev] Removing Bio.Mindy and Martel Message-ID: <320fb6e00904251630t43ec275ehd25906476c6afe18@mail.gmail.com> Hi all, Bio.Mindy and Martel are the old "regular expressions on steroids" parsing framework we used to use in Biopython, which needed the external dependency mxTextTools (v2, we never got things to work fully with v3). These modules were deprecated in Biopython 1.48 (Sept 2008), and I explicitly wrote in the release announcements for Biopython 1.50 (and its beta) that this would be the final release to include them. I decided to do this in two steps (partly because of the number of files involved). I've just removed Mindy and associated bits in CVS, and everything looks fine from a setup and unit test point of view. Next comes Martel and its remaining dependent modules. Martel is still used in the following modules, which were also deprecated in Biopython 1.48 (Sept 2008): Bio.MetaTool (parser for output from an obsolete version of MetaTool) Bio.Saf (an obscure alignment format) Bio.NBRF (replaced with "pir" format in Bio.SeqIO) Bio.IntelliGenetics (replaced with "ig" format in Bio.SeqIO) We've actually had three releases where these modules have had a deprecation warning in place, but not quite the full year as stated in the written policy: http://biopython.org/wiki/Deprecation_policy Does anyone have any objections about us pressing ahead with removing Martel and these modules now? Peter From biopython at maubp.freeserve.co.uk Sun Apr 26 06:58:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 26 Apr 2009 11:58:29 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <320fb6e00904240959k78d0805bo469dc9666c70d3c0@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> <7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com> <320fb6e00904240451ma79b037y36315270a03724b@mail.gmail.com> <7265d4f0904240831p378a467dx34152951a3f16c52@mail.gmail.com> <320fb6e00904240959k78d0805bo469dc9666c70d3c0@mail.gmail.com> Message-ID: <320fb6e00904260358i424a6436v24f21e928fffc073@mail.gmail.com> On Fri, Apr 24, 2009 at 5:59 PM, Peter wrote: > On Fri, Apr 24, 2009 at 4:31 PM, Cymon Cox wrote: >> Rather than add another variation on _Option, or alter _OptionAlt to cover >> "-param=value", and as we only have a few command line interfaces at >> present, I'd like to suggest the following simplification to _Option: >> ... >> ie. add an equate flag > > That looks very sensible. If there are no counter suggestions, I > think that could be checked in :) The equate argument is now in CVS. One catch was that the old code used an equals on options starting "--", e.g. "--apped=yes", but not on short options starting "-", e.g. "-append yes" (a bit of magic based on the behaviour of typical Unix tools?). From a grep for "_Option", the only files concerned are: AlignAce/Applications.py Application/__init__.py Blast/Applications.py Emboss/Applications.py Motif/Applications/AlignAce.py And from looking at these, they all use options with a single leading dash, so for backwards compatibility I set equate to False by default (not True as in your outlined code). Does this work for you Cymon? Peter From bartek at rezolwenta.eu.org Sun Apr 26 08:08:51 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Sun, 26 Apr 2009 14:08:51 +0200 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> <320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com> <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> Message-ID: <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> Hi all, On Wed, Apr 22, 2009 at 11:23 AM, Peter wrote: > > If you can fix the current git hub repository, great. > I 've finally found some time to fix the tag issue in our repository. I've actually spent some time looking at git-rebase (learned a lot, but nothing useful for our problem. Then I realised that since tags are just references to commits, we need to move them to the trunk (instead of re-basing the trunk). Long story short - assuming you are in the directory of your git repo you can fix any particular tag wit a single command. E.g. If you want to fix the biopython-149 tag, you do: git tag -f biopython-149 biopython-149~1 -f option enforces the replacement of the existing label, while biopython-149~1 references the parent commit of our empty tag commit (you can also use ~2 for a grand parent and so on). You can see the effect of this procedure (as seen in gitx -- a very nice tool) in the attached images. If you want to fix all biopython tags, you simply do: for t in `git tag|grep biopython`; do git tag -f $t $t~1; done It works locally, the changes can be pushed back to github (need --tags -f to force tag renames), I've done this on my branch of biopython on github. If there are no objections to the way tags are handled, I can try to update the trunk. This is a bit tricky, because I need to make the update scripts work nicely with moving the tags, but it should be doable. > The old conversion's deletion is still in progress, it must have stalled: > http://support.github.com/discussions/repos/485-reposiotry-stuck-in-rename Seems to be gone now. that's one problem less :) > > If we can fix the tags, great. ?If we can also remap the authors to > their git usernames, even better. > This is doable in the current setup. I don't know whether we need to do this. The old commits are signed by the same credentials (name, e-mail) as on CVS server. If we start re-mapping them now, we are going to have essentially a new commit history, so everybody would need to rebase their branches... I don't see a problem of having old commits signed with old e-mails, and new commits signed by new. Especially, that everybody can have multiple e-mails assigned to their github account (that's how I did with mine). cheers Bartek -------------- next part -------------- A non-text attachment was scrubbed... Name: before.png Type: image/png Size: 18855 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: after.png Type: image/png Size: 15150 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Sun Apr 26 08:29:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 26 Apr 2009 13:29:01 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com> <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> Message-ID: <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> On Sun, Apr 26, 2009 at 1:08 PM, Bartek Wilczynski wrote: > Hi all, > > On Wed, Apr 22, 2009 at 11:23 AM, Peter wrote: >> >> If you can fix the current git hub repository, great. >> > I 've finally found some time to fix the tag issue in our repository. > I've actually spent some time looking at git-rebase (learned a lot, > but nothing useful for our problem. Then I realised that since tags > are just references to commits, we need to move them to the trunk > (instead of re-basing the trunk). > > Long story short - assuming you are in the directory of your git repo > you can fix any particular tag wit a single command. E.g. If you want > to fix the biopython-149 tag, you do: > git tag -f biopython-149 biopython-149~1 > > -f option enforces the replacement of the existing label, while > biopython-149~1 references the parent commit of our empty tag commit > (you can also use ~2 for a grand parent and so on). > > You can see the effect of this procedure (as seen in gitx -- a very > nice tool) in the attached images. > > If you want to fix all biopython tags, you simply do: > > for t in `git tag|grep biopython`; do git tag -f $t $t~1; done > > It works locally, the changes can be pushed back to github (need > --tags -f to force tag renames), > I've done this on my branch of biopython on github. > > If there are no objections to the way tags are handled, I can try to > update the trunk. > This is a bit tricky, because I need to make the update scripts work > nicely with moving > the tags, but it should be doable. I say give this a go - fingers crossed :) >> The old conversion's deletion is still in progress, it must have stalled: >> http://support.github.com/discussions/repos/485-reposiotry-stuck-in-rename > > Seems to be gone now. that's one problem less :) Great. I did have reminded them, but they solved it. >> >> If we can fix the tags, great. If we can also remap the authors to >> their git usernames, even better. >> > This is doable in the current setup. I don't know whether we need to > do this. The old commits > are signed by the same credentials (name, e-mail) as on CVS server. If > we start re-mapping them > now, we are going to have essentially a new commit history, so > everybody would need to rebase their > branches... I don't see a problem of having old commits signed with > old e-mails, and new commits > signed by new. Especially, that everybody can have multiple e-mails > assigned to their github account > (that's how I did with mine). That would be simpler. I'll have to try on my github account... Peter From biopython at maubp.freeserve.co.uk Sun Apr 26 08:46:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 26 Apr 2009 13:46:55 +0100 Subject: [Biopython-dev] Properties in Bio.Application interface? Message-ID: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com> On Thu, Apr 23, 2009 at 10:29 AM, Peter wrote: > What about this bit I wrote earlier: >>> ... We might want to discuss extending the AbstractCommandline >>> __init__ method to take **kwargs, allowing the parameters to be >>> set like this: >>> >>> from Bio import Application >>> from Bio.Align.Applications import MafftCommandline >>> cmd = MafftCommandline(input="sample.fa", ...) >>> return_code, std_handle, err_handle = Application.generic_run(cmd) >>> >>> I'm not sure how well this would work in practice as the range of >>> valid argument names in python may not overlap with the valid >>> parameter names. > > We'll have to see how well the above idea works in practice - it > may not be general enough to be useful. > > Also, perhaps we can automatically generate properties for each > argument allowing this: > > cmd.input = "sample.fa" > > rather than: > > cmd.set_parameter("input", "sample.fa") > > For the "switch" type arguments which take no value, if these are > implemented with a separate option class (maybe _Switch or > _OptionNoValue) then rather than: > > cmd.set_parameter("noanchors") > > we might want to do: > > cmd.noanchors = True > > and allow the switch to be removed with: > > cmd.noanchors = False > > i.e. For those arguments which take no argument (is "switch" the > right term here?), evaluate the property set value as a boolean to > add/remove -noanchors from the command line string. > > I think using properties in this way could make the command line > object more intuitive, but again python puts limits on property names > which might mean for some arguments you'd have to use the > set_parameter version. > > Peter > I have cleaning up the existing Bio.Application command line objects in CVS to follow the parameter alias convention already laid out in Bio.Application. i.e. They all now have human readable paramater aliases, which are also valid python identifiers. This means these "human readable names" can also be used for argument names in __init__ (using **kwargs), or as property names. I think I've got properties working now as an experiment on my machine, generated at run time using the "human readable name" for each parameter. We would need to special case "switch" arguments (i.e. those which take no value) as outlined above. Does this sound worthwhile? If so, I'll put together an enhancement bug with a patch, or a branch on github. Peter From bugzilla-daemon at portal.open-bio.org Sun Apr 26 09:45:47 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 26 Apr 2009 09:45:47 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line interface In-Reply-To: Message-ID: <200904261345.n3QDjlkm022449@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 cymon.cox at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1286 is|0 |1 obsolete| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Apr 26 09:49:44 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 26 Apr 2009 09:49:44 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904261349.n3QDniuI022654@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 cymon.cox at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Bio.Application MUSCLE |Bio.Application command line |command line interface |interfaces ------- Comment #6 from cymon.cox at gmail.com 2009-04-26 09:49 EST ------- (Change title of this bug.) Now tracking github branch: http://github.com/cymon/biopython-github-master/tree/applic-int Added command line interfaces for: MUSCLE, MAFFT, DALIGN, PRANK To do: Clustalw, T-coffee C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sun Apr 26 13:22:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 26 Apr 2009 18:22:43 +0100 Subject: [Biopython-dev] Bio.EMBOSS wrappers In-Reply-To: <320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com> References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> <20090413123219.GB5429@sobchak.mgh.harvard.edu> <320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com> <20090413134429.GE5429@sobchak.mgh.harvard.edu> <320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com> Message-ID: <320fb6e00904261022l43f799a8g8729a47ba15042f8@mail.gmail.com> On Mon, Apr 13, 2009 at 2:49 PM, Peter wrote: > On Mon, Apr 13, 2009 at 2:44 PM, Brad Chapman wrote: >>> > ... Feel free to add away. >>> >>> I need to work on my delegation skills - that seems to have back fired ;) >> >> Oops. I honestly read that as "do I have your permission?" I can of >> course tackle this, but am a bit underwater now. > > Looking back, I was a bit ambiguous. I don't mind who does it - let's > see who has time free first. OK, I've added a minimal needle wrapper based on the water wrapper. As part of this I remove the -nosimilarity option which doesn't work on the current versions of EMBOSS needle and water (5.0 or 6.0). For -auto and -filter, I think we probably should extend the parameter classes to explicitly cover these switch arguments which take no value (they are either part of the command line, or omitted). We've touched on this already on Cymon's thread... Peter From bugzilla-daemon at portal.open-bio.org Sun Apr 26 16:09:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 26 Apr 2009 16:09:51 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904262009.n3QK9p8U011039@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #7 from cymon.cox at gmail.com 2009-04-26 16:09 EST ------- Added CLUSTALW Bio.Application command line interface C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Apr 27 05:58:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Apr 2009 10:58:52 +0100 Subject: [Biopython-dev] main page on wiki In-Reply-To: <320fb6e00904230216l4b2cd91k513baa3f7a08172c@mail.gmail.com> References: <49EFCF07.2050502@student.otago.ac.nz> <9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com> <49EFE553.6070405@gmail.com> <320fb6e00904230216l4b2cd91k513baa3f7a08172c@mail.gmail.com> Message-ID: <320fb6e00904270258s523c49a1j1bfc5d4a12ca86a9@mail.gmail.com> On Thu, Apr 23, 2009 at 10:16 AM, Peter wrote: > > If there are no counter comments, I'll put David's changes up later > today or tomorrow. > OK - make that a couple of days later ;) This isn't exactly as in David's draft - I shortened some of the link text and omitted a couple of links under "Contribute" which seemed unnecessary on the home page. I've also kept the final line giving the latest release and date (although the text is shorter now). Brad commented (off list?) that having this is a good indicator of the project's activity, and I agree. Alternatively, I'd like to try having dates on the news feed, but the media wiki plugin needs to be updated for that to work... Peter From bugzilla-daemon at portal.open-bio.org Mon Apr 27 08:23:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 27 Apr 2009 08:23:00 -0400 Subject: [Biopython-dev] [Bug 2671] Including GenomeDiagram in the main Biopython distribution In-Reply-To: Message-ID: <200904271223.n3RCN0GL009972@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2671 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #34 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-27 08:23 EST ------- (In reply to comment #25) > OK, GenomeDiagram is now in CVS, with some basic tests. Still to do: > > * Updating the existing GenomeDiagram manual to match (different imports, > colour to color), which I think can stay as a separate PDF file. Leighton can do that... > * A short introduction to Bio.Graphics including GenomeDiagram as part > of a new chapter in the tutorial? Done. (In reply to comment #33) > Plus (as pointed out on Bug 2711 / Bug 2710): > > * Updating the installation instructions so that the ReportLab > section also covers renderPM (needed for bitmaps). Done. Marking this bug fixed as of Biopython 1.50. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at portal.open-bio.org Mon Apr 27 10:12:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 27 Apr 2009 10:12:51 -0400 Subject: [Biopython-dev] [Bug 2821] NCBIXML.parse only returns results for non-empty hits rather than one per query sequence In-Reply-To: Message-ID: <200904271412.n3RECpZC019165@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2821 camilla at ip.id.au changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from camilla at ip.id.au 2009-04-27 10:12 EST ------- Hi Peter Thanks for the suggestions. In the end, I realised that b_record.query contains the header line of the query sequence all along, so there is no real bug here, just my misunderstanding of what information is stored where. I think this issue can be closed. For anyone else out there with similar problems, if you aren't certain what data is in an object, you can use the dir() function to list them all. Thanks again Camilla -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Apr 27 11:59:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Apr 2009 16:59:03 +0100 Subject: [Biopython-dev] Installation documentation Message-ID: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> I've made some updates to Installation.tex, which I think are an improvement over the version shipped with Biopython 1.50 and currently online. I think we could update these files now: http://biopython.org/DIST/docs/install/Installation.html http://biopython.org/DIST/docs/install/Installation.pdf Does that seem sensible? Before that, would anyone like to proof read the text in CVS, or make further updates? For example, are the bits on FreeBSD, Fink and RPMs still valid? Peter From p.j.a.cock at googlemail.com Mon Apr 27 12:09:57 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 27 Apr 2009 17:09:57 +0100 Subject: [Biopython-dev] Rolling new releases In-Reply-To: <320fb6e00904230658y310609c8l89bf27c33bd56d56@mail.gmail.com> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com> <320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com> <20090421122045.GD30529@sobchak.mgh.harvard.edu> <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com> <20090423125356.GE34546@sobchak.mgh.harvard.edu> <320fb6e00904230658y310609c8l89bf27c33bd56d56@mail.gmail.com> Message-ID: <320fb6e00904270909y35ebc841yd2074d6970b71fe4@mail.gmail.com> On Thu, Apr 23, 2009 at 2:58 PM, Peter Cock wrote: > The complicated bit is getting the code and documentation in CVS > ready, and that is harder to delegate. ?Once that is done though, the > actual release process is fairly straight forward - as documented here > - and could be delegated to anyone methodical with suitably setup > development machine(s): > http://biopython.org/wiki/Building_a_release > Maybe some of the release process could be automated literally as a > script - but doing each step methodically by hand and checking as you > go is wise. On the bright side, after dropping Martel the "Building a release" instructions will get a little shorter :) Peter From biopython at maubp.freeserve.co.uk Mon Apr 27 12:26:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Apr 2009 17:26:37 +0100 Subject: [Biopython-dev] Removing Bio.Mindy and Martel In-Reply-To: <320fb6e00904251630t43ec275ehd25906476c6afe18@mail.gmail.com> References: <320fb6e00904251630t43ec275ehd25906476c6afe18@mail.gmail.com> Message-ID: <320fb6e00904270926l7e2db7e0x21a7bde1e47af4b0@mail.gmail.com> On Sun, Apr 26, 2009 at 12:30 AM, Peter wrote: > > Does anyone have any objections about us pressing ahead with removing > Martel and these modules now? > Well I hope not, as I've just make the changes in CVS. Note that I have not deleted all the files in the Martel folder, but simply excluded Martel from setup.py. Peter From biopython at maubp.freeserve.co.uk Mon Apr 27 12:28:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Apr 2009 17:28:15 +0100 Subject: [Biopython-dev] Installation documentation In-Reply-To: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> Message-ID: <320fb6e00904270928w26fb0f1axbb7be88188d0355f@mail.gmail.com> On Mon, Apr 27, 2009 at 4:59 PM, Peter wrote: > I've made some updates to Installation.tex, which I think are an > improvement over the version shipped with Biopython 1.50 and currently > online. ?I think we could update these files now: > > http://biopython.org/DIST/docs/install/Installation.html > http://biopython.org/DIST/docs/install/Installation.pdf > > Does that seem sensible? ?Before that, would anyone like to proof read > the text in CVS, or make further updates? ?For example, are the bits > on ?FreeBSD, Fink and RPMs still valid? If we are going to update the online version, I'll refrain from removing the mxTextTools bit from Installation.tex for the time being. Peter From biopython at maubp.freeserve.co.uk Mon Apr 27 12:37:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Apr 2009 17:37:09 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> Message-ID: <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> > On Sun, Apr 26, 2009 at 1:08 PM, Bartek Wilczynski >>> >>> If we can fix the tags, great. ?If we can also remap the authors to >>> their git usernames, even better. >>> >> This is doable in the current setup. I don't know whether we need to >> do this. The old commits are signed by the same credentials (name, >> e-mail) as on CVS server. >From looking at git log, they just have our CVS usename, e.g. Author: peterc i.e. No email address >> If we start re-mapping them now, we are going to have essentially a >> new commit history, so everybody would need to rebase their >> branches... I don't see a problem of having old commits signed with >> old e-mails, and new commits signed by new. Especially, that >> everybody can have multiple e-mails assigned to their github >> account (that's how I did with mine). > > That would be simpler. ?I'll have to try on my github account... > Given we don't have email addresses embedded in the old commits, do you think is this going to be possible (without changing the repository)? Peter From chapmanb at 50mail.com Tue Apr 28 08:41:20 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 28 Apr 2009 08:41:20 -0400 Subject: [Biopython-dev] Installation documentation In-Reply-To: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> Message-ID: <20090428124119.GV34546@sobchak.mgh.harvard.edu> Hi Peter; > I've made some updates to Installation.tex, which I think are an > improvement over the version shipped with Biopython 1.50 and currently > online. I think we could update these files now: > > http://biopython.org/DIST/docs/install/Installation.html > http://biopython.org/DIST/docs/install/Installation.pdf > > Does that seem sensible? Before that, would anyone like to proof read > the text in CVS, or make further updates? For example, are the bits > on FreeBSD, Fink and RPMs still valid? The FreeBSD port is out of date now, so I commented that section out and replaced it with a section on using easy_install. This also reminded me that I needed to update the version on the Python Package Index. I added a note to the release details to do this; oh man, another step. Peter, if you have an account on pypi, let me know your login and I can add you as an owner for Biopython. Brad From p.j.a.cock at googlemail.com Tue Apr 28 09:36:37 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 Apr 2009 14:36:37 +0100 Subject: [Biopython-dev] [Biopython] Parsing large blast files In-Reply-To: <627305.69090.qm@web62401.mail.re1.yahoo.com> References: <320fb6e00904270354i351bccb3q6aaa2369db6f82e0@mail.gmail.com> <627305.69090.qm@web62401.mail.re1.yahoo.com> Message-ID: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> On Tue, Apr 28, 2009 at 2:00 PM, Michiel de Hoon wrote: >> NCBIStandalone.Iterator() is the old semi-obsolete plain >> text parser - it won't parse the XML output, hence the >> "Invalid header" error. ?Maybe the tutorial >> (or the error message) could be clearer. > > I think part of the problem is the organization of the code in Bio.Blast, > which seems to have grown historically. Bio.Blast.NCBIStandalone > contains blastall, blastpgp, and rpsblast, which makes sense, but also > ?BlastParser and PsiBlastParser, which are not necessarily connected > to standalone Blast. Bio.Blast.ParseBlastTable contains the parser for > blastpgp output. Bio.Blast.NCBIWWW contains qblast, but also the > parser for Blast HTML output, though qblast does not necessarily > generate output in HTML format. I presumed that initially the standalone tools only produced plain text, and the website (qblast) only produced HTML - hence the use of Bio.Blast.NCBIStandalone for both command line wrappers AND the plain text parser, and Bio.Blast.NCBIWWW for both the qblast function AND the HTML parser. > The usage of this module may be more understandable if all functions > were accessible from Bio.Blast directly in a fashion more consistent > with current Biopython. Bio.Blast would then have the following functions: > > read(handle, format='xml') > parse(handle, format='xml') > blastall > blastpgp > rpsblast > qblast > > with most of the actual code hiding in Bio.Blast.NCBIStandalone etcetera. > > Any objections, comments? I do like the idea of moving/importing the qblast function directly under Bio.Blast, and perhaps removing Bio.Blast.NCBIXML later on. For read/parse functions, we should probably call the format "blastxml" to match BioPerl. Would you continue to support the plain text output here? Also something to keep in mind is there may be non-NCBI variants of BLAST with their own formats as well. Rather than continuing to encourage the use of blastall, blastpgp and rpsblast I would rather bring Bio.Blast.Applications up to date, and then declare them obsolete . These three "helper" functions are very limiting in how the command line is invoked - you can't choose the exact call used (e.g. subprocess options) or what you want back (e.g. you may not care about the handles). For example, getting BLAST to write its output to a file is confusingly difficult right now using these functions. Also, dealing with errors isn't nice. Peter From biopython at maubp.freeserve.co.uk Tue Apr 28 09:40:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Apr 2009 14:40:43 +0100 Subject: [Biopython-dev] Installation documentation In-Reply-To: <20090428124119.GV34546@sobchak.mgh.harvard.edu> References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> <20090428124119.GV34546@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904280640l58f0098fs467d90646a039e29@mail.gmail.com> On Tue, Apr 28, 2009 at 1:41 PM, Brad Chapman wrote: > > The FreeBSD port is out of date now, so I commented that section out > and replaced it with a section on using easy_install. This also > reminded me that I needed to update the version on the Python > Package Index. I added a note to the release details to do this; oh > man, another step. Well, easy_install isn't (yet) an official python standard so I hadn't previously worried about it - our wiki Downloads page does mention it. Frankly the less "official" ways the are to install, the less ways it can go wrong, and then the less questions need to be asked when it goes wrong. Nor had I worried about how PyPi's listing might need to be updated. I assumed it was clever enough to scan the http://biopython.org/DIST/ directory and parse the filenames. Is the real answer you (Brad) kept it up to date? http://pypi.python.org/pypi/biopython/ > Peter, if you have an account on pypi, let me know your login and I > can add you as an owner for Biopython. I don't have an account on pypi. Peter From bugzilla-daemon at portal.open-bio.org Tue Apr 28 11:04:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 11:04:09 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281504.n3SF49so024149@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 11:04 EST ------- I've checked the MUSCLE wrapper into CVS, and added the -diags option. I also created test_Muscle_tool.py which requires MUSCLE be installed, and checks we can invoke it and parse its clustal output OK. A more general alignment wrapper unit test can simply construct some command line objects and check them against an expected string (without requiring the tools to be installed). Note that I am concerned about the file exists check on the input file argument. This is helpful, but also prevents certain reasonable usage examples - e.g. the input file is created on the fly and doesn't exist yet, or, the command line constructed will be submitted to a cluster where the path will be valid (even if the path isn't valid on the local machine where Biopython is running). Also, perhaps we should think about Bio.Application including automatic quoting for filenames with spaces in them... see the _escape_filename function used in Bio.Clustalw and Bio.Blast.NCBIStandalone. This would be only for parameters explicitly tagged as filenames. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 11:25:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 11:25:20 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281525.n3SFPKbd025807@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #9 from cymon.cox at gmail.com 2009-04-28 11:25 EST ------- (In reply to comment #8) > I've checked the MUSCLE wrapper into CVS, and added the -diags option. You pulled this from the applic-int branch yes? (Hmm, missed that -diags...) I also > created test_Muscle_tool.py which requires MUSCLE be installed, and checks we > can invoke it and parse its clustal output OK. Ive also just checked in (to the github branch) some unittests for MUSCLE, MAFFT, PRANK, and DIALIGN. These just check that the app runs and the return code is 0 - a few other checks are made but not much else. > A more general alignment wrapper unit test can simply construct some command > line objects and check them against an expected string (without requiring the > tools to be installed). I will do these - all in one test_ApplicationCommandlines.py unittest suite. > Note that I am concerned about the file exists check on the input file > argument. This is helpful, but also prevents certain reasonable usage examples > - e.g. the input file is created on the fly and doesn't exist yet, or, the > command line constructed will be submitted to a cluster where the path will be > valid (even if the path isn't valid on the local machine where Biopython is > running). Good point. Perhaps the os.path.exists on input files needs to be dropped from all wrappers. > > Also, perhaps we should think about Bio.Application including automatic quoting > for filenames with spaces in them... see the _escape_filename function used in > Bio.Clustalw and Bio.Blast.NCBIStandalone. This would be only for parameters > explicitly tagged as filenames. Yes, I thought about doing that but havent acted. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 11:44:06 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 11:44:06 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281544.n3SFi61M027248@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 11:44 EST ------- (In reply to comment #9) > (In reply to comment #8) > > I've checked the MUSCLE wrapper into CVS, and added the -diags option. > > You pulled this from the applic-int branch yes? (Hmm, missed that -diags...) Yes. I spotted the -diags because it is an example given if you just run "muscle". > Ive also just checked in (to the github branch) some unittests for MUSCLE, > MAFFT, PRANK, and DIALIGN. These just check that the app runs and the return > code is 0 - a few other checks are made but not much else. I'll look at that. > > A more general alignment wrapper unit test can simply construct some command > > line objects and check them against an expected string (without requiring > > the tools to be installed). > > I will do these - all in one test_ApplicationCommandlines.py unittest suite. Sounds good. Maybe just test_AlignApps.py if it is just for Bio.Align.Applications? > > Note that I am concerned about the file exists check on the input file > > argument. This is helpful, but also prevents certain reasonable usage > > examples - e.g. the input file is created on the fly and doesn't exist > > yet, or, the command line constructed will be submitted to a cluster > > where the path will be valid (even if the path isn't valid on the local > > machine where Biopython is running). > > Good point. Perhaps the os.path.exists on input files needs to be dropped > from all wrappers. Maybe - I dropped most of them from the Muscle and Clustalw ones. The matrix arguments are a little trickier, where the argument can be either a special word of a filename. See below for a related issue ... > > Also, perhaps we should think about Bio.Application including automatic > > quoting for filenames with spaces in them... see the _escape_filename > > function used in Bio.Clustalw and Bio.Blast.NCBIStandalone. This would > > be only for parameters explicitly tagged as filenames. > > Yes, I thought about doing that but havent acted. > Another issue is any file exists check needs to be aware that filenames may be quoted (due to containing spaces). i.e. A simple call to os.path.isfile(...) won't work. I've integrated your Clustalw wrapper into CVS, and in order to extend my existing unit tests to use this with spaces in file names, I was forced to drop the existence check. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 12:18:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 12:18:11 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281618.n3SGIBPl029571@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #11 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 12:18 EST ------- (In reply to comment #9) > > Ive also just checked in (to the github branch) some unittests for MUSCLE, > MAFFT, PRANK, and DIALIGN. These just check that the app runs and the return > code is 0 - a few other checks are made but not much else. > I think I was looking at your master branch, rather than the applic-int branch: http://github.com/cymon/biopython-github-master/commits/applic-int I see the changes now... > > Also, perhaps we should think about Bio.Application including automatic > > quoting for filenames with spaces in them... see the _escape_filename > > function used in Bio.Clustalw and Bio.Blast.NCBIStandalone. This would > > be only for parameters explicitly tagged as filenames. > > Yes, I thought about doing that but havent acted. Seeing as we both think this makes sense, I've done that in CVS. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 12:30:53 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 12:30:53 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281630.n3SGUrLn030516@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #12 from cymon.cox at gmail.com 2009-04-28 12:30 EST ------- (In reply to comment #11) > (In reply to comment #9) > > > > Ive also just checked in (to the github branch) some unittests for MUSCLE, > > MAFFT, PRANK, and DIALIGN. These just check that the app runs and the return > > code is 0 - a few other checks are made but not much else. > > > > I think I was looking at your master branch, rather than the applic-int branch: > http://github.com/cymon/biopython-github-master/commits/applic-int > I see the changes now... In those unittests, you'll note that I have no idea about the windows environment! (dont use window, never have used windows). I just copied from the Emboss wrapper... C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 12:39:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 12:39:20 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281639.n3SGdKcO030951@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #13 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 12:39 EST ------- (In reply to comment #11) > In those unittests, you'll note that I have no idea about the windows > environment! (dont use window, never have used windows). I just copied > from the Emboss wrapper... > > C. That explains things :) I had already guessed you hadn't run any of these tests on Windows, because the executable isn't recorded properly, and even if it was, you never use it when creating the command line objects: #Don't do this if you want to actually run the application, as #it would only work on Unix where the command is on the path: #cmdline = MafftCommandline() #Instead, use the exe name we determined earlier: cmdline = MafftCommandline(mafft_exe) The EMBOSS installer is nice and *does* setup EMBOSS_ROOT, which is why test_Emboss.py looks for it. However, for test_Clustalw_tool.py I just made a list of the default install locations, and check them. There is no environment variable! I haven't looked at the documentation but I would be pleasantly surprised if a MAFFT_ROOT environment variable was setup by the default method of installing MAFFT on Windows (and similarly for the other tools). If the tools do record their install location in the registry, we can do a win32api call to get the path. Then if win32api isn't installed just raise the MissingExternalDependencyError exception. If you look at my test_Muscle_tool.py in CVS, you'll see I haven't yet determined how best to try and locate MUSCLE. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 12:54:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 12:54:32 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281654.n3SGsWBX032007@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #14 from cymon.cox at gmail.com 2009-04-28 12:54 EST ------- (In reply to comment #13) > (In reply to comment #11) > > In those unittests, you'll note that I have no idea about the windows > > environment! (dont use window, never have used windows). I just copied > > from the Emboss wrapper... > > > > C. > > That explains things :) > > I had already guessed you hadn't run any of these tests on Windows, because the > executable isn't recorded properly, and even if it was, you never use it when > creating the command line objects: > > #Don't do this if you want to actually run the application, as > #it would only work on Unix where the command is on the path: > #cmdline = MafftCommandline() > #Instead, use the exe name we determined earlier: > cmdline = MafftCommandline(mafft_exe) OK, thanks, I'll update the wrappers on my branch. > If the tools do record their install location in the registry, we can do a > win32api call to get the path. > Then if win32api isn't installed just raise the > MissingExternalDependencyError exception. We can? ;) Can you give me some code, or I could just use this in the meantime: if sys.platform=="win32" : raise MissingExternalDependencyError("Testing with MUSCLE not implemented on Windows yet") C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 13:07:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 13:07:11 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281707.n3SH7BTG000522@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 13:07 EST ------- (In reply to comment #14) > > #Don't do this if you want to actually run the application, as > > #it would only work on Unix where the command is on the path: > > #cmdline = MafftCommandline() > > #Instead, use the exe name we determined earlier: > > cmdline = MafftCommandline(mafft_exe) > > OK, thanks, I'll update the wrappers on my branch. Have a look at this test_Muscle_tool.py CVS revision 1.4 first: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Tests/test_Muscle_tool.py?cvsroot=biopython > > If the tools do record their install location in the registry, > > we can do a win32api call to get the path. Then if win32api > > isn't installed just raise the MissingExternalDependencyError > > exception. > > We can? ;) > > Can you give me some code, ... There are a lot of ifs here. The code is fairly simple (I've done this kind of thing before, but can't find an example right away). The catch is establishing IF the information we want gets written to the registry during the tool installation or not. > or I could just use this in the meantime: > if sys.platform=="win32" : > raise MissingExternalDependencyError("Testing with MUSCLE not implemented > on Windows yet") Yeah - use something like that, but be aware that the tests shouldn't assume that the executable name is just "muscle". Hopefully test_Muscle_tool.py does this right... Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 13:32:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 13:32:14 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281732.n3SHWEip002274@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #16 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 13:32 EST ------- (In reply to comment #15) > > Yeah - use something like that, but be aware that the tests shouldn't assume > that the executable name is just "muscle". Hopefully test_Muscle_tool.py does > this right... > Well, it does now. I've got test_Muscle_tool.py to run on Windows, assuming the user chooses to put MUSCLE under the program files directory in a reasonably predictable folder. Given the MUSCLE installation process on Windows is entirely manual, we can't really do anything else. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Apr 28 13:45:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Apr 2009 18:45:01 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> Message-ID: <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com> On Mon, Apr 27, 2009 at 5:37 PM, Peter wrote: >> On Sun, Apr 26, 2009 at 1:08 PM, Bartek Wilczynski >>>> >>>> If we can fix the tags, great. ?If we can also remap the authors to >>>> their git usernames, even better. >>>> >>> This is doable in the current setup. I don't know whether we need to >>> do this. The old commits are signed by the same credentials (name, >>> e-mail) as on CVS server. > > From looking at git log, they just have our CVS usename, e.g. > Author: peterc > i.e. No email address > >>> If we start re-mapping them now, we are going to have essentially a >>> new commit history, so everybody would need to rebase their >>> branches... I don't see a problem of having old commits signed with >>> old e-mails, and new commits signed by new. Especially, that >>> everybody can have multiple e-mails assigned to their github >>> account (that's how I did with mine). >> >> That would be simpler. ?I'll have to try on my github account... > > Given we don't have email addresses embedded in the old commits, > do you think is this going to be possible (without changing the > repository)? I take that back - I added an email address of just "peterc" to my github account (it seems they don't do any validation, perhaps for this very reason?). This had no immediate effect, but one day later and all my CVS commits are now shown with my photo in github. Neat - but it makes it much more obvious that I have a tendency to do lots of small commits! Peter From bartek at rezolwenta.eu.org Tue Apr 28 13:50:20 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 28 Apr 2009 19:50:20 +0200 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com> Message-ID: <8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com> On Tue, Apr 28, 2009 at 7:45 PM, Peter wrote: > I take that back - I added an email address of just "peterc" to my > github account (it seems they don't do any validation, perhaps for > this very reason?). ?This had no immediate effect, but one day later > and all my CVS commits are now shown with my photo in github. ?Neat - great > but it makes it much more obvious that I have a tendency to do lots of > small commits! > That's good practice in git :) cheers Bartek From bugzilla-daemon at portal.open-bio.org Tue Apr 28 13:55:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 13:55:00 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281755.n3SHt0FK003782@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #17 from cymon.cox at gmail.com 2009-04-28 13:55 EST ------- (In reply to comment #16) > (In reply to comment #15) > > > > Yeah - use something like that, but be aware that the tests shouldn't assume > > that the executable name is just "muscle". Hopefully test_Muscle_tool.py does > > this right... > > > > Well, it does now. I've got test_Muscle_tool.py to run on Windows, assuming > the user chooses to put MUSCLE under the program files directory in a > reasonably predictable folder. Given the MUSCLE installation process on > Windows is entirely manual, we can't really do anything else. > OK, pushed to applic-int updated unittest for PRANK, MAFFT, and DIALIGN - skipping tests on windows. Also changed the names to test_XXXX_tool.py C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 14:28:31 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 14:28:31 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281828.n3SISVTs005955@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #18 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 14:28 EST ------- (In reply to comment #17) > > OK, pushed to applic-int updated unittest for PRANK, MAFFT, and DIALIGN - > skipping tests on windows. > > Also changed the names to test_XXXX_tool.py > > C. Great. In addition to the MUSCLE and ClustalW stuff, I've got the PRANK code and unit tests in CVS now. These three tests all work on a Linux, Mac and Windows machine (with Python 2.4, 2.5 and 2.6). I'm stopping working on this for today. It would be great if you could test a clean checkout from CVS, and we'll resume this merge later on for the remaining tools MAFFT and DIALIGN. Also, would you be able to look into making the Prank test faster to run? Maybe use a smaller example input file? After we do that, I'd like to use it to test the Nexus parser via Bio.AlignIO (just something simple which won't be affected by gap differences between different versions of PRANK - like my tests for MUSCLE). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Apr 28 15:27:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Apr 2009 20:27:41 +0100 Subject: [Biopython-dev] Where to put command line wrappers In-Reply-To: <320fb6e00904231421k3e18d0b1y9003614e906fcb1c@mail.gmail.com> References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> <8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com> <8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com> <320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com> <20090417140241.GD16092@sobchak.mgh.harvard.edu> <320fb6e00904231421k3e18d0b1y9003614e906fcb1c@mail.gmail.com> Message-ID: <320fb6e00904281227l5e17159g4333fd98d019ad60@mail.gmail.com> On Thu, Apr 23, 2009 at 10:21 PM, Peter wrote: > > OK, what I propose is that the command line objects are exposed as > Bio.Align.Applications.MuscleCommandline, > Bio.Align.Applications.ClustalwCommandline, etc but that the > implementations live in Bio/Align/Applications/_Muscle.py, > _Clustalw.py etc. To do this the Bio/Align/Applications/__init__.py > file will look like this: > > from _Muscle import MuscleCommandline > from _Clustalw import ClustalwCommandline > > This avoids having a single massive file, yet keeps the public > namespace simple. For the user, they do this: > > from Bio.Align.Applications import MuscleCommandline > cline = MuscleCommandline(...) > > or if they prefer, > > from Bio.Align import Applications > cline = Applications.MuscleCommandline(...) > > From the user's point of view all the alignment command line wrapper > objects live together under Bio.Align.Applications. As no one objected or put forward an alternative scheme, Cymon and I have been pressing ahead on Bug 2815 using the above file layout. I have also updated Bio.Motif.Applications to match (this module was deliberately left out of Biopython 1.50 while this issue was settled). Peter From bugzilla-daemon at portal.open-bio.org Tue Apr 28 17:18:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 17:18:11 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904282118.n3SLIB0N015984@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #19 from cymon.cox at gmail.com 2009-04-28 17:18 EST ------- (In reply to comment #18) > (In reply to comment #17) > > > It would be great if you could test a clean checkout from CVS, Done - on Ubuntu 9.04 Python2.6.2 - Clustalw_tool and Prank_tool both good. Cant test Muscle_tool as Muscle 3.7 is broken on this release (builds and core-dumps). > > Also, would you be able to look into making the Prank test faster to run? Will look into this. (merged upstream into applic-int) C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Tue Apr 28 21:28:26 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 28 Apr 2009 18:28:26 -0700 (PDT) Subject: [Biopython-dev] [Biopython] Parsing large blast files In-Reply-To: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> Message-ID: <290052.25369.qm@web62407.mail.re1.yahoo.com> --- On Tue, 4/28/09, Peter Cock wrote: > I do like the idea of moving/importing the qblast function > directly under Bio.Blast, and perhaps removing Bio.Blast.NCBIXML > later on. Well Bio.Blast.NCBIXML would still be there (containing the code for the XML parser), but users would access it through Bio.Blast.parse/read. > For read/parse functions, we should probably call the > format "blastxml" to match BioPerl. We could have both "xml" and "blastxml" for Blast XML output, "text" and "blasttext" for Blast text output, and "table" and "blasttable" for Blast table (-m 8 and 9) output. > Would you continue to support the plain text output here? Yes. I'm more thinking about code reorganization than removing/adding functionality. > Rather than continuing to encourage the use of blastall, > blastpgp and rpsblast I would rather bring Bio.Blast.Applications > up to date, and then declare them obsolete. How would users typically use Bio.Blast.Applications? --Michiel. From p.j.a.cock at googlemail.com Wed Apr 29 04:33:03 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Apr 2009 09:33:03 +0100 Subject: [Biopython-dev] [Biopython] Parsing large blast files In-Reply-To: <290052.25369.qm@web62407.mail.re1.yahoo.com> References: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> <290052.25369.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com> On Wed, Apr 29, 2009 at 2:28 AM, Michiel de Hoon wrote: > > How would users typically use Bio.Blast.Applications? > In the next release, I would aim to have Bio.Blast.Applications updated to cover blastall (fully), plus blastpgp and rpsblast (currently not covered) and for the three helper functions Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast to all use Bio.Blast.Applications internally. I would suggest at some point (perhaps a release later) calling the three helper functions obsolete, and eventually deprecating them, but I appreciate these are well documented and well used, so this should be a gradual transistion. In the future I would see people contructing their application command line object and then using it to spawn the task as needed. The Bio.Applicaition.generic_run might suffice for low output tools, ranging up to using the builtin subprocess module for full control. The command line string can also be used in other ways, e.g. for submission to a computing cluster using qsub, or writing to a shell script etc. The point about this is decoupling constuction of the command line string, and actually executing it. Right now the Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions do both, and there is no way to (a) see what the command line used was, which makes debugging difficult, and (b) no way to control how it is invoked (e.g. recent Windows GUI questions). Another immediate benefit is an example usage that I do quite often: Running BLAST and saving the output to a file. The cleanest way to do this is to use the -o option to get BLAST itself to write to a file. If you do this, then there is no useful output written to the handles - but the Bio.Blast.NCBIStandalone make this fiddly (see Bug 2654). Right now the tutorial does something equally indirect - in python read BLAST output from stdout and save it to a file (and probably not in a memory efficient way either!). See also this thread on where to put new command line wrappers: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005766.html If you where asking about the actual code for how to build the command line object, well I have some thoughts on making the current Bio.Application base class easier to use (properties and keyword arguments at init) which I have started to discuss on the dev list. Peter From bugzilla-daemon at portal.open-bio.org Wed Apr 29 05:55:13 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Apr 2009 05:55:13 -0400 Subject: [Biopython-dev] [Bug 2822] New: Bio.Application.AbstractCommandline - properties and kwargs Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2822 Summary: Bio.Application.AbstractCommandline - properties and kwargs Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk I have two related proposals to make the command line wrapper objects easier to use, (1) Supporting keyword arguments in __init__ (2) Supporting parameters as python properties These both require each parameter to have a "human readable alias" which is also a valid python identifier (this should be the case in CVS now). I will attach patches to this bug, and perhaps put this on github too. For reference, consider this example (based on one in test_Emboss.py) using the old code in CVS: >>> from Bio.Emboss.Applications import WaterCommandline >>> water_exe = r"C:\Progra~1\Emboss\water.exe" >>> cline = WaterCommandline(cmd=water_exe) >>> cline.set_parameter("-asequence", "asis:ACCCGGGCGCGGT") >>> cline.set_parameter("-bsequence", "asis:ACCCGAGCGCGGT") >>> cline.set_parameter("-gapopen", "10") >>> cline.set_parameter("-gapextend", "0.5") >>> cline.set_parameter("-outfile", "temp_test.water") >>> print cline C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5 -outfile=temp_test.water Note that the parameters can have aliases (sometimes at the actual command line, e.g. a long and a short version of the same switch). Here the following is also supported: >>> from Bio.Emboss.Applications import WaterCommandline >>> water_exe = r"C:\Progra~1\Emboss\water.exe" >>> cline = WaterCommandline(cmd=water_exe) >>> cline.set_parameter("asequence", "asis:ACCCGGGCGCGGT") >>> cline.set_parameter("bsequence", "asis:ACCCGAGCGCGGT") >>> cline.set_parameter("gapopen", "10") >>> cline.set_parameter("gapextend", "0.5") >>> cline.set_parameter("outfile", "temp_test.water") >>> print cline C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5 -outfile=temp_test.water -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 29 06:00:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Apr 2009 06:00:14 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200904291000.n3TA0EBu028672@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-29 06:00 EST ------- Created an attachment (id=1287) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1287&action=view) Adds keyword argument support to the __init__ method This patch adds keyword argument support to the __init__ method, although for the purposes of demonstration in this patch I have only updated the EMBOSS wrappers to use it. As an alternative to the earlier example you would be able to do: >>> from Bio.Emboss.Applications import WaterCommandline >>> water_exe = r"C:\Progra~1\Emboss\water.exe" >>> cline = WaterCommandline(cmd=water_exe, asequence="asis:ACCCGGGCGCGGT", bsequence="asis:ACCCGAGCGCGGT", gapopen="10", gapextend="0.5", outfile="temp_test.water") >>> print cline C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5 -outfile=temp_test.water You can of course still use the set_parameter approach as well, for example to change a setting: >>> cline.set_parameter("gapopen", "20") >>> print cline C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=20 -gapextend=0.5 -outfile=temp_test.water I think this is much nicer, and also more like some of the existing "helper functions" we have for wrapping command line tools. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Apr 29 06:25:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Apr 2009 11:25:17 +0100 Subject: [Biopython-dev] Properties in Bio.Application interface? In-Reply-To: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com> References: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com> Message-ID: <320fb6e00904290325w756493f2i264f79cd3f7e08b8@mail.gmail.com> On Sun, Apr 26, 2009 at 1:46 PM, Peter wrote: > > I have cleaning up the existing Bio.Application command line objects > in CVS to follow the parameter alias convention already laid out in > Bio.Application. ?i.e. They all now have human readable paramater > aliases, which are also valid python identifiers. ?This means these > "human readable names" can also be used for argument names in > __init__ (using **kwargs), or as property names. > > I think I've got properties working now as an experiment on my > machine, generated at run time using the "human readable name" for > each parameter. ?We would need to special case "switch" arguments > (i.e. those which take no value) as outlined above. > > Does this sound worthwhile? ?If so, I'll put together an enhancement > bug with a patch, or a branch on github. I've filed Bug 2822 for these enhancements to the Bio.Application based command line objects, http://bugzilla.open-bio.org/show_bug.cgi?id=2822 So far there is just a patch to support keyword arguments (quite simple really), with an example of how this changes the interface. I'm still working on the code to do properties as well - I thought I'd solved this a few days ago but it doesn't quite work... Peter From p.j.a.cock at googlemail.com Wed Apr 29 06:31:26 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Apr 2009 11:31:26 +0100 Subject: [Biopython-dev] [Biopython] Parsing large blast files In-Reply-To: <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com> References: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> <290052.25369.qm@web62407.mail.re1.yahoo.com> <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com> Message-ID: <320fb6e00904290331n654964bficfc68ae92d477387@mail.gmail.com> On Apr 29, Peter wrote: > On Apr 29, Michiel de Hoon wrote: >> >> How would users typically use Bio.Blast.Applications? >> > > In the next release, I would aim to have Bio.Blast.Applications > updated to cover blastall (fully), plus blastpgp and rpsblast > (currently not covered) and for the three helper functions > Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast to all use > Bio.Blast.Applications internally. ?... > > If you where asking about the actual code for how to build the command > line object, well I have some thoughts on making the current > Bio.Application base class easier to use (properties and keyword > arguments at init) which I have started to discuss on the dev list. See this dev list thread: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005916.html And Bug 2822 (with examples): http://bugzilla.open-bio.org/show_bug.cgi?id=2822 Peter From bugzilla-daemon at portal.open-bio.org Wed Apr 29 07:05:25 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Apr 2009 07:05:25 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904291105.n3TB5PRe000547@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #20 from cymon.cox at gmail.com 2009-04-29 07:05 EST ------- (In reply to comment #18) > (In reply to comment #17) > Also, would you be able to look into making the Prank test faster to run? Maybe > use a smaller example input file? After we do that, I'd like to use it to test > the Nexus parser via Bio.AlignIO (just something simple which won't be affected > by gap differences between different versions of PRANK - like my tests for > MUSCLE). Reduced run time from 8s to 1s, added asserts for Nexus outfile parsing. Pushed to applic-int C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 29 07:40:56 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Apr 2009 07:40:56 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904291140.n3TBeu6o002524@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #21 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-29 07:40 EST ------- (In reply to comment #20) > (In reply to comment #18) > > (In reply to comment #17) > > Also, would you be able to look into making the Prank test faster to run? > > Maybe use a smaller example input file? After we do that, I'd like to > > use it to test the Nexus parser via Bio.AlignIO (just something simple > > which won't be affected by gap differences between different versions of > > PRANK - like my tests for MUSCLE). > > Reduced run time from 8s to 1s, added asserts for Nexus outfile parsing. > > Pushed to applic-int > C. Lovely - checked into CVS. On the Linux machine I tested this on it went from 16s to 2s :) P.S. See also Bug 2822 for some of my ideas on making the Bio.Application base class easier to use. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Wed Apr 29 08:11:15 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 29 Apr 2009 08:11:15 -0400 Subject: [Biopython-dev] Installation documentation In-Reply-To: <320fb6e00904280640l58f0098fs467d90646a039e29@mail.gmail.com> References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> <20090428124119.GV34546@sobchak.mgh.harvard.edu> <320fb6e00904280640l58f0098fs467d90646a039e29@mail.gmail.com> Message-ID: <20090429121115.GX34546@sobchak.mgh.harvard.edu> Hi Peter; > Well, easy_install isn't (yet) an official python standard so I hadn't > previously worried about it - our wiki Downloads page does mention it. > Frankly the less "official" ways the are to install, the less ways it > can go wrong, and then the less questions need to be asked when it > goes wrong. I hear you about too many options. I am a fan of easy_install and PyPi seems to have some momentum even if it is not officially endorsed. The way I normally work on cluster/shared machines is to have an up to date local version of Python and easy_install things I need. PyPi can also handle dependencies, which is nice -- I actually wrote some commented out code in setup.py which will help enable automatic numpy installation now that we are supporting only 2.4 or better. > Nor had I worried about how PyPi's listing might need to be updated. > I assumed it was clever enough to scan the http://biopython.org/DIST/ > directory and parse the filenames. Is the real answer you (Brad) kept > it up to date? > http://pypi.python.org/pypi/biopython/ Yes, I've been doing it on PyPi. The -f option you recommended on the wiki is good in case that is out of date, and I copied that into the install docs for consistency. > > Peter, if you have an account on pypi, let me know your login and I > > can add you as an owner for Biopython. > > I don't have an account on pypi. Cool -- if you end up wanting to play with it just let me know and I'll add you. Brad From bugzilla-daemon at portal.open-bio.org Wed Apr 29 08:23:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Apr 2009 08:23:19 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904291223.n3TCNJaG005773@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #22 from cymon.cox at gmail.com 2009-04-29 08:23 EST ------- (In reply to comment #21) > P.S. See also Bug 2822 for some of my ideas on making the Bio.Application base > class easier to use. Eagerly anticipating the github branch ;) C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 29 08:53:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Apr 2009 08:53:40 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200904291253.n3TCrec4008244@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-29 08:53 EST ------- OK, for the moment I'm going to give up on the property idea. I was trying to add them dynamically in __init__ or __new__ based on the parameter list, but this is actually rather tricky. I still think it should be possible though... We could use __getattr__ but that doesn't create an entry in dir(...), and thus is not discoverable - nor can use use each parameter's description for a docstring this way. Perhaps the simplest idea would be to the properties explicitly in each subclass, but this would require more upfront effort as all the existing property object lists would need to be replaced. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 29 10:35:41 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Apr 2009 10:35:41 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200904291435.n3TEZfED018571@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1287 is|0 |1 obsolete| | ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-29 10:35 EST ------- Created an attachment (id=1288) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1288&action=view) Adds keyword argument support to the __init__ method AND properties (In reply to comment #2) > OK, for the moment I'm going to give up on the property idea. I was trying to > add them dynamically in __init__ or __new__ based on the parameter list, but > this is actually rather tricky. I still think it should be possible though... I was close earlier, and think I have solved it now :) As before, this patch adds keyword argument support to the __init__ method, but also setups properties dynamically. Again, for the purposes of demonstration in this patch I have only updated the EMBOSS wrappers to use this. So, my original example (using the current code) was: >>> from Bio.Emboss.Applications import WaterCommandline >>> water_exe = r"C:\Progra~1\Emboss\water.exe" >>> cline = WaterCommandline(cmd=water_exe) >>> cline.set_parameter("asequence", "asis:ACCCGGGCGCGGT") >>> cline.set_parameter("bsequence", "asis:ACCCGAGCGCGGT") >>> cline.set_parameter("gapopen", "10") >>> cline.set_parameter("gapextend", "0.5") >>> cline.set_parameter("outfile", "temp_test.water") >>> print cline C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5 -outfile=temp_test.water With the __init__ keyword argument support, this becomes valid: >>> from Bio.Emboss.Applications import WaterCommandline >>> water_exe = r"C:\Progra~1\Emboss\water.exe" >>> cline = WaterCommandline(cmd=water_exe, asequence="asis:ACCCGGGCGCGGT", bsequence="asis:ACCCGAGCGCGGT", gapopen="10", gapextend="0.5", outfile="temp_test.water") >>> print cline C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5 -outfile=temp_test.water You can of course still use the set_parameter approach as well, for example to change a setting: >>> cline.set_parameter("gapopen", "20") >>> print cline C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=20 -gapextend=0.5 -outfile=temp_test.water With the property support, you can then read/or set parameter values directly: >>> cline.gapopen '20' >>> cline.gapopen = 15 >>> cline.gapopen 15 >>> print cline C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=15 -gapextend=0.5 -outfile=temp_test.water This is much nicer I think, but perhaps the biggest plus point is the properties have docstrings which show via: >>> help(cline) ... and are discoverable: >>> dir(cline) ['__class__', '__delattr__', '__dict__', '__doc__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__str__', '__weakref__', '_check_value', '_get_parameter', 'aformat', 'asequence', 'bsequence', 'datafile', 'gapextend', 'gapopen', 'outfile', 'parameters', 'program_name', 'set_parameter', 'similarity', 'snucleotide', 'sprotein'] This makes the parameters all readily discoverable, without having to resort to looking at Biopython's source code, or the command line application's help. Right now (using the old code in CVS), the information is there but buried: >>> print cline.parameters [, , , , , , , , , ] >>> for p in cline.parameters : ... print p.names, p.description ... ['-asequence', 'asequence'] First sequence to align ['-bsequence', 'bsequence'] Second sequence to align ['-gapopen', 'gapopen'] Gap open penalty ['-gapextend', 'gapextend'] Gap extension penalty ['-outfile', 'outfile'] Output file for the alignment ['-datafile', 'datafile'] Matrix file ['-similarity', 'similarity'] Display percent identity and similarity ['-snucleotide', 'snucleotide'] Sequences are nucleotide (boolean) ['-sprotein', 'sprotein'] Sequences are protein (boolean) ['-aformat', 'aformat'] Display output in a different specified output format So, comments? We can choose to add EITHER the __init__ keyword arguments OR the properties. Or of course, BOTH. Or neither, and just leave the interface as it stand in CVS now. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Apr 29 11:34:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Apr 2009 16:34:26 +0100 Subject: [Biopython-dev] Properties in Bio.Application interface? In-Reply-To: <320fb6e00904290325w756493f2i264f79cd3f7e08b8@mail.gmail.com> References: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com> <320fb6e00904290325w756493f2i264f79cd3f7e08b8@mail.gmail.com> Message-ID: <320fb6e00904290834g7a73c7au487564e3b103250@mail.gmail.com> On Wed, Apr 29, 2009 at 11:25 AM, Peter wrote: > I've filed Bug 2822 for these enhancements to the Bio.Application > based command line objects, > http://bugzilla.open-bio.org/show_bug.cgi?id=2822 I think I learnt some more about python in the process, which may be a sign that the code I've come up with is too complicated, but Bug 2822 now has a patch to support both keyword arguments and properties in the Bio.Application style command line wrappers. This will require minor changes to the __init__ method of any command line sub-class (demonstrated using Bio.Emboss.Applications only thus far). I can envision a simpler approach to this code by defining the properties explicitly in each subclass, but that would mean a lot of boring/risky refactoring (or a clever script to do it for us). There are examples using the new code in the bug comments. Apart from preferring this API, the other big difference is the properties provide built in help. I'll be away for the next four days so I (probably) won't be able to reply to any comments or questions till Monday. Peter From biopython at maubp.freeserve.co.uk Wed Apr 29 12:10:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Apr 2009 17:10:28 +0100 Subject: [Biopython-dev] Git on Windows Message-ID: <320fb6e00904290910y29c6386ax8975ef3c2597d09e@mail.gmail.com> Hi all, I just wanted to say I've had a very quick test of git on Windows, using the git package from cygwin, and it seems to work OK. After copying my SSH key over from my main machine, I was able to clone my github repository, merge from the upstream Biopython branch (i.e. the one being updated from CVS), and push this back to my personal github repository. Why did I use git from cygwin? Well I have cygwin installed anyway for mingw32 (the compiler used for the Biopython Windows installers for Python 2.3 to 2.5), and was already using the cvs package from cygwin, so this seemed simplest. Peter From dalloliogm at gmail.com Wed Apr 29 12:42:50 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 29 Apr 2009 18:42:50 +0200 Subject: [Biopython-dev] Git on Windows In-Reply-To: <320fb6e00904290910y29c6386ax8975ef3c2597d09e@mail.gmail.com> References: <320fb6e00904290910y29c6386ax8975ef3c2597d09e@mail.gmail.com> Message-ID: <5aa3b3570904290942g6a73fae3k3a53c2e13c95c258@mail.gmail.com> On Wed, Apr 29, 2009 at 6:10 PM, Peter wrote: > Hi all, > > I just wanted to say I've had a very quick test of git on Windows, Hi, by the way, this is a document published by google on a comparison hg/git: - http://code.google.com/p/support/wiki/DVCSAnalysis In the comments, there is some discussion over git clients for Windows. -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From eric.talevich at gmail.com Wed Apr 29 15:28:58 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 29 Apr 2009 15:28:58 -0400 Subject: [Biopython-dev] XML parsing library for new modules Message-ID: <3f6baf360904291228m1bba6008kb26e345510772e7a@mail.gmail.com> Hi all, I'm writing a parser for the PhyloXML format for Google Summer of Code this year, and as the name would imply, it requires parsing some large XML files. The existing modules in Biopython for parsing XML formats seem to use xml.sax in the standard library. In Python 2.5, a faster and more Pythonic parser was added to the standard lib: ElementTree (xml.etree), in pure-Python and C-enhanced flavors. How do you feel about each of these libraries as the basis for a new Biopython module? Here are some interesting benchmarks: http://effbot.org/zone/celementtree.htm#benchmarks The ElementTree library is also available as a standalone package, compatible back to Python 2.1, and the lxml package also offers an independent implementation. So maintaining compatibility with Python 2.4 would require the availability of one of these third-party packages, and my code would try each of these imports in order: from xml.etree import cElementTree as ElementTree from xml.etree import ElementTree # Separate lxml package from lxml.etree import ElementTree # Standalone elementtree package import cElementTree as ElementTree from elementtree import ElementTree Then one day, when Python 2.4 is no longer supported, only the first two lines would be needed. (The second line is for sites that disable C extensions, like Google App Engine, or alternate Python implementations like Jython.) Another option is xml.parsers.expat, but just Googling around, it appears that the Python zeitgeist is strongly in favor of xml.etree for new code. Thoughts? Thanks, Eric From chapmanb at 50mail.com Thu Apr 30 08:05:32 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 30 Apr 2009 08:05:32 -0400 Subject: [Biopython-dev] Properties in Bio.Application interface? In-Reply-To: <320fb6e00904290834g7a73c7au487564e3b103250@mail.gmail.com> References: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com> <320fb6e00904290325w756493f2i264f79cd3f7e08b8@mail.gmail.com> <320fb6e00904290834g7a73c7au487564e3b103250@mail.gmail.com> Message-ID: <20090430120532.GA50777@sobchak.mgh.harvard.edu> Hi Peter; > > I've filed Bug 2822 for these enhancements to the Bio.Application > > based command line objects, > > http://bugzilla.open-bio.org/show_bug.cgi?id=2822 > > I think I learnt some more about python in the process, which may be a > sign that the code I've come up with is too complicated, but Bug 2822 > now has a patch to support both keyword arguments and properties in > the Bio.Application style command line wrappers. This will require > minor changes to the __init__ method of any command line sub-class > (demonstrated using Bio.Emboss.Applications only thus far). I can > envision a simpler approach to this code by defining the properties > explicitly in each subclass, but that would mean a lot of boring/risky > refactoring (or a clever script to do it for us). I love what you are doing here. The keywords and properties make it much more Pythonic; the old way reeks of Java-style get/sets. My vote is to put them both in. Brad From marcin.swiatek at mail.mcgill.ca Thu Apr 30 11:23:35 2009 From: marcin.swiatek at mail.mcgill.ca (Marcin Swiatek) Date: Thu, 30 Apr 2009 11:23:35 -0400 Subject: [Biopython-dev] MUMmer Message-ID: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca> Hello, I guess I should start with a nice 'hi' to everybody, now that I am sending my first message to this group. So: Hi, Everybody! Now, that we have the formality out of the way, I will get to the point. Recently, I have written some Python code for parsing and processing the output of MUMmer tool (http://mummer.sourceforge.net/). More specifically, the code I have manages invocations and handles outputs of the nucmer pipeline (alignment of multiple closely related nucleotide sequences) and of mummer itself (short exact matches). Obviously, the results are ultimately rendered as pairs of biopython's Seq objects. I use this stuff only myself, in work on bacterial genomes, but I would be more than willing to contribute it to the project. It may be rough around the edges at the moment, but I think I could easily give it the necessary polish if there is interest in having it included. Should that be the case, could one of the project leads point me in the right direction, please? How should I go about the submission? Regards, Marcin Swiatek From bartek at rezolwenta.eu.org Thu Apr 30 12:50:41 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 30 Apr 2009 18:50:41 +0200 Subject: [Biopython-dev] MUMmer In-Reply-To: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca> References: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca> Message-ID: <8b34ec180904300950n2c75f010oed27493f52d0da14@mail.gmail.com> Hi Marcin, On Thu, Apr 30, 2009 at 5:23 PM, Marcin Swiatek wrote: > Hello, > > > > I use this stuff only myself, in work on bacterial genomes, but I would > be more than willing to contribute it to the project. It may be rough > around the edges at the moment, but I think I could easily give it the > necessary polish if there is interest in having it included. > Contributions are always welome > > > Should that be the case, could one of the project leads point me in the > right direction, please? How should I go about the submission? > > I don't think I qualify as a lead, but nonetheless I think I can help here. I think that the best way to submit your code currently is to create a branch (fork) of biopython on github and submit your changes there and then notify people on biopython-dev that there is new code to review. You can also submit an enhancement bug to bugzilla. There are a couple of wiki pages which might be of interest to you: - http://biopython.org/wiki/Contributing - http://biopython.org/wiki/GitUsage If you have any questions or problems during the process, ask on the list. As for the code, I'm not sure, but maybe instead of returning a pair of sequences, an alignment object might be a better choice? You might want to also check out a recent code on application wrappers: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005766.html cheers Bartek From bugzilla-daemon at portal.open-bio.org Wed Apr 1 11:28:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 07:28:12 -0400 Subject: [Biopython-dev] [Bug 2802] New: Loader.py: load SeqRecord comments as list Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2802 Summary: Loader.py: load SeqRecord comments as list Product: Biopython Version: 1.49b Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: andrea at biodec.com Loader.py version: 1.38 or below python: any Actually seqrecord.annotation['comment'] is a string. SProt parser and GenBank parser parse comment as string. SProt record parser, instead, parse comment as list, according to the "-!-" tag. I'm working on parsing comment as lists, either for Uniprot and for GenBank (ncbi), and I need to have the possibility to manage comment as lists. The biosql schema, also, has in the table "comment", the field "rank" that is suitable to be used for storing list entries. In this way the table is ready and implemented to store list data. The patch is retro-compatible, so the _load_comment function is able to load either string or list entries, according to the data type. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 11:29:02 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 07:29:02 -0400 Subject: [Biopython-dev] [Bug 2802] Loader.py: load SeqRecord comments as list In-Reply-To: Message-ID: <200904011129.n31BT23k007952@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2802 ------- Comment #1 from andrea at biodec.com 2009-04-01 07:29 EST ------- Created an attachment (id=1270) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1270&action=view) proposed Patch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 11:48:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 07:48:15 -0400 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200904011148.n31BmFmX009292@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 07:48 EST ------- I've updated CVS as per comment 12 to also use record.query_length, and comment 13 to also use record.database_length. Before: >>> from Bio.Blast import NCBIXML >>> for record in NCBIXML.parse(open("xbt007.xml")) : ... print record.query_id ... print record.query_letters, record.query_length ... print record.num_letters_in_database, record.database_letters, record.database_length ... gi|585505|sp|Q08386|MOPB_RHOCA 270 None 13958303 None None gi|129628|sp|P07175.1|PARA_AGRTU 222 None 13958303 None None Now, with Bio/Blast/NCBIXML.py CVS revision 1.20 or 1.21, >>> from Bio.Blast import NCBIXML >>> for record in NCBIXML.parse(open("xbt007.xml")) : ... print record.query_id ... print record.query_letters, record.query_length ... print record.num_letters_in_database, record.database_letters, record.database_length ... gi|585505|sp|Q08386|MOPB_RHOCA 270 270 13958303 None 13958303 gi|129628|sp|P07175.1|PARA_AGRTU 222 222 13958303 None 13958303 We could perhaps deprecate record.database_letters immediately, and at a later point, record.query_letters -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 11:50:07 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 07:50:07 -0400 Subject: [Biopython-dev] [Bug 2802] Loader.py: load SeqRecord comments as list In-Reply-To: Message-ID: <200904011150.n31Bo7ib009452@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2802 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 07:50 EST ------- See also Bug 2235 for the SwissProt parsing into SeqRecord objects. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 12:33:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 08:33:37 -0400 Subject: [Biopython-dev] [Bug 2802] Loader.py: load SeqRecord comments as list In-Reply-To: Message-ID: <200904011233.n31CXbuM012687@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2802 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 08:33 EST ------- Thanks for the report and suggested patch. This is now fixed in CVS (slightly differently though). I'd be grateful if you could test the latest code. A fresh CVS checkout would be easiest - you'll need to update several files as I was working on another issue at the same time: Checking in BioSQL/BioSeq.py; /home/repository/biopython/biopython/BioSQL/BioSeq.py,v <-- BioSeq.py new revision: 1.35; previous revision: 1.34 done Checking in BioSQL/Loader.py; /home/repository/biopython/biopython/BioSQL/Loader.py,v <-- Loader.py new revision: 1.39; previous revision: 1.38 done Checking in Tests/test_BioSQL_SeqIO.py; /home/repository/biopython/biopython/Tests/test_BioSQL_SeqIO.py,v <-- test_BioSQL_SeqIO.py new revision: 1.33; previous revision: 1.32 done Checking in Tests/output/test_BioSQL_SeqIO; /home/repository/biopython/biopython/Tests/output/test_BioSQL_SeqIO,v <-- test_BioSQL_SeqIO new revision: 1.6; previous revision: 1.5 done Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Apr 1 14:23:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Apr 2009 15:23:45 +0100 Subject: [Biopython-dev] Testing Biopython with NumPy 1.3 In-Reply-To: <320fb6e00903310212o29bba163ma9d68a901eabc2c9@mail.gmail.com> References: <320fb6e00903301535j21ae6659r931c9be0fd17faf3@mail.gmail.com> <730606.962.qm@web62408.mail.re1.yahoo.com> <320fb6e00903310212o29bba163ma9d68a901eabc2c9@mail.gmail.com> Message-ID: <320fb6e00904010723j594bc958kc721a234c54d4ea5@mail.gmail.com> On Tue, Mar 31, 2009 at 10:12 AM, Peter wrote: > On Tue, Mar 31, 2009 at 1:08 AM, Michiel de Hoon wrote: >> >>> So, whatever is going wrong on test_Cluster.py seems to be >>> specific to Windows (XP) and Python 2.6 - and possibly just >>> my Windows development machine. >>> >> I believe that the problem is that msvcr90.dll is missing. This >> is the C runtime from Microsoft. Earlier Pythons used >> msvcr71.dll, if I'm not mistaken. > > You may be right - there is some stuff on the numpy mailing list > about this and manifest files etc when using mingw32. ?It may > be simplest to try the appropriate MS compiler instead... OK, good news using the MS compiler: I went to http://www.microsoft.com/express/download/ and installed the free VC++ 2008 Express Edition (using the web install, unticking the optional silverlight and sql server bits). Using the "Visual Studio 2008 Command Prompt" shortcut I was able to build, test, install Biopython CVS fine. All this shortcut claims to do is setup suitable environment variables first, so this last bit can probably be simplified for every day use. This should mean we can include a Biopython 1.50 (beta) installer for Windows on Python 2.6 using NumPy 1.3 :) It would still be nice to resolve the mingw32 issue, but it isn't critical right now. Peter From bugzilla-daemon at portal.open-bio.org Wed Apr 1 14:41:24 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 10:41:24 -0400 Subject: [Biopython-dev] [Bug 2803] New: Insure Alignment objects are passed to AlignIO.write() Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2803 Summary: Insure Alignment objects are passed to AlignIO.write() Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: cymon.cox at gmail.com Insure Alignment objects are passed to AlignIO.write() Stops this kind of abuse: records = list(SeqIO.parse(open("Tests/NBRF/DMA_nuc.pir", "r"), "pir")) AlignIO.write([records], open("alignIO.fasta", "w"), "fasta") -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 14:42:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 10:42:55 -0400 Subject: [Biopython-dev] [Bug 2803] Insure Alignment objects are passed to AlignIO.write() In-Reply-To: Message-ID: <200904011442.n31EgtlQ023181@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2803 ------- Comment #1 from cymon.cox at gmail.com 2009-04-01 10:42 EST ------- Created an attachment (id=1271) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1271&action=view) nsure-Alignment-objects-are-passed-to-write-AlignIO -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 15:25:36 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 11:25:36 -0400 Subject: [Biopython-dev] [Bug 2803] Insure Alignment objects are passed to AlignIO.write() In-Reply-To: Message-ID: <200904011525.n31FPa3V026200@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2803 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 11:25 EST ------- Thanks for filing the bug (originally raised in our discussion on the mailing list). There is a major drawback to your proposed fix, + if isinstance(alignments, types.GeneratorType): + alignments = list(alignments) This means if you gave the AlignIO.write function a generator returning hundreds or large alignment objects, they would all get loaded into memory at once. One of the big aims with Bio.SeqIO and AlignIO in using generators/iterators is to allow memory efficient working where we try to keep only one record/alignment in memory at a time. Anyway, I'll take a look at this. I think we need to just check the case where Bio.AlignIO.write uses Bio.SeqIO.write internally... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 15:36:54 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 11:36:54 -0400 Subject: [Biopython-dev] [Bug 2803] Insure Alignment objects are passed to AlignIO.write() In-Reply-To: Message-ID: <200904011536.n31Fasdu027053@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2803 ------- Comment #3 from cymon.cox at gmail.com 2009-04-01 11:36 EST ------- (In reply to comment #2) > Thanks for filing the bug (originally raised in our discussion on the mailing > list). > > There is a major drawback to your proposed fix, > > + if isinstance(alignments, types.GeneratorType): > + alignments = list(alignments) > > This means if you gave the AlignIO.write function a generator returning > hundreds or large alignment objects, they would all get loaded into memory at > once. One of the big aims with Bio.SeqIO and AlignIO in using > generators/iterators is to allow memory efficient working where we try to keep > only one record/alignment in memory at a time. > > Anyway, I'll take a look at this. I think we need to just check the case where > Bio.AlignIO.write uses Bio.SeqIO.write internally... > Yes, I see. I had originally intended to check the type while looping through the alignments before calling SeqIO.write, but thought better of it because some alignments may get written before a error occurs, whereas it seems best that either all or none at all get written from the call to AlignIO.write. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 15:55:26 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 11:55:26 -0400 Subject: [Biopython-dev] [Bug 2803] Insure Alignment objects are passed to AlignIO.write() In-Reply-To: Message-ID: <200904011555.n31FtQ9X028474@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2803 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 11:55 EST ------- (In reply to comment #3) > > Anyway, I'll take a look at this. I think we need to just check the case > > where Bio.AlignIO.write uses Bio.SeqIO.write internally... That turned out to be the case, fixed in CVS. See Bio/AlignIO/__init__.py revision 1.22 and Tests/test_AlignIO.py 1.19 > Yes, I see. I had originally intended to check the type while looping through > the alignments before calling SeqIO.write, but thought better of it because > some alignments may get written before a error occurs, whereas it seems best > that either all or none at all get written from the call to AlignIO.write. You are right, if we are given a list/iterator containing some real Alignments but also some non-Alignments we have a problem. We can't pre-check all the entries before writing without converting to a list (and this ruins the memory benefits). We just catching the erroneous input when we reach it, even though it may happen half way through writing to the file. Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 18:04:05 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 14:04:05 -0400 Subject: [Biopython-dev] [Bug 2804] New: Clustalw subprocess hangs when large stdout returned Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2804 Summary: Clustalw subprocess hangs when large stdout returned Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: cymon.cox at gmail.com As noted on the mailing list, the following hangs waiting for a return: from Bio import SeqIO from Bio import Clustalw from Bio.Clustalw import MultipleAlignCL records = list(SeqIO.parse(open("Tests/NBRF/Cw_prot.pir", "r"), "pir")) handle = open("temp.fasta", "w") SeqIO.write(records, handle, "fasta") handle.close() cline = MultipleAlignCL("temp.fasta", command="clustalw") align = Clustalw.do_alignment(cline) This appears to be due to a known issue as documented here: http://docs.python.org/library/subprocess.html#subprocess.Popen.wait but wasnt being picked up by the tests - presumably because no test file is large enough to trigger the problem. Instead of using .wait() it suggests .communicate() The attached patch works for me on Linux. But as noted in __init__.py this maybe an issue for Windows: #We don't need to supply any piped input, but we setup the #standard input pipe anyway as a work around for a python #bug if this is called from a Windows GUI program. For #details, see http://bugs.python.org/issue1124861 Also subprocess.returncode is now /3 so moved "if status: value = status / 256 "so that only done if calling os.popen() C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 18:05:10 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 14:05:10 -0400 Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large stdout returned In-Reply-To: Message-ID: <200904011805.n31I5ACv005787@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2804 ------- Comment #1 from cymon.cox at gmail.com 2009-04-01 14:05 EST ------- Created an attachment (id=1272) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1272&action=view) clustalw subprocess patch -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 22:05:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 18:05:40 -0400 Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large stdout returned In-Reply-To: Message-ID: <200904012205.n31M5eDa024097@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2804 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 18:05 EST ------- It is great that you've found a simple and reproduceable test case. I can confirm this problem on a Linux machine with Python 2.4.3 (what version of python do you have?) (In reply to comment #1) > Created an attachment (id=1272) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1272&action=view) [details] > clustalw subprocess patch Unfortunately the patch is flawed here: status = child_process.communicate()[1] We want to get the return code (a numerical error value), but the communicate method returns two strings giving the contents of stdout and strerr, i.e. ... CLUSTAL W (1.83) Multiple Sequence Alignments ... Sequence format is Pearson Sequence 1: HLA_HLA00401 366 aa Sequence 2: HLA_HLA00402 366 aa ... Group 109: Sequences: 3 Score:6519 Group 110: Sequences: 111 Score:4464 Alignment Score 8299041 CLUSTAL-Alignment file created [temp.aln] for stdout, and an empty string for stderr. Doing this seems to work on Linux with python 2.4.3, child_process.communicate() #ignore the stdout and stderr data! child_process.stdin.close() child_process.stdout.close() child_process.stderr.close() status = child_process.returncode However, I have only tested this one example far, and not on Windows or the Mac yet. It would be a good idea to extend test_Clustalw_tool.py to cover some deliberate failures to check we can read the error level (return code) ClustalW gives back. Of course, this will need testing with both clustalw 1.x and 2.x to be safe. Note that the original code using os.popen still works fine for this example. We switched to subprocess because os.popen* are being deprecated on Python 2.6, and didn't work well with names with spaces as I recall. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 1 22:42:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 1 Apr 2009 18:42:39 -0400 Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large stdout returned In-Reply-To: Message-ID: <200904012242.n31MgdKd026637@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2804 ------- Comment #3 from cymon.cox at gmail.com 2009-04-01 18:42 EST ------- (In reply to comment #2) > It is great that you've found a simple and reproduceable test case. I can > confirm this problem on a Linux machine with Python 2.4.3 (what version of > python do you have?) Python 2.5.2 (r252:60911, Oct 5 2008, 19:24:49) [GCC 4.3.2] on linux2 on Ubuntu Intrepid > > (In reply to comment #1) > > Created an attachment (id=1272) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1272&action=view) [details] [details] > > clustalw subprocess patch > > Unfortunately the patch is flawed here: > > status = child_process.communicate()[1] Actually, the 'whole' patch is good. Have a look at the second bit of the patch, where I change my initial commit to my branch: #Grab stderr - status = child_process.communicate()[1] + child_process.communicate() + value = child_process.returncode except ImportError : etc... I've been trying to get to grips with git - and clearly havent succeeded to yet! When you run the command "git format-patch" it creates a separate for each commit to the branch, and I can't figure out how to just get the patch against only the current version of the file. So git gave me two patches, which I cat'ed together and submitted as a composite patch. Sorry I didnt make that clear. If anyone knows how to get the diff against only the current file version, I'd appreciate the answer ;) Cheers, C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 2 11:00:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Apr 2009 07:00:48 -0400 Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large stdout returned In-Reply-To: Message-ID: <200904021100.n32B0mEZ014206@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2804 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1272 is|0 |1 obsolete| | ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 07:00 EST ------- Created an attachment (id=1273) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1273&action=view) Patch to Bio/Clustalw/__init__.py (In reply to comment #3) > > When you run the command "git format-patch" it creates a separate for each > commit to the branch, and I can't figure out how to just get the patch against > only the current version of the file. So git gave me two patches, which I > cat'ed together and submitted as a composite patch. > I see - that odd looking patch had confused me. I think you want to look at "giff diff ..." for this, it also can do things like show the diff between the remote branches. I have tested this new patch on both Linux and Mac now, using both ClustalW 1.83 and 2.0.10 - next up Windows, and extending the unit test. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 2 11:32:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Apr 2009 07:32:40 -0400 Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large stdout returned In-Reply-To: Message-ID: <200904021132.n32BWdqU016365@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2804 ------- Comment #5 from cymon.cox at gmail.com 2009-04-02 07:32 EST ------- (In reply to comment #4) > Created an attachment (id=1273) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1273&action=view) [details] > Patch to Bio/Clustalw/__init__.py > > (In reply to comment #3) > > > > When you run the command "git format-patch" it creates a separate for each > > commit to the branch, and I can't figure out how to just get the patch against > > only the current version of the file. So git gave me two patches, which I > > cat'ed together and submitted as a composite patch. > > > > I see - that odd looking patch had confused me. I think you want to look at > "giff diff ..." for this, it also can do things like show the diff between the > remote branches. > > I have tested this new patch on both Linux and Mac now, using both ClustalW > 1.83 and 2.0.10 - next up Windows, and extending the unit test. Your new patch doesnt indent the lines (as in my original patch): 113 value = 0 114 if status: value = status / 256 so that they only get executed when run_clust = os.popen(str(command_line)) The return code from child_process.communicate() is already /256 also assign value = child_process.returncode (the return code is 0 for success and never "") """ child_process.communicate() value = child_process.returncode except ImportError : #Fall back for python 2.3 run_clust = os.popen(str(command_line)) status = run_clust.close() # The exit status is the second byte of the termination status # TODO - Check this holds on win32... value = 0 if status: value = status / 256 # check the return value for errors, as on 1.81 the return value # from Clustalw is actually helpful for figuring out errors # 1 => bad command line option if value == 1: raise ValueError("Bad command line option in the command: %s" % str(command_line)) """ C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 2 14:34:10 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Apr 2009 10:34:10 -0400 Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large stdout returned In-Reply-To: Message-ID: <200904021434.n32EYApO032328@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2804 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 10:34 EST ------- I've updated test_Clustalw_tool.py in CVS to catch this dead lock, and confirmed the unit test will fail on Mac and Linux when using subprocess (on the bright side, Python 2.3 should still work), but the test passes with the fix outlined - or simply using the os.popen code instead. Interestingly the lockup seems to happen more readily on Linux that on the Mac. I've yet to test on Windows. I also added three tests for standard error conditions - interestingly I don't ever seem to get an error code back (either with subprocess or os.popen). What about you? This makes testing these special cases for raising specific IOError exceptions difficult. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 2 15:19:04 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Apr 2009 11:19:04 -0400 Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large stdout returned In-Reply-To: Message-ID: <200904021519.n32FJ4DC003715@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2804 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED OS/Version|Linux |All Resolution| |FIXED ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 11:19 EST ------- Hi Cymon, I've updated the unit test for Windows on Python 2.3 through 2.6 (had to move some file deletions to the end, and watch out for extra error message variations). Windows also deadlocks on this example when using subprocess - the test should normally take about four seconds in total (depending on your computer's speed of course). Using os.popen avoids the deadlock (but can't cope with file names with spaces). Your fix in comment 5 also works :) So, now we have a unit test which catches this deadlock on all three operating systems, which confirms your fix which works on all three. I've checked it into CVS, and marked this bug as fixed. [I'm still not sure what is happening with the return values - if you look into this further please raise a new bug for it.] Thanks! Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 2 15:32:39 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Apr 2009 11:32:39 -0400 Subject: [Biopython-dev] [Bug 2806] New: Possible deadlock (hang) in Bio.Application using subprocess wait() Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2806 Summary: Possible deadlock (hang) in Bio.Application using subprocess wait() Product: Biopython Version: Not Applicable Platform: PC OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk CC: cymon.cox at gmail.com See Bug 2804 which demonstrated a reproducible hang on Windows, Linux and Mac from the subprocess .wait() method, and a work around. Bio.Application may suffer from the same problem, and could be fixed with the same approach. Patch to follow ... Ideally we'd have a suitable unit test covering this - perhaps using Bio.EMBOSS? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 2 15:33:30 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Apr 2009 11:33:30 -0400 Subject: [Biopython-dev] [Bug 2806] Possible deadlock (hang) in Bio.Application using subprocess wait() In-Reply-To: Message-ID: <200904021533.n32FXU67004756@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2806 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 11:33 EST ------- Created an attachment (id=1274) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1274&action=view) Patch to Bio/Application/__init__.py Use the .communicate() method instead of .wait() -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 2 19:18:56 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Apr 2009 15:18:56 -0400 Subject: [Biopython-dev] [Bug 2734] db.load problem with postgresql and psycopg2 In-Reply-To: Message-ID: <200904021918.n32JIuXc023154@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2734 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |INVALID ------- Comment #9 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 15:18 EST ------- As per comment 8, I'm going to assume Stephen had an old copy of Biopython on his machine, which would explain the error. In the absence of any further information there isn't anything we can do. Marking bug as invalid. Stephen - if you do work out what was going on, or if you still have a problem after sorting out any issue with multiple copies of Biopython installed, please do reopen this report. Thanks Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 2 22:29:18 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 2 Apr 2009 18:29:18 -0400 Subject: [Biopython-dev] [Bug 2807] New: Clustalw return codes Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2807 Summary: Clustalw return codes Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: cymon.cox at gmail.com see bug 2804 More on clustalw return codes: Note return codes are the same whether using subprocess.returncode or (os.popen().close() \3) clustalw1.81 clustalw2.09 ----------------- ------------------ error: Bad command line option in the command: clustalw_bogus -INFILE=Fasta/f002 127 127 error: can't open sequence file: clustalw -INFILE=no_file_present 2 255 error: wrong format of input file: clustalw -INFILE=Phylip/hennigian.phy 3 255 error: only one sequence in input: clustalw -INFILE=Fasta/f001 4 0 ========================================================= Clustalw.__init__ tries to catch return codes 1, 2, 3, and 4, others get caught generically. I dont think it is possible to generate a return code 1 using 1.81 because interface doesnt allow ad hoc options to be added to the command line. Invalid values of options are just ignore by clustalw and it aligns the data anyway (ie return code 0). Return codes 127 and 255 could be caught for newer versions and a more informative error returned. But given that there are 9 other clustalw versions between 1.81 (June 2003) and the latest 2.0.10 (Oct 2008 the latest) for which I havent checked the return codes, it might be better to just return a generic command line error if the return value is > 0. In the case where only one sequence is present, newer versions return code 0, but throws a ValueError when trying to parse the non-existent output file (see comment in test_Clustalw_tools.py). C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 3 09:50:44 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 3 Apr 2009 05:50:44 -0400 Subject: [Biopython-dev] [Bug 2807] Clustalw return codes In-Reply-To: Message-ID: <200904030950.n339oiIx019752@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2807 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-03 05:50 EST ------- (In reply to comment #0) > Clustalw.__init__ tries to catch return codes 1, 2, 3, and 4, others get > caught generically. With the CVS code, using clustalw1.81, is it definitely catching these errors and raising specific IOErrors? > I dont think it is possible to generate a return code 1 using 1.81 because > interface doesnt allow ad hoc options to be added to the command line. The Bio.Clustalw.do_alignment() function accepts any command line string, so you should be able to feed it a clustalw command with invalid arguments. > Invalid values of options are just ignore by clustalw and it aligns the > data anyway (ie return code 0). We'd have to look at the clustalw source code to confirm what should trigger an return error code of 1. > Return codes 127 and 255 could be caught for newer versions and a more > informative error returned. Yes, that sounds sensible. > But given that there are 9 other clustalw versions > between 1.81 (June 2003) and the latest 2.0.10 (Oct 2008 the latest) for which > I havent checked the return codes, it might be better to just return a generic > command line error if the return value is > 0. That also sounds sensible. > In the case where only one sequence is present, newer versions return code 0, > but throws a ValueError when trying to parse the non-existent output file (see > comment in test_Clustalw_tools.py). Maybe we should report that as a bug, I think clustalw2.0 is intended to be API compatible with clustalw1.x Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From thamelry at binf.ku.dk Fri Apr 3 13:31:05 2009 From: thamelry at binf.ku.dk (Thomas Hamelryck) Date: Fri, 3 Apr 2009 15:31:05 +0200 Subject: [Biopython-dev] PDB tidy script In-Reply-To: <320fb6e00903231405l479ddcc6of9cd0c1aa8fd98d4@mail.gmail.com> References: <320fb6e00903231405l479ddcc6of9cd0c1aa8fd98d4@mail.gmail.com> Message-ID: <2d7c25310904030631u56c642d6k83355d6bc4cd5d19@mail.gmail.com> Hi everybody, > I haven't been on this list long enough to know -- is Thomas still > > supporting the PDB module? Yes and no. First, I've been pretty busy with establishing a group here in Copenhagen, but it looks like I will have time for Bio.PDB again in the future. There's for example a set of classes dealing with RNA structure coming up. Just have to submit it. Second, I have no interest in doing anything beyond 3D stuff. I am not going to implement header parsing for example. I know many people have donated code, but in general this code is very messy and ad-hoc. The PDB parser is pretty lean, fast and quite stable now - IMO parsing the header should be the responsibility of a helper class, in order not to overload the 3D code with a lot of stuff that most people will not use. Also, the header info is for most purposes quite useless, especially in PDB files. It makes no sense to parse the PDB header in fact - if you need header info, use the MMCIF files. > If so, would he give his blessing to some more > > invasive changes to the PDB module, such as unifying PDBParser and > > parse_pdb_header? That separation has always seemed curiously vestigal to > > me. You could provide a uniform interface, but please keep the 3D data processing and the header processing in separate classes! The Structure object has functionality to be 'annotated', so you could transfer data from the header to the Structure object easily. > If you look back over the history, there initially was no header parsing, > it was a contribution from Kristian Rother, and I would agree, it is rather > disjoint from the rest of the code. One thing I personally wanted last > time I was working with PDB files was to have secondary structure > information (for them alpha and beta sheet lines in the header) > mapped onto the residue objects automatically. This is a good example of why header parsing is something of a red herring. You really want to recompute that using some decent program like DSSP or PSEA, or even an internal Bio.PDB procedure. But it's fine of course if you want to add this! I would suggest you try and get Thomas involved now for his input > on the design (before you start coding), but if need be press ahead > anyway for your own use, and he can always comment on your > public branch. I hope the two of you can work together on this, and > if/when Thomas does stand down (or delagate), you could then be > in an excellent position to take over as the Bio.PDB maintainer if > that's what you wanted. Sure, I'm open to this, but I'd like to stay involved if the 3D stuff is altered, even just to discuss new designs. Cheers, -Thomas From biopython at maubp.freeserve.co.uk Fri Apr 3 16:41:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 3 Apr 2009 17:41:04 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.50 (beta) Message-ID: <320fb6e00904030941y5110cb95r8be1fb37b01b9b98@mail.gmail.com> On Tue, Mar 31, 2009 at 10:38 PM, Peter wrote: > Hi all, > > OK guys, after a brief chat off the mailing list, I'm hoping to do the > Biopython 1.50 beta release roughly this weekend, somewhere between > Friday 4 and Monday 6 April. ?Until then please consider CVS "frozen" > for anything other that documentation changes or unit test additions, > or at a push really tiny changes. ?Once I'm ready to actually do the > release, I'll send out an email requesting no further CVS commits. I'm going to try and do the release tonight (in the next few hours), so please consider CVS frozen until further notice. Thanks, Peter From biopython at maubp.freeserve.co.uk Fri Apr 3 18:07:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 3 Apr 2009 19:07:58 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.50 (beta) In-Reply-To: <320fb6e00904030941y5110cb95r8be1fb37b01b9b98@mail.gmail.com> References: <320fb6e00904030941y5110cb95r8be1fb37b01b9b98@mail.gmail.com> Message-ID: <320fb6e00904031107q3df63df7q3569c22d4a521b7c@mail.gmail.com> On Fri, Apr 3, 2009 at 5:41 PM, Peter wrote: > > I'm going to try and do the release tonight (in the next few hours), > so please consider CVS frozen until further notice. > OK, its done - uploaded, and tagged in CVS. Could you all give it a quick test now that would be great, especially the Windows installers if possible as I currently only have ready access to the one Windows machine which is where the installers were built. I'll prepare the news entry and email announcement later on tonight, based on the current NEWS file. If there is anything missing which should be mentioned, please email me ASAP. I'm happy for CVS to be used again to check in documentation changes, but no code changes yet please. Thanks Peter From tiagoantao at gmail.com Sat Apr 4 16:43:10 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sat, 4 Apr 2009 17:43:10 +0100 Subject: [Biopython-dev] Merging branches Message-ID: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com> Hi, This might be a lame question but I am completely stuck and don't seem to understand why. I am trying to PARTIALLY merge 2 branches: my popgen branch with Giovanni's. I want to import his changes to Bio/PopGen/Stats , but only that (nothing on other Bio directories, and, above all not a new test). This changes are not conflictual, so I have no warning and everything gets in: If I do a git-merge I get the whole bang. Is there any way to just get partial merge? In this case I only want to merge a single sub dir (although, in general one might just want to import a single file) Of course I could do 2 checkouts and copy files across, on the local filesystem, but is that not loosing the history of connections between the files? Many thanks, Tiago -- "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From biopython at maubp.freeserve.co.uk Sat Apr 4 17:01:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 4 Apr 2009 18:01:53 +0100 Subject: [Biopython-dev] Merging branches In-Reply-To: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com> References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com> Message-ID: <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com> 2009/4/4 Tiago Ant?o: > Is there any way to just get partial merge? In this case I only want > to merge a single sub dir (although, in general one might just want > to import a single file) Can you cherry pick the changes you want? Github's fork queue provides another approach to the same issue. However, these both work on patches (individual commits) rather than files/directories. Peter From tiagoantao at gmail.com Sat Apr 4 17:29:20 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sat, 4 Apr 2009 18:29:20 +0100 Subject: [Biopython-dev] Merging branches In-Reply-To: <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com> References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com> <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com> Message-ID: <6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com> Me thinks I need to get a book on git and understand, once and for all, the basic concepts. I am getting merge conflicts with cherry picking and I don't even understand why Anyway it would be nice (but not fundamental) to merge just a single file. 2009/4/4 Peter : > 2009/4/4 Tiago Ant?o: >> Is there any way to just get partial merge? In this case I only want >> to merge a single sub dir (although, in general one might just want >> to import a single file) > > Can you cherry pick the changes you want? ?Github's fork queue > provides another approach to the same issue. ?However, these both work > on patches (individual commits) ?rather than files/directories. > > Peter > -- "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From biopython at maubp.freeserve.co.uk Sat Apr 4 19:06:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 4 Apr 2009 20:06:57 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.50 (beta) In-Reply-To: <320fb6e00904031107q3df63df7q3569c22d4a521b7c@mail.gmail.com> References: <320fb6e00904030941y5110cb95r8be1fb37b01b9b98@mail.gmail.com> <320fb6e00904031107q3df63df7q3569c22d4a521b7c@mail.gmail.com> Message-ID: <320fb6e00904041206yb0e4a29ja715a54faeeca28e@mail.gmail.com> On Fri, Apr 3, 2009 at 7:07 PM, Peter wrote: > I'm happy for CVS to be used again to check in documentation changes, > but no code changes yet please. Also I should have said before, those with CVS access, please feel free to add more unit tests. I've started work on one using the EMBOSS tools, to check both the command line wrappers in Bio.Emboss but also our parsers. I'm repeating myself but if you have some new code you'd like to check in, while CVS is "frozen" for the release process, this is a nice chance to try playing with git and github ;) Peter From bartek at rezolwenta.eu.org Sun Apr 5 09:49:14 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Sun, 5 Apr 2009 11:49:14 +0200 Subject: [Biopython-dev] Merging branches In-Reply-To: <6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com> References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com> <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com> <6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com> Message-ID: <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com> Hi Tiago, 2009/4/4 Tiago Ant?o : > Me thinks I need to get a book on git and understand, once and for > all, the basic concepts. I am getting merge conflicts with cherry > picking and I don't even understand why > If you could be a bit more specific (providing the files and revision numbers would be great), than it would be easier to help. I know it is an extra work, but we need some info, also to improve our wiki documents. > Anyway it would be nice (but not fundamental) to merge just a single file. > This is one of the fundamentalo changes between CVS and git. CVS uses files as the atomic piece of data, while git works with changesets (commits). This means, that if you only need a part of what was committed as a big changeset, you will need to put an extra effort into selecting what you need. >> 2009/4/4 Tiago Ant?o: >>> Is there any way to just get partial merge? In this case I only want >>> to merge a single sub dir (although, in general one might just want >>> to import a single file) Looking at specific files is not the default way things work in git. The idea is that if someone makes a single commit, it is an atomic contribution that is either to be accepted or not. You can of course create a diff file and then split it into specific files. I'll look into possible easier ways of doing it. cheers Bartek From eric.talevich at gmail.com Sun Apr 5 16:47:39 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 5 Apr 2009 12:47:39 -0400 Subject: [Biopython-dev] Merging branches In-Reply-To: <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com> References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com> <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com> <6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com> <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com> Message-ID: <3f6baf360904050947m5d9ec75eh18d64c53b8d9e2a6@mail.gmail.com> 2009/4/5 Bartek Wilczynski > Hi Tiago, > > >> 2009/4/4 Tiago Ant?o: > >>> Is there any way to just get partial merge? In this case I only want > >>> to merge a single sub dir (although, in general one might just want > >>> to import a single file) > > Looking at specific files is not the default way things work in git. > The idea is that if > someone makes a single commit, it is an atomic contribution that is > either to be > accepted or not. You can of course create a diff file and then split > it into specific files. > I'll look into possible easier ways of doing it. > > cheers > Bartek > You can get a list of the changes that affected a single subdirectory by giving the directory name to git log, e.g. "git log Bio/PopGen/Stats/". Those commits don't necessarily just affect Bio/PopGen/Stats, but assuming there aren't any single-commit code bombs, then it's probably a good idea to take those associated modifications anyway. You can also give a range of versions to git-log to get the commits that occurred since Gio's branch diverged from yours -- it looks something like "git log [path] HEAD..[gio's branch]", details are in the help page for git-rev-parse. Then you can use that list of commits for cherry-picking, in the original order. If it's essential to get just a specific file at a specific version, you can find the SHA1 hash for that blob (probably easiest through github) and use git-show with a redirect to the file in your tree, or a temporary filename. This loses the history, though. Cheers, Eric From tiagoantao at gmail.com Mon Apr 6 10:35:47 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 6 Apr 2009 11:35:47 +0100 Subject: [Biopython-dev] Merging branches In-Reply-To: <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com> References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com> <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com> <6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com> <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com> Message-ID: <6d941f120904060335m380ac820k97a558e8332fdf63@mail.gmail.com> Hi, 2009/4/5 Bartek Wilczynski : > If you could be a bit more specific (providing the files and revision numbers > would be great), than it would be easier to help. I know it is an extra work, > but we need some info, also to improve our wiki documents. > I would like to replace this: http://github.com/tiagoantao/biopython-popgen-test/blob/fa5ebc23e7aaabce94ae594d9a4f83be9bf90215/Bio/PopGen/Stats/Simple.py With this: http://github.com/dalloliogm/biopython/blob/cbaf6249cb91ed505cb575f09c2eaef3809872b9/Bio/PopGen/Stats/Simple.py It would be cool not to loose the history relationship (I suppose that would be the good practice). > This means, that if you only need a part of what was committed as a > big changeset, > you will need to put an extra effort into selecting what you need. But how do you do that (other than manually copying files)? Cherry pick seems to be commit based... > Looking at specific files is not the default way things work in git. > The idea is that if > someone makes a single commit, it is an atomic contribution that is > either to be > accepted or not. You can of course create a diff file and then split > it into specific files. > I'll look into possible easier ways of doing it. The point is: wanting to use part of a commit without loosing history. In my case, I dont want to import a test_PopGen_Fst file that Gio has. That being said, I dont think this is a big deal. I was just to preserve the history connectivity between repositiories. I think we can just use the old fashioned method of copying some files around. But it would be good to know if there is a "best practice" (which, I could not find out) Tiago PS - I might have to go under surgery this week, if I stop responding for a long time, my apologies in advance but I am probably recovering. From biopython at maubp.freeserve.co.uk Mon Apr 6 13:25:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 6 Apr 2009 14:25:29 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code Message-ID: <320fb6e00904060625v4a49da2au76159eae18f707eb@mail.gmail.com> Brad has been working on his GFF parsing code - see progress reports on his blog http://bcbio.wordpress.com/ and his code on github, http://github.com/chapmanb/bcbb/tree/master/gff Potentially this could make it into Biopython 1.51, and I was just thinking about where the code would go. Brad is supporting both GFF3 and the loosely defined GFF2 variants, so Bio.GFF seems a good place. There would also be a wrapper under Bio.SeqIO for loading GFF files as SeqRecord objects (I haven't played with Brad's code, but it can do this already). However, we already have a Bio.GFF module from Michael Hoffman created back in 2002 which accesses MySQL General Feature Format (GFF) databases created with BioPerl. Perhaps we should poll the main discussion list now, and if there are no responses from people using it, we could deprecate Bio.GFF for Biopython 1.50? Under our current deprecation policy we shouldn't then remove Bio.GFF until Biopython 1.52 at the earliest, http://biopython.org/wiki/Deprecation_policy What do you think Brad? How about using Bio.GFF3 instead? Peter From chapmanb at 50mail.com Mon Apr 6 22:08:26 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 6 Apr 2009 18:08:26 -0400 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904060625v4a49da2au76159eae18f707eb@mail.gmail.com> References: <320fb6e00904060625v4a49da2au76159eae18f707eb@mail.gmail.com> Message-ID: <20090406220826.GH43636@sobchak.mgh.harvard.edu> Peter; Thanks for the plug. GFF parsing is moving along; the main feature two things I would like to finish before proposing it for inclusion are writing of GFF files and putting GFF into BioSQL with the nested features. The code does work for parsing, and I've been using it for some real projects; anyone who would like to test it is more than welcome. As far as the current Bio.GFF, that is a bit of a conundrum. The current code does work and for some cases it would be nice of having the utility of working with GFF from a database. Eventually BioSQL from GFF may supplant that, but that should be finished and tested first. I would argue for keeping it in. However, it is a bit confusing if someone is looking for a parser. It would make more sense if it lived under a namespace like Bio.GFF.DB. What do you think about adding a warning that it is going to move to a new namespace and then moving it there, if we don't hear any complaints, for 1.51? This is less cumbersome than a removal for users since it's just an import change. Brad > Brad has been working on his GFF parsing code - see progress reports > on his blog http://bcbio.wordpress.com/ and his code on github, > http://github.com/chapmanb/bcbb/tree/master/gff > > Potentially this could make it into Biopython 1.51, and I was just > thinking about where the code would go. Brad is supporting both GFF3 > and the loosely defined GFF2 variants, so Bio.GFF seems a good place. > There would also be a wrapper under Bio.SeqIO for loading GFF files as > SeqRecord objects (I haven't played with Brad's code, but it can do > this already). > > However, we already have a Bio.GFF module from Michael Hoffman created > back in 2002 which accesses MySQL General Feature Format (GFF) > databases created with BioPerl. Perhaps we should poll the main > discussion list now, and if there are no responses from people using > it, we could deprecate Bio.GFF for Biopython 1.50? Under our current > deprecation policy we shouldn't then remove Bio.GFF until Biopython > 1.52 at the earliest, http://biopython.org/wiki/Deprecation_policy > > What do you think Brad? How about using Bio.GFF3 instead? > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From mjldehoon at yahoo.com Tue Apr 7 11:32:52 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 7 Apr 2009 04:32:52 -0700 (PDT) Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090406220826.GH43636@sobchak.mgh.harvard.edu> Message-ID: <316000.69837.qm@web62407.mail.re1.yahoo.com> Hi Brad, Thanks for your work on the GFF parser; I'm dealing with GFF files quite a lot. Could you maybe give a simple example of how to use your GFF parser, once it's included into Biopython? --Michiel. --- On Mon, 4/6/09, Brad Chapman wrote: > From: Brad Chapman > Subject: Re: [Biopython-dev] Bio.GFF and Brad's code > To: biopython-dev at lists.open-bio.org > Date: Monday, April 6, 2009, 6:08 PM > Peter; > Thanks for the plug. GFF parsing is moving along; the main > feature > two things I would like to finish before proposing it for > inclusion > are writing of GFF files and putting GFF into BioSQL with > the nested > features. The code does work for parsing, and I've been > using it for > some real projects; anyone who would like to test it is > more than > welcome. > > As far as the current Bio.GFF, that is a bit of a > conundrum. The > current code does work and for some cases it would be nice > of having > the utility of working with GFF from a database. Eventually > BioSQL > from GFF may supplant that, but that should be finished and > tested > first. I would argue for keeping it in. > > However, it is a bit confusing if someone is looking for a > parser. It > would make more sense if it lived under a namespace like > Bio.GFF.DB. > What do you think about adding a warning that it is going > to move to > a new namespace and then moving it there, if we don't > hear any > complaints, for 1.51? This is less cumbersome than a > removal for > users since it's just an import change. > > Brad > > > > > Brad has been working on his GFF parsing code - see > progress reports > > on his blog http://bcbio.wordpress.com/ and his code > on github, > > http://github.com/chapmanb/bcbb/tree/master/gff > > > > Potentially this could make it into Biopython 1.51, > and I was just > > thinking about where the code would go. Brad is > supporting both GFF3 > > and the loosely defined GFF2 variants, so Bio.GFF > seems a good place. > > There would also be a wrapper under Bio.SeqIO for > loading GFF files as > > SeqRecord objects (I haven't played with > Brad's code, but it can do > > this already). > > > > However, we already have a Bio.GFF module from Michael > Hoffman created > > back in 2002 which accesses MySQL General Feature > Format (GFF) > > databases created with BioPerl. Perhaps we should > poll the main > > discussion list now, and if there are no responses > from people using > > it, we could deprecate Bio.GFF for Biopython 1.50? > Under our current > > deprecation policy we shouldn't then remove > Bio.GFF until Biopython > > 1.52 at the earliest, > http://biopython.org/wiki/Deprecation_policy > > > > What do you think Brad? How about using Bio.GFF3 > instead? > > > > Peter > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From bartek at rezolwenta.eu.org Tue Apr 7 12:35:21 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 7 Apr 2009 14:35:21 +0200 Subject: [Biopython-dev] Merging branches In-Reply-To: <6d941f120904060335m380ac820k97a558e8332fdf63@mail.gmail.com> References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com> <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com> <6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com> <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com> <6d941f120904060335m380ac820k97a558e8332fdf63@mail.gmail.com> Message-ID: <8b34ec180904070535r3a6f23e8w9b917f7592930eda@mail.gmail.com> Hi, 2009/4/6 Tiago Ant?o : >> This means, that if you only need a part of what was committed as a >> big changeset, >> you will need to put an extra effort into selecting what you need. > > But how do you do that (other than manually copying files)? I think that in this case you need to do this manually. If you care only about one file, copying it is the easiest option. > Cherry pick seems to be commit based... In fact the whole git is commit based. It's not tracking files as such, but blobs of data. >I would like to replace this: >http://github.com/tiagoantao/biopython-popgen-test/blob/fa5ebc23e7aaabce94ae594d9a4f83be9bf90215/Bio/PopGen/Stats/Simple.py >With this: >http://github.com/dalloliogm/biopython/blob/cbaf6249cb91ed505cb575f09c2eaef3809872b9/Bio/PopGen/Stats/Simple.py >It would be cool not to loose the history relationship (I suppose that >would be the good practice). Indeed, keeping history is the right thing and it was one of the reasons to switch to git. It would be perfect if Giovanni could "redo" some of his commits and split them into smaller operations, so that cherry picking commits would be possible. I know it's a pain... > The point is: wanting to use part of a commit without loosing history. > In my case, I dont want to import a test_PopGen_Fst file that Gio has. > That being said, I dont think this is a big deal. I was just to > preserve the history connectivity between repositiories. I think we > can just use the old fashioned method of copying some files around. > But it would be good to know if there is a "best practice" (which, I > could not find out) As far as I can tell, there is no way you could take only a part of a commit. The best practice is to make smaller, atomic commits. It has many advantages: -it's easier to document a smaller change (I think it makes up for potentially more work because of more commits) -you can then "undo" small locally committed changes before pushing them to public repo -cherry picking of nicely documented small changes is an easy job In this particular case of changes in tests, I think really changes to one test should be committed separately from changes in other tests. cheers Bartek From tiagoantao at gmail.com Tue Apr 7 16:43:49 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 7 Apr 2009 17:43:49 +0100 Subject: [Biopython-dev] PopGen Stats Message-ID: <6d941f120904070943n7de7afa7m262dd4f4c0149cb@mail.gmail.com> Hi, I've started a page documenting the effort to implement statstics here http://biopython.org/wiki/PopGen_dev_Statistics anyone is welcomed to participate. I was expecting to have a personal hurdle during this week, which didn't happen. So I expect to be working heavily on this (finally). Tiago -- "A man who dares to waste one hour of time has not discovered the value of life" - Charles Darwin From peter at maubp.freeserve.co.uk Tue Apr 7 19:38:50 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Apr 2009 20:38:50 +0100 Subject: [Biopython-dev] Fwd: [blast-announce] BLAST 2.2.20 now available In-Reply-To: <9E9BBBAC-1E4D-4C2E-AD8E-7F6DF6F317A5@ncbi.nlm.nih.gov> References: <9E9BBBAC-1E4D-4C2E-AD8E-7F6DF6F317A5@ncbi.nlm.nih.gov> Message-ID: <320fb6e00904071238v7ed94541me4af7937fa7ae27b@mail.gmail.com> Hi all, There is a new version of BLAST out - we'll need to check if the NCBI's online server has been updated (if so, our unit test test_NCBI_qblast.py should catch any obvious issues). We'll also want to check the standalone version of BLAST is OK. Point (2) below sounds interesting, previously using BLAST databases with spaces in the path on Windows was rather hairy. Peter ---------- Forwarded message ---------- From: mcginnis Date: Apr 7, 2009 1:50 PM Subject: [blast-announce] BLAST 2.2.20 now available To: blast-announce at ncbi.nlm.nih.gov New BLAST binaries are available on the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/) The list of changes are: 1.) Ungapped blastn searches allow arbitrary reward/penalty scores. 2.) Spaces are allowed in database pathnames on windows 3.) Seedtop now has gilist support. 4.) Fix a bug that caused the number and order of queries to affect blastx results. 5.) Modified the 2-hit blastn algorithm so that no overlap is allowed between hits. From jacobporter2002 at yahoo.com Wed Apr 8 02:27:21 2009 From: jacobporter2002 at yahoo.com (Jacob Porter) Date: Tue, 7 Apr 2009 19:27:21 -0700 (PDT) Subject: [Biopython-dev] Phylogeny modules for BioPython Message-ID: <296822.1198.qm@web33706.mail.mud.yahoo.com> Hi all, My name is Jacob Porter, and I am a graduate student in the math department at UC Davis.? I've done work before on phylogeny inference using so-called "phylogenetic invariants" that can be found at the website: http://www.shsu.edu/~ldg005/small-trees/ It appears to me that BioPython doesn't have much support for phylogeny inference and tools related to phylogeny inference. I have applied to the Google Summer of Code (12 weeks of working part-time on a programming assignment), and I am looking for a project that could work with BioPython as I see a lot of potential in it.? I can bring my expertise on phylogeny inference to this project to add some support for this. I need three things from the community ASAP: 1) Ideas as to which of my several project ideas are the most useful to the BioPython community 2) Information as to what is already included in BioPython concerning phylogeny inference and related tools 3) A mentor that will help me with the project (and possibly work in conjunction with Nascent (https://www.nescent.org/wg_phyloinformatics/Main_Pagementors)? I?would need a 12 -week schedule of tasks for the project (TBD), and answers to questions related to developing for BioPython.? (I've worked with Python a lot before, so I shouldn't need much help with Python so much as I need help with BioPython). Project?1: Add support for popular phylogeny representation standards such as DND files.? Give the ability to read and write such files.? Convert between such files.? I need help in picking which standards to use and need help in picking which operations on these files is the most useful. Project?2: Add wrappers for modern (hopefully high throughput and accurate) phylogeny inference software written in C++/C.? Examples of such software include neighbor-joining, MJOIN software (similar to neighbor-joining) (http://bio.math.berkeley.edu/mjoin/), Garli (http://www.molecularevolution.org/si/software/garli/), treeSVD (http://www.stat.uchicago.edu/~eriksson/software.html), and maximum parsimony.? I would like to know which sort of phylogeny inference software is the most useful in your opinion.? I assume no wrappers for such software exist. Project?3: Add analytic algorithms that use phylogeny in some way.? Examples include bootstrapping and protein-protein interaction inference algorithms.? (i.e. "Inferring protein interactions from phylogenetic distance matrices" by Gertz et al.)? I need information as to what sort of algorithms would be useful. Project 4: Enhance phylogeny inference software further.? MJOIN has bugs (I think it returns negative distances in some cases, and some modifications to it that I developed using phylogenetic invariants are seg-faulting). Not all of these ideas will probably be able to be developed, so I need information as to what might be the most useful.? I was thinking of focusing on Project 1 and Project 2 for the initial phase. Any information will be appreciated, and any mentorship will be great.? I would like a response quickly, so that I can inform Nascent of my plans. Thanks, Jacob Porter UC Davis From p.j.a.cock at googlemail.com Wed Apr 8 08:54:35 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 8 Apr 2009 09:54:35 +0100 Subject: [Biopython-dev] Phylogeny modules for BioPython In-Reply-To: <296822.1198.qm@web33706.mail.mud.yahoo.com> References: <296822.1198.qm@web33706.mail.mud.yahoo.com> Message-ID: <320fb6e00904080154j433ef69cmd99b847240ee5b39@mail.gmail.com> On 4/8/09, Jacob Porter wrote: > > Hi all, > > My name is Jacob Porter, and I am a graduate student in the math > department at UC Davis. I've done work before on phylogeny inference > ... > It appears to me that BioPython doesn't have much support for > phylogeny inference and tools related to phylogeny inference. I'm sure there is room for improvement. > I have applied to the Google Summer of Code (12 weeks of > working part-time on a programming assignment), and I am > looking for a project that could work with BioPython as I see > a lot of potential in it. I can bring my expertise on phylogeny > inference to this project to add some support for this. > > I need three things from the community ASAP: > > 1) Ideas as to which of my several project ideas are the > most useful to the BioPython community Personally, I might pick command line wrappers for existing command line tools. However, these don't actually make anything new possible, as writting your own command line is already fairly easy. This in itself wouldn't be that much work either. > 2) Information as to what is already included in BioPython > concerning phylogeny inference and related tools Look at Bio.Nexus, plus somewhat related, Bio.AlignIO. > 3) A mentor that will help me with the project (and > possibly work in conjunction with Nascent > (https://www.nescent.org/wg_phyloinformatics/Main_Pagementors) > I would need a 12 -week schedule of tasks for the > project (TBD), and answers to questions related to > developing for BioPython. (I've worked with Python > a lot before, so I shouldn't need much help with > Python so much as I need help with BioPython). Brad Chapman may be willing to mentor a GSoC student, have a look back of the recent email discussions here. In particular, Nick Matzke has already expressed some interest in Biogeographical and community phylogenetics for Biopython (there is a wiki page on open-bio.org on this). > Project 1: > Add support for popular phylogeny representation > standards such as DND files. Give the ability to > read and write such files. Convert between such > files. I need help in picking which standards to use > and need help in picking which operations on these > files is the most useful. We have this already in Bio.Nexus, but there is still room for improvement - see Bug 2788 for example. > Project 2: > Add wrappers for modern (hopefully high throughput > and accurate) phylogeny inference software written in > C++/C. Examples of such software include > neighbor-joining, MJOIN software (similar to > neighbor-joining) (http://bio.math.berkeley.edu/mjoin/), > Garli (http://www.molecularevolution.org/si/software/garli/), > treeSVD (http://www.stat.uchicago.edu/~eriksson/software.html), > and maximum parsimony. I would like to know which > sort of phylogeny inference software is the most useful > in your opinion. I assume no wrappers for such software > exist. Well, Bio.Nexus is a great help with certain tools. There is scope for adding more command line wrappers though (I like quick-join and and also quicktree for NJ tree building). > Project 3: > Add analytic algorithms that use phylogeny in some > way. Examples include bootstrapping and protein-protein > interaction inference algorithms. (i.e. "Inferring protein > interactions from phylogenetic distance matrices" by > Gertz et al.) I need information as to what sort of > algorithms would be useful. I feel that this is still very much an active area of research, and there are no clear gold standards. However, perhaps some published algorithms may be worth re-implementing in Biopython. I would still tend to favour more general work for Biopython that would support people implementing any/their own algorithm. > Project 4: > Enhance phylogeny inference software further. > MJOIN has bugs (I think it returns negative distances > in some cases, and some modifications to it that I > developed using phylogenetic invariants are seg-faulting). Fixing any bug in MJOIN sounds like a good idea - but doesn't really affect Biopython directly. > Not all of these ideas will probably be able to be > developed, so I need information as to what might > be the most useful. I was thinking of focusing on > Project 1 and Project 2 for the initial phase. > > Any information will be appreciated, and any > mentorship will be great. I would like a response > quickly, so that I can inform Nascent of my plans. Peter. P.S. Its Biopython, not BioPython From chapmanb at 50mail.com Wed Apr 8 12:32:26 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Apr 2009 08:32:26 -0400 Subject: [Biopython-dev] Phylogeny modules for BioPython In-Reply-To: <320fb6e00904080154j433ef69cmd99b847240ee5b39@mail.gmail.com> References: <296822.1198.qm@web33706.mail.mud.yahoo.com> <320fb6e00904080154j433ef69cmd99b847240ee5b39@mail.gmail.com> Message-ID: <20090408123226.GL43636@sobchak.mgh.harvard.edu> Jacob; Thanks much for your interest in Biopython for Summer of Code; glad to see a discussion here about your proposal. Peter's comments are great; I will add to them from the SoC perspective. > > I have applied to the Google Summer of Code (12 weeks of > > working part-time on a programming assignment) SoC is a full time commitment for the summer. Your proposal also lists some conflicts (classes, other research) for the summer months. On your updated proposal you should be explicit about these and describe how you plan to make up time you miss during the first two weeks of the quarter. More generally, your proposal needs a detailed plan of deliverables on a week to week basis over the project timeline, starting with coding on May 23rd: http://socghop.appspot.com/document/show/program/google/gsoc2009/timeline This is the last hour for refining proposals, so you will need to update your proposal quickly for us to still have time to consider it. I would recommend copying your current proposal to a Google Doc, adding all of the specifics needed, and then submitting a link to the open document as a comment to your initial proposal. > Brad Chapman may be willing to mentor a GSoC student, have a look back > of the recent email discussions here. In particular, Nick Matzke has > already expressed some interest in Biogeographical and community > phylogenetics for Biopython (there is a wiki page on open-bio.org on > this). I am definitely willing to help; spots will be very competitive throughout the program. Echoing Peter's comments, I would put together a project proposal that tackles: - Improving parsing support in Bio.Nexus, based on existing code and bug reports, and other suggestions you might have. - Providing code wrapping for other phylogeny software. Since the usefulness of different algorithms depends heavily on the context in which it is used, you will not find a consensus about which program is most useful. My suggestion is to suggest wrappers for several useful programs covering the spectrum of possibilities. In additions to the ones you listed, a couple others are: RAxML http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm FastTree http://www.microbesonline.org/fasttree/index.html - A higher level API over the parsing and command line program support that helps users with specific phylogenetic tasks. Based on your experience and input from the Biopython community of users, this would have the goal of providing a simple way to do common tasks. This should be a combination of code to surround repetitive items, and cookbook style documentation to help people with specific phylogenetic problems. Other general suggestions: - Tests. Please describe your plans to write unit tests for all the code your write. - Documentation. Please do leave time in your project plan to fully document using your proposed code. - Projects 3 and 4, as Peter suggests, are out of the scope of GSoC. 3, specifically, is more of a research project. Finally, a few meta-items from your e-mail meant as helpful advice: > It appears to me that BioPython doesn't have much support for > phylogeny inference and tools related to phylogeny inference. I understand this is an attempt to provide motivation for your proposal, but you should do so in a way that does not disparage the work of the people you are soliciting advice from. Your request would be better received if you described it in the context of improving existing phylogenetic support in Biopython. > I need three things from the community ASAP: [...] > I would like a response quickly No one likes to be told what to do, much less a group your are requesting help and hopefully a job from. Again, you should think about how your phrasing will be interpreted by those reading it. > Nascent You twice misspelled this: NESCent. Mistakes happen, but it reflects badly on your commitment to the project to not be able to spell the name of the organization you would like to work with. These are the small things you should be careful and double check. Thanks again for your interest and looking forward to seeing your revised project plan, Brad From chapmanb at 50mail.com Wed Apr 8 12:49:08 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Apr 2009 08:49:08 -0400 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <316000.69837.qm@web62407.mail.re1.yahoo.com> References: <20090406220826.GH43636@sobchak.mgh.harvard.edu> <316000.69837.qm@web62407.mail.re1.yahoo.com> Message-ID: <20090408124908.GN43636@sobchak.mgh.harvard.edu> Hi Michiel; > Thanks for your work on the GFF parser; I'm dealing with GFF files > quite a lot. Could you maybe give a simple example of how to use your > GFF parser, once it's included into Biopython? Awesome; I'm glad it will be useful. I'd definitely welcome any feedback you have on the API or implementation. At this stage we can be flexible and hopefully get it finalized before it hits Biopython. I will get some user documentation together soon, but here is some basic usage. To parse an entire GFF file, getting all features at once: from BCBio.GFF.GFFParser import GFFAddingIterator gff_iterator = GFFAddingIterator() rec_dict = gff_iterator.get_all_features(gff_file) The returned dictionary is like a dictionary from SeqIO.to_dict; keys are ids and values are SeqRecords. You can also seed the parser with an initial dictionary containing sequences or other features, and the features from the GFF file will be added to those records: with open(seq_file) as seq_handle: seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta")) gff_iterator = GFFAddingIterator(seq_dict) If a file is very large, you have two ways of limiting the size of items parsed. The first is to specify which items you are interested and return only those. This code will parse out coding transcripts on chromosome I: cds_limit_info = dict( gff_source_type = [('Coding_transcript', 'gene'), ('Coding_transcript', 'mRNA'), ('Coding_transcript', 'CDS')], gff_id = ['I'] ) rec_dict = gff_iterator.get_all_features(gff_file, limit_info=cds_limit_info) The second is to use an iterator over a section of the file: for rec_dict in gff_iterator.get_features(gff_file, target_lines=1000000): # handle partial rec dictionary of first 1000000 lines Finally, there is an interface to examine a GFF file and figure out useful ways to limit it. This will give you a dictionary of all possible ways to limit a file along with the counts in each: gff_examiner = GFFExaminer() possible_limits = gff_examiner.available_limits(gff_file) and this will give a dictionary of the parent-child relationships in the file: gff_examiner = GFFExaminer() pc_map = gff_examiner.parent_child_map(gff_file) Since GFF providers tend to differ in how they structure their information, this helps get a quick overview of the file to determine how to manage it. Happy to hear about thoughts you might have. Thanks, Brad > > --Michiel. > > > --- On Mon, 4/6/09, Brad Chapman wrote: > > > From: Brad Chapman > > Subject: Re: [Biopython-dev] Bio.GFF and Brad's code > > To: biopython-dev at lists.open-bio.org > > Date: Monday, April 6, 2009, 6:08 PM > > Peter; > > Thanks for the plug. GFF parsing is moving along; the main > > feature > > two things I would like to finish before proposing it for > > inclusion > > are writing of GFF files and putting GFF into BioSQL with > > the nested > > features. The code does work for parsing, and I've been > > using it for > > some real projects; anyone who would like to test it is > > more than > > welcome. > > > > As far as the current Bio.GFF, that is a bit of a > > conundrum. The > > current code does work and for some cases it would be nice > > of having > > the utility of working with GFF from a database. Eventually > > BioSQL > > from GFF may supplant that, but that should be finished and > > tested > > first. I would argue for keeping it in. > > > > However, it is a bit confusing if someone is looking for a > > parser. It > > would make more sense if it lived under a namespace like > > Bio.GFF.DB. > > What do you think about adding a warning that it is going > > to move to > > a new namespace and then moving it there, if we don't > > hear any > > complaints, for 1.51? This is less cumbersome than a > > removal for > > users since it's just an import change. > > > > Brad > > > > > > > > > Brad has been working on his GFF parsing code - see > > progress reports > > > on his blog http://bcbio.wordpress.com/ and his code > > on github, > > > http://github.com/chapmanb/bcbb/tree/master/gff > > > > > > Potentially this could make it into Biopython 1.51, > > and I was just > > > thinking about where the code would go. Brad is > > supporting both GFF3 > > > and the loosely defined GFF2 variants, so Bio.GFF > > seems a good place. > > > There would also be a wrapper under Bio.SeqIO for > > loading GFF files as > > > SeqRecord objects (I haven't played with > > Brad's code, but it can do > > > this already). > > > > > > However, we already have a Bio.GFF module from Michael > > Hoffman created > > > back in 2002 which accesses MySQL General Feature > > Format (GFF) > > > databases created with BioPerl. Perhaps we should > > poll the main > > > discussion list now, and if there are no responses > > from people using > > > it, we could deprecate Bio.GFF for Biopython 1.50? > > Under our current > > > deprecation policy we shouldn't then remove > > Bio.GFF until Biopython > > > 1.52 at the earliest, > > http://biopython.org/wiki/Deprecation_policy > > > > > > What do you think Brad? How about using Bio.GFF3 > > instead? > > > > > > Peter > > > _______________________________________________ > > > Biopython-dev mailing list > > > Biopython-dev at lists.open-bio.org > > > > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > From bugzilla-daemon at portal.open-bio.org Wed Apr 8 22:55:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 8 Apr 2009 18:55:59 -0400 Subject: [Biopython-dev] [Bug 2808] New: Bio.SeqIO "ig" format parser doesn't deal with optional 1 terminator Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2808 Summary: Bio.SeqIO "ig" format parser doesn't deal with optional 1 terminator Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk While working on new unit test test_Emboss.py I noticed that EMBOSS seqret creates ig files where the sequence includes a terminal digit one. Further research online suggests this is an optional feature of the file format, although not commonly used. See: http://bmerc-www.bu.edu/needle-doc/latest/seq-formats.html#seq-file-format The Bio.SeqIO "ig" parser should be aware of the (optional) terminal "1" marker, and not include it in the returned sequence. Perhaps we should even add this when writing the files. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Fri Apr 10 13:10:34 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 10 Apr 2009 09:10:34 -0400 Subject: [Biopython-dev] Invitation for Biopython news coordinators In-Reply-To: <49DD5575.4040901@student.otago.ac.nz> References: <20090406230542.GK43636@sobchak.mgh.harvard.edu> <49DD5575.4040901@student.otago.ac.nz> Message-ID: <20090410131034.GH54672@sobchak.mgh.harvard.edu> David; Thanks for taking the time to write; it is great to hear that you are interested. Copying this to the dev list so others can comment and you can feel free to discuss as much as you want. > I'd be keen to help spread the good word about bio-python, I'm a very > novice programmer who has been using the tools to work on some 454 > transcriptome data. I will probably never be a good enough programmer to > contribute code to the project so would see this as a way to "give > something back". Perfect. Getting involved is the first step; you'd be surprised how much you can learn just by taking on new tasks. I started helping with Biopython by writing documentation. > For me as a n00b the most useful resource by far has been the cookbook - > seeing some working scripts that I could change to suit my ends has > helped me get to the point that I can write much more generalised code > for my project 'from scratch'. To that end I think it would be really > helpful to highlight work that other people have done, either published > or made available by authors, with a little detail on the questions > and the way BioPython was used to get at them. We could extend it to > show some "use cases" for BioPython working with other programs or how > new features can be used once they are included in the main release. > > To me the most obvious way of presenting such information would be a > blog, we could invite authors and developers to make short posts and > failing that I'd be happy write up posts summarising published research. > We could also try an aggregate blogs from the devs and anyone else > talking about biopython "in the wild". This sounds great. You are welcome to use the twitter account, news posts, the wiki, or a blog -- however you see fit. For your aggregation idea, you might want to take a look at friendfeed. It's pretty simple to set up a room and pull in RSS feeds, twitter postings, and what not. There is a Python for Bioinformatics room: http://friendfeed.com/rooms/python-for-bioinformatics Most feeds come from general Python sources so it is a bit more broad, but is a good starting place. I know some of the admins (Chris, Paulo, Andrew) are around here, and may want to chime in. For publications, Peter has done a lot of work on identifying papers that use Biopython: http://biopython.org/wiki/Publications Building on this to include short reusable examples from the research would be very useful. > Anyway, those are a few ideas, I'm definitely keen to help out and to > take on board any other ideas that are out there. Great, let us know how you want to get started. Feel free to start with something small and expand from there. Peter can help out with account information for twitter; if you need other things just ask away. Brad > Cheers, > David > > Brad Chapman wrote: > > Biopythonistas; > > Communication is a key component of successful open source projects. > > The challenges of distributed programming by volunteers can be > > overcome by ensuring that the whole community is aware of > > interesting discussions, new contributions, and development goals. > > Traditionally, this communication has happened through our mailing > > lists, wiki pages, and bug tracking system. While these will > > continue to to be useful resources, new methods of disseminating > > information are changing how we interact through the web. > > > > I'd like to issue an invitation for anyone interested in helping > > revolutionize how Biopython news is disseminated. We are looking for > > contributors from the community to brainstorm new ways to make the > > discussions that happen at biopython.org accessible. You would > > actively follow development here and on the development lists and > > distill this information into useful quick bullet points for those > > interested in Biopython but too busy to follow detailed discussions. > > > > We are proposing two ways to do this: > > > > - Monthly highlights on our news server: > > http://news.open-bio.org/news/category/obf-projects/biopython/ > > The RSS feed from these posts are currently widely distributed around the > > internet. > > > > - More frequent pointers to interesting discussions or other items > > of interest happening in Biopython through our Twitter account: > > http://twitter.com/biopython > > > > This is an opportunity for those of you who are looking to become > > more involved, and would like to learn more about Biopython by > > following all of the coding activity more closely. The position is > > very flexible and we are happy to have one or more people take it > > on; we would also encourage you to be as creative as you want in > > doing so. > > > > I see this as an chance to both provide information and to highlight > > the great work people do at Biopython. If you are interested in > > taking on this role please respond with your ideas. Thanks for your > > interest, > > > > Brad > > _______________________________________________ > > BioPython mailing list - BioPython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > From bugzilla-daemon at portal.open-bio.org Fri Apr 10 14:13:58 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 10 Apr 2009 10:13:58 -0400 Subject: [Biopython-dev] [Bug 2809] New: Adding startswith and endswith methods to the Seq object Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2809 Summary: Adding startswith and endswith methods to the Seq object Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk OtherBugsDependingO 2351 nThis: As part of making the Seq object more like the Python string (Bug 2351), we need alphabet aware startswith and endswith methods. Patch to follow. There are many possible use cases for this. One example which prompted me to work on this was taking SeqRecord objects from sequencing reads (a FASTQ file read in with Bio.SeqIO) where some include a PCR primer associated prefix/suffix which I want to strip off (by slicing the SeqRecord). To do this I need to know if a given SeqRecord's sequence starts with (or ends with) a given primer sequence (or tuple of primer sequences). Current work around, str(record.seq).startswith(prefix) Patch to follow, which will allow record.seq.startswith(prefix) directly. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 10 14:13:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 10 Apr 2009 10:13:59 -0400 Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string, even subclass string? In-Reply-To: Message-ID: <200904101413.n3AEDx5I004913@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2351 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2809 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 10 14:15:27 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 10 Apr 2009 10:15:27 -0400 Subject: [Biopython-dev] [Bug 2809] Adding startswith and endswith methods to the Seq object In-Reply-To: Message-ID: <200904101415.n3AEFRRb005139@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2809 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-10 10:15 EST ------- Created an attachment (id=1275) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1275&action=view) Patch to Bio/Seq.py and Tests/test_Seq_objs.py Adds startswith and endswith methods to the Seq object, and tests these with simple doctest and a longer separate unit test. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Apr 10 14:46:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 10 Apr 2009 15:46:02 +0100 Subject: [Biopython-dev] Tutorial & Cookbook Message-ID: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com> David wrote: >> For me as a n00b the most useful resource by far has been the cookbook - >> seeing some working scripts that I could change to suit my ends has >> helped me get to the point that I can write much more generalised code >> for my project 'from scratch'. ... When you said "cookbook", did you mean the Biopython Tutorial & Cookbook? http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf There are a couple of other documents under the "Cookbook" folder here: http://biopython.org/DIST/docs/cookbook/Restriction.html http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf I have been wondering if the "Biopython Tutorial & Cookbook" should be separated now - it is getting a bit long (which in some ways is a good thing!). Maybe we should re-title it as just the "Biopython Tutorial". Some bits of the current "Cookbook chapter" might be moved into the main body of the tutorial (e.g. the alignment stuff), but having the cookbook entries separate might be a good idea. For a separate "Cookbook", we could again use LaTeX for another HTML/PDF document (or set of documents) but perhaps just a series of pages on the wiki would be more accessible - and much easier for people to contribute to? We'd need to organize things (e.g. a cookbook category on the wiki) to make sure everything is still accessible. As a bonus, it would give us more hits on Google - which is probably a good thing. On the other hand, it would be very good if all our cookbook use cases could be rolled into the unit test framework - which wouldn't be so easy if they live on the wiki. Something based on doctests might work... Peter From bugzilla-daemon at portal.open-bio.org Fri Apr 10 17:29:06 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 10 Apr 2009 13:29:06 -0400 Subject: [Biopython-dev] [Bug 2808] Bio.SeqIO "ig" format parser doesn't deal with optional 1 terminator In-Reply-To: Message-ID: <200904101729.n3AHT6g0020169@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2808 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-10 13:29 EST ------- (In reply to comment #0) > > The Bio.SeqIO "ig" parser should be aware of the (optional) terminal "1" > marker, and not include it in the returned sequence. > Fixed in CVS, Bio/SeqIO/IgIO.p revision 1.5 Tests/test_Emboss.py revision 1.10 > > Perhaps we should even add this when writing the files. > We don't write out ig files so this isn't an issue at the moment. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Apr 10 18:12:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 10 Apr 2009 19:12:12 +0100 Subject: [Biopython-dev] Bio.EMBOSS wrappers Message-ID: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> Hi Those of you following the CVS RSS feed will have noticed a lot of activity on my new unit test test_Emboss.py, which now works on Windows, Linux and Mac OS (provided EMBOSS is installed), and does four main tasks: - runs needle, checks Bio.AlignIO can parse the output - runs water, checks Bio.AlignIO can parse the output - runs seqret to check Bio.SeqIO - runs seqret to check Bio.AlignIO It would probably be logical to also include tests for the EMBOSS version of primer3 here too, but I am not familiar with this tool and the Biopython parsers. For now I build the command line strings for seqret and needle "by hand", as Bio.EMBOSS doesn't have wrappers for them yet. I also note that the existing wrappers in Bio.EMBOSS don't support the very handy -auto and -filter command line arguments supported by all (or at least most) of the EMBOSS command line tools. Using -auto turns off any user prompting for missing arguments (very important for calling from a script). Using -filter is useful for running the tools with pipes (i.e. no output file is required as stdout can be used instead, and potentially no input file if we write to stdin correctly). Brad, seeing as Bio.EMBOSS is "your baby" how do you feel about adding these features? The needle wrapper would make an excellent basis for a new water wrapper. For adding -auto and -filter support, there is probably a clever approach with a common EMBOSS specific subclass of Bio.Application.AbstractCommandline, but I haven't tried. Peter From mjldehoon at yahoo.com Sat Apr 11 02:26:45 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 10 Apr 2009 19:26:45 -0700 (PDT) Subject: [Biopython-dev] Tutorial & Cookbook In-Reply-To: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com> Message-ID: <93403.18413.qm@web62406.mail.re1.yahoo.com> --- On Fri, 4/10/09, Peter wrote: > I have been wondering if the "Biopython Tutorial & > Cookbook" should be separated now - it is getting > a bit long (which in some ways is a good thing!). In my opinion, it doesn't matter if the "Biopython Tutorial & Cookbook" is long. I guess that few people actually print this document anyway. I am in favor of having one "official" documentation for Biopython. If we have one Tutorial and one Cookbook, we'll have lots of overlap between the two, it'll be unclear what should be in the Tutorial and what in the Cookbook, and we'll have to make sure the two are consistent. A cookbook on the Wiki could be helpful though, and since the Wiki pages can be fixed easily we won't have to worry so much about inconsistencies with the official documentation. > Maybe we should re-title it as just the "Biopython Tutorial". That sounds like a good idea. > Some bits of the current "Cookbook chapter" might be moved > into the main body of the tutorial (e.g. the alignment > stuff), Yes. The cookbook chapter has the same problem as a cookbook document; it's not clear what should go there. A more logical place for cookbook-style examples is at the end of each chapter in the documentation. For example, Bio.Entrez has a bunch of cookbook-style examples at the end of its chapter in the Biopython Tutorial & Cookbook. Currently, there are not so many sections left in the cookbook chapter; most of them have become full-fledged chapters and were moved out of the cookbook chapter. > For a separate "Cookbook", we could again use LaTeX for another > HTML/PDF document (or set of documents) but perhaps just a > series of pages on the wiki would be more accessible - and much > easier for people to contribute to? +1 for the wiki, -1 for another HTML/PDF document. > On the other hand, it would be very good if all our > cookbook use cases > could be rolled into the unit test framework - which > wouldn't be so > easy if they live on the wiki. Something based on doctests > might work... Whereas it can be useful if some cookbook examples are part of the unit tests, I don't think it's absolutely required. I see a wiki cookbook more as complementary to the unit tests. --Michiel. From mjldehoon at yahoo.com Sat Apr 11 11:29:47 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 11 Apr 2009 04:29:47 -0700 (PDT) Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090408124908.GN43636@sobchak.mgh.harvard.edu> Message-ID: <830379.9837.qm@web62402.mail.re1.yahoo.com> Hi Brad, Thanks for the examples; that clarified it a lot. I have a couple of suggestions of how to make the GFF parser more generally usable, and more consistent with other parsers in Biopython. Looking at your first example: > from BCBio.GFF.GFFParser import GFFAddingIterator > > gff_iterator = GFFAddingIterator() > rec_dict = gff_iterator.get_all_features(gff_file) > > The returned dictionary is like a dictionary from > SeqIO.to_dict; > keys are ids and values are SeqRecords. It's not clear to me why we need an iterator for GFF files. Can't we just use Python's line iterator instead? I would expect code like this: from Bio import GFF handle = open("my_gff_file.gff") for line in handle: # call the appropriate GFF function on the line The second point is about GFFAddingIterator.get_all_features. If this is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict? Then the code looks as follows: from Bio import GFF handle = open("my_gff_file.gff") rec_dict = GFF.to_dict(handle) Another thing to consider is that IDs in the GFF file do not need to be unique. For example, consider a GFF file that stores genome mapping locations for short sequences stored in a Fasta file. Since each sequence can have more than one mapping location, we can have multiple lines in the GFF file for one sequence ID. The last point is about storing SeqRecords in rec_dict. A GFF file typically does not store sequences; if it does, it's not clear which field in the GFF file does. On the other hand, a SeqRecord often does not contain the chromosomal location, which is what the GFF file stores. So why use a SeqRecord for GFF information? Sorry for bringing up lots of issues. But I think that a GFF parser will be heavily used, so we should optimize its design as much as possible. Best, --Michiel. From biopython at maubp.freeserve.co.uk Sun Apr 12 13:16:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 12 Apr 2009 14:16:58 +0100 Subject: [Biopython-dev] Bio.EMBOSS wrappers In-Reply-To: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> Message-ID: <320fb6e00904120616u390cfe56w3889804d2bffd385@mail.gmail.com> On 4/10/09, Peter wrote: > Hi > > Those of you following the CVS RSS feed will have noticed a lot of > activity on my new unit test test_Emboss.py, which now works on > Windows, Linux and Mac OS (provided EMBOSS is installed), and does > four main tasks: > > - runs needle, checks Bio.AlignIO can parse the output > - runs water, checks Bio.AlignIO can parse the output > - runs seqret to check Bio.SeqIO > - runs seqret to check Bio.AlignIO It now also runs transeq to check the Bio.Seq translations on all common tables. This has shown up some differences in our translations for ambiguous sequences - I may have found a bug in EMBOSS... Peter From sbassi at clubdelarazon.org Mon Apr 13 01:57:52 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Sun, 12 Apr 2009 22:57:52 -0300 Subject: [Biopython-dev] Fwd: [blast-announce] BLAST 2.2.20 now available In-Reply-To: <320fb6e00904071238v7ed94541me4af7937fa7ae27b@mail.gmail.com> References: <9E9BBBAC-1E4D-4C2E-AD8E-7F6DF6F317A5@ncbi.nlm.nih.gov> <320fb6e00904071238v7ed94541me4af7937fa7ae27b@mail.gmail.com> Message-ID: <9e2f512b0904121857o3f8da862ycf40a0510f9dbd51@mail.gmail.com> On Tue, Apr 7, 2009 at 4:38 PM, Peter wrote: > Hi all, .... > We'll also want to check the standalone version of BLAST is OK. I've made the following check: Run a blast query (with blast 2.2.20) with output in xml. Run my python script that converts XML to HTML using Biopython (under Biopython 1.50beta) and it worked OK. The script deals with most information bits found in an XML blast file so if there is any change in the blast output, this program would crash. From eric.talevich at gmail.com Mon Apr 13 03:13:32 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 12 Apr 2009 23:13:32 -0400 Subject: [Biopython-dev] PDB tidy script In-Reply-To: <2d7c25310904030631u56c642d6k83355d6bc4cd5d19@mail.gmail.com> References: <320fb6e00903231405l479ddcc6of9cd0c1aa8fd98d4@mail.gmail.com> <2d7c25310904030631u56c642d6k83355d6bc4cd5d19@mail.gmail.com> Message-ID: <3f6baf360904122013k21aa8efcm4aae0ac872e8e6af@mail.gmail.com> Hi Thomas & everyone, I've started a separate branch on GitHub for this work: http://github.com/etal/biopython/tree/pdbtidy I pushed one small change just now (partly to play with git branches), which is basically the example code I gave earlier. It wraps the PDBLoader and parse_pdb_header classes, and sticks a finger into PDBList too, so that parsing and building a structure from a PDB file is a one-liner for both local and RCSB-hosted files: >>> from Bio import PDB >>> prot = PDB.load('pdb2hmb.ent') >>> dir(prot) ['__doc__', '__init__', '__module__', 'author', 'compound', 'deposition_date', 'head', 'journal', 'journal_reference', 'keywords', 'name', 'release_date', 'resolution', 'source', 'structure', 'structure_method', 'structure_reference'] Or: >>> PDB.fetch('2hmb') /usr/lib/python2.5/site-packages/Bio/PDB/PDBList.py:240: UserWarning: Retrieving ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/hm/pdb2hmb.ent.gz warn("Retrieving %s" % url) (The warning is supposed to be a comment, but that cleanup is happening in another branch: http://github.com/etal/biopython/tree/bug2754 ). My idea is to pull all of the parse_pdb_header data out of the PDBParser and Structure classes, and store it in the PDBLoader wrapper instead. The existing "header" attributes can point to the PDBLoader parent if it exists, or temporarily contain None or "" if necessary to avoid breaking scripts, according to the deprecation plan. Annotations could either stay in Structure or move to Loader. Then we'd have a fast, lean, consistent hierarchy of classes for 3D structure work, and an easy API for loading and exploring PDB files interactively. Part of the pdbtidy concept is to check that the PDB header is consistent with the structure it represents, so I'd like the API for metadata to be just as nice as the existing one for 3D structure. So, this is just a start, but I hope the intent is clear enough that someone will tell me to stop if the whole idea is misguided. Thanks, Eric From biopython at maubp.freeserve.co.uk Mon Apr 13 09:51:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 10:51:38 +0100 Subject: [Biopython-dev] Fwd: [blast-announce] BLAST 2.2.20 now available In-Reply-To: <9e2f512b0904121857o3f8da862ycf40a0510f9dbd51@mail.gmail.com> References: <9E9BBBAC-1E4D-4C2E-AD8E-7F6DF6F317A5@ncbi.nlm.nih.gov> <320fb6e00904071238v7ed94541me4af7937fa7ae27b@mail.gmail.com> <9e2f512b0904121857o3f8da862ycf40a0510f9dbd51@mail.gmail.com> Message-ID: <320fb6e00904130251k3e3e77f2x20e03fba19fd8ff7@mail.gmail.com> On Mon, Apr 13, 2009 at 2:57 AM, Sebastian Bassi wrote: > On Tue, Apr 7, 2009 at 4:38 PM, Peter wrote: >> Hi all, > .... >> We'll also want to check the standalone version of BLAST is OK. > > I've made the following check: > Run a blast query (with blast 2.2.20) with output in xml. Run my > python script that converts XML to HTML using Biopython (under > Biopython 1.50beta) and it worked OK. The script deals with most > information bits found in an XML blast file so if there is any change > in the blast output, this program would crash. Great - thanks for checking that :) Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 10:44:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 11:44:29 +0100 Subject: [Biopython-dev] BOSC 2009 Message-ID: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com> Hello Biopythoneers, Those of you following the dev-mailing list or the OBF news feed will know that talk abstracts for BOSC 2009 are due in today, see http://www.open-bio.org/wiki/BOSC_2009 I should to be able to attend and present the Biopython Project Update, and a few other Biopython developers may also be around too, so some sort of hackathon is in the air. It is a bit unfortunate the deadline was scheduled on the Easter break, as I'm sure quite a few of you will be on holiday, but here is an outline abstract. If anyone has comments, please let me know (on the list or directly) in the next couple of hours... Biopython Project Update (draft abstract for BOSC 2009) In this talk we present the current status of the Biopython project, focusing on features developed in the last year, and future plans for the project. The Oxford University Press journal Bioinformatics has recently published an application note describing Biopython: Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, and de Hoon MJ. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009 Mar 20. doi:10.1093/bioinformatics/btp163 Since BOSC 2008, Biopython 1.49 has been released. This was an important milestone in bringing support for Python 2.6, and in terms of our dependence on Numerical Python as we made the transition from the obsolete Numeric library to NumPy. Biopython 1.49 also added more biological methods to our core sequence object. April 2009 will see the release of Biopython 1.50 (at the time of writing, a beta has already been released). Some of the new features include: 1. GenomeDiagram by Leighton Pritchard has been integrated into Biopython as the Bio.Graphics.GenomeDiagram module. 2. A new module Bio.Motif has been added, which is intended to replace the existing Bio.AlignAce and Bio.MEME modules. 3. Bio.SeqIO can now read and write FASTQ and QUAL files used in second generation sequencing work. Biopython will celebrate its 10th Birthday later this year, we will present a brief history of the project and current work. This includes the evaluation of git (and github) as a possible distributed version control system (DVCS) to replace our existing very stable CVS server hosted by the Open Bioinformatics Foundation, which we hope will encourage more participation in the project. -- Thanks, Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 12:16:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 13:16:10 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <830379.9837.qm@web62402.mail.re1.yahoo.com> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> Message-ID: <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> On Sat, Apr 11, 2009 at 12:29 PM, Michiel de Hoon wrote: > > Hi Brad, > > Thanks for the examples; that clarified it a lot. I haven't tried the code yet, but I have a GFF file I need to convert into FASTA format. Hopefully later this week I'll get to that... There are a few things I can ask now through: Why are the functions _gff_line_map() and _gff_line_reduce() private (leading underscores)? I had thought you wanted to make the map/reduce approach available to people trying to parse GFF files on multiple threads (e.g. using disco) which would require them to use these two functions, wouldn't it? If so, they should be part of the public API. I don't see any support for the optional FASTA block in a GFF file. Is this something you intend to add later Brad? See also my thoughts below for Bio.SeqIO integration. > I have a couple of suggestions of how to make the GFF parser more generally usable, and more consistent with other parsers in Biopython. > Looking at your first example: > >> from BCBio.GFF.GFFParser import GFFAddingIterator >> >> gff_iterator = GFFAddingIterator() >> rec_dict = gff_iterator.get_all_features(gff_file) >> >> The returned dictionary is like a dictionary from >> SeqIO.to_dict; >> keys are ids and values are SeqRecords. > > It's not clear to me why we need an iterator for GFF files. Can't we just use Python's line iterator instead? I would expect code like this: > > from Bio import GFF > handle = open("my_gff_file.gff") > for line in handle: > ? ?# call the appropriate GFF function on the line I think the appropriate GFF function here might be Brad's _gff_line_map(). This knows about different GFF line types (e.g. ## header lines). I'm not sure if a line based approach like this can cope with the optional ##FASTA block through. > The second point is about GFFAddingIterator.get_all_features. If this > is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict? > Then the code looks as follows: > > from Bio import GFF > handle = open("my_gff_file.gff") > rec_dict = GFF.to_dict(handle) Well, the Bio.SeqIO.to_dict() function takes a SeqRecord list/iterator rather than a handle, but that might make sense here. > Another thing to consider is that IDs in the GFF file do not need to be unique. > For example, consider a GFF file that stores genome mapping locations for > short sequences stored in a Fasta file. Since each sequence can have more > than one mapping location, we can have multiple lines in the GFF file for one > sequence ID. That sounds nasty. Do you have any example files of this we could use for a test case? > The last point is about storing SeqRecords in rec_dict. A GFF file typically > does not store sequences; if it does, it's not clear which field in the GFF file > does. On the other hand, a SeqRecord often does not contain the > chromosomal location, which is what the GFF file stores. So why use a > SeqRecord for GFF information? I don't think the GFF parser should only return SeqRecord object, but I do see a use for this (via Bio.SeqIO). GFF files could be represented as a list of SeqFeature objects, and using a SeqRecord to hold this seems very natural to me. It also means we could use Bio.SeqIO to load a GFF file into SeqRecord objects for storage in a BioSQL database. If you look at the NCBI FTP site, they often provide genome sequences in a range of file formats including GenBank and GFF. e.g. ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/ The GenBank files contain the features plus the sequence, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gbk Their GFF3 file only contains the features: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff Some GFF files will include the sequence too, in this case we can fetch it in FASTA format: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna In principle, you could parse this FASTA file and the GFF3 file and put together a GenBank file - or vice versa. As an aside, I would also consider adding protein table support on the same lines, look at this file: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.ptt The header information gives us the genome size, so Bio.SeqIO could return a SeqRecord with lots of SeqFeature objects and for the SeqRecord's seq property use a Bio.Seq.UnknownSeq of length 4639675bp. This is something I might look at implementing myself after Biopython 1.50 is out. We should be able to read in a GenBank file and output a PTT file, and verify it matches the NCBI provided version of the PTT file. Similarly, I would want Bio.SeqIO to be able to parse a GFF3 file, and give me a SeqRecord with lots of SeqFeature objects. If the sequence is present in the file, it should use that (not the case for these NCBI GFF3 files). Otherwise, we wouldn't necessarily know the actual sequence length which we'd need to use the new Bio.Seq.UnknownSeq object. However, we can infer from the maximum feature coordinates a minimum sequence length. For these NCBI GFF3 files, as there is a source feature this does actually give use the genome length, so this should work very nicely. Peter From chapmanb at 50mail.com Mon Apr 13 12:32:19 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 13 Apr 2009 08:32:19 -0400 Subject: [Biopython-dev] Bio.EMBOSS wrappers In-Reply-To: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> Message-ID: <20090413123219.GB5429@sobchak.mgh.harvard.edu> Hi Peter; The tests from EMBOSS look great; thanks for putting this together. > For now I build the command line strings for seqret and needle "by > hand", as Bio.EMBOSS doesn't have wrappers for them yet. I also note > that the existing wrappers in Bio.EMBOSS don't support the very handy > -auto and -filter command line arguments supported by all (or at least > most) of the EMBOSS command line tools. Using -auto turns off any > user prompting for missing arguments (very important for calling from > a script). Using -filter is useful for running the tools with pipes > (i.e. no output file is required as stdout can be used instead, and > potentially no input file if we write to stdin correctly). > > Brad, seeing as Bio.EMBOSS is "your baby" how do you feel about adding > these features? The needle wrapper would make an excellent basis for > a new water wrapper. For adding -auto and -filter support, there is > probably a clever approach with a common EMBOSS specific subclass of > Bio.Application.AbstractCommandline, but I haven't tried. Definitely go for it. My approach on this has mostly been to add command lines as they are requested, or if I need them for something I am doing. Not ideal. Having a subclass with -auto and -filter is a really good idea; unfortunately nothing clever is designed into the command line builders right now. Feel free to add away. Brad From chapmanb at 50mail.com Mon Apr 13 12:52:55 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 13 Apr 2009 08:52:55 -0400 Subject: [Biopython-dev] Tutorial & Cookbook In-Reply-To: <93403.18413.qm@web62406.mail.re1.yahoo.com> References: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com> <93403.18413.qm@web62406.mail.re1.yahoo.com> Message-ID: <20090413125255.GC5429@sobchak.mgh.harvard.edu> Hi all; > > I have been wondering if the "Biopython Tutorial & > > Cookbook" should be separated now - it is getting > > a bit long (which in some ways is a good thing!). > > In my opinion, it doesn't matter if the "Biopython Tutorial & > Cookbook" is long. I guess that few people actually print this > document anyway. > > I am in favor of having one "official" documentation for Biopython. > If we have one Tutorial and one Cookbook, we'll have lots of overlap > between the two, it'll be unclear what should be in the Tutorial > and what in the Cookbook, and we'll have to make sure the two are > consistent. I am for whatever is easiest to maintain. Being long isn't a problem as people can just skip to whatever they need; reading things online will be increasingly common. Agreed with Michiel that minimizing overlap is key. It's the same as maintaining code; if you have the same thing in multiple places it is more likely to get out of sync and be confusing. There is a pretty clear distinction between tutorial documentation and cookbook examples, so... > A cookbook on the Wiki could be helpful though, and since the Wiki > pages can be fixed easily we won't have to worry so much about > inconsistencies with the official documentation. [...] > +1 for the wiki, -1 for another HTML/PDF document. Same vote for me. I am responsible for the LaTeX file, but if I were starting it today would do things entirely on the web. The barrier to contributing is much lower. > > On the other hand, it would be very good if all our cookbook use cases > > could be rolled into the unit test framework - which wouldn't be so > > easy if they live on the wiki. Something based on doctests might work... This is a good idea; broken examples in documentation are definitely annoying. If we enforce a common format for cookbook items, then we could scrape the wiki pages, extract the python code and run it as part of the tests. The python cookbook could serve as some inspiration: http://code.activestate.com/recipes/langs/python/ Brad From biopython at maubp.freeserve.co.uk Mon Apr 13 12:53:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 13:53:18 +0100 Subject: [Biopython-dev] Bio.EMBOSS wrappers In-Reply-To: <20090413123219.GB5429@sobchak.mgh.harvard.edu> References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> <20090413123219.GB5429@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com> On Mon, Apr 13, 2009 at 1:32 PM, Brad Chapman wrote: >> Brad, seeing as Bio.EMBOSS is "your baby" how do you feel about adding >> these features? ?The needle wrapper would make an excellent basis for >> a new water wrapper. ?For adding -auto and -filter support, there is >> probably a clever approach with a common EMBOSS specific subclass of >> Bio.Application.AbstractCommandline, but I haven't tried. > > Definitely go for it. My approach on this has mostly been to add > command lines as they are requested, or if I need them for something > I am doing. Not ideal. > > Having a subclass with -auto and -filter is a really good idea; > unfortunately nothing clever is designed into the command line builders > right now. Feel free to add away. I need to work on my delegation skills - that seems to have back fired ;) Regarding adding -auto support, I have a question about the needle wrapper and the gap parameters. Using the needle tool at the command line will prompt for the gap parameters UNLESS the -auto argument has been used. i.e. Without -auto, it makes sense to insist on the gap parameters being included, which is what the current wrapper does. However, if we add support for -auto, then these parameters can be optional. We could handle this in the wrapper, but it would be messy (and there may be similar questions with other EMBOSS tools). What do you think - stick with the simple option of insisting the Biopython user set the gap parameters, even if they are using -auto? Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 13:16:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 14:16:51 +0100 Subject: [Biopython-dev] Tutorial & Cookbook In-Reply-To: <20090413125255.GC5429@sobchak.mgh.harvard.edu> References: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com> <93403.18413.qm@web62406.mail.re1.yahoo.com> <20090413125255.GC5429@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904130616t2cd5f029i582f4a2488d1182c@mail.gmail.com> Brad wrote: >Michiel wrote: >> A cookbook on the Wiki could be helpful though, and since the Wiki >> pages can be fixed easily we won't have to worry so much about >> inconsistencies with the official documentation. >> [...] >> +1 for the wiki, -1 for another HTML/PDF document. > > Same vote for me. I am responsible for the LaTeX file, but if I were > starting it today would do things entirely on the web. The barrier > to contributing is much lower. One of the nice things about the current PDF (and HTML) file is we can ship it with each release, meaning it can be used while offline. Also it means we don't have to worry too much about having our online documentation deal with older versions of Biopython. But you are right that LaTeX is a slight barrier to contributing - although it wasn't an issue for me personally as I learnt LaTeX during my Maths/Physics undergraduate degree. In anycase, I've previously said that if people have additions for the tutorial, I'll take plain text and do the mark up for them. >> > On the other hand, it would be very good if all our cookbook use cases >> > could be rolled into the unit test framework - which wouldn't be so >> > easy if they live on the wiki. ?Something based on doctests might work... > > This is a good idea; broken examples in documentation are definitely > annoying. If we enforce a common format for cookbook items, then we > could scrape the wiki pages, extract the python code and run it as > part of the tests. That sounds possible - we might be able to scrape the wiki page, reformat it and feed it into doctests... although testing graphical output will still be a problem. Speaking of doctests, we should do more of those in our docstrings. For our online API documentation at http://biopython.org/DIST/docs/api/ it would be nice to have the python examples within the docstrings (including the doctests) shown with syntax colouring. See http://epydoc.sourceforge.net/manual-epytext.html#doctest-blocks for an example, and compare this to http://biopython.org/DIST/docs/api/Bio.Seq-module.html - maybe we need to adjust our indentation? Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 13:33:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 14:33:03 +0100 Subject: [Biopython-dev] BOSC 2009 In-Reply-To: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com> References: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com> Message-ID: <320fb6e00904130633k68fe32bdj3c0419afc5ada71a@mail.gmail.com> On Mon, Apr 13, 2009 at 11:44 AM, Peter wrote: > Hello Biopythoneers, > > Those of you following the dev-mailing list or the OBF news feed will > know that talk abstracts for BOSC 2009 are due in today, see > http://www.open-bio.org/wiki/BOSC_2009 > I should to be able to attend and present the Biopython Project > Update, and a few other Biopython developers may also be > around too, so some sort of hackathon is in the air. > > It is a bit unfortunate the deadline was scheduled on the Easter > break, as I'm sure quite a few of you will be on holiday, but here > is an outline abstract. ?If anyone has comments, please let me > know (on the list or directly) in the next couple of hours... That's been submitted now, although I can still make revisions at the moment if anyone spots something worth adding/fixing. I did remember to add the website and license information as BOSC request on their instructions. Peter From chapmanb at 50mail.com Mon Apr 13 13:35:39 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 13 Apr 2009 09:35:39 -0400 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> Message-ID: <20090413133539.GD5429@sobchak.mgh.harvard.edu> Michiel and Peter; Thanks for your comments on this. I'm definitely open to modifying the interface and am happy to y'all giving feedback. In reading through your comments, there is a bit of a disconnect between what you are expecting the parser to do and how it is designed right now. You both are thinking of the GFF parser as a line oriented parser that emits an object, like a SeqFeature, for each line in the file. This one way to do it, but the downsides are: - Many features, like coding regions, are actually represented over multiple lines. - As Michiel pointed out, almost all files have many replicating IDs (the first column). Ideally you want all of these features consolidated to a single SeqRecord. So the parser now takes a higher level view and assumes that the user will want those two things done for them. So it is designed as an "adder," that puts features onto SeqRecord objects. A normal use case would be: - Use SeqIO to parse a FASTA file with the sequences => SeqRecords - Use the GFFParser to add features from a separate GFF file to the SeqRecords. These are SeqFeatures, added to the right records and nested in a parent/child relationship as appropriate. Ideally you would parse the entire GFF file and do all this feature adding at once. For big files this fails due to memory issues, which is why the filtering and iterating features were introduced. Okay, so that is the top level view. I will try to hit some of the specifics: > Why are the functions _gff_line_map() and _gff_line_reduce() private > (leading underscores)? I had thought you wanted to make the > map/reduce approach available to people trying to parse GFF files on > multiple threads (e.g. using disco) which would require them to use > these two functions, wouldn't it? If so, they should be part of the > public API. I don't think a standard user would want to deal with these directly. They just parse lines into their components and build an intermediate dictionary object. To parallelize the job, the GFFMapReduceFeatureAdder class has a 'disco_host' parameter which then runs the job in parallel. > I don't see any support for the optional FASTA block in a GFF file. > Is this something you intend to add later Brad? See also my thoughts > below for Bio.SeqIO integration. I haven't added anything for parsing header and footer directives but it is on the to do list and I have a good idea how to handle them. Definitely pass along a file that uses these you want to parse and we can work on it. > > I have a couple of suggestions of how to make the GFF parser more > > generally usable, and more consistent with other parsers in Biopython. [...] > > It's not clear to me why we need an iterator for GFF files. Can't we > > just use Python's line iterator instead? I would expect code like this: > > > > from Bio import GFF > > handle = open("my_gff_file.gff") > > for line in handle: > > ? ?# call the appropriate GFF function on the line Right, so this was tackled in the top level overview above. Michiel, does the design make more sense now? > > The second point is about GFFAddingIterator.get_all_features. If this > > is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict? > > Then the code looks as follows: > > > > from Bio import GFF > > handle = open("my_gff_file.gff") > > rec_dict = GFF.to_dict(handle) Yes, except in the more common cases you are adding to a dictionary of records as opposed to generating one from scratch. My thought was that copying the SeqIO behavior made it more confusing because it doesn't do quite the same thing. After my explanation, what are your thoughts? > > Another thing to consider is that IDs in the GFF file do not need to be unique. > > For example, consider a GFF file that stores genome mapping locations for > > short sequences stored in a Fasta file. Since each sequence can have more > > than one mapping location, we can have multiple lines in the GFF file for one > > sequence ID. Yes, this goes back to my explanation above and is why the parser works differently than the standard SeqIO parsers. GFF ends up being a different beast. I think it makes sense to copy useful patterns we have already, but don't want to confuse users with close by not the same functionality. > > The last point is about storing SeqRecords in rec_dict. A GFF file typically > > does not store sequences; if it does, it's not clear which field in the GFF file > > does. On the other hand, a SeqRecord often does not contain the > > chromosomal location, which is what the GFF file stores. So why use a > > SeqRecord for GFF information? Hopefully the SeqRecords make more sense now. What it is really doing is adding SeqFeatures to SeqRecords. When the user doesn't provide one, it creates an empty SeqRecord with the appropriate ID to use and adds SeqFeatures to it. > If you look at the NCBI FTP site, they often provide genome sequences > in a range of file formats including GenBank and GFF. [...] > Their GFF3 file only contains the features: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff > > Some GFF files will include the sequence too, in this case we can > fetch it in FASTA format: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna Right on. So you would first parse the Fasta file with the SeqIO parser to_dict functionality, and then feed this dictionary to the GFF parser to add the features. > In principle, you could parse this FASTA file and the GFF3 file and > put together a GenBank file - or vice versa. Yes. > Similarly, I would want Bio.SeqIO to be able to parse a GFF3 file, and > give me a SeqRecord with lots of SeqFeature objects. If the sequence > is present in the file, it should use that (not the case for these > NCBI GFF3 files). Otherwise, we wouldn't necessarily know the actual > sequence length which we'd need to use the new Bio.Seq.UnknownSeq > object. However, we can infer from the maximum feature coordinates a > minimum sequence length. For these NCBI GFF3 files, as there is a > source feature this does actually give use the genome length, so this > should work very nicely. Using UnknownSeq is a good idea, and I will do. Whew. Michiel and Peter -- hopefully the high level intentions are a bit more clear. Thanks for your input so far; let's hash this out so it makes sense to everyone. Brad From chapmanb at 50mail.com Mon Apr 13 13:44:29 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 13 Apr 2009 09:44:29 -0400 Subject: [Biopython-dev] Bio.EMBOSS wrappers In-Reply-To: <320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com> References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> <20090413123219.GB5429@sobchak.mgh.harvard.edu> <320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com> Message-ID: <20090413134429.GE5429@sobchak.mgh.harvard.edu> Hi Peter; > >> Brad, seeing as Bio.EMBOSS is "your baby" how do you feel about adding > >> these features? ?The needle wrapper would make an excellent basis for > >> a new water wrapper. ?For adding -auto and -filter support, there is > >> probably a clever approach with a common EMBOSS specific subclass of > >> Bio.Application.AbstractCommandline, but I haven't tried. > > > > Definitely go for it. My approach on this has mostly been to add > > command lines as they are requested, or if I need them for something > > I am doing. Not ideal. > > > > Having a subclass with -auto and -filter is a really good idea; > > unfortunately nothing clever is designed into the command line builders > > right now. Feel free to add away. > > I need to work on my delegation skills - that seems to have back fired ;) Oops. I honestly read that as "do I have your permission?" I can of course tackle this, but am a bit underwater now. > Regarding adding -auto support, I have a question about the needle > wrapper and the gap parameters. Using the needle tool at the command > line will prompt for the gap parameters UNLESS the -auto argument has > been used. i.e. Without -auto, it makes sense to insist on the gap > parameters being included, which is what the current wrapper does. > However, if we add support for -auto, then these parameters can be > optional. We could handle this in the wrapper, but it would be messy > (and there may be similar questions with other EMBOSS tools). What do > you think - stick with the simple option of insisting the Biopython > user set the gap parameters, even if they are using -auto? I think we should stick with the simple option. These were meant to be pretty dumb specifiers that help users write more modular code than simply pasting in a raw string for the command line. Trying to get too fancy is probably overkill. Brad From biopython at maubp.freeserve.co.uk Mon Apr 13 13:49:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 14:49:56 +0100 Subject: [Biopython-dev] Bio.EMBOSS wrappers In-Reply-To: <20090413134429.GE5429@sobchak.mgh.harvard.edu> References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> <20090413123219.GB5429@sobchak.mgh.harvard.edu> <320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com> <20090413134429.GE5429@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com> On Mon, Apr 13, 2009 at 2:44 PM, Brad Chapman wrote: >> > ... Feel free to add away. >> >> I need to work on my delegation skills - that seems to have back fired ;) > > Oops. I honestly read that as "do I have your permission?" I can of > course tackle this, but am a bit underwater now. Looking back, I was a bit ambiguous. I don't mind who does it - let's see who has time free first. >> Regarding adding -auto support, I have a question about the needle >> wrapper and the gap parameters. ?Using the needle tool at the command >> line will prompt for the gap parameters UNLESS the -auto argument has >> been used. ?i.e. Without -auto, it makes sense to insist on the gap >> parameters being included, which is what the current wrapper does. >> However, if we add support for -auto, then these parameters can be >> optional. ?We could handle this in the wrapper, but it would be messy >> (and there may be similar questions with other EMBOSS tools). ?What do >> you think - stick with the simple option of insisting the Biopython >> user set the gap parameters, even if they are using -auto? > > I think we should stick with the simple option. These were meant to > be pretty dumb specifiers that help users write more modular code than > simply pasting in a raw string for the command line. Trying to get > too fancy is probably overkill. Agreed. Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 14:19:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 15:19:54 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090413133539.GD5429@sobchak.mgh.harvard.edu> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> <20090413133539.GD5429@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com> > Okay, so that is the top level view. I will try to hit some of the > specifics: > >> Why are the functions _gff_line_map() and _gff_line_reduce() private >> (leading underscores)? ?I had thought you wanted to make the >> map/reduce approach available to people trying to parse GFF files on >> multiple threads (e.g. using disco) which would require them to use >> these two functions, wouldn't it? ?If so, they should be part of the >> public API. > > I don't think a standard user would want to deal with these > directly. They just parse lines into their components and build an > intermediate dictionary object. To parallelize the job, the > GFFMapReduceFeatureAdder class has a 'disco_host' parameter which > then runs the job in parallel. Are you aware of any alternatives to disco for doing map/reduce on Python, and does that impact your design choices? >> I don't see any support for the optional FASTA block in a GFF file. >> Is this something you intend to add later Brad? ?See also my thoughts >> below for Bio.SeqIO integration. > > I haven't added anything for parsing header and footer directives but > it is on the to do list and I have a good idea how to handle them. Definitely > pass along a file that uses these you want to parse and we can work on it. There are some partial examples here: http://www.sequenceontology.org/gff3.shtml We should have a peep at BioPerl's unit tests and/or ask Lincoln directly. >> > I have a couple of suggestions of how to make the GFF parser more >> > generally usable, and more consistent with other parsers in Biopython. > [...] >> > It's not clear to me why we need an iterator for GFF files. Can't we >> > just use Python's line iterator instead? I would expect code like this: >> > >> > from Bio import GFF >> > handle = open("my_gff_file.gff") >> > for line in handle: >> > ? ?# call the appropriate GFF function on the line > > Right, so this was tackled in the top level overview above. Michiel, > does the design make more sense now? > >> > The second point is about GFFAddingIterator.get_all_features. If this >> > is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict? >> > Then the code looks as follows: >> > >> > from Bio import GFF >> > handle = open("my_gff_file.gff") >> > rec_dict = GFF.to_dict(handle) > > Yes, except in the more common cases you are adding to a dictionary > of records as opposed to generating one from scratch. My thought was > that copying the SeqIO behavior made it more confusing because it > doesn't do quite the same thing. After my explanation, what are your > thoughts? Maybe there is a role for a to_dict() function for when you start from scratch, but as you say, it does sound like there is a general need to add to an existing dict. >> > Another thing to consider is that IDs in the GFF file do not need to be unique. >> > For example, consider a GFF file that stores genome mapping locations for >> > short sequences stored in a Fasta file. Since each sequence can have more >> > than one mapping location, we can have multiple lines in the GFF file for one >> > sequence ID. > > Yes, this goes back to my explanation above and is why the > parser works differently than the standard SeqIO parsers. GFF ends > up being a different beast. I think it makes sense to copy useful > patterns we have already, but don't want to confuse users with close > by not the same functionality. > >> > The last point is about storing SeqRecords in rec_dict. A GFF file typically >> > does not store sequences; if it does, it's not clear which field in the GFF file >> > does. On the other hand, a SeqRecord often does not contain the >> > chromosomal location, which is what the GFF file stores. So why use a >> > SeqRecord for GFF information? > > Hopefully the SeqRecords make more sense now. What it is really doing is > adding SeqFeatures to SeqRecords. When the user doesn't provide one, > it creates an empty SeqRecord with the appropriate ID to use and > adds SeqFeatures to it. > >> If you look at the NCBI FTP site, they often provide genome sequences >> in a range of file formats including GenBank and GFF. >> [...] >> Their GFF3 file only contains the features: >> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff >> >> Some GFF files will include the sequence too, in this case we can >> fetch it in FASTA format: >> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna > > Right on. So you would first parse the Fasta file with the SeqIO > parser to_dict functionality, and then feed this dictionary to the > GFF parser to add the features. Hmm. I'm with you on the idea that you may need to parse a GFF file and a separate second file to get the actual sequence (e.g. a FASTA file), but there is more than one way to combine the two. For a single sequence, I was thinking more along the lines of: from Bio import SeqIO record = SeqIO.read(open("NC_000913.fna"),"fasta") record.features = SeqIO.read(open("NC_000913.gff"),"gff3").features Or, depending on what other annotation you can extract, perhaps the other way round would be best: from Bio import SeqIO record = SeqIO.read(open("NC_000913.gff"),"gff3") record.seq = SeqIO.read(open("NC_000913.fna"),"fasta").seq The above is pretty trivial I think, as long as we include examples of this in our documentation. This kind of manipulation is also file format neutral - it would work equally well with a FASTA file and a PTT file (assuming we add parsing NCBI protein tables to Bio.SeqIO as outlined in my earlier email). Or for another example, perhaps an annotated GenBank file without the sequence (e.g. just a CONTIG assembly line) plus a FASTA file for the full nucleotide sequence. If the FASTA and GFF file apply to multiple sequences (e.g. a set of contigs, rather than a single chromosome), and you have enough memory, then something using dictionaries should work: from Bio import SeqIO records = SeqIO.to_dict(SeqIO.read(open("NC_000913.fna"),"fasta")) for temp_rec in SeqIO.parse(open("NC_000913.gff"),"gff3") : records[temp_rec.id].features = temp_rec.features or, from Bio import SeqIO records = SeqIO.to_dict(SeqIO.read(open("NC_000913.gff"),"gff3")) for temp_rec in SeqIO.parse(open("NC_000913.fna"),"fasta") : records[temp_rec.id].seq = temp_rec.seq (You may need to massage the keys to match up, I'm assuming here that isn't required). i.e. It can all be done from Bio.SeqIO without needing to dive into Bio.GFF unless you need to do something special (e.g. filtering the features). >> In principle, you could parse this FASTA file and the GFF3 file and >> put together a GenBank file - or vice versa. > > Yes. > >> Similarly, I would want Bio.SeqIO to be able to parse a GFF3 file, and >> give me a SeqRecord with lots of SeqFeature objects. ?If the sequence >> is present in the file, it should use that (not the case for these >> NCBI GFF3 files). ?Otherwise, we wouldn't necessarily know the actual >> sequence length which we'd need to use the new Bio.Seq.UnknownSeq >> object. ?However, we can infer from the maximum feature coordinates a >> minimum sequence length. ?For these NCBI GFF3 files, as there is a >> source feature this does actually give use the genome length, so this >> should work very nicely. > > Using UnknownSeq is a good idea, and I will do. Great. > Whew. Michiel and Peter -- hopefully the high level intentions are a > bit more clear. Thanks for your input so far; let's hash this out so > it makes sense to everyone. Good plan :) As you can probably tell, I am concentrating on getting this to match up well with the Bio.SeqIO framework. It will be nice to know the underlying Bio.GFF module has more options, but I expect most people to start with reading in a GFF file using Bio.SeqIO, and being able to transfer their existing knowledge of SeqFeature objects learnt from using Bio.SeqIO to read in GenBank files. Peter From jflatow at gmail.com Mon Apr 13 14:41:56 2009 From: jflatow at gmail.com (Jared Flatow) Date: Mon, 13 Apr 2009 09:41:56 -0500 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> <20090413133539.GD5429@sobchak.mgh.harvard.edu> <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com> Message-ID: <3050CC48-7365-4746-B30C-F56C2ACAA2F8@gmail.com> FYI: On Apr 13, 2009, at 9:19 AM, Peter wrote: > Are you aware of any alternatives to disco for doing map/reduce on > Python, and does that impact your design choices? You can use Python map/reduce functions with Hadoop via the Streaming contrib package included with Hadoop. An overview: http://docs.google.com/Presentation?id=dgr666gg_31cd4n7qdz Here is an input reader/record reader for FASTA: http://gist.github.com/45551 jared From bugzilla-daemon at portal.open-bio.org Mon Apr 13 15:41:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 13 Apr 2009 11:41:29 -0400 Subject: [Biopython-dev] [Bug 2601] Seq find() method: proposal In-Reply-To: Message-ID: <200904131541.n3DFfTGN022460@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2601 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-13 11:41 EST ------- See also Bug 2809, for the much narrower option of adding string-like startswith and endswith methods to the Seq object (which as proposed would not deal with ambiguity characters). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Apr 13 17:55:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 18:55:53 +0100 Subject: [Biopython-dev] EFetch returning ASN.1 not genbank format Message-ID: <320fb6e00904131055r24ac6e42p1da55cfcf173e@mail.gmail.com> Hi all, At then end of last week I found test_SeqIO_online.py was failing and traced this to a change in Entrez EFetch. EFetch is documented here: http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html The issue is with EFetch and the undocumented rettype=genbank argument which we currently use in our documentation and unit tests. This isn't an "official" argument in that it isn't listed on their website, but until recently it returned plain text GenBank files, acting like the official rettype=gb or gp arguments. However, as of the end of last week, EFtech returns the default format instead (ASN.1), causing test_SeqIO_online.py to fail and rendering some of our examples misleading. I emailed the NCBI and received a very prompt reply, > Dear Colleague, > >?As the e-Utils continue to be refined our developers sometimes > address one-off issues, and this was one of them. The 'official' > parameter for GenBank is rettype=gb. Now if the parameter is not > correct you will default to ASN.1 in the nucleotide databases. We > apologize for any inconvenience. > > Regards, > > Steve Pechous, Ph.D. > NCBI User Services I then emailed back (before Easter) to ask if they would reconsider this change, and have just had a reply: > Hi Peter, > > This will likely not reverse back as the true parameters are laid out > in the help documents and are now required, so to speak. > > Regards, > > Steve Pechous, Ph.D. > NCBI User Services With hindsight we shouldn't have used rettype="genbank", but it did seem to make things simpler for our documentation and I really hadn't expected the NCBI to change this. I think we have two options: (1) Add a special case to Bio.Entrez.eftech to map rettype="genbank" to rettype="gb" (or "gp" for the protein database). This is simple and causes least disruption to Biopython uses, but is a bad idea in the long run as it means we are effectively providing our own variant of the Entrez API. (2) Update our documentation and unit tests to use rettype="gb" or "gp" instead of rettype="genbank", and add a special case to Bio.Entrez.eftech to map rettype="genbank" to rettype="gb" (or "gp" for the protein database) and issue a warning that the NCBI have changed their API. At a later point we might change this warning to an error. This would provide a clear transition for end user scripts, and keep us consistent with the official Entrez API. I favour option (2) here. Any other thoughts? Whatever we do should happen before we release Biopython 1.50. Peter From biopython at maubp.freeserve.co.uk Mon Apr 13 18:06:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Apr 2009 19:06:25 +0100 Subject: [Biopython-dev] Plan for Biopython 1.50 (final) Message-ID: <320fb6e00904131106s70028e9el56d334fa732bddf8@mail.gmail.com> On Tue, Mar 31, 2009 at 10:38 PM, Peter wrote: > Hi all, > > OK guys, after a brief chat off the mailing list, I'm hoping to do the > Biopython 1.50 beta release roughly this weekend, ... > > After the release of Biopython 1.50 beta, we'll reopen CVS again for > small changes and documentation. ?While the beta is being tested by > our user base, I'd like us to push to finish any missing documentation > - in particular for new modules Bio.Motif (Bartek) and > Bio.Graphics.GenomeDiagram (me and/or Leighton), plus the new > SeqRecord slicing and UnknownSeq class (me). That documentation still needs doing, and it would be nice to have it with Biopython 1.50. If Bartek or Leighton expects to add anything in the next few days, then I'd be happy to hold back the release for that. I'll try and do the SeqRecord stuff myself shortly. > Depending on the feedback from the beta, I'd hope we can do the final > release of Biopython 1.50 well before the end of April, and then > reopen CVS for new code. There haven't been any problems with the beta reported, however there is the issue of EFetch returning ASN.1 not genbank format (see my earlier email) which I think we must resolve before Biopython 1.50 is released. Apart from these two points (documentation and EFetch), are there any issues regarding doing the official release of Biopython 1.50? I think we can aim for a release this week... Peter From lpritc at scri.ac.uk Tue Apr 14 08:29:14 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 14 Apr 2009 09:29:14 +0100 Subject: [Biopython-dev] EFetch returning ASN.1 not genbank format In-Reply-To: <320fb6e00904131055r24ac6e42p1da55cfcf173e@mail.gmail.com> Message-ID: On 13/04/2009 18:55, "Peter" wrote: [...] > I think we have two options: > > (1) Add a special case to Bio.Entrez.eftech to map rettype="genbank" > to rettype="gb" (or "gp" for the protein database). This is simple > and causes least disruption to Biopython uses, but is a bad idea in > the long run as it means we are effectively providing our own variant > of the Entrez API. > > (2) Update our documentation and unit tests to use rettype="gb" or > "gp" instead of rettype="genbank", and add a special case to > Bio.Entrez.eftech to map rettype="genbank" to rettype="gb" (or "gp" > for the protein database) and issue a warning that the NCBI have > changed their API. At a later point we might change this warning to > an error. This would provide a clear transition for end user scripts, > and keep us consistent with the official Entrez API. > > I favour option (2) here. Any other thoughts? Whatever we do should > happen before we release Biopython 1.50. Option (2). Option (1) risks cementing an argument into place in Biopython that could potentially contradict future Entrez API usage. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From mjldehoon at yahoo.com Tue Apr 14 08:33:48 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 14 Apr 2009 01:33:48 -0700 (PDT) Subject: [Biopython-dev] EFetch returning ASN.1 not genbank format In-Reply-To: <320fb6e00904131055r24ac6e42p1da55cfcf173e@mail.gmail.com> Message-ID: <273080.33626.qm@web62408.mail.re1.yahoo.com> I am also in favor of option (2). --Michiel > I think we have two options: > > (1) Add a special case to Bio.Entrez.eftech to map > rettype="genbank" > to rettype="gb" (or "gp" for the > protein database). This is simple > and causes least disruption to Biopython uses, but is a bad > idea in > the long run as it means we are effectively providing our > own variant > of the Entrez API. > > (2) Update our documentation and unit tests to use > rettype="gb" or > "gp" instead of rettype="genbank", and > add a special case to > Bio.Entrez.eftech to map rettype="genbank" to > rettype="gb" (or "gp" > for the protein database) and issue a warning that the NCBI > have > changed their API. At a later point we might change this > warning to > an error. This would provide a clear transition for end > user scripts, > and keep us consistent with the official Entrez API. From bugzilla-daemon at portal.open-bio.org Tue Apr 14 08:51:56 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 14 Apr 2009 04:51:56 -0400 Subject: [Biopython-dev] [Bug 2811] New: EFetch returning ASN.1 not GenBank format for rettype=genbank Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2811 Summary: EFetch returning ASN.1 not GenBank format for rettype=genbank Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk At the end of last week I found test_SeqIO_online.py was failing and traced this to a change in Entrez EFetch. EFetch is documented here: http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html The issue is with EFetch and the undocumented rettype=genbank argument which we currently use in our documentation and unit tests. This isn't an "official" argument in that it isn't listed on their website, but until recently it returned plain text GenBank files, acting like the official rettype=gb or gp arguments. However, as of the end of last week, EFtech returns the default format instead (ASN.1), causing test_SeqIO_online.py to fail and rendering some of our examples misleading. I emailed the NCBI and received a very prompt reply, > Dear Colleague, > > As the e-Utils continue to be refined our developers sometimes > address one-off issues, and this was one of them. The 'official' > parameter for GenBank is rettype=gb. Now if the parameter is not > correct you will default to ASN.1 in the nucleotide databases. We > apologize for any inconvenience. > > Regards, > > Steve Pechous, Ph.D. > NCBI User Services I then emailed back (before Easter) to ask if they would reconsider this change, and have just had a reply: > Hi Peter, > > This will likely not reverse back as the true parameters are laid out > in the help documents and are now required, so to speak. > > Regards, > > Steve Pechous, Ph.D. > NCBI User Services With hindsight we shouldn't have used rettype="genbank", but it did seem to make things simpler for our documentation and I really hadn't expected the NCBI to change this. After discussion on the mailing list, the plan is to update our documentation and unit tests to use rettype="gb" or "gp" instead of rettype="genbank", and add a special case to Bio.Entrez.eftech to map rettype="genbank" to rettype="gb" (or "gp" for the protein database) and issue a warning that the NCBI have changed their API. At a later point we might change this warning to an error. This would provide a clear transition for end user scripts, and keep us consistent with the official Entrez API. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Apr 14 08:53:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Apr 2009 09:53:02 +0100 Subject: [Biopython-dev] EFetch returning ASN.1 not genbank format In-Reply-To: <273080.33626.qm@web62408.mail.re1.yahoo.com> References: <320fb6e00904131055r24ac6e42p1da55cfcf173e@mail.gmail.com> <273080.33626.qm@web62408.mail.re1.yahoo.com> Message-ID: <320fb6e00904140153w4c659655q64f19540f7bd12b7@mail.gmail.com> On Tue, Apr 14, 2009 at 9:33 AM, Michiel de Hoon wrote: > > I am also in favor of option (2). > > --Michiel > OK. Let's do that then. I've filed Bug 2811 for this issue, http://bugzilla.open-bio.org/show_bug.cgi?id=2811 Peter From bugzilla-daemon at portal.open-bio.org Tue Apr 14 09:54:23 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 14 Apr 2009 05:54:23 -0400 Subject: [Biopython-dev] [Bug 2811] EFetch returning ASN.1 not GenBank format for rettype=genbank In-Reply-To: Message-ID: <200904140954.n3E9sND0024084@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2811 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-14 05:54 EST ------- Tutorial updated, see Doc/Tutorial.tex revision 1.221 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Tue Apr 14 10:36:03 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 14 Apr 2009 03:36:03 -0700 (PDT) Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090413133539.GD5429@sobchak.mgh.harvard.edu> Message-ID: <322143.67385.qm@web62403.mail.re1.yahoo.com> --- On Mon, 4/13/09, Brad Chapman wrote: > A normal use case would be: > > - Use SeqIO to parse a FASTA file with the sequences => > SeqRecords > - Use the GFFParser to add features from a separate GFF > file to the SeqRecords. These are SeqFeatures, added to > the right records and nested in a parent/child relationship > as appropriate. Usually, when I use a GFF file I either don't have an associated Fasta file, or I am not particularly interested in the original sequences. So while this approach is useful for some people, in its current form it's not exactly generally usable. First, let's discuss how to represent the information contained in a GFF file. SeqRecords are good if the GFF file is associated with a Fasta file (or contains the sequence itself), but if not it seems to be a bit awkward. How about the following (and I think Peter was hinting at the same idea): The actual parser lives in Bio.GFF, and produces Bio.GFF.Record objects that closely resemble the GFF file structure. For example, we use the GFF specified fields ( [attributes] [comments]) as attributes to Bio.GFF.Record objects. Bio.SeqIO then uses the parser in Bio.GFF, and puts its information in the appropriate fields of a SeqRecord. Here, we have to think about two cases: Simply creating a SeqRecord based on the GFF file, and adding the information in the GFF file as annotations to a pre-existing set of SeqRecords. (I am not sure if we need a separate function for that, or, as Peter suggested, let the user do that himself, guided by some examples in the documentation). Users then have a choice to use Bio.SeqIO to get SeqRecords, or Bio.GFF to see the "raw" GFF data, depending on their needs. How does that sound? --Michiel From biopython at maubp.freeserve.co.uk Tue Apr 14 11:04:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Apr 2009 12:04:39 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <322143.67385.qm@web62403.mail.re1.yahoo.com> References: <20090413133539.GD5429@sobchak.mgh.harvard.edu> <322143.67385.qm@web62403.mail.re1.yahoo.com> Message-ID: <320fb6e00904140404x35f87a00ude242e6c3c4c7971@mail.gmail.com> On Tue, Apr 14, 2009 at 11:36 AM, Michiel de Hoon wrote: > > Usually, when I use a GFF file I either don't have an associated Fasta file, > or I am not particularly interested in the original sequences. So while this > approach is useful for some people, in its current form it's not exactly > generally usable. > > First, let's discuss how to represent the information contained in a GFF > file. SeqRecords are good if the GFF file is associated with a Fasta file > (or contains the sequence itself), but if not it seems to be a bit awkward. I think parsing a GFF file with Bio.SeqIO into SeqRecord object(s) can still be useful even without the sequence. The list of SeqFeature objects belonging to each SeqRecord can be used for example with GenomeDiagram to draw a picture of the organism. Because you lack the sequence, you won't be able to include GC% or GC skew, but it is nice to visualize the annotation all the same. You could also do things like looking for the ratio of genic and inter-genic usage, or hunt for overlapping genes - although for these it may be easier to work with a more low level representation. > How about the following (and I think Peter was hinting at the same idea): > > The actual parser lives in Bio.GFF, and produces Bio.GFF.Record objects > that closely resemble the GFF file structure. For example, we use the > GFF specified fields ( > [attributes] [comments]) as attributes to > Bio.GFF.Record objects. That sounds possible to me - although I haven't given the basic Bio.GFF.Record structure any thought, nor indeed have I examined what data objects Brad is returning at the moment. > Bio.SeqIO then uses the parser in Bio.GFF, and puts its information in the > appropriate fields of a SeqRecord. Yes - much like how Bio.SeqIO calls other modules like Bio.GenBank and Bio.SwissProt now. However, regarding the implementation, I wouldn't automatically insist the Bio.SeqIO GFF wrapper *has* to use a Bio.GFF.Record internally (assuming we have such a thing) as that could be a performance bottleneck. I guess it depends on how simple the Bio.GFF.Record objects are. > Here, we have to think about two cases: > Simply creating a SeqRecord based on the GFF file, and adding the > information in the GFF file as annotations to a pre-existing set of SeqRecords. > (I am not sure if we need a separate function for that, or, as Peter suggested, > let the user do that himself, guided by some examples in the documentation). Simply creating SeqRecord objects from a GFF file is the standard Bio.SeqIO approach. For combining data from a GFF file and a FASTA file, this is rather like the FASTA+QUAL situation. Here we do document (in the docstrings, not yet in the tutorial) how to use Bio.SeqIO to read in two sets of SeqRecord objects and combine them, but also provide a "paired file iterator" to do this for you. Right now this function is in Bio.SeqIO.QualityIO, but I am open to moving this and the low level bits to somewhere like Bio.Sequencing.Quality instead (as long as we do this before Biopython 1.50 is released). I have pondered a "paired file iterator" function for Bio.SeqIO for dealing with FASTA+QUAL, FASTA+GFF, FASTA+PPT, etc, which would take TWO file handles and return SeqRecord objects. Interestingly all the examples thus far are FASTA+other. Anyway, this could be added later if need be. > Users then have a choice to use Bio.SeqIO to get SeqRecords, or Bio.GFF to see the "raw" GFF data, depending on their needs. > How does that sound? Pretty much what I had in mind - although as I said, I've not given much thought to how to present the "raw" GFF data. Peter From bugzilla-daemon at portal.open-bio.org Tue Apr 14 12:05:07 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 14 Apr 2009 08:05:07 -0400 Subject: [Biopython-dev] [Bug 2811] EFetch returning ASN.1 not GenBank format for rettype=genbank In-Reply-To: Message-ID: <200904141205.n3EC570L032323@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2811 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-14 08:05 EST ------- Bio/Entrez/__init__.py CVS revision 1.41 Tests/test_SeqIO_online.py CVS revision 1.7 DEPRECATED CVS revision 1.50 Marking as fixed (although a proof reading of the tutorial wouldn't hurt). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 14 23:33:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 14 Apr 2009 19:33:59 -0400 Subject: [Biopython-dev] [Bug 2811] EFetch returning ASN.1 not GenBank format for rettype=genbank In-Reply-To: Message-ID: <200904142333.n3ENXxFX018002@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2811 ------- Comment #3 from sbassi at gmail.com 2009-04-14 19:33 EST ------- I saw in the online Tutorial this small typo: "form Bio import SeqIO" I and have a question regarding this bug: What about adding "gb" as format type in SeqIO, and mapped to "genbank". This would add consistency (if I retrieve a sequence using "gb" from Entrez, I expect to save it using SeqIO with "gb"). I think it won't hurt to have "gb" as an alias for "genbank" in SeqIO. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From peter at maubp.freeserve.co.uk Tue Apr 14 23:34:02 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 00:34:02 +0100 Subject: [Biopython-dev] Bio.Motif breaks epydoc? Message-ID: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com> Hi all, I forgot to run epydoc when I did Biopython 1.50 beta, but I've just tried and it is failing - apparently due to an issue with Bio.Motif. First of all there are some warnings which we should probably address now, before the Bio.Motif API is officially released: Warning: Module Bio.Motif.AlignAceParser is shadowed by a variable with the same name. Warning: Module Bio.Motif.MEMEParser is shadowed by a variable with the same name. Warning: Module Bio.Motif.Motif is shadowed by a variable with the same name. Ignoring these warnings for now, epydoc then crashes for me doing Bio.Motif.Motif.Motif-class.html - which is bigger problem. This was using Epydoc version 3.0.1 (with python 2.6 on Ubuntu Jaunty). I'll try another machine tomorrow just to make sure this isn't a local setup issue. Also we should probably fix these "shadowing warnings", they can make the API confusing - in addition to confusing epydoc and making the API doc pages confusing. GenomeDiagram is also doing this, and we should try and fix that too: Warning: Module Bio.Graphics.GenomeDiagram.Diagram is shadowed by a variable with the same name. Warning: Module Bio.Graphics.GenomeDiagram.FeatureSet is shadowed by a variable with the same name. Warning: Module Bio.Graphics.GenomeDiagram.GraphSet is shadowed by a variable with the same name. Warning: Module Bio.Graphics.GenomeDiagram.Track is shadowed by a variable with the same name. However it may be a bit late to fix the main source of these warnings, Bio.PDB, without breaking things (i.e. any fix may not be backwards compatible). See also this thread from when I was running epydoc for Biopython 1.49 late last year: http://lists.open-bio.org/pipermail/biopython-dev/2008-November/004810.html Peter From bugzilla-daemon at portal.open-bio.org Wed Apr 15 00:13:49 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 14 Apr 2009 20:13:49 -0400 Subject: [Biopython-dev] [Bug 2811] EFetch returning ASN.1 not GenBank format for rettype=genbank In-Reply-To: Message-ID: <200904150013.n3F0DnkE021278@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2811 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-14 20:13 EST ------- (In reply to comment #3) > I saw in the online Tutorial this small typo: > "form Bio import SeqIO" I'd fixed at least one occurange of that error before, but you are right - there were still two left in CVS. Thanks. > I and have a question regarding this bug: What about adding "gb" as format > type in SeqIO, and mapped to "genbank". This would add consistency (if I > retrieve a sequence using "gb" from Entrez, I expect to save it using SeqIO > with "gb"). I think it won't hurt to have "gb" as an alias for "genbank" in > SeqIO. The reason we have this bug in the first place was we used an unofficial return type in EFetch in order to use the same format name ("genbank") in both Bio.Entrez and Bio.SeqIO - and this did make the examples straight forward. Adding aliases (such as "gb", "gp", and maybe also "genpept" for "genbank") might make Bio.Entrez and Bio.SeqIO a little nicer to use together after the changes forced by this bug. There are also several aliases used in EMBOSS that would also make sense (e.g. "pfam" for "stockholm"). On the down side, having more than one name risks confusion. Bring this up on the mailing list if you like. Leaving this bug as fixed. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sbassi at clubdelarazon.org Wed Apr 15 02:05:53 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Tue, 14 Apr 2009 23:05:53 -0300 Subject: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank in SeqIO Message-ID: <9e2f512b0904141905x69d10c48s95f5a808e1cc430f@mail.gmail.com> As a follow up to bug 2811 where "gb" is now a valid name in Bio.Entrez, I propose to add "gb" as an alias for "genbank" in SeqIO. This proposal is backward compatible since previous code using "genbank" is unaffected. The rationale behind my request is that Entrez.efetch(db=db,id=x,rettype='gb') When I want to save the sequence I got using rettype='gb', seems consistent to use SeqIO.write(myseq,fielhandle,'gb') Bugtrack chat related: ---------- Forwarded message ---------- From: Date: Tue, Apr 14, 2009 at 9:13 PM Subject: [Biopython-dev] [Bug 2811] EFetch returning ASN.1 not GenBank format for rettype=genbank To: biopython-dev at biopython.org http://bugzilla.open-bio.org/show_bug.cgi?id=2811 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-14 20:13 EST ------- (In reply to comment #3) > I saw in the online Tutorial this small typo: > "form Bio import SeqIO" I'd fixed at least one occurange of that error before, but you are right - there were still two left in CVS. Thanks. > I and have a question regarding this bug: What about adding "gb" as format > type in SeqIO, and mapped to "genbank". This would add consistency (if I > retrieve a sequence using "gb" from Entrez, I expect to save it using SeqIO > with "gb"). I think it won't hurt to have "gb" as an alias for "genbank" in > SeqIO. The reason we have this bug in the first place was we used an unofficial return type in EFetch in order to use the same format name ("genbank") in both Bio.Entrez and Bio.SeqIO - and this did make the examples straight forward. Adding aliases (such as "gb", "gp", and maybe also "genpept" for "genbank") might make Bio.Entrez and Bio.SeqIO a little nicer to use together after the changes forced by this bug. There are also several aliases used in EMBOSS that would also make sense (e.g. "pfam" for "stockholm"). On the down side, having more than one name risks confusion. Bring this up on the mailing list if you like. Leaving this bug as fixed. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev -- Sebasti?n Bassi. Diplomado en Ciencia y Tecnolog?a. Non standard disclaimer: READ CAREFULLY. By reading this email, you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies ("BOGUS AGREEMENTS") that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. From biopython at maubp.freeserve.co.uk Wed Apr 15 09:40:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 10:40:56 +0100 Subject: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank in SeqIO In-Reply-To: <9e2f512b0904141905x69d10c48s95f5a808e1cc430f@mail.gmail.com> References: <9e2f512b0904141905x69d10c48s95f5a808e1cc430f@mail.gmail.com> Message-ID: <320fb6e00904150240l1e6b424tdd34035256876226@mail.gmail.com> On Wed, Apr 15, 2009 at 3:05 AM, Sebastian Bassi wrote: > As a follow up to bug 2811 where "gb" is now a valid name in > Bio.Entrez, ... Just to note that in Entrez EFetch, using rettype=gb (and the related rettype=gb for proteins in GenPept format) has always been a valid argument (and in fact has always been the documented way to get a GenBank/GenPept file back). >From my point of view it was a nice feature of Entrez EFetch that they used to (unofficially) support retype=genbank, which was consistent with Bio.SeqIO. I suppose you could all try lobbing the NCBI to put Entrez EFetch back to the pre Easter 2009 behavior, but realistically we'll just have to live with it. Now that Entrez EFetch doesn't support the unofficial rettype=genbank argument anymore, we have the current situation where you must use "gb" (or "gp") for Bio.Entrez but "genbank" for Bio.SeqIO. I agree this isn't so nice, but as I wrote on Bug 2811, I'm not keen on having aliases in Bio.SeqIO (but I may be in a minority here, hence suggesting a discussion). On the plus side, EMBOSS offers "gb" (and "ddbj") as alternative aliases for "genbank", so there is precedent. In a related approach, I suppose we could have Bio.SeqIO take "genbank" to mean GenBank or GenPept as determined from the file or the alphabet (as now), and add "gb" meaning (nucelotide) GenBank files, and "gb" meaning (protein) GenPept files. But again, this breaks the Python ideal of there being one clear way to do things (having multiple names for the same format). Peter From peter at maubp.freeserve.co.uk Wed Apr 15 10:43:40 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 11:43:40 +0100 Subject: [Biopython-dev] Bio.Motif breaks epydoc? In-Reply-To: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com> References: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com> Message-ID: <320fb6e00904150343u35f66911pd45520c399e2e5f1@mail.gmail.com> On Wed, Apr 15, 2009 at 12:34 AM, Peter wrote: > Hi all, > > I forgot to run epydoc when I did Biopython 1.50 beta, but I've just tried [...] > we should probably fix these "shadowing warnings", they can make > the API confusing - in addition to confusing epydoc and making the API > doc pages confusing. ?GenomeDiagram is also doing this, and we should > try and fix that too: > > Warning: Module Bio.Graphics.GenomeDiagram.Diagram is shadowed by a > ? ? ? ? variable with the same name. > Warning: Module Bio.Graphics.GenomeDiagram.FeatureSet is shadowed by a > ? ? ? ? variable with the same name. > Warning: Module Bio.Graphics.GenomeDiagram.GraphSet is shadowed by a > ? ? ? ? variable with the same name. > Warning: Module Bio.Graphics.GenomeDiagram.Track is shadowed by a variable > ? ? ? ? with the same name. The shadowing issue with GenomeDiagram should be OK in CVS now - this was an accidental side effect of renaming the internal modules as part of integrating GenomeDiagram into Biopython. I discussed this with Leighton (off list) and we agreed that renaming the modules with the simplest solution, and opted for adding an underscore which makes it explicit that the modules concerned are intended to be private. This doesn't affect the (intended) public API for GenomeDiagram. Peter From mjldehoon at yahoo.com Wed Apr 15 10:57:43 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 15 Apr 2009 03:57:43 -0700 (PDT) Subject: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank in SeqIO In-Reply-To: <320fb6e00904150240l1e6b424tdd34035256876226@mail.gmail.com> Message-ID: <587664.25168.qm@web62402.mail.re1.yahoo.com> I think it's nice to be consistent with NCBI, and I don't see a big problem in having an alias for GenBank in SeqIO. At least, having "gb" in Bio.Entrez but "genbank" in Bio.SeqIO would go against the principle of least surprise. --Michiel. --- On Wed, 4/15/09, Peter wrote: > From: Peter > Subject: Re: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank in SeqIO > To: "Sebastian Bassi" > Cc: biopython-dev at lists.open-bio.org > Date: Wednesday, April 15, 2009, 5:40 AM > On Wed, Apr 15, 2009 at 3:05 AM, Sebastian Bassi > wrote: > > As a follow up to bug 2811 where "gb" is now > a valid name in > > Bio.Entrez, ... > > Just to note that in Entrez EFetch, using rettype=gb (and > the related > rettype=gb for proteins in GenPept format) has always been > a valid > argument (and in fact has always been the documented way to > get a > GenBank/GenPept file back). > > >From my point of view it was a nice feature of Entrez > EFetch that they > used to (unofficially) support retype=genbank, which was > consistent with > Bio.SeqIO. I suppose you could all try lobbing the NCBI to > put Entrez > EFetch back to the pre Easter 2009 behavior, but > realistically we'll just > have to live with it. > > Now that Entrez EFetch doesn't support the unofficial > rettype=genbank > argument anymore, we have the current situation where you > must use > "gb" (or "gp") for Bio.Entrez but > "genbank" for Bio.SeqIO. I agree this > isn't so nice, but as I wrote on Bug 2811, I'm not > keen on having aliases > in Bio.SeqIO (but I may be in a minority here, hence > suggesting a > discussion). On the plus side, EMBOSS offers > "gb" (and "ddbj") as > alternative aliases for "genbank", so there is > precedent. > > In a related approach, I suppose we could have Bio.SeqIO > take > "genbank" to mean GenBank or GenPept as > determined from the file > or the alphabet (as now), and add "gb" meaning > (nucelotide) GenBank > files, and "gb" meaning (protein) GenPept files. > > But again, this breaks the Python ideal of there being one > clear way to > do things (having multiple names for the same format). > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From biopython at maubp.freeserve.co.uk Wed Apr 15 11:01:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 12:01:54 +0100 Subject: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank in SeqIO In-Reply-To: <587664.25168.qm@web62402.mail.re1.yahoo.com> References: <320fb6e00904150240l1e6b424tdd34035256876226@mail.gmail.com> <587664.25168.qm@web62402.mail.re1.yahoo.com> Message-ID: <320fb6e00904150401q209ae99id6746f2a0c4e3532@mail.gmail.com> On Wed, Apr 15, 2009 at 11:57 AM, Michiel de Hoon wrote: > > I think it's nice to be consistent with NCBI, and I don't see a big > problem in having an alias for GenBank in SeqIO. At least, > having "gb" in Bio.Entrez but "genbank" in Bio.SeqIO would > go against the principle of least surprise. True. Would you support other aliases such as "pfam" for "stockholm", an alias supported in EMBOSS for this alignment format? Peter From biopython at maubp.freeserve.co.uk Wed Apr 15 12:21:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 13:21:17 +0100 Subject: [Biopython-dev] Tutorial & Cookbook In-Reply-To: <320fb6e00904130616t2cd5f029i582f4a2488d1182c@mail.gmail.com> References: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com> <93403.18413.qm@web62406.mail.re1.yahoo.com> <20090413125255.GC5429@sobchak.mgh.harvard.edu> <320fb6e00904130616t2cd5f029i582f4a2488d1182c@mail.gmail.com> Message-ID: <320fb6e00904150521q536fa27drd54db5e267876b15@mail.gmail.com> On Mon, Apr 13, 2009 at 2:16 PM, Peter wrote: > Speaking of doctests, we should do more of those in our docstrings. > For our online API documentation at > http://biopython.org/DIST/docs/api/ it would be nice to have the > python examples within the docstrings (including the doctests) shown > with syntax colouring. ?See > http://epydoc.sourceforge.net/manual-epytext.html#doctest-blocks for > an example, and compare this to > http://biopython.org/DIST/docs/api/Bio.Seq-module.html - maybe we need > to adjust our indentation? We currently explicitly use plain text for epydoc, rather than the default epytext markup language. If we switch to epytext (or at least a very simple subset of it, as some of the markup doesn't lend itself to friendly human readable docstrings) then we do get python syntax colouring on the doctests. However, this will require some effort to fine tune the docstrings, and right now it makes a mess of in some cases. Peter From biopython at maubp.freeserve.co.uk Wed Apr 15 13:19:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 14:19:34 +0100 Subject: [Biopython-dev] docstrings, doctests and epydoc API pages Message-ID: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com> I've changed the thread title to something a little more specific. On Wed, Apr 15, 2009 at 1:21 PM, Peter wrote: > We currently explicitly use plain text for epydoc, rather than the > default epytext markup language. ?If we switch to epytext (or at least > a very simple subset of it, as some of the markup doesn't lend itself > to friendly human readable docstrings) then we do get python syntax > colouring on the doctests. ?However, this will require some effort to > fine tune the docstrings, and right now it makes a mess of in some > cases. As a test, I was able to update Bio/Seq.py to look good as epytext (while still being equally readable as plain text for when reading the API documentation at the python prompt with the help function). I uploaded one new page to the website: http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html The rest of the online API pages are currently still from Biopython 1.49, when epydoc parsed the docstrings as plain text. For another example with quite a few docstrings and doctests, look at: http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html What do you all think? I don't know if it will encourage more people to look at the API pages, but I certainly like the new version where the doctests are shown boxed with syntax colouring. Note that this would be a lot easier to do if epydoc supported "plaintext with doctests" as a markup type, or did this automatically when told the markup is just "plaintext" (as I had originally hoped for). I wonder how easy that would be to implement... it might be less work than checking all our API pages by hand and fixing our markup to follow epytext standards. See also: http://epydoc.sourceforge.net/epytext.html Peter From bartek at rezolwenta.eu.org Wed Apr 15 14:43:15 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Wed, 15 Apr 2009 16:43:15 +0200 Subject: [Biopython-dev] Bio.Motif breaks epydoc? In-Reply-To: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com> References: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com> Message-ID: <8b34ec180904150743tffaca5dva03c58bbe2a0eade@mail.gmail.com> Hi, I'm working on Bio.Motif to fix this. I'll send a patch later today. cheers Bartek On Wed, Apr 15, 2009 at 1:34 AM, Peter wrote: > Hi all, > > I forgot to run epydoc when I did Biopython 1.50 beta, but I've just > tried and it is failing - apparently due to an issue with Bio.Motif. > > First of all there are some warnings which we should probably address > now, before the Bio.Motif API is officially released: > > Warning: Module Bio.Motif.AlignAceParser is shadowed by a variable with the > ? ? ? ? same name. > Warning: Module Bio.Motif.MEMEParser is shadowed by a variable with the > ? ? ? ? same name. > Warning: Module Bio.Motif.Motif is shadowed by a variable with the same > ? ? ? ? name. > > Ignoring these warnings for now, epydoc then crashes for me doing > Bio.Motif.Motif.Motif-class.html - which is bigger problem. ?This was > using Epydoc version 3.0.1 (with python 2.6 on Ubuntu Jaunty). ?I'll > try another machine tomorrow just to make sure this isn't a local > setup issue. > > Also we should probably fix these "shadowing warnings", they can make > the API confusing - in addition to confusing epydoc and making the API > doc pages confusing. ?GenomeDiagram is also doing this, and we should > try and fix that too: > > Warning: Module Bio.Graphics.GenomeDiagram.Diagram is shadowed by a > ? ? ? ? variable with the same name. > Warning: Module Bio.Graphics.GenomeDiagram.FeatureSet is shadowed by a > ? ? ? ? variable with the same name. > Warning: Module Bio.Graphics.GenomeDiagram.GraphSet is shadowed by a > ? ? ? ? variable with the same name. > Warning: Module Bio.Graphics.GenomeDiagram.Track is shadowed by a variable > ? ? ? ? with the same name. > > However it may be a bit late to fix the main source of these warnings, > Bio.PDB, without breaking things (i.e. any fix may not be backwards > compatible). ?See also this thread from when I was running epydoc for > Biopython 1.49 late last year: > http://lists.open-bio.org/pipermail/biopython-dev/2008-November/004810.html > > Peter > -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From biopython at maubp.freeserve.co.uk Wed Apr 15 16:46:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 17:46:02 +0100 Subject: [Biopython-dev] docstrings, doctests and epydoc API pages In-Reply-To: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com> References: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com> Message-ID: <320fb6e00904150946y45010c99u8508e8e6fd71eb75@mail.gmail.com> > As a test, I was able to update Bio/Seq.py to look good as epytext > (while still being equally readable as plain text for when reading the > API documentation at the python prompt with the help function). I > uploaded one new page to the website: > > http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html > > The rest of the online API pages are currently still from Biopython > 1.49, when epydoc parsed the docstrings as plain text. ?For another > example with quite a few docstrings and doctests, look at: > > http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html > > What do you all think? ?I don't know if it will encourage more people > to look at the API pages, but I certainly like the new version where > the doctests are shown boxed with syntax colouring. I've done Bio/SeqIO/QualityIO.py as well which proved harder due to lots of example FASTQ records embedded in the text. I've also worked out how to set the epydoc markup format on a per file basis with the __docformat__ setting (see also PEP 258). This means we can gradually convert existing docstrings on a file by file basis - I'd suggest we focus on those with docstrings first, as they will benefit most from this. The only downside thus far is that the epytext mark up seems rather fragile, and it is easy to "break" a docstring such that epydoc fails to render nicely. At least epydoc falls back on plain text in this situation, so the text is still human readable. Tip: You need an EMPTY line before and after each doctest in order for it to work with epydoc as epytext markup. This is annoying as the doctest framework can cope with a line with spaces in it. Peter From sbassi at clubdelarazon.org Wed Apr 15 19:19:13 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Wed, 15 Apr 2009 16:19:13 -0300 Subject: [Biopython-dev] Proposal: Parse and read in SeqIO and NCBIXML Message-ID: <9e2f512b0904151219i60a8eda0xd06c9c86c690b6e3@mail.gmail.com> In SeqIO there is parse and read. Parse return an iterable with all the record found in the file, while read return only a record and it is used when we know that the file has only one record. This is OK. But in NCBIXML, there is only parse. If the the ncbiblast output has only one record (because it was made from 1 query), now we have to write: NCBIXML.parse(x).next() or iterate over a "list" of one member. I think it would be nice to add a read method to NCBIXML, such as the one in SeqIO. From biopython at maubp.freeserve.co.uk Wed Apr 15 21:30:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 15 Apr 2009 22:30:55 +0100 Subject: [Biopython-dev] Proposal: Parse and read in SeqIO and NCBIXML In-Reply-To: <9e2f512b0904151219i60a8eda0xd06c9c86c690b6e3@mail.gmail.com> References: <9e2f512b0904151219i60a8eda0xd06c9c86c690b6e3@mail.gmail.com> Message-ID: <320fb6e00904151430i19983fafq43ca1c9395579fb3@mail.gmail.com> On Wed, Apr 15, 2009 at 8:19 PM, Sebastian Bassi wrote: > In SeqIO there is parse and read. Parse return an iterable with all > the record found in the file, while read return only a record and it > is used when we know that the file has only one record. This is OK. > But in NCBIXML, there is only parse. If the the ncbiblast output has > only one record (because it was made from 1 query), now we have to > write: > NCBIXML.parse(x).next() or iterate over a "list" of one member. I > think it would be nice to add a read method to NCBIXML, such as the > one in SeqIO. That seems sensible to me, we could probably squeeze that in for Biopython 1.50 too. Could you file an enhancement bug in case I forget about this? Peter From bugzilla-daemon at portal.open-bio.org Wed Apr 15 21:42:28 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 15 Apr 2009 17:42:28 -0400 Subject: [Biopython-dev] [Bug 2812] New: Adding read method to NCBIXML (just like SeqIO and SwissProt). Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2812 Summary: Adding read method to NCBIXML (just like SeqIO and SwissProt). Product: Biopython Version: 1.50b Platform: PC OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: sbassi at gmail.com NCBIXML should have a "read" method. It has a parse method that returns an iterable. If the the ncbiblast output has only one record (because it was made from 1 query), now we have to write: NCBIXML.parse(x).next() or iterate over a "list" of one member. Other objects like SeqIO and SwissProt has both "read" and "parse" to deal with one entry files. I think for the sake of consistency NCBIXML should also have a read method. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 15 21:58:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 15 Apr 2009 17:58:15 -0400 Subject: [Biopython-dev] [Bug 2812] Adding read method to NCBIXML (just like SeqIO and SwissProt). In-Reply-To: Message-ID: <200904152158.n3FLwFYc027155@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2812 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-15 17:58 EST ------- Adding this should do the trick (based on the SeqIO.read function): def read(handle, debug=0) : """Returns a single Blast record (assumes just one query). Use the Bio.Blast.NCBIXML.read() function if you expect more than one BLAST record (i.e. if you have more than one query sequence). This function is for use when there is one and only one BLAST result. """ iterator = parse(handle, debug) try : first = iterator.next() except StopIteration : first = None if first is None : raise ValueError("No records found in handle") try : second = iterator.next() except StopIteration : second = None if second is not None : raise ValueError("More than one record found in handle") return first However, on reflection this needs some special testing for when there is a single query giving NO hits. I suspect that means the BLAST XML file will contain no records (at least that's my guess from recent versions - I haven't tried 2.2.20 yet). Would raising a ValueError in this situation reasonable? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bartek at rezolwenta.eu.org Wed Apr 15 23:10:03 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 16 Apr 2009 01:10:03 +0200 Subject: [Biopython-dev] Bio.Motif breaks epydoc? In-Reply-To: <320fb6e00904151514g2b9709fbj7c3de68d88db3f7d@mail.gmail.com> References: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com> <8b34ec180904150743tffaca5dva03c58bbe2a0eade@mail.gmail.com> <8b34ec180904151458k39fec681u53fcf64de9f7590d@mail.gmail.com> <320fb6e00904151514g2b9709fbj7c3de68d88db3f7d@mail.gmail.com> Message-ID: <8b34ec180904151610j4c62b7d7k51be600420aa73c@mail.gmail.com> Hi On Thu, Apr 16, 2009 at 12:14 AM, Peter wrote: > How about putting it in Bio/Motif/_Motif.py? ?That makes it clear > people are expected to access it via Bio.Motif.Motif, and not go via > the module. ?This is what Leighton and I did for GenomeDiagram which > was a very similar situation. ?Using an underscore denotes a private > module, so you could at a later date rename it to something else > without worrying about backwards compatibiltiy (if you do change your > mind). > OK, I'll update the source tomorrow. > Are you planning any documentation to go with this? ?It would be nice > to include it with Biopython 1.50 but not essential. There is a cookbook-style tutorial in Docs/cookbook/motif. I'm not sure if it's ready for inclusion into the official tutorial. I'm hoping to add some more features soon and then it could be improved and included into the tutorial. cheers Bartek From winda002 at student.otago.ac.nz Thu Apr 16 03:30:43 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Thu, 16 Apr 2009 15:30:43 +1200 Subject: [Biopython-dev] Tutorial & Cookbook In-Reply-To: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com> References: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com> Message-ID: <49E6A663.90900@student.otago.ac.nz> Hi all, Sorry about the delay in replying to this, the easter holidays are the last chance to play in the sun in the southern hemisphere. Peter wrote: > David wrote: > >>> For me as a n00b the most useful resource by far has been the cookbook - >>> >>> > > When you said "cookbook", did you mean the Biopython Tutorial & Cookbook? > http://biopython.org/DIST/docs/tutorial/Tutorial.html > http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > > There are a couple of other documents under the "Cookbook" folder here: > http://biopython.org/DIST/docs/cookbook/Restriction.html > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf > I really meant the Tutorial and Cookbook and specifically the examples in it. The first thing I tried to do with BioPython was parse BLAST outputs and actually seeing a loop that would work and I that I could tweak to get what I wanted from by BLAST results was really cool. From my perspective it makes sense to have a tutorial that walks through the main features with some relatively simple examples (like the existing one) with a separate cookbook highlighting what you can actually do when you bring everything together. I think this would fulfill the goals I was talking about in my original post (having nicely documented examples of BioPython in action out there for anyone who's looking) and adding a cookbook catergory to the wiki achieves this with the smallest impediment to participation . If anyone's counting I think that's +3 for wiki and -3 for a new html/pdf document. David From peter at maubp.freeserve.co.uk Thu Apr 16 10:56:23 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Apr 2009 11:56:23 +0100 Subject: [Biopython-dev] Bio.Motif breaks epydoc? In-Reply-To: <8b34ec180904150743tffaca5dva03c58bbe2a0eade@mail.gmail.com> References: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com> <8b34ec180904150743tffaca5dva03c58bbe2a0eade@mail.gmail.com> Message-ID: <320fb6e00904160356i68ca063ak370faa78eda63876@mail.gmail.com> On Wed, Apr 15, 2009 at 3:43 PM, Bartek Wilczynski wrote: > Hi, > > I'm working on Bio.Motif to fix this. [...] > > cheers > Bartek Bartek has solved the epydoc problem in CVS now, and I have been able to build the API documentation using a clean installation of Biopython from CVS. :) It looks like the LaTeX equation in Bio/Motif/Motif.py (which was full of backslashes) was causing some of the trouble. Peter From biopython at maubp.freeserve.co.uk Thu Apr 16 16:45:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Apr 2009 17:45:13 +0100 Subject: [Biopython-dev] Where to put command line wrappers Message-ID: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> Hi all, We were recently discussing alignment tools like MUSCLE and ClustalW and putting together a set of command line wrappers under Bio.Align for them. I think Bio.Align.Applications was suggested to match Bio.EMBOSS.Applications. For EMBOSS we have a single file, Bio/Emboss/Applications.py, which has about 15 wrappers (all very similar as the EMBOSS applications are very consistent). This is nice in that all the wrappers are in the Bio.Emboss.Application namespace. Bartek and I have been having a similar discussion for Motif tools, and if the AliceAce wrappers should go in Bio.Motif.Applications to match. For now Bio.Motif has just one wrapper for AlignACE and sister tool CompareACE. Now giving each tool-set its own file is possible (Bio/Motif/Applications/AlignAce.py) but would one (large) file be simpler? (i.e. Bio/Motif/Applications.py). I'm not sure how many wrappers we might eventually expect for multiple sequence alignments, maybe ten or twenty, mostly from different tool sets. Maybe Bio/Align/Applications/Muscle.py etc is the way to go, but we can then import all the command line objects under the Bio.Align.Applications namespace. Any comments? Peter From biopython at maubp.freeserve.co.uk Thu Apr 16 17:16:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Apr 2009 18:16:10 +0100 Subject: [Biopython-dev] Where to put command line wrappers In-Reply-To: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> Message-ID: <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> On Thu, Apr 16, 2009 at 5:45 PM, Peter wrote: > Hi all, > > We were recently discussing alignment tools like MUSCLE and ClustalW > and putting together a set of command line wrappers under Bio.Align > for them. ?I think Bio.Align.Applications was suggested to match > Bio.EMBOSS.Applications. > > For EMBOSS we have a single file, Bio/Emboss/Applications.py, which > has about 15 wrappers (all very similar as the EMBOSS applications are > very consistent). ?This is nice in that all the wrappers are in the > Bio.Emboss.Application namespace. > > Bartek and I have been having a similar discussion for Motif tools, > and if the AliceAce wrappers should go in Bio.Motif.Applications to > match. ?For now Bio.Motif has just one wrapper for AlignACE and sister > tool CompareACE. ?Now giving each tool-set its own file is possible > (Bio/Motif/Applications/AlignAce.py) but would one (large) file be > simpler? (i.e. Bio/Motif/Applications.py). > > I'm not sure how many wrappers we might eventually expect for multiple > sequence alignments, maybe ten or twenty, mostly from different tool > sets. ?Maybe Bio/Align/Applications/Muscle.py etc is the way to go, > but we can then import all the command line objects under the > Bio.Align.Applications namespace. > > Any comments? For any that missed the thread last week, I'd like to link back to the end of my post: http://lists.open-bio.org/pipermail/biopython-dev/2009-March/005658.html I see introducing Bio.Align.Applications as chance to get a more consistent approach to Biopython's command line wrappers established (replacing Bio.Clustalw). And as I wrote last month, I think we should focus on the Bio.Application command line wrapper object. For reasons explained in the linked email, I would want to rewrite Bio.Blast.NCBIStandalone in the same way (probably putting the command line wrapper classes in Bio.Blast.Applications, and if there is interesting, include other variants like WUBlast). Are there any other wrappers not using Bio.Application which I have forgotten about? Bio/AlignAce/Applications.py does use Bio.Application, but we are planning to replace this module with Bio.Motif which gives us a chance to review the API without worrying too much about backwards compatibility. As part of moving it to Bio.Motif, I would remove the run methods from AlignAceCommandline and CompareAceCommandline (none of the other Biopython command line objects have them as far as I know), and also remove the AlignAce and CompareAce helper functions (in Bio/AlignAce/AlignAceStandalone.py and Bio/AlignAce/CompareAceStandalone.py). Internally these all call the Bio.Application.generic_run function, and return stdout and stderr as wrapped StringIO handles. Because it reads in all the stdout and stderr output into memory, Bio.Application.generic_run function is only suitable for tools with print very little to the console (or nothing, in which case the return values can be ignored). This method is useless on things like BLAST XML output to stdout which can be hundreds of megabytes in size. I would generally discourage the use of the Bio.Application.generic_run function and instead we should give examples using the command line object together with the subprocess module (Python 2.3 doesn't have subprocess, but Biopthyon 1.50 will be the last release to care about this) which lets the user choose what if any handles they care about. Peter From bartek at rezolwenta.eu.org Thu Apr 16 17:37:29 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 16 Apr 2009 19:37:29 +0200 Subject: [Biopython-dev] Fwd: Where to put command line wrappers In-Reply-To: <8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com> References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> <8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com> Message-ID: <8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com> Hi All, On Thu, Apr 16, 2009 at 5:45 PM, Peter wrote: > For EMBOSS we have a single file, Bio/Emboss/Applications.py, which > has about 15 wrappers (all very similar as the EMBOSS applications are > very consistent). ?This is nice in that all the wrappers are in the > Bio.Emboss.Application namespace. > > Bartek and I have been having a similar discussion for Motif tools, > and if the AliceAce wrappers should go in Bio.Motif.Applications to > match. ?For now Bio.Motif has just one wrapper for AlignACE and sister > tool CompareACE. ?Now giving each tool-set its own file is possible > (Bio/Motif/Applications/AlignAce.py) but would one (large) file be > simpler? (i.e. Bio/Motif/Applications.py). > I think that there is a difference between EMBOSS and Bio.[Motif|Align]. In EMBOSS we have a very nicely comoditized set of tools with similar interfaces, while both for multiple alignment and motif searching the tools vary a lot. In case of multiple alignments this is only with respect to parameters and output format, while in motif searching there is also a lot of differences in the types of input (background models etc.). Also, quite likely the parsers for different tools will be written by different people. In this case, I think that it's much easier from the maintainers point of view to have a directory with separate files rather than a single module. If people are scared by nested namespaces, we can import the important classes into the higher level. >> I'm not sure how many wrappers we might eventually expect for multiple >> sequence alignments, maybe ten or twenty, mostly from different tool >> sets. ?Maybe Bio/Align/Applications/Muscle.py etc is the way to go, >> but we can then import all the command line objects under the >> Bio.Align.Applications namespace. >> +1 from me. > > Bio/AlignAce/Applications.py does use Bio.Application, but we are > planning to replace this module with Bio.Motif which gives us a chance > to review the API without worrying too much about backwards > compatibility. ?As part of moving it to Bio.Motif, I would remove the > run methods from AlignAceCommandline and CompareAceCommandline (none > of the other Biopython command line objects have them as far as I > know), and also remove the AlignAce and CompareAce helper functions > (in Bio/AlignAce/AlignAceStandalone.py and > Bio/AlignAce/CompareAceStandalone.py). Internally these all call the > Bio.Application.generic_run function, and return stdout and stderr as > wrapped StringIO handles. > > Because it reads in all the stdout and stderr output into memory, > Bio.Application.generic_run function is only suitable for tools with > print very little to the console (or nothing, in which case the return > values can be ignored). ?This method is useless on things like BLAST > XML output to stdout which can be hundreds of megabytes in size. ?I > would generally discourage the use of the Bio.Application.generic_run > function and instead we should give examples using the command line > object together with the subprocess module (Python 2.3 doesn't have > subprocess, but Biopthyon 1.50 will be the last release to care about > this) which lets the user choose what if any handles they care about. Motif finding programs usually output a lot less than there is input. Normally, you don't want to see more than 10 motifs and each contributes ~1kb so I don't see this as a huge problem in this case. To be honest, I'm not too keen on rewriting this old code (as well as MEME parser which was contributed by Jason Hackney). But if there will be any new motif parsers (I'd like to have weederand RSAT one day...) I'm happy to conform to any (reasonable) policy. cheers Bartek From biopython at maubp.freeserve.co.uk Thu Apr 16 18:53:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Apr 2009 19:53:03 +0100 Subject: [Biopython-dev] Fwd: Where to put command line wrappers In-Reply-To: <8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com> References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> <8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com> <8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com> Message-ID: <320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com> On 4/16/09, Bartek Wilczynski wrote: > Hi All, > > On Thu, Apr 16, 2009 at 5:45 PM, Peter wrote: > > For EMBOSS we have a single file, Bio/Emboss/Applications.py, which > > has about 15 wrappers (all very similar as the EMBOSS applications are > > very consistent). This is nice in that all the wrappers are in the > > Bio.Emboss.Application namespace. > > > > Bartek and I have been having a similar discussion for Motif tools, > > and if the AliceAce wrappers should go in Bio.Motif.Applications to > > match. For now Bio.Motif has just one wrapper for AlignACE and sister > > tool CompareACE. Now giving each tool-set its own file is possible > > (Bio/Motif/Applications/AlignAce.py) but would one (large) file be > > simpler? (i.e. Bio/Motif/Applications.py). > > > I think that there is a difference between EMBOSS and > Bio.[Motif|Align]. In EMBOSS we have a very nicely comoditized > set of tools with similar interfaces, while both for multiple > alignment and motif searching the tools vary a lot. In case of > multiple alignments this is only with respect to parameters and > output format, while in motif searching there is also a lot of > differences in the types of input (background models etc.). That is a good argument for using Bio/Align/Applications/XXX.py and Bio/Motif/Applications/XXX.py while also having Bio/EMBOSS/Applications.py > Also, quite likely the parsers for different tools will be written by > different people. Biopython's command line wrappers can be quite separate from the parsers - this is a natural break. One can be useful without the other, and keeping them separate allows you to for example use a Biopython wrapper with another parser, or vice versa. > In this case, I think that it's much easier from the maintainers point > of view to have a directory with separate files rather than a single > module. [...] True. > >> I'm not sure how many wrappers we might eventually expect for multiple > >> sequence alignments, maybe ten or twenty, mostly from different tool > >> sets. Maybe Bio/Align/Applications/Muscle.py etc is the way to go, > >> but we can then import all the command line objects under the > >> Bio.Align.Applications namespace. > > +1 from me. > > > Bio/AlignAce/Applications.py does use Bio.Application, but we are > > planning to replace this module with Bio.Motif which gives us a chance > > to review the API without worrying too much about backwards > > compatibility. As part of moving it to Bio.Motif, I would remove the > > run methods from AlignAceCommandline and CompareAceCommandline (none > > of the other Biopython command line objects have them as far as I > > know), and also remove the AlignAce and CompareAce helper functions > > (in Bio/AlignAce/AlignAceStandalone.py and > > Bio/AlignAce/CompareAceStandalone.py). Internally these all call the > > Bio.Application.generic_run function, and return stdout and stderr as > > wrapped StringIO handles. > > > > Because it reads in all the stdout and stderr output into memory, > > Bio.Application.generic_run function is only suitable for tools with > > print very little to the console (or nothing, in which case the return > > values can be ignored). This method is useless on things like BLAST > > XML output to stdout which can be hundreds of megabytes in size. I > > would generally discourage the use of the Bio.Application.generic_run > > function and instead we should give examples using the command line > > object together with the subprocess module (Python 2.3 doesn't have > > subprocess, but Biopthyon 1.50 will be the last release to care about > > this) which lets the user choose what if any handles they care about. > > Motif finding programs usually output a lot less than there is input. Normally, > you don't want to see more than 10 motifs and each contributes ~1kb so > I don't see this as a huge problem in this case. I can see that Bio.Application.generic_run function is often handy, but sometimes it is quite inappropriate. For AlignAce obviously it has sufficed. > To be honest, I'm not too keen on rewriting this old code (as well as > MEME parser which was contributed by Jason Hackney). But if there > will be any new motif parsers (I'd like to have weederand RSAT one > day...) I'm happy to conform to any (reasonable) policy. In the AlignAce case, in the above I wasn't suggesting rewriting, rather removing some of the what I saw as redundant bits (in an effort at consistency). On reflection, perhaps the core Bio.Application.AbstractCommandline object might benefit from some "run" like methods? However they do morph it from a command line string representation into something bigger... feature creep! ;) Peter From biopython at maubp.freeserve.co.uk Thu Apr 16 20:16:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Apr 2009 21:16:04 +0100 Subject: [Biopython-dev] Where to put command line wrappers In-Reply-To: <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> Message-ID: <320fb6e00904161316m62162af2s506442502b73c8bc@mail.gmail.com> > I see introducing Bio.Align.Applications as chance to get a more > consistent approach to Biopython's command line wrappers established > (replacing Bio.Clustalw). And as I wrote last month, I think we > should focus on the Bio.Application command line wrapper object. For > reasons explained in the linked email, I would want to rewrite > Bio.Blast.NCBIStandalone in the same way (probably putting the command > line wrapper classes in Bio.Blast.Applications, and if there is > interesting, include other variants like WUBlast). Are there any > other wrappers not using Bio.Application which I have forgotten about? Funnily enough, there already is a Bio.Blast.Applications module containing a wrapper for NCBI Fasta and NCBI blastall (a little out of data, also nothing for rpsblast or blastpgpg). The older Bio.Blast.NCBIStandalone was never updated to use this internally. Here's a nice little job for after Biopython 1.50 is out... Peter From bugzilla-daemon at portal.open-bio.org Thu Apr 16 22:40:53 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 16 Apr 2009 18:40:53 -0400 Subject: [Biopython-dev] [Bug 2809] Adding startswith and endswith methods to the Seq object In-Reply-To: Message-ID: <200904162240.n3GMerIj001589@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2809 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-16 18:40 EST ------- Checked in after discussion on the mailing list. Checking in Bio/Seq.py; /home/repository/biopython/biopython/Bio/Seq.py,v <-- Seq.py new revision: 1.76; previous revision: 1.75 done Checking in Tests/test_Seq_objs.py; /home/repository/biopython/biopython/Tests/test_Seq_objs.py,v <-- test_Seq_objs.py new revision: 1.5; previous revision: 1.4 done Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 16 22:40:54 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 16 Apr 2009 18:40:54 -0400 Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string, even subclass string? In-Reply-To: Message-ID: <200904162240.n3GMesOq001602@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2351 Bug 2351 depends on bug 2809, which changed state. Bug 2809 Summary: Adding startswith and endswith methods to the Seq object http://bugzilla.open-bio.org/show_bug.cgi?id=2809 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From winda002 at student.otago.ac.nz Fri Apr 17 05:31:45 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Fri, 17 Apr 2009 17:31:45 +1200 Subject: [Biopython-dev] Cookbook recipes on the wiki Message-ID: <49E81441.8040906@student.otago.ac.nz> Hi all, In the recent thread about the cookbook style entries in the tutorial everyone that had an opinion seemed to think it was best to incorporate these into the wiki. I've made a very small start at doing this with a category on the wiki (http://biopython.org/wiki/Category:Cookbook) and an example of what an entry in the cookbook might look like (http://biopython.org/wiki/Split_fasta_file). What do people think of these? If we decide this is the way to go then to have an entry turn up in the cookbook category you need only to add [[Category:Cookbook]] to an entry david From biopython at maubp.freeserve.co.uk Fri Apr 17 09:32:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 10:32:32 +0100 Subject: [Biopython-dev] Cookbook recipes on the wiki In-Reply-To: <49E81441.8040906@student.otago.ac.nz> References: <49E81441.8040906@student.otago.ac.nz> Message-ID: <320fb6e00904170232i75d88a73p5738e54a32de8bdf@mail.gmail.com> On Fri, Apr 17, 2009 at 6:31 AM, David Winter wrote: > Hi all, > > In the recent thread about the cookbook style entries in the tutorial > everyone that had an opinion seemed to think it was best to incorporate > these into the wiki. I've made a very small start at doing this with a > category on the wiki (http://biopython.org/wiki/Category:Cookbook) and an > example of what an entry in the cookbook might look like > (http://biopython.org/wiki/Split_fasta_file). > > What do people think of these? If we decide this is the way to go then to > have an entry turn up in the cookbook category you need only to add > [[Category:Cookbook]] to an entry We'd previously discussed using a cookbook category on the wiki, and that looks good: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005715.html I'm tempted to get rid of Category:Wiki_Documentation though - it seems a bit redundant, almost everything on the wiki is documentation. At least rename this to Category:Documentation? Peter From biopython at maubp.freeserve.co.uk Fri Apr 17 11:08:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 12:08:12 +0100 Subject: [Biopython-dev] sff reader In-Reply-To: <200904171246.46568.jblanca@btc.upv.es> References: <200904161146.28203.jblanca@btc.upv.es> <320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com> <200904171246.46568.jblanca@btc.upv.es> Message-ID: <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com> On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca wrote: > Hi Peter: > Here you have some code to read the sff files. Thanks - I'm not sure when I'll get to look at this, maybe next week. > For the time being it creates a dict for the sequences. I'm not sure about > how to integrate the generated data in BioPython. The sequence and > qualities should go to a SeqRecord, but there is also the information > about the clipping. For Bio.SeqIO, we would need to use a SeqRecord. Ideally we'd want to be able to read and write SFF files, and to do that we'll have to record all the essential annotation (i.e. clipping) somehow. Can you write SFF files? > For my work I use a kind of SeqRecord with a mask property and the > mask is a Location that shows which part of the sequence is ok. I don't > know if that's a valid model for BioPython. A mask could be done as a list of booleans, and we can treat it as another per-letter-annotation in the SeqRecord. I'm not sure if this is helpful or not. The Roche tools let you choose to extract trimmed reads as FASTA and QUAL, or untrimmed. Perhaps for reading SFF files with Bio.SeqIO we should get the user to choose between these options (e.g. format names "roche-sff" and "roche-sff-notrim")? Roche's FASTA files use upper case for the trimmed region, and lower case for the start/end which would get trimmed off. This is simple and we could do this for Biopython too - meaning you'd get the same data if you read the SFF file directly, or used Roche's FASTA+QUAL files with SeqIO. Note that when reading an SFF file directly, we should probably record the real trim data as well. > In the extract_sff script we generated three files: the fasta sequences, > the fasta qualities and the xml with the clippings. > One option could be to clip the sequences, but I don't know if that's the > desired behaviour in all cases. Trimming is probably a sensible default. If we do give the untrimmed sequences, we'd need a way to easily trim them. > There's also a couple of more tricks with the clipping. > In theory there's clip_qual and clip_adapter, but in the files > we've seen clip_adapter is always zero and clip_quality is used > instead for both quality and adapter. I think we could generate > one clipping combining both. Let me know what do you think. > Also take into account that in some cases the generated clipping > from the 454 software are just wrong. I'll need to learn more about the details before coming to any conclusions about how to deal with this information in Biopython. > If you want to forward this mail to the list you're more than welcome. > Best regards, > > Jose Blanca I've CC'd this reply to the list (without the python file attachments). Regards, Peter From chapmanb at 50mail.com Fri Apr 17 13:23:34 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 17 Apr 2009 09:23:34 -0400 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> <20090413133539.GD5429@sobchak.mgh.harvard.edu> <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com> Message-ID: <20090417132334.GA16092@sobchak.mgh.harvard.edu> Peter, Michiel and Jared; Thanks for the comments. My apologies for the late reply; I've been sick the past few days and am trying to catch back up. All your points from the different posts are consolidated below. [Michiel] > First, let's discuss how to represent the information contained in > a GFF file. SeqRecords are good if the GFF file is associated with > a Fasta file (or contains the sequence itself), but if not it seems > to be a bit awkward. How about the following (and I think Peter was > hinting at the same idea): > > The actual parser lives in Bio.GFF, and produces Bio.GFF.Record > objects that closely resemble the GFF file structure. For example, we > use the GFF specified fields ( > [attributes] [comments]) as attributes > to Bio.GFF.Record objects. The GFF parser right now is really generating SeqFeature objects for each GFF line; the top level SeqRecords are a collection that holds the individual features. The SeqFeature object is pretty similar to GFF and the generic object you are proposing. For instance, here is a GFF line and the relevant attributes from SeqFeature for the line: I Orfeome PCR_product 12759747 12764936 . - . PCR_product "mv_B0019.1" ; Amplified 1 ; Amplified 1 type: PCR_product location: [12759746:12764936] strand: -1 qualifiers: Key: amplified, Value: ['1'] Key: pcr_product, Value: ['mv_B0019.1'] Key: source, Value: ['Orfeome'] Things are a bit more generalized as key/value pairs in qualifiers, but the mapping straightforward. My only suggestion would be that we add 'start' and 'end' accessors to SeqFeature that map to feature.location.nofuzzy_start and feature.location.nofuzzy_end, respectively. SeqFeature is more generalized, for GenBank location nastiness, but we should make the common simple case simpler. > Bio.SeqIO then uses the parser in Bio.GFF, and puts its information > in the appropriate fields of a SeqRecord. Here, we have to think > about two cases: Simply creating a SeqRecord based on the GFF file, > and adding the information in the GFF file as annotations to a > pre-existing set of SeqRecords. Yes. Both of these cases are handled now -- a user can supply a seed dictionary of SeqRecords to which SeqFeatures are added. Alternatively, a new SeqRecord is created for features if one is not provided. > Users then have a choice to use Bio.SeqIO to get SeqRecords, or > Bio.GFF to see the "raw" GFF data, depending on their needs. > > How does that sound? So we could have two ways to access the GFF file: - An iterator that returns SeqFeature objects for each line in the file. No other processing is done. - The higher level interface that we have been discussing, which adds them to records and nests features. My only question is concerning the nested features, like coding sequences. This a very common GFF case (see http://www.sequenceontology.org/gff3.shtml; The Canonical Gene section for the GFF). A raw parser iterator cannot handle these as it needs to read multiple lines to build the nested feature. Is this still useful for the use cases you were thinking of? [Peter] > Hmm. I'm with you on the idea that you may need to parse a GFF file > and a separate second file to get the actual sequence (e.g. a FASTA > file), but there is more than one way to combine the two. For a > single sequence, I was thinking more along the lines of: > > from Bio import SeqIO > record = SeqIO.read(open("NC_000913.fna"),"fasta") > record.features = SeqIO.read(open("NC_000913.gff"),"gff3").features Make sense, but this only works for the case where you have a single FASTA sequence and a single GFF file describing one record. This is a special case for bacterial genomes and GFF from NCBI, but doesn't work for other Eukaryotic GFFs and SOLiD GFF files. Do we want different ways to use the parser for custom cases? > If the FASTA and GFF file apply to multiple sequences (e.g. a set of > contigs, rather than a single chromosome), and you have enough memory, > then something using dictionaries should work: > > from Bio import SeqIO > records = SeqIO.to_dict(SeqIO.read(open("NC_000913.fna"),"fasta")) > for temp_rec in SeqIO.parse(open("NC_000913.gff"),"gff3") : > records[temp_rec.id].features = temp_rec.features Your intention makes good sense here, and this is more or less what it is doing under the covers. Could we think about expanding SeqIO to have functionality for this "adding to a record" case? Something like: from Bio import SeqIO records = SeqIO.to_dict(SeqIO.read(open("NC_000913.fna"),"fasta")) records = SeqIO.add_to_dict(records, open("NC_000913.gff"), "gff3") This exposes less of the actual implementation details to the user. > As you can probably tell, I am concentrating on getting this to match > up well with the Bio.SeqIO framework. It will be nice to know the > underlying Bio.GFF module has more options, but I expect most people > to start with reading in a GFF file using Bio.SeqIO, and being able to > transfer their existing knowledge of SeqFeature objects learnt from > using Bio.SeqIO to read in GenBank files. I'm really glad you are thinking about it from this angle. The limit cases will be pretty common for real life work; most of the eukaryotic GFF dumps from Ensembl or wherever are quite large and are going to need some intelligent parsing to not get into memory issues. I worry that if we try to put this right on top of the existing SeqIO functionality, which deal with different kinds of files, we are going to clutter the interface. > I have pondered a "paired file iterator" function for Bio.SeqIO for > dealing with FASTA+QUAL, FASTA+GFF, FASTA+PPT, etc, which would take > TWO file handles and return SeqRecord objects. Interestingly all the > examples thus far are FASTA+other. Anyway, this could be added later > if need be. I like the way you did this for FASTA/Qual files but am not sure if would map nicely to GFF for the memory reasons mentioned above. [MapReduce] > Are you aware of any alternatives to disco for doing map/reduce on > Python, and does that impact your design choices? Jared is right on; Hadoop is the another MapReduce framework in wide use. More generally, I agree with you; the distributed portion needs to be generalized. Let's lock down the interface and local parsing, and then I will circle around on that again. Thanks all again for the thoughts, Brad From chapmanb at 50mail.com Fri Apr 17 13:30:20 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 17 Apr 2009 09:30:20 -0400 Subject: [Biopython-dev] docstrings, doctests and epydoc API pages In-Reply-To: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com> References: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com> Message-ID: <20090417133020.GB16092@sobchak.mgh.harvard.edu> Peter; > As a test, I was able to update Bio/Seq.py to look good as epytext > (while still being equally readable as plain text for when reading the > API documentation at the python prompt with the help function). I > uploaded one new page to the website: > > http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html I had some time where I was obsessed with making Biopython look good in one of these API documentation modules (maybe HappyDoc, back in the day). Eventually I came to the sad conclusion that not too many people really seem to actually look at auto generated API docs. Most will fire up the code in their favorite editor if they are interested in the fine details. So, I like the way this looks, but my vote is it is probably not worth the cycles unless you are having fun with it. Also, be ready get mad when the preferred method of markup changes from epytext to structuredtext or someothertext. Brad From biopython at maubp.freeserve.co.uk Fri Apr 17 13:45:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 14:45:11 +0100 Subject: [Biopython-dev] docstrings, doctests and epydoc API pages In-Reply-To: <20090417133020.GB16092@sobchak.mgh.harvard.edu> References: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com> <20090417133020.GB16092@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904170645u463d0b4ej8be66735bd2889e3@mail.gmail.com> On Fri, Apr 17, 2009 at 2:30 PM, Brad Chapman wrote: > Peter; > >> As a test, I was able to update Bio/Seq.py to look good as epytext >> (while still being equally readable as plain text for when reading the >> API documentation at the python prompt with the help function). I >> uploaded one new page to the website: >> >> http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html > > I had some time where I was obsessed with making Biopython look > good in one of these API documentation modules (maybe HappyDoc, > back in the day). Eventually I came to the sad conclusion that not too > many people really seem to actually look at auto generated API docs. > Most will fire up the code in their favorite editor if they are > interested in the fine details. I agree that we don't push the API docs page enough (and indeed corresponding built in documentation). This is a shame, as the built in docstrings should really get more attention. To try and raise their profile I've added links to the relevant pages from some of the wiki pages to try and encourage people to look at them. There is probably a cunning redirect link which will get the frames to work, but I've just used deep linking on these pages for now: http://biopython.org/wiki/Seq http://biopython.org/wiki/SeqRecord http://biopython.org/wiki/SeqIO http://biopython.org/wiki/AlignIO In fact, maybe we should simplify/remove these wiki pages and just push the API pages and relevant cookbook wiki pages in their place? Up until now, the wiki was nicer in that it looked better - with the epydoc mark up that isn't the case. The API docs should be the definitive documentation, in that the are kept up to date with the code, and are under version control. > So, I like the way this looks, but my vote is it is probably not > worth the cycles unless you are having fun with it. Also, be ready > get mad when the preferred method of markup changes from epytext to > structuredtext or someothertext. I know what you mean - the novelty has worn off now, and doing further conversions is tedious. I like the idea of a tweak to epydoc to do "plain text + automatic markup of doctests". If that existed it would be a great default option for Biopython, as all I really care about for the markup is getting the python doctests to look good. Peter From chapmanb at 50mail.com Fri Apr 17 14:02:41 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 17 Apr 2009 10:02:41 -0400 Subject: [Biopython-dev] Fwd: Where to put command line wrappers In-Reply-To: <320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com> References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> <8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com> <8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com> <320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com> Message-ID: <20090417140241.GD16092@sobchak.mgh.harvard.edu> Hi all; [Where to put the commandline objects] > > I think that there is a difference between EMBOSS and > > Bio.[Motif|Align]. In EMBOSS we have a very nicely comoditized > > set of tools with similar interfaces, while both for multiple > > alignment and motif searching the tools vary a lot. In case of > > multiple alignments this is only with respect to parameters and > > output format, while in motif searching there is also a lot of > > differences in the types of input (background models etc.). > > That is a good argument for using Bio/Align/Applications/XXX.py and > Bio/Motif/Applications/XXX.py while also having > Bio/EMBOSS/Applications.py There is a natural tension between overgeneralizing and dumping too much into one file. At one end you have deeply nested Java-like directories with a few lines of code in each file. I tend towards the "more in a single file and less nesting" camp. My vote would be that if the Motif Applications file will only contain commandline wrappers, they could live in one file. [generic_run] > > Motif finding programs usually output a lot less than there is input. Normally, > > you don't want to see more than 10 motifs and each contributes ~1kb so > > I don't see this as a huge problem in this case. > > I can see that Bio.Application.generic_run function is often handy, > but sometimes it is quite inappropriate. For AlignAce obviously it > has sufficed. Yeah, generic_run is not as generic as it should be. It does have a lot of hard fought logic for working with multiple python versions and windows/unix. Could we make generic_run appropriate for the big standard out cases so we don't end up duplicating that in Blast/Clustalw/wherever runners? Brad From biopython at maubp.freeserve.co.uk Fri Apr 17 14:13:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 15:13:18 +0100 Subject: [Biopython-dev] Fwd: Where to put command line wrappers In-Reply-To: <20090417140241.GD16092@sobchak.mgh.harvard.edu> References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> <8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com> <8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com> <320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com> <20090417140241.GD16092@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904170713p30dc4d51m284c897ec1b9b505@mail.gmail.com> On Fri, Apr 17, 2009 at 3:02 PM, Brad Chapman wrote: >> >> I can see that Bio.Application.generic_run function is often handy, >> but sometimes it is quite inappropriate. ?For AlignAce obviously it >> has sufficed. > > Yeah, generic_run is not as generic as it should be. It does have a > lot of hard fought logic for working with multiple python versions > and windows/unix. Could we make generic_run appropriate for the big > standard out cases so we don't end up duplicating that in > Blast/Clustalw/wherever runners? The AlignAce and Clustalw already call generic_run internally - and for them it is fine. For BLAST, by default the output goes to standard out, so generic run is a bad idea as this loads all of stdout into memory. We may want to add some variations on generic_run for this kind of usage, or say it is up to the user to deal with it as appropriate for their setup. Peter From p.j.a.cock at googlemail.com Fri Apr 17 14:40:42 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 17 Apr 2009 15:40:42 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090417132334.GA16092@sobchak.mgh.harvard.edu> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> <20090413133539.GD5429@sobchak.mgh.harvard.edu> <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com> <20090417132334.GA16092@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904170740y17aec13cta2137d6d45f43e7c@mail.gmail.com> On Fri, Apr 17, 2009 at 2:23 PM, Brad Chapman wrote: > Things are a bit more generalized as key/value pairs in qualifiers, > but the mapping straightforward. My only suggestion would be that we > add 'start' and 'end' accessors to SeqFeature that map to > feature.location.nofuzzy_start and feature.location.nofuzzy_end, > respectively. SeqFeature is more generalized, for GenBank location > nastiness, but we should make the common simple case simpler. The SeqFeature already has start and end "attributes", but they are done with some magic in __getattr__, I was planning to update this to use a modern python property get. I can't find an enhancement bug on this so it may just have been on my mental to do list ;) See also, http://lists.open-bio.org/pipermail/biopython/2007-September/003703.html Peter From biopython at maubp.freeserve.co.uk Fri Apr 17 15:25:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 16:25:35 +0100 Subject: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank in SeqIO In-Reply-To: <320fb6e00904150401q209ae99id6746f2a0c4e3532@mail.gmail.com> References: <320fb6e00904150240l1e6b424tdd34035256876226@mail.gmail.com> <587664.25168.qm@web62402.mail.re1.yahoo.com> <320fb6e00904150401q209ae99id6746f2a0c4e3532@mail.gmail.com> Message-ID: <320fb6e00904170825w191f7c90p9cb7f175e3f5be17@mail.gmail.com> On Wed, Apr 15, 2009 at 12:01 PM, Peter wrote: > On Wed, Apr 15, 2009 at 11:57 AM, Michiel de Hoon wrote: >> >> I think it's nice to be consistent with NCBI, and I don't see a big >> problem in having an alias for GenBank in SeqIO. At least, >> having "gb" in Bio.Entrez but "genbank" in Bio.SeqIO would >> go against the principle of least surprise. > > True. OK, in the absence of any objections, I have added "gb" as an alias for "genbank" in Bio.SeqIO: Bio/SeqIO/__init__.py CVS revision 1.52 Tests/test_SeqIO_online.py revision 1.8 Tests/output/test_SeqIO_online CVS revision 1.4 Doc/Tutorial.tex CVS revision 1.229 Peter From mjldehoon at yahoo.com Fri Apr 17 16:44:34 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 17 Apr 2009 09:44:34 -0700 (PDT) Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090417132334.GA16092@sobchak.mgh.harvard.edu> Message-ID: <148828.89199.qm@web62404.mail.re1.yahoo.com> --- On Fri, 4/17/09, Brad Chapman wrote: > The GFF parser right now is really generating SeqFeature > objects for each GFF line; the top level SeqRecords are a > collection that holds the individual features. The SeqFeature > object is pretty similar to GFF and the generic object you are > proposing. For instance, here is a GFF line and the relevant > attributes from SeqFeature for the line: > > I Orfeome PCR_product 12759747 12764936 . - . PCR_product "mv_B0019.1" ; Amplified 1 ; Amplified 1 > > type: PCR_product > location: [12759746:12764936] > strand: -1 > qualifiers: > Key: amplified, Value: ['1'] > Key: pcr_product, Value: ['mv_B0019.1'] > Key: source, Value: ['Orfeome'] > Just to make I understand how this works, looking at your previous code example: >>> from BCBio.GFF.GFFParser import GFFAddingIterator >>> gff_iterator = GFFAddingIterator() >>> rec_dict = gff_iterator.get_all_features(gff_file) > The returned dictionary is like a dictionary from SeqIO.to_dict; > keys are ids and values are SeqRecords. What will be the key in rec_dict for the example GFF file above? Is that the "I" in the first column, as in rec_dict["I"] = a SeqRecord with the SeqFeature you described above? Best, --Michiel From bugzilla-daemon at portal.open-bio.org Fri Apr 17 17:03:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 17 Apr 2009 13:03:59 -0400 Subject: [Biopython-dev] [Bug 2812] Adding read method to NCBIXML (just like SeqIO and SwissProt). In-Reply-To: Message-ID: <200904171703.n3HH3xrq015467@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2812 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-17 13:03 EST ------- Fixed in CVS (without the read/parse typo in the docstring suggested in comment 1). Checking in Bio/Blast/NCBIXML.py; /home/repository/biopython/biopython/Bio/Blast/NCBIXML.py,v <-- NCBIXML.py new revision: 1.22; previous revision: 1.21 done Checking in Tests/test_NCBIXML.py; /home/repository/biopython/biopython/Tests/test_NCBIXML.py,v <-- test_NCBIXML.py new revision: 1.7; previous revision: 1.6 done Checking in Tests/test_NCBI_qblast.py; /home/repository/biopython/biopython/Tests/test_NCBI_qblast.py,v <-- test_NCBI_qblast.py new revision: 1.6; previous revision: 1.5 done Checking in Tests/output/test_NCBIXML; /home/repository/biopython/biopython/Tests/output/test_NCBIXML,v <-- test_NCBIXML new revision: 1.6; previous revision: 1.5 done RCS file: /home/repository/biopython/biopython/Tests/Blast/blastp_no_hits.xml,v done Checking in blastp_no_hits.xml; /home/repository/biopython/biopython/Tests/Blast/blastp_no_hits.xml,v <-- blastp_no_hits.xml initial revision: 1.1 done -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Apr 17 17:16:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 18:16:55 +0100 Subject: [Biopython-dev] Plan for Biopython 1.50 (final) In-Reply-To: <320fb6e00904131106s70028e9el56d334fa732bddf8@mail.gmail.com> References: <320fb6e00904131106s70028e9el56d334fa732bddf8@mail.gmail.com> Message-ID: <320fb6e00904171016q40f99a3fjda75b3add17ab8c0@mail.gmail.com> On Mon, Apr 13, 2009 at 7:06 PM, Peter wrote: > Apart from these two points (documentation and EFetch), are there any > issues regarding doing the official release of Biopython 1.50? ?I > think we can aim for a release this week... Other than a little more documentation polishing, I think we are ready for Biopython 1.50 now. Thanks Bartek and Tiago for dealing with the Bio.Motif and Bio.PopGen issues I raised so promptly :) Are there any release blocking issues I've missed? I was going to do it this evening before leaving work, but I'm tired and wouldn't want to make any mistakes. Instead, I aim to do the release this weekend, and make the Windows installers at some point on Monday. The more rain we get this weekend, the more time I'll try and spend on the docs first - otherwise the lawn needs cutting... ;) I'll send out a warning email before hand - but until then please feel free to check in documentation changes (including docstrings and doctests). We still don't have much on GenomeDiagram in the main tutorial, but I have some plans to improve this. We also don't have the misc GC related functions from the standalone GenomeDiagram which we might add to Bio.SeqUtils, but I think that can wait till Biopython 1.51. Bartek has made a start on the Bio.Motif documentation as a separate "cookbook" LaTeX file (plus we have some basic docstrings done): http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Doc/cookbook/motif/motif.tex?cvsroot=biopython For the long term I think we want to get rid of these misc "cookbook" documents (by moving their content), to focus on the main document ("Biopython Tutorial and Cookbook"), the docstrings, and in future cookbook entries on the wiki (which can be more user driven). Peter From ogmaciel at gnome.org Fri Apr 17 17:23:07 2009 From: ogmaciel at gnome.org (Og Maciel) Date: Fri, 17 Apr 2009 13:23:07 -0400 Subject: [Biopython-dev] Plan for Biopython 1.50 (final) In-Reply-To: <320fb6e00904171016q40f99a3fjda75b3add17ab8c0@mail.gmail.com> References: <320fb6e00904131106s70028e9el56d334fa732bddf8@mail.gmail.com> <320fb6e00904171016q40f99a3fjda75b3add17ab8c0@mail.gmail.com> Message-ID: <98a1f5280904171023p13c1e7a9o5686b451fd3da61c@mail.gmail.com> On Fri, Apr 17, 2009 at 1:16 PM, Peter wrote: >> Apart from these two points (documentation and EFetch), are there any >> issues regarding doing the official release of Biopython 1.50? ?I >> think we can aim for a release this week... Cool! I have 1.50b packaged for Foresight Linux and will update it once the new version is released. :) Cheers, -- Og B. Maciel omaciel at foresightlinux.org ogmaciel at gnome.org ogmaciel at ubuntu.com GPG Keys: D5CFC202 http://www.ogmaciel.com (en_US) http://blog.ogmaciel.com (pt_BR) From chapmanb at 50mail.com Fri Apr 17 20:05:58 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 17 Apr 2009 16:05:58 -0400 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904170740y17aec13cta2137d6d45f43e7c@mail.gmail.com> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> <20090413133539.GD5429@sobchak.mgh.harvard.edu> <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com> <20090417132334.GA16092@sobchak.mgh.harvard.edu> <320fb6e00904170740y17aec13cta2137d6d45f43e7c@mail.gmail.com> Message-ID: <20090417200558.GC19290@sobchak.mgh.harvard.edu> Peter and Michiel; [start/end attributes on SeqFeatures] > The SeqFeature already has start and end "attributes", but they are > done with some magic in __getattr__, I was planning to update this > to use a modern python property get. I can't find an enhancement > bug on this so it may just have been on my mental to do list ;) These attributes are on the FeatureLocation object. The whole location hierarchy is a bit complicated to represent all of the GenBank fuzziness, but it looks like: SeqFeature -- has_a --> FeatureLocation -- has_two --> Positions (start, end) So if you wanted to get a non-fuzzy start end, you need to do: feature.location.nofuzzy_start, feature.location.nofuzzy_end Your way above would be: feature.location.start.position So, I was thinking of hiding this Location/Position stuff from the end user and just adding a start and end attribute directly on the feature. For everyone that never touches fuzziness, this would make more sense; it is also in line with making SeqFeature like Michiel's proposed GFFRecord object. [GFF to SeqFeature example] > > I Orfeome PCR_product 12759747 12764936 . - . PCR_product "mv_B0019.1" ; Amplified 1 ; Amplified 1 > > > > type: PCR_product > > location: [12759746:12764936] > > strand: -1 > > qualifiers: > > Key: amplified, Value: ['1'] > > Key: pcr_product, Value: ['mv_B0019.1'] > > Key: source, Value: ['Orfeome'] > > > > Just to make I understand how this works, looking at your previous code example: > > >>> from BCBio.GFF.GFFParser import GFFAddingIterator > >>> gff_iterator = GFFAddingIterator() > >>> rec_dict = gff_iterator.get_all_features(gff_file) > > > The returned dictionary is like a dictionary from SeqIO.to_dict; > > keys are ids and values are SeqRecords. > > What will be the key in rec_dict for the example GFF file above? Is that the "I" in the first column, as in > > rec_dict["I"] = a SeqRecord with the SeqFeature you described above? Yes, that is exactly right. If we decide to have a SeqFeature iterator, we should also add a 'rec_id' key/value pair to the qualifiers that would map to the record -- chromosome 'I' in this case. This would let the user do the mapping themselves. Brad From biopython at maubp.freeserve.co.uk Fri Apr 17 22:12:14 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Apr 2009 23:12:14 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090417200558.GC19290@sobchak.mgh.harvard.edu> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> <20090413133539.GD5429@sobchak.mgh.harvard.edu> <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com> <20090417132334.GA16092@sobchak.mgh.harvard.edu> <320fb6e00904170740y17aec13cta2137d6d45f43e7c@mail.gmail.com> <20090417200558.GC19290@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904171512n3ff0090dy8042b1c860cf5a2c@mail.gmail.com> On 4/17/09, Brad Chapman wrote: > Peter and Michiel; > > [start/end attributes on SeqFeatures] > > > The SeqFeature already has start and end "attributes", but they are > > done with some magic in __getattr__, I was planning to update this > > to use a modern python property get. I can't find an enhancement > > bug on this so it may just have been on my mental to do list ;) > > These attributes are on the FeatureLocation object. Sorry - yeah, you're right. I wasn't paying enough attention. > The whole location hierarchy is a bit complicated to represent all > of the GenBank fuzziness, but it looks like: > > SeqFeature -- has_a --> FeatureLocation -- has_two --> Positions (start, end) > And that's the nice case without sub-features and joins ;) Peter From mjldehoon at yahoo.com Sat Apr 18 04:28:09 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 17 Apr 2009 21:28:09 -0700 (PDT) Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090417200558.GC19290@sobchak.mgh.harvard.edu> Message-ID: <252312.21376.qm@web62408.mail.re1.yahoo.com> I tried this code to read a GFF file from miRBase, containing the genome positions of microRNAs in human. The good news is that the code works as advertised. At the same time, I think that for a basic parser (as opposed to a parser integrated with Bio.SeqIO), the SeqFeatures are way too complicated for my mind. This is how I used the parser: >>> from GFFParser import GFFAddingIterator >>> gff_iterator = GFFAddingIterator() >>> rec_dict = gff_iterator.get_all_features("Data/miRBase/hsa.gff") # It would be better to pass a handle to get_all_features # instead of a file name. The file may be gzipped or bzipped, # or the user may want to read it from the internet. >>> len(rec_dict['1']) 50 # fifty microRNAs on chromosome 1 >>> rec_dict['1'].features[0] Bio.SeqFeature.SeqFeature(Bio.SeqFeature.FeatureLocation(Bio.SeqFeature.ExactPosition(20228),Bio.SeqFeature.ExactPosition(20366)), type='miRNA', strand=1, id='hsa-mir-1302-2') >>> rec_dict['1'].features[0].qualifiers['ACC'] ['MI0006363'] >>> rec_dict['1'].features[0].qualifiers['ID'] ['hsa-mir-1302-2'] # This is still OK, though a bit more deeply nested than I would like. >>> rec_dict['1'].features[0].location Bio.SeqFeature.FeatureLocation(Bio.SeqFeature.ExactPosition(20228),Bio.SeqFeature.ExactPosition(20366)) >>> rec_dict['1'].features[0].location._start Bio.SeqFeature.ExactPosition(20228) # Am I supposed to use _start here? It looks like a private variable. >>> rec_dict['1'].features[0].location._start.position 20228 # Too much typing for everyday usage. I don't think that I would use it. For a basic parser, I like the _gff_line_map function much better. Applied to the first line in the GFF file, it returns >>> result = _gff_line_map(line, params) [('parent', {'quals': {'ACC': ['MI0006363'], 'ID': ['hsa-mir-1302-2']}, 'rec_id': '1', 'location': [20228, 20366], 'is_gff2': False, 'type': 'miRNA', 'id': 'hsa-mir-1302-2', 'strand': 1})] >>> print result[0][1] {'quals': {'ACC': ['MI0006363'], 'ID': ['hsa-mir-1302-2']}, 'rec_id': '1', 'location': [20228, 20366], 'is_gff2': False, 'type': 'miRNA', 'id': 'hsa-mir-1302-2', 'strand': 1} which is exactly what I need, in (almost) the places where I'd expect them. --Michiel From biopython at maubp.freeserve.co.uk Sat Apr 18 13:54:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 18 Apr 2009 14:54:44 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <252312.21376.qm@web62408.mail.re1.yahoo.com> References: <20090417200558.GC19290@sobchak.mgh.harvard.edu> <252312.21376.qm@web62408.mail.re1.yahoo.com> Message-ID: <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> On Sat, Apr 18, 2009 at 5:28 AM, Michiel de Hoon wrote: > > This is how I used the parser: > >>>> from GFFParser import GFFAddingIterator >>>> gff_iterator = GFFAddingIterator() >>>> rec_dict = gff_iterator.get_all_features("Data/miRBase/hsa.gff") > # It would be better to pass a handle to get_all_features > # instead of a file name. The file may be gzipped or bzipped, > # or the user may want to read it from the internet. >>>> len(rec_dict['1']) > 50 > # fifty microRNAs on chromosome 1 >>>> rec_dict['1'].features[0] > Bio.SeqFeature.SeqFeature(Bio.SeqFeature.FeatureLocation(Bio.SeqFeature.ExactPosition(20228),Bio.SeqFeature.ExactPosition(20366)), type='miRNA', strand=1, id='hsa-mir-1302-2') >>>> rec_dict['1'].features[0].qualifiers['ACC'] > ['MI0006363'] >>>> rec_dict['1'].features[0].qualifiers['ID'] > ['hsa-mir-1302-2'] > # This is still OK, though a bit more deeply nested than I would like. >>>> rec_dict['1'].features[0].location > Bio.SeqFeature.FeatureLocation(Bio.SeqFeature.ExactPosition(20228),Bio.SeqFeature.ExactPosition(20366)) >>>> rec_dict['1'].features[0].location._start > Bio.SeqFeature.ExactPosition(20228) > # Am I supposed to use _start here? It looks like a private variable. >>>> rec_dict['1'].features[0].location._start.position > 20228 No, you are meant to use start, e.g.: >>> print rec_dict['1'].features[0].location.start 20228 >>> rec_dict['1'].features[0].location.start.position 20228 This is what I was talking about in the earlier email on this thread, the SeqFeature has start and end "attributes", but they are done with some magic in __getattr__. I plan to update this to use a modern python property get (so they will show up in dir(...) and we can give them docstring), but don't recall filing a bug on this issue yet. See also, http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005734.html http://lists.open-bio.org/pipermail/biopython/2007-September/003703.html Related to this, perhaps the position classes (and in particular the ExactPosition class) should have an __int__ method, so you can use the object directly (rather than messing about with subproperties like .position). This should let you do the following (untested): record = ... #e.g. a SeqRecord from a GFF file or GenBank feature = record.features[5] #for example sub_seq = my_seq[feature.location.start:feature.location.end] Coupled with a variation of Brad's suggestion of adding start and end properties to the SeqFeature, if we make these act as proxies for feature.location.start and feature.location.end that would become just: record = ... feature = record.features[5] #for example sub_seq = my_seq[feature.start:feature.end] The fuzzy locations (from GenBank or EMBL files) would need a bit of care, ideally matching how the NCBI do things (easily checked by taking an NCBI GenBank files and comparing it to the simpler locations given in their FASTA, PTT or GFF files). Peter From bugzilla-daemon at portal.open-bio.org Sat Apr 18 21:45:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 18 Apr 2009 17:45:12 -0400 Subject: [Biopython-dev] [Bug 2814] New: Use properties instead of __getattr__ in FeatureLocation Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2814 Summary: Use properties instead of __getattr__ in FeatureLocation Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk The SeqFeature's location (i.e. the FeatureLocation object) has start and end "attributes", but they are done with some magic in __getattr__. We should use a modern python property get (so they will show up in dir(...) and we can give them docstrings) See also, http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005781.html http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005734.html http://lists.open-bio.org/pipermail/biopython/2007-September/003703.html Patch to follow -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Apr 18 21:47:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 18 Apr 2009 17:47:59 -0400 Subject: [Biopython-dev] [Bug 2814] Use properties instead of __getattr__ in FeatureLocation In-Reply-To: Message-ID: <200904182147.n3ILlx88027985@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2814 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-18 17:47 EST ------- Created an attachment (id=1278) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1278&action=view) Patch to Bio/SeqFeature.py This doesn't try and change the functionality or API at all. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sat Apr 18 21:48:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 18 Apr 2009 22:48:58 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> References: <20090417200558.GC19290@sobchak.mgh.harvard.edu> <252312.21376.qm@web62408.mail.re1.yahoo.com> <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> Message-ID: <320fb6e00904181448u49e92549t70c3a23c1a0c4d4f@mail.gmail.com> On Sat, Apr 18, 2009 at 2:54 PM, Peter wrote: > This is what I was talking about in the earlier email on this thread, > the SeqFeature has start and end "attributes", but they are done > with some magic in __getattr__. I plan to update this to use a > modern python property get (so they will show up in dir(...) and > we can give them docstring), but don't recall filing a bug on this > issue yet. Filed now, Bug 2814 - Use properties instead of __getattr__ in FeatureLocation http://bugzilla.open-bio.org/show_bug.cgi?id=2814 Something for after Biopython 1.50 is done. Peter From bugzilla-daemon at portal.open-bio.org Sat Apr 18 22:42:57 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 18 Apr 2009 18:42:57 -0400 Subject: [Biopython-dev] [Bug 2814] Use properties instead of __getattr__ in FeatureLocation In-Reply-To: Message-ID: <200904182242.n3IMgvOq031013@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2814 ------- Comment #2 from eric.talevich at gmail.com 2009-04-18 18:42 EST ------- (In reply to comment #1) > Created an attachment (id=1278) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1278&action=view) [details] > Patch to Bio/SeqFeature.py Peter, you mentioned on the mailing list that this will be applied after the 1.50 release. Since Py2.3 support ends there also, you could use the newer decorator style instead: start = property(fget= lambda self : self._start, doc="Start location (possibly a fuzzy position).") becomes: @property def start(self): """Start location (possibly a fuzzy position).""" return self._start I think this is the preferred style for Python 2.4 and later. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Apr 20 09:03:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 10:03:47 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.50 Message-ID: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> On Fri, Apr 17, 2009 at 6:16 PM, Peter wrote: > > Are there any release blocking issues I've missed? I'm going to assume not. > I was going to do it this evening before leaving work, but I'm tired and > wouldn't want to make any mistakes. ?Instead, I aim to do the release > this weekend, and make the Windows installers at some point on > Monday. ?The more rain we get this weekend, the more time I'll try > and spend on the docs first - otherwise the lawn needs cutting... ;) Well the good news is it didn't rain, I had a nice weekend, and cut half the grass. The bad news is obviously I didn't do the Biopython release, although I did work on the documentation. In addition to the nice weather, my other excuse is I had forgotten I'd upgraded my old laptop so I didn't have a Python 2.3 machine handy at home. ;) > I'll send out a warning email before hand - but until then please > feel free to check in documentation changes (including docstrings > and doctests). This is the CVS freeze email. I'm going to do the release in the next hour or two. > We still don't have much on GenomeDiagram in the main tutorial, but I > have some plans to improve this. [...] I got most of that done at the weekend :) Peter From biopython at maubp.freeserve.co.uk Mon Apr 20 12:11:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 13:11:18 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.50 In-Reply-To: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> Message-ID: <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> On Mon, Apr 20, 2009 at 10:03 AM, Peter wrote: > > This is the CVS freeze email. ?I'm going to do the release in the next > hour or two. > Well, its done. CVS is tagged, the packages are online, I've updated the wiki, the epydoc API pages, and the online copy of the tutorial. You can use CVS again, but just in case there are any surprises in the next few days which would force a re-release, minor changes only please. That just leaves the official announcement on the news page (which will be echoed onto twitter automatically) and to the mailing lists. I'll circulate a draft after lunch, unless one of our news coordinator volunteers wants to write something? I realize I should have suggested this earlier as this is short notice, and you are in different time zones, but its worth a try. For reference, here is the 1.50 beta announcement, http://news.open-bio.org/news/2009/04/biopython-150-beta-released/ I can't find anything on http://lists.open-bio.org/pipermail/biopython-announce/ or the main list, so it looks like I forget that :( This might explain the relatively low amount of feedback... The NEWS and DEPRECATED files are here: http://biopython.open-bio.org/SRC/biopython/NEWS http://biopython.open-bio.org/SRC/biopython/DEPRECATED Peter From bugzilla-daemon at portal.open-bio.org Mon Apr 20 12:18:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 20 Apr 2009 08:18:33 -0400 Subject: [Biopython-dev] [Bug 2815] New: Bio.Application MUSCLE command line interface Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2815 Summary: Bio.Application MUSCLE command line interface Product: Biopython Version: Not Applicable Platform: PC OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: cymon.cox at gmail.com Attached is a module to run the MUSCLE alignment programme based on the Bio.Applications interface. A couple of helper functions are included MuscleAlign and ProfileMuscleAlign. Discussion on the dev-list suggests that helper functions are superfluous. Maybe, but I thought I'd include them anyway. A couple of unittests are included for the helper funcs. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 20 12:19:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 20 Apr 2009 08:19:38 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line interface In-Reply-To: Message-ID: <200904201219.n3KCJcSu009533@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #1 from cymon.cox at gmail.com 2009-04-20 08:19 EST ------- Created an attachment (id=1279) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1279&action=view) MUSCLE module -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Apr 20 12:21:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 20 Apr 2009 08:21:19 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line interface In-Reply-To: Message-ID: <200904201221.n3KCLJjf009683@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #2 from cymon.cox at gmail.com 2009-04-20 08:21 EST ------- Created an attachment (id=1280) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1280&action=view) unittest for MuscleAlign -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Mon Apr 20 13:29:46 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 20 Apr 2009 09:29:46 -0400 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> References: <20090417200558.GC19290@sobchak.mgh.harvard.edu> <252312.21376.qm@web62408.mail.re1.yahoo.com> <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> Message-ID: <20090420132946.GB29652@sobchak.mgh.harvard.edu> Michiel; Thanks for trying this out and your thoughts. > > # It would be better to pass a handle to get_all_features > > # instead of a file name. The file may be gzipped or bzipped, > > # or the user may want to read it from the internet. Yes, this is the way it was originally designed. I changed to files to be consistent with a distributed Disco implementation, which needs to be fed a file instead of a handle. Your suggestion is a good one. Let me give some thought to separating the interfaces, as handles would be more consistent with the rest of Biopython. [accessing start and end] > >>> print rec_dict['1'].features[0].location.start > 20228 > >>> rec_dict['1'].features[0].location.start.position > 20228 [...] > Coupled with a variation of Brad's suggestion of adding start > and end properties to the SeqFeature, if we make these act > as proxies for feature.location.start and feature.location.end > that would become just: > > record = ... > feature = record.features[5] #for example > sub_seq = my_seq[feature.start:feature.end] Thanks Peter, that's exactly right. Accessing the start and end coordinates in SeqFeatures is unnecessarily cumbersome right now, but can be fixed fairly simply. We should be able to get this in now that 1.50 is rolled out. Eric's decorator way of doing this was very nice. > The fuzzy locations (from GenBank or EMBL files) would need > a bit of care, ideally matching how the NCBI do things (easily > checked by taking an NCBI GenBank files and comparing it to > the simpler locations given in their FASTA, PTT or GFF files). To be clear, start and end in SeqFeature would be integers and not handle any fuzzy stuff. All of the representation is still there for those actually dealing with fuzziness, but the top level attributes would expose the coordinates nicely for the remaining 99% of cases. > I think that for a basic parser (as opposed to a parser integrated with Bio.SeqIO), > the SeqFeatures are way too complicated for my mind. [...] > For a basic parser, I like the _gff_line_map function much better. > Applied to the first line in the GFF file, it returns [...] > which is exactly what I need, in (almost) the places where I'd expect them. Does solving the start/end problem as described above help bridge the gap between SeqFeatures and the custom representation? Are there other usability issues you found? I would prefer to expose one data structure and think SeqFeature can handle the data well. They scale to nested cases, and will be familiar to those using features in SeqIO or BioSQL. Brad From dave.bridges at gmail.com Mon Apr 20 13:55:40 2009 From: dave.bridges at gmail.com (Dave Bridges) Date: Mon, 20 Apr 2009 09:55:40 -0400 Subject: [Biopython-dev] Bio.Motif Suggestions Message-ID: <49EC7EDC.2030809@gmail.com> From an off-list conversation with Bartek > > Is it possible to give a name to an instance, so that when you > print, say to > > fasta it retains that info > Yes and no... Motifs have a .name property which can be used for storing names of motifs, but it is currently not used in fasta output. BTW. fasta (and other) output functions changed recently in CVS, but I didn't have time to update my branch in git. Please have a look at the .format method of Motif class in the main branch. There are also some (minor) changes in the tutorial, so you may want to merge them back into your branch. Bio.Motif got refactored quite a bit (on Peter's request), so you should update the code, but the API didn't change too much. Currently, the fasta output prints only Instance 1, Instance 2 and so on in the ID field but it would be a trivial improvement to add motif name there. > > Is there an alphabet that accepts spaces which might be necessary for > > correct alignment of a motif, and if so will that work with the rest of > > motif.py? > That's a tougher one. It wasn't really needed so far (DNA motifs rarely have spaces), but I guess that for protein motifs it's a very important thing. I have some code for doing that, but I will need to find it. I'll write you later about it. > > in to_horizontal_matrix/to_vertical_matrix is it possible to print > out a > > legend for the matrices (for ex. the alphabet letters and the position) > > along the top and side. > No, not yet, but again, it would be a nice improvement (and easy to make). cheers Bartek From biopython at maubp.freeserve.co.uk Mon Apr 20 14:35:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 15:35:15 +0100 Subject: [Biopython-dev] Bio.Motif Suggestions In-Reply-To: <49EC7EDC.2030809@gmail.com> References: <49EC7EDC.2030809@gmail.com> Message-ID: <320fb6e00904200735y1002ee71i1a2f11c664045567@mail.gmail.com> On Mon, Apr 20, 2009 at 2:55 PM, Dave Bridges wrote: > >> > Is there an alphabet that accepts spaces which might be necessary for >> > correct alignment of a motif, and if so will that work with the rest of >> > motif.py? >> > > That's a tougher one. It wasn't really needed so far (DNA motifs > rarely have spaces), but I guess that for protein motifs it's a very > important thing. > I have some code for doing that, but I will need to find it. I'll > write you later about it. > What would a space in a motif mean? Clearly something different from a wildcard like N or X in nucleotide or protein sequences. Does it mean a gap of variable length? If it means a gap of one character then surely just using a "-" would be sensible (as used in multiple sequence alignments), for which we have a gapped alphabet system setup. Note that there are some issues with the current Bio.Motif code and alphabets, which should be addressed. For example, generic alphabets don't have a letters property giving the list of expected letters, so using set() on the sequences themselves might be more appropriate in places. Peter From biopython at maubp.freeserve.co.uk Mon Apr 20 14:37:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 15:37:02 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090420132946.GB29652@sobchak.mgh.harvard.edu> References: <20090417200558.GC19290@sobchak.mgh.harvard.edu> <252312.21376.qm@web62408.mail.re1.yahoo.com> <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> <20090420132946.GB29652@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904200737s71e0dfa2y3d7cfbf36324a79d@mail.gmail.com> On Mon, Apr 20, 2009 at 2:29 PM, Brad Chapman wrote: > Michiel; > Thanks for trying this out and your thoughts. > >> > # It would be better to pass a handle to get_all_features >> > # instead of a file name. The file may be gzipped or bzipped, >> > # or the user may want to read it from the internet. > > Yes, this is the way it was originally designed. I changed to files to > be consistent with a distributed Disco implementation, which needs to be > fed a file instead of a handle. Your suggestion is a good one. Let me > give some thought to separating the interfaces, as handles would be more > consistent with the rest of Biopython. I'd second that - definitely go with handles rather than filenames. Peter From biopython at maubp.freeserve.co.uk Mon Apr 20 14:55:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 15:55:21 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.50 In-Reply-To: <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> Message-ID: <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> On Mon, Apr 20, 2009 at 1:11 PM, Peter wrote: > That just leaves the official announcement on the news page (which > will be echoed onto twitter automatically) and to the mailing lists. > I'll circulate a draft after lunch, unless one of our news coordinator > volunteers wants to write something? ?I realize I should have > suggested this earlier as this is short notice, and you are in > different time zones, but its worth a try. And here is my draft - the HTML is just for the links on the news site. Should we add something about the Entrez EFetch change ("genbank" to "gb")? Peter -- We are pleased to announce Biopython release 1.50, featuring some significant additions since Biopython 1.49 was released late last year. GenomeDiagram by Leighton Pritchard has been integrated into Biopython as the Bio.Graphics.GenomeDiagram module. A new module Bio.Motif has been added, which is intended to replace the existing Bio.AlignAce and Bio.MEME modules. Also have a look at Bio.ExPASy and the revised Prosite and Enzyme parsers. As noted in a previous news posting, Bio.SeqIO can now read and write FASTQ and QUAL files used in second generation sequencing work. In connection with this, our SeqRecord object has a new dictionary attribute, letter_annotations, for per-letter-annotation information like sequence quality scores or secondary structure predictions. Also, the SeqRecord object can now be sliced to give a new SeqRecord covering just part of the sequence. Biopython 1.50 supports Python 2.3, 2.4, 2.5 and 2.6. However, this is expected to be the final version to support Python 2.3 (see this previous announcement). Also, Biopython 1.50 should be the last release to include our old deprecated parsing infrastructure (Martel and Bio.Mindy). We?ve also updated the Biopython Tutorial and Cookbook (also available in PDF), and not just by adding our logo to the cover ;) Thank you to everyone who tested the Biopython 1.50 beta release, and to all our contributors. Source distributions and Windows installers are available from the downloads page on the Biopython website (biopython.org). -Peter on behalf of the Biopython developers From bartek at rezolwenta.eu.org Mon Apr 20 15:04:44 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 20 Apr 2009 17:04:44 +0200 Subject: [Biopython-dev] Bio.Motif Suggestions In-Reply-To: <8b34ec180904200800j52f9accdk27bb9c499c7b0761@mail.gmail.com> References: <49EC7EDC.2030809@gmail.com> <320fb6e00904200735y1002ee71i1a2f11c664045567@mail.gmail.com> <8b34ec180904200800j52f9accdk27bb9c499c7b0761@mail.gmail.com> Message-ID: <8b34ec180904200804o58d531ache7e3110f3b919cfa@mail.gmail.com> On Mon, Apr 20, 2009 at 4:35 PM, Peter wrote: > On Mon, Apr 20, 2009 at 2:55 PM, Dave Bridges wrote: >> >>> > Is there an alphabet that accepts spaces which might be necessary for >>> > correct alignment of a motif, and if so will that work with the rest of >>> > motif.py? >>> >> >> That's a tougher one. It wasn't really needed so far (DNA motifs >> rarely have spaces), but I guess that for protein motifs it's a very >> important thing. >> I have some code for doing that, but I will need to find it. I'll >> write you later about it. >> > > What would a space in a motif mean? ?Clearly something different from > a wildcard like N or X in nucleotide or protein sequences. ?Does it > mean a gap of variable length? ?If it means a gap of one character > then surely just using a "-" would be sensible (as used in multiple > sequence alignments), for which we have a gapped alphabet system > setup. > I think that once we start talking about gapped motifs, we are really talking about multiple alignments on steroids. This hasn't been done so far because you don't really need it for DNA motifs, but in case of protein motifs we need to make it compatible with multiple alignments. I think it would be great to be able to easily convert multiple alignments into motifs. This would allow us to ?use the power of BIo.AlignIO for IO and Bio.Motif for searching and comparisons.The question is how to design API for these ?functions. What about: align= Bio.AlignIO.read(....) motif=Bio.Motif.from_alignment(align) ... > Note that there are some issues with the current Bio.Motif code and > alphabets, which should be addressed. ?For example, generic alphabets > don't have a letters property giving the list of expected letters, so > using set() on the sequences themselves might be more appropriate in > places. Yes, I was using Bio.Motif only for DNA motifs myself, so there was not much consideration given to proper handling of alphabets. I'll need to clear it up now. cheers ?Bartek From bartek at rezolwenta.eu.org Mon Apr 20 15:08:57 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 20 Apr 2009 17:08:57 +0200 Subject: [Biopython-dev] CVS freeze for Biopython 1.50 In-Reply-To: <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> Message-ID: <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com> Hi Peter, Looks fine to me. Thanks for your effort put into this release. Bio.Motif certainly benefited from refactoring initiated by your comments before the release. cheers Bartek On Mon, Apr 20, 2009 at 4:55 PM, Peter wrote: > On Mon, Apr 20, 2009 at 1:11 PM, Peter wrote: >> That just leaves the official announcement on the news page (which >> will be echoed onto twitter automatically) and to the mailing lists. >> I'll circulate a draft after lunch, unless one of our news coordinator >> volunteers wants to write something? ?I realize I should have >> suggested this earlier as this is short notice, and you are in >> different time zones, but its worth a try. > > And here is my draft - the HTML is just for the links on the news > site. ?Should we add something about the Entrez EFetch change > ("genbank" to "gb")? > > Peter > > -- > > We are pleased to announce Biopython release 1.50, featuring some > significant additions since Biopython 1.49 was released late last > year. > > GenomeDiagram > by Leighton Pritchard has been integrated into Biopython as the > Bio.Graphics.GenomeDiagram module. > > A new module Bio.Motif has been added, which is intended to replace > the existing Bio.AlignAce and Bio.MEME modules. Also have a look at > Bio.ExPASy and the revised Prosite and Enzyme parsers. > > As noted in a previous news posting, href="http://biopython.org/wiki/SeqIO">Bio.SeqIO can now read and > write FASTQ > and QUAL files used in second generation sequencing work. In > connection with this, our href="http://biopython.org/wiki/SeqRecord">SeqRecord object has a > new dictionary attribute, letter_annotations, for > per-letter-annotation information like sequence quality scores or > secondary structure predictions. Also, the SeqRecord object can now be > sliced to give a new SeqRecord covering just part of the sequence. > > Biopython 1.50 supports Python 2.3, 2.4, 2.5 and 2.6. However, this is > expected to be the final version to support Python 2.3 (see this href="http://news.open-bio.org/news/2009/04/2008/11/biopython-and-python-26-and-python-23/">previous > announcement). Also, Biopython 1.50 should be the last release to > include our old deprecated parsing infrastructure (Martel and > Bio.Mindy). > > We?ve also updated the Biopython > Tutorial and Cookbook (also available in href="http://biopython.org/DIST/docs/tutorial/Tutorial.pdf">PDF), > and not just by adding href="http://biopython.org/wiki/Logo">our logo to the cover ;) > > Thank you to everyone who tested the href="http://news.open-bio.org/news/2009/04/biopython-150-beta-released/">Biopython > 1.50 beta release, and to all our contributors. > > Source distributions and Windows installers are available from the href="http://biopython.org/wiki/Download">downloads page on the href="http://biopython.org/">Biopython website (biopython.org). > > -Peter on behalf of the Biopython developers > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From biopython at maubp.freeserve.co.uk Mon Apr 20 16:04:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 17:04:56 +0100 Subject: [Biopython-dev] CVS freeze for Biopython 1.50 In-Reply-To: <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com> Message-ID: <320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com> On Mon, Apr 20, 2009 at 4:08 PM, Bartek Wilczynski wrote: > Hi Peter, > > Looks fine to me. Cool. > Thanks for your effort put into this release. Thanks. I'd forgotten how much work these can be - the Biopython 1.50 beta release seemed to go much more smoothly, but there I wasn't aiming quite so high (e.g. it didn't have any GenomeDiagram documentation in it, and I hadn't really looked at Bio.Motif in detail). Michiel did offer this time round, but maybe next time it should be someone else's turn to do the actual release bit? What I mean is the project co-ordination is a bit nebulous, but the actual mechanics of doing a release are fairly simple (assuming you have a Windows machine already setup to do the installers), pretty well documented, and that part could be delegated. See http://biopython.org/wiki/Building_a_release i.e. Maybe in a few months time I (or Michiel) can say "Right, CVS freeze while XXX does the release", where person XXX gets to scan the documentation, double check the NEWS files, check the unit tests etc, before putting together the packages and uploading them to the server. And maybe then hand over to our "News Coordinator" to do the release announcement? Having more people involved will make it take a little longer, but should mean less minor things get missed (e.g. a typo in the NEWS file, or a broken unit test specific to a particular OS or version of python). > Bio.Motif certainly benefited from refactoring ?initiated by your > comments before the release. Well, I hope so :) Peter From biopython at maubp.freeserve.co.uk Mon Apr 20 17:27:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 18:27:00 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com> <8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com> <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> Message-ID: <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> On Wed, Mar 25, 2009 at 11:28 AM, Peter wrote: > On Tue, Mar 24, 2009 at 11:58 PM, Bartek Wilczynski wrote: >> For the tags, they were not pushed to github before, because I didn't >> know I need to specifically do it qith git push --tags. > > ... > They also show up in github (near the top, drop down menu next to > branches) and in gitx (and I assume other GUI clients). Bartek fixed the tag issue, but I don't like how they show up in github. The most visible sign of the tags is in the downloads menu which lets you get a source code bundle using that tag. If we could turn that off I would - these bundles won't include the compiled PDF and HTML documentation, and could cause confusion when people have a problems and they just say they "downloaded version X from the website". My main concern is the tags don't appear to be shown when looking at the history in github, which is the main reason I wanted them in the first place. e.g. http://github.com/biopython/biopython/commits/master/Bio/Blast/NCBIXML.py Compare this to ViewCVS, which shows the tags in the history: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIXML.py?cvsroot=biopython I find this very handy for investigating bugs, and much easier than messing about at the command line with CVS. The fact that I can do this from almost any networked computer in the world is great for triaging bugs or responding to emails - it lets me look back over the history with our releases clearly labeled. So right now, the github history is a big step backwards for me. As an alternative, I had a quick look at GitX (on the Mac) from the GUI, they don't seem to have a history-of-one-file view, just a global history. For how I have been using ViewCVS's history, this is useless. However, interesting for a GUI tool, they have a command line option which sort of does this, e.g. $ gitx -- Bio/Blast/NCBIXML.py Then the history shows all changes affecting the given file (or path), but as you might guess from git's commit based design, you also get shown other changes made in the same commit. This is kind of nice, just different. But still no tags visible :( Peter P.S. Tags aside, the github history view hasn't been working 100% for me, e.g. http://support.github.com/discussions/site/487-commit-history-sorry-this-commit-log-is-taking-too-long-to-generate From biopython at maubp.freeserve.co.uk Mon Apr 20 17:36:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 18:36:36 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com> <8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com> <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> Message-ID: <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> On Mon, Apr 20, 2009 at 6:27 PM, Peter wrote: > On Wed, Mar 25, 2009 at 11:28 AM, Peter wrote: >> On Tue, Mar 24, 2009 at 11:58 PM, Bartek Wilczynski wrote: >>> For the tags, they were not pushed to github before, because I didn't >>> know I need to specifically do it qith git push --tags. >> >> ... >> They also show up in github (near the top, drop down menu next to >> branches) and in gitx (and I assume other GUI clients). > > Bartek fixed the tag issue, but I don't like how they show up in > github. >From some more reading this, it sounds like our CVS tags are essentially turned into commit markers in git. See: http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#how-git-stores-references http://book.git-scm.com/3_git_tag.html This shouldn't rule out showing them in the history, but perhaps the cvs to git migration confuses things... Peter From biopython at maubp.freeserve.co.uk Mon Apr 20 19:02:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Apr 2009 20:02:18 +0100 Subject: [Biopython-dev] Biopython 1.50 released Message-ID: <320fb6e00904201202j4bb9666es18c89136ce973a48@mail.gmail.com> Dear all, We are pleased to announce Biopython release 1.50, featuring some significant additions since Biopython 1.49 was released late last year. GenomeDiagram by Leighton Pritchard has been integrated into Biopython as the Bio.Graphics.GenomeDiagram module. A new module Bio.Motif has been added, which is intended to replace the existing Bio.AlignAce and Bio.MEME modules. Also have a look at Bio.SwissProt and Bio.ExPASy and their revised parsers. As noted in a previous news posting, Bio.SeqIO can now read and write FASTQ and QUAL files used in second generation sequencing work. In connection with this, our SeqRecord object has a new dictionary attribute, letter_annotations, for per-letter-annotation information like sequence quality scores or secondary structure predictions. Also, the SeqRecord object can now be sliced to give a new SeqRecord covering just part of the sequence. Biopython 1.50 supports Python 2.3, 2.4, 2.5 and 2.6. However, this is expected to be the final version to support Python 2.3 (see this previous announcement). Also, Biopython 1.50 should be the last release to include our old deprecated parsing infrastructure (Martel and Bio.Mindy). We?ve also updated the Biopython Tutorial and Cookbook (also available in PDF), and not just by adding our logo to the cover ;) http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Thank you to everyone who tested the Biopython 1.50 beta release, and to all our contributors. Source distributions and Windows installers are available from the downloads page on the Biopython website: http://biopython.org/wiki/Download -Peter, on behalf of the Biopython developers P.S. This news post is online at http://news.open-bio.org/news/2009/04/biopython-release-150/ You may wish to subscribe to our news feed. For RSS links etc, see: http://biopython.org/wiki/News From lpritc at scri.ac.uk Tue Apr 21 08:34:25 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 21 Apr 2009 09:34:25 +0100 Subject: [Biopython-dev] Bio.Motif Suggestions In-Reply-To: <8b34ec180904200804o58d531ache7e3110f3b919cfa@mail.gmail.com> Message-ID: Hi, Some thoughts and a bit of a wishlist... On 20/04/2009 16:04, "Bartek Wilczynski" wrote: > On Mon, Apr 20, 2009 at 4:35 PM, Peter > wrote: >> >> What would a space in a motif mean? ?Clearly something different from >> a wildcard like N or X in nucleotide or protein sequences. ?Does it >> mean a gap of variable length? ?If it means a gap of one character >> then surely just using a "-" would be sensible (as used in multiple >> sequence alignments), for which we have a gapped alphabet system >> setup. >> > I think that once we start talking about gapped motifs, we are really > talking about > multiple alignments on steroids. This hasn't been done so far because you > don't > really need it for DNA motifs, It might not be required for the motifs you've been working with, but we've been doing profile-based searches for bipartite regulatory binding sites in DNA. These sites have a variable-length spacer region, and so require gapped alignments for building motifs. The spacer region consensus (depending on the level of identity required for the consensus) is usually composed of Ns. I guess that this comes down to whether we choose to restrict the meaning of "motif" to an ungapped string of symbols (including ambiguity) representing nt/aa, or whether we want to permit the inclusion of variable-length gaps, regions, or ambiguities in a PROSITE or regular expression-like manner (e.g. C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H, GAACC.{17,21}AAC or C{,3}A{3,5}TTTT). Although profile methods like HMMer can produce a consensus output that looks like an ungapped string of symbols to represent a motif, it doesn't capture important features of the HMM representation. I think the latter representations are more useful, even if harder to code/maintain. I think that leaving them out would be a glaring hole in functionality, and that they're a target Biopython should aim for. > I think it would be great to be > able to easily > convert multiple alignments into motifs. This would allow us to ?use > the power of > BIo.AlignIO for IO and Bio.Motif for searching and comparisons.The question is > how to design API for these ?functions. I agree. I think that there's another important question: what do we mean, and need to do, when we talk about converting an alignment into a motif? Consensus/majority and PSSM methods from a sequence alignment should be straightforward to implement in Python - even for gapped alignments. Including a representation of variable-length gaps might be a little more difficult, and storing an HMM representation may be too much to manage immediately. That's still three different types of object - with likely different components to their interfaces - to be stored. In their relationship to a source alignment, these representations could be properties of a single alignment, or independent Bio.Motif objects (perhaps each with a link back to their parent alignment). The results of searches are also likely to be qualitatively different, depending on the type of motif used for the search, and the results desired by the user. I think that, for anything other than simple searches (string search, regex), we'd be on a hiding to nothing by implementing search methods within Python. It's not likely to be as fast as dedicated search packages, and it would be a headache for maintenance. So, with apologies if I missed this part of the discussion or documentation, it seems to me that Bio.Motif could be most powerful in the alignment/searching/comparison process as a 'broker' within BioPython, providing a consistent API for interface with external alignment/search/comparison applications that also permits programmatic manipulation of the profile/HMM/alignment. E.g. align = Bio.AlignIO.read(alignfilehandle) consensus = align.build_consensus(threshold=0.9) pssm = align.build_pssm() hmmer = align.build_hmmer() hmm = align.build_hmm(order=3) Or consensus = Bio.Motif.consensus_from_alignment(align, threshold=0.9) pssm = Bio.Motif.build_pssm_from_alignment(align) hmmer = Bio.Motif.build_hmmer_from_alignment(align) hmm = Bio.Motif.build_hmm_from_alignment(align, order=3) (which I don't think is as neat an interface, even if all align.build_consensus does is call the Bio.Motif.consensus_from_alignment method) Followed by things like pssm.consensus() pssm.logo() hmm.generate_sequence(length=100) hmm.to_graphviz() And then the consensus, pssm, hmm and hmmer objects could be used as input to interfaces for the relevant applications. Converting an alignment into an HMM for this purpose may itself benefit from a call to HMMer's hmmbuild (and Pythonic representation of the data structure), rather than implementation of an equivalent internal function - even though I think one of those would be useful, too. Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From biopython at maubp.freeserve.co.uk Tue Apr 21 10:15:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Apr 2009 11:15:05 +0100 Subject: [Biopython-dev] Python 2.3 support Message-ID: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com> Hi all, As we've been warning for the last couple of releases, Biopython 1.50 should be the last release to officially support Python 2.3. No one has complained yet, but they may not have noticed. I suspect there may be people out there using a local Biopython installation on an old Linux/Unix computer where the system Python is rather old. For Biopython 1.50 I added a warning to setup.py when run on Python 2.3 so that may get more attention. Given the small possibility that we may get need to do a fix release with Python 2.3 support, I propose that we don't actively remove any Python 2.3 support in CVS yet (maybe not until after Biopython 1.51?). Any new modules that require Python 2.4+ to run would be OK, but I would like to avoid breaking existing core functionality on Python 2.3 in the short term. I know I'm dragging my feed on this, but being a bit cautious here shouldn't hurt. Plus I have an ulterior motive: I'm one of the few Biopython users still actually using Python 2.3! To be precise, this now only on one machine at work - but this is the cluster head node. However, an upgrade is planned in the next month or so, and once that is done, maybe I'll relent and we can remove Python 2.3 support in CVS ;) Peter From bugzilla-daemon at portal.open-bio.org Tue Apr 21 11:05:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Apr 2009 07:05:48 -0400 Subject: [Biopython-dev] [Bug 2817] New: Meta-bug for cleanup once we drop Python 2.3 support Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2817 Summary: Meta-bug for cleanup once we drop Python 2.3 support Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk We are going to drop support for Python 2.3, see: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005812.html This means we can remove a number of workarounds in the code: Python 2.4+ includes the built in set, so we can remove numerous uses of the following: #TODO - Remove this work around once we drop python 2.3 support try: set = set except NameError: from sets import Set as set Python 2.4+ includes the subprocess module, so we can use this unconditionally in Bio.Application.generic_run() etc. Python 2.4+ includes support for generator expressions. We should update the documentation examples as appropriate, and this may also allow some memory optimizations in places. Python 2.4+ will also allow us to update our property methods to use decorators as suggested by Eric Talevich on Bug 2814. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 21 11:12:18 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Apr 2009 07:12:18 -0400 Subject: [Biopython-dev] [Bug 2814] Use properties instead of __getattr__ in FeatureLocation In-Reply-To: Message-ID: <200904211112.n3LBCILI021318@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2814 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-21 07:12 EST ------- (In reply to comment #2) > Peter, you mentioned on the mailing list that this will be applied after the > 1.50 release. Since Py2.3 support ends there also, you could use the newer > decorator style instead: > > start = property(fget= lambda self : self._start, > doc="Start location (possibly a fuzzy position).") > > becomes: > > @property > def start(self): > """Start location (possibly a fuzzy position).""" > return self._start > > > I think this is the preferred style for Python 2.4 and later. Thanks for the suggestion Eric. That sounds like a good plan, but not yet. See Bug 2917. I've checked in this patch and am marking this bug as fixed. See: Bio/SeqFeature.py CVS revision 1.17 Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Tue Apr 21 11:12:20 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 21 Apr 2009 04:12:20 -0700 (PDT) Subject: [Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO, Bio.SwissProt Message-ID: <393946.5637.qm@web62408.mail.re1.yahoo.com> Dear all, I've noticed an inconsistency between how Bio.SeqIO and Bio.SwissProt parse DE (description) lines in SwissProt files. For these DE lines: DE RecName: Full=11S globulin seed storage protein 2; DE AltName: Full=11S globulin seed storage protein II; DE AltName: Full=Alpha-globulin; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 acidic chain; DE AltName: Full=11S globulin seed storage protein II acidic chain; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 basic chain; DE AltName: Full=11S globulin seed storage protein II basic chain; DE Flags: Precursor; a SwissProt record created by Bio.SwissProt contains the following: >>> print swiss_record.description RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S globulin seed storage protein II; AltName: Full=Alpha-globulin; Contains: RecName: Full=11S globulin seed storage protein 2 acidic chain; AltName: Full=11S globulin seed storage protein II acidic chain; Contains: RecName: Full=11S globulin seed storage protein 2 basic chain; AltName: Full=11S globulin seed storage protein II basic chain; Flags: Precursor; but a SeqRecord returned by Bio.SeqIO contains this: >>> print seq_record.description RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S globulin seed storage protein II; AltName: Full=Alpha-globulin; Contains: RecName: Full=11S globulin seed storage protein 2 acidic chain; AltName: Full=11S globulin seed storage protein II acidic chain; Contains: RecName: Full=11S globulin seed storage protein 2 basic chain; AltName: Full=11S globulin seed storage protein II basic chain; Flags: Precursor; So Bio.SeqIO removes the spaces in front of the line, but Bio.SwissProt doesn't. For consistency, I think it's better to decide on one of these two styles. My preference is for the approach used by Bio.SwissProt. Any objections to modifying the code used by Bio.SeqIO? --Michiel. From p.j.a.cock at googlemail.com Tue Apr 21 11:26:00 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 12:26:00 +0100 Subject: [Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO, Bio.SwissProt In-Reply-To: <393946.5637.qm@web62408.mail.re1.yahoo.com> References: <393946.5637.qm@web62408.mail.re1.yahoo.com> Message-ID: <320fb6e00904210426q19ea7a44xdb902661297ec855@mail.gmail.com> On Tue, Apr 21, 2009 at 12:12 PM, Michiel de Hoon wrote: > > Dear all, > > I've noticed an inconsistency between how Bio.SeqIO and Bio.SwissProt parse DE (description) lines in SwissProt files. > > For these DE lines: > > DE ? RecName: Full=11S globulin seed storage protein 2; > DE ? AltName: Full=11S globulin seed storage protein II; > DE ? AltName: Full=Alpha-globulin; > DE ? Contains: > DE ? ? RecName: Full=11S globulin seed storage protein 2 acidic chain; > DE ? ? AltName: Full=11S globulin seed storage protein II acidic chain; > DE ? Contains: > DE ? ? RecName: Full=11S globulin seed storage protein 2 basic chain; > DE ? ? AltName: Full=11S globulin seed storage protein II basic chain; > DE ? Flags: Precursor; > > a SwissProt record created by Bio.SwissProt contains the following: >>>> print swiss_record.description > RecName: Full=11S globulin seed storage protein 2; > AltName: Full=11S globulin seed storage protein II; > AltName: Full=Alpha-globulin; > Contains: > ?RecName: Full=11S globulin seed storage protein 2 acidic chain; > ?AltName: Full=11S globulin seed storage protein II acidic chain; > Contains: > ?RecName: Full=11S globulin seed storage protein 2 basic chain; > ?AltName: Full=11S globulin seed storage protein II basic chain; > Flags: Precursor; > > but a SeqRecord returned by Bio.SeqIO contains this: > >>>> print seq_record.description > RecName: Full=11S globulin seed storage protein 2; > AltName: Full=11S globulin seed storage protein II; > AltName: Full=Alpha-globulin; > Contains: > RecName: Full=11S globulin seed storage protein 2 acidic chain; > AltName: Full=11S globulin seed storage protein II acidic chain; > Contains: > RecName: Full=11S globulin seed storage protein 2 basic chain; > AltName: Full=11S globulin seed storage protein II basic chain; > Flags: Precursor; > > So Bio.SeqIO removes the spaces in front of the line, but Bio.SwissProt doesn't. > For consistency, I think it's better to decide on one of these two styles. > My preference is for the approach used by Bio.SwissProt. Any objections to modifying the code used by Bio.SeqIO? Have you got a link for the full record in your example? For interaction with other Bio.SeqIO formats, I generally expect the description to be a single line string (with no embedded newlines). If you look at the (old) SwissProt files in our unit tests, the current Bio.SeqIO behaviour makes sense - the DE line(s) just encode a fairly short simple string. It looks like the SwissProt format has changed, and we should be parsing the new extended DE lines more carefully, and splitting these entries up and recording them in the SeqRecord.annotations dictionary? Peter From bartek at rezolwenta.eu.org Tue Apr 21 11:29:39 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 21 Apr 2009 13:29:39 +0200 Subject: [Biopython-dev] Bio.Motif Suggestions In-Reply-To: References: <8b34ec180904200804o58d531ache7e3110f3b919cfa@mail.gmail.com> Message-ID: <8b34ec180904210429g36d089a6h578dc0197a94516a@mail.gmail.com> On Tue, Apr 21, 2009 at 10:34 AM, Leighton Pritchard wrote: > Hi, > > Some thoughts and a bit of a wishlist... These are always welcome. I can make no promises on timing of making your wishes come true ;) >>> >> I think that once we start talking about gapped motifs, we are really >> talking about >> multiple alignments on steroids. This hasn't been done so far because you >> don't >> really need it for DNA motifs, > > It might not be required for the motifs you've been working with, but we've > been doing profile-based searches for bipartite regulatory binding sites in > DNA. ?These sites have a variable-length spacer region, and so require > gapped alignments for building motifs. ?The spacer region consensus > (depending on the level of identity required for the consensus) is usually > composed of Ns. Indeed There are dyadic motifs for some of transcription factors. So far I was working only under assumption that that the gap is not too variable (say 3-5 nucleotides) and this you can fake by using multiple PWMs with different sizes of the gap e.g.: CACnnnGTG CACnnnnGTG CACnnnnnGTG But it is a workaround rather than a feature... I'd be also interested in knowing about other applications where maybe this assumption (small gaps) is violated. Are there also motifs with multiple gaps? Implementing this feature would probably require a separate subclass of Motif, since the internal implementation of searching would need to be different. This is a very good feature request, I think it is worth implementing, though currently I have no time to do it properly. If You don't care too much about efficiency, I could write quickly this dyadic subclass with the implementation based on two motif instances and a variable gap. > > I guess that this comes down to whether we choose to restrict the meaning of > "motif" to an ungapped string of symbols (including ambiguity) representing > nt/aa, or whether we want to permit the inclusion of variable-length gaps, > regions, or ambiguities in a PROSITE or regular expression-like manner (e.g. > C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H, GAACC.{17,21}AAC or > C{,3}A{3,5}TTTT). ?Although profile methods like HMMer can produce a > consensus output that looks like an ungapped string of symbols to represent > a motif, it doesn't capture important features of the HMM representation. > I think that you are touching on multiple issues here. I'll try to answer them separately: - gapped alignemnts are one thing. If we have a gap in one sequence but not in the others (frequent in protein motifs, not so much in DNA motifs) we just need a way to sensibly use it in creation of PWMs for searching - dyadic motifs (gaps in otherwise ungapped alignments) are a different issue, since we have a gap in all instances, but it may have a variable length. see above. -regular expressions are a different way of describing motifs. I think that it is not a purpose of Bio.Motif to compete with regexps, but it would be certainly valuable to be able to have a possibility of creating motifs from some sort of (simplified) regexps. This was, to some extent, discussed in a recent thread on Seq.startswith methods -HMM motifs are totally different kind of beast. These guys introduce dependencies between positions (doable also with regexps) and there is currently no support for them in Bio.Motif. It would be cool to have support for them, but I'm not an expert here and it looks to me like a lot of work (also probably the methods of Bio.Motif are not exactly right for HMMs). -finally, suporting prosite syntax seems to be depending on the variable gap feature, but otherwise it's simple an important input fomat to support. > I think the latter representations are more useful, even if harder to > code/maintain. ?I think that leaving them out would be a glaring hole in > functionality, and that they're a target Biopython should aim for. Usefulness is hard to define in abstract of a particular problem , so this is arguable. It is certain that bio.Motif is not complete suite for all kinds of motif analysis but i don't know of any tool that is supporting alll these types of motifs with a single API (if you know one, please tell me). We should have ambitious goals, but I wouldn't call it a glaring hole not to have what is currently not available elsewhere... > >> I think it would be great to be able to easily >> convert multiple alignments into motifs. This would allow us to ?use >> the power of >> BIo.AlignIO for IO and Bio.Motif for searching and comparisons.The question is >> how to design API for these ?functions. > > I agree. ?I think that there's another important question: what do we mean, > and need to do, when we talk about converting an alignment into a motif? > Consensus/majority and PSSM methods from a sequence alignment should be > straightforward to implement in Python - even for gapped alignments. > Including a representation of variable-length gaps might be a little more > difficult, and storing an HMM representation may be too much to manage > immediately. ?That's still three different types of object - with likely > different components to their interfaces - to be stored. ?In their > relationship to a source alignment, these representations could be > properties of a single alignment, or independent Bio.Motif objects (perhaps > each with a link back to their parent alignment). > > The results of searches are also likely to be qualitatively different, > depending on the type of motif used for the search, and the results desired > by the user. > > I think that, for anything other than simple searches (string search, > regex), we'd be on a hiding to nothing by implementing search methods within > Python. ?It's not likely to be as fast as dedicated search packages, and it > would be a headache for maintenance. ?So, with apologies if I missed this What do you mean by searching here? Searching for a known motif or searching for a new motif? And what dedicated packages you have on your mind? > part of the discussion or documentation, it seems to me that Bio.Motif could > be most powerful in the alignment/searching/comparison process as a 'broker' > within BioPython, providing a consistent API for interface with external > alignment/search/comparison applications that also permits programmatic > manipulation of the profile/HMM/alignment. ?E.g. > That's definitely an important field, though I'm not sure if _the_ function for Bio.Motif. I think that the most valuable thing would be to internalize some of the compliexity of different ways of using motifs in bioinformatics. My modest goal for now is making protein motifs first class citizens (meaning handling alphabets and gaps properly etc. ). The next thing would be to make bio.motif cooperate nicely with - Bio.Seq (e.g seq.startswith etc.), - Bio.Align (conversions from-to alignments) which includes easy motif creation from simple formats like IUPAC and simple regexps and would correspond to the "broker" function if I understand it correctly. Then I think it would be really cool to have spaced motifs, although here we need to be careful about performance. > align = Bio.AlignIO.read(alignfilehandle) > consensus = align.build_consensus(threshold=0.9) > pssm = align.build_pssm() > hmmer = align.build_hmmer() > hmm = align.build_hmm(order=3) > > Or > > consensus = Bio.Motif.consensus_from_alignment(align, threshold=0.9) > pssm = Bio.Motif.build_pssm_from_alignment(align) > hmmer = Bio.Motif.build_hmmer_from_alignment(align) > hmm = Bio.Motif.build_hmm_from_alignment(align, order=3) > I would guess that the first example is what would be actually used, but it requires the functions on the Motif.side to be available. As for more specific things: - I don't like the usage of PSSM and consensus here. these are just different ways of looking at a Motif. -Also the difference between HMMer and HMM is unclear to me (isn't hmmer a tool to make HMMS? Do we support HMMER in Biopython currently?) But I'm not too concerned about HMMs at the moment. I would rather think of something like: align = Bio.AlignIO.read(alignfilehandle) motif= align.build_motif() followed by: motif.consensus() motif.search_pwm(seq) motif.search_instances(seq) motif.weblogo() > > And then the consensus, pssm, hmm and hmmer objects could be used as input > to interfaces for the relevant applications. > I don't understand your idea of separating consensus from pssm motifs. These are not fundamentally different. HMMs though are really different. > Converting an alignment into an HMM for this purpose may itself benefit from > a call to HMMer's hmmbuild (and Pythonic representation of the data > structure), rather than implementation of an equivalent internal function - > even though I think one of those would be useful, too. > Again, I'm not sure whether we have support for HMMer now (it was mentioned on the mailing-list once, but I don't know what happened to it). But I agree it would be useful. To summarize: - thanks for so much input, I especially apreciate the input on possible usages - I will work on the features I mentioned in the direction of unifying the API for DNA and protein motifs, and I would definitely appreciate any help from others - The dyadic motifs (or more generally gapped motifs) are next, and require taking care of performance issues - HMM support is currently further down on my to-do list, mostly because It needs a rather different API. But once we have the "glue" functions for motifs, we can try to make similar "glue" functions for HMMs. cheers Bartek From p.j.a.cock at googlemail.com Tue Apr 21 11:52:26 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 12:52:26 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090420132946.GB29652@sobchak.mgh.harvard.edu> References: <20090417200558.GC19290@sobchak.mgh.harvard.edu> <252312.21376.qm@web62408.mail.re1.yahoo.com> <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> <20090420132946.GB29652@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904210452n6c17fcafl205abcfcc4e3edd1@mail.gmail.com> On Mon, Apr 20, 2009 at 2:29 PM, Brad Chapman wrote: > [accessing start and end] >> >>> print rec_dict['1'].features[0].location.start >> 20228 >> >>> rec_dict['1'].features[0].location.start.position >> 20228 > [...] >> Coupled with a variation of Brad's suggestion of adding start >> and end properties to the SeqFeature, if we make these act >> as proxies for feature.location.start and feature.location.end >> that would become just: >> >> record = ... >> feature = record.features[5] #for example >> sub_seq = my_seq[feature.start:feature.end] > > Thanks Peter, that's exactly right. Actually, it isn't - my mistake. Adding start and end properties to the SeqFeature as proxies for feature.location.start and feature.location.end wouldn't be a great idea. Currently feature.location.start and features.location.end are position objects, and even if they had an __int__ method you can't do this: record[feature.location.start:record.feature.location.end] or: record.seq[feature.location.start:record.feature.location.end] You would have to do this: record[int(feature.location.start):int(record.feature.location.end)] or: record.seq[int(feature.location.start):int(record.feature.location.end)] The above wouldn't work well for fuzzy locations, we're better off with the current explicit option: record[feature.location.start.position:record.feature.location.end.position] or: record.seq[feature.location.start.position:record.feature.location.end.position] where if the user wants to they can take into account the fuzzy details, such as adding record.feature.location.end.extension to the end slice point. ---------------- Now the good news, we can instead simply using the FeatureLocation shortcuts for (approximated) plain integers: record[feature.location.nofuzzy_start:record.feature.location.nofuzzy_end] or: record.seq[feature.location.nofuzzy_start:record.feature.location.nofuzzy_end] These methods already take into consideration fuzzy ends, and knows to treat the start and end differently to get the wider feature. So, a slight variation of the proposed internal details would be to make SeqFeature.start and end proxies for SeqFeature.location.nofuzzy_start and SeqFeature.location.nofuzzy_end (i.e. plain integers), achieving the goal of just: record[feature.start:record.feature.end] or: record.seq[feature.start:record.feature.location.end] (Suitable for non-join features, and gives a reasonable approximation for fuzzy locations). > Accessing the start and end coordinates in SeqFeatures is unnecessarily > cumbersome right now, but can be fixed fairly simply. We should be able > to get this in now that 1.50 is rolled out. > ... > To be clear, start and end in SeqFeature would be integers and not > handle any fuzzy stuff. All of the representation is still there for > those actually dealing with fuzziness, but the top level attributes > would expose the coordinates nicely for the remaining 99% of cases. Right - and with the above correction that SeqFeature.start and end would be proxies for SeqFeature.location.nofuzzy_start and SeqFeature.location.nofuzzy_end, you would get plain integers, and this should cover most use cases. At least for non-Eukaryotes ;) >> I think that for a basic parser (as opposed to a parser integrated with Bio.SeqIO), >> the SeqFeatures are way too complicated for my mind. > [...] >> For a basic parser, I like the _gff_line_map function much better. >> Applied to the first line in the GFF file, it returns > [...] >> which is exactly what I need, in (almost) the places where I'd expect them. > > Does solving the start/end problem as described above help bridge the > gap between SeqFeatures and the custom representation? Are there other > usability issues you found? I would prefer to expose one data structure > and think SeqFeature can handle the data well. They scale to nested > cases, and will be familiar to those using features in SeqIO or BioSQL. You must agree that SeqFeature and FeatureLocation objects are not very lightweight. I understood that one of your goals with Bio.GFF and map/reduce is to handle massive files, so surely it makes sense to use a simple object structure here? Peter From mjldehoon at yahoo.com Tue Apr 21 11:55:36 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 21 Apr 2009 04:55:36 -0700 (PDT) Subject: [Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO, Bio.SwissProt In-Reply-To: <320fb6e00904210426q19ea7a44xdb902661297ec855@mail.gmail.com> Message-ID: <861995.42083.qm@web62406.mail.re1.yahoo.com> > Have you got a link for the full record in your example? > You can find it here: http://www.uniprot.org/uniprot/Q9XHP0.txt > For interaction with other Bio.SeqIO formats, I generally > expect the description to be a single line string (with no > embedded newlines). > It looks like the SwissProt format has changed, and we > should be parsing the new extended DE lines more > carefully, and splitting these entries up and recording > them in the SeqRecord.annotations dictionary? > That sounds reasonable. The dictionary will have to be nested though. Something like this: annotations["RecName"] = [{"Full": "11S globulin seed storage protein 2"}] annotations["AltName"] = [{"Full": "11S globulin seed storage protein II"}, {"Full": "Alpha-globulin"}] annotations["Contains"] = [{"RecName": {"Full": "11S globulin seed storage protein 2 acidic chain"}}, "AltName": {"Full": "Full=11S globulin seed storage protein II acidic chain"}}, {"RecName": {"Full": "11S globulin seed storage protein 2 basic chain"}}, "AltName": {"Full": "Full=11S globulin seed storage protein II basic chain"}}, ] annotations["Flags"] = "Precursor" --Michiel From p.j.a.cock at googlemail.com Tue Apr 21 12:04:44 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 13:04:44 +0100 Subject: [Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO, Bio.SwissProt In-Reply-To: <861995.42083.qm@web62406.mail.re1.yahoo.com> References: <320fb6e00904210426q19ea7a44xdb902661297ec855@mail.gmail.com> <861995.42083.qm@web62406.mail.re1.yahoo.com> Message-ID: <320fb6e00904210504g6c7f60f1o96129c9a6759c256@mail.gmail.com> On Tue, Apr 21, 2009 at 12:55 PM, Michiel de Hoon wrote: > >> Have you got a link for the full record in your example? >> > You can find it here: > > http://www.uniprot.org/uniprot/Q9XHP0.txt > >> For interaction with other Bio.SeqIO formats, I generally >> expect the description to be a single line string (with no >> embedded newlines). > >> It looks like the SwissProt format has changed, and we >> should be parsing the new extended DE lines more >> carefully, and splitting these entries up and recording >> them in the SeqRecord.annotations dictionary? >> > That sounds reasonable. The dictionary will have to be nested though. Something like this: > > annotations["RecName"] = [{"Full=11S globulin seed storage protein 2"] > annotations["AltName"] = ["Full=11S globulin seed storage protein II", "Full=Alpha-globulin"] > annotations["Contains"] = [{"RecName": {"Full": "11S globulin seed storage protein 2 acidic chain"}}, > ? ? ? ? ? ? ? ? ? ? ? ? ? ?"AltName": {"Full": "Full=11S globulin seed storage protein II acidic chain"}}, > ? ? ? ? ? ? ? ? ? ? ? ? ? {"RecName": {"Full": "11S globulin seed storage protein 2 basic chain"}}, > ? ? ? ? ? ? ? ? ? ? ? ? ? ?"AltName": {"Full": "Full=11S globulin seed storage protein II basic chain"}}, > ? ? ? ? ? ? ? ? ? ? ? ? ?] > annotations["Flags"] = "Precursor" > Possible - but for BioSQL we couldn't store those dictionaries. A list of strings should work, but isn't as elegant. Maybe something along these lines? annotations["RecName"] = ["Full: 11S globulin seed storage protein 2;"}] annotations["AltName"] = ["Full: 11S globulin seed storage protein II", "Full: Alpha-globulin"] annotations["Contains"] = ["RecName: Full=11S globulin seed storage protein 2 acidic chain;\nAltName: Full=11S globulin seed storage protein II acidic chain;", "RecName: Full=11S globulin seed storage protein 2 basic chain;\nAltName: Full=11S globulin seed storage protein II basic chain;"] annotations["Flags"] = "Precursor" Or for "Contains" just have a flat list of strings, one for each name (here four names). Or for "Contains" just drop the AltName entries, and simply have a list of the RecName entries (here two names). Peter From bugzilla-daemon at portal.open-bio.org Tue Apr 21 12:13:04 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Apr 2009 08:13:04 -0400 Subject: [Biopython-dev] [Bug 2818] New: Add start and end properties to SeqFeature object Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2818 Summary: Add start and end properties to SeqFeature object Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk An enhancment proposed on the mailing list would add start and end properties to the SeqFeature returning plain integers (non-fuzzy approximations to the start and end locations) suitable for slicing most parent sequences. Dealing with a join location would still be tricky. Example usage: >>> from Bio import SeqIO >>> record = SeqIO.read(open("NC_005816.gb"),"gb") >>> feature = record.features[2] >>> print feature type: gene location: [86:1109] ref: None:None strand: 1 qualifiers: Key: db_xref, Value: ['GeneID:2767718'] Key: locus_tag, Value: ['YP_pPCP01'] >>> record[feature.start:feature.end] SeqRecord(seq=Seq('ATGGTCACTTTTGAGACAGTTATGGAAATTAAAATCCTGCACAAGCAGGGAATG...TGA', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.', dbxrefs=[]) >>> record.seq[feature.start:feature.end] Seq('ATGGTCACTTTTGAGACAGTTATGGAAATTAAAATCCTGCACAAGCAGGGAATG...TGA', IUPACAmbiguousDNA()) Patch to follow. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 21 12:16:17 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 21 Apr 2009 08:16:17 -0400 Subject: [Biopython-dev] [Bug 2818] Add start and end properties to SeqFeature object In-Reply-To: Message-ID: <200904211216.n3LCGHWZ025657@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2818 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-21 08:16 EST ------- Created an attachment (id=1281) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1281&action=view) Patch to Bio/SeqFeature.py Makes SeqFeature.start and end proxies for SeqFeature.location.nofuzzy_start and SeqFeature.location.nofuzzy_end (i.e. plain integers) See also: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005818.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Tue Apr 21 12:17:41 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 13:17:41 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904210452n6c17fcafl205abcfcc4e3edd1@mail.gmail.com> References: <20090417200558.GC19290@sobchak.mgh.harvard.edu> <252312.21376.qm@web62408.mail.re1.yahoo.com> <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> <20090420132946.GB29652@sobchak.mgh.harvard.edu> <320fb6e00904210452n6c17fcafl205abcfcc4e3edd1@mail.gmail.com> Message-ID: <320fb6e00904210517k63edc766xcb830a7150e4c5d1@mail.gmail.com> On Tue, Apr 21, 2009 at 12:52 PM, Peter Cock wrote: >> Accessing the start and end coordinates in SeqFeatures is unnecessarily >> cumbersome right now, but can be fixed fairly simply. We should be able >> to get this in now that 1.50 is rolled out. >> ... >> To be clear, start and end in SeqFeature would be integers and not >> handle any fuzzy stuff. All of the representation is still there for >> those actually dealing with fuzziness, but the top level attributes >> would expose the coordinates nicely for the remaining 99% of cases. > > Right - and with the above correction that SeqFeature.start and end > would be proxies for SeqFeature.location.nofuzzy_start and > SeqFeature.location.nofuzzy_end, you would get plain integers, and > this should cover most use cases. ?At least for non-Eukaryotes ;) Patch for this proposal on Bug 2818, http://bugzilla.open-bio.org/show_bug.cgi?id=2818 Peter From bartek at rezolwenta.eu.org Tue Apr 21 12:17:55 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 21 Apr 2009 14:17:55 +0200 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180904210516y21bec2e1r3294b2d15edf386f@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com> <8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com> <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> <8b34ec180904210434t2ee76e8bsc91af814f53e2df4@mail.gmail.com> <320fb6e00904210457p6189e096m966becad772cd610@mail.gmail.com> <8b34ec180904210516y21bec2e1r3294b2d15edf386f@mail.gmail.com> Message-ID: <8b34ec180904210517i259762e7t343e0f773c939a15@mail.gmail.com> On Tue, Apr 21, 2009 at 1:57 PM, Peter wrote: > Maybe. ?We can double check this by creating a trivial project in > github, doing a few commits, tag, commits, tag - and checking the > github interface and also the GitX presentation. ?That should tell us > if the issue is specific to our converted repository or not. no it's not specific. You can find a toy repository here: http://github.com/barwil/testing_tags/tree/master (please don't consider this link a permanent one, I'll remove it soon.) cheers Bartek From chapmanb at 50mail.com Tue Apr 21 12:20:45 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 21 Apr 2009 08:20:45 -0400 Subject: [Biopython-dev] Rolling new releases In-Reply-To: <320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com> <320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com> Message-ID: <20090421122045.GD30529@sobchak.mgh.harvard.edu> Hi Peter; > http://biopython.org/wiki/Building_a_release > > i.e. Maybe in a few months time I (or Michiel) can say "Right, CVS > freeze while XXX does the release", where person XXX gets to scan the > documentation, double check the NEWS files, check the unit tests etc, > before putting together the packages and uploading them to the server. > And maybe then hand over to our "News Coordinator" to do the release > announcement? Having more people involved will make it take a little > longer, but should mean less minor things get missed (e.g. a typo in > the NEWS file, or a broken unit test specific to a particular OS or > version of python). It would be great to have others involved in rolling releases. BioPerl often passes the Release Manager hat around for release to release, and perhaps we can get the same tradition going here. I like the idea of people volunteering for this. It would also be worth thinking about what the worst parts of building the releases are and seeing if we can automate or eliminate them. A few things that I can think of: - Remove support for older python versions, which would eliminate all those windows installers. I will write more about this in your other thread. - Eliminating the beta releases. Biopython is developed as stable in Git/CVS, so gets testing that way on developer machines. Are we getting enough feedback from betas to make them worthwhile? - Automate building the docs nightly/weekly on biopython.org. If the Tutorial/epydoc stuff is a lot of work, we could work up a script and cron to eliminate this part. That's from my fuzzy memory of rolling releases. Brad From chapmanb at 50mail.com Tue Apr 21 12:35:31 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 21 Apr 2009 08:35:31 -0400 Subject: [Biopython-dev] Python 2.3 support In-Reply-To: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com> References: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com> Message-ID: <20090421123531.GE30529@sobchak.mgh.harvard.edu> Hi Peter; > As we've been warning for the last couple of releases, Biopython 1.50 > should be the last release to officially support Python 2.3. No one > has complained yet, but they may not have noticed. I suspect there may > be people out there using a local Biopython installation on an old > Linux/Unix computer where the system Python is rather old. For > Biopython 1.50 I added a warning to setup.py when run on Python 2.3 so > that may get more attention. Are we getting a lot of feedback that we need to keep supporting these old versions? 2.3 was released in 2003, 2.4 in 2004, and 2.5 in 2006. This means people who need anything prior to 2.5 haven't updated in over 3 years. I understand the problem of non-responsive sysadmins and what not. However, we only have so many cycles for testing and coding; is it worthwhile spending some on these problems? One of the nice selling points of Python is that it's a dynamic language, and I like using new features of the language as much as anyone. Beyond the 2/3 split, it is very back compatible and I've never had any problems moving even very large projects forward to new versions. Practically, I'd be for dropping 2.4 support in the next release and being a bit more aggressive in general on moving upwards and onwards. Brad From p.j.a.cock at googlemail.com Tue Apr 21 12:43:11 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 13:43:11 +0100 Subject: [Biopython-dev] Rolling new releases In-Reply-To: <20090421122045.GD30529@sobchak.mgh.harvard.edu> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com> <320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com> <20090421122045.GD30529@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com> On Tue, Apr 21, 2009 at 1:20 PM, Brad Chapman wrote: > It would also be worth thinking about what the worst parts of > building the releases are and seeing if we can automate or eliminate > them. A few things that I can think of: > > - Remove support for older python versions, which would eliminate > ?all those windows installers. I will write more about this in your > ?other thread. That makes almost no difference, its just one extra line to do at the command line: c:\python23\python setup.py bdist_wininst c:\python24\python setup.py bdist_wininst c:\python25\python setup.py bdist_wininst c:\python26\python setup.py bdist_wininst Yes, you also have to build and test on each version of python, but honestly, once the build environment is setup doing the Windows release on three versus four versions of Python isn't worth worrying about. > - Eliminating the beta releases. Biopython is developed as stable > ?in Git/CVS, so gets testing that way on developer machines. Are we > ?getting enough feedback from betas to make them worthwhile? For Biopython's move from Numeric to NumPy, I think doing a beta was worthwhile. Maybe the feedback from the 1.50 beta release wasn't that big, but it didn't take that much effort, and it focused us ready for Biopython 1.50 well. Beta releases are also good for any Windows users, for whom setting up the build environment is quite a hurdle, so running the latest code from the repository is more difficult. Beta releases also give us more press coverage - and gives us a clear way to ask people to try out particular new stuff. > - Automate building the docs nightly/weekly on biopython.org. If the > ?Tutorial/epydoc stuff is a lot of work, we could work up a script > ?and cron to eliminate this part. Again, building the docs is pretty trivial. We have in the past deliberately NOT updated the online copies, so that it is in sync with the latest release. I suppose we could have two copies on the website, the "latest release" and the "nightly code". Peter From chapmanb at 50mail.com Tue Apr 21 12:44:49 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 21 Apr 2009 08:44:49 -0400 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904210452n6c17fcafl205abcfcc4e3edd1@mail.gmail.com> References: <20090417200558.GC19290@sobchak.mgh.harvard.edu> <252312.21376.qm@web62408.mail.re1.yahoo.com> <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com> <20090420132946.GB29652@sobchak.mgh.harvard.edu> <320fb6e00904210452n6c17fcafl205abcfcc4e3edd1@mail.gmail.com> Message-ID: <20090421124449.GF30529@sobchak.mgh.harvard.edu> Hi Peter; [...fuzzy handling...] > Right - and with the above correction that SeqFeature.start and end > would be proxies for SeqFeature.location.nofuzzy_start and > SeqFeature.location.nofuzzy_end, you would get plain integers, and > this should cover most use cases. At least for non-Eukaryotes ;) Yes, that was my proposal. Thanks for fleshing it out and for the patch. > > Does solving the start/end problem as described above help bridge the > > gap between SeqFeatures and the custom representation? Are there other > > usability issues you found? I would prefer to expose one data structure > > and think SeqFeature can handle the data well. They scale to nested > > cases, and will be familiar to those using features in SeqIO or BioSQL. > > You must agree that SeqFeature and FeatureLocation objects are not > very lightweight. I understood that one of your goals with Bio.GFF > and map/reduce is to handle massive files, so surely it makes sense to > use a simple object structure here? Unless you are thinking of having an object representation as being too heavy, the non-light part of SeqFeature is all the FeatureLocation fuzziness. I would be for a SeqFeatureLite class that is API compatible with SeqFeature (with the new start/end attributes) and does not support fuzzy locations. This would handle GFF understandably, be lightweight, and allow access to BioSQL and SeqIO. How does this sound? Brad From p.j.a.cock at googlemail.com Tue Apr 21 12:56:23 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 13:56:23 +0100 Subject: [Biopython-dev] Python 2.3 support In-Reply-To: <20090421123531.GE30529@sobchak.mgh.harvard.edu> References: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com> <20090421123531.GE30529@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904210556i7fd3efe3pe74a6773d367a3e9@mail.gmail.com> On Tue, Apr 21, 2009 at 1:35 PM, Brad Chapman wrote: > Hi Peter; > >> As we've been warning for the last couple of releases, Biopython 1.50 >> should be the last release to officially support Python 2.3. ?No one >> has complained yet, but they may not have noticed. I suspect there may >> be people out there using a local Biopython installation on an old >> Linux/Unix computer where the system Python is rather old. For >> Biopython 1.50 I added a warning to setup.py when run on Python 2.3 so >> that may get more attention. > > Are we getting a lot of feedback that we need to keep supporting these > old versions? 2.3 was released in 2003, 2.4 in 2004, and 2.5 in 2006. > This means people who need anything prior to 2.5 haven't updated in over > 3 years. I understand the problem of non-responsive sysadmins and what > not. However, we only have so many cycles for testing and coding; is it > worthwhile spending some on these problems? Until recently I have a very strong personal interest in keeping Biopython running on Python 2.3, so I never regarded this as "wasted cycles". My personal Windows machine ran Python 2.3 and MSCV 6.0. In order to update the python version and continue to compile Biopython, I would also have had to replace the compiler etc. and the hard drive was pretty full so this didn't appeal. I have recently been trying Ubuntu on this machine instead (on a second hard drive). For reference, my current (only) Windows machine (at work) has Python 2.3, 2.4 and 2.5 for which I use mingw32 to compile Biopython (same setup as Michiel), plus Python 2.6 for which I'm using Microsoft's free VC++ 2008 Express Edition from http://www.microsoft.com/express/download/ > Practically, I'd be for dropping 2.4 support in the next release and > being a bit more aggressive in general on moving upwards and onwards. I wouldn't support that. I would insist on giving at least one release's notice as a minimum. Peter From p.j.a.cock at googlemail.com Tue Apr 21 13:05:23 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 14:05:23 +0100 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) Message-ID: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> On Tue, Apr 21, 2009 at 1:44 PM, Brad Chapman wrote: >> You must agree that SeqFeature and FeatureLocation objects are not >> very lightweight. ?I understood that one of your goals with Bio.GFF >> and map/reduce is to handle massive files, so surely it makes sense to >> use a simple object structure here? > > Unless you are thinking of having an object representation as being too > heavy, the non-light part of SeqFeature is all the FeatureLocation > fuzziness. Fair point. > I would be for a SeqFeatureLite class that is API compatible with > SeqFeature (with the new start/end attributes) and does not support > fuzzy locations. This would handle GFF understandably, be lightweight, > and allow access to BioSQL and SeqIO. How does this sound? I have also been thinking about how I would (re)design the SeqFeature and FeatureLocation objects. In particular I would want to put the strand as part of the same object as the location, and also any join-locations. I would still want to cope with fuzzy locations, but make the non-fuzzy approximations more prominent in comparison. Also, I really don't like the way joins are currently stored as more SeqFeatures in the sub_features list (plus this kind of blocks alternative usage for child/parent nesting that might be nice for GFF files). The prime use case to keep in mind is taking a feature location (even a join), and using this to extract that region of nucleotides from the parent sequence (i.e. a Seq object or a SeqRecord object, as now both can be sliced). Peter From dalloliogm at gmail.com Tue Apr 21 13:25:52 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 21 Apr 2009 15:25:52 +0200 Subject: [Biopython-dev] Python 2.3 support In-Reply-To: <320fb6e00904210556i7fd3efe3pe74a6773d367a3e9@mail.gmail.com> References: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com> <20090421123531.GE30529@sobchak.mgh.harvard.edu> <320fb6e00904210556i7fd3efe3pe74a6773d367a3e9@mail.gmail.com> Message-ID: <5aa3b3570904210625v78e604c4j87c4be62b9a0488c@mail.gmail.com> On Tue, Apr 21, 2009 at 2:56 PM, Peter Cock wrote: > On Tue, Apr 21, 2009 at 1:35 PM, Brad Chapman wrote: > > Hi Peter; > > > >> As we've been warning for the last couple of releases, Biopython 1.50 > >> should be the last release to officially support Python 2.3. No one > >> has complained yet, but they may not have noticed. > I know of many people (a whole lab) which until recently were still using python 2.3. However, please, drop support for these older version or people won't never upgrade :) -- My blog on bioinformatics (now in English): http://bioinfoblog.it From p.j.a.cock at googlemail.com Tue Apr 21 13:51:26 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 14:51:26 +0100 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> Message-ID: <320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com> On Tue, Apr 21, 2009 at 1:44 PM, Brad Chapman wrote: > Unless you are thinking of having an object representation as being too > heavy, the non-light part of SeqFeature is all the FeatureLocation > fuzziness. I've just had a quick go at what should be a 100% backwards compatible modification to the FeatureLocation class to store ExactPosition start or end positions as integers. The idea should be more memory efficient, using the complex position objects only when required. The new __init__ method would look like this: def __init__(self, start, end): """Specify the start and end of a sequence feature.""" #Keeps exact locations as plain integers #Calculates the non-fuzzy versions now so make accessing #them simpler and faster (expected to be used more often) if isinstance(start, int) or isinstance(start, long): self._start = None self._start_int_nofuzzy = start elif isinstance(start, ExactPosition) : #Don't need to keep the full object self._start = None self._start_int_nofuzzy = start.position else : assert isinstance(start, AbstractPosition), repr(start) self._start = start self._start_int_nofuzzy = min(start.position, start.position + start.extension) if isinstance(end, int) or isinstance(end, long) : self._end = None self._end_int_nofuzzy = end elif isinstance(end, ExactPosition) : #Don't need to keep the full object self._end = None self._end_int_nofuzzy = end.position else : assert isinstance(end, AbstractPosition), repr(end) self._end = end self._end_int_nofuzzy = max(end.position, end.position + end.extension) The associated methods are then updated accordingly. When a position object is requested, self._start or self._end is used (if it is not None, when an ExactPosition is generated on the fly from the integer self.self._start_int_nofuzzy or self._end_int_nofuzzy). When the non-fuzzy integer approximation is wanted (the typical use case), we have those cached as the integers. The unit tests all pass (except test_BioSQL_SeqIO.py), but we'd need to have some sort of benchmark to demonstrate any memory gains in order to justify this kind of change. Maybe try it with Brad's GFF parser on a very large file? I could stick the full patch on Bugzilla (or perhaps github) is this sounds worth pursuing... An alternative implementation would use a single private variable to store either the integer position or the position object, and check the type when the public properties are accessed. This should be an even bigger memory saving, but may be slower. Peter From p.j.a.cock at googlemail.com Tue Apr 21 13:55:23 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 14:55:23 +0100 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> Message-ID: <320fb6e00904210655u5f497227o24457ec2ee3d711f@mail.gmail.com> > I have also been thinking about how I would (re)design the SeqFeature > and FeatureLocation objects. ?In particular I would want to put the > strand as part of the same object as the location, and also any > join-locations. ?I would still want to cope with fuzzy locations, but > make the non-fuzzy approximations more prominent in comparison. ?Also, > I really don't like the way joins are currently stored as more > SeqFeatures in the sub_features list (plus this kind of blocks > alternative usage for child/parent nesting that might be nice for GFF > files). > > The prime use case to keep in mind is taking a feature location (even > a join), and using this to extract that region of nucleotides from the > parent sequence (i.e. a Seq object or a SeqRecord object, as now both > can be sliced). I forgot to mention the second major use case I'm concerned about, which is recovering the GenBank/EMBL style location string. I have looked at this in the past, by adding methods to the FeatureLocation and all the Position objects, but it is complicated by the fact the Position objects don't know if they are at the start or end (and for the start locations we need to add one to convert from Python counting). This is the main block on having Bio.SeqIO support writing GenBank (or EMBL) files with their features included. Peter From lpritc at scri.ac.uk Tue Apr 21 13:50:01 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 21 Apr 2009 14:50:01 +0100 Subject: [Biopython-dev] Bio.Motif Suggestions In-Reply-To: <8b34ec180904210429g36d089a6h578dc0197a94516a@mail.gmail.com> Message-ID: Hi Bartek, It's a long one, this... I expect many TLDR response ;) On 21/04/2009 12:29, "Bartek Wilczynski" wrote: > On Tue, Apr 21, 2009 at 10:34 AM, Leighton Pritchard > wrote: >> Some thoughts and a bit of a wishlist... > > These are always welcome. I can make no promises on timing of making > your wishes come true ;) No-one ever does :( > But it is a workaround rather than a feature... I'd be also interested > in knowing about other applications where maybe this assumption (small gaps) > is violated. Are there also motifs with multiple gaps? Yes - it might be a stretch, but if you wanted to represent the organisation of protein domains in a multi-domain protein (e.g. a transposase, or some pathogen effectors) as motifs you might want to do this. > Implementing this feature would probably require a separate > subclass of Motif, since the internal implementation of searching would > need to be different. I'm not sure that this needs to be true. A motif with no gaps can be considered as a special case of a motif with an arbitrary number of gaps. If the base implementation is that of a gapped motif (e.g. Represented as ACT.{5,10}CCC.{,4}TATCAT.{3}GGG) then the basic method of searching - and here using the re module might work - doesn't need to be any different for an ungapped variant representing a particular instance of the multiply-gapped motif (ACTNNNNNNCCCNNNNTATCATNNNGGG), or for any other ungapped sequence (e.g. ACTCCCTATCATGGG). This may not be the case for more complex search algorithms, however. Other classes of Motif may well be necessary, in any case... > This is a very good feature request, I think it is worth implementing, > though currently I have no time to do it properly. I'm right there with you, unfortunately ;) >> I guess that this comes down to whether we choose to restrict the meaning of >> "motif" to an ungapped string of symbols (including ambiguity) representing >> nt/aa, or whether we want to permit the inclusion of variable-length gaps >> > I think that you are touching on multiple issues here. I was trying to focus on one issue, but it does have lots of implications, which you cover below. The one issue I intended is this: A sequence motif can be represented in more than one way, and those ways are not necessarily interchangeable - either conceptually or in code. An ungapped string of symbols isn't able to represent the same information as a regular expression (can do ambiguity of repeat counts), which in turn isn't able to represent the same information as a PSSM (can represent probabilities at each position), which in turn isn't able to represent the same information as an HMM (can represent variable-order dependency). However, the things you want to do with that motif, such as use it to search a set of candidate sequences or produce an example matching sequence for test purposes, can be the same regardless of the coding or conceptual representation of that motif. We come back to this below, but for now this does lead on to... > - gapped alignemnts are one thing. If we have a gap in one sequence > but not in the others > (frequent in protein motifs, not so much in DNA motifs) we just need a > way to sensibly use it in creation of PWMs for searching > - dyadic motifs (gaps in otherwise ungapped alignments) are a > different issue, since we have a > gap in all instances, but it may have a variable length. see above. These are, I think, the same issue. In your first example, PWMs will (mostly) work because the lengths of most sequences are the same and there are few gaps. However, unless you have a way of varying the length of your PWM during a query of the target sequence, the PWM need not match the gapped sequence strongly, potentially leading to a false negative. As an example: ABCDE AB-DE ABCDE ABCDE The PWM will be (shorthand) [A1][B1][C.75,-.25][D1][E1], and when applied to the target sequence ABDE (which was in your alignment), will not produce as high a score as it would for the other members of the alignment. For the alignment: A-CDE AB-DE ABC-E ABCDE The PWM is (shorthand) [A1][B.75,-.25][C.75,-.25][D.75,-.25][E1] With corresponding poor scores (potential false negatives) for target sequences ACDE, ABDE and ABCE. Without a way to (intelligently) place gaps in your target sequences, or otherwise account for gaps when searching, the problem is the same whether there is one gap or a dyadic motif. The *practical* issue is different, in that you can probably accept the odd false negative for a motif in which one training sequence has a gap, but PWMs are poor candidates for alignments with many gaps, as they can readily produce false negatives. The key issue is that PWMs are fixed-length, and variable-length representations are common, desirable, and difficult to express in a fixed-width framework. > -regular expressions are a different way of describing motifs. That is true - they are intermediate between consensus sequence, and PSSMs in their ability to describe variation, but also have the capacity to represent variable-length sequences. > I think that it is not a purpose of Bio.Motif to compete with regexps, but it > would be certainly valuable to be able to have a possibility of creating > motifs from some sort of (simplified) regexps. This was, to some extent, > discussed in a recent thread on Seq.startswith methods I was involved in that discussion :D I don't think that Bio.Motif needs to compete with the re module, but instead could use its robust, stable code to implement a regular expression representation of sequence motifs, seamlessly. > -HMM motifs are totally different kind of beast. These guys introduce > dependencies between positions (doable also with regexps) and there is > currently no support for them in Bio.Motif. It would be cool to have > support for them, but I'm not an expert here and it looks to me like a > lot of work (also probably the methods of Bio.Motif are not exactly right for > HMMs). You're right about the dependencies - they're the important features I was alluding to in my post - but I don't think that regular expressions are a good way to approach the same problem; they don't encode the same information. > -finally, suporting prosite syntax seems to be depending on the variable gap > feature, but otherwise it's simple an important input fomat to support. I wasn't suggesting PROSITE syntax as part of any desire for implementation - though a PROSITE <-> regex/consensus translation would be useful, I think - rather as an illustration that more people than me need variable length spacers in their motifs. >> I think the latter representations are more useful, even if harder to >> code/maintain. ?I think that leaving them out would be a glaring hole in >> functionality > > Usefulness is hard to define in abstract of a particular problem , so > this is arguable. It is certain that bio.Motif is not complete suite for all > kinds of motif analysis but i don't know of any tool that is supporting alll > these types of motifs with a single API (if you know one, please tell me). > We should have ambitious goals, but I wouldn't call it a glaring hole not to > have what is currently not available elsewhere... I apologise for my poor wording. What I meant was that it would seem odd if support for motif representation was considered complete without representing variable-length sequences. Left alone, this would always represent an obvious target for improvement (i.e. 'a glaring hole in functionality'). No criticism was meant by it - I think you've done a great job so far on Bio.Motif - and I apologise if I have caused offence. >> I think that, for anything other than simple searches (string search, >> regex), we'd be on a hiding to nothing by implementing search methods within >> Python. ?It's not likely to be as fast as dedicated search packages, and it >> would be a headache for maintenance. > What do you mean by searching here? Searching for a known motif or searching > for a new motif? And what dedicated packages you have on your mind? Searching for a known motif in a larger sequence. Three packages - two biologically-dedicated, one not - spring to mind. The non-biologically-dedicated one is grep. Representing ambiguity symbols as combinations of bases, e.g. [ACT] . [TA], [^T] and so on - with FASTA files where sequences are not punctuated by \n or \r - is highly effective for finding sequence motifs representable by regular expressions. Dedicated 1: PSI-BLAST - takes PSSMs representing a sequence profile Dedicated 2: HMMer - builds and uses an HMM representation of the sequence profile. There are others, but I'd have to think hard to recall them. You could consider HMMer versions 1, 2 and 3 as different, in a number of ways - including their utility for nucleotide sequence representation... >> it seems to me that Bio.Motif could >> be most powerful in the alignment/searching/comparison process as a 'broker' >> within BioPython, providing a consistent API for interface with external >> alignment/search/comparison applications that also permits programmatic >> manipulation of the profile/HMM/alignment. ?E.g. > I think that the most valuable thing would be to internalize some of > the compliexity of different ways of using motifs in bioinformatics. My modest > goal for now is making protein motifs first class citizens (meaning handling > alphabets and gaps properly etc. ). > The next thing would be to make bio.motif cooperate nicely with > - Bio.Seq (e.g seq.startswith etc.), > - Bio.Align (conversions from-to alignments) > which includes easy motif creation from simple formats like IUPAC and > simple regexps and would correspond to the "broker" function if I understand > it correctly. > Then I think it would be really cool to have spaced motifs, although > here we need to be careful about performance. If I might suggest: the main role of the Bio.Motif module as you intend it appears to be to represent motifs of biological sequences, and to provide useful functionality for them. Now, there are several ways of representing these motifs both conceptually, and in code - and they're not all interchangeable. Some of them have a many -> one mapping (PSSM -> consensus sequence), and some have no obvious mapping at all (HMM <-/-> PSSM). There is a decision to be made concerning how motifs are represented internally: PSSM, regex and/or HMM. PSSM has the clear benefit that, given a PSSM, you can easily generate the consensus sequence and a regular expression of fixed-length - but the mapping to a regular expression is not clear, and may not produce the one that the user would prefer. HMMs can't readily be converted to other representations, and regular expressions can't be expanded to PSSMs, or converted to consensus sequences (unless they have no length ambiguities). It is not just performance we need to think about, but the very representation of a motif. Each of these representations is useful under different circumstances. I think it is worth avoiding a structure that enforces a single internal representation and closes off future alternative representations. Giving the user sufficient flexibility/rope to hang themselves with in their choice of internal representation is a Good Thing?, in my opinion. > As for more specific things: > - I don't like the usage of PSSM and consensus here. these are just > different ways of looking at a Motif. > I don't understand your idea of separating consensus from pssm motifs. These > are not fundamentally different. HMMs though are really different. I see what you mean, but I think you're associating PSSM with Motif too strongly. A PSSM can be used to generate a consensus sequence, but the resulting consensus sequence cannot be used to generate the corresponding PSSM uniquely. There is not a one-one mapping, and they do not describe the same information. Consensus sequences, for example, do not indicate the probability of finding a particular symbol at any given position; PSSMs can. PSSMs are fundamentally different from consensus sequences in that they don't encode variability at any position. Consensus, regex, PSSM and HMM are all different ways of looking at a Motif, but they're not all internally-compatible - which is my point. If you build a PSSM motif and make the alignment data nonrecoverable, you cannot reconstruct a corresponding HMM representation, later, for example. So you would have to decide what kind of representation you use at motif build-time, build all of them at once, or keep the alignment around to build what you need later. I'd prefer to choose at build time, but YMMV. > -Also the difference between HMMer and HMM is unclear to me > (isn't hmmer a tool to make HMMS? Do we support HMMER in Biopython currently?) > But I'm not too concerned about HMMs at the moment. There is a fair amount of flexibility in how you choose to define your HMM for a motif, and not just in the order of the HMM. There has been corresponding variation in how HMMer represents its data internally, over the years. I was meaning to imply by syntax that a HMMer-specific representation could be called 'hmmer', but a generic internal HMM representation could just be called 'hmm', to reflect this. I'm not going to insist on the convention, but it seems simple and obvious to me (again, YMMV). Sorry for the length and likely repetition, but I think these are issues worth thinking about. Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From biopython at maubp.freeserve.co.uk Tue Apr 21 14:30:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Apr 2009 15:30:20 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com> <8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com> <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> Message-ID: <320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com> > > From some more reading this, it sounds like our CVS tags are > essentially turned into commit markers in git. ?See: > > http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#how-git-stores-references > http://book.git-scm.com/3_git_tag.html > > This shouldn't rule out showing them in the history, but perhaps the > cvs to git migration confuses things... By setting up a toy repository with tags done though git itself (I assume), Bartek has convinced me that GitHub itself never shows the tags in the history. I think this is big drawback, and that we should ask GitHub about this. However, using the Mac GUI tool GitX, I was able to see the tags in the history using the toy repository (they show up as nice yellow blobs), but not using the current Biopython CVS to git conversion. There appears to be something less than ideal about our CVS to git conversion. I believe this relates to how the tag commits appear in the commit tree - and it looks like for Biopython they are all tiny branches off the main trunk. i.e. If you look at the main trunk history (overall or for any one file) then tags commits are not in it. This hunch appears to be supported by the git log output: $ git clone git://github.com/biopython/biopython.git $ cd biopython $ git log --graph --all ... | * commit 8fb446965d58f266ba8bf41a992a09e4bedbac3e | Author: peterc | Date: Mon Apr 20 16:07:41 2009 +0000 | | Bump the version number now that Biopython 1.50 is released | | * commit 4ed11049092d86704a2a15359c77459bad30e291 |/ Author: cvs2dvcs transform | Date: Mon Apr 20 10:48:32 2009 +0000 | | This commit was manufactured by cvs2svn to create tag 'biopython-150'. | * commit 29aa4df3480cdee803694766f137ab2baf5625b2 | Author: peterc | Date: Mon Apr 20 10:48:31 2009 +0000 | | You don't have to email Iddo to get on the CONTRIB file | ... In comparison, for Bartek's toy repository there is a single branch shown. Peter From sbassi at clubdelarazon.org Tue Apr 21 14:34:20 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Tue, 21 Apr 2009 11:34:20 -0300 Subject: [Biopython-dev] Python 2.3 support In-Reply-To: <5aa3b3570904210625v78e604c4j87c4be62b9a0488c@mail.gmail.com> References: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com> <20090421123531.GE30529@sobchak.mgh.harvard.edu> <320fb6e00904210556i7fd3efe3pe74a6773d367a3e9@mail.gmail.com> <5aa3b3570904210625v78e604c4j87c4be62b9a0488c@mail.gmail.com> Message-ID: <9e2f512b0904210734w46d22856k46c4bdfcddcf0346@mail.gmail.com> On Tue, Apr 21, 2009 at 10:25 AM, Giovanni Marco Dall'Olio wrote: > However, please, drop support for these older version or people won't never > upgrade :) That is true, but is also true that you can use a new version without upgrading. The reason for not upgrading is in most cases avoiding to break working scripts. I my (old) OS, the WIFI card uses Python 2.3 to work. But Python allows to install "alternative" versions without conflicting with your default system version. This way I have Python 2.4, 2.5, 2.6 and 3 all installed in the same machine. Using alt-install or just compiling a Python version without doing a system install. I even have more than one 2.5 version and each with a different Biopython installation (using virtual_env) for testing purposes. So I don't think there is a valid reason to keep supporting such an old version. From biopython at maubp.freeserve.co.uk Tue Apr 21 14:47:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Apr 2009 15:47:40 +0100 Subject: [Biopython-dev] Python 2.3 support In-Reply-To: <9e2f512b0904210734w46d22856k46c4bdfcddcf0346@mail.gmail.com> References: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com> <20090421123531.GE30529@sobchak.mgh.harvard.edu> <320fb6e00904210556i7fd3efe3pe74a6773d367a3e9@mail.gmail.com> <5aa3b3570904210625v78e604c4j87c4be62b9a0488c@mail.gmail.com> <9e2f512b0904210734w46d22856k46c4bdfcddcf0346@mail.gmail.com> Message-ID: <320fb6e00904210747h73b8881dkcfaf8a53f2f7aab@mail.gmail.com> > ... Python allows to install "alternative" versions without > conflicting with your default system version. This way I have Python > 2.4, 2.5, 2.6 and 3 all installed in the same machine. Using > alt-install or just compiling a Python version without doing a system > install. I even have more than one 2.5 version and each with a > different Biopython installation (using virtual_env) for testing > purposes. So I don't think there is a valid reason to keep supporting > such an old version. OK, OK, no one loves Python 2.3 anymore, and you'll all be glad to see the back of it ;) Shall we say that at the end of April, unless anyone has come forward with a strong need to continue using Biopython on Python 2.3 (or we are forced to do another release to fix something), we'll start work on removing Python 2.3 specific code in May? A lot (hopefully most) of the Python 2.3 bits have a comment about this in the source code, so a quick grep should pull out most of them. If any of you remember any other specific things we need to change add a note to Bug 2817 please. http://bugzilla.open-bio.org/show_bug.cgi?id=2817 Thanks Peter From bartek at rezolwenta.eu.org Tue Apr 21 15:19:00 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 21 Apr 2009 17:19:00 +0200 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com> <8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com> <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> <320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com> <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> Message-ID: <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> Hi, > There appears to be something less than ideal about our CVS to git > conversion. ?I believe this relates to how the tag commits appear in > the commit tree - and it looks like for Biopython they are all tiny > branches off the main trunk. ?i.e. If you look at the main trunk > history (overall or for any one file) then tags commits are not in it. > I haven't noticed this difference. It just seems to be the way cvs2got handles tags. This behavior does not seem to be controllable from the config file. I'll try to ask on the cvs2git mailing list. In case it is not possible to change it in cvs2git itself, the worst scenario would be to re-tag the git tree manually (or with a help of some script). So there is no risk of loosing tags. I'll post when I have any progress onn this issue. cheers ?Bartek -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From mjldehoon at yahoo.com Tue Apr 21 15:23:03 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 21 Apr 2009 08:23:03 -0700 (PDT) Subject: [Biopython-dev] Rolling new releases In-Reply-To: <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com> Message-ID: <955867.40270.qm@web62404.mail.re1.yahoo.com> --- On Tue, 4/21/09, Peter Cock wrote: > Again, building the docs is pretty trivial. We have in the > past deliberately NOT updated the online copies, so that it is > in sync with the latest release. I suppose we could have two > copies on the website, the "latest release" and the > "nightly code". > That would be nice. In the past, I've done such things by hand to let people look at the documentation for a piece of code that's about to go into CVS. From mjldehoon at yahoo.com Tue Apr 21 15:28:56 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 21 Apr 2009 08:28:56 -0700 (PDT) Subject: [Biopython-dev] Rolling new releases In-Reply-To: <20090421122045.GD30529@sobchak.mgh.harvard.edu> Message-ID: <268684.34243.qm@web62402.mail.re1.yahoo.com> --- On Tue, 4/21/09, Brad Chapman wrote: > - Eliminating the beta releases. Biopython is developed as > stable in Git/CVS, so gets testing that way on developer > machines. Are we getting enough feedback from betas to make > them worthwhile? I agree. A project like Biopython is destined to be in perpetual beta mode anyway. To my mind, Biopython 1.50-beta is as stable as Biopython 1.49 and Biopython 1.51. In addition, will we be able to remember that Biopython 1.50b is the beta release of version 1.50 (or did we have a 1.50, then a 1.50a, and then a 1.50b release?). --Michiel From mjldehoon at yahoo.com Tue Apr 21 15:35:44 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 21 Apr 2009 08:35:44 -0700 (PDT) Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090421124449.GF30529@sobchak.mgh.harvard.edu> Message-ID: <595877.69734.qm@web62408.mail.re1.yahoo.com> --- On Tue, 4/21/09, Brad Chapman wrote: > I would be for a SeqFeatureLite class that is API > compatible with SeqFeature (with the new start/end > attributes) and does not support > fuzzy locations. This would handle GFF understandably, be > lightweight, and allow access to BioSQL and SeqIO. > How does this sound? Depends on whether SeqFeatureLite only exists for the benefit of GFF files. If so, we're better off with a light-weight GFF-specific object. If not, then it may make sense. But even then it sounds a bit like class creep. --Michiel. From p.j.a.cock at googlemail.com Tue Apr 21 15:58:23 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 16:58:23 +0100 Subject: [Biopython-dev] Rolling new releases In-Reply-To: <955867.40270.qm@web62404.mail.re1.yahoo.com> References: <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com> <955867.40270.qm@web62404.mail.re1.yahoo.com> Message-ID: <320fb6e00904210858k1aa4b5cav2a784b75fb3b3f8@mail.gmail.com> On Tue, Apr 21, 2009 at 4:23 PM, Michiel de Hoon wrote: > > Peter wrote: >> Again, building the docs is pretty trivial. ?We have in the >> past deliberately NOT updated the online copies, so that it is >> in sync with the latest release. ?I suppose we could have two >> copies on the website, the "latest release" and the >> "nightly code". > > That would be nice. In the past, I've done such things by hand to > let people look at the documentation for a piece of code that's > about to go into CVS. > This should be trivial to get setup - at least as long as our repository lives on the OBF server. There are already scripts or CVS hooks in place to update http://biopython.org/SRC/ although I don't know how exactly this is configured. On Tue, Apr 21, 2009 at 4:28 PM, Michiel de Hoon wrote: >Brad wrote: >>> - Eliminating the beta releases. Biopython is developed as >>> stable in Git/CVS, so gets testing that way on developer >>> machines. Are we getting enough feedback from betas to make >>> them worthwhile? > > I agree. A project like Biopython is destined to be in perpetual beta > mode anyway. To my mind, Biopython 1.50-beta is as stable as > Biopython 1.49 and Biopython 1.51. In addition, will we be able to > remember that Biopython 1.50b is the beta release of version 1.50 > (or did we have a 1.50, then a 1.50a, and then a 1.50b release?). Maybe I have hung about with computer scientists / programmers too long, as to me there is no confusion about the ordering alpha -> beta -> release candidate -> final. However, if the consensus is that explicit beta releases are redundant, then so be it. Peter From bartek at rezolwenta.eu.org Tue Apr 21 15:59:32 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 21 Apr 2009 17:59:32 +0200 Subject: [Biopython-dev] Bio.Motif Suggestions In-Reply-To: References: <8b34ec180904200804o58d531ache7e3110f3b919cfa@mail.gmail.com> Message-ID: <8b34ec180904210859r34a0a034qdfb54d57c3ca85e3@mail.gmail.com> Hi, thanks for your suggestions. To make the long story short: - I mostly agree with your points - I've updated the wiki page to include your requests http://biopython.org/wiki/MotifDev - I'll definitely spend some time working on particular requests and then post specifically. cheers Bartek On Tue, Apr 21, 2009 at 10:34 AM, Leighton Pritchard wrote: > Hi, > > Some thoughts and a bit of a wishlist... > > On 20/04/2009 16:04, "Bartek Wilczynski" wrote: > >> On Mon, Apr 20, 2009 at 4:35 PM, Peter >> wrote: >>> >>> What would a space in a motif mean? ?Clearly something different from >>> a wildcard like N or X in nucleotide or protein sequences. ?Does it >>> mean a gap of variable length? ?If it means a gap of one character >>> then surely just using a "-" would be sensible (as used in multiple >>> sequence alignments), for which we have a gapped alphabet system >>> setup. >>> >> I think that once we start talking about gapped motifs, we are really >> talking about >> multiple alignments on steroids. This hasn't been done so far because you >> don't >> really need it for DNA motifs, > > It might not be required for the motifs you've been working with, but we've > been doing profile-based searches for bipartite regulatory binding sites in > DNA. ?These sites have a variable-length spacer region, and so require > gapped alignments for building motifs. ?The spacer region consensus > (depending on the level of identity required for the consensus) is usually > composed of Ns. > > I guess that this comes down to whether we choose to restrict the meaning of > "motif" to an ungapped string of symbols (including ambiguity) representing > nt/aa, or whether we want to permit the inclusion of variable-length gaps, > regions, or ambiguities in a PROSITE or regular expression-like manner (e.g. > C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H, GAACC.{17,21}AAC or > C{,3}A{3,5}TTTT). ?Although profile methods like HMMer can produce a > consensus output that looks like an ungapped string of symbols to represent > a motif, it doesn't capture important features of the HMM representation. > > I think the latter representations are more useful, even if harder to > code/maintain. ?I think that leaving them out would be a glaring hole in > functionality, and that they're a target Biopython should aim for. > >> I think it would be great to be >> able to easily >> convert multiple alignments into motifs. This would allow us to ?use >> the power of >> BIo.AlignIO for IO and Bio.Motif for searching and comparisons.The question is >> how to design API for these ?functions. > > I agree. ?I think that there's another important question: what do we mean, > and need to do, when we talk about converting an alignment into a motif? > Consensus/majority and PSSM methods from a sequence alignment should be > straightforward to implement in Python - even for gapped alignments. > Including a representation of variable-length gaps might be a little more > difficult, and storing an HMM representation may be too much to manage > immediately. ?That's still three different types of object - with likely > different components to their interfaces - to be stored. ?In their > relationship to a source alignment, these representations could be > properties of a single alignment, or independent Bio.Motif objects (perhaps > each with a link back to their parent alignment). > > The results of searches are also likely to be qualitatively different, > depending on the type of motif used for the search, and the results desired > by the user. > > I think that, for anything other than simple searches (string search, > regex), we'd be on a hiding to nothing by implementing search methods within > Python. ?It's not likely to be as fast as dedicated search packages, and it > would be a headache for maintenance. ?So, with apologies if I missed this > part of the discussion or documentation, it seems to me that Bio.Motif could > be most powerful in the alignment/searching/comparison process as a 'broker' > within BioPython, providing a consistent API for interface with external > alignment/search/comparison applications that also permits programmatic > manipulation of the profile/HMM/alignment. ?E.g. > > align = Bio.AlignIO.read(alignfilehandle) > consensus = align.build_consensus(threshold=0.9) > pssm = align.build_pssm() > hmmer = align.build_hmmer() > hmm = align.build_hmm(order=3) > > Or > > consensus = Bio.Motif.consensus_from_alignment(align, threshold=0.9) > pssm = Bio.Motif.build_pssm_from_alignment(align) > hmmer = Bio.Motif.build_hmmer_from_alignment(align) > hmm = Bio.Motif.build_hmm_from_alignment(align, order=3) > > (which I don't think is as neat an interface, even if all > align.build_consensus does is call the Bio.Motif.consensus_from_alignment > method) > > Followed by things like > > pssm.consensus() > pssm.logo() > hmm.generate_sequence(length=100) > hmm.to_graphviz() > > And then the consensus, pssm, hmm and hmmer objects could be used as input > to interfaces for the relevant applications. > > Converting an alignment into an HMM for this purpose may itself benefit from > a call to HMMer's hmmbuild (and Pythonic representation of the data > structure), rather than implementation of an equivalent internal function - > even though I think one of those would be useful, too. > > Cheers, > > L. > > -- > Dr Leighton Pritchard MRSC > D131, Plant Pathology Programme, SCRI > Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA > e:lpritc at scri.ac.uk ? ? ? w:http://www.scri.ac.uk/staff/leightonpritchard > gpg/pgp: 0xFEFC205C ? ? ? tel:+44(0)1382 562731 x2405 > > > ______________________________________________________ > SCRI, Invergowrie, Dundee, DD2 5DA. > The Scottish Crop Research Institute is a charitable company limited by guarantee. > Registered in Scotland No: SC 29367. > Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. > > > DISCLAIMER: > > This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. ?This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. ?It may not be disclosed or used by any other than that > addressee. > If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on > this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. > > Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). > ______________________________________________________ > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From biopython at maubp.freeserve.co.uk Tue Apr 21 16:29:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Apr 2009 17:29:19 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com> <8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com> <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> <320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com> <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> Message-ID: <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> On Tue, Apr 21, 2009 at 4:19 PM, Bartek Wilczynski wrote: > Hi, > >> There appears to be something less than ideal about our CVS to git >> conversion. ?I believe this relates to how the tag commits appear in >> the commit tree - and it looks like for Biopython they are all tiny >> branches off the main trunk. ?i.e. If you look at the main trunk >> history (overall or for any one file) then tags commits are not in it. >> > > I haven't noticed this difference. It just seems to be the way cvs2got > handles tags. This behavior does not seem to be controllable from > the config file. I'll try to ask on the cvs2git mailing list. > > In case it is not possible to change it in cvs2git itself, the worst > scenario would be to re-tag the git tree manually (or with a help > of some script). So there is no risk of loosing tags. > > I'll post when I have any progress onn this issue. There is another option, redo the import using git cvsimport. This has the downside that we lose all the network history currently in github, but its only going to affect a couple of people and that was always a possibility. I've just done this twice, firstly over the network (just over an hour, probably a bad idea in terms of wasting the OBF bandwidth). Then I succeeded in doing it locally (under 15 minutes) on my Mac after logging into dev.open-bio.org and fetching a zipped up copy of the CVS files. The hard bit was working out how to get the CVSROOT directory setup: cvs -d $PWD/biopython_cvs init cd biopython_cvs unzip ../../Biopython-CVS-2009-04-21.zip cd .. time nice -n 10 git cvsimport -v -k -d /Users/pjcock/repositories/bp_cvs_local_to_git/biopython_cvs -C biopython_git biopython Both conversion appear to give the same result. Using GitX the history how shows the tags as I expect them to appear (nice yellow markers on the main branch), and the tag side branches have gone: $ cd biopython_git $ git log --graph --all ... | * commit 6283ffe77fdd07ae678d2fa35ae9311ee7fd51ee | Author: peterc | Date: Mon Apr 20 16:07:41 2009 +0000 | | Bump the version number now that Biopython 1.50 is released | * commit 17a9b80f89be97fd4cc31d7c3618e82e4c83cafc | Author: peterc | Date: Mon Apr 20 10:48:31 2009 +0000 | | You don't have to email Iddo to get on the CONTRIB file | ... I'm not sure if "git log" can be told to show the tags itself. Also, just like Bartek's conversion using cvs2svn, this also appears to correctly identify simple file moving (when the add and delete are done in one CVS operation, obviously not when it was done in two steps like my recent changes in Bio.Graphics.GenomeDiagram). Note - we can probably use http://github.com/guides/change-author-details-in-commit-history to map author names to github user names later, but in theory git cvsimport will do this with the -A option. Peter From biopython at maubp.freeserve.co.uk Tue Apr 21 16:58:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Apr 2009 17:58:23 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com> <8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com> <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> <320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com> <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> Message-ID: <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> On Tue, Apr 21, 2009 at 5:29 PM, Peter wrote: > > There is another option, redo the import using git cvsimport. ?This > has the downside that we lose all the network history currently in > github, but its only going to affect a couple of people and that was > always a possibility. > > I've just done this twice, firstly over the network (just over an > hour, probably a bad idea in terms of wasting the OBF bandwidth). > Then I succeeded in doing it locally (under 15 minutes) on my Mac > after logging into dev.open-bio.org and fetching a zipped up copy of > the CVS files. ?The hard bit was working out how to get the CVSROOT > directory setup: > > cvs -d $PWD/biopython_cvs init > cd biopython_cvs > unzip ../../Biopython-CVS-2009-04-21.zip > cd .. > time nice -n 10 git cvsimport -v -k -d > /Users/pjcock/repositories/bp_cvs_local_to_git/biopython_cvs ?-C > biopython_git biopython > > Both conversion appear to give the same result. ?Using GitX the > history how shows the tags as I expect them to appear (nice yellow > markers on the main branch), and the tag side branches have gone: > I've pushed this to github as http://github.com/peterjc/biopython-cvs-import/tree/master $ cd biopython_git $ git remote add origin git at github.com:peterjc/biopython-cvs-import.git $ git push origin master $ git push origin master --tags This won't be automatically updated, so please don't fork it! Peter From biopython at maubp.freeserve.co.uk Tue Apr 21 18:18:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Apr 2009 19:18:12 +0100 Subject: [Biopython-dev] Possible re-import from CVS to git Message-ID: <320fb6e00904211118n23e68b2vc43de8397ff3b08c@mail.gmail.com> On the thread about the missing history tags in github, I wrote: >> ... Then I succeeded in doing it locally (under 15 minutes) on my Mac >> after logging into dev.open-bio.org and fetching a zipped up copy of >> the CVS files. ?The hard bit was working out how to get the CVSROOT >> directory setup: >> >> cvs -d $PWD/biopython_cvs init >> cd biopython_cvs >> unzip ../../Biopython-CVS-2009-04-21.zip >> cd .. >> time nice -n 10 git cvsimport -v -k -d >> /Users/pjcock/repositories/bp_cvs_local_to_git/biopython_cvs ?-C >> biopython_git biopython >> I've been testing the -A option for git cvsimport to map our CVS usernames to hithub accounts. http://www.kernel.org/pub/software/scm/git/docs/git-cvsimport.html The following format omitting the email address does nothing at all (checking the local repository), which is a shame as I was hoping it would allow a quick and simple way to map the CVS usernames to the github usernames: peterc=peterjc However, the documented format does work: peterc=full name It seems that as long as the email address matches that used for your github account, once the repository is uploaded to github it will all work nicely - and your github account will be linked to the commit. So, if we are going to re-do the git import (and we may have to fix the tag history), it would be very nice if all the existing CVS users could first: (a) setup an account on github, and (b) tell me the email address you are using for it. If we do move to github, you would need to do this anyway in order to be given collaborator status to make commits direct to the main trunk. > I've pushed this to github as > http://github.com/peterjc/biopython-cvs-import/tree/master That is deleted now. Peter From p.j.a.cock at googlemail.com Tue Apr 21 20:06:56 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Apr 2009 21:06:56 +0100 Subject: [Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO, Bio.SwissProt In-Reply-To: <320fb6e00904210504g6c7f60f1o96129c9a6759c256@mail.gmail.com> References: <320fb6e00904210426q19ea7a44xdb902661297ec855@mail.gmail.com> <861995.42083.qm@web62406.mail.re1.yahoo.com> <320fb6e00904210504g6c7f60f1o96129c9a6759c256@mail.gmail.com> Message-ID: <320fb6e00904211306u50955608ndccef5d0cb6ba09b@mail.gmail.com> On Tue, Apr 21, 2009 at 1:04 PM, Peter Cock wrote: >>> It looks like the SwissProt format has changed, and we >>> should be parsing the new extended DE lines more >>> carefully, and splitting these entries up and recording >>> them in the SeqRecord.annotations dictionary? >> >> That sounds reasonable. The dictionary will have to be >> nested though. Something like this ... >> Thinking this over, we should take that SwissProt file and load it into BioSQL using BioPerl, and see how they dealt with the DE lines, and try and do the same for Bio.SeqIO in order that loading it into BioSQL with Biopython gives more or less the same thing. Peter From eric.talevich at gmail.com Wed Apr 22 04:32:33 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 22 Apr 2009 00:32:33 -0400 Subject: [Biopython-dev] Possible re-import from CVS to git In-Reply-To: <320fb6e00904211118n23e68b2vc43de8397ff3b08c@mail.gmail.com> References: <320fb6e00904211118n23e68b2vc43de8397ff3b08c@mail.gmail.com> Message-ID: <3f6baf360904212132k7110deeft6829c1b4a7b18f24@mail.gmail.com> On Tue, Apr 21, 2009 at 2:18 PM, Peter wrote: > So, if we are going to re-do the git import (and we may have to fix > the tag history), it would be very nice if all the existing CVS users > could first: > (a) setup an account on github, and > (b) tell me the email address you are using for it. > > If we do move to github, you would need to do this anyway in order to > be given collaborator status to make commits direct to the main trunk. > > Eek. Now that the Summer of Code is under way, I guess this is a good time to bring up the question of how Nick and I should be following the Biopython trunk and publishing our own code. In spite of the warning that the CVS tracker in GitHub was tentative, I was getting comfortable with the setup we had. Should I (we) hold off on pushing anything substantial to GitHub until this tagging situation is resolved, or is there a better way to approach this? For example, does anyone know if it's straightforward to back up a branch's recent history with git-format-patch and apply it directly onto a new repository with different references? Thanks, Eric From lpritc at scri.ac.uk Wed Apr 22 07:56:46 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 22 Apr 2009 08:56:46 +0100 Subject: [Biopython-dev] Bio.Motif Suggestions In-Reply-To: <8b34ec180904210859r34a0a034qdfb54d57c3ca85e3@mail.gmail.com> Message-ID: Hi Bart, On 21/04/2009 16:59, "Bartek Wilczynski" wrote: > Hi, > > thanks for your suggestions. > > To make the long story short: > - I mostly agree with your points > - I've updated the wiki page to include your requests > http://biopython.org/wiki/MotifDev > - I'll definitely spend some time working on particular requests and > then post specifically. Many thanks for the quick response - I've seen your wiki update, too. Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From bartek at rezolwenta.eu.org Wed Apr 22 08:53:21 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Wed, 22 Apr 2009 10:53:21 +0200 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com> <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> <320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com> <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> Message-ID: <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> Hi, On Tue, Apr 21, 2009 at 6:58 PM, Peter wrote: > On Tue, Apr 21, 2009 at 5:29 PM, Peter wrote: >> >> There is another option, redo the import using git cvsimport. ?This >> has the downside that we lose all the network history currently in >> github, but its only going to affect a couple of people and that was >> always a possibility. Yes, it is ?an option, but I would be quite reluctant to do it. I think this issue with tags is possible to get fixed without re-doing the import. I'm scared by the possibility the we re-import stuff, fix the tags, everybody swithches, people complain how good it was back then with CVS, ane one month down the road, we find that there is an issue with something else, that was not present in the previous import. I think this is becoming a bit chaotic now. We still haven't removed the first github conversion: (biopython_old branch: is anyone using it anyway?), ?there is this semi-official one that has a (fixable in my opinion) issue with tags and now there is a new one made by Peter. In summary: I have no objections to using any particular tool for importing stuff to git. I don't like the idea of not even trying to fix tghe problem we have but instantly changing the tool we are using. I consider now re-importing stuff a major problem: everybody will need to port their changes which is work. >> >> I've just done this twice, firstly over the network (just over an >> hour, probably a bad idea in terms of wasting the OBF bandwidth). >> Then I succeeded in doing it locally (under 15 minutes) on my Mac >> after logging into dev.open-bio.org and fetching a zipped up copy of >> the CVS files. ?The hard bit was working out how to get the CVSROOT >> directory setup: >> itt's good to know it works, I don't think the time differences are significant. > > This won't be automatically updated, so please don't fork it! exactly Bartek -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From biopython at maubp.freeserve.co.uk Wed Apr 22 09:08:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Apr 2009 10:08:13 +0100 Subject: [Biopython-dev] Possible re-import from CVS to git In-Reply-To: <3f6baf360904212132k7110deeft6829c1b4a7b18f24@mail.gmail.com> References: <320fb6e00904211118n23e68b2vc43de8397ff3b08c@mail.gmail.com> <3f6baf360904212132k7110deeft6829c1b4a7b18f24@mail.gmail.com> Message-ID: <320fb6e00904220208oaecd1a5p844ce8642acc51fa@mail.gmail.com> On Wed, Apr 22, 2009 at 5:32 AM, Eric Talevich wrote: > On Tue, Apr 21, 2009 at 2:18 PM, Peter wrote: > >> So, if we are going to re-do the git import (and we may have to fix >> the tag history), ... > > Eek. Now that the Summer of Code is under way, I guess this is a good time > to bring up the question of how Nick and I should be following the Biopython > trunk and publishing our own code. > > In spite of the warning that the CVS tracker in GitHub was tentative, I was > getting comfortable with the setup we had. Should I (we) hold off on pushing > anything substantial to GitHub until this tagging situation is resolved, or > is there a better way to approach this? For example, does anyone know if > it's straightforward to back up a branch's recent history with > git-format-patch and apply it directly onto a new repository with different > references? Bartek is looking into fixing the existing CVS to git mirror on github, but that may not be possible. And I do think it is worth fixing the tag history even at the cost of some upheaval in the short term. In terms of you and Nick, for now carry on using github if you are comfortable with it. The new phylogenetics stuff will I assume be mostly new python modules, or modifications to a couple of existing ones (e.g. Bio.Nexus). Merging this later shouldn't be too bad - you should be able to generate a diff against CVS (or its current mirror in git) and we can apply that to CVS (or a new git repository). Peter From biopython at maubp.freeserve.co.uk Wed Apr 22 09:23:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Apr 2009 10:23:46 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com> <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> <320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com> <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> Message-ID: <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> On Wed, Apr 22, 2009 at 9:53 AM, Bartek Wilczynski wrote: > Hi, > > Peter wrote: >>> There is another option, redo the import using git cvsimport. ?This> >>> has the downside that we lose all the network history currently in >>> github, but its only going to affect a couple of people and that was >>> always a possibility. > > Yes, it is ?an option, but I would be quite reluctant to do it. I think this > issue with tags is possible to get fixed without re-doing the import. If you can fix the current git hub repository, great. > I'm scared by the possibility the we re-import stuff, fix the tags, everybody > swithches, people complain how good it was back then with CVS, ane one > month down the road, we find that there is an issue with something else, > that was not present in the previous import. This is why we are testing things: We have found something wrong with the current import, and it wasn't immediately obvious (partly because we were still getting to know git and github). > I think this is becoming a bit chaotic now. We still haven't removed the > first github conversion: (biopython_old branch: is anyone using it anyway?), The old conversion's deletion is still in progress, it must have stalled: http://support.github.com/discussions/repos/485-reposiotry-stuck-in-rename >?there is this semi-official one that has a (fixable in my opinion) issue with > tags ... If we can fix the tags, great. If we can also remap the authors to their git usernames, even better. > ... and now there is a new one made by Peter. I deleted that one - it was just a proof of principle. > In summary: > I have no objections to using any particular tool for importing stuff to git. > I don't like the idea of not even trying to fix the problem we have > but instantly changing the tool we are using. It was really to demonstrate to my own satisfaction that we could have the tags in the history properly. > I consider now re-importing stuff a major problem: everybody will need to port > their changes which is work. True - but this was always a possibility. From browsing the github network this really will just affect basically just two people: * Eric - quite a few changes, some of which we can probably look at merging into CVS now which would solve that. * Giovanni - quite a few changes (on a couple of files) on one branch, and a couple of other branches for proposed unit tests Also: * Dave Bridges - documentation changes to one file which we can merge into CVS and then he can delete that branch * Tiago - trivial changes to one file (stats in PopGen) * Peter (me) - I have a few test branches, nothing I care about. Brad, Bartek and Leighton have no changes made. Peter From cy at cymon.org Wed Apr 22 09:48:08 2009 From: cy at cymon.org (Cymon Cox) Date: Wed, 22 Apr 2009 10:48:08 +0100 Subject: [Biopython-dev] Bio.Application interface Message-ID: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> >From reading the previous discussion on the list, I gather there is a preference for removing helper functions to the Bio.Application command line interfaces, such that the user interface would be something like: from Bio import Application from Bio.Align.Applications import MafftCommandline cmd = MafftCommandline() cmd.set_parameter("input", "sample.fa") [etc...] i, o, e = Application.generic_run(cmd) ie the user explicitly sets the cl parameters. Ive written Application.AbstractCommandline for both MUSCLE and MAFFT. However, each of these programmes uses a variation on the parameter styles not easily covered by the current _AbstractParameter classes _Option and _Argument. The _Option class deals with parameters of the type "- -append=yes" and "-a yes", and the _Argument returns just the value to the command line, ie cmd.set_parameter("input", "sample.fa") puts just "sample.fa" on the cl. A muscle command might be: "muscle -in Fasta/f002 -out Fasta/temp_align_out2.fa -objscore sp -noanchors" ie with a "-noanchors" command, currently the parameter would need to be an _Argument and set using: cmd.set_parameter("noanchors", "-noanchors") A MAFFT command might be: "mafft - -maxiterate 200 - -nofft myInputData.fa" ie with a "- -nofft" parameter which would need to be an _Argument and set using: cmd.set_parameter("nofft", "- -nofft") and a "- -maxiterate 200" parameter which _Option doesnt cover, that is "- -" params always have an "=" before the value. So, it looks like a _OptionNoEquals parameter class is required to cover the "- -param value", and I would suggest a _ArgumentName class that returns the parameter name to the command line such that: cmd.set_parameter("- -nofit") returns "- -nofit" to the cl, and cmd.set_parameter("- -nofit", value) raises and error via the checker_function As and aside, MAFFT also has a: "mafft - -seed file1 - -seed file2 inputData.fa" ie mulitple number of - -seed parameters which is not covered by the current interface. Cheers, C. -- From biopython at maubp.freeserve.co.uk Wed Apr 22 10:26:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Apr 2009 11:26:11 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> Message-ID: <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> On Wed, Apr 22, 2009 at 10:48 AM, Cymon Cox wrote: > From reading the previous discussion on the list, I gather there is a > preference for removing helper functions to the Bio.Application command line > interfaces, such that the user interface would be something like: > > from Bio import Application > from Bio.Align.Applications import MafftCommandline > cmd = MafftCommandline() > cmd.set_parameter("input", "sample.fa") > [etc...] > i, o, e = Application.generic_run(cmd) > > ie the user explicitly sets the cl parameters. Yes, that would fit my preference for giving the user direct access to the command line as a string, to invoke as they choose. We might want to discuss extending the AbstractCommandline __init__ method to take **kwargs, allowing the parameters to be set like this: from Bio import Application from Bio.Align.Applications import MafftCommandline cmd = MafftCommandline(input="sample.fa", ...) return_code, std_handle, err_handle = Application.generic_run(cmd) I'm not sure how well this would work in practice as the range of validate argument names in python may not overlap with the valid parameter names. > Ive written Application.AbstractCommandline for both MUSCLE and MAFFT. > However, each of these programmes uses a variation on the parameter styles > not easily covered by the current _AbstractParameter classes _Option and > _Argument. The _Option class deals with parameters of the type "- > -append=yes" and "-a yes", ... > A muscle command might be: > "muscle -in Fasta/f002 -out Fasta/temp_align_out2.fa -objscore sp > -noanchors" > ie with a "-noanchors" command Those kind of options which don't take a value are really common on Unix, I suspect we already have things like this in the other wrappers. I'd guess they just use the _Option class and omit the value. > So, it looks like a _OptionNoEquals parameter class is required to cover the > "- -param value", and I would suggest a _ArgumentName class that returns the > parameter name to the command line such that: > > cmd.set_parameter("- -nofit") returns "- -nofit" to the cl, and > cmd.set_parameter("- -nofit", value) raises and error via the > checker_function You are right, a subclass of _Option which checks there is no value argument could be sensible. Maybe _OptionNoValue rather than _OptionNoEquals? Peter From peter at maubp.freeserve.co.uk Wed Apr 22 11:00:19 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Apr 2009 12:00:19 +0100 Subject: [Biopython-dev] Nice small test case for fuzzy locations Message-ID: <320fb6e00904220400m5c18ad42gbe301b739d54ce99@mail.gmail.com> Hi all, This is a nice small GenBank file with fuzzy locations, joins, and fuzzy joins: ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Cryptosporidium_parvum/NC_006980.gbk I think this will make an excellent test case, see new unittest based Tests/test_SeqIO_feature.py which we can extend to include GFF or PTT files when they are in Bio.SeqIO too. The good news is our non-fuzzy locations appear to be doing just what GenBank does - you did a good job there Brad :) If anyone comes across a better example file let us know (i.e. also very small, but with between positions, one of position etc as well). Peter From biopython at maubp.freeserve.co.uk Wed Apr 22 13:30:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Apr 2009 14:30:00 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> Message-ID: <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> On Wed, Apr 22, 2009 at 2:23 PM, Cymon Cox wrote: > 2009/4/22 Peter >> >> On Wed, Apr 22, 2009 at 10:48 AM, Cymon Cox wrote: >> >> > Ive written Application.AbstractCommandline for both MUSCLE and MAFFT. >> > However, each of these programmes uses a variation on the parameter >> > styles >> > not easily covered by the current _AbstractParameter classes _Option and >> > _Argument. The _Option class deals with parameters of the type "- >> > -append=yes" and "-a yes", ... >> > A muscle command might be: >> > "muscle -in Fasta/f002 -out Fasta/temp_align_out2.fa -objscore sp >> > -noanchors" >> > ie with a "-noanchors" command >> >> Those kind of options which don't take a value are really common on >> Unix, ?I suspect we already have things like this in the other wrappers. >> I'd guess they just use the _Option class and omit the value. > > Yes, I see now... they need to be _Options with a "lambda x: 0" value > checker function - for some reason was trying to force them into _Argument > > This is the current _Option class: > ... > So _Option covers: "- -param=value", "-param value", "-param", "- -param" > > What it doesnt cover is "- -param value" and "-param=value" > ... This might be a silly question, but do you actually these exact option layouts for MUSCLE and MAFFT? Many Unix tools use something like libopt and will actually take slight variations, and may also offer short and long names for the same option. Perhaps the existing option code in Bio.Application will suffice? Peter From mjldehoon at yahoo.com Wed Apr 22 14:31:48 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 22 Apr 2009 07:31:48 -0700 (PDT) Subject: [Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO, Bio.SwissProt In-Reply-To: <320fb6e00904211306u50955608ndccef5d0cb6ba09b@mail.gmail.com> Message-ID: <218724.9949.qm@web62406.mail.re1.yahoo.com> --- On Tue, 4/21/09, Peter Cock wrote: > Thinking this over, we should take that SwissProt file and > load it into BioSQL using BioPerl, and see how they dealt > with the DE lines, and try and do the same for Bio.SeqIO > in order that loading it into BioSQL with Biopython gives > more or less the same thing. Good point. Does anybody know how BioPerl stores SwissProt files in SQL databases? I know neither Perl nor SQL ... --Michiel From p.j.a.cock at googlemail.com Wed Apr 22 14:44:23 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Apr 2009 15:44:23 +0100 Subject: [Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO, Bio.SwissProt In-Reply-To: <218724.9949.qm@web62406.mail.re1.yahoo.com> References: <320fb6e00904211306u50955608ndccef5d0cb6ba09b@mail.gmail.com> <218724.9949.qm@web62406.mail.re1.yahoo.com> Message-ID: <320fb6e00904220744s1c88c725nb1fa607ce10df723@mail.gmail.com> On Wed, Apr 22, 2009 at 3:31 PM, Michiel de Hoon wrote: > > --- On Tue, 4/21/09, Peter Cock wrote: > >> Thinking this over, we should take that SwissProt file and >> load it into BioSQL using BioPerl, and see how they dealt >> with the DE lines, and try and do the same for Bio.SeqIO >> in order that loading it into BioSQL with Biopython gives >> more or less the same thing. > > Good point. Does anybody know how BioPerl stores SwissProt files in SQL databases? I know neither Perl nor SQL ... > Not off hand, but I know enough about BioPerl to be able to load the file into a BioSQL database. I'll post back later (but probably not today). Peter From bugzilla-daemon at portal.open-bio.org Wed Apr 22 16:14:47 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Apr 2009 12:14:47 -0400 Subject: [Biopython-dev] [Bug 2819] New: Bio.SeqIO support for NCBI protein tables (*.ptt files) Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2819 Summary: Bio.SeqIO support for NCBI protein tables (*.ptt files) Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk On their FTP site the NCBI provide a range of files for each genome/plasmid/chromosome, e.g. ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Cryptosporidium_parvum/ The *.ptt files are simple tab separated tables listing all the proteins. They correspond to the CDS features in the GenBank file. This enhancement bug is about adding "ptt" as an input file format in Bio.SeqIO (and potentially as an output format too), where a single ptt file gives a single SeqRecord object containing a SeqFeature object for each protein. The header line gives the sequence length, so an UnknownSeq can be used for the SeqRecrd's seq property. One example application of this would be to draw a GenomeDiagram showing the protein locations. This can be done using the SeqFeature objects from parsing a GenBank file, but using the ptt file will be much faster. See earlier suggestions on the mailing list (part of the GFF thread): http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005725.html http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005745.html Patch to follow... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 22 16:15:26 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Apr 2009 12:15:26 -0400 Subject: [Biopython-dev] [Bug 2819] Bio.SeqIO support for NCBI protein tables (*.ptt files) In-Reply-To: Message-ID: <200904221615.n3MGFQZi027802@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2819 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-22 12:15 EST ------- Created an attachment (id=1282) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1282&action=view) New file Bio/SeqIO/ProteinTableIO.py -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 22 16:16:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Apr 2009 12:16:37 -0400 Subject: [Biopython-dev] [Bug 2819] Bio.SeqIO support for NCBI protein tables (*.ptt files) In-Reply-To: Message-ID: <200904221616.n3MGGbXh027904@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2819 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-22 12:16 EST ------- Created an attachment (id=1283) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1283&action=view) Patch to Bio/SeqIO/__init__.py to use "ptt" files for input -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 22 16:19:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 22 Apr 2009 12:19:15 -0400 Subject: [Biopython-dev] [Bug 2819] Bio.SeqIO support for NCBI protein tables (*.ptt files) In-Reply-To: Message-ID: <200904221619.n3MGJF3V028128@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2819 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-22 12:19 EST ------- Created an attachment (id=1284) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1284&action=view) Patch to Tests/test_SeqIO_features.py to check "genbank" vs "ptt" parsing Requires additional input files from the NCBI to go in Tests/GenBank, ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Cryptosporidium_parvum/NC_006980.ptt ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Yersinia_pestis_biovar_Microtus_91001/NC_005816.ptt -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Apr 22 16:24:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Apr 2009 17:24:36 +0100 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> References: <20090408124908.GN43636@sobchak.mgh.harvard.edu> <830379.9837.qm@web62402.mail.re1.yahoo.com> <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com> Message-ID: <320fb6e00904220924x38466ac1sc80fe344eec1b200@mail.gmail.com> On Mon, Apr 13, 2009 at 1:16 PM, Peter wrote: > I don't think the GFF parser should only return SeqRecord object, but > I do see a use for this (via Bio.SeqIO). ?GFF files could be > represented as a list of SeqFeature objects, and using a SeqRecord to > hold this seems very natural to me. ?It also means we could use > Bio.SeqIO to load a GFF file into SeqRecord objects for storage in a > BioSQL database. > > If you look at the NCBI FTP site, they often provide genome sequences > in a range of file formats including GenBank and GFF. > > e.g. > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/ > > The GenBank files contain the features plus the sequence, > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gbk > > Their GFF3 file only contains the features: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff > > Some GFF files will include the sequence too, in this case we can > fetch it in FASTA format: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna > > In principle, you could parse this FASTA file and the GFF3 file and > put together a GenBank file - or vice versa. > > As an aside, I would also consider adding protein table support on the > same lines, look at this file: > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.ptt > The header information gives us the genome size, so Bio.SeqIO could > return a SeqRecord with lots of SeqFeature objects and for the > SeqRecord's seq property use a Bio.Seq.UnknownSeq of length 4639675bp. > ?This is something I might look at implementing myself after Biopython > 1.50 is out. ?We should be able to read in a GenBank file and output a > PTT file, and verify it matches the NCBI provided version of the PTT > file. There is a working NCBI protein table ("ptt") format parser for Bio.SeqIO on Bug 2819 including unit tests. http://bugzilla.open-bio.org/show_bug.cgi?id=2819 Hopefully this will be useful in integrating the GFF/GFF3 parser into Bio.SeqIO, as well as being worth while in its own right. This "ptt" parser should work fine with BioSQL and GenomeDiagram, offering a light weight alternative to parsing the GenBank or GFF3 file when all you care about is the locations of the proteins (CDS features). Peter From cy at cymon.org Wed Apr 22 17:00:38 2009 From: cy at cymon.org (Cymon Cox) Date: Wed, 22 Apr 2009 18:00:38 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> Message-ID: <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> 2009/4/22 Peter > On Wed, Apr 22, 2009 at 2:23 PM, Cymon Cox wrote: > > 2009/4/22 Peter > >> > >> On Wed, Apr 22, 2009 at 10:48 AM, Cymon Cox wrote: > >> > >> > Ive written Application.AbstractCommandline for both MUSCLE and MAFFT. > >> > However, each of these programmes uses a variation on the parameter > >> > styles > >> > not easily covered by the current _AbstractParameter classes _Option > and > >> > _Argument. The _Option class deals with parameters of the type "- > >> > -append=yes" and "-a yes", ... > >> > A muscle command might be: > >> > "muscle -in Fasta/f002 -out Fasta/temp_align_out2.fa -objscore sp > >> > -noanchors" > >> > ie with a "-noanchors" command > >> > >> Those kind of options which don't take a value are really common on > >> Unix, I suspect we already have things like this in the other wrappers. > >> I'd guess they just use the _Option class and omit the value. > > > > Yes, I see now... they need to be _Options with a "lambda x: 0" value > > checker function - for some reason was trying to force them into > _Argument > > > > This is the current _Option class: > > ... > > So _Option covers: "- -param=value", "-param value", "-param", "- -param" > > > > What it doesnt cover is "- -param value" and "-param=value" > > ... > > This might be a silly question, but do you actually these exact option > layouts for MUSCLE and MAFFT? Many Unix tools use something like > libopt and will actually take slight variations, and may also offer short > and long names for the same option. Perhaps the existing option code > in Bio.Application will suffice? MAFFT uses "--param value" style options, and won't accept "--param=value" or "-param value" as alternatives. Neither use "-param=value", but if more applications it may turn up. C. > > > Peter > -- ____________________________________________________________________ Cymon J. Cox Centro de Ciencias do Mar Faculdade de Ciencias do Mar e Ambiente (FCMA) Universidade do Algarve Campus de Gambelas 8005-139 Faro Portugal Phone: +0351 289800909 ext 7909 Fax: +0351 289800051 Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com HomePage : http://biology.duke.edu/bryology/cymon.html -8.63/-6.77 From biopython at maubp.freeserve.co.uk Wed Apr 22 21:25:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Apr 2009 22:25:35 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> Message-ID: <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> On Wed, Apr 22, 2009 at 6:00 PM, Cymon Cox wrote: >> >> This might be a silly question, but do you actually these exact option >> layouts for MUSCLE and MAFFT? Many Unix tools use something like >> libopt and will actually take slight variations, and may also offer short >> and long names for the same option. Perhaps the existing option code >> in Bio.Application will suffice? > > MAFFT uses "--param value" style options, and won't accept "--param=value" > or "-param value" as alternatives. OK. Then yes, we should support that. Brad, as Bio.Application is your module, would you like to comment? > > Neither use "-param=value", but if more applications it may turn up. > I don't think I have ever see a command line application that used that. Peter From chapmanb at 50mail.com Wed Apr 22 22:44:01 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 22 Apr 2009 18:44:01 -0400 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> Message-ID: <20090422224401.GC34546@sobchak.mgh.harvard.edu> Peter and Cymon; > >> This might be a silly question, but do you actually these exact option > >> layouts for MUSCLE and MAFFT? Many Unix tools use something like > >> libopt and will actually take slight variations, and may also offer short > >> and long names for the same option. Perhaps the existing option code > >> in Bio.Application will suffice? > > > > MAFFT uses "--param value" style options, and won't accept "--param=value" > > or "-param value" as alternatives. > > OK. Then yes, we should support that. Brad, as Bio.Application is your > module, would you like to comment? My comment is: I think it is awesome MAFFT made up their own way of doing the command line. Seriously, y'all are doing the right thing. Add a new class to Bio.Application: _OptionAlt or whatever you'd like to call MAFFT's inventive new way to specify command line arguments. Adapt the __str__ from _Option to do it the "--param val" way in this class. Then use this for your MAFFT commandline. I believe I just summarized your discussion, so you can replace this whole message with +1. Brad From winda002 at student.otago.ac.nz Thu Apr 23 02:14:31 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Thu, 23 Apr 2009 14:14:31 +1200 Subject: [Biopython-dev] main page on wiki Message-ID: <49EFCF07.2050502@student.otago.ac.nz> Hi all, As you probably know the main page of the wiki (http://biopython.org/wiki/Main_Page) is the first place someone washes up when they google 'biopython'. As part of this "news coordinator" idea I have made an alternative version of the main page (http://biopython.org/wiki/User:Davidw/homepage) which acts a bit more as a "portal" for the wiki/project. This is born from my own experience with the wiki as a newcomer; it took me a long time to cotton on to the fact there was a navigation box on each page so I didn't realise what the website had to offer (this may say more about me than the design of the front page). Which version would you like to see as the main page? Obviously this isn't an either-or thing, my 'mock-up' version can be edited by anyone with an account on the wiki (the main page is protected for obvious reasons) so any ideas that you have can be incorporated to that one (older versions of the page are all saved so you can edit as bravely as you like). Thanks, David From sbassi at clubdelarazon.org Thu Apr 23 01:53:09 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Wed, 22 Apr 2009 22:53:09 -0300 Subject: [Biopython-dev] main page on wiki In-Reply-To: <49EFCF07.2050502@student.otago.ac.nz> References: <49EFCF07.2050502@student.otago.ac.nz> Message-ID: <9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com> On Wed, Apr 22, 2009 at 11:14 PM, David Winter wrote: > Which version would you like to see as the main page? Obviously this isn't I liked the new version. I would add (if I knew how to do it) some icons near each of: Get Started Get help Contribute From argriffi at ncsu.edu Thu Apr 23 01:42:21 2009 From: argriffi at ncsu.edu (alex) Date: Wed, 22 Apr 2009 21:42:21 -0400 Subject: [Biopython-dev] main page on wiki In-Reply-To: <49EFCF07.2050502@student.otago.ac.nz> References: <49EFCF07.2050502@student.otago.ac.nz> Message-ID: <49EFC77D.5070307@ncsu.edu> David Winter wrote: > Hi all, > > As you probably know the main page of the wiki > (http://biopython.org/wiki/Main_Page) is the first place someone washes > up when they google 'biopython'. As part of this "news coordinator" idea > I have made an alternative version of the main page > (http://biopython.org/wiki/User:Davidw/homepage) which acts a bit more > as a "portal" for the wiki/project. This is born from my own experience > with the wiki as a newcomer; it took me a long time to cotton on to the > fact there was a navigation box on each page so I didn't realise what > the website had to offer (this may say more about me than the design of > the front page). > > Which version would you like to see as the main page? Obviously this > isn't an either-or thing, my 'mock-up' version can be edited by anyone > with an account on the wiki (the main page is protected for obvious > reasons) so any ideas that you have can be incorporated to that one > (older versions of the page are all saved so you can edit as bravely as > you like). > > Thanks, > David I like your version better than the current main page. From idoerg at gmail.com Thu Apr 23 03:49:39 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Wed, 22 Apr 2009 20:49:39 -0700 Subject: [Biopython-dev] main page on wiki In-Reply-To: <9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com> References: <49EFCF07.2050502@student.otago.ac.nz> <9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com> Message-ID: <49EFE553.6070405@gmail.com> I second Sebastian on the icons, and third Sebastian and Alex on preferring David's take on a main page. Sebastian Bassi wrote: > On Wed, Apr 22, 2009 at 11:14 PM, David Winter > wrote: >> Which version would you like to see as the main page? Obviously this isn't > > I liked the new version. I would add (if I knew how to do it) some > icons near each of: > Get Started Get help Contribute > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev -- Iddo Friedberg Ph.D. Atkinson Hall MC 0446 University of California San Diego 9500 Gilman Dr. La Jolla, CA 92093-0446 USA http://iddo-friedberg.net From biopython at maubp.freeserve.co.uk Thu Apr 23 09:16:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Apr 2009 10:16:44 +0100 Subject: [Biopython-dev] main page on wiki In-Reply-To: <49EFE553.6070405@gmail.com> References: <49EFCF07.2050502@student.otago.ac.nz> <9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com> <49EFE553.6070405@gmail.com> Message-ID: <320fb6e00904230216l4b2cd91k513baa3f7a08172c@mail.gmail.com> On Thu, Apr 23, 2009 at 4:49 AM, Iddo Friedberg wrote: > I second Sebastian on the icons, and third Sebastian and Alex on preferring > David's take on a main page. Are you all looking at the *current* home page which already has a few of David's suggestions (in particular the news feed on the right), or the old version from memory? Also, what size screens do you all have? It should ideally look OK on small screens or windows (e.g. 1024 by 768 is what my laptop uses, which isn't that old). From playing with my window size, it should be OK - the proposed layout seems quite flexible :) If there are no counter comments, I'll put David's changes up later today or tomorrow. Peter From biopython at maubp.freeserve.co.uk Thu Apr 23 09:29:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Apr 2009 10:29:04 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <20090422224401.GC34546@sobchak.mgh.harvard.edu> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> <20090422224401.GC34546@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904230229o7efcfbe0ld2da94f10bd1b3b8@mail.gmail.com> On Wed, Apr 22, 2009 at 11:44 PM, Brad Chapman wrote: > Peter and Cymon; > > My comment is: I think it is awesome MAFFT made up their own way > of doing the command line. Was that sarcasm Brad? > Seriously, y'all are doing the right thing. Add a new class to > Bio.Application: _OptionAlt or whatever you'd like to call MAFFT's > inventive new way to specify command line arguments. Adapt the > __str__ from _Option to do it the "--param val" way in this class. > Then use this for your MAFFT commandline. Maybe _DoubleDashOption for the class name? I haven't looked at this closely enough to have a firm opinion - but as this will be a private class anyway, the name doesn't matter so much. > I believe I just summarized your discussion, so you can replace this > whole message with +1. :) What about this bit I wrote earlier: >> ... We might want to discuss extending the AbstractCommandline >> __init__ method to take **kwargs, allowing the parameters to be >> set like this: >> >> from Bio import Application >> from Bio.Align.Applications import MafftCommandline >> cmd = MafftCommandline(input="sample.fa", ...) >> return_code, std_handle, err_handle = Application.generic_run(cmd) >> >> I'm not sure how well this would work in practice as the range of >> valid argument names in python may not overlap with the valid >> parameter names. We'll have to see how well the above idea works in practice - it may not be general enough to be useful. Also, perhaps we can automatically generate properties for each argument allowing this: cmd.input = "sample.fa" rather than: cmd.set_parameter("input", "sample.fa") For the "switch" type arguments which take no value, if these are implemented with a separate option class (maybe _Switch or _OptionNoValue) then rather than: cmd.set_parameter("noanchors") we might want to do: cmd.noanchors = True and allow the switch to be removed with: cmd.noanchors = False i.e. For those arguments which take no argument (is "switch" the right term here?), evaluate the property set value as a boolean to add/remove -noanchors from the command line string. I think using properties in this way could make the command line object more intuitive, but again python puts limits on property names which might mean for some arguments you'd have to use the set_parameter version. Peter From bugzilla-daemon at portal.open-bio.org Thu Apr 23 09:39:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 05:39:11 -0400 Subject: [Biopython-dev] [Bug 2819] Bio.SeqIO support for NCBI protein tables (*.ptt files) In-Reply-To: Message-ID: <200904230939.n3N9dBZ5000718@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2819 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-23 05:39 EST ------- Just to note that Bio/SeqIO/ProteinTableIO.py needs a minor improvement to cope with one special case - features which wrap the origin, e.g. NEQ001 in Nanoarchaeum equitans. ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.gbk ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.ptt This is the first CDS in the GenBank file, location given as: complement(join(490883..490885,1..879)) It is the last entry in the Protein Table file, 490883..879 - ... All my code needs to do is spot when start > end, and then add the two appropriate sub-features (using the known genome length, 490885) and set the location operator to join (to match what the GenBank parser does). I'll do this at some point assuming there is interest in adding this parser to Bio.SeqIO. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Thu Apr 23 12:36:35 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 23 Apr 2009 08:36:35 -0400 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> <320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com> Message-ID: <20090423123635.GD34546@sobchak.mgh.harvard.edu> Hi all; > > Unless you are thinking of having an object representation as being too > > heavy, the non-light part of SeqFeature is all the FeatureLocation > > fuzziness. > > I've just had a quick go at what should be a 100% backwards compatible > modification to the FeatureLocation class to store ExactPosition start > or end positions as integers. The idea should be more memory > efficient, using the complex position objects only when required. I like the idea here but I would go a step further and get rid of FeatureLocation, collapsing the start and end location onto the SeqFeature itself. FeatureLocation is basically just a holder for a start and end coordinates. In this version, you would store the positions plus extensions and fuzzy type on the Feature, and then instantiate fuzzy objects on demand. I took a look at the resource usage of these objects versus a lightweight implementation. For a GFF file with 70k features, the maximum memory usage is 128M versus 111M for the lightweight version. So the improvement is rather modest, ~15%. > I forgot to mention the second major use case I'm concerned about, > which is recovering the GenBank/EMBL style location string. I have > looked at this in the past, by adding methods to the FeatureLocation > and all the Position objects, but it is complicated by the fact the > Position objects don't know if they are at the start or end (and for > the start locations we need to add one to convert from Python > counting). This is the main block on having Bio.SeqIO support writing > GenBank (or EMBL) files with their features included. I admittedly haven't looked at this in a while, but this was designed to be round tripped. The GenBank Record class can be written out back in GenBank format, and test_GenBank explicitly checks that the start and end records are the same. Brad From chapmanb at 50mail.com Thu Apr 23 12:53:56 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 23 Apr 2009 08:53:56 -0400 Subject: [Biopython-dev] Rolling new releases In-Reply-To: <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com> <320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com> <20090421122045.GD30529@sobchak.mgh.harvard.edu> <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com> Message-ID: <20090423125356.GE34546@sobchak.mgh.harvard.edu> Hi all; > > It would also be worth thinking about what the worst parts of > > building the releases are and seeing if we can automate or eliminate > > them. A few things that I can think of: [Brainstorming a few suggestions] I feel like I derailed from the main point by making suggestions. Separate from a debate about betas and version support and documentation -- how can we make releases easier to roll? Peter, this started when you mentioned that rolling the release felt kind of painful and it would be great if others would pitch in. The idea of soliciting volunteers as release coordinators is great. In addition to that, we should think about streamlining the release process -- what are the parts we can get rid of and still have high quality releases? Peter, since you are doing them right now, what are your thoughts? Brad From lpritc at scri.ac.uk Thu Apr 23 13:43:43 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 23 Apr 2009 14:43:43 +0100 Subject: [Biopython-dev] main page on wiki In-Reply-To: <49EFC77D.5070307@ncsu.edu> Message-ID: On 23/04/2009 02:42, "alex" wrote: > David Winter wrote: >> Hi all, [...] >> Which version would you like to see as the main page? > I like your version better than the current main page. +1 I like the layout. Sebastian's idea for icons is also good. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From p.j.a.cock at googlemail.com Thu Apr 23 13:58:58 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 Apr 2009 14:58:58 +0100 Subject: [Biopython-dev] Rolling new releases In-Reply-To: <20090423125356.GE34546@sobchak.mgh.harvard.edu> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com> <320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com> <20090421122045.GD30529@sobchak.mgh.harvard.edu> <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com> <20090423125356.GE34546@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904230658y310609c8l89bf27c33bd56d56@mail.gmail.com> On Thu, Apr 23, 2009 at 1:53 PM, Brad Chapman wrote: > Hi all; > >> > It would also be worth thinking about what the worst parts of >> > building the releases are and seeing if we can automate or eliminate >> > them. A few things that I can think of: > > [Brainstorming a few suggestions] > > I feel like I derailed from the main point by making suggestions. > Separate from a debate about betas and version support and > documentation -- how can we make releases easier to roll? > > Peter, this started when you mentioned that rolling the release > felt kind of painful and it would be great if others would pitch in. > The idea of soliciting volunteers as release coordinators is great. I didn't mean painful, so much as time consuming - but this was mostly coordinating final polish/bug fixes and documentation. This kind of thing requires some debate and judgement calls, and will be different for every release. I spent quite a lot of time on documentation for things which I really wanted to get into the Tutorial that shipped with the release (some of which should have happened earlier, so this was partly my own fault). In terms of getting the documentation updated for each release, this would be less effort if we as a group were more diligent about putting things in the tutorial and/or docstrings as we go along. It's important that nice new features are demonstrated, otherwise no-one will know they are there without reading the code itself or from following the mailing list discussions carefully. > In addition to that, we should think about streamlining the release > process -- what are the parts we can get rid of and still have high > quality releases? Peter, since you are doing them right now, what > are your thoughts? The complicated bit is getting the code and documentation in CVS ready, and that is harder to delegate. Once that is done though, the actual release process is fairly straight forward - as documented here - and could be delegated to anyone methodical with suitably setup development machine(s): http://biopython.org/wiki/Building_a_release Maybe some of the release process could be automated literally as a script - but doing each step methodically by hand and checking as you go is wise. For the release process, I'm basically proposing splitting this up into up to three jobs: (1) Coordinating final bug fixes and documentation in CVS. This has recently been handled by me or Michiel with most discussion on the dev lists, and some module specific details off list, and this works and I wouldn't change it. (2) Once CVS is ready, building the documentation, doing the release archives, doing epydoc, doing the Windows installers, tagging CVS, and uploading to the website. Part of the job would include scanning the NEWS and DEPRECATED files, plus recent documentation to make sure nothing was missed. This can be delegated. (3) Writing and publishing the release announcement on the news site and email lists (with the timing coordinated with the people doing jobs 1 and 2). I suggest having our new news coordinators take over this bit. So, while historically (1), (2) and (3) have be done by one person I think this could be split up into the "Release Director", "Release Manager" and "News Coordinator" roles (perhaps with different job titles?). Peter From p.j.a.cock at googlemail.com Thu Apr 23 14:06:14 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 Apr 2009 15:06:14 +0100 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <20090423123635.GD34546@sobchak.mgh.harvard.edu> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> <320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com> <20090423123635.GD34546@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904230706j213d6a47iadc6722581e52588@mail.gmail.com> On Thu, Apr 23, 2009 at 1:36 PM, Brad Chapman wrote: > Hi all; > >> > Unless you are thinking of having an object representation as being too >> > heavy, the non-light part of SeqFeature is all the FeatureLocation >> > fuzziness. >> >> I've just had a quick go at what should be a 100% backwards compatible >> modification to the FeatureLocation class to store ExactPosition start >> or end positions as integers. ?The idea should be more memory >> efficient, using the complex position objects only when required. > > I like the idea here but I would go a step further and get rid of > FeatureLocation, collapsing the start and end location onto the > SeqFeature itself. FeatureLocation is basically just a holder for a > start and end coordinates. In this version, you would store the > positions plus extensions and fuzzy type on the Feature, and then > instantiate fuzzy objects on demand. > > I took a look at the resource usage of these objects versus > a lightweight implementation. For a GFF file with 70k features, the > maximum memory usage is 128M versus 111M for the lightweight > version. So the improvement is rather modest, ~15%. Thanks for that. Perhaps the variant idea using a using a single reference for each location would save more (currently is uses two references, one for the object and one for the integer - so in general we are wasting memory on a pointer to None). Certainly merging the SeqFeature and FeatureLocation should save even more memory. We could do this with full backward compatibility by generating the FeatureLocation object on request (using a property method for the SeqFeature's location), and this can also trigger a deprecation warning. We'd have to think about what to do with the SeqFeature's __init__ method more carefully. >> I forgot to mention the second major use case I'm concerned about, >> which is recovering the GenBank/EMBL style location string. ?I have >> looked at this in the past, by adding methods to the FeatureLocation >> and all the Position objects, but it is complicated by the fact the >> Position objects don't know if they are at the start or end (and for >> the start locations we need to add one to convert from Python >> counting). ?This is the main block on having Bio.SeqIO support writing >> GenBank (or EMBL) files with their features included. > > I admittedly haven't looked at this in a while, but this was > designed to be round tripped. The GenBank Record class can be > written out back in GenBank format, and test_GenBank explicitly > checks that the start and end records are the same. Yes - The Bio.GenBank.Record class should round-trip, from memory it stores feature locations as string. I'm interested in writing a SeqRecord out as a GenBank file (which already do, but without the features). This would let you do things like load an EMBL or GFF3 file as a SeqRecord, and output it as a GenBank file. Peter From cy at cymon.org Thu Apr 23 14:32:10 2009 From: cy at cymon.org (Cymon Cox) Date: Thu, 23 Apr 2009 15:32:10 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <20090422224401.GC34546@sobchak.mgh.harvard.edu> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> <20090422224401.GC34546@sobchak.mgh.harvard.edu> Message-ID: <7265d4f0904230732i124670ebvf859b2e27943ba37@mail.gmail.com> 2009/4/22 Brad Chapman > Peter and Cymon; > > > >> This might be a silly question, but do you actually these exact option > > >> layouts for MUSCLE and MAFFT? Many Unix tools use something like > > >> libopt and will actually take slight variations, and may also offer > short > > >> and long names for the same option. Perhaps the existing option code > > >> in Bio.Application will suffice? > > > > > > MAFFT uses "--param value" style options, and won't accept > "--param=value" > > > or "-param value" as alternatives. > > > > OK. Then yes, we should support that. Brad, as Bio.Application is your > > module, would you like to comment? > > My comment is: I think it is awesome MAFFT made up their own way > of doing the command line. I think you'll be likewise inspired by the MUSCLE command line parsing: [cymon at chara mafft]$ muscle -in Tests/Fasta/f002 -anchorspacing -cluster1 upgmb Command-line option "upgmb" must start with '-' But of course, these two are perfectly acceptable: [cymon at chara mafft]$ muscle -in Tests/Fasta/f002 -anchorspacing --cluster1=upgmb [cymon at chara mafft]$ muscle -in Tests/Fasta/f002 -anchorspacing on-balance-I-think-Ill-go-home -cluster1 upgmb At present, there is no current way to force a value argument to an option so cmd.set_parameter("-anchorspacing") is acceptable in the interface. But, in general, I assume the idea is not 'save' the user from niceties of the particular programme command line, ie in command line interface I'm allowing users to set parameters which either dont work or crash the programme... > Seriously, y'all are doing the right thing. Add a new class to > Bio.Application: _OptionAlt or whatever you'd like to call MAFFT's > inventive new way to specify command line arguments. Adapt the > __str__ from _Option to do it the "--param val" way in this class. > Then use this for your MAFFT commandline. class _OptionAlt(_AbstractParameter): """Represent an option that can be set for a program. This holds UNIXish options like: --append yes --append """ def __str__(self): """Return the value of this option for the commandline. """ if self.names[0].find("--") >= 0: output = "%s" % self.names[0] if self.value is not None: output += " %s " % self.value else: output += " " else: raise ValueError("Unrecognized option type: %s" % self.names[0]) return output C. -- ____________________________________________________________________ Cymon J. Cox Centro de Ciencias do Mar Faculdade de Ciencias do Mar e Ambiente (FCMA) Universidade do Algarve Campus de Gambelas 8005-139 Faro Portugal Phone: +0351 289800909 ext 7909 Fax: +0351 289800051 Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com HomePage : http://biology.duke.edu/bryology/cymon.html -8.63/-6.77 From bugzilla-daemon at portal.open-bio.org Thu Apr 23 15:22:36 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 11:22:36 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line interface In-Reply-To: Message-ID: <200904231522.n3NFMal6026332@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 cymon.cox at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1280 is|0 |1 obsolete| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 15:23:05 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 11:23:05 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line interface In-Reply-To: Message-ID: <200904231523.n3NFN5va026431@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 cymon.cox at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1279 is|0 |1 obsolete| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 15:25:43 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 11:25:43 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line interface In-Reply-To: Message-ID: <200904231525.n3NFPhPH026661@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #3 from cymon.cox at gmail.com 2009-04-23 11:25 EST ------- Created an attachment (id=1285) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1285&action=view) Bio.Align.Applications.py text -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 15:32:34 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 11:32:34 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line interface In-Reply-To: Message-ID: <200904231532.n3NFWYkO027258@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #4 from cymon.cox at gmail.com 2009-04-23 11:32 EST ------- Created an attachment (id=1286) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1286&action=view) Patch for Bio.Applications __init__.py -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 15:33:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 11:33:09 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line interface In-Reply-To: Message-ID: <200904231533.n3NFX9kw027294@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #5 from cymon.cox at gmail.com 2009-04-23 11:33 EST ------- MUSCLE and MAFFT Bio.Application command lines Patch for Bio.Applications __init__py to add _OptionAlt class covering "--param value" style options C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 15:43:04 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 11:43:04 -0400 Subject: [Biopython-dev] [Bug 2754] Bio.PDB: Parse warnings should print to stderr, not stdout In-Reply-To: Message-ID: <200904231543.n3NFh4cT028184@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2754 ------- Comment #13 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-23 11:43 EST ------- In comment #3 Bruce wrote: > > I believe that we should be using the using Python warnings module for these > types of messages: > http://docs.python.org/library/warnings.html > > This permits the user to have a greater control over the output and also > allows redirecting the output as required. In the Bio directory, there are > currently 36 and 25 uses of stderr and stdout, respectively. > > In terms of the patch, my limited understanding is that local import sys will > override any global redirection of the output which in my opinion is a bad > idea. Good points, and yes, using the warnings module here (and probably elsewhere in Biopython) makes sense. Eric wrote in comment #9: > Yes, something must be done with test_PDB.py, because I don't think > warnings.warn can be made to play nice with that print-and-compare test > -- or any print-and-compare, since the warning messages contain extra > environment-specific information. I was able to solve this with the following trick: import warnings def send_warnings_to_stdout(message, category, filename, lineno, file=None): print message warnings.showwarning = send_warnings_to_stdout This now prints *just* the message text without the stack trace information etc. This also means it looks like any other output from the print-and-compare test, to test_PDB.py required only a trivial change. Note that I haven't taken Eric's patches/branch as is - for one thing I wanted to use the same import style as elsewhere in Biopython: i.e. import warnings warnings.warn("Message") rather than: from warnings import warn warn("Message") However, I think we can now close Bug 2754. Eric - please try the latest code from CVS (or the mirror on github). Also, could you also open separate bug(s) for the other issues, such as your new unittest based version of test_PDB.py? Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Apr 23 16:34:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Apr 2009 17:34:27 +0100 Subject: [Biopython-dev] How are people doing their git merges from the trunk? Message-ID: <320fb6e00904230934kac2ca0eoc8ba284fbe2c382e@mail.gmail.com> Hi all, We have the CVS trunk mirrored here: http://github.com/biopython/biopython/tree/master I have a copy of this in my github account here, http://github.com/peterjc/biopython/tree/master I decided that I would (initially at least) treat my master branch as a copy of the master branch, and not commit local changes to this branch. Instead I periodically grab the latest commits from the master using the commands: #Do this once only: #git remote add official_dist git://github.com/biopython/biopython.git echo Checking out my local master branch... git checkout master echo Updating my local master branch with the official dist... git pull official_dist master echo Status: git status echo Pushing to my github master branch... git push origin master This means the github network diagram only advances by one step, even if the operation combined 10s of individual commits (which are still shown individually on my history on github). Alternatively, I could have used github's cherry pick interface (the fork queue), or used git cherry pick at the command line. I can see this is useful if you only want to pick out a few patches. Is there any reason to use this when you want all the commits from another branch? Bartek's latest activity on the github network is a series of points - I think this means he did a "cherry pick", and selected most (maybe even all) of the changes from the main trunk. Am I interpreting this right? Thanks Peter From biopython at maubp.freeserve.co.uk Thu Apr 23 21:21:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Apr 2009 22:21:41 +0100 Subject: [Biopython-dev] Fwd: Where to put command line wrappers In-Reply-To: <20090417140241.GD16092@sobchak.mgh.harvard.edu> References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> <8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com> <8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com> <320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com> <20090417140241.GD16092@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904231421k3e18d0b1y9003614e906fcb1c@mail.gmail.com> On Fri, Apr 17, 2009 at 3:02 PM, Brad Chapman wrote: > Hi all; > > [Where to put the commandline objects] >> > I think that there is a difference between EMBOSS and >> > Bio.[Motif|Align]. In EMBOSS we have a very nicely comoditized >> > set of tools with similar interfaces, while both for multiple >> > alignment and motif searching the tools vary a lot. In case of >> > multiple alignments this is only with respect to parameters and >> > output format, while in motif searching there is also a lot of >> > differences in the types of input (background models etc.). >> >> That is a good argument for using Bio/Align/Applications/XXX.py and >> Bio/Motif/Applications/XXX.py while also having >> Bio/EMBOSS/Applications.py > > There is a natural tension between overgeneralizing and dumping > too much into one file. At one end you have deeply nested Java-like > directories with a few lines of code in each file. I tend towards the > "more in a single file and less nesting" camp. My vote would be that > if the Motif Applications file will only contain commandline > wrappers, they could live in one file. OK, what I propose is that the command line objects are exposed as Bio.Align.Applications.MuscleCommandline, Bio.Align.Applications.ClustalwCommandline, etc but that the implementations live in Bio/Align/Applications/_Muscle.py, _Clustalw.py etc. To do this the Bio/Align/Applications/__init__.py file will look like this: from _Muscle import MuscleCommandline from _Clustalw import ClustalwCommandline This avoids having a single massive file, yet keeps the public namespace simple. For the user, they do this: from Bio.Align.Applications import MuscleCommandline cline = MuscleCommandline(...) or if they prefer, from Bio.Align import Applications cline = Applications.MuscleCommandline(...) >From the user's point of view all the alignment command line wrapper objects live together under Bio.Align.Applications. This will be consistent with the public API for the EMBOSS wrappers where you can do: from Bio.Emboss.Applications import Primer3Commandline cline = Primer3Commandline(...) or variants like that. For Bio.Motif.Applications we can do the same as for Bio.Align.Applications, or if there are only one or two wrappers initially put the classes directly in Bio/Motif/Applications/__init__.py and then split them into private files later on if the file gets too big. Peter From bugzilla-daemon at portal.open-bio.org Thu Apr 23 21:52:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 17:52:08 -0400 Subject: [Biopython-dev] [Bug 2820] New: Convert test_PDB.py to unittest Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2820 Summary: Convert test_PDB.py to unittest Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: enhancement Priority: P3 Component: Unit Tests AssignedTo: biopython-dev at biopython.org ReportedBy: eric.talevich at gmail.com The current test script for Bio.PDB uses the print-and-compare approach. I've written an equivalent test script using unittest, assuming that style is the preferred one. It was written to go with Bug 2754, but now lives on my pdbtidy branch: http://github.com/etal/biopython/tree/pdbtidy This script could also live alongside the original test_PDB.py for awhile, as an additional check on Bio.PDB's error handling. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 22:01:16 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 18:01:16 -0400 Subject: [Biopython-dev] [Bug 2754] Bio.PDB: Parse warnings should print to stderr, not stdout In-Reply-To: Message-ID: <200904232201.n3NM1GW1025781@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2754 ------- Comment #14 from eric.talevich at gmail.com 2009-04-23 18:01 EST ------- (In reply to comment #13) > I think we can now close Bug 2754. Eric - please try the latest code > from CVS (or the mirror on github). Works for me. I'll delete the bug2754 branch from github. > Also, could you also open separate bug(s) for the other issues, such as your > new unittest based version of test_PDB.py? I opened Bug 2820 for the unittest version of test_PDB.py. The script itself is living on my pdbtidy branch at Tests/test_PDB_unit.py now, although one of the tests broke during the merge (there were a lot of conflicts). I'll open bugs for the other changes once I figure out which modifications are worth sharing. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 22:26:38 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 18:26:38 -0400 Subject: [Biopython-dev] [Bug 2754] Bio.PDB: Parse warnings should print to stderr, not stdout In-Reply-To: Message-ID: <200904232226.n3NMQcPf027372@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2754 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-23 18:26 EST ------- (In reply to comment #14) > (In reply to comment #13) > > I think we can now close Bug 2754. Eric - please try the latest code > > from CVS (or the mirror on github). > > Works for me. > Great - marking this bug as fixed :) > > I'll delete the bug2754 branch from github. > OK - it has served its purpose now :) > > Also, could you also open separate bug(s) for the other issues, > > such as your new unittest based version of test_PDB.py? > > I opened Bug 2820 for the unittest version of test_PDB.py. The script itself > is living on my pdbtidy branch at Tests/test_PDB_unit.py now, although one > of the tests broke during the merge (there were a lot of conflicts). Thanks. > I'll open bugs for the other changes once I figure out which modifications > are worth sharing. Thank you :) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Thu Apr 23 22:54:02 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 23 Apr 2009 18:54:02 -0400 Subject: [Biopython-dev] How are people doing their git merges from the trunk? In-Reply-To: <320fb6e00904230934kac2ca0eoc8ba284fbe2c382e@mail.gmail.com> References: <320fb6e00904230934kac2ca0eoc8ba284fbe2c382e@mail.gmail.com> Message-ID: <3f6baf360904231554y7e28084ex79bb0a7f60b4cef7@mail.gmail.com> On Thu, Apr 23, 2009 at 12:34 PM, Peter wrote: > > I decided that I would (initially at least) treat my master branch as > a copy of the master branch, and not commit local changes to this > branch. Instead I periodically grab the latest commits from the > master using the commands: I think this is the recommended way to do it. I read a thread where Mercurial gurus recommended keeping a clean clone of the upstream repository, and never committing to that clone. Git seems to have a cleaner version of this with in-place branches. After a few bad incidents with git-rebase, I resolved to keep 'master' in sync with the biopython trunk, and use new named branches for all modifications. The workflow is: git checkout master git pull origin # if I've pushed commits from a different computer recently git pull upstream master # upstream is the remote biopython/biopython git push origin master git checkout phyloxml # a local branch git merge master # hack, commit, repeat # rebasing commits made in this session on this branch is still safe git push origin phyloxml This means the github network diagram only advances by one step, even > if the operation combined 10s of individual commits (which are still > shown individually on my history on github). > > I think mine shows up as multiple dots, and I don't use cherry-pick. Pulling from upstream on the master branch always results in a fast-forward, though. From bugzilla-daemon at portal.open-bio.org Thu Apr 23 23:36:28 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 19:36:28 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200904232336.n3NNaSw6031547@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-23 19:36 EST ------- (In reply to comment #0) > The current test script for Bio.PDB uses the print-and-compare approach. I've > written an equivalent test script using unittest, assuming that style is the > preferred one. Yes, in principle the unittest style is prefferred. In practice I am pragmatic about this - a print-and-compare test is better than nothing, and for some things is much easier to write. > It was written to go with Bug 2754, but now lives on my pdbtidy branch: > http://github.com/etal/biopython/tree/pdbtidy > > This script could also live alongside the original test_PDB.py for awhile, as > an additional check on Bio.PDB's error handling. I've checked in a slightly modified version as test_PDB_unit.py - I think having both this and the original test_PDB.py is sensible in the short term. You wrote on Bug 2754 comment 14 that "one of the tests broke during the merge", was that this one: def test_warnings(self): """Parse a flawed PDB file in permissive mode, with warnings""" # Python 2.6+: rewrite this using warnings.catch_warnings parser = PDBParser(PERMISSIVE=1) msg_redef_n = r"Atom N defined twice in residue at line 19\." msg_blank_alt = r"Blank altlocs in duplicate residue SER \(' ', 4, ' '\) at line 41\." msg_redef_o = r"Atom O defined twice in residue at line 820\." warnings.simplefilter('ignore') # NB: Order is important here! warnings.filterwarnings('error', msg_redef_n, PDBConstructionWarning) self.assertRaises(PDBConstructionWarning, parser.get_structure, "example", "PDB/a_structure.pdb") warnings.filters.pop(0) warnings.filterwarnings('error', msg_blank_alt, PDBConstructionWarning) self.assertRaises(PDBConstructionWarning, parser.get_structure, "example", "PDB/a_structure.pdb") warnings.filters.pop(0) warnings.filterwarnings('error', msg_redef_o, PDBConstructionWarning) self.assertRaises(PDBConstructionWarning, parser.get_structure, "example", "PDB/a_structure.pdb") warnings.filters.pop(0) warnings.filters.pop(0) I tried but couldn't get this to work (on Python 2.4.3 on Linux), even with plenty of warnings.resetwarnings() which seemed cleaner than popping things. I agree with the idea that we should make sure particular errors do get raised (this is checked by the print-and-compare test_PDB.py because we capture these warnings to stdout), but right now how to make it work escapes me. Maybe after a good night's sleep things will make sense ;) Leaving this bug open to address this point. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 24 03:12:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 23:12:15 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200904240312.n3O3CFdn011360@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #2 from eric.talevich at gmail.com 2009-04-23 23:12 EST ------- (In reply to comment #1) > You wrote on Bug 2754 comment 14 that "one of the tests broke during the > merge", was that this one: > > def test_warnings(self): > [...] > > I tried but couldn't get this to work (on Python 2.4.3 on Linux), even with > plenty of warnings.resetwarnings() which seemed cleaner than popping things. > Yep, that's the one. The behavior of the warnings module and resetwarnings() is pathological, I think. If a warning is triggered before the warnings.simplefilter('always') function is called, that specific warning will be silent until the interpreter is restarted. That's why order is sensitive in that function, and why the three exceptions aren't three separate functions. The attribute warnings.filters is a list of filters that warnings are checked against as they're raised, and at startup the list is not empty. Calling warnings.resetwarnings() just empties this list, including the default filters and any use of 'ignore' or 'always'. Maybe the popping was just voodoo and an empty filter list is fine... dunno. Python 2.6 includes a context manager that makes all these problems *completely* go away, by catching all of the warnings raised within a context and optionally storing them as a list of warning objects that can be inspected. Would you be interested in having a unit test that does a more thorough check of the warnings system, but only runs on Py2.6? I'm guessing no, but hey, worth a shot. Most likely, some warnings just aren't being caught because my version of the unit test assumed a different variety of warnings coming out of PDB. If that's the case then it should be an easy fix and you can disregard my whining. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 24 03:56:49 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 23 Apr 2009 23:56:49 -0400 Subject: [Biopython-dev] [Bug 2821] New: NCBIXML.parse only returns results for non-empty hits rather than one per query sequence Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2821 Summary: NCBIXML.parse only returns results for non-empty hits rather than one per query sequence Product: Biopython Version: 1.50b Platform: Other OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: camilla at ip.id.au I used NCBIStandalone.blastall to BLAST all records in query database VEKY.faa (a FASTA-format file of 226 proteins) significantly similar in proteins in target database VPOO.faa (a FASTA-format file of 80 proteins). Many of the 'VEKY' proteins do not have a significant hit in the 'VPOO' database (which is what I expect and this is fine). To access the results, I iterate using a loop like the following to parse the raw BLAST results in XML format: blast_out = _open_file(outraw_file, 'r') blast_records = NCBIXML.parse(blast_out) for b_record in blast_records: # deal with each record here However, instead of getting 226 records as I expect, some of which have a description of alignments field of length zero, this returns 64 records - the records that did not have 'no hits'. My problem is that I'd like to work out which VEKY query sequence each 'b_record' corresponds to. But so far I have not been able to find any such information in the b_record. And because it doesn't produce one per query sequence, I cannot infer that information from the order of the query sequences in my input VEKY.faa file. Do you know how I can get around this problem? Warm thanks in advance for any help or tips, Camilla -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 24 08:05:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 24 Apr 2009 04:05:29 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200904240805.n3O85TqY030236@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #3 from dalloliogm at gmail.com 2009-04-24 04:05 EST ------- (In reply to comment #0) > The current test script for Bio.PDB uses the print-and-compare approach. I've > written an equivalent test script using unittest, assuming that style is the > preferred one. > > It was written to go with Bug 2754, but now lives on my pdbtidy branch: > http://github.com/etal/biopython/tree/pdbtidy > > This script could also live alongside the original test_PDB.py for awhile, as > an additional check on Bio.PDB's error handling. > I also tried to write an unittest-based test for PDB exposure, just for playing with it a bit: - http://github.com/dalloliogm/biopython/blob/7dabfff5f7b523479bf8d6de120d0f6c7d03f7df/Tests/test_PDBexposure.py I used the approach where one unit test is equivalent to a PDB file, instead of a set of functions. For example: - test case 1: PDB.NeighborSearch is able to read a random generated PDB file - test case 2: PDB.NeighborSearch is able to read a pdb file with only one structure - test case 3: PDB.NeighborSearch is able to read another specific pdb case -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 24 08:06:43 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 24 Apr 2009 04:06:43 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200904240806.n3O86h0q030360@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #4 from dalloliogm at gmail.com 2009-04-24 04:06 EST ------- (In reply to comment #3) This has the advantage that you can write a base test class and then apply the same tests to various files, by subclassing. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 24 09:02:35 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 24 Apr 2009 05:02:35 -0400 Subject: [Biopython-dev] [Bug 2821] NCBIXML.parse only returns results for non-empty hits rather than one per query sequence In-Reply-To: Message-ID: <200904240902.n3O92Z68004987@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2821 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-24 05:02 EST ------- What version of BLAST do you have, and (assuming its less than say 10 MB) could you attach the XML file to this bug? >From memory this is a limitation of the raw XML file from the NCBI - there is no way to tell if there were additional queries with no hits (so Biopython can't help directly). I have not checked BLAST 2.2.20, but had been meaning to ask the NCBI about this. They may not regard it as a "bug", but it was annoying. I have used two workarounds in my own code. (1) Load a list of the query IDs into memory, and as you go though the BLAST results you can see which queries don't appear - and therefore had no hits. (2) Use the .next() methods on a FASTA iterator on the query file, and the NCBIXML iterator on the BLAST XML file to step through the two files in sync. I have some code to do this somewhere... maybe I should turn this into a cookbook recipe for the wiki: http://biopython.org/wiki/Category:Cookbook Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 24 09:07:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 24 Apr 2009 05:07:48 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200904240907.n3O97mDU005535@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-24 05:07 EST ------- (In reply to comment #3) > > I also tried to write an unittest-based test for PDB exposure, just for > playing with it a bit: > ... > I used the approach where one unit test is equivalent to a PDB file, > instead of a set of functions. Hi Giovanni, Isn't Bug 2759 for the PDB exposure test? I was thinking of just adding that to the new file test_PDB_unit.py, rather than making it into its own file. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 24 09:15:05 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 24 Apr 2009 05:15:05 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200904240915.n3O9F5hr006324@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #6 from dalloliogm at gmail.com 2009-04-24 05:15 EST ------- (In reply to comment #5) > (In reply to comment #3) > > > > I also tried to write an unittest-based test for PDB exposure, just for > > playing with it a bit: > > ... > > I used the approach where one unit test is equivalent to a PDB file, > > instead of a set of functions. > > Hi Giovanni, > > Isn't Bug 2759 for the PDB exposure test? I was thinking of just adding that > to the new file test_PDB_unit.py, rather than making it into its own file. > > Peter Ok, of course :) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 24 09:50:33 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 24 Apr 2009 05:50:33 -0400 Subject: [Biopython-dev] [Bug 2759] Unit test for Bio.PDB.HSExposure In-Reply-To: Message-ID: <200904240950.n3O9oXlV008332@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2759 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1234 is|0 |1 obsolete| | ------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-24 05:50 EST ------- (From update of attachment 1234) I have checked this initial exposure test in as part of new file test_PDB_unit.py (created for Bug 2820). Leaving this bug open to look at Martin and/or Giovanni's improvements/extensions. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Apr 24 09:59:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 24 Apr 2009 05:59:09 -0400 Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest In-Reply-To: Message-ID: <200904240959.n3O9x9M8008849@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2820 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-24 05:59 EST ------- (In reply to comment #2) > > Yep, that's the one. > > The behavior of the warnings module and resetwarnings() is pathological, I > think. If a warning is triggered before the warnings.simplefilter('always') > function is called, that specific warning will be silent until the interpreter > is restarted. That's why order is sensitive in that function, and ... > Calling warnings.resetwarnings() just empties this list, including the > default filters and any use of 'ignore' or 'always'. The reduced warning test in CVS was working until I added more unit tests (for Bug 2759). This changed the test order, and the warnings were no longer being triggered. I tried a few things like setting warnings.defaultaction="always" at the top of the file, and adding and warnings. onceregistry={} to the test method, but I have given up. We need to be able to *completely* reset the warnings module for this approach to work. > Python 2.6 includes a context manager that makes all these problems > *completely* go away, by catching all of the warnings raised within a > context and optionally storing them as a list of warning objects that > can be inspected. That sounds much better :) > Would you be interested in having a unit test that does a more thorough > check of the warnings system, but only runs on Py2.6? I'm guessing no, > but hey, worth a shot. Yes - other than using the old print-and-compare test, this seems worth doing in order to actually test the warnings we expect are being issued. It could be a whole new file, test_PDB_warnings.py which required Python 2.6+, but as its just one or two tests, maybe just use conditional method(s) within the test_PDB_unit.py file. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Fri Apr 24 10:57:03 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 24 Apr 2009 11:57:03 +0100 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <20090423123635.GD34546@sobchak.mgh.harvard.edu> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> <320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com> <20090423123635.GD34546@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904240357j29663811q80d4f7c2e7cf6382@mail.gmail.com> On Thu, Apr 23, 2009 at 1:36 PM, Brad Chapman wrote: > I took a look at the resource usage of these objects versus > a lightweight implementation. For a GFF file with 70k features, the > maximum memory usage is 128M versus 111M for the lightweight > version. So the improvement is rather modest, ~15%. How did you measure these memory figures? And was your 15% comparison between the current "heavy" SeqFeature + FeatureLocation system as in CVS, and my lightweight alternative described earlier? Peter From cy at cymon.org Fri Apr 24 11:43:33 2009 From: cy at cymon.org (Cymon Cox) Date: Fri, 24 Apr 2009 12:43:33 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> Message-ID: <7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com> 2009/4/22 Peter > On Wed, Apr 22, 2009 at 6:00 PM, Cymon Cox wrote: > >> > >> This might be a silly question, but do you actually these exact option > >> layouts for MUSCLE and MAFFT? Many Unix tools use something like > >> libopt and will actually take slight variations, and may also offer > short > >> and long names for the same option. Perhaps the existing option code > >> in Bio.Application will suffice? > > > > MAFFT uses "--param value" style options, and won't accept > "--param=value" > > or "-param value" as alternatives. > > OK. Then yes, we should support that. Brad, as Bio.Application is your > module, would you like to comment? > > > > > Neither use "-param=value", but if more applications it may turn up. > > > > I don't think I have ever see a command line application that used that. PRANK - Probabilistic Alignment Kit http://www.ebi.ac.uk/goldman-srv/prank/prank/ Advanced usage: 'prank [optional parameters] -d=sequence_file [optional parameters]' Doesn't accept "-d sequence_file" or "- -d=sequence_file" C. -- From biopython at maubp.freeserve.co.uk Fri Apr 24 11:51:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Apr 2009 12:51:58 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> <7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com> Message-ID: <320fb6e00904240451ma79b037y36315270a03724b@mail.gmail.com> On Fri, Apr 24, 2009 at 12:43 PM, Cymon Cox wrote: > 2009/4/22 Peter > >> On Wed, Apr 22, 2009 at 6:00 PM, Cymon Cox wrote: >> >> >> >> This might be a silly question, but do you actually these exact option >> >> layouts for MUSCLE and MAFFT? ?Many Unix tools use something like >> >> libopt and will actually take slight variations, and may also offer >> >> short and long names for the same option. ?Perhaps the existing >> >> option code in Bio.Application will suffice? >> > >> > MAFFT uses "--param value" style options, and won't accept >> "--param=value" >> > or "-param value" as alternatives. >> >> OK. ?Then yes, we should support that. ?Brad, as Bio.Application is your >> module, would you like to comment? >> >> > >> > Neither use "-param=value", but if more applications it may turn up. >> > >> >> I don't think I have ever see a command line application that used that. > > > PRANK - Probabilistic Alignment Kit > http://www.ebi.ac.uk/goldman-srv/prank/prank/ > > Advanced usage: 'prank [optional parameters] -d=sequence_file [optional > parameters]' > > Doesn't accept "-d sequence_file" or "- -d=sequence_file" I had misunderstood the quotes to be literally typed on the command line ;) Peter From biopython at maubp.freeserve.co.uk Fri Apr 24 12:39:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Apr 2009 13:39:51 +0100 Subject: [Biopython-dev] How are people doing their git merges from the trunk? In-Reply-To: <3f6baf360904231554y7e28084ex79bb0a7f60b4cef7@mail.gmail.com> References: <320fb6e00904230934kac2ca0eoc8ba284fbe2c382e@mail.gmail.com> <3f6baf360904231554y7e28084ex79bb0a7f60b4cef7@mail.gmail.com> Message-ID: <320fb6e00904240539n616a5e77s7ef4377c2cd4c336@mail.gmail.com> On Thu, Apr 23, 2009 at 11:54 PM, Eric Talevich wrote: > On Thu, Apr 23, 2009 at 12:34 PM, Peter wrote: > >> I decided that I would (initially at least) treat my master branch as >> a copy of the master branch, and not commit local changes to this >> branch. ?Instead I periodically grab the latest commits from the >> master using the commands: > > I think this is the recommended way to do it. I read a thread where > Mercurial gurus recommended keeping a clean clone of the upstream > repository, and never committing to that clone. Git seems to have a cleaner > version of this with in-place branches. > > After a few bad incidents with git-rebase, I resolved to keep 'master' in > sync with the biopython trunk, and use new named branches for all > modifications. The workflow is: > > git checkout master > git pull origin ? ?# if I've pushed commits from a different computer > recently > git pull upstream master ? # upstream is the remote biopython/biopython > git push origin master Using "upstream" seems like a very sensible name, I assume you set up: git remote add upstream git://github.com/biopython/biopython.git Peter From chapmanb at 50mail.com Fri Apr 24 12:45:15 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 24 Apr 2009 08:45:15 -0400 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <320fb6e00904240357j29663811q80d4f7c2e7cf6382@mail.gmail.com> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> <320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com> <20090423123635.GD34546@sobchak.mgh.harvard.edu> <320fb6e00904240357j29663811q80d4f7c2e7cf6382@mail.gmail.com> Message-ID: <20090424124515.GJ34546@sobchak.mgh.harvard.edu> Hi Peter; > > I took a look at the resource usage of these objects versus > > a lightweight implementation. For a GFF file with 70k features, the > > maximum memory usage is 128M versus 111M for the lightweight > > version. So the improvement is rather modest, ~15%. > > How did you measure these memory figures? With the unix 'time' command; those are the values reported by %M, which is the maximum memory used during the process. > And was your 15% comparison between the current "heavy" SeqFeature + > FeatureLocation system as in CVS, and my lightweight alternative > described earlier? This was with an even lighter version. I just added start/end as attributes to the SeqFeatures. So there was no FeatureLocation or individual position objects. This was a hack to look at the best case scenario to save memory. The baseline was the default SeqFeatures before we started thinking about changing them. > How does this version look? It should save more memory that the > version I sent you three days ago, and again aims for 100% backwards > compatibility - all the unit tests pass. That is nice. Do we still want to keep a FeatureLocation, or condense this all onto the SeqFeature itself? Brad From chapmanb at 50mail.com Fri Apr 24 12:47:06 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 24 Apr 2009 08:47:06 -0400 Subject: [Biopython-dev] How are people doing their git merges from the trunk? In-Reply-To: <320fb6e00904240539n616a5e77s7ef4377c2cd4c336@mail.gmail.com> References: <320fb6e00904230934kac2ca0eoc8ba284fbe2c382e@mail.gmail.com> <3f6baf360904231554y7e28084ex79bb0a7f60b4cef7@mail.gmail.com> <320fb6e00904240539n616a5e77s7ef4377c2cd4c336@mail.gmail.com> Message-ID: <20090424124706.GK34546@sobchak.mgh.harvard.edu> Eric and Peter; This is really good stuff. Can we add the details to the wiki? It looks like this section could use the information from this thread: http://biopython.org/wiki/GitUsage#Merging_upstream_changes Brad > On Thu, Apr 23, 2009 at 11:54 PM, Eric Talevich wrote: > > On Thu, Apr 23, 2009 at 12:34 PM, Peter wrote: > > > >> I decided that I would (initially at least) treat my master branch as > >> a copy of the master branch, and not commit local changes to this > >> branch. ?Instead I periodically grab the latest commits from the > >> master using the commands: > > > > I think this is the recommended way to do it. I read a thread where > > Mercurial gurus recommended keeping a clean clone of the upstream > > repository, and never committing to that clone. Git seems to have a cleaner > > version of this with in-place branches. > > > > After a few bad incidents with git-rebase, I resolved to keep 'master' in > > sync with the biopython trunk, and use new named branches for all > > modifications. The workflow is: > > > > git checkout master > > git pull origin ? ?# if I've pushed commits from a different computer > > recently > > git pull upstream master ? # upstream is the remote biopython/biopython > > git push origin master > > Using "upstream" seems like a very sensible name, I assume you set up: > git remote add upstream git://github.com/biopython/biopython.git > > Peter > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Fri Apr 24 14:14:10 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 24 Apr 2009 15:14:10 +0100 Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF) In-Reply-To: <20090424124515.GJ34546@sobchak.mgh.harvard.edu> References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com> <320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com> <20090423123635.GD34546@sobchak.mgh.harvard.edu> <320fb6e00904240357j29663811q80d4f7c2e7cf6382@mail.gmail.com> <20090424124515.GJ34546@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904240714s3a0df8cfk75330fd4025c13a3@mail.gmail.com> On Fri, Apr 24, 2009 at 1:45 PM, Brad Chapman wrote: > With the unix 'time' command; those are the values reported by %M, > which is the maximum memory used during the process. > You said 70k features, but how big was the file on disk? >> >> And was your 15% comparison between the current "heavy" SeqFeature + >> FeatureLocation system as in CVS, and my lightweight alternative >> described earlier? >> > > This was with an even lighter version. I just added start/end as > attributes to the SeqFeatures. So there was no FeatureLocation or > individual position objects. This was a hack to look at the best case > scenario to save memory. The baseline was the default SeqFeatures > before we started thinking about changing them. Right - so even if the FeatureLocation is a bit "heavy", getting rid of it wouldn't make that much difference based on your simple profiling. >> How does this version look? It should save more memory that the >> version I sent you three days ago, and again aims for 100% backwards >> compatibility - all the unit tests pass. > > That is nice. Do we still want to keep a FeatureLocation, or > condense this all onto the SeqFeature itself? For the moment I was exploring ways to avoid wasting memory in the FeatureLocation object while retaining 100% compatibility. If your simple profiling numbers are telling the whole story, then there isn't a great deal of point in adding any internal complexity for a small memory saving. If we do want to preserve the current SeqFeature and FeatureLocation API, then the proposal on Bug 2818 is a worthwhile incremental improvement. However, we can probably come up with something even nicer if we change the SeqFeature and FeatureLocation in a non-backwards compatible way. If we did change the API, I would want to stop using the sub_features list to hold join information as child SeqFeatures. I was thinking the FeatureLocation object should hold this, but merging the SeqFeature and FeatureLocation could make sense. Are there any other non-join location operators we really have to deal with? Internally the FeatureLocation (or SeqFeature) could have a list of child locations held as a private list holding two entry tuples (start and end positions). Typically for a non-join feature this will be just _loc_list=[(start,end)], while more generally it would be _loc_list=[(start1,end1),...,(startN,endN)]. The FeatureLocation (or SeqFeature) would have (fuzzy/non-fuzzy) start and end properties which would access _loc_list[0][0] for the start, and loc_list[-1][1] for the end. I would still use the existing position objects to store fuzzy positions. Peter From cy at cymon.org Fri Apr 24 15:31:28 2009 From: cy at cymon.org (Cymon Cox) Date: Fri, 24 Apr 2009 16:31:28 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <320fb6e00904240451ma79b037y36315270a03724b@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> <7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com> <320fb6e00904240451ma79b037y36315270a03724b@mail.gmail.com> Message-ID: <7265d4f0904240831p378a467dx34152951a3f16c52@mail.gmail.com> 2009/4/24 Peter > >> > MAFFT uses "--param value" style options, and won't accept > >> "--param=value" > >> > or "-param value" as alternatives. > >> > >> OK. Then yes, we should support that. Brad, as Bio.Application is your > >> module, would you like to comment? > >> > >> > > >> > Neither use "-param=value", but if more applications it may turn up. > >> > > >> > >> I don't think I have ever see a command line application that used that. > > > > > > PRANK - Probabilistic Alignment Kit > > http://www.ebi.ac.uk/goldman-srv/prank/prank/ > > > > Advanced usage: 'prank [optional parameters] -d=sequence_file [optional > > parameters]' > > > > Doesn't accept "-d sequence_file" or "- -d=sequence_file" > > I had misunderstood the quotes to be literally typed on the command line ;) So the upshot is that both "- -param value" and "-param=value" need to be supported. Rather than add another variation on _Option, or alter _OptionAlt to cover "-param=value", and as we only have a few command line interfaces at present, I'd like to suggest the following simplification to _Option: _AbstractParameter.__init__(:self, names = [], types = [], checker_function = None, is_required = 0, description = "", equate=True): self.names = names self.param_types = types self.checker_function = checker_function self.description = description self.is_required = is_required self.equate = equate [...] class _Option(_AbstractParameter): """Represent an option that can be set for a program. This holds UNIXish options like: --append=yes --append yes --append -append=yes -a yes -append """ def __str__(self): """Return the value of this option for the commandline. """ output = "%s" % self.names[0] if self.value is not None: output += "%s%s " % \ (self.equate and "=" or " ", self.value) return output ie. add an equate flag C. -- From biopython at maubp.freeserve.co.uk Fri Apr 24 16:59:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Apr 2009 17:59:28 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <7265d4f0904240831p378a467dx34152951a3f16c52@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> <7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com> <320fb6e00904240451ma79b037y36315270a03724b@mail.gmail.com> <7265d4f0904240831p378a467dx34152951a3f16c52@mail.gmail.com> Message-ID: <320fb6e00904240959k78d0805bo469dc9666c70d3c0@mail.gmail.com> On Fri, Apr 24, 2009 at 4:31 PM, Cymon Cox wrote: > > So the upshot is that both "- -param value" and "-param=value" need to be > supported. > > Rather than add another variation on _Option, or alter _OptionAlt to cover > "-param=value", and as we only have a few command line interfaces at > present, I'd like to suggest the following simplification to _Option: > ... > ie. add an equate flag That looks very sensible. If there are no counter suggestions, I think that could be checked in :) Peter From bugzilla-daemon at portal.open-bio.org Sat Apr 25 20:36:36 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 25 Apr 2009 16:36:36 -0400 Subject: [Biopython-dev] [Bug 2817] Meta-bug for cleanup once we drop Python 2.3 support In-Reply-To: Message-ID: <200904252036.n3PKaa7G001530@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2817 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-25 16:36 EST ------- Python 2.4+ should let us use the package_data option in setup.py to install the data files needed for Bio.Entrez and Bio.PopGen (and, if we still include it, Bio.EUtils). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sat Apr 25 23:30:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 26 Apr 2009 00:30:15 +0100 Subject: [Biopython-dev] Removing Bio.Mindy and Martel Message-ID: <320fb6e00904251630t43ec275ehd25906476c6afe18@mail.gmail.com> Hi all, Bio.Mindy and Martel are the old "regular expressions on steroids" parsing framework we used to use in Biopython, which needed the external dependency mxTextTools (v2, we never got things to work fully with v3). These modules were deprecated in Biopython 1.48 (Sept 2008), and I explicitly wrote in the release announcements for Biopython 1.50 (and its beta) that this would be the final release to include them. I decided to do this in two steps (partly because of the number of files involved). I've just removed Mindy and associated bits in CVS, and everything looks fine from a setup and unit test point of view. Next comes Martel and its remaining dependent modules. Martel is still used in the following modules, which were also deprecated in Biopython 1.48 (Sept 2008): Bio.MetaTool (parser for output from an obsolete version of MetaTool) Bio.Saf (an obscure alignment format) Bio.NBRF (replaced with "pir" format in Bio.SeqIO) Bio.IntelliGenetics (replaced with "ig" format in Bio.SeqIO) We've actually had three releases where these modules have had a deprecation warning in place, but not quite the full year as stated in the written policy: http://biopython.org/wiki/Deprecation_policy Does anyone have any objections about us pressing ahead with removing Martel and these modules now? Peter From biopython at maubp.freeserve.co.uk Sun Apr 26 10:58:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 26 Apr 2009 11:58:29 +0100 Subject: [Biopython-dev] Bio.Application interface In-Reply-To: <320fb6e00904240959k78d0805bo469dc9666c70d3c0@mail.gmail.com> References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com> <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com> <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com> <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com> <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com> <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com> <7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com> <320fb6e00904240451ma79b037y36315270a03724b@mail.gmail.com> <7265d4f0904240831p378a467dx34152951a3f16c52@mail.gmail.com> <320fb6e00904240959k78d0805bo469dc9666c70d3c0@mail.gmail.com> Message-ID: <320fb6e00904260358i424a6436v24f21e928fffc073@mail.gmail.com> On Fri, Apr 24, 2009 at 5:59 PM, Peter wrote: > On Fri, Apr 24, 2009 at 4:31 PM, Cymon Cox wrote: >> Rather than add another variation on _Option, or alter _OptionAlt to cover >> "-param=value", and as we only have a few command line interfaces at >> present, I'd like to suggest the following simplification to _Option: >> ... >> ie. add an equate flag > > That looks very sensible. If there are no counter suggestions, I > think that could be checked in :) The equate argument is now in CVS. One catch was that the old code used an equals on options starting "--", e.g. "--apped=yes", but not on short options starting "-", e.g. "-append yes" (a bit of magic based on the behaviour of typical Unix tools?). From a grep for "_Option", the only files concerned are: AlignAce/Applications.py Application/__init__.py Blast/Applications.py Emboss/Applications.py Motif/Applications/AlignAce.py And from looking at these, they all use options with a single leading dash, so for backwards compatibility I set equate to False by default (not True as in your outlined code). Does this work for you Cymon? Peter From bartek at rezolwenta.eu.org Sun Apr 26 12:08:51 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Sun, 26 Apr 2009 14:08:51 +0200 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com> <320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com> <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> Message-ID: <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> Hi all, On Wed, Apr 22, 2009 at 11:23 AM, Peter wrote: > > If you can fix the current git hub repository, great. > I 've finally found some time to fix the tag issue in our repository. I've actually spent some time looking at git-rebase (learned a lot, but nothing useful for our problem. Then I realised that since tags are just references to commits, we need to move them to the trunk (instead of re-basing the trunk). Long story short - assuming you are in the directory of your git repo you can fix any particular tag wit a single command. E.g. If you want to fix the biopython-149 tag, you do: git tag -f biopython-149 biopython-149~1 -f option enforces the replacement of the existing label, while biopython-149~1 references the parent commit of our empty tag commit (you can also use ~2 for a grand parent and so on). You can see the effect of this procedure (as seen in gitx -- a very nice tool) in the attached images. If you want to fix all biopython tags, you simply do: for t in `git tag|grep biopython`; do git tag -f $t $t~1; done It works locally, the changes can be pushed back to github (need --tags -f to force tag renames), I've done this on my branch of biopython on github. If there are no objections to the way tags are handled, I can try to update the trunk. This is a bit tricky, because I need to make the update scripts work nicely with moving the tags, but it should be doable. > The old conversion's deletion is still in progress, it must have stalled: > http://support.github.com/discussions/repos/485-reposiotry-stuck-in-rename Seems to be gone now. that's one problem less :) > > If we can fix the tags, great. ?If we can also remap the authors to > their git usernames, even better. > This is doable in the current setup. I don't know whether we need to do this. The old commits are signed by the same credentials (name, e-mail) as on CVS server. If we start re-mapping them now, we are going to have essentially a new commit history, so everybody would need to rebase their branches... I don't see a problem of having old commits signed with old e-mails, and new commits signed by new. Especially, that everybody can have multiple e-mails assigned to their github account (that's how I did with mine). cheers Bartek -------------- next part -------------- A non-text attachment was scrubbed... Name: before.png Type: image/png Size: 18855 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: after.png Type: image/png Size: 15150 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Sun Apr 26 12:29:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 26 Apr 2009 13:29:01 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com> <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> Message-ID: <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> On Sun, Apr 26, 2009 at 1:08 PM, Bartek Wilczynski wrote: > Hi all, > > On Wed, Apr 22, 2009 at 11:23 AM, Peter wrote: >> >> If you can fix the current git hub repository, great. >> > I 've finally found some time to fix the tag issue in our repository. > I've actually spent some time looking at git-rebase (learned a lot, > but nothing useful for our problem. Then I realised that since tags > are just references to commits, we need to move them to the trunk > (instead of re-basing the trunk). > > Long story short - assuming you are in the directory of your git repo > you can fix any particular tag wit a single command. E.g. If you want > to fix the biopython-149 tag, you do: > git tag -f biopython-149 biopython-149~1 > > -f option enforces the replacement of the existing label, while > biopython-149~1 references the parent commit of our empty tag commit > (you can also use ~2 for a grand parent and so on). > > You can see the effect of this procedure (as seen in gitx -- a very > nice tool) in the attached images. > > If you want to fix all biopython tags, you simply do: > > for t in `git tag|grep biopython`; do git tag -f $t $t~1; done > > It works locally, the changes can be pushed back to github (need > --tags -f to force tag renames), > I've done this on my branch of biopython on github. > > If there are no objections to the way tags are handled, I can try to > update the trunk. > This is a bit tricky, because I need to make the update scripts work > nicely with moving > the tags, but it should be doable. I say give this a go - fingers crossed :) >> The old conversion's deletion is still in progress, it must have stalled: >> http://support.github.com/discussions/repos/485-reposiotry-stuck-in-rename > > Seems to be gone now. that's one problem less :) Great. I did have reminded them, but they solved it. >> >> If we can fix the tags, great. If we can also remap the authors to >> their git usernames, even better. >> > This is doable in the current setup. I don't know whether we need to > do this. The old commits > are signed by the same credentials (name, e-mail) as on CVS server. If > we start re-mapping them > now, we are going to have essentially a new commit history, so > everybody would need to rebase their > branches... I don't see a problem of having old commits signed with > old e-mails, and new commits > signed by new. Especially, that everybody can have multiple e-mails > assigned to their github account > (that's how I did with mine). That would be simpler. I'll have to try on my github account... Peter From biopython at maubp.freeserve.co.uk Sun Apr 26 12:46:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 26 Apr 2009 13:46:55 +0100 Subject: [Biopython-dev] Properties in Bio.Application interface? Message-ID: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com> On Thu, Apr 23, 2009 at 10:29 AM, Peter wrote: > What about this bit I wrote earlier: >>> ... We might want to discuss extending the AbstractCommandline >>> __init__ method to take **kwargs, allowing the parameters to be >>> set like this: >>> >>> from Bio import Application >>> from Bio.Align.Applications import MafftCommandline >>> cmd = MafftCommandline(input="sample.fa", ...) >>> return_code, std_handle, err_handle = Application.generic_run(cmd) >>> >>> I'm not sure how well this would work in practice as the range of >>> valid argument names in python may not overlap with the valid >>> parameter names. > > We'll have to see how well the above idea works in practice - it > may not be general enough to be useful. > > Also, perhaps we can automatically generate properties for each > argument allowing this: > > cmd.input = "sample.fa" > > rather than: > > cmd.set_parameter("input", "sample.fa") > > For the "switch" type arguments which take no value, if these are > implemented with a separate option class (maybe _Switch or > _OptionNoValue) then rather than: > > cmd.set_parameter("noanchors") > > we might want to do: > > cmd.noanchors = True > > and allow the switch to be removed with: > > cmd.noanchors = False > > i.e. For those arguments which take no argument (is "switch" the > right term here?), evaluate the property set value as a boolean to > add/remove -noanchors from the command line string. > > I think using properties in this way could make the command line > object more intuitive, but again python puts limits on property names > which might mean for some arguments you'd have to use the > set_parameter version. > > Peter > I have cleaning up the existing Bio.Application command line objects in CVS to follow the parameter alias convention already laid out in Bio.Application. i.e. They all now have human readable paramater aliases, which are also valid python identifiers. This means these "human readable names" can also be used for argument names in __init__ (using **kwargs), or as property names. I think I've got properties working now as an experiment on my machine, generated at run time using the "human readable name" for each parameter. We would need to special case "switch" arguments (i.e. those which take no value) as outlined above. Does this sound worthwhile? If so, I'll put together an enhancement bug with a patch, or a branch on github. Peter From bugzilla-daemon at portal.open-bio.org Sun Apr 26 13:45:47 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 26 Apr 2009 09:45:47 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line interface In-Reply-To: Message-ID: <200904261345.n3QDjlkm022449@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 cymon.cox at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1286 is|0 |1 obsolete| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Apr 26 13:49:44 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 26 Apr 2009 09:49:44 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904261349.n3QDniuI022654@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 cymon.cox at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Bio.Application MUSCLE |Bio.Application command line |command line interface |interfaces ------- Comment #6 from cymon.cox at gmail.com 2009-04-26 09:49 EST ------- (Change title of this bug.) Now tracking github branch: http://github.com/cymon/biopython-github-master/tree/applic-int Added command line interfaces for: MUSCLE, MAFFT, DALIGN, PRANK To do: Clustalw, T-coffee C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sun Apr 26 17:22:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 26 Apr 2009 18:22:43 +0100 Subject: [Biopython-dev] Bio.EMBOSS wrappers In-Reply-To: <320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com> References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com> <20090413123219.GB5429@sobchak.mgh.harvard.edu> <320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com> <20090413134429.GE5429@sobchak.mgh.harvard.edu> <320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com> Message-ID: <320fb6e00904261022l43f799a8g8729a47ba15042f8@mail.gmail.com> On Mon, Apr 13, 2009 at 2:49 PM, Peter wrote: > On Mon, Apr 13, 2009 at 2:44 PM, Brad Chapman wrote: >>> > ... Feel free to add away. >>> >>> I need to work on my delegation skills - that seems to have back fired ;) >> >> Oops. I honestly read that as "do I have your permission?" I can of >> course tackle this, but am a bit underwater now. > > Looking back, I was a bit ambiguous. I don't mind who does it - let's > see who has time free first. OK, I've added a minimal needle wrapper based on the water wrapper. As part of this I remove the -nosimilarity option which doesn't work on the current versions of EMBOSS needle and water (5.0 or 6.0). For -auto and -filter, I think we probably should extend the parameter classes to explicitly cover these switch arguments which take no value (they are either part of the command line, or omitted). We've touched on this already on Cymon's thread... Peter From bugzilla-daemon at portal.open-bio.org Sun Apr 26 20:09:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 26 Apr 2009 16:09:51 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904262009.n3QK9p8U011039@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #7 from cymon.cox at gmail.com 2009-04-26 16:09 EST ------- Added CLUSTALW Bio.Application command line interface C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Apr 27 09:58:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Apr 2009 10:58:52 +0100 Subject: [Biopython-dev] main page on wiki In-Reply-To: <320fb6e00904230216l4b2cd91k513baa3f7a08172c@mail.gmail.com> References: <49EFCF07.2050502@student.otago.ac.nz> <9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com> <49EFE553.6070405@gmail.com> <320fb6e00904230216l4b2cd91k513baa3f7a08172c@mail.gmail.com> Message-ID: <320fb6e00904270258s523c49a1j1bfc5d4a12ca86a9@mail.gmail.com> On Thu, Apr 23, 2009 at 10:16 AM, Peter wrote: > > If there are no counter comments, I'll put David's changes up later > today or tomorrow. > OK - make that a couple of days later ;) This isn't exactly as in David's draft - I shortened some of the link text and omitted a couple of links under "Contribute" which seemed unnecessary on the home page. I've also kept the final line giving the latest release and date (although the text is shorter now). Brad commented (off list?) that having this is a good indicator of the project's activity, and I agree. Alternatively, I'd like to try having dates on the news feed, but the media wiki plugin needs to be updated for that to work... Peter From bugzilla-daemon at portal.open-bio.org Mon Apr 27 12:23:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 27 Apr 2009 08:23:00 -0400 Subject: [Biopython-dev] [Bug 2671] Including GenomeDiagram in the main Biopython distribution In-Reply-To: Message-ID: <200904271223.n3RCN0GL009972@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2671 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #34 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-27 08:23 EST ------- (In reply to comment #25) > OK, GenomeDiagram is now in CVS, with some basic tests. Still to do: > > * Updating the existing GenomeDiagram manual to match (different imports, > colour to color), which I think can stay as a separate PDF file. Leighton can do that... > * A short introduction to Bio.Graphics including GenomeDiagram as part > of a new chapter in the tutorial? Done. (In reply to comment #33) > Plus (as pointed out on Bug 2711 / Bug 2710): > > * Updating the installation instructions so that the ReportLab > section also covers renderPM (needed for bitmaps). Done. Marking this bug fixed as of Biopython 1.50. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at portal.open-bio.org Mon Apr 27 14:12:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 27 Apr 2009 10:12:51 -0400 Subject: [Biopython-dev] [Bug 2821] NCBIXML.parse only returns results for non-empty hits rather than one per query sequence In-Reply-To: Message-ID: <200904271412.n3RECpZC019165@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2821 camilla at ip.id.au changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from camilla at ip.id.au 2009-04-27 10:12 EST ------- Hi Peter Thanks for the suggestions. In the end, I realised that b_record.query contains the header line of the query sequence all along, so there is no real bug here, just my misunderstanding of what information is stored where. I think this issue can be closed. For anyone else out there with similar problems, if you aren't certain what data is in an object, you can use the dir() function to list them all. Thanks again Camilla -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Apr 27 15:59:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Apr 2009 16:59:03 +0100 Subject: [Biopython-dev] Installation documentation Message-ID: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> I've made some updates to Installation.tex, which I think are an improvement over the version shipped with Biopython 1.50 and currently online. I think we could update these files now: http://biopython.org/DIST/docs/install/Installation.html http://biopython.org/DIST/docs/install/Installation.pdf Does that seem sensible? Before that, would anyone like to proof read the text in CVS, or make further updates? For example, are the bits on FreeBSD, Fink and RPMs still valid? Peter From p.j.a.cock at googlemail.com Mon Apr 27 16:09:57 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 27 Apr 2009 17:09:57 +0100 Subject: [Biopython-dev] Rolling new releases In-Reply-To: <320fb6e00904230658y310609c8l89bf27c33bd56d56@mail.gmail.com> References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com> <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com> <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com> <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com> <320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com> <20090421122045.GD30529@sobchak.mgh.harvard.edu> <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com> <20090423125356.GE34546@sobchak.mgh.harvard.edu> <320fb6e00904230658y310609c8l89bf27c33bd56d56@mail.gmail.com> Message-ID: <320fb6e00904270909y35ebc841yd2074d6970b71fe4@mail.gmail.com> On Thu, Apr 23, 2009 at 2:58 PM, Peter Cock wrote: > The complicated bit is getting the code and documentation in CVS > ready, and that is harder to delegate. ?Once that is done though, the > actual release process is fairly straight forward - as documented here > - and could be delegated to anyone methodical with suitably setup > development machine(s): > http://biopython.org/wiki/Building_a_release > Maybe some of the release process could be automated literally as a > script - but doing each step methodically by hand and checking as you > go is wise. On the bright side, after dropping Martel the "Building a release" instructions will get a little shorter :) Peter From biopython at maubp.freeserve.co.uk Mon Apr 27 16:26:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Apr 2009 17:26:37 +0100 Subject: [Biopython-dev] Removing Bio.Mindy and Martel In-Reply-To: <320fb6e00904251630t43ec275ehd25906476c6afe18@mail.gmail.com> References: <320fb6e00904251630t43ec275ehd25906476c6afe18@mail.gmail.com> Message-ID: <320fb6e00904270926l7e2db7e0x21a7bde1e47af4b0@mail.gmail.com> On Sun, Apr 26, 2009 at 12:30 AM, Peter wrote: > > Does anyone have any objections about us pressing ahead with removing > Martel and these modules now? > Well I hope not, as I've just make the changes in CVS. Note that I have not deleted all the files in the Martel folder, but simply excluded Martel from setup.py. Peter From biopython at maubp.freeserve.co.uk Mon Apr 27 16:28:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Apr 2009 17:28:15 +0100 Subject: [Biopython-dev] Installation documentation In-Reply-To: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> Message-ID: <320fb6e00904270928w26fb0f1axbb7be88188d0355f@mail.gmail.com> On Mon, Apr 27, 2009 at 4:59 PM, Peter wrote: > I've made some updates to Installation.tex, which I think are an > improvement over the version shipped with Biopython 1.50 and currently > online. ?I think we could update these files now: > > http://biopython.org/DIST/docs/install/Installation.html > http://biopython.org/DIST/docs/install/Installation.pdf > > Does that seem sensible? ?Before that, would anyone like to proof read > the text in CVS, or make further updates? ?For example, are the bits > on ?FreeBSD, Fink and RPMs still valid? If we are going to update the online version, I'll refrain from removing the mxTextTools bit from Installation.tex for the time being. Peter From biopython at maubp.freeserve.co.uk Mon Apr 27 16:37:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Apr 2009 17:37:09 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com> <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> Message-ID: <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> > On Sun, Apr 26, 2009 at 1:08 PM, Bartek Wilczynski >>> >>> If we can fix the tags, great. ?If we can also remap the authors to >>> their git usernames, even better. >>> >> This is doable in the current setup. I don't know whether we need to >> do this. The old commits are signed by the same credentials (name, >> e-mail) as on CVS server. >From looking at git log, they just have our CVS usename, e.g. Author: peterc i.e. No email address >> If we start re-mapping them now, we are going to have essentially a >> new commit history, so everybody would need to rebase their >> branches... I don't see a problem of having old commits signed with >> old e-mails, and new commits signed by new. Especially, that >> everybody can have multiple e-mails assigned to their github >> account (that's how I did with mine). > > That would be simpler. ?I'll have to try on my github account... > Given we don't have email addresses embedded in the old commits, do you think is this going to be possible (without changing the repository)? Peter From chapmanb at 50mail.com Tue Apr 28 12:41:20 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 28 Apr 2009 08:41:20 -0400 Subject: [Biopython-dev] Installation documentation In-Reply-To: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> Message-ID: <20090428124119.GV34546@sobchak.mgh.harvard.edu> Hi Peter; > I've made some updates to Installation.tex, which I think are an > improvement over the version shipped with Biopython 1.50 and currently > online. I think we could update these files now: > > http://biopython.org/DIST/docs/install/Installation.html > http://biopython.org/DIST/docs/install/Installation.pdf > > Does that seem sensible? Before that, would anyone like to proof read > the text in CVS, or make further updates? For example, are the bits > on FreeBSD, Fink and RPMs still valid? The FreeBSD port is out of date now, so I commented that section out and replaced it with a section on using easy_install. This also reminded me that I needed to update the version on the Python Package Index. I added a note to the release details to do this; oh man, another step. Peter, if you have an account on pypi, let me know your login and I can add you as an owner for Biopython. Brad From p.j.a.cock at googlemail.com Tue Apr 28 13:36:37 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 Apr 2009 14:36:37 +0100 Subject: [Biopython-dev] [Biopython] Parsing large blast files In-Reply-To: <627305.69090.qm@web62401.mail.re1.yahoo.com> References: <320fb6e00904270354i351bccb3q6aaa2369db6f82e0@mail.gmail.com> <627305.69090.qm@web62401.mail.re1.yahoo.com> Message-ID: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> On Tue, Apr 28, 2009 at 2:00 PM, Michiel de Hoon wrote: >> NCBIStandalone.Iterator() is the old semi-obsolete plain >> text parser - it won't parse the XML output, hence the >> "Invalid header" error. ?Maybe the tutorial >> (or the error message) could be clearer. > > I think part of the problem is the organization of the code in Bio.Blast, > which seems to have grown historically. Bio.Blast.NCBIStandalone > contains blastall, blastpgp, and rpsblast, which makes sense, but also > ?BlastParser and PsiBlastParser, which are not necessarily connected > to standalone Blast. Bio.Blast.ParseBlastTable contains the parser for > blastpgp output. Bio.Blast.NCBIWWW contains qblast, but also the > parser for Blast HTML output, though qblast does not necessarily > generate output in HTML format. I presumed that initially the standalone tools only produced plain text, and the website (qblast) only produced HTML - hence the use of Bio.Blast.NCBIStandalone for both command line wrappers AND the plain text parser, and Bio.Blast.NCBIWWW for both the qblast function AND the HTML parser. > The usage of this module may be more understandable if all functions > were accessible from Bio.Blast directly in a fashion more consistent > with current Biopython. Bio.Blast would then have the following functions: > > read(handle, format='xml') > parse(handle, format='xml') > blastall > blastpgp > rpsblast > qblast > > with most of the actual code hiding in Bio.Blast.NCBIStandalone etcetera. > > Any objections, comments? I do like the idea of moving/importing the qblast function directly under Bio.Blast, and perhaps removing Bio.Blast.NCBIXML later on. For read/parse functions, we should probably call the format "blastxml" to match BioPerl. Would you continue to support the plain text output here? Also something to keep in mind is there may be non-NCBI variants of BLAST with their own formats as well. Rather than continuing to encourage the use of blastall, blastpgp and rpsblast I would rather bring Bio.Blast.Applications up to date, and then declare them obsolete . These three "helper" functions are very limiting in how the command line is invoked - you can't choose the exact call used (e.g. subprocess options) or what you want back (e.g. you may not care about the handles). For example, getting BLAST to write its output to a file is confusingly difficult right now using these functions. Also, dealing with errors isn't nice. Peter From biopython at maubp.freeserve.co.uk Tue Apr 28 13:40:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Apr 2009 14:40:43 +0100 Subject: [Biopython-dev] Installation documentation In-Reply-To: <20090428124119.GV34546@sobchak.mgh.harvard.edu> References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> <20090428124119.GV34546@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00904280640l58f0098fs467d90646a039e29@mail.gmail.com> On Tue, Apr 28, 2009 at 1:41 PM, Brad Chapman wrote: > > The FreeBSD port is out of date now, so I commented that section out > and replaced it with a section on using easy_install. This also > reminded me that I needed to update the version on the Python > Package Index. I added a note to the release details to do this; oh > man, another step. Well, easy_install isn't (yet) an official python standard so I hadn't previously worried about it - our wiki Downloads page does mention it. Frankly the less "official" ways the are to install, the less ways it can go wrong, and then the less questions need to be asked when it goes wrong. Nor had I worried about how PyPi's listing might need to be updated. I assumed it was clever enough to scan the http://biopython.org/DIST/ directory and parse the filenames. Is the real answer you (Brad) kept it up to date? http://pypi.python.org/pypi/biopython/ > Peter, if you have an account on pypi, let me know your login and I > can add you as an owner for Biopython. I don't have an account on pypi. Peter From bugzilla-daemon at portal.open-bio.org Tue Apr 28 15:04:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 11:04:09 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281504.n3SF49so024149@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 11:04 EST ------- I've checked the MUSCLE wrapper into CVS, and added the -diags option. I also created test_Muscle_tool.py which requires MUSCLE be installed, and checks we can invoke it and parse its clustal output OK. A more general alignment wrapper unit test can simply construct some command line objects and check them against an expected string (without requiring the tools to be installed). Note that I am concerned about the file exists check on the input file argument. This is helpful, but also prevents certain reasonable usage examples - e.g. the input file is created on the fly and doesn't exist yet, or, the command line constructed will be submitted to a cluster where the path will be valid (even if the path isn't valid on the local machine where Biopython is running). Also, perhaps we should think about Bio.Application including automatic quoting for filenames with spaces in them... see the _escape_filename function used in Bio.Clustalw and Bio.Blast.NCBIStandalone. This would be only for parameters explicitly tagged as filenames. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 15:25:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 11:25:20 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281525.n3SFPKbd025807@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #9 from cymon.cox at gmail.com 2009-04-28 11:25 EST ------- (In reply to comment #8) > I've checked the MUSCLE wrapper into CVS, and added the -diags option. You pulled this from the applic-int branch yes? (Hmm, missed that -diags...) I also > created test_Muscle_tool.py which requires MUSCLE be installed, and checks we > can invoke it and parse its clustal output OK. Ive also just checked in (to the github branch) some unittests for MUSCLE, MAFFT, PRANK, and DIALIGN. These just check that the app runs and the return code is 0 - a few other checks are made but not much else. > A more general alignment wrapper unit test can simply construct some command > line objects and check them against an expected string (without requiring the > tools to be installed). I will do these - all in one test_ApplicationCommandlines.py unittest suite. > Note that I am concerned about the file exists check on the input file > argument. This is helpful, but also prevents certain reasonable usage examples > - e.g. the input file is created on the fly and doesn't exist yet, or, the > command line constructed will be submitted to a cluster where the path will be > valid (even if the path isn't valid on the local machine where Biopython is > running). Good point. Perhaps the os.path.exists on input files needs to be dropped from all wrappers. > > Also, perhaps we should think about Bio.Application including automatic quoting > for filenames with spaces in them... see the _escape_filename function used in > Bio.Clustalw and Bio.Blast.NCBIStandalone. This would be only for parameters > explicitly tagged as filenames. Yes, I thought about doing that but havent acted. C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 15:44:06 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 11:44:06 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281544.n3SFi61M027248@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 11:44 EST ------- (In reply to comment #9) > (In reply to comment #8) > > I've checked the MUSCLE wrapper into CVS, and added the -diags option. > > You pulled this from the applic-int branch yes? (Hmm, missed that -diags...) Yes. I spotted the -diags because it is an example given if you just run "muscle". > Ive also just checked in (to the github branch) some unittests for MUSCLE, > MAFFT, PRANK, and DIALIGN. These just check that the app runs and the return > code is 0 - a few other checks are made but not much else. I'll look at that. > > A more general alignment wrapper unit test can simply construct some command > > line objects and check them against an expected string (without requiring > > the tools to be installed). > > I will do these - all in one test_ApplicationCommandlines.py unittest suite. Sounds good. Maybe just test_AlignApps.py if it is just for Bio.Align.Applications? > > Note that I am concerned about the file exists check on the input file > > argument. This is helpful, but also prevents certain reasonable usage > > examples - e.g. the input file is created on the fly and doesn't exist > > yet, or, the command line constructed will be submitted to a cluster > > where the path will be valid (even if the path isn't valid on the local > > machine where Biopython is running). > > Good point. Perhaps the os.path.exists on input files needs to be dropped > from all wrappers. Maybe - I dropped most of them from the Muscle and Clustalw ones. The matrix arguments are a little trickier, where the argument can be either a special word of a filename. See below for a related issue ... > > Also, perhaps we should think about Bio.Application including automatic > > quoting for filenames with spaces in them... see the _escape_filename > > function used in Bio.Clustalw and Bio.Blast.NCBIStandalone. This would > > be only for parameters explicitly tagged as filenames. > > Yes, I thought about doing that but havent acted. > Another issue is any file exists check needs to be aware that filenames may be quoted (due to containing spaces). i.e. A simple call to os.path.isfile(...) won't work. I've integrated your Clustalw wrapper into CVS, and in order to extend my existing unit tests to use this with spaces in file names, I was forced to drop the existence check. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 16:18:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 12:18:11 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281618.n3SGIBPl029571@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #11 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 12:18 EST ------- (In reply to comment #9) > > Ive also just checked in (to the github branch) some unittests for MUSCLE, > MAFFT, PRANK, and DIALIGN. These just check that the app runs and the return > code is 0 - a few other checks are made but not much else. > I think I was looking at your master branch, rather than the applic-int branch: http://github.com/cymon/biopython-github-master/commits/applic-int I see the changes now... > > Also, perhaps we should think about Bio.Application including automatic > > quoting for filenames with spaces in them... see the _escape_filename > > function used in Bio.Clustalw and Bio.Blast.NCBIStandalone. This would > > be only for parameters explicitly tagged as filenames. > > Yes, I thought about doing that but havent acted. Seeing as we both think this makes sense, I've done that in CVS. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 16:30:53 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 12:30:53 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281630.n3SGUrLn030516@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #12 from cymon.cox at gmail.com 2009-04-28 12:30 EST ------- (In reply to comment #11) > (In reply to comment #9) > > > > Ive also just checked in (to the github branch) some unittests for MUSCLE, > > MAFFT, PRANK, and DIALIGN. These just check that the app runs and the return > > code is 0 - a few other checks are made but not much else. > > > > I think I was looking at your master branch, rather than the applic-int branch: > http://github.com/cymon/biopython-github-master/commits/applic-int > I see the changes now... In those unittests, you'll note that I have no idea about the windows environment! (dont use window, never have used windows). I just copied from the Emboss wrapper... C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 16:39:20 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 12:39:20 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281639.n3SGdKcO030951@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #13 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 12:39 EST ------- (In reply to comment #11) > In those unittests, you'll note that I have no idea about the windows > environment! (dont use window, never have used windows). I just copied > from the Emboss wrapper... > > C. That explains things :) I had already guessed you hadn't run any of these tests on Windows, because the executable isn't recorded properly, and even if it was, you never use it when creating the command line objects: #Don't do this if you want to actually run the application, as #it would only work on Unix where the command is on the path: #cmdline = MafftCommandline() #Instead, use the exe name we determined earlier: cmdline = MafftCommandline(mafft_exe) The EMBOSS installer is nice and *does* setup EMBOSS_ROOT, which is why test_Emboss.py looks for it. However, for test_Clustalw_tool.py I just made a list of the default install locations, and check them. There is no environment variable! I haven't looked at the documentation but I would be pleasantly surprised if a MAFFT_ROOT environment variable was setup by the default method of installing MAFFT on Windows (and similarly for the other tools). If the tools do record their install location in the registry, we can do a win32api call to get the path. Then if win32api isn't installed just raise the MissingExternalDependencyError exception. If you look at my test_Muscle_tool.py in CVS, you'll see I haven't yet determined how best to try and locate MUSCLE. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 16:54:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 12:54:32 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281654.n3SGsWBX032007@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #14 from cymon.cox at gmail.com 2009-04-28 12:54 EST ------- (In reply to comment #13) > (In reply to comment #11) > > In those unittests, you'll note that I have no idea about the windows > > environment! (dont use window, never have used windows). I just copied > > from the Emboss wrapper... > > > > C. > > That explains things :) > > I had already guessed you hadn't run any of these tests on Windows, because the > executable isn't recorded properly, and even if it was, you never use it when > creating the command line objects: > > #Don't do this if you want to actually run the application, as > #it would only work on Unix where the command is on the path: > #cmdline = MafftCommandline() > #Instead, use the exe name we determined earlier: > cmdline = MafftCommandline(mafft_exe) OK, thanks, I'll update the wrappers on my branch. > If the tools do record their install location in the registry, we can do a > win32api call to get the path. > Then if win32api isn't installed just raise the > MissingExternalDependencyError exception. We can? ;) Can you give me some code, or I could just use this in the meantime: if sys.platform=="win32" : raise MissingExternalDependencyError("Testing with MUSCLE not implemented on Windows yet") C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 17:07:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 13:07:11 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281707.n3SH7BTG000522@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 13:07 EST ------- (In reply to comment #14) > > #Don't do this if you want to actually run the application, as > > #it would only work on Unix where the command is on the path: > > #cmdline = MafftCommandline() > > #Instead, use the exe name we determined earlier: > > cmdline = MafftCommandline(mafft_exe) > > OK, thanks, I'll update the wrappers on my branch. Have a look at this test_Muscle_tool.py CVS revision 1.4 first: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Tests/test_Muscle_tool.py?cvsroot=biopython > > If the tools do record their install location in the registry, > > we can do a win32api call to get the path. Then if win32api > > isn't installed just raise the MissingExternalDependencyError > > exception. > > We can? ;) > > Can you give me some code, ... There are a lot of ifs here. The code is fairly simple (I've done this kind of thing before, but can't find an example right away). The catch is establishing IF the information we want gets written to the registry during the tool installation or not. > or I could just use this in the meantime: > if sys.platform=="win32" : > raise MissingExternalDependencyError("Testing with MUSCLE not implemented > on Windows yet") Yeah - use something like that, but be aware that the tests shouldn't assume that the executable name is just "muscle". Hopefully test_Muscle_tool.py does this right... Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 17:32:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 13:32:14 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281732.n3SHWEip002274@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #16 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 13:32 EST ------- (In reply to comment #15) > > Yeah - use something like that, but be aware that the tests shouldn't assume > that the executable name is just "muscle". Hopefully test_Muscle_tool.py does > this right... > Well, it does now. I've got test_Muscle_tool.py to run on Windows, assuming the user chooses to put MUSCLE under the program files directory in a reasonably predictable folder. Given the MUSCLE installation process on Windows is entirely manual, we can't really do anything else. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Apr 28 17:45:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Apr 2009 18:45:01 +0100 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com> <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> Message-ID: <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com> On Mon, Apr 27, 2009 at 5:37 PM, Peter wrote: >> On Sun, Apr 26, 2009 at 1:08 PM, Bartek Wilczynski >>>> >>>> If we can fix the tags, great. ?If we can also remap the authors to >>>> their git usernames, even better. >>>> >>> This is doable in the current setup. I don't know whether we need to >>> do this. The old commits are signed by the same credentials (name, >>> e-mail) as on CVS server. > > From looking at git log, they just have our CVS usename, e.g. > Author: peterc > i.e. No email address > >>> If we start re-mapping them now, we are going to have essentially a >>> new commit history, so everybody would need to rebase their >>> branches... I don't see a problem of having old commits signed with >>> old e-mails, and new commits signed by new. Especially, that >>> everybody can have multiple e-mails assigned to their github >>> account (that's how I did with mine). >> >> That would be simpler. ?I'll have to try on my github account... > > Given we don't have email addresses embedded in the old commits, > do you think is this going to be possible (without changing the > repository)? I take that back - I added an email address of just "peterc" to my github account (it seems they don't do any validation, perhaps for this very reason?). This had no immediate effect, but one day later and all my CVS commits are now shown with my photo in github. Neat - but it makes it much more obvious that I have a tendency to do lots of small commits! Peter From bartek at rezolwenta.eu.org Tue Apr 28 17:50:20 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 28 Apr 2009 19:50:20 +0200 Subject: [Biopython-dev] history on github - where are the tags? In-Reply-To: <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com> References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com> <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com> <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com> <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com> <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com> <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com> <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com> <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com> <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com> <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com> Message-ID: <8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com> On Tue, Apr 28, 2009 at 7:45 PM, Peter wrote: > I take that back - I added an email address of just "peterc" to my > github account (it seems they don't do any validation, perhaps for > this very reason?). ?This had no immediate effect, but one day later > and all my CVS commits are now shown with my photo in github. ?Neat - great > but it makes it much more obvious that I have a tendency to do lots of > small commits! > That's good practice in git :) cheers Bartek From bugzilla-daemon at portal.open-bio.org Tue Apr 28 17:55:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 13:55:00 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281755.n3SHt0FK003782@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #17 from cymon.cox at gmail.com 2009-04-28 13:55 EST ------- (In reply to comment #16) > (In reply to comment #15) > > > > Yeah - use something like that, but be aware that the tests shouldn't assume > > that the executable name is just "muscle". Hopefully test_Muscle_tool.py does > > this right... > > > > Well, it does now. I've got test_Muscle_tool.py to run on Windows, assuming > the user chooses to put MUSCLE under the program files directory in a > reasonably predictable folder. Given the MUSCLE installation process on > Windows is entirely manual, we can't really do anything else. > OK, pushed to applic-int updated unittest for PRANK, MAFFT, and DIALIGN - skipping tests on windows. Also changed the names to test_XXXX_tool.py C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Apr 28 18:28:31 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 14:28:31 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904281828.n3SISVTs005955@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #18 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 14:28 EST ------- (In reply to comment #17) > > OK, pushed to applic-int updated unittest for PRANK, MAFFT, and DIALIGN - > skipping tests on windows. > > Also changed the names to test_XXXX_tool.py > > C. Great. In addition to the MUSCLE and ClustalW stuff, I've got the PRANK code and unit tests in CVS now. These three tests all work on a Linux, Mac and Windows machine (with Python 2.4, 2.5 and 2.6). I'm stopping working on this for today. It would be great if you could test a clean checkout from CVS, and we'll resume this merge later on for the remaining tools MAFFT and DIALIGN. Also, would you be able to look into making the Prank test faster to run? Maybe use a smaller example input file? After we do that, I'd like to use it to test the Nexus parser via Bio.AlignIO (just something simple which won't be affected by gap differences between different versions of PRANK - like my tests for MUSCLE). Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Apr 28 19:27:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Apr 2009 20:27:41 +0100 Subject: [Biopython-dev] Where to put command line wrappers In-Reply-To: <320fb6e00904231421k3e18d0b1y9003614e906fcb1c@mail.gmail.com> References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com> <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com> <8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com> <8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com> <320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com> <20090417140241.GD16092@sobchak.mgh.harvard.edu> <320fb6e00904231421k3e18d0b1y9003614e906fcb1c@mail.gmail.com> Message-ID: <320fb6e00904281227l5e17159g4333fd98d019ad60@mail.gmail.com> On Thu, Apr 23, 2009 at 10:21 PM, Peter wrote: > > OK, what I propose is that the command line objects are exposed as > Bio.Align.Applications.MuscleCommandline, > Bio.Align.Applications.ClustalwCommandline, etc but that the > implementations live in Bio/Align/Applications/_Muscle.py, > _Clustalw.py etc. To do this the Bio/Align/Applications/__init__.py > file will look like this: > > from _Muscle import MuscleCommandline > from _Clustalw import ClustalwCommandline > > This avoids having a single massive file, yet keeps the public > namespace simple. For the user, they do this: > > from Bio.Align.Applications import MuscleCommandline > cline = MuscleCommandline(...) > > or if they prefer, > > from Bio.Align import Applications > cline = Applications.MuscleCommandline(...) > > From the user's point of view all the alignment command line wrapper > objects live together under Bio.Align.Applications. As no one objected or put forward an alternative scheme, Cymon and I have been pressing ahead on Bug 2815 using the above file layout. I have also updated Bio.Motif.Applications to match (this module was deliberately left out of Biopython 1.50 while this issue was settled). Peter From bugzilla-daemon at portal.open-bio.org Tue Apr 28 21:18:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 28 Apr 2009 17:18:11 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904282118.n3SLIB0N015984@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #19 from cymon.cox at gmail.com 2009-04-28 17:18 EST ------- (In reply to comment #18) > (In reply to comment #17) > > > It would be great if you could test a clean checkout from CVS, Done - on Ubuntu 9.04 Python2.6.2 - Clustalw_tool and Prank_tool both good. Cant test Muscle_tool as Muscle 3.7 is broken on this release (builds and core-dumps). > > Also, would you be able to look into making the Prank test faster to run? Will look into this. (merged upstream into applic-int) C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From mjldehoon at yahoo.com Wed Apr 29 01:28:26 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 28 Apr 2009 18:28:26 -0700 (PDT) Subject: [Biopython-dev] [Biopython] Parsing large blast files In-Reply-To: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> Message-ID: <290052.25369.qm@web62407.mail.re1.yahoo.com> --- On Tue, 4/28/09, Peter Cock wrote: > I do like the idea of moving/importing the qblast function > directly under Bio.Blast, and perhaps removing Bio.Blast.NCBIXML > later on. Well Bio.Blast.NCBIXML would still be there (containing the code for the XML parser), but users would access it through Bio.Blast.parse/read. > For read/parse functions, we should probably call the > format "blastxml" to match BioPerl. We could have both "xml" and "blastxml" for Blast XML output, "text" and "blasttext" for Blast text output, and "table" and "blasttable" for Blast table (-m 8 and 9) output. > Would you continue to support the plain text output here? Yes. I'm more thinking about code reorganization than removing/adding functionality. > Rather than continuing to encourage the use of blastall, > blastpgp and rpsblast I would rather bring Bio.Blast.Applications > up to date, and then declare them obsolete. How would users typically use Bio.Blast.Applications? --Michiel. From p.j.a.cock at googlemail.com Wed Apr 29 08:33:03 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Apr 2009 09:33:03 +0100 Subject: [Biopython-dev] [Biopython] Parsing large blast files In-Reply-To: <290052.25369.qm@web62407.mail.re1.yahoo.com> References: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> <290052.25369.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com> On Wed, Apr 29, 2009 at 2:28 AM, Michiel de Hoon wrote: > > How would users typically use Bio.Blast.Applications? > In the next release, I would aim to have Bio.Blast.Applications updated to cover blastall (fully), plus blastpgp and rpsblast (currently not covered) and for the three helper functions Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast to all use Bio.Blast.Applications internally. I would suggest at some point (perhaps a release later) calling the three helper functions obsolete, and eventually deprecating them, but I appreciate these are well documented and well used, so this should be a gradual transistion. In the future I would see people contructing their application command line object and then using it to spawn the task as needed. The Bio.Applicaition.generic_run might suffice for low output tools, ranging up to using the builtin subprocess module for full control. The command line string can also be used in other ways, e.g. for submission to a computing cluster using qsub, or writing to a shell script etc. The point about this is decoupling constuction of the command line string, and actually executing it. Right now the Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions do both, and there is no way to (a) see what the command line used was, which makes debugging difficult, and (b) no way to control how it is invoked (e.g. recent Windows GUI questions). Another immediate benefit is an example usage that I do quite often: Running BLAST and saving the output to a file. The cleanest way to do this is to use the -o option to get BLAST itself to write to a file. If you do this, then there is no useful output written to the handles - but the Bio.Blast.NCBIStandalone make this fiddly (see Bug 2654). Right now the tutorial does something equally indirect - in python read BLAST output from stdout and save it to a file (and probably not in a memory efficient way either!). See also this thread on where to put new command line wrappers: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005766.html If you where asking about the actual code for how to build the command line object, well I have some thoughts on making the current Bio.Application base class easier to use (properties and keyword arguments at init) which I have started to discuss on the dev list. Peter From bugzilla-daemon at portal.open-bio.org Wed Apr 29 09:55:13 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Apr 2009 05:55:13 -0400 Subject: [Biopython-dev] [Bug 2822] New: Bio.Application.AbstractCommandline - properties and kwargs Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2822 Summary: Bio.Application.AbstractCommandline - properties and kwargs Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk I have two related proposals to make the command line wrapper objects easier to use, (1) Supporting keyword arguments in __init__ (2) Supporting parameters as python properties These both require each parameter to have a "human readable alias" which is also a valid python identifier (this should be the case in CVS now). I will attach patches to this bug, and perhaps put this on github too. For reference, consider this example (based on one in test_Emboss.py) using the old code in CVS: >>> from Bio.Emboss.Applications import WaterCommandline >>> water_exe = r"C:\Progra~1\Emboss\water.exe" >>> cline = WaterCommandline(cmd=water_exe) >>> cline.set_parameter("-asequence", "asis:ACCCGGGCGCGGT") >>> cline.set_parameter("-bsequence", "asis:ACCCGAGCGCGGT") >>> cline.set_parameter("-gapopen", "10") >>> cline.set_parameter("-gapextend", "0.5") >>> cline.set_parameter("-outfile", "temp_test.water") >>> print cline C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5 -outfile=temp_test.water Note that the parameters can have aliases (sometimes at the actual command line, e.g. a long and a short version of the same switch). Here the following is also supported: >>> from Bio.Emboss.Applications import WaterCommandline >>> water_exe = r"C:\Progra~1\Emboss\water.exe" >>> cline = WaterCommandline(cmd=water_exe) >>> cline.set_parameter("asequence", "asis:ACCCGGGCGCGGT") >>> cline.set_parameter("bsequence", "asis:ACCCGAGCGCGGT") >>> cline.set_parameter("gapopen", "10") >>> cline.set_parameter("gapextend", "0.5") >>> cline.set_parameter("outfile", "temp_test.water") >>> print cline C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5 -outfile=temp_test.water -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 29 10:00:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Apr 2009 06:00:14 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200904291000.n3TA0EBu028672@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-29 06:00 EST ------- Created an attachment (id=1287) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1287&action=view) Adds keyword argument support to the __init__ method This patch adds keyword argument support to the __init__ method, although for the purposes of demonstration in this patch I have only updated the EMBOSS wrappers to use it. As an alternative to the earlier example you would be able to do: >>> from Bio.Emboss.Applications import WaterCommandline >>> water_exe = r"C:\Progra~1\Emboss\water.exe" >>> cline = WaterCommandline(cmd=water_exe, asequence="asis:ACCCGGGCGCGGT", bsequence="asis:ACCCGAGCGCGGT", gapopen="10", gapextend="0.5", outfile="temp_test.water") >>> print cline C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5 -outfile=temp_test.water You can of course still use the set_parameter approach as well, for example to change a setting: >>> cline.set_parameter("gapopen", "20") >>> print cline C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=20 -gapextend=0.5 -outfile=temp_test.water I think this is much nicer, and also more like some of the existing "helper functions" we have for wrapping command line tools. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Apr 29 10:25:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Apr 2009 11:25:17 +0100 Subject: [Biopython-dev] Properties in Bio.Application interface? In-Reply-To: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com> References: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com> Message-ID: <320fb6e00904290325w756493f2i264f79cd3f7e08b8@mail.gmail.com> On Sun, Apr 26, 2009 at 1:46 PM, Peter wrote: > > I have cleaning up the existing Bio.Application command line objects > in CVS to follow the parameter alias convention already laid out in > Bio.Application. ?i.e. They all now have human readable paramater > aliases, which are also valid python identifiers. ?This means these > "human readable names" can also be used for argument names in > __init__ (using **kwargs), or as property names. > > I think I've got properties working now as an experiment on my > machine, generated at run time using the "human readable name" for > each parameter. ?We would need to special case "switch" arguments > (i.e. those which take no value) as outlined above. > > Does this sound worthwhile? ?If so, I'll put together an enhancement > bug with a patch, or a branch on github. I've filed Bug 2822 for these enhancements to the Bio.Application based command line objects, http://bugzilla.open-bio.org/show_bug.cgi?id=2822 So far there is just a patch to support keyword arguments (quite simple really), with an example of how this changes the interface. I'm still working on the code to do properties as well - I thought I'd solved this a few days ago but it doesn't quite work... Peter From p.j.a.cock at googlemail.com Wed Apr 29 10:31:26 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Apr 2009 11:31:26 +0100 Subject: [Biopython-dev] [Biopython] Parsing large blast files In-Reply-To: <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com> References: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com> <290052.25369.qm@web62407.mail.re1.yahoo.com> <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com> Message-ID: <320fb6e00904290331n654964bficfc68ae92d477387@mail.gmail.com> On Apr 29, Peter wrote: > On Apr 29, Michiel de Hoon wrote: >> >> How would users typically use Bio.Blast.Applications? >> > > In the next release, I would aim to have Bio.Blast.Applications > updated to cover blastall (fully), plus blastpgp and rpsblast > (currently not covered) and for the three helper functions > Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast to all use > Bio.Blast.Applications internally. ?... > > If you where asking about the actual code for how to build the command > line object, well I have some thoughts on making the current > Bio.Application base class easier to use (properties and keyword > arguments at init) which I have started to discuss on the dev list. See this dev list thread: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005916.html And Bug 2822 (with examples): http://bugzilla.open-bio.org/show_bug.cgi?id=2822 Peter From bugzilla-daemon at portal.open-bio.org Wed Apr 29 11:05:25 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Apr 2009 07:05:25 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904291105.n3TB5PRe000547@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #20 from cymon.cox at gmail.com 2009-04-29 07:05 EST ------- (In reply to comment #18) > (In reply to comment #17) > Also, would you be able to look into making the Prank test faster to run? Maybe > use a smaller example input file? After we do that, I'd like to use it to test > the Nexus parser via Bio.AlignIO (just something simple which won't be affected > by gap differences between different versions of PRANK - like my tests for > MUSCLE). Reduced run time from 8s to 1s, added asserts for Nexus outfile parsing. Pushed to applic-int C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 29 11:40:56 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Apr 2009 07:40:56 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904291140.n3TBeu6o002524@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #21 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-29 07:40 EST ------- (In reply to comment #20) > (In reply to comment #18) > > (In reply to comment #17) > > Also, would you be able to look into making the Prank test faster to run? > > Maybe use a smaller example input file? After we do that, I'd like to > > use it to test the Nexus parser via Bio.AlignIO (just something simple > > which won't be affected by gap differences between different versions of > > PRANK - like my tests for MUSCLE). > > Reduced run time from 8s to 1s, added asserts for Nexus outfile parsing. > > Pushed to applic-int > C. Lovely - checked into CVS. On the Linux machine I tested this on it went from 16s to 2s :) P.S. See also Bug 2822 for some of my ideas on making the Bio.Application base class easier to use. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Wed Apr 29 12:11:15 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 29 Apr 2009 08:11:15 -0400 Subject: [Biopython-dev] Installation documentation In-Reply-To: <320fb6e00904280640l58f0098fs467d90646a039e29@mail.gmail.com> References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com> <20090428124119.GV34546@sobchak.mgh.harvard.edu> <320fb6e00904280640l58f0098fs467d90646a039e29@mail.gmail.com> Message-ID: <20090429121115.GX34546@sobchak.mgh.harvard.edu> Hi Peter; > Well, easy_install isn't (yet) an official python standard so I hadn't > previously worried about it - our wiki Downloads page does mention it. > Frankly the less "official" ways the are to install, the less ways it > can go wrong, and then the less questions need to be asked when it > goes wrong. I hear you about too many options. I am a fan of easy_install and PyPi seems to have some momentum even if it is not officially endorsed. The way I normally work on cluster/shared machines is to have an up to date local version of Python and easy_install things I need. PyPi can also handle dependencies, which is nice -- I actually wrote some commented out code in setup.py which will help enable automatic numpy installation now that we are supporting only 2.4 or better. > Nor had I worried about how PyPi's listing might need to be updated. > I assumed it was clever enough to scan the http://biopython.org/DIST/ > directory and parse the filenames. Is the real answer you (Brad) kept > it up to date? > http://pypi.python.org/pypi/biopython/ Yes, I've been doing it on PyPi. The -f option you recommended on the wiki is good in case that is out of date, and I copied that into the install docs for consistency. > > Peter, if you have an account on pypi, let me know your login and I > > can add you as an owner for Biopython. > > I don't have an account on pypi. Cool -- if you end up wanting to play with it just let me know and I'll add you. Brad From bugzilla-daemon at portal.open-bio.org Wed Apr 29 12:23:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Apr 2009 08:23:19 -0400 Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces In-Reply-To: Message-ID: <200904291223.n3TCNJaG005773@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2815 ------- Comment #22 from cymon.cox at gmail.com 2009-04-29 08:23 EST ------- (In reply to comment #21) > P.S. See also Bug 2822 for some of my ideas on making the Bio.Application base > class easier to use. Eagerly anticipating the github branch ;) C. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 29 12:53:40 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Apr 2009 08:53:40 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200904291253.n3TCrec4008244@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-29 08:53 EST ------- OK, for the moment I'm going to give up on the property idea. I was trying to add them dynamically in __init__ or __new__ based on the parameter list, but this is actually rather tricky. I still think it should be possible though... We could use __getattr__ but that doesn't create an entry in dir(...), and thus is not discoverable - nor can use use each parameter's description for a docstring this way. Perhaps the simplest idea would be to the properties explicitly in each subclass, but this would require more upfront effort as all the existing property object lists would need to be replaced. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Apr 29 14:35:41 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 29 Apr 2009 10:35:41 -0400 Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline - properties and kwargs In-Reply-To: Message-ID: <200904291435.n3TEZfED018571@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2822 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1287 is|0 |1 obsolete| | ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-29 10:35 EST ------- Created an attachment (id=1288) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1288&action=view) Adds keyword argument support to the __init__ method AND properties (In reply to comment #2) > OK, for the moment I'm going to give up on the property idea. I was trying to > add them dynamically in __init__ or __new__ based on the parameter list, but > this is actually rather tricky. I still think it should be possible though... I was close earlier, and think I have solved it now :) As before, this patch adds keyword argument support to the __init__ method, but also setups properties dynamically. Again, for the purposes of demonstration in this patch I have only updated the EMBOSS wrappers to use this. So, my original example (using the current code) was: >>> from Bio.Emboss.Applications import WaterCommandline >>> water_exe = r"C:\Progra~1\Emboss\water.exe" >>> cline = WaterCommandline(cmd=water_exe) >>> cline.set_parameter("asequence", "asis:ACCCGGGCGCGGT") >>> cline.set_parameter("bsequence", "asis:ACCCGAGCGCGGT") >>> cline.set_parameter("gapopen", "10") >>> cline.set_parameter("gapextend", "0.5") >>> cline.set_parameter("outfile", "temp_test.water") >>> print cline C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5 -outfile=temp_test.water With the __init__ keyword argument support, this becomes valid: >>> from Bio.Emboss.Applications import WaterCommandline >>> water_exe = r"C:\Progra~1\Emboss\water.exe" >>> cline = WaterCommandline(cmd=water_exe, asequence="asis:ACCCGGGCGCGGT", bsequence="asis:ACCCGAGCGCGGT", gapopen="10", gapextend="0.5", outfile="temp_test.water") >>> print cline C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5 -outfile=temp_test.water You can of course still use the set_parameter approach as well, for example to change a setting: >>> cline.set_parameter("gapopen", "20") >>> print cline C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=20 -gapextend=0.5 -outfile=temp_test.water With the property support, you can then read/or set parameter values directly: >>> cline.gapopen '20' >>> cline.gapopen = 15 >>> cline.gapopen 15 >>> print cline C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=15 -gapextend=0.5 -outfile=temp_test.water This is much nicer I think, but perhaps the biggest plus point is the properties have docstrings which show via: >>> help(cline) ... and are discoverable: >>> dir(cline) ['__class__', '__delattr__', '__dict__', '__doc__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__str__', '__weakref__', '_check_value', '_get_parameter', 'aformat', 'asequence', 'bsequence', 'datafile', 'gapextend', 'gapopen', 'outfile', 'parameters', 'program_name', 'set_parameter', 'similarity', 'snucleotide', 'sprotein'] This makes the parameters all readily discoverable, without having to resort to looking at Biopython's source code, or the command line application's help. Right now (using the old code in CVS), the information is there but buried: >>> print cline.parameters [, , , , , , , , , ] >>> for p in cline.parameters : ... print p.names, p.description ... ['-asequence', 'asequence'] First sequence to align ['-bsequence', 'bsequence'] Second sequence to align ['-gapopen', 'gapopen'] Gap open penalty ['-gapextend', 'gapextend'] Gap extension penalty ['-outfile', 'outfile'] Output file for the alignment ['-datafile', 'datafile'] Matrix file ['-similarity', 'similarity'] Display percent identity and similarity ['-snucleotide', 'snucleotide'] Sequences are nucleotide (boolean) ['-sprotein', 'sprotein'] Sequences are protein (boolean) ['-aformat', 'aformat'] Display output in a different specified output format So, comments? We can choose to add EITHER the __init__ keyword arguments OR the properties. Or of course, BOTH. Or neither, and just leave the interface as it stand in CVS now. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Apr 29 15:34:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Apr 2009 16:34:26 +0100 Subject: [Biopython-dev] Properties in Bio.Application interface? In-Reply-To: <320fb6e00904290325w756493f2i264f79cd3f7e08b8@mail.gmail.com> References: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com> <320fb6e00904290325w756493f2i264f79cd3f7e08b8@mail.gmail.com> Message-ID: <320fb6e00904290834g7a73c7au487564e3b103250@mail.gmail.com> On Wed, Apr 29, 2009 at 11:25 AM, Peter wrote: > I've filed Bug 2822 for these enhancements to the Bio.Application > based command line objects, > http://bugzilla.open-bio.org/show_bug.cgi?id=2822 I think I learnt some more about python in the process, which may be a sign that the code I've come up with is too complicated, but Bug 2822 now has a patch to support both keyword arguments and properties in the Bio.Application style command line wrappers. This will require minor changes to the __init__ method of any command line sub-class (demonstrated using Bio.Emboss.Applications only thus far). I can envision a simpler approach to this code by defining the properties explicitly in each subclass, but that would mean a lot of boring/risky refactoring (or a clever script to do it for us). There are examples using the new code in the bug comments. Apart from preferring this API, the other big difference is the properties provide built in help. I'll be away for the next four days so I (probably) won't be able to reply to any comments or questions till Monday. Peter From biopython at maubp.freeserve.co.uk Wed Apr 29 16:10:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Apr 2009 17:10:28 +0100 Subject: [Biopython-dev] Git on Windows Message-ID: <320fb6e00904290910y29c6386ax8975ef3c2597d09e@mail.gmail.com> Hi all, I just wanted to say I've had a very quick test of git on Windows, using the git package from cygwin, and it seems to work OK. After copying my SSH key over from my main machine, I was able to clone my github repository, merge from the upstream Biopython branch (i.e. the one being updated from CVS), and push this back to my personal github repository. Why did I use git from cygwin? Well I have cygwin installed anyway for mingw32 (the compiler used for the Biopython Windows installers for Python 2.3 to 2.5), and was already using the cvs package from cygwin, so this seemed simplest. Peter From dalloliogm at gmail.com Wed Apr 29 16:42:50 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 29 Apr 2009 18:42:50 +0200 Subject: [Biopython-dev] Git on Windows In-Reply-To: <320fb6e00904290910y29c6386ax8975ef3c2597d09e@mail.gmail.com> References: <320fb6e00904290910y29c6386ax8975ef3c2597d09e@mail.gmail.com> Message-ID: <5aa3b3570904290942g6a73fae3k3a53c2e13c95c258@mail.gmail.com> On Wed, Apr 29, 2009 at 6:10 PM, Peter wrote: > Hi all, > > I just wanted to say I've had a very quick test of git on Windows, Hi, by the way, this is a document published by google on a comparison hg/git: - http://code.google.com/p/support/wiki/DVCSAnalysis In the comments, there is some discussion over git clients for Windows. -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From eric.talevich at gmail.com Wed Apr 29 19:28:58 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 29 Apr 2009 15:28:58 -0400 Subject: [Biopython-dev] XML parsing library for new modules Message-ID: <3f6baf360904291228m1bba6008kb26e345510772e7a@mail.gmail.com> Hi all, I'm writing a parser for the PhyloXML format for Google Summer of Code this year, and as the name would imply, it requires parsing some large XML files. The existing modules in Biopython for parsing XML formats seem to use xml.sax in the standard library. In Python 2.5, a faster and more Pythonic parser was added to the standard lib: ElementTree (xml.etree), in pure-Python and C-enhanced flavors. How do you feel about each of these libraries as the basis for a new Biopython module? Here are some interesting benchmarks: http://effbot.org/zone/celementtree.htm#benchmarks The ElementTree library is also available as a standalone package, compatible back to Python 2.1, and the lxml package also offers an independent implementation. So maintaining compatibility with Python 2.4 would require the availability of one of these third-party packages, and my code would try each of these imports in order: from xml.etree import cElementTree as ElementTree from xml.etree import ElementTree # Separate lxml package from lxml.etree import ElementTree # Standalone elementtree package import cElementTree as ElementTree from elementtree import ElementTree Then one day, when Python 2.4 is no longer supported, only the first two lines would be needed. (The second line is for sites that disable C extensions, like Google App Engine, or alternate Python implementations like Jython.) Another option is xml.parsers.expat, but just Googling around, it appears that the Python zeitgeist is strongly in favor of xml.etree for new code. Thoughts? Thanks, Eric From chapmanb at 50mail.com Thu Apr 30 12:05:32 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 30 Apr 2009 08:05:32 -0400 Subject: [Biopython-dev] Properties in Bio.Application interface? In-Reply-To: <320fb6e00904290834g7a73c7au487564e3b103250@mail.gmail.com> References: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com> <320fb6e00904290325w756493f2i264f79cd3f7e08b8@mail.gmail.com> <320fb6e00904290834g7a73c7au487564e3b103250@mail.gmail.com> Message-ID: <20090430120532.GA50777@sobchak.mgh.harvard.edu> Hi Peter; > > I've filed Bug 2822 for these enhancements to the Bio.Application > > based command line objects, > > http://bugzilla.open-bio.org/show_bug.cgi?id=2822 > > I think I learnt some more about python in the process, which may be a > sign that the code I've come up with is too complicated, but Bug 2822 > now has a patch to support both keyword arguments and properties in > the Bio.Application style command line wrappers. This will require > minor changes to the __init__ method of any command line sub-class > (demonstrated using Bio.Emboss.Applications only thus far). I can > envision a simpler approach to this code by defining the properties > explicitly in each subclass, but that would mean a lot of boring/risky > refactoring (or a clever script to do it for us). I love what you are doing here. The keywords and properties make it much more Pythonic; the old way reeks of Java-style get/sets. My vote is to put them both in. Brad From marcin.swiatek at mail.mcgill.ca Thu Apr 30 15:23:35 2009 From: marcin.swiatek at mail.mcgill.ca (Marcin Swiatek) Date: Thu, 30 Apr 2009 11:23:35 -0400 Subject: [Biopython-dev] MUMmer Message-ID: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca> Hello, I guess I should start with a nice 'hi' to everybody, now that I am sending my first message to this group. So: Hi, Everybody! Now, that we have the formality out of the way, I will get to the point. Recently, I have written some Python code for parsing and processing the output of MUMmer tool (http://mummer.sourceforge.net/). More specifically, the code I have manages invocations and handles outputs of the nucmer pipeline (alignment of multiple closely related nucleotide sequences) and of mummer itself (short exact matches). Obviously, the results are ultimately rendered as pairs of biopython's Seq objects. I use this stuff only myself, in work on bacterial genomes, but I would be more than willing to contribute it to the project. It may be rough around the edges at the moment, but I think I could easily give it the necessary polish if there is interest in having it included. Should that be the case, could one of the project leads point me in the right direction, please? How should I go about the submission? Regards, Marcin Swiatek From bartek at rezolwenta.eu.org Thu Apr 30 16:50:41 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 30 Apr 2009 18:50:41 +0200 Subject: [Biopython-dev] MUMmer In-Reply-To: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca> References: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca> Message-ID: <8b34ec180904300950n2c75f010oed27493f52d0da14@mail.gmail.com> Hi Marcin, On Thu, Apr 30, 2009 at 5:23 PM, Marcin Swiatek wrote: > Hello, > > > > I use this stuff only myself, in work on bacterial genomes, but I would > be more than willing to contribute it to the project. It may be rough > around the edges at the moment, but I think I could easily give it the > necessary polish if there is interest in having it included. > Contributions are always welome > > > Should that be the case, could one of the project leads point me in the > right direction, please? How should I go about the submission? > > I don't think I qualify as a lead, but nonetheless I think I can help here. I think that the best way to submit your code currently is to create a branch (fork) of biopython on github and submit your changes there and then notify people on biopython-dev that there is new code to review. You can also submit an enhancement bug to bugzilla. There are a couple of wiki pages which might be of interest to you: - http://biopython.org/wiki/Contributing - http://biopython.org/wiki/GitUsage If you have any questions or problems during the process, ask on the list. As for the code, I'm not sure, but maybe instead of returning a pair of sequences, an alignment object might be a better choice? You might want to also check out a recent code on application wrappers: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005766.html cheers Bartek