From biopython at maubp.freeserve.co.uk Sat Jan 1 16:15:43 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 1 Jan 2011 21:15:43 +0000 Subject: [Biopython-dev] Bio.SeqIO.index extension, Bio.SeqIO.index_many In-Reply-To: <20101208124144.GL4621@sobchak.mgh.harvard.edu> References: <20101207135941.GF4621@sobchak.mgh.harvard.edu> <20101208124144.GL4621@sobchak.mgh.harvard.edu> Message-ID: On Wed, Dec 8, 2010 at 12:41 PM, Brad Chapman wrote: > >Eric wrote: >> How about index_db or index_sqlite? The fact that it uses a SQLite >> database for storage seems significant enough to be noted in the name. > > +1 for index_db. That's clearer than index_file(s), which sort of > just implies you are indexing something but not that it is > non-memory. It also allows you to have multiple backends in addition > to SQLite. Nice. > > Brad > Checked in, but I still need to look at Python 3 support. Even plain Bio.SeqIO.index() will need some re-engineering to run at acceptable speed on Python 3, so this isn't unexpected. Peter From biopython at maubp.freeserve.co.uk Sun Jan 2 16:29:52 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 2 Jan 2011 21:29:52 +0000 Subject: [Biopython-dev] Bio.SeqIO.index extension, Bio.SeqIO.index_many In-Reply-To: References: <20101207135941.GF4621@sobchak.mgh.harvard.edu> <20101208124144.GL4621@sobchak.mgh.harvard.edu> Message-ID: On Sat, Jan 1, 2011 at 9:15 PM, Peter wrote: >> > > Checked in, but I still need to look at Python 3 support. Even > plain Bio.SeqIO.index() will need some re-engineering to run > at acceptable speed on Python 3, so this isn't unexpected. > There was also a UserDict thing to sort out for Python 3 which isn't taken care of by 2to3, see http://bugs.python.org/issue2876 However, the biggest annoyance of the index_db stuff was my unit tests on Windows - it turns out repeatedly creating, using, closing and deleting a file with the same name was a bad idea. Something was keeping a stale handle to the file as it wasn't always getting deleted. In the end I just gave in and used a different temp file for each tests, and it works fine. Weird, also rather frustrating - but the tests pass now :) I wonder how SQLite3 support in Jython is coming along... http://bugs.jython.org/issue1682864 Peter From biopython at maubp.freeserve.co.uk Tue Jan 4 17:43:27 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 Jan 2011 22:43:27 +0000 Subject: [Biopython-dev] Calling 2to3 from setup.py Message-ID: Hi all, Something we've talked about before is calling lib2to3 or the 2to3 script from within setup.py to make installing Biopython simpler on Python 3. Also, our current arrangement where we recommend calling 2to3 in situ is not very helpful from a source code control point of view - it makes working on Python 3 specific fixes rather fiddly. If we didn't need any special arguments for calling 2to3 we could try this simple solution: try: from distutils.command.build_py import build_py_2to3 as build_py except ImportError: from distutils.command.build_py import build_py then add this to the setup function call, cmdclass = {'build_py': build_py} See http://docs.python.org/py3k/distutils/apiref.html and other pages. However, as far as I can see, that doesn't cater for passing in options like disabling the long fixer (which we require). I then looked at how NumPy are doing this, and they have a hook in setup.py which calls their own Python script called py3tool.py to do the conversion, then change the current directory to the converted code before calling the setup function. See: https://github.com/numpy/numpy/blob/master/setup.py https://github.com/numpy/numpy/blob/master/tools/py3tool.py The NumPy py3tool.py script has some brains - it will not bother to reconvert previously converted but unchanged files. Since 2to3 is quite slow this is important, e.g. for doing: python3 setup.py build python3 setup.py test python3 setup.py install However, this only seems to be a simple check based on the file timestamps. I worry that you'd have to clear the converted files to ensure a clean rebuild after switching branches - but from a brief search online it looks like git will give modified files the current time stamp when you do a checkout. On the following branch I've followed the same basic strategy as NumPy - handle the 2to3 conversion with a script and then before calling the setup function switch to the converted source tree. The main difference is I also track the md5 checksums of the source files and the 2to3 converted python scripts. Perhaps it is over engineered, but it seems safer than looking at the files' time stamps? https://github.com/peterjc/biopython/tree/py3setup I haven't tried this yet on Windows or Mac, just Linux with Python 3.1 for now. Another potential issue with the NumPy code is it doesn't worry about the Python 3.1 and 3.2 (etc) versions of 2to3 giving slightly different results. To be safe, I'm using a separate build folder for each. If you run setup.py under Python 3.x, it calls lib2to3 from that Python. Has anyone else looked at this? Peter From biopython at maubp.freeserve.co.uk Tue Jan 4 18:30:29 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 Jan 2011 23:30:29 +0000 Subject: [Biopython-dev] Calling 2to3 from setup.py In-Reply-To: References: Message-ID: On Tue, Jan 4, 2011 at 10:43 PM, Peter wrote: > Hi all, > > Something we've talked about before is calling lib2to3 or the 2to3 script > from within setup.py to make installing Biopython simpler on Python 3. > > ... > > I then looked at how NumPy are doing this, and they have a hook > in setup.py which calls their own Python script called py3tool.py to > do the conversion, ... it will not bother to reconvert previously > converted but unchanged files. ... > > However, this only seems to be a simple check based on the file > timestamps. I worry that you'd have to clear the converted files > to ensure a clean rebuild after switching branches - but from > a brief search online it looks like git will give modified files the > current time stamp when you do a checkout. For an interesting but heated discussion of this and related issues, see this thread: http://www.spinics.net/lists/git/msg24579.html The key point is that although some version control systems do have an option to restore time stamps, git does not. Thus if you switch branches, and changed file gets the current timestamp. This is simple, and ensures simple build systems like make will rebuild all dependencies (but may do unnecessary work). > On the following branch I've followed the same basic strategy as > NumPy - handle the 2to3 conversion with a script and then before > calling the setup function switch to the converted source tree. > The main difference is I also track the md5 checksums of the > source files and the 2to3 converted python scripts. Perhaps it > is over engineered, but it seems safer than looking at the files' > time stamps? If you are on the master branch, then checkout another branch, then checkout the master branch again, the net result with git is any files which differed between the two branches would have had their time stamp updated (but with no net change to their contents). Using the NumPy setup.py script this would trigger a needless reconversion of those files with 2to3. Using the md5 approach would not do this extra work. On the other hand, this example is contrived - in practice when I change branch I want to build/install and test that code. So on reflection, using the time stamp to decide if 2to3 needs to be rerun is probable quite sufficient (and will be faster too). Peter From bugzilla-daemon at portal.open-bio.org Wed Jan 5 12:33:13 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 Jan 2011 12:33:13 -0500 Subject: [Biopython-dev] [Bug 3166] New: Bio.PDB.DSSP fails to work on PDBs with HETATM Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3166 Summary: Bio.PDB.DSSP fails to work on PDBs with HETATM Product: Biopython Version: 1.50 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: macrozhu+biopy at gmail.com Hi, I am current using BioPython 1.50. It seems Bio.PDB.DSSP fails if the input PDB file contains HETATM. For example, for PDB entry 3jui, the DSSP.__init__() function breaks with an exception: KeyError: (' ', 547, ' ') This is because residue 547 has id ('H_MSE', 547, ' '). But the function Bio.PDB.DSSP.make_dssp_dict() never fill the het field in residue id when parsing DSSP output. See line 135 in DSSP.py: res_id=(" ", resseq, icode) As a matter of fact, there is no way to figure out the het value from DSSP output. Therefore, to address this issue, I suggest to revise the function DSSP.__init__() so that it looks like below (revised lines marked with comments # REVISED): class DSSP(AbstractResiduePropertyMap): """ Run DSSP on a pdb file, and provide a handle to the DSSP secondary structure and accessibility. Note that DSSP can only handle one model. Example: >>> p=PDBParser() >>> structure=parser.get_structure("1fat.pdb") >>> model=structure[0] >>> dssp=DSSP(model, "1fat.pdb") >>> # print dssp data for a residue >>> secondary_structure, accessibility=dssp[(chain_id, res_id)] """ def __init__(self, model, pdb_file, dssp="dssp"): """ @param model: the first model of the structure @type model: L{Model} @param pdb_file: a PDB file @type pdb_file: string @param dssp: the dssp executable (ie. the argument to os.system) @type dssp: string """ # create DSSP dictionary dssp_dict, dssp_keys=dssp_dict_from_pdb_file(pdb_file, dssp) dssp_map={} dssp_list=[] # Now create a dictionary that maps Residue objects to # secondary structure and accessibility, and a list of # (residue, (secondary structure, accessibility)) tuples for key in dssp_keys: chain_id, res_id=key chain=model[chain_id] #################### ### REVISED #################### # in DSSP, HET field is not considered # thus HETATM records may cause unnecessary exceptions # e.g. 3jui. try: res=chain[res_id] except KeyError: found = False # try again with all HETATM # consider resseq + icode res_seq_icode = ('%s%s' % (res_id[1],res_id[2])).strip() for r in chain: if r.id[0] != ' ': r_seq_icode = ('%s%s' % (r.id[1],r.id[2])).strip() if r_seq_icode == res_seq_icode: res = r found = True break if not found: raise KeyError(res_id) #################### ### REVISED FINISHES #################### aa, ss, acc=dssp_dict[key] res.xtra["SS_DSSP"]=ss res.xtra["EXP_DSSP_ASA"]=acc # relative accessibility resname=res.get_resname() try: rel_acc=acc/MAX_ACC[resname] if rel_acc>1.0: rel_acc=1.0 except KeyError: rel_acc = 'NA' res.xtra["EXP_DSSP_RASA"]=rel_acc # Verify if AA in DSSP == AA in Structure # Something went wrong if this is not true! resname=to_one_letter_code[resname] if resname=="C": # DSSP renames C in C-bridges to a,b,c,d,... # - we rename it back to 'C' if _dssp_cys.match(aa): aa='C' #################### ### REVISED #################### if not (resname==aa or (res.id[0] != ' ' and aa=='X')): #################### ### REVISED FINISHES #################### raise PDBException("Structure/DSSP mismatch at "+str(res)) dssp_map[key]=((res, ss, acc, rel_acc)) dssp_list.append((res, ss, acc, rel_acc)) AbstractResiduePropertyMap.__init__(self, dssp_map, dssp_keys, dssp_list) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 5 12:44:22 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 Jan 2011 12:44:22 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101051744.p05HiMA0006192@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-05 12:44 EST ------- Hi, Biopython 1.50 was released in April 2009, and is nearly two years old. However, checking for recent changes to the file Bio/PDB/DSSP.py nothing looks to have altered the __init__ code you're interested in. https://github.com/biopython/biopython/commits/master/Bio/PDB/DSSP.py Could you submit your proposed changes as a patch against the latest code please? You can attach the patch file to this bug. Also an explicit example (ideally just a few lines of Python) showing how to reproduce the problem would be very helpful. Thanks! Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 5 13:07:15 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 Jan 2011 13:07:15 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101051807.p05I7FhZ007739@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #2 from macrozhu+biopy at gmail.com 2011-01-05 13:07 EST ------- Created an attachment (id=1556) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1556&action=view) handle DSSP results on PDBs with HETATM, parse PHI and PSI angles The current version of DSSP can not handle some PDB files with HETATM in it. e.g. 3jui. This can be illustrated by the following example: > python DSSP.py 3jui.pdb KeyError: (' ', 547, ' ') In addition, PHI and PSI angles calculated by DSSP are useful in many cases. Therefore, in this patch I also revised the code so that PHI and PSI angles in DSSP output files are parsed and assigned to residue feature xtra. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 5 13:08:24 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 Jan 2011 13:08:24 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101051808.p05I8OC8007831@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #3 from macrozhu+biopy at gmail.com 2011-01-05 13:08 EST ------- Created an attachment (id=1557) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1557&action=view) handle DSSP results on PDBs with HETATM, parse PHI and PSI angles Hi, Peter, The current version of DSSP can not handle some PDB files with HETATM in it. e.g. 3jui. This can be illustrated by the following example: > python DSSP.py 3jui.pdb KeyError: (' ', 547, ' ') Here is a patch file DSSP.py for addressing this problem. In addition, PHI and PSI angles calculated by DSSP are useful in many cases. Therefore, in this patch I also revised the code so that PHI and PSI angles in DSSP output files are parsed and assigned to residue feature xtra. regards, hongbo zhu -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 5 13:10:56 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 Jan 2011 13:10:56 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101051810.p05IAu8d007991@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #4 from macrozhu+biopy at gmail.com 2011-01-05 13:10 EST ------- oops, the same patch was submitted twice. Please ignore 1556 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From updates at feedmyinbox.com Fri Jan 7 03:45:34 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Fri, 7 Jan 2011 03:45:34 -0500 Subject: [Biopython-dev] 1/7 newest questions tagged biopython - Stack Overflow Message-ID: // python multiprocessing each with own subprocess (Kubuntu,Mac) // January 6, 2011 at 4:25 PM http://stackoverflow.com/questions/4620041/python-multiprocessing-each-with-own-subprocess-kubuntu-mac I've created a script that by default creates one multiprocessing Process; then it works fine. When starting multiple processes, it starts to hang, and not always in the same place. The program's about 700 lines of code, so I'll try to summarise what's going on. I want to make the most of my multi-cores, by parallelising the slowest task, which is aligning DNA sequences. For that I use the subprocess module to call a command-line program: 'hmmsearch', which I can feed in sequences through /dev/stdin, and then I read out the aligned sequences through /dev/stdout. I imagine the hang occurs because of these multiple subprocess instances reading / writing from stdout / stdin, and I really don't know the best way to go about this... I was looking into os.fdopen(...) & os.tmpfile(), to create temporary filehandles or pipes where I can flush the data through. However, I've never used either before & I can't picture how to do that with the subprocess module. Ideally I'd like to bypass using the hard-drive entirely, because pipes are much better with high-throughput data processing! Any help with this would be super wonderful!! import multiprocessing, subprocess from Bio import SeqIO class align_seq( multiprocessing.Process ): def __init__( self, inPipe, outPipe, semaphore, options ): multiprocessing.Process.__init__(self) self.in_pipe = inPipe ## Sequences in self.out_pipe = outPipe ## Alignment out self.options = options.copy() ## Modifiable sub-environment self.sem = semaphore def run(self): inp = self.in_pipe.recv() while inp != 'STOP': seq_record , HMM = inp # seq_record is only ever one Bio.Seq.SeqRecord object at a time. # HMM is a file location. align_process = subprocess.Popen( ['hmmsearch', '-A', '/dev/stdout', '-o',os.devnull, HMM, '/dev/stdin'], shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE ) self.sem.acquire() align_process.stdin.write( seq_record.format('fasta') ) align_process.stdin.close() for seq in SeqIO.parse( align_process.stdout, 'stockholm' ): # get the alignment output self.out_pipe.send_bytes( seq.seq.tostring() ) # send it to consumer align_process.wait() # Don't know if there's any need for this?? self.sem.release() align_process.stdout.close() inp = self.in_pipe.recv() self.in_pipe.close() #Close handles so don't overshoot max. limit on number of file-handles. self.out_pipe.close() -- Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=newest Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/520218/8a8361b28bdb22206ffa317797e7067a6f101db5/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From bugzilla-daemon at portal.open-bio.org Wed Jan 12 08:54:30 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 12 Jan 2011 08:54:30 -0500 Subject: [Biopython-dev] [Bug 3168] New: different StringIO import for Python 3 Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3168 Summary: different StringIO import for Python 3 Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: michael.kuhn at gmail.com Bio/File.py fails in Python 3, because StringIO.StringIO has been moved to the io module. These changes fix this (in Bio/File.py): import StringIO --> try: from StringIO import StringIO except ImportError: from io import StringIO and StringHandle = StringIO.StringIO --> StringHandle = StringIO (I didn't see a proper process documented anywhere to submit patches with the whole 2to3 conversion going on at the same time). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 12 09:46:47 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 12 Jan 2011 09:46:47 -0500 Subject: [Biopython-dev] [Bug 3169] New: to_one_letter_code in Bio.SCOP.Raf is old Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3169 Summary: to_one_letter_code in Bio.SCOP.Raf is old Product: Biopython Version: 1.56 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: macrozhu+biopy at gmail.com Hi, The dictionary to_one_letter_code in Bio.SCOP.Raf is a bit old now. The current dictionary is based on a table taken from the RAF release notes of ASTRAL. This is an old table and some new three-letter codes in the PDB are not found in it (e.g. M3L in 2X4W). ASTRAL does not use the table since v1.73. Rather, PDB Chemical Component Dictionary is used. See http://astral.berkeley.edu/seq.cgi?get=raf-edit-comments;ver=1.75 "Beginning with ASTRAL 1.73, the PDB's chemical dictionary is used to translate chemically modified residues, instead of the translation table from ASTRAL 1.55." The PDB Chemical Component Dictionary can be obtained from: http://deposit.pdb.org/cc_dict_tut.html . I have parsed the dictionary and there are 12054 three-letter codes (as of Jan 2011). Among them, most correspond to a one-letter code '?'. Still, there are 1245 three-letter codes corresponding to a one-letter code other than '?' (the list is attached in the end). Therefore, I suggest to update the to_one_letter_code dictionary in Bio.SCOP.Raf. Best regards, hongbo zhu to_one_letter_code = { '00C':'C','01W':'X','0A0':'D','0A1':'Y','0A2':'K', '0A8':'C','0AA':'V','0AB':'V','0AC':'G','0AD':'G', '0AF':'W','0AG':'L','0AH':'S','0AK':'D','0AM':'A', '0AP':'C','0AU':'U','0AV':'A','0AZ':'P','0BN':'F', '0C ':'C','0CS':'A','0DC':'C','0DG':'G','0DT':'T', '0G ':'G','0NC':'A','0SP':'A','0U ':'U','0YG':'YG', '10C':'C','125':'U','126':'U','127':'U','128':'N', '12A':'A','143':'C','175':'ASG','193':'X','1AP':'A', '1MA':'A','1MG':'G','1PA':'F','1PI':'A','1PR':'N', '1SC':'C','1TQ':'W','1TY':'Y','200':'F','23F':'F', '23S':'X','26B':'T','2AD':'X','2AG':'G','2AO':'X', '2AR':'A','2AS':'X','2AT':'T','2AU':'U','2BD':'I', '2BT':'T','2BU':'A','2CO':'C','2DA':'A','2DF':'N', '2DM':'N','2DO':'X','2DT':'T','2EG':'G','2FE':'N', '2FI':'N','2FM':'M','2GT':'T','2HF':'H','2LU':'L', '2MA':'A','2MG':'G','2ML':'L','2MR':'R','2MT':'P', '2MU':'U','2NT':'T','2OM':'U','2OT':'T','2PI':'X', '2PR':'G','2SA':'N','2SI':'X','2ST':'T','2TL':'T', '2TY':'Y','2VA':'V','32S':'X','32T':'X','3AH':'H', '3AR':'X','3CF':'F','3DA':'A','3DR':'N','3GA':'A', '3MD':'D','3ME':'U','3NF':'Y','3TY':'X','3XH':'G', '4AC':'N','4BF':'Y','4CF':'F','4CY':'M','4DP':'W', '4F3':'GYG','4FB':'P','4FW':'W','4HT':'W','4IN':'X', '4MF':'N','4MM':'X','4OC':'C','4PC':'C','4PD':'C', '4PE':'C','4PH':'F','4SC':'C','4SU':'U','4TA':'N', '5AA':'A','5AT':'T','5BU':'U','5CG':'G','5CM':'C', '5CS':'C','5FA':'A','5FC':'C','5FU':'U','5HP':'E', '5HT':'T','5HU':'U','5IC':'C','5IT':'T','5IU':'U', '5MC':'C','5MD':'N','5MU':'U','5NC':'C','5PC':'C', '5PY':'T','5SE':'U','5ZA':'TWG','64T':'T','6CL':'K', '6CT':'T','6CW':'W','6HA':'A','6HC':'C','6HG':'G', '6HN':'K','6HT':'T','6IA':'A','6MA':'A','6MC':'A', '6MI':'N','6MT':'A','6MZ':'N','6OG':'G','70U':'U', '7DA':'A','7GU':'G','7JA':'I','7MG':'G','8AN':'A', '8FG':'G','8MG':'G','8OG':'G','9NE':'E','9NF':'F', '9NR':'R','9NV':'V','A ':'A','A1P':'N','A23':'A', 'A2L':'A','A2M':'A','A34':'A','A35':'A','A38':'A', 'A39':'A','A3A':'A','A3P':'A','A40':'A','A43':'A', 'A44':'A','A47':'A','A5L':'A','A5M':'C','A5O':'A', 'A66':'X','AA3':'A','AA4':'A','AAR':'R','AB7':'X', 'ABA':'A','ABR':'A','ABS':'A','ABT':'N','ACB':'D', 'ACL':'R','AD2':'A','ADD':'X','ADX':'N','AEA':'X', 'AEI':'D','AET':'A','AFA':'N','AFF':'N','AFG':'G', 'AGM':'R','AGT':'X','AHB':'N','AHH':'X','AHO':'A', 'AHP':'A','AHS':'X','AHT':'X','AIB':'A','AKL':'D', 'ALA':'A','ALC':'A','ALG':'R','ALM':'A','ALN':'A', 'ALO':'T','ALQ':'X','ALS':'A','ALT':'A','ALY':'K', 'AP7':'A','APE':'X','APH':'A','API':'K','APK':'K', 'APM':'X','APP':'X','AR2':'R','AR4':'E','ARG':'R', 'ARM':'R','ARO':'R','ARV':'X','AS ':'A','AS2':'D', 'AS9':'X','ASA':'D','ASB':'D','ASI':'D','ASK':'D', 'ASL':'D','ASM':'X','ASN':'N','ASP':'D','ASQ':'D', 'ASU':'N','ASX':'B','ATD':'T','ATL':'T','ATM':'T', 'AVC':'A','AVN':'X','AYA':'A','AYG':'AYG','AZK':'K', 'AZS':'S','AZY':'Y','B1F':'F','B1P':'N','B2A':'A', 'B2F':'F','B2I':'I','B2V':'V','B3A':'A','B3D':'D', 'B3E':'E','B3K':'K','B3L':'X','B3M':'X','B3Q':'X', 'B3S':'S','B3T':'X','B3U':'H','B3X':'N','B3Y':'Y', 'BB6':'C','BB7':'C','BB9':'C','BBC':'C','BCS':'C', 'BCX':'C','BE2':'X','BFD':'D','BG1':'S','BGM':'G', 'BHD':'D','BIF':'F','BIL':'X','BIU':'I','BJH':'X', 'BLE':'L','BLY':'K','BMP':'N','BMT':'T','BNN':'A', 'BNO':'X','BOE':'T','BOR':'R','BPE':'C','BRU':'U', 'BSE':'S','BT5':'N','BTA':'L','BTC':'C','BTR':'W', 'BUC':'C','BUG':'V','BVP':'U','BZG':'N','C ':'C', 'C12':'TYG','C1X':'K','C25':'C','C2L':'C','C2S':'C', 'C31':'C','C32':'C','C34':'C','C36':'C','C37':'C', 'C38':'C','C3Y':'C','C42':'C','C43':'C','C45':'C', 'C46':'C','C49':'C','C4R':'C','C4S':'C','C5C':'C', 'C66':'X','C6C':'C','C99':'TFG','CAF':'C','CAL':'X', 'CAR':'C','CAS':'C','CAV':'X','CAY':'C','CB2':'C', 'CBR':'C','CBV':'C','CCC':'C','CCL':'K','CCS':'C', 'CCY':'CYG','CDE':'X','CDV':'X','CDW':'C','CEA':'C', 'CFL':'C','CFY':'FCYG','CG1':'G','CGA':'E','CGU':'E', 'CH ':'C','CH6':'MYG','CH7':'KYG','CHF':'X','CHG':'X', 'CHP':'G','CHS':'X','CIR':'R','CJO':'GYG','CLE':'L', 'CLG':'K','CLH':'K','CLV':'AFG','CM0':'N','CME':'C', 'CMH':'C','CML':'C','CMR':'C','CMT':'C','CNU':'U', 'CP1':'C','CPC':'X','CPI':'X','CQR':'GYG','CR0':'TLG', 'CR2':'GYG','CR5':'G','CR7':'KYG','CR8':'HYG','CRF':'TWG', 'CRG':'THG','CRK':'MYG','CRO':'GYG','CRQ':'QYG','CRU':'E', 'CRW':'ASG','CRX':'ASG','CS0':'C','CS1':'C','CS3':'C', 'CS4':'C','CS8':'N','CSA':'C','CSB':'C','CSD':'C', 'CSE':'C','CSF':'C','CSH':'SHG','CSI':'G','CSJ':'C', 'CSL':'C','CSO':'C','CSP':'C','CSR':'C','CSS':'C', 'CSU':'C','CSW':'C','CSX':'C','CSY':'SYG','CSZ':'C', 'CTE':'W','CTG':'T','CTH':'T','CUC':'X','CWR':'S', 'CXM':'M','CY0':'C','CY1':'C','CY3':'C','CY4':'C', 'CYA':'C','CYD':'C','CYF':'C','CYG':'C','CYJ':'X', 'CYM':'C','CYQ':'C','CYR':'C','CYS':'C','CZ2':'C', 'CZO':'GYG','CZZ':'C','D11':'T','D1P':'N','D3 ':'N', 'D33':'N','D3P':'G','D3T':'T','D4M':'T','D4P':'X', 'DA ':'A','DA2':'X','DAB':'A','DAH':'F','DAL':'A', 'DAR':'R','DAS':'D','DBB':'T','DBM':'N','DBS':'S', 'DBU':'T','DBY':'Y','DBZ':'A','DC ':'C','DC2':'C', 'DCG':'G','DCI':'X','DCL':'X','DCT':'C','DCY':'C', 'DDE':'H','DDG':'G','DDN':'U','DDX':'N','DFC':'C', 'DFG':'G','DFI':'X','DFO':'X','DFT':'N','DG ':'G', 'DGH':'G','DGI':'G','DGL':'E','DGN':'Q','DHA':'A', 'DHI':'H','DHL':'X','DHN':'V','DHP':'X','DHU':'U', 'DHV':'V','DI ':'I','DIL':'I','DIR':'R','DIV':'V', 'DLE':'L','DLS':'K','DLY':'K','DM0':'K','DMH':'N', 'DMK':'D','DMT':'X','DN ':'N','DNE':'L','DNG':'L', 'DNL':'K','DNM':'L','DNP':'A','DNR':'C','DNS':'K', 'DOA':'X','DOC':'C','DOH':'D','DON':'L','DPB':'T', 'DPH':'F','DPL':'P','DPP':'A','DPQ':'Y','DPR':'P', 'DPY':'N','DRM':'U','DRP':'N','DRT':'T','DRZ':'N', 'DSE':'S','DSG':'N','DSN':'S','DSP':'D','DT ':'T', 'DTH':'T','DTR':'W','DTY':'Y','DU ':'U','DVA':'V', 'DXD':'N','DXN':'N','DYG':'DYG','DYS':'C','DZM':'A', 'E ':'A','E1X':'A','EDA':'A','EDC':'G','EFC':'C', 'EHP':'F','EIT':'T','ENP':'N','ESB':'Y','ESC':'M', 'EXY':'L','EY5':'N','EYS':'X','F2F':'F','FA2':'A', 'FA5':'N','FAG':'N','FAI':'N','FCL':'F','FFD':'N', 'FGL':'G','FGP':'S','FHL':'X','FHO':'K','FHU':'U', 'FLA':'A','FLE':'L','FLT':'Y','FME':'M','FMG':'G', 'FMU':'N','FOE':'C','FOX':'G','FP9':'P','FPA':'F', 'FRD':'X','FT6':'W','FTR':'W','FTY':'Y','FZN':'K', 'G ':'G','G25':'G','G2L':'G','G2S':'G','G31':'G', 'G32':'G','G33':'G','G36':'G','G38':'G','G42':'G', 'G46':'G','G47':'G','G48':'G','G49':'G','G4P':'N', 'G7M':'G','GAO':'G','GAU':'E','GCK':'C','GCM':'X', 'GDP':'G','GDR':'G','GFL':'G','GGL':'E','GH3':'G', 'GHG':'Q','GHP':'G','GL3':'G','GLH':'Q','GLM':'X', 'GLN':'Q','GLQ':'E','GLU':'E','GLX':'Z','GLY':'G', 'GLZ':'G','GMA':'E','GMS':'G','GMU':'U','GN7':'G', 'GND':'X','GNE':'N','GOM':'G','GPL':'K','GS ':'G', 'GSC':'G','GSR':'G','GSS':'G','GSU':'E','GT9':'C', 'GTP':'G','GVL':'X','GYC':'CYG','GYS':'SYG','H2U':'U', 'H5M':'P','HAC':'A','HAR':'R','HBN':'H','HCS':'X', 'HDP':'U','HEU':'U','HFA':'X','HGL':'X','HHI':'H', 'HHK':'AK','HIA':'H','HIC':'H','HIP':'H','HIQ':'H', 'HIS':'H','HL2':'L','HLU':'L','HMF':'A','HMR':'R', 'HOL':'N','HPC':'F','HPE':'F','HPQ':'F','HQA':'A', 'HRG':'R','HRP':'W','HS8':'H','HS9':'H','HSE':'S', 'HSL':'S','HSO':'H','HTI':'C','HTN':'N','HTR':'W', 'HV5':'A','HVA':'V','HY3':'P','HYP':'P','HZP':'P', 'I ':'I','I2M':'I','I58':'K','I5C':'C','IAM':'A', 'IAR':'R','IAS':'D','IC ':'C','IEL':'K','IEY':'HYG', 'IG ':'G','IGL':'G','IGU':'G','IIC':'SHG','IIL':'I', 'ILE':'I','ILG':'E','ILX':'I','IMC':'C','IML':'I', 'IOY':'F','IPG':'G','IPN':'N','IRN':'N','IT1':'K', 'IU ':'U','IYR':'Y','IYT':'T','JJJ':'C','JJK':'C', 'JJL':'C','JW5':'N','K1R':'C','KAG':'G','KCX':'K', 'KGC':'K','KOR':'M','KPI':'K','KST':'K','KYQ':'K', 'L2A':'X','LA2':'K','LAA':'D','LAL':'A','LBY':'K', 'LC ':'C','LCA':'A','LCC':'N','LCG':'G','LCH':'N', 'LCK':'K','LCX':'K','LDH':'K','LED':'L','LEF':'L', 'LEH':'L','LEI':'V','LEM':'L','LEN':'L','LET':'X', 'LEU':'L','LG ':'G','LGP':'G','LHC':'X','LHU':'U', 'LKC':'N','LLP':'K','LLY':'K','LME':'E','LMQ':'Q', 'LMS':'N','LP6':'K','LPD':'P','LPG':'G','LPL':'X', 'LPS':'S','LSO':'X','LTA':'X','LTR':'W','LVG':'G', 'LVN':'V','LYM':'K','LYN':'K','LYR':'K','LYS':'K', 'LYX':'K','LYZ':'K','M0H':'C','M1G':'G','M2G':'G', 'M2L':'K','M2S':'M','M3L':'K','M5M':'C','MA ':'A', 'MA6':'A','MA7':'A','MAA':'A','MAD':'A','MAI':'R', 'MBQ':'Y','MBZ':'N','MC1':'S','MCG':'X','MCL':'K', 'MCS':'C','MCY':'C','MDH':'X','MDO':'ASG','MDR':'N', 'MEA':'F','MED':'M','MEG':'E','MEN':'N','MEP':'U', 'MEQ':'Q','MET':'M','MEU':'G','MF3':'X','MFC':'GYG', 'MG1':'G','MGG':'R','MGN':'Q','MGQ':'A','MGV':'G', 'MGY':'G','MHL':'L','MHO':'M','MHS':'H','MIA':'A', 'MIS':'S','MK8':'L','ML3':'K','MLE':'L','MLL':'L', 'MLY':'K','MLZ':'K','MME':'M','MMT':'T','MND':'N', 'MNL':'L','MNU':'U','MNV':'V','MOD':'X','MP8':'P', 'MPH':'X','MPJ':'X','MPQ':'G','MRG':'G','MSA':'G', 'MSE':'M','MSL':'M','MSO':'M','MSP':'X','MT2':'M', 'MTR':'T','MTU':'A','MTY':'Y','MVA':'V','N ':'N', 'N10':'S','N2C':'X','N5I':'N','N5M':'C','N6G':'G', 'N7P':'P','NA8':'A','NAL':'A','NAM':'A','NB8':'N', 'NBQ':'Y','NC1':'S','NCB':'A','NCX':'N','NCY':'X', 'NDF':'F','NDN':'U','NEM':'H','NEP':'H','NF2':'N', 'NFA':'F','NHL':'E','NIT':'X','NIY':'Y','NLE':'L', 'NLN':'L','NLO':'L','NLP':'L','NLQ':'Q','NMC':'G', 'NMM':'R','NMS':'T','NMT':'T','NNH':'R','NP3':'N', 'NPH':'C','NRP':'LYG','NRQ':'MYG','NSK':'X','NTY':'Y', 'NVA':'V','NYC':'TWG','NYG':'NYG','NYM':'N','NYS':'C', 'NZH':'H','O12':'X','O2C':'N','O2G':'G','OAD':'N', 'OAS':'S','OBF':'X','OBS':'X','OCS':'C','OCY':'C', 'ODP':'N','OHI':'H','OHS':'D','OIC':'X','OIP':'I', 'OLE':'X','OLT':'T','OLZ':'S','OMC':'C','OMG':'G', 'OMT':'M','OMU':'U','ONE':'U','ONL':'X','OPR':'R', 'ORN':'A','ORQ':'R','OSE':'S','OTB':'X','OTH':'T', 'OTY':'Y','OXX':'D','P ':'G','P1L':'C','P1P':'N', 'P2T':'T','P2U':'U','P2Y':'P','P5P':'A','PAQ':'Y', 'PAS':'D','PAT':'W','PAU':'A','PBB':'C','PBF':'F', 'PBT':'N','PCA':'E','PCC':'P','PCE':'X','PCS':'F', 'PDL':'X','PDU':'U','PEC':'C','PF5':'F','PFF':'F', 'PFX':'X','PG1':'S','PG7':'G','PG9':'G','PGL':'X', 'PGN':'G','PGP':'G','PGY':'G','PHA':'F','PHD':'D', 'PHE':'F','PHI':'F','PHL':'F','PHM':'F','PIV':'X', 'PLE':'L','PM3':'F','PMT':'C','POM':'P','PPN':'F', 'PPU':'A','PPW':'G','PQ1':'N','PR3':'C','PR5':'A', 'PR9':'P','PRN':'A','PRO':'P','PRS':'P','PSA':'F', 'PSH':'H','PST':'T','PSU':'U','PSW':'C','PTA':'X', 'PTH':'Y','PTM':'Y','PTR':'Y','PU ':'A','PUY':'N', 'PVH':'H','PVL':'X','PYA':'A','PYO':'U','PYX':'C', 'PYY':'N','QLG':'QLG','QUO':'G','R ':'A','R1A':'C', 'R1B':'C','R1F':'C','R7A':'C','RC7':'HYG','RCY':'C', 'RIA':'A','RMP':'A','RON':'X','RT ':'T','RTP':'N', 'S1H':'S','S2C':'C','S2D':'A','S2M':'T','S2P':'A', 'S4A':'A','S4C':'C','S4G':'G','S4U':'U','S6G':'G', 'SAC':'S','SAH':'C','SAR':'G','SBL':'S','SC ':'C', 'SCH':'C','SCS':'C','SCY':'C','SD2':'X','SDG':'G', 'SDP':'S','SEB':'S','SEC':'A','SEG':'A','SEL':'S', 'SEM':'X','SEN':'S','SEP':'S','SER':'S','SET':'S', 'SGB':'S','SHC':'C','SHP':'G','SHR':'K','SIB':'C', 'SIC':'DC','SLA':'P','SLR':'P','SLZ':'K','SMC':'C', 'SME':'M','SMF':'F','SMP':'A','SMT':'T','SNC':'C', 'SNN':'N','SOC':'C','SOS':'N','SOY':'S','SPT':'T', 'SRA':'A','SSU':'U','STY':'Y','SUB':'X','SUI':'DG', 'SUN':'S','SUR':'U','SVA':'S','SVX':'S','SVZ':'X', 'SYS':'C','T ':'T','T11':'F','T23':'T','T2S':'T', 'T2T':'N','T31':'U','T32':'T','T36':'T','T37':'T', 'T38':'T','T39':'T','T3P':'T','T41':'T','T48':'T', 'T49':'T','T4S':'T','T5O':'U','T5S':'T','T66':'X', 'T6A':'A','TA3':'T','TA4':'X','TAF':'T','TAL':'N', 'TAV':'D','TBG':'V','TBM':'T','TC1':'C','TCP':'T', 'TCQ':'X','TCR':'W','TCY':'A','TDD':'L','TDY':'T', 'TFE':'T','TFO':'A','TFQ':'F','TFT':'T','TGP':'G', 'TH6':'T','THC':'T','THO':'X','THR':'T','THX':'N', 'THZ':'R','TIH':'A','TLB':'N','TLC':'T','TLN':'U', 'TMB':'T','TMD':'T','TNB':'C','TNR':'S','TOX':'W', 'TP1':'T','TPC':'C','TPG':'G','TPH':'X','TPL':'W', 'TPO':'T','TPQ':'Y','TQQ':'W','TRF':'W','TRG':'K', 'TRN':'W','TRO':'W','TRP':'W','TRQ':'W','TRW':'W', 'TRX':'W','TS ':'N','TST':'X','TT ':'N','TTD':'T', 'TTI':'U','TTM':'T','TTQ':'W','TTS':'Y','TY2':'Y', 'TY3':'Y','TYB':'Y','TYI':'Y','TYN':'Y','TYO':'Y', 'TYQ':'Y','TYR':'Y','TYS':'Y','TYT':'Y','TYU':'N', 'TYX':'X','TYY':'Y','TZB':'X','TZO':'X','U ':'U', 'U25':'U','U2L':'U','U2N':'U','U2P':'U','U31':'U', 'U33':'U','U34':'U','U36':'U','U37':'U','U8U':'U', 'UAR':'U','UCL':'U','UD5':'U','UDP':'N','UFP':'N', 'UFR':'U','UFT':'U','UMA':'A','UMP':'U','UMS':'U', 'UN1':'X','UN2':'X','UNK':'X','UR3':'U','URD':'U', 'US1':'U','US2':'U','US3':'T','US5':'U','USM':'U', 'V1A':'C','VAD':'V','VAF':'V','VAL':'V','VB1':'K', 'VDL':'X','VLL':'X','VLM':'X','VMS':'X','VOL':'X', 'X ':'G','X2W':'E','X4A':'N','X9Q':'AFG','XAD':'A', 'XAE':'N','XAL':'A','XAR':'N','XCL':'C','XCP':'X', 'XCR':'C','XCS':'N','XCT':'C','XCY':'C','XGA':'N', 'XGL':'G','XGR':'G','XGU':'G','XTH':'T','XTL':'T', 'XTR':'T','XTS':'G','XTY':'N','XUA':'A','XUG':'G', 'XX1':'K','XXY':'THG','XYG':'DYG','Y ':'A','YCM':'C', 'YG ':'G','YOF':'Y','YRR':'N','YYG':'G','Z ':'C', 'ZAD':'A','ZAL':'A','ZBC':'C','ZCY':'C','ZDU':'U', 'ZFB':'X','ZGU':'G','ZHP':'N','ZTH':'T','ZZJ':'A' } -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jan 13 05:51:48 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 13 Jan 2011 05:51:48 -0500 Subject: [Biopython-dev] [Bug 3168] different StringIO import for Python 3 In-Reply-To: Message-ID: <201101131051.p0DApmJx003755@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3168 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-13 05:51 EST ------- Hi Michael, We have started looking at calling 2to3 via the setup.py script, and will update the REAME file if things change. However, for now you must run the 2to3 script twice (as described in the README file) before installing Biopython. The 2to3 script automatically switches this: import StringIO ... StringHandle = StringIO.StringIO to a Python 3 equivalent: import io ... StringHandle = io.StringIO i.e. I can't reproduce any problem. Could you clarify what you are doing? Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jan 13 06:37:16 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 13 Jan 2011 06:37:16 -0500 Subject: [Biopython-dev] [Bug 3169] to_one_letter_code in Bio.SCOP.Raf is old In-Reply-To: Message-ID: <201101131137.p0DBbGeS013288@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3169 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-13 06:37 EST ------- Hi Hongbo, Could you share the script you used to parse the mmCIF file to build the to_one_letter_code dictionary from the chem_comp (Table 1)? That would be very helpful since this data will need to be updated again in future. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jan 13 07:54:46 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 13 Jan 2011 07:54:46 -0500 Subject: [Biopython-dev] [Bug 3169] to_one_letter_code in Bio.SCOP.Raf is old In-Reply-To: Message-ID: <201101131254.p0DCskc7029238@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3169 ------- Comment #2 from macrozhu+biopy at gmail.com 2011-01-13 07:54 EST ------- Created an attachment (id=1560) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1560&action=view) python script for parsing PDB Chem Component Hi, Peter, the script is attached. It was a quick hack: I just parsed all the fields of "_chem_comp.one_letter_code" and "_chem_comp.three_letter_code" in the component.cif and wrote output to a .txt file. Hope that helps. cheers, hongbo -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jan 14 05:06:00 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 Jan 2011 05:06:00 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101141006.p0EA60Tc013924@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 macrozhu+biopy at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1556 is|0 |1 obsolete| | Attachment #1557 is|0 |1 obsolete| | ------- Comment #5 from macrozhu+biopy at gmail.com 2011-01-14 05:05 EST ------- Created an attachment (id=1561) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1561&action=view) fix DSSP crash when reading PDB with DisorderedResidue The current version of DSSP.py does not handle DisorderedResidue well. Case 1, Point mutations, e.g. 1h9h chain E resi 22 BioPython uses the last residue as default (resi SER in this case). But DSSP takes the first one (alternative location is blank,A or 1, CYS in this case). >python DSSP.py 1h9h.pdb Case 2, one of the disordered residues is HET. e.g. 3piu chain A Residue 273 >python DSSP.py 3piu.pdb In the first case, the DisorderedResidue.is_disordered() returns 2, and in the 2nd case, the DisorderedResidue.is_disordered() returns 1. These values are used to cope with DisorderedResidue in DSSP.py. Minors: tempfile.mktemp() should be replaced by tempfile.mkstemp() (see http://docs.python.org/library/tempfile.html#tempfile.mktemp "Deprecated since version 2.3: Use mkstemp() instead.") -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jan 14 05:32:31 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 Jan 2011 05:32:31 -0500 Subject: [Biopython-dev] [Bug 3168] different StringIO import for Python 3 In-Reply-To: Message-ID: <201101141032.p0EAWVqo018072@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3168 ------- Comment #2 from michael.kuhn at gmail.com 2011-01-14 05:32 EST ------- Ok, for me, the 2to3 script does not make the change (using Python 3.1.1). I only updated biopython. I'll ask our sysadmin to upgrade Python 3 and get back to you. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From b.invergo at gmail.com Fri Jan 14 05:56:38 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Fri, 14 Jan 2011 11:56:38 +0100 Subject: [Biopython-dev] pypaml Message-ID: Hi everyone, New subscriber here, and hopefully a new contributer as well! I have written a Python interface to the CODEML program of the PAML package (http://abacus.gene.ucl.ac.uk/software/paml.html), with the intention of eventually covering all of the programs in the package. You can find my package here: http://code.google.com/p/pypaml/ I recently ran across a discussion that occurred on the main Biopython list regarding my interface (http://lists.open-bio.org/pipermail/biopython/2010-September/006743.html) and I realized that perhaps it would be better if I integrated it into Biopython. I know that it's something many people would be interested in. I am very enthusiastic to continue this project and to do whatever I need to do to facilitate the integration. Some immediate tasks that need to be done are: - change the licensing: currently it's GPL, as described in the code and on the project page. Is it sufficient to simply remove its dedicated project page and change the verbiage in the code? - check coding standards as described in the Contributing to Biopython wiki - make some changes to be compatible with Python 2.5: I use @property and @x.setter decorator tags which are only 2.6+. I think that's the only incompatability - double-check the CODEML output parsing for many PAML versions; the output is notoriously non-standard from release to release. I may have to build some version-checking into the parser. I wrote it based on the output of PAML 4.3 - build some unit tests (I'm new to this in Python so I need to learn a bit about that - perhaps making it fit with any other structural standards in the Biopython library? I've tried from the start to make it very generalized so I don't think any major changes need to be made. Plus, I think structurally it should be easy to implement the other PAML programs by copying a lot of the code. The output parsing for each program is a different story, though. So, as I understand it, I should file an enhancement bug over at the Bugzilla site. In the meantime I can start working on some of the points listed above. I also need to refresh my memory of using git since I've gotten in the dirty habit of using svn (assuming this is all approved)! Is there anything else I need to do for now? Cheers, Brandon Invergo Pompeu Fabra University Barcelona, Spain From p.j.a.cock at googlemail.com Fri Jan 14 07:28:30 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Jan 2011 12:28:30 +0000 Subject: [Biopython-dev] pypaml In-Reply-To: References: Message-ID: On Fri, Jan 14, 2011 at 10:56 AM, Brandon Invergo wrote: > Hi everyone, > New subscriber here, and hopefully a new contributer as well! > Hi Brandon, Welcome to the list. By the way, apologies for mixing you and the PAML author Ziheng Yang up last year (I misread the pypaml webpage): http://lists.open-bio.org/pipermail/biopython/2010-September/006747.html > I have written a Python interface to the CODEML program of the PAML > package (http://abacus.gene.ucl.ac.uk/software/paml.html), with the > intention of eventually covering all of the programs in the package. > You can find my package here: > http://code.google.com/p/pypaml/ > > I recently ran across a discussion that occurred on the main Biopython > list regarding my interface > (http://lists.open-bio.org/pipermail/biopython/2010-September/006743.html) > and I realized that perhaps it would be better if I integrated it into > Biopython. I know that it's something many people would be interested > in. I am very enthusiastic to continue this project and to do whatever > I need to do to facilitate the integration. That is great news :) > Some immediate tasks that need to be done are: > - change the licensing: currently it's GPL, as described in the code > and on the project page. Is it sufficient to simply remove its > dedicated project page and change the verbiage in the code? Assuming you wrote all the code (or have your co-authors agreement), then yes, you can just change the licence. If you want to you can update the code in your repository and website, maybe make a new release while you are at it. Alternatively, you could just leave the standalone pypaml code as it is (under the GPL), but base your Biopython contributions on it (under the Biopython MIT/BSD licence). I would suggest that you don't make API changes to standalone pypaml, so as not to disrupt your existing users. However some of the work like Python 2.5 support might be worth doing there (before looking at Biopython integration). As a bonus, that should also mean you can use pypaml under Jython (Python on the JVM). > - check coding standards as described in the Contributing to Biopython wiki > - make some changes to be compatible with Python 2.5: I use @property > and @x.setter decorator tags which are only 2.6+. I think that's the > only incompatability If so that doesn't sound too hard to update. > - double-check the CODEML output parsing for many PAML versions; the > output is notoriously non-standard from release to release. I may have > to build some version-checking into the parser. I wrote it based on > the output of PAML 4.3 >From Chris Field's comments last year, that may be a lot of work for relatively little gain. I don't use PAML and have no idea what versions are typically used though. http://lists.open-bio.org/pipermail/biopython/2010-September/006760.html > - build some unit tests (I'm new to this in Python so I need to learn > a bit about that We've tried to cover the basics in a chapter in our tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > - perhaps making it fit with any other structural standards in the > Biopython library? > > I've tried from the start to make it very generalized so I don't think > any major changes need to be made. Plus, I think structurally it > should be easy to implement the other PAML programs by copying a lot > of the code. The output parsing for each program is a different story, > though. Does that mean you have wrappers for calling the PAML command line tools? Can you point me at the code for that - I'd like a quick look to see if it makes sense to switch over to the Bio.Application based system we're trying to standardise on in Biopython. On the other hand, if you have a much higher level wrapper maybe it is fine as it is (e.g. the Bio.PopGen wrappers follow their own route, although they use Bio.Application for the low level API inside). > So, as I understand it, I should file an enhancement bug over at the > Bugzilla site. That would be useful to give us a reference number for tracking it. A lot of your email would make a good introduction to the issue to put in the comment. > In the meantime I can start working on some of the > points listed above. I also need to refresh my memory of using git > since I've gotten in the dirty habit of using svn (assuming this is > all approved)! Is there anything else I need to do for now? Doing your work on a github fork of the Biopython repository would be great (although you may want to start with adding unit tests or doing Python 2.5 changes within standalone pypaml). Peter. From b.invergo at gmail.com Fri Jan 14 08:36:48 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Fri, 14 Jan 2011 14:36:48 +0100 Subject: [Biopython-dev] pypaml In-Reply-To: References: Message-ID: Hi Peter, Thanks for the welcome! > Assuming you wrote all the code (or have your co-authors agreement), > then yes, you can just change the licence. If you want to you can > update the code in your repository and website, maybe make a new > release while you are at it. Alternatively, you could just leave the > standalone pypaml code as it is (under the GPL), but base your > Biopython contributions on it (under the Biopython MIT/BSD licence). I wrote all the code myself so changing it shouldn't be a problem. I tend to license tools with the GPL by habit but I'm not opposed to relicensing it. > I would suggest that you don't make API changes to standalone > pypaml, so as not to disrupt your existing users. However some of > the work like Python 2.5 support might be worth doing there (before > looking at Biopython integration). As a bonus, that should also mean > you can use pypaml under Jython (Python on the JVM). > >> - check coding standards as described in the Contributing to Biopython wiki >> - make some changes to be compatible with Python 2.5: I use @property >> and @x.setter decorator tags which are only 2.6+. I think that's the >> only incompatability > > If so that doesn't sound too hard to update. I think, as it stands, the CODEML api is complete so no real changes need to be made there. As for the decorators, that was actually added in the last commit I made, so rolling back is quite simple. >> - double-check the CODEML output parsing for many PAML versions; the >> output is notoriously non-standard from release to release. I may have >> to build some version-checking into the parser. I wrote it based on >> the output of PAML 4.3 > > From Chris Field's comments last year, that may be a lot of work for > relatively little gain. I don't use PAML and have no idea what versions > are typically used though. > http://lists.open-bio.org/pipermail/biopython/2010-September/006760.html I would suggest that we don't support very old versions. Perhaps from 4.x up (currently it's at 4.4c). Most of the parsing is done via regular expressions, so changes in the order of the outputs shouldn't matter. Changes in the wording will. This is something to work on. >> - build some unit tests (I'm new to this in Python so I need to learn >> a bit about that > > We've tried to cover the basics in a chapter in our tutorial, > http://biopython.org/DIST/docs/tutorial/Tutorial.html > http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Thanks I'll check them out > Does that mean you have wrappers for calling the PAML command > line tools? Can you point me at the code for that - I'd like a quick > look to see if it makes sense to switch over to the Bio.Application > based system we're trying to standardise on in Biopython. On the > other hand, if you have a much higher level wrapper maybe it is > fine as it is (e.g. the Bio.PopGen wrappers follow their own route, > although they use Bio.Application for the low level API inside). I use the subprocess library of python to call the command line tool. PAML programs work by calling the tool with a control file as its argument. The control file specifies all of the run arguments, including the data files, output files, and other variables. Basically, pypaml works by dynamically building a control file via properties for the data files and a dictionary for the other variables, running the command line tool with that control file as its parameter, and then grabbing the output file, parsing it and storing the results in a dictionary object. The run() function, line 217, does this: http://code.google.com/p/pypaml/source/browse/trunk/src/pypaml/codeml.py with the actual subprocess call happening at 239/241 (verbose/silent). So, much of the code is dedicated to building the control file and parsing the output. I'm not as familiar with the other PAML programs, but a look through the manual indicates that they operate in a similar manner. (sorry that the code isn't fully commented yet) Ok, well, time to get cracking then. I'll add the Bugzilla item and make some changes in the standalone. I'll then inform the dev-list when things are in better condition for integration! Cheers, Brandon >> So, as I understand it, I should file an enhancement bug over at the >> Bugzilla site. > > That would be useful to give us a reference number for tracking it. > A lot of your email would make a good introduction to the issue to > put in the comment. > >> In the meantime I can start working on some of the >> points listed above. I also need to refresh my memory of using git >> since I've gotten in the dirty habit of using svn (assuming this is >> all approved)! Is there anything else I need to do for now? > > Doing your work on a github fork of the Biopython repository > would be great (although you may want to start with adding unit > tests or doing Python 2.5 changes within standalone pypaml). > > Peter. > From p.j.a.cock at googlemail.com Fri Jan 14 08:50:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Jan 2011 13:50:08 +0000 Subject: [Biopython-dev] pypaml In-Reply-To: References: Message-ID: On Fri, Jan 14, 2011 at 1:36 PM, Brandon Invergo wrote: > Hi Peter, > Thanks for the welcome! > >> Assuming you wrote all the code (or have your co-authors agreement), >> then yes, you can just change the licence. If you want to you can >> update the code in your repository and website, maybe make a new >> release while you are at it. Alternatively, you could just leave the >> standalone pypaml code as it is (under the GPL), but base your >> Biopython contributions on it (under the Biopython MIT/BSD licence). > > I wrote all the code myself so changing it shouldn't be a problem. I > tend to license tools with the GPL by habit but I'm not opposed to > relicensing it. For standalone projects I also like the GPL, but for libraries LGPL is better. However, in the scientific Python community people have generally followed the Python licence convention and gone with the more flexible MIT/BSD style licence. >> I would suggest that you don't make API changes to standalone >> pypaml, so as not to disrupt your existing users. However some of >> the work like Python 2.5 support might be worth doing there (before >> looking at Biopython integration). As a bonus, that should also mean >> you can use pypaml under Jython (Python on the JVM). >> >>> - check coding standards as described in the Contributing to Biopython wiki >>> - make some changes to be compatible with Python 2.5: I use @property >>> and @x.setter decorator tags which are only 2.6+. I think that's the >>> only incompatability >> >> If so that doesn't sound too hard to update. > > I think, as it stands, the CODEML api is complete so no real changes > need to be made there. As for the decorators, that was actually added > in the last commit I made, so rolling back is quite simple. You can of course define properties, setters, getters etc without using decorators (this is what we do in Biopython). >>> - double-check the CODEML output parsing for many PAML versions; the >>> output is notoriously non-standard from release to release. I may have >>> to build some version-checking into the parser. I wrote it based on >>> the output of PAML 4.3 >> >> From Chris Field's comments last year, that may be a lot of work for >> relatively little gain. I don't use PAML and have no idea what versions >> are typically used though. >> http://lists.open-bio.org/pipermail/biopython/2010-September/006760.html > > I would suggest that we don't support very old versions. Perhaps from > 4.x up (currently it's at 4.4c). Most of the parsing is done via > regular expressions, so changes in the order of the outputs shouldn't > matter. Changes in the wording will. This is something to work on. You may be able to get some comments from any PAML users on the main Biopython discussion list to guide you here. >>> - build some unit tests (I'm new to this in Python so I need to learn >>> a bit about that >> >> We've tried to cover the basics in a chapter in our tutorial, >> http://biopython.org/DIST/docs/tutorial/Tutorial.html >> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > > Thanks I'll check them out > >> Does that mean you have wrappers for calling the PAML command >> line tools? Can you point me at the code for that - I'd like a quick >> look to see if it makes sense to switch over to the Bio.Application >> based system we're trying to standardise on in Biopython. On the >> other hand, if you have a much higher level wrapper maybe it is >> fine as it is (e.g. the Bio.PopGen wrappers follow their own route, >> although they use Bio.Application for the low level API inside). > > I use the subprocess library of python to call the command line tool. > PAML programs work by calling the tool with a control file as its > argument. The control file specifies all of the run arguments, > including the data files, output files, and other variables. > Basically, pypaml works by dynamically building a control file via > properties for the data files and a dictionary for the other > variables, running the command line tool with that control file as its > parameter, and then grabbing the output file, parsing it and storing > the results in a dictionary object. > > The run() function, line 217, does this: > http://code.google.com/p/pypaml/source/browse/trunk/src/pypaml/codeml.py > with the actual subprocess call happening at 239/241 (verbose/silent). > > So, much of the code is dedicated to building the control file and > parsing the output. I'm not as familiar with the other PAML programs, > but a look through the manual indicates that they operate in a similar > manner. (sorry that the code isn't fully commented yet) Having looked at that briefly, since this is a command line tool driven by a configuration input file, rather than command line switches and arguments, I see no reason to bother with using our Bio.Application framework. By the way, have you ever tried using this under Windows? > Ok, well, time to get cracking then. I'll add the Bugzilla item and > make some changes in the standalone. I'll then inform the dev-list > when things are in better condition for integration! That sounds like a plan. Peter From bugzilla-daemon at portal.open-bio.org Fri Jan 14 09:01:13 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 Jan 2011 09:01:13 -0500 Subject: [Biopython-dev] [Bug 3170] New: pypaml Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3170 Summary: pypaml Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: b.invergo at gmail.com PAML (Phylogenetic Analysis by Maximum Likelihood; http://abacus.gene.ucl.ac.uk/software/paml.html) is a package of programs written by Ziheng Yang. The programs are used widely, especially CODEML which is used to estimate evolutionary rate parameters for a given sequence alignment. There is currently a PAML library for BioPerl but, to my knowledge, no such wrapper exists for Python. I have independently written a Python interface to the CODEML program of the PAML package, with the intention of eventually covering all of the programs in the package. You can find my code here: http://code.google.com/p/pypaml/ I believe it would be beneficial to integrate my pypaml package into the main Biopython project and to continue its development as such. Before it can be integrated, some immediate tasks must be done: - change the licensing: currently it's GPL, as described in the code and on the project page. Is it sufficient to simply remove its dedicated project page and change the verbiage in the code? - check coding standards as described in the Contributing to Biopython wiki - make some changes to be compatible with Python 2.5: I use @property and @x.setter decorator tags which are only 2.6+. I think that's the only incompatability - double-check the CODEML output parsing for many PAML versions; the output is notoriously non-standard from release to release. I may have to build some version-checking into the parser. I wrote it based on the output of PAML 4.3. I propose that compatibility with only 4.X+ be implemented (current version = 4.4c - build some unit tests (I'm new to this in Python so I need to learn a bit about that I've tried from the start to make it very generalized so I don't think any major changes need to be made. Plus, I think structurally it should be easy to implement the other PAML programs by copying a lot of the code. The output parsing for each program is a different story, though. I will implement many of the above changes first in my stand-alone library before merging it with a branch of the Biopython git repository. Because CODEML appears to be the most commonly used program from the package, for the immediate future it will continue to receive most of the focus, but with time the other programs will be implemented. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From b.invergo at gmail.com Fri Jan 14 09:11:44 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Fri, 14 Jan 2011 15:11:44 +0100 Subject: [Biopython-dev] pypaml In-Reply-To: References: Message-ID: > You can of course define properties, setters, getters etc without using > decorators (this is what we do in Biopython). That's how I had it before. I decided to switch over to new-style classes and in reading up on that topic I came across the decorators and I became a bit excited to implement them. They're certainly not necessary, as you say. > By the way, have you ever tried using this under Windows? I haven't yet but by the looks of it it should work fine assuming the programs are in the system path and thus can be called by name from any location in the file system. I see one line where I accidentally made it *nix-specific (default working directory is "./") but other than that, all files/directories are located via os.path or by user-inputted strings (as they would be in the control file). I have both a Linux and a Windows 7 machine at home though so I can do some testing. Obviously the unit tests here will help catch system-specific errors such as entering file locations incorrectly (I can see a few exceptions that I'm currently not handling). Once I make a couple of the core changes, I'll send a message to the main Biopython list to get some people to try it out and to let me know how it works (esp. re: version numbers and parsing) as well as to indicate if I currently don't support something they want to do. Regards, Brandon From bugzilla-daemon at portal.open-bio.org Fri Jan 14 09:18:26 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 Jan 2011 09:18:26 -0500 Subject: [Biopython-dev] [Bug 3170] Integration of external package: pypaml In-Reply-To: Message-ID: <201101141418.p0EEIQVi030815@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3170 b.invergo at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|pypaml |Integration of external | |package: pypaml -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jan 14 09:27:51 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 Jan 2011 09:27:51 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101141427.p0EERpkN032207@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 macrozhu+biopy at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1561 is|0 |1 obsolete| | ------- Comment #6 from macrozhu+biopy at gmail.com 2011-01-14 09:27 EST ------- Created an attachment (id=1562) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1562&action=view) temp file created by tempfile.mkstemp() needs os.close() I realize that temp files created using tempfile.mkstemp() needs to be closed using os.close(). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Fri Jan 14 10:40:35 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 14 Jan 2011 10:40:35 -0500 Subject: [Biopython-dev] pypaml In-Reply-To: References: Message-ID: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Brandon; It's great you are looking to contribute your CODEML wrappers to Biopython. It looks like really useful functionality. Peter tackled most of the high level details so I'll chime in with a few more detailed suggestions. > I use the subprocess library of python to call the command line tool. > PAML programs work by calling the tool with a control file as its > argument. The control file specifies all of the run arguments, > including the data files, output files, and other variables. > Basically, pypaml works by dynamically building a control file via > properties for the data files and a dictionary for the other > variables, running the command line tool with that control file as its > parameter, and then grabbing the output file, parsing it and storing > the results in a dictionary object. > > The run() function, line 217, does this: > http://code.google.com/p/pypaml/source/browse/trunk/src/pypaml/codeml.py > with the actual subprocess call happening at 239/241 (verbose/silent). The functionality here looks great. My stylistic suggestion would be to separate the code for running the commandline from that used to parse the output file. Ideally these would be two separate classes that could live under the Bio.Phylo namespace: https://github.com/biopython/biopython/tree/master/Bio/Phylo For the commandline code, it would be nice to have a Bio.Phylo.Applications that is organized similar to Bio.Align.Applications: https://github.com/biopython/biopython/tree/master/Bio/Align/Applications This will give you some flexibility as you want to expand out to support other programs, and provide a framework for additional phylogenetic commandline utilities. Eric might have some suggestions about the best module name to use for the parsing code as he has been managing the Phylo namespace. Separating parsing from commandline generation can also let you move the _results dictionary from being a class member to a return value for a parse function. This is a bit more straightforward workflow instead of having the side-effect of assigning an internal class attribute. Thanks again for contributing, Brad From b.invergo at gmail.com Fri Jan 14 10:56:06 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Fri, 14 Jan 2011 16:56:06 +0100 Subject: [Biopython-dev] pypaml In-Reply-To: <20110114154035.GC30193@sobchak.mgh.harvard.edu> References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: Brad, Thanks for your comments! This is something I hinted at in my original post about structural rearrangements; what I meant is that I'm largely unfamiliar with the structure/organization of the Biopython source and how one works within it. I have admittedly not used Biopython much yet in my own research so I don't have much experience with it. Your tips on where pypaml would fit in the namespace are very helpful. Already from thinking about integrating with the project I had thought about separating the parsing engine, especially once I have to start doing version checking. The code may grow quickly for that and it does make more organizational sense to move it. Your comments also indirectly made me realize that eventually it would be nice to be able to run the software on Bio.Align objects and Bio.Phylo objects as inputs (they would have to be written to temporary text files so that PAML could read them). Well, that'll come in time, but it's a thought. It looks like I have a lot of Biopython-related studying to do for homework! Luckily, these things excite me... Cheers, Brandon On Fri, Jan 14, 2011 at 4:40 PM, Brad Chapman wrote: > Brandon; > It's great you are looking to contribute your CODEML wrappers to > Biopython. It looks like really useful functionality. Peter tackled > most of the high level details so I'll chime in with a few more > detailed suggestions. > >> I use the subprocess library of python to call the command line tool. >> PAML programs work by calling the tool with a control file as its >> argument. The control file specifies all of the run arguments, >> including the data files, output files, and other variables. >> Basically, pypaml works by dynamically building a control file via >> properties for the data files and a dictionary for the other >> variables, running the command line tool with that control file as its >> parameter, and then grabbing the output file, parsing it and storing >> the results in a dictionary object. >> >> The run() function, line 217, does this: >> http://code.google.com/p/pypaml/source/browse/trunk/src/pypaml/codeml.py >> with the actual subprocess call happening at 239/241 (verbose/silent). > > The functionality here looks great. My stylistic suggestion would be > to separate the code for running the commandline from that used to > parse the output file. Ideally these would be two separate classes > that could live under the Bio.Phylo namespace: > > https://github.com/biopython/biopython/tree/master/Bio/Phylo > > For the commandline code, it would be nice to have a > Bio.Phylo.Applications that is organized similar to > Bio.Align.Applications: > > https://github.com/biopython/biopython/tree/master/Bio/Align/Applications > > This will give you some flexibility as you want to expand out to > support other programs, and provide a framework for additional > phylogenetic commandline utilities. > > Eric might have some suggestions about the best module name to use > for the parsing code as he has been managing the Phylo namespace. > > Separating parsing from commandline generation can also let you move > the _results dictionary from being a class member to a return value for > a parse function. This is a bit more straightforward workflow > instead of having the side-effect of assigning an internal class > attribute. > > Thanks again for contributing, > Brad > > From biopython at maubp.freeserve.co.uk Fri Jan 14 14:02:50 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 Jan 2011 19:02:50 +0000 Subject: [Biopython-dev] test_PhyloXML.py on Python 3 In-Reply-To: References: Message-ID: On Fri, Aug 13, 2010 at 2:24 AM, Eric Talevich wrote: > On Thu, Aug 12, 2010 at 12:37 PM, Peter wrote: > >> Hi Eric (et al), >> >> Is test_PhyloXML.py working for you under Python 3? >> >> I'm getting the following (both with and without the 2to3 --nofix=long >> option): >> >> $ python3 test_PhyloXML.py >> ... >> ?File "/home/xxx/lib/python3.1/site-packages/Bio/Phylo/PhyloXMLIO.py", >> line 298, in __init__ >> ? ?event, root = next(context) >> ?File "", line 59, in __iter__ >> TypeError: invalid event tuple >> >> ---------------------------------------------------------------------- >> Ran 47 tests in 0.015s >> >> All the sub-tests in test_PhyloXML.py are failing the same way. >> >> >From memory this was working recently. >> >> > Yeah, it was... it's fixed now/again. > > This is the issue with passing byte/unicode strings to cElementTree in > Python 3. I had a check for Python versions 3.0.0 through 3.1.1, where we > need to import ElementTree instead of cElementTree. Apparently Python 3.1.2 > still has the bug. > > -Eric It looks like Python 3.1.3 also has the same bug in cElementTree :( See this buildbot-slave log for example, http://events.open-bio.org:8010/builders/Linux%2064%20-%20Python%203.1/builds/96/steps/shell_3/logs/stdio I've extended the workaround again: https://github.com/biopython/biopython/commit/105444a340a2ad0e48c8582864104333b90adfc0 Let's see if we get any progress on the Python bug itself, http://bugs.python.org/issue9257 Peter From eric.talevich at gmail.com Sat Jan 15 00:35:48 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 15 Jan 2011 00:35:48 -0500 Subject: [Biopython-dev] pypaml In-Reply-To: <20110114154035.GC30193@sobchak.mgh.harvard.edu> References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: Hi Brandon, Thanks for volunteering! I think this will be a nice addition to Biopython and particularly Bio.Phylo. Some thoughts on organization: On Fri, Jan 14, 2011 at 10:40 AM, Brad Chapman wrote: > > The functionality here looks great. My stylistic suggestion would be > to separate the code for running the commandline from that used to > parse the output file. Ideally these would be two separate classes > that could live under the Bio.Phylo namespace: > > https://github.com/biopython/biopython/tree/master/Bio/Phylo > I agree. For the commandline code, it would be nice to have a > Bio.Phylo.Applications that is organized similar to > Bio.Align.Applications: > > https://github.com/biopython/biopython/tree/master/Bio/Align/Applications > > This will give you some flexibility as you want to expand out to > support other programs, and provide a framework for additional > phylogenetic commandline utilities. > Since it sounds like you might eventually write wrappers for other programs in the PAML suite, a layout like this might work: Bio/Phylo/Applications/_codeml.py -- just the wrapper for running the command-line program, perhaps based on the Bio.Application classes. The API for calling the wrapper goes through __init__.py; the user doesn't import this module directly. (See Bio.Align.Applications) Bio/Phylo/PAML/codeml.py -- all the code for parsing the output of the command-line program, and working with that dictionary/class. Any other modules this depends on would also go here, as would the other code for working with the input/output of other PAML programs. Separating parsing from commandline generation can also let you move > the _results dictionary from being a class member to a return value for > a parse function. This is a bit more straightforward workflow > instead of having the side-effect of assigning an internal class > attribute. > Yes. Also, the user might have saved the output from a codeml run previously (maybe from a shell script/pipeline), and want to parse it without re-running codeml through a Python wrapper. Right? (Sorry if I misunderstood your code.) I look forward to seeing your branch on GitHub. Please let us know if you have any problems along the way. All the best, Eric From p.j.a.cock at googlemail.com Sat Jan 15 06:56:23 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 15 Jan 2011 11:56:23 +0000 Subject: [Biopython-dev] pypaml In-Reply-To: References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: On Sat, Jan 15, 2011 at 5:35 AM, Eric Talevich wrote: > Hi Brandon, > > Thanks for volunteering! I think this will be a nice addition to Biopython > and particularly Bio.Phylo. > > Some thoughts on organization: > > On Fri, Jan 14, 2011 at 10:40 AM, Brad Chapman wrote: > >> >> The functionality here looks great. My stylistic suggestion would be >> to separate the code for running the commandline from that used to >> parse the output file. Ideally these would be two separate classes >> that could live under the Bio.Phylo namespace: >> >> https://github.com/biopython/biopython/tree/master/Bio/Phylo >> > > I agree. That sounds good. This will be a big change for anyone already using the stand alone pypaml - but some changes are unavoidable. > For the commandline code, it would be nice to have a >> Bio.Phylo.Applications that is organized similar to >> Bio.Align.Applications: >> >> https://github.com/biopython/biopython/tree/master/Bio/Align/Applications >> >> This will give you some flexibility as you want to expand out to >> support other programs, and provide a framework for additional >> phylogenetic commandline utilities. >> > > Since it sounds like you might eventually write wrappers for other programs > in the PAML suite, a layout like this might work: > > Bio/Phylo/Applications/_codeml.py > ?-- just the wrapper for running the command-line program, perhaps based on > the Bio.Application classes. The API for calling the wrapper goes through > __init__.py; the user doesn't import this module directly. (See > Bio.Align.Applications) > Roughly how many applications are there in PAML? What Brad and Eric have outlined would work fine, but we could opt for something a little different, like the namespace Bio.Phylo.Applications for general tools (there are some tree building tools I could write wrappers for - using the same setup as Bio.Align.Applications), and have namespace Bio.Phylo.Applications.PAML for the PAML wrappers. Another reason to separate them is they won't be using the simple Bio.Application framework (due to the way PAML options must be specified via input files). > > Bio/Phylo/PAML/codeml.py > ?-- all the code for parsing the output of the command-line program, and > working with that dictionary/class. Any other modules this depends on would > also go here, as would the other code for working with the input/output of > other PAML programs. > > >> Separating parsing from commandline generation can also let you move >> the _results dictionary from being a class member to a return value for >> a parse function. This is a bit more straightforward workflow >> instead of having the side-effect of assigning an internal class >> attribute. >> > > Yes. Also, the user might have saved the output from a codeml run > previously (maybe from a shell script/pipeline), and want to parse it > without re-running codeml through a Python wrapper. Right? (Sorry > if I misunderstood your code.) > > I look forward to seeing your branch on GitHub. Please let us know > if you have any problems along the way. > > All the best, > Eric Thanks for your comments Brad and Eric :) Peter From b.invergo at gmail.com Sat Jan 15 07:20:05 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Sat, 15 Jan 2011 13:20:05 +0100 Subject: [Biopython-dev] pypaml In-Reply-To: References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: I'll reply to both Eric and Peter in this one... >>> The functionality here looks great. My stylistic suggestion would be >>> to separate the code for running the commandline from that used to >>> parse the output file. Ideally these would be two separate classes >>> that could live under the Bio.Phylo namespace: >>> >>> https://github.com/biopython/biopython/tree/master/Bio/Phylo >>> >> >> I agree. > > That sounds good. This will be a big change for anyone already > using the stand alone pypaml - but some changes are unavoidable. I plan to make a tag of the current version on Google Code and then branch it and start making these structural changes. I'll put a notice on the main page to let the users know how things will be changing as I prepare to migrate to Biopython. It'll be a slow, steady process. >> For the commandline code, it would be nice to have a >>> Bio.Phylo.Applications that is organized similar to >>> Bio.Align.Applications: >>> >>> https://github.com/biopython/biopython/tree/master/Bio/Align/Applications >>> >>> This will give you some flexibility as you want to expand out to >>> support other programs, and provide a framework for additional >>> phylogenetic commandline utilities. >>> >> >> Since it sounds like you might eventually write wrappers for other programs >> in the PAML suite, a layout like this might work: >> >> Bio/Phylo/Applications/_codeml.py >> ?-- just the wrapper for running the command-line program, perhaps based on >> the Bio.Application classes. The API for calling the wrapper goes through >> __init__.py; the user doesn't import this module directly. (See >> Bio.Align.Applications) >> > > Roughly how many applications are there in PAML? What Brad and > Eric have outlined would work fine, but we could opt for something > a little different, like the namespace Bio.Phylo.Applications for > general tools (there are some tree building tools I could write > wrappers for - using the same setup as Bio.Align.Applications), > and have namespace Bio.Phylo.Applications.PAML for the PAML > wrappers. Another reason to separate them is they won't be > using the simple Bio.Application framework (due to the way > PAML options must be specified via input files). There are 8 programs in PAML. Copied from the manual: ? Comparison and tests of phylogenetic trees (baseml and codeml); ? Estimation of parameters in sophisticated substitution models, including models of variable rates among sites and models for combined analysis of multiple genes or site partitions (baseml and codeml); ? Likelihood ratio tests of hypotheses through comparison of implemented models (baseml, codeml, chi2); ? Estimation of divergence times under global and local clock models (baseml and codeml); ? Likelihood (Empirical Bayes) reconstruction of ancestral sequences using nucleotide, amino acid and codon models (baseml and codeml); ? Generation of datasets of nucleotide, codon, and amino acid sequence by Monte Carlo simulation (evolver); ? Estimation of synonymous and nonsynonymous substitution rates and detection of positive selection in protein-coding DNA sequences (yn00 and codeml). ? Bayesian estimation of species divergence times incorporating uncertainties in fossil calibrations (mcmctree). >> Yes. Also, the user might have saved the output from a codeml run >> previously (maybe from a shell script/pipeline), and want to parse it >> without re-running codeml through a Python wrapper. Right? (Sorry >> if I misunderstood your code.) Actually, it currently does support doing this. The parse_results() function takes a string filename as an argument so you can call it without having run any analyses yet. Still, it makes more sense to make the parser a separate class. What I'm torn about is to either have a single PAML parser class or to have separate parsers for each program. The output files contain the program name in the first line so it's simple enough to determine what kind of output you're looking at, but the code might get a bit long and cumbersome. Thanks for the input everyone. I'll have a lot of things done this weekend I hope (it's a busy one with other projects at the same time). Cheers, Brandon From eric.talevich at gmail.com Sat Jan 15 13:23:19 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 15 Jan 2011 13:23:19 -0500 Subject: [Biopython-dev] pypaml In-Reply-To: References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: On Sat, Jan 15, 2011 at 7:20 AM, Brandon Invergo wrote: > I'll reply to both Eric and Peter in this one... > > >>> The functionality here looks great. My stylistic suggestion would be > >>> to separate the code for running the commandline from that used to > >>> parse the output file. Ideally these would be two separate classes > >>> that could live under the Bio.Phylo namespace: > >>> > >>> https://github.com/biopython/biopython/tree/master/Bio/Phylo > >>> > >> > >> I agree. > > > > That sounds good. This will be a big change for anyone already > > using the stand alone pypaml - but some changes are unavoidable. > > I plan to make a tag of the current version on Google Code and then > branch it and start making these structural changes. I'll put a notice > on the main page to let the users know how things will be changing as > I prepare to migrate to Biopython. It'll be a slow, steady process. > Sounds great to me! Slow and steady wins the race. >> For the commandline code, it would be nice to have a > >>> Bio.Phylo.Applications that is organized similar to > >>> Bio.Align.Applications: > >>> > >>> > https://github.com/biopython/biopython/tree/master/Bio/Align/Applications > >>> > >>> This will give you some flexibility as you want to expand out to > >>> support other programs, and provide a framework for additional > >>> phylogenetic commandline utilities. > >>> > >> > >> Since it sounds like you might eventually write wrappers for other > programs > >> in the PAML suite, a layout like this might work: > >> > >> Bio/Phylo/Applications/_codeml.py > >> -- just the wrapper for running the command-line program, perhaps based > on > >> the Bio.Application classes. The API for calling the wrapper goes > through > >> __init__.py; the user doesn't import this module directly. (See > >> Bio.Align.Applications) > >> > > > > Roughly how many applications are there in PAML? What Brad and > > Eric have outlined would work fine, but we could opt for something > > a little different, like the namespace Bio.Phylo.Applications for > > general tools (there are some tree building tools I could write > > wrappers for - using the same setup as Bio.Align.Applications), > > and have namespace Bio.Phylo.Applications.PAML for the PAML > > wrappers. Another reason to separate them is they won't be > > using the simple Bio.Application framework (due to the way > > PAML options must be specified via input files). > If the Bio.Applications framework won't work for codeml/PAML then I think it would be misleading to put any pypaml code under Bio.Phylo.Applications (at least for now). Later we might find a way to put PAML options into named temporary files and run the command-line applications that way, but that's probably not a priority yet. *Code Philosophy*: It's my understanding that tightly nested namespaces are nicer for library developers, but flatter namespaces are nicer for the users of those libraries (especially those who don't use full-featured IDEs like Eclipse). Python lets us have it both ways, to some extent, by importing protected modules to a higher-level namespace. See if you agree with examples like these: # Common functionality and generalized tree I/O is available at the top level >>> from Bio import Phylo # Everything under *.Applications directly uses the Bio.Application framework >>> from Bio.Phylo.Applications import RAxMLCommandline >>> from Bio.Phylo.Applications import MrBayesCommandline # Extra functionality for a popular application suite goes in a separate sub-package >>> from Bio.Phylo.PAML import codeml # A namespace for web services reminds the user that the network will be used >>> from Bio.Phylo.WWW import Dryad This is basically what we proposed with Bio.Struct for GSoC 2010, and I don't think any of it contradicts the existing conventions of Bio.Align. Namespace collisions are unlikely: the sub-packages would generally be either support for new file formats or helpers for application suites, and those would only match if an application suite defined its own file formats -- in which case the modules do belong under the same sub-package. >> Yes. Also, the user might have saved the output from a codeml run > >> previously (maybe from a shell script/pipeline), and want to parse it > >> without re-running codeml through a Python wrapper. Right? (Sorry > >> if I misunderstood your code.) > > Actually, it currently does support doing this. The parse_results() > function takes a string filename as an argument so you can call it > without having run any analyses yet. Still, it makes more sense to > make the parser a separate class. What I'm torn about is to either > have a single PAML parser class or to have separate parsers for each > program. The output files contain the program name in the first line > so it's simple enough to determine what kind of output you're looking > at, but the code might get a bit long and cumbersome. > I'd recommend splitting the parsers into separate modules. Small functions and classes are much easier to maintain. If everyone agrees with this layout, I'd suggest putting your existing __init__.py and codeml.py under Bio/Phylo/PAML/. Inside codeml.py, I'd suggest: 1. Have the run() method raise an exception when the subprocess return code is non-zero, instead of returning the subprocess return code directly (try subprocess.check_call in place of subprocess.call, or see Bio/Application/__init__.py). Most of the time the user will want to throw an error of some sort if the command line fails; this is more direct. Then, since run() no longer needs to return an integer, it's free to return the results dictionary instead. 2. Change parse_results() to return a dictionary, rather than setting it on self.results. So the run() function retrieves this dictionary by calling parse_results(), then returns it (after chdir'ing). 3. Now that parse_results() doesn't need direct access to self._results, move it out of the codeml class and rename it as a standalone function: def read(results_file, version=None): ... Any other optional info that parse_results()/read() needs can also be passed as keyword arguments -- I'm not sure if I missed any places where that's occurring. This is the same overall change Brad was suggesting, I think. It also brings the style of pypaml/codeml pretty much in line with how Biopython and Bio.Application work, so further integration would be easier in the future. Best, Eric From b.invergo at gmail.com Sun Jan 16 09:19:13 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Sun, 16 Jan 2011 15:19:13 +0100 Subject: [Biopython-dev] pypaml In-Reply-To: References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: Hi everyone, A quick question about style: since the name "codeml" is based on a program which is always spelled either in all caps or in all lower-case, what would be the best way to write the class name regarding capitalization? Stick with the usual camel-case convention, "Codeml", anyway? Things are progressing nicely. I've already taken care of a lot of the minor tasks and improvements... Cheers, Brandon From p.j.a.cock at googlemail.com Sun Jan 16 10:09:07 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 16 Jan 2011 15:09:07 +0000 Subject: [Biopython-dev] pypaml In-Reply-To: References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: On Sun, Jan 16, 2011 at 2:19 PM, Brandon Invergo wrote: > Hi everyone, > A quick question about style: since the name "codeml" is based on a > program which is always spelled either in all caps or in all > lower-case, what would be the best way to write the class name > regarding capitalization? Stick with the usual camel-case convention, > "Codeml", anyway? I'd go with Codeml for a class name (or something like CodemlResult or whatever). Neither CODEML nor codeml seem good class names in Python. > Things are progressing nicely. I've already taken care of a lot of the > minor tasks and improvements... Sounds good :) Peter From biopython at maubp.freeserve.co.uk Sun Jan 16 19:38:05 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Jan 2011 00:38:05 +0000 Subject: [Biopython-dev] Calling 2to3 from setup.py In-Reply-To: References: Message-ID: On Tue, Jan 4, 2011 at 11:30 PM, Peter wrote: > On Tue, Jan 4, 2011 at 10:43 PM, Peter wrote: >> Hi all, >> >> Something we've talked about before is calling lib2to3 or the 2to3 script >> from within setup.py to make installing Biopython simpler on Python 3. >> >> ... >> >> I then looked at how NumPy are doing this, and they have a hook >> in setup.py which calls their own Python script called py3tool.py to >> do the conversion, ... it will not bother to reconvert previously >> converted but unchanged files. ... > > ... So > on reflection, using the time stamp to decide if 2to3 needs to > be rerun is probable quite sufficient (and will be faster too). > > Peter > I switched to the simpler approach used by NumPy (just look at the last modified timestamp) and committed this: https://github.com/biopython/biopython/commit/1eeed11aefc54787fb836a6b3b5f4c82628edef4 I've had to tweak the buildbot accordingly: it no longer needs to call the 2to3 script, and the tests must be run from the converted version rather than the original Python 2 version of the code. Peter From bugzilla-daemon at portal.open-bio.org Mon Jan 17 05:11:31 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 17 Jan 2011 05:11:31 -0500 Subject: [Biopython-dev] [Bug 3168] different StringIO import for Python 3 In-Reply-To: Message-ID: <201101171011.p0HABVp5004830@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3168 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-17 05:11 EST ------- Which version of Python 3 do you have? We're testing on Python 3.1 and 3.2 at the moment, ignoring Python 3.0. Also, could you retry using the latest Biopython from git? This now calls 2to3 automatically from setup.py which is much simpler than manually calling 2to3 at the command line. Thanks Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Jan 17 08:54:32 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Jan 2011 13:54:32 +0000 Subject: [Biopython-dev] curious error from HMM unit test Message-ID: Hi all, There was a curious failure under one of the buildslaves running Jython 2.5.2rc3 on 64bit Linux: http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/6/steps/shell/logs/stdio ====================================================================== ERROR: test_HMMCasino ---------------------------------------------------------------------- Traceback (most recent call last): File "run_tests.py", line 314, in runTest suite = unittest.TestLoader().loadTestsFromName(name) File "/home/buildslave/jython2.5.2rc3/Lib/unittest.py", line 533, in loadTestsFromName module = __import__('.'.join(parts_copy)) File "/home/buildslave/jython2.5.2rc3/Lib/unittest.py", line 533, in loadTestsFromName module = __import__('.'.join(parts_copy)) File "/home/buildslave/BuildBot/jython252lin64/build/Tests/test_HMMCasino.py", line 180, in trained_mm = trainer.train([training_seq], stop_training) File "/home/buildslave/BuildBot/jython252lin64/build/Bio/HMM/Trainer.py", line 212, in train emission_count = self.update_emissions(emission_count, File "/home/buildslave/BuildBot/jython252lin64/build/Bio/HMM/Trainer.py", line 332, in update_emissions expected_times += (forward_vars[(k, i)] * UnboundLocalError: local variable 'k' referenced before assignment ---------------------------------------------------------------------- The error message is rather odd, since k is defined as the outer loop variable, see: https://github.com/biopython/biopython/blob/master/Bio/HMM/Trainer.py This is in the HMM code, so due to the stochastic nature we expect each run of test may take slightly different branches through the code. We haven't altered this code for a while, so this is either a long standing issue, or perhaps indicative of a problem in Jython 2.5.2rc3 instead (although I don't see any open bugs relevant). I've logged into the buildslave and re-run test_HMMCasino.py under Jython about 30 times - all fine. Rerunning the build also came up clear. Has anyone got any insight into what might be going on? Unless it happens again maybe it was a fluke (cosmic ray, bad ram, etc). Peter From eric.talevich at gmail.com Mon Jan 17 11:17:14 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 17 Jan 2011 11:17:14 -0500 Subject: [Biopython-dev] curious error from HMM unit test In-Reply-To: References: Message-ID: On Mon, Jan 17, 2011 at 8:54 AM, Peter wrote: > Hi all, > > There was a curious failure under one of the buildslaves running > Jython 2.5.2rc3 on 64bit Linux: > > http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/6/steps/shell/logs/stdio > > [...] > > The error message is rather odd, since k is defined as the outer loop > variable, see: > https://github.com/biopython/biopython/blob/master/Bio/HMM/Trainer.py > > This is in the HMM code, so due to the stochastic nature we expect > each run of test may take slightly different branches through the > code. We haven't altered this code for a while, so this is either a > long standing issue, or perhaps indicative of a problem in Jython > 2.5.2rc3 instead (although I don't see any open bugs relevant). I've > logged into the buildslave and re-run test_HMMCasino.py under Jython > about 30 times - all fine. Rerunning the build also came up clear. > > Has anyone got any insight into what might be going on? Unless it > happens again maybe it was a fluke (cosmic ray, bad ram, etc). > It could have been an exotic bug in Jython (or its interactions with the JVM) where the JIT or garbage collector is removing local variables too early. I don't see how you could provide a "fix" for it in Biopython, since k definitely exists at that point in the loop in any valid Python and Jython almost always handles it correctly. Maybe you could seed the RNG at the start of the unit test to ensure the same paths are always taken? -E From biopython at maubp.freeserve.co.uk Mon Jan 17 11:37:03 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Jan 2011 16:37:03 +0000 Subject: [Biopython-dev] curious error from HMM unit test In-Reply-To: References: Message-ID: On Mon, Jan 17, 2011 at 4:17 PM, Eric Talevich wrote: > > It could have been an exotic bug in Jython (or its interactions with the > JVM) where the JIT or garbage collector is removing local variables too > early. I don't see how you could provide a "fix" for it in Biopython, since > k definitely exists at that point in the loop in any valid Python and Jython > almost always handles it correctly. > Good point - maybe that is the most likely explanation. > > Maybe you could seed the RNG at the start of the unit test to ensure the > same paths are always taken? > We could do, but as part of that I'd want to increase the test coverage to ensure most of the HMM code is actually covered in the single non- stochastic run. As no-one is actively looking after it, so I'd rather not touch the tests. Peter From biopython at maubp.freeserve.co.uk Mon Jan 17 13:04:27 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Jan 2011 18:04:27 +0000 Subject: [Biopython-dev] Python 3 and Bio.SeqIO.index() In-Reply-To: References: Message-ID: Continuing a thread back in July 2010, http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008004.html On Thu, Jul 15, 2010 at 2:31 PM, Peter wrote: > > I think the clear message (from both Windows and Linux) is that for > Bio.SeqIO.index() to perform at a tolerable speed on Python 3 we > can't use the default text mode with unicode strings, we are going > to have to use binary mode with bytes. > I've now done that - which brings the time for test_SeqIO_index.py down for Python 3.x to roughly the same as Python 2.x (about a four fold speed up). Under Windows there may also be a slight speed up for Python 2, while on Linux/Mac there could be a slight slow down. I expect we can work on this. The good news is this yet another step towards supporting Biopython under Python 3. Peter From bugzilla-daemon at portal.open-bio.org Tue Jan 18 06:11:51 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 Jan 2011 06:11:51 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101181111.p0IBBpVh029286@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 macrozhu+biopy at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1562 is|0 |1 obsolete| | ------- Comment #7 from macrozhu+biopy at gmail.com 2011-01-18 06:11 EST ------- Created an attachment (id=1563) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1563&action=view) handle DSSP shifted output when there are too many res in a PDB In this version, one more issue is fixed when reading DSSP output: Sometimes DSSP output is not in the correct format. Column 34 to 38, which are supposed to be occupied by solvent accessibility (acc), are partially occupied by the elongated field before it. Thus, the conversion of the string in col 34 to 38 to integer will raise a ValueError exception. e.g. 3kic chain T res 321, or 1VSY chain T res 6077. >python DSSP.py 3kic.pdb In such cases, the acc value, and all the following values in the same line are shifted to the right because residue sequence number is too long (more than 4 digits). Normally, only 4 digits are allocated to seq num in DSSP output. When there are too many residues, this problems appears. Now the ValueError exception is caught and the line is re-examined for shifted acc values. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 18 06:14:38 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 Jan 2011 06:14:38 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101181114.p0IBEclN029403@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #8 from macrozhu+biopy at gmail.com 2011-01-18 06:14 EST ------- the code had been tested on more than 24,000 PDB files (a subset of the PDB) and it seems it works well. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 18 12:10:02 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 Jan 2011 12:10:02 -0500 Subject: [Biopython-dev] [Bug 3168] different StringIO import for Python 3 In-Reply-To: Message-ID: <201101181710.p0IHA2gp002494@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3168 michael.kuhn at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WORKSFORME ------- Comment #4 from michael.kuhn at gmail.com 2011-01-18 12:10 EST ------- Ok, so the 2to3 from Python 3.1.3 handles this correctly, while Python 3.1.1 fails. So I'm closing this bug, though perhaps you can update the install instructions to require Python >= 3.1.3. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 18 23:02:26 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 Jan 2011 23:02:26 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101190402.p0J42QvK002446@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #9 from eric.talevich at gmail.com 2011-01-18 23:02 EST ------- Created an attachment (id=1564) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1564&action=view) Patch based on Hongbo Zhu's attachment I adapted the previous code to the current Biopython head. Of note: - the pseudo-doctest in the DSSP class docstring wasn't runnable. I fixed that. - Instead of mkstemp I used NamedTemporary file -- this way Python will delete the tempfile automatically when the handle is closed - Some code compression, but I kept most of Hongbo Zhu's comments intact -- I found them useful - I tweaked the test script at the end of the file Tested this with PDB 1MOT, 3KIC, 1VSY, 3LLT, 3NR9 and checked with pylint. Seems OK to me. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 19 06:00:09 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 19 Jan 2011 06:00:09 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101191100.p0JB09fZ020533@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #10 from macrozhu+biopy at gmail.com 2011-01-19 06:00 EST ------- Hi, Eric, thanks for the nice code compression :) A mistake in my comments among the code ought to be corrected: the "9999 atoms" should be "9999 residues". In addition, a quick search shows that biopython 1.56 uses still tempfile.mktemp() in the following files. ./Scripts/xbbtools/xbb_blastbg.py ./Tests/test_PhyloXML.py ./Bio/GFF/GenericTools.py ./Bio/PDB/DSSP.py ./Bio/PDB/ResidueDepth.py ./Bio/PDB/NACCESS.py tempfile.mktemp() is deprecated since python release 2.3 https://github.com/biopython/biopython And even python 2.3 support had been dropped by biopython :-) best hongbo -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 19 22:46:02 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 19 Jan 2011 22:46:02 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101200346.p0K3k2Fi023153@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #11 from eric.talevich at gmail.com 2011-01-19 22:46 EST ------- (In reply to comment #10) > Hi, Eric, > thanks for the nice code compression :) Cheers. Have you been able to test this patch, and does it satisfy? Should I commit this change to the Biopython trunk, then? > A mistake in my comments among the code ought to be corrected: the "9999 atoms" > should be "9999 residues". OK, I was suspicious of that comment too. I've fixed it on my local branch. > In addition, a quick search shows that biopython 1.56 uses still > tempfile.mktemp() in the following files. [...] Thanks, I've made a note of it. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Jan 20 07:04:23 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Jan 2011 12:04:23 +0000 Subject: [Biopython-dev] curious error from HMM unit test In-Reply-To: References: Message-ID: On Mon, Jan 17, 2011 at 4:37 PM, Peter wrote: > On Mon, Jan 17, 2011 at 4:17 PM, Eric Talevich wrote: >> >> It could have been an exotic bug in Jython (or its interactions with the >> JVM) where the JIT or garbage collector is removing local variables too >> early. I don't see how you could provide a "fix" for it in Biopython, since >> k definitely exists at that point in the loop in any valid Python and Jython >> almost always handles it correctly. >> > > Good point - maybe that is the most likely explanation. > It has happened again on the same install of Jython 2.5.2rc3 on 64bit Linux, previously on build 6, now on build 12: http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/12/steps/shell/logs/stdio Again, repeating the build made the error go away - but the load on the machine would have been different etc. Peter From bugzilla-daemon at portal.open-bio.org Thu Jan 20 10:57:55 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 20 Jan 2011 10:57:55 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101201557.p0KFvt3I008953@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #12 from macrozhu+biopy at gmail.com 2011-01-20 10:57 EST ------- (In reply to comment #11) > (In reply to comment #10) > Cheers. Have you been able to test this patch, and does it satisfy? > I made changes to two lines because: -- input to function res2code() should be instances of Bio.PDB.Residue, not strings; -- variable res is from last loop round, thus not suitable as input to res2code(). So in the end, I changed two of the lines that use function res2code(). It has been tested on around 20,000 PDBs. cheers, 221c221 < res_seq_icode = '%s%s' % (res_id[1],res_id[2]) --- > res_seq_icode = res2code(res_id) 266c266 < res_seq_icode = '%s%s' % (res_id[1],res_id[2]) --- > res_seq_icode = res2code(res_id) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From anaryin at gmail.com Fri Jan 21 10:13:48 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 21 Jan 2011 16:13:48 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: Hello all, I've been working on the renumbering residues, remove disordered atoms, and biological unit representation functions. I've made quite some changes, specially to the renumbering algorithm. Explanation follows: Before I simply calculated how much to subtract from each residue number based on the first. That worked perfectly if all residue numbers were in a growing progression, which was not the case for some structures. Also, HETATMs weren't separated from the main ATOM lines, and in many PDB files you see numbering starting from 1000 for example. What I coded allows for certain discrimination of HETATMs from ATOMs based on the SEQRES field of the PDB file header (added parsing to parse_pdb_header). This ensures HETATMs are numbered from 1000. I've also incorporated a way of filtering modified aminoacids (that show up as HETATM but in between ATOM lines) to be treated as ATOMs if there is no SEQRES header present in the PDB file by looking for a CA atom. A warning is issued along with this "magic" feature turning on annoucing that the results may be a bit unreliable.. I've shown the code and the idea to the people in my lab and I got generally good responses, but of course they are all biased :) Have a look for yourselves, I created a branch for these. https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements Thanks! From bugzilla-daemon at portal.open-bio.org Sat Jan 22 00:20:56 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 22 Jan 2011 00:20:56 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101220520.p0M5KuMW019063@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 eric.talevich at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #13 from eric.talevich at gmail.com 2011-01-22 00:20 EST ------- (In reply to comment #12) > I made changes to two lines because: > -- input to function res2code() should be instances of Bio.PDB.Residue, not > strings; > -- variable res is from last loop round, thus not suitable as input to > res2code(). > So in the end, I changed two of the lines that use function res2code(). It has > been tested on around 20,000 PDBs. Committed: https://github.com/biopython/biopython/commit/cc6842e0f79178af6bf9f32ad6ac3025685f55d1 Thanks for your help! Would you like to be added to the contributors list (in the NEWS file), with or without an e-mail address? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From krother at rubor.de Sat Jan 22 08:15:45 2011 From: krother at rubor.de (Kristian Rother) Date: Sat, 22 Jan 2011 14:15:45 +0100 Subject: [Biopython-dev] PDBFINDER module added Message-ID: Hi, PDBFINDER is a weekly updated text file that contains annotation for the entire PDB database. see: http://swift.cmbi.ru.nl/gv/pdbfinder/ The recent publication of PDBFINDER in NAR encouraged me to write a parser. I've added it as Bio.PDB.PDBFINDER in the branch 'pdbfinder' including tests. See: https://github.com/krother/biopython/commit/1ee57fc7ca08357d29fe4d8289c23ab30eecb5f9 Would be nice if someone interested could review the module. Have fun! Kristian From eric.talevich at gmail.com Sat Jan 22 18:06:52 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 22 Jan 2011 18:06:52 -0500 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: On Fri, Jan 21, 2011 at 10:13 AM, Jo?o Rodrigues wrote: > Hello all, > > I've been working on the renumbering residues, remove disordered atoms, and > biological unit representation functions. > > I've made quite some changes, specially to the renumbering algorithm. > Explanation follows: > > Before I simply calculated how much to subtract from each residue number > based on the first. That worked perfectly if all residue numbers were in a > growing progression, which was not the case for some structures. Also, > HETATMs weren't separated from the main ATOM lines, and in many PDB files > you see numbering starting from 1000 for example. > > What I coded allows for certain discrimination of HETATMs from ATOMs based > on the SEQRES field of the PDB file header (added parsing to > parse_pdb_header). This ensures HETATMs are numbered from 1000. I've also > incorporated a way of filtering modified aminoacids (that show up as HETATM > but in between ATOM lines) to be treated as ATOMs if there is no SEQRES > header present in the PDB file by looking for a CA atom. A warning is issued > along with this "magic" feature turning on annoucing that the results may be > a bit unreliable.. > > I've shown the code and the idea to the people in my lab and I got generally > good responses, but of course they are all biased :) Have a look for > yourselves, I created a branch for these. > > https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements Hi Jo?o, Good stuff. I see you made a nice clean revision history for us, too -- thanks! Whitespace: Some extra spaces crept in and are throwing off the diff in Structure.py. Also, Structure.remove_disordered_atoms has a bunch of blank lines in the function body; could you slim it down? Chain.renumber_residues: 1. How about 'res_init' and 'het_init'? When I see "seed" I think RNG. 2. It looks like het_init does numbering automagically if the given value is 0, and otherwise numbers HETATMs in the expected way starting from the given number. How about letting "het_init=None" be the default, so "if not het_init" triggers the magical behavior, and if a positive integer is given then go from there. (Check that het_init >= 1 and maybe het_init == int(het_init) to allow 1.0 to be coerced to 1 if you want.) 3. I see in the last commit you added a local variable called 'magic'. Could you find a better name for that? I think 'guess_by_ca' would fit, if I'm reading the code correctly. 4. In the last block (lines 170-174 now), could you add a comment explaining why it would be reached? Before this commit there was a comment "Other HETATMs" but I'm not sure I fully understand. Is it for HETATMs not associated with any residue, i.e. not residue modifications? Structure.renumber_residues: 1. OK, I see what you're doing with het_seed=0 -- clever, but maybe more clever than necessary. It's not obvious from reading just this code that the first iteration is a special case for HETATM numbering; a maintainer would have to look at Chain.py too. A comment about that would help, I think. 2. Why (h/1000 >= 1) instead of (h >= 1000) ? 3. If the Chain.renumber_residues arguments change to 'res_init' and 'het_init', then 'seed' here should change to 'init' 4. The arguments 'sequential' and 'chain_displace' seem to interact -- I don't think I'd use chain_displace != 1 unless I had set sequential=True. So, it seems like chain_displace should only take effect if sequential=True (i.e. line 77 would be indented another level). To tighten things up further, I'd combine those two arguments into a single 'skip/gap_between_chains' or similar: # UNTESTED for chain in model: r_new, h_new = chain.renumber_residues(res_seed=r, het_seed=h) if skip_between_chains: r = r_new + skip_between_chains if h_new >= 1000: # Each chain's HETATM numbering starts at the next even 1000*N h = 1000*((h_new/1000)+1) else: h = h_new + 1 Structure.build_biological_unit: It looks like if the structure has more than one model, the models after 0 will be clobbered when this method is run. So, unless a better solution appears, it's safest to add an assert for len(self.models) == 1 or check+ValueError for multiple models. All the best, Eric From krother at rubor.de Mon Jan 24 06:38:45 2011 From: krother at rubor.de (Kristian Rother) Date: Mon, 24 Jan 2011 12:38:45 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged Message-ID: Hi Joao, I got two things to add to Erics comments: - When renumbering a chain, the id's of some residues are changed. Have you tested whether the keys in Chain.child_dict are changed as well? - Could you refactor a method Chain.change_residue_numbers(old_ids, new_ids) that does the changing of the calculated identifiers? I think this would have a some advantages (shorter code is more testable, easier to deal with the point above, I could use this for some custom numbering schemes). - Currently, Chain.renumber_residues in the lines last_num = residue.id[1]+displace residue.id = (residue.id[0], residue.id[1]+displace, residue.id[2]) are repating 3 times. Best regards, Kristian From biopython at maubp.freeserve.co.uk Mon Jan 24 06:50:21 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 24 Jan 2011 11:50:21 +0000 Subject: [Biopython-dev] curious error from HMM unit test In-Reply-To: References: Message-ID: On Thu, Jan 20, 2011 at 12:04 PM, Peter wrote: > On Mon, Jan 17, 2011 at 4:37 PM, Peter wrote: >> On Mon, Jan 17, 2011 at 4:17 PM, Eric Talevich wrote: >>> >>> It could have been an exotic bug in Jython (or its interactions with the >>> JVM) where the JIT or garbage collector is removing local variables too >>> early. I don't see how you could provide a "fix" for it in Biopython, since >>> k definitely exists at that point in the loop in any valid Python and Jython >>> almost always handles it correctly. >>> >> >> Good point - maybe that is the most likely explanation. >> > > It has happened again on the same install of Jython 2.5.2rc3 on 64bit Linux, > previously on build 6, now on build 12: > > http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/12/steps/shell/logs/stdio > > Again, repeating the build made the error go away - but the load on the > machine would have been different etc. And a more interesting variant, http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/16/steps/shell/logs/stdio http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/18/steps/shell/logs/stdio # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00002aaaab54c400, pid=13422, tid=1078704448 # # JRE version: 6.0_17-b17 # Java VM: OpenJDK 64-Bit Server VM (14.0-b16 mixed mode linux-amd64 ) # Derivative: IcedTea6 1.7.5 # Distribution: Custom build (Wed Oct 13 13:04:40 EDT 2010) # Problematic frame: # j Bio.HMM.Trainer$py.update_emissions$12(Lorg/python/core/PyFrame;Lorg/python/core/ThreadState;)Lorg/python/core/PyObject;+555 # # An error report file with more information is saved as: # /home/buildslave/BuildBot/jython252lin64/build/Tests/hs_err_pid13422.log # # If you would like to submit a bug report, please include # instructions how to reproduce the bug and visit: # http://icedtea.classpath.org/bugzilla # Looking at the log suggests this could be a low memory issue, perhaps from running multiple test builds at once. Peter From tiagoantao at gmail.com Mon Jan 24 09:02:47 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 24 Jan 2011 14:02:47 +0000 Subject: [Biopython-dev] curious error from HMM unit test In-Reply-To: References: Message-ID: Sorry, I am drenched in work (writing up my PhD thesis) and had no time to attend to this. Maybe it is the Jython version? It is still a release candidate. I think the one installed is RC2, I am going to upgrade to the new RC3 and try again. Tiago On Mon, Jan 24, 2011 at 11:50 AM, Peter wrote: > On Thu, Jan 20, 2011 at 12:04 PM, Peter wrote: >> On Mon, Jan 17, 2011 at 4:37 PM, Peter wrote: >>> On Mon, Jan 17, 2011 at 4:17 PM, Eric Talevich wrote: >>>> >>>> It could have been an exotic bug in Jython (or its interactions with the >>>> JVM) where the JIT or garbage collector is removing local variables too >>>> early. I don't see how you could provide a "fix" for it in Biopython, since >>>> k definitely exists at that point in the loop in any valid Python and Jython >>>> almost always handles it correctly. >>>> >>> >>> Good point - maybe that is the most likely explanation. >>> >> >> It has happened again on the same install of Jython 2.5.2rc3 on 64bit Linux, >> previously on build 6, now on build 12: >> >> http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/12/steps/shell/logs/stdio >> >> Again, repeating the build made the error go away - but the load on the >> machine would have been different etc. > > And a more interesting variant, > http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/16/steps/shell/logs/stdio > http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/18/steps/shell/logs/stdio > > # A fatal error has been detected by the Java Runtime Environment: > # > # ?SIGSEGV (0xb) at pc=0x00002aaaab54c400, pid=13422, tid=1078704448 > # > # JRE version: 6.0_17-b17 > # Java VM: OpenJDK 64-Bit Server VM (14.0-b16 mixed mode linux-amd64 ) > # Derivative: IcedTea6 1.7.5 > # Distribution: Custom build (Wed Oct 13 13:04:40 EDT 2010) > # Problematic frame: > # j ?Bio.HMM.Trainer$py.update_emissions$12(Lorg/python/core/PyFrame;Lorg/python/core/ThreadState;)Lorg/python/core/PyObject;+555 > # > # An error report file with more information is saved as: > # /home/buildslave/BuildBot/jython252lin64/build/Tests/hs_err_pid13422.log > # > # If you would like to submit a bug report, please include > # instructions how to reproduce the bug and visit: > # ? http://icedtea.classpath.org/bugzilla > # > > Looking at the log suggests this could be a low memory issue, > perhaps from running multiple test builds at once. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From macrozhu at gmail.com Mon Jan 24 13:25:17 2011 From: macrozhu at gmail.com (Hongbo Zhu) Date: Mon, 24 Jan 2011 19:25:17 +0100 Subject: [Biopython-dev] why HETERO-flag in residue identifier (Bio.PDB.Residue)? Message-ID: Hi, I was recently working on the BioPython module DSSP.py . There was some problem in the module when reading DSSP output. One of them was due to different descriptions of residue identifier in DSSP and BioPython. As we all know, in BioPython, residue identifier consists of three fields ( hetero-?ag, sequence identifier, insertion code ). But DSSP uses only the latter two. This can sometimes cause unnecessary exceptions (see http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ). In retrospect, I start to wonder why BioPython included hetero-flag in residue identifier. After checking several BioPython documents, I found that in "The Biopython Structural Bioinformatics FAQ", this question has been answered: "The reason for the hetero-?ag is that many, many PDB ?les use the same sequence identi?er for an amino acid and a hetero-residue or a water, which would create obvious problems if the hetero-?ag was not used." I somehow got interested in the issue and performed a scanning on a subset of PDB (a non-redundant set of ~22,000 pdb entries derived using PISCES http://dunbrack.fccc.edu/PISCES.php ). I found ~30 cases in which same sequence identifier + icode is used for more than one residues (see below). I checked all of them. It turned out that in all of these cases, though same sequence identifier+icode is used for different residues, the residues have different alternative locations. This means they can still be distinguished if alternative locations are considered. In BioPython, alternative location is always very well taken care of. So it seems to me that hetero-flag is a bit redundant in residue identifier. It should also be fine if hetero-flag is just given as an attribute to residues (I still need to scan all the PDB entries to confirm my claim). I want to hear your opinions about the hetero-flag in residue identifier. cheers, hongbo zhu Duplicate: 2pxs 0 A ('H_XYG', 66, ' ') Duplicate: 2pxs 0 B ('H_XYG', 66, ' ') Duplicate: 3bln 0 A ('H_MPD', 147, ' ') Duplicate: 3ned 0 A ('H_CH6', 67, ' ') Duplicate: 3ned 0 A ('H_NRQ', 67, ' ') Duplicate: 3l4j 0 A ('H_PTR', 782, ' ') Duplicate: 1ysl 0 B (' ', 111, ' ') Duplicate: 3gju 0 A (' ', 289, ' ') Duplicate: 3fcr 0 A ('H_LLP', 288, ' ') Duplicate: 1xpm 0 A (' ', 111, ' ') Duplicate: 1xpm 0 B (' ', 111, ' ') Duplicate: 1xpm 0 C (' ', 111, ' ') Duplicate: 1xpm 0 D (' ', 111, ' ') Duplicate: 2vqr 0 A ('H_DDZ', 57, ' ') Duplicate: 3piu 0 A (' ', 273, ' ') Duplicate: 2w8s 0 A ('H_FGL', 57, ' ') Duplicate: 2w8s 0 B ('H_FGL', 57, ' ') Duplicate: 2w8s 0 C ('H_FGL', 57, ' ') Duplicate: 2w8s 0 D ('H_FGL', 57, ' ') Duplicate: 2wpn 0 B ('H_PSW', 489, ' ') Duplicate: 2wpn 0 B ('H_PSW', 489, ' ') Duplicate: 3a0m 0 F (' ', 13, ' ') Duplicate: 3a0m 0 F (' ', 16, ' ') Duplicate: 3a0m 0 F (' ', 13, ' ') Duplicate: 3a0m 0 F (' ', 16, ' ') Duplicate: 2ci1 0 A ('H_K1R', 273, ' ') Duplicate: 2uv2 0 A ('H_TPO', 183, ' ') Duplicate: 3d3w 0 B ('H_CSO', 138, ' ') Duplicate: 3hvy 0 A ('H_LLP', 243, ' ') Duplicate: 3hvy 0 B ('H_LLP', 243, ' ') Duplicate: 3hvy 0 C ('H_LLP', 243, ' ') Duplicate: 3hvy 0 D ('H_LLP', 243, ' ') Duplicate: 2j6v 0 A ('H_ALY', 229, ' ') Duplicate: 2j6v 0 B ('H_ALY', 229, ' ') -- Hongbo From biopython at maubp.freeserve.co.uk Mon Jan 24 18:05:56 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 24 Jan 2011 23:05:56 +0000 Subject: [Biopython-dev] curious error from HMM unit test In-Reply-To: References: Message-ID: 2011/1/24 Tiago Ant?o : > Sorry, I am drenched in work (writing up my PhD thesis) and had no > time to attend to this. Maybe it is the Jython version? It is still a > release candidate. I think the one installed is RC2, I am going to > upgrade to the new RC3 and try again. Not to worry Tiago - it's not your machine - its one of mine (with Jython 2.5.2 RC3). You've got more important things right now ;) It looks like we may have found a bug worth reporting to the Jython guys. I'll try and work out if I can reproduce it "by hand" rather than just via buildbot. Peter From biopython at maubp.freeserve.co.uk Mon Jan 24 18:08:59 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 24 Jan 2011 23:08:59 +0000 Subject: [Biopython-dev] why HETERO-flag in residue identifier (Bio.PDB.Residue)? In-Reply-To: References: Message-ID: On Mon, Jan 24, 2011 at 6:25 PM, Hongbo Zhu wrote: > Hi, > > I was recently working on the BioPython module DSSP.py . There was some > problem in the module when reading DSSP output. One of them was due to > different descriptions of residue identifier in DSSP and BioPython. As we > all know, in BioPython, residue identifier consists of three fields ( > hetero-?ag, sequence identifier, insertion code ). ... > > I somehow got interested in the issue and performed a scanning on a subset > of PDB (a non-redundant set of ~22,000 pdb entries derived using PISCES > http://dunbrack.fccc.edu/PISCES.php ). I found ~30 cases in which same > sequence identifier + icode is used for more than one residues (see below). > I checked all of them. It turned out that in all of these cases, though same > sequence identifier+icode is used for different residues, the residues have > different alternative locations. This means they can still be distinguished > if alternative locations are considered. In BioPython, alternative location > is always very well taken care of. > > So it seems to me that hetero-flag is a bit redundant in residue identifier. > It should also be fine if hetero-flag is just given as an attribute to > residues ?(I still need to scan all the PDB entries to confirm my claim). I > want to hear your opinions about the hetero-flag in residue identifier. It may be that prior to the big PDB re-mediation (clean up) this was a real and common problem. Certainly your investigation suggests this isn't the case now. Peter From anaryin at gmail.com Mon Jan 24 18:23:06 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 25 Jan 2011 00:23:06 +0100 Subject: [Biopython-dev] why HETERO-flag in residue identifier (Bio.PDB.Residue)? In-Reply-To: References: Message-ID: To be really honest, I don't understand the problem with the flag. I don't really see it as redundant. Could you please explain better? From macrozhu at gmail.com Tue Jan 25 02:19:35 2011 From: macrozhu at gmail.com (Hongbo Zhu) Date: Tue, 25 Jan 2011 08:19:35 +0100 Subject: [Biopython-dev] why HETERO-flag in residue identifier (Bio.PDB.Residue)? In-Reply-To: References: Message-ID: > > So it seems to me that hetero-flag is a bit redundant in residue > identifier. > > It should also be fine if hetero-flag is just given as an attribute to > > residues (I still need to scan all the PDB entries to confirm my claim). > I > > want to hear your opinions about the hetero-flag in residue identifier. > > It may be that prior to the big PDB re-mediation (clean up) this was a > real and common problem. Certainly your investigation suggests > this isn't the case now. > This also occurred to me. You are right, I performed the test on PDB files after remediation. If this is the case, hetero-flag is better kept for backward compatibility. > > Peter > -- Hongbo From macrozhu at gmail.com Tue Jan 25 03:17:13 2011 From: macrozhu at gmail.com (Hongbo Zhu) Date: Tue, 25 Jan 2011 09:17:13 +0100 Subject: [Biopython-dev] why HETERO-flag in residue identifier (Bio.PDB.Residue)? In-Reply-To: References: Message-ID: By redundant, I mean that a residue can be unambiguously determined by using (PDB code, model id, chain id, residue sequence identifier+icode) . HETERO-flag itself is definitely not redundant information for a residue. But it seems to be redundant in residue ID according to the small test on ~22,000 remediated PDB files. This redundancy sometimes causes unnecessary problems. For example, in DSSP, residues are determined by using sequence identifier+icode. When parsing DSSP output, some residues cannot be located the PDB structure stored in Bio.PDB.Structure because sequence identifier + icode is not enough for determining the residues in BioPython. One example is: 3jui 0 A 547 In the protein structure, using sequence identifier + icode, this residue is unambiguously determined. But in BioPython, one has to specify ('H_MSE', 547, ' ') to locate this residue. (Note that we can also simply use 547 without icode to locate it. But we don't want to accidentally forget icode in our script, do we :). Peter pointed out that the existence of hetero-flag in residue ID might be due to the mistakes in the old PDB files before remediation. If it is the case, hetero-flag should better be retained for backwards compatibility. regards, hongbo On Tue, Jan 25, 2011 at 12:23 AM, Jo?o Rodrigues wrote: > To be really honest, I don't understand the problem with the flag. I don't > really see it as redundant. Could you please explain better? > > From anaryin at gmail.com Tue Jan 25 04:39:21 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 25 Jan 2011 10:39:21 +0100 Subject: [Biopython-dev] why HETERO-flag in residue identifier (Bio.PDB.Residue)? In-Reply-To: References: Message-ID: Thanks, now I understand what you meant :) I ran into somewhat of a similar problem when trying to deal with the renumbering of residues but I guess indeed old PDB files are perhaps messy in this aspect (just looking at the odd numbering of some..) so I'd agree with Peter. From krother at rubor.de Tue Jan 25 04:40:12 2011 From: krother at rubor.de (Kristian Rother) Date: Tue, 25 Jan 2011 10:40:12 +0100 Subject: [Biopython-dev] Bio.PDB.Residue.id In-Reply-To: References: Message-ID: Hi, In our group, we've been discussing the PDB.Residue.id issue a couple of times. The current notation is fine but it is unintuitive and hard to learn for newbies. We therefore use a wrapper that allows to access residues by one-string ids = str(identifier + icode), like '101', '101A', '3B' etc. I'm sure changing ids in PDB.Residue would break a lot of scripts people use. I could imagine some workarounds that allow ignoring the HETERO flag though. Would work for me. How about you? Best regards, Kristian > By redundant, I mean that a residue can be unambiguously determined by > using > (PDB code, model id, chain id, residue sequence identifier+icode) . > HETERO-flag itself is definitely not redundant information for a residue. > But it seems to be redundant in residue ID according to the small test on > ~22,000 remediated PDB files. > > This redundancy sometimes causes unnecessary problems. For example, in > DSSP, > residues are determined by using sequence identifier+icode. When parsing > DSSP output, some residues cannot be located the PDB structure stored in > Bio.PDB.Structure because sequence identifier + icode is not enough for > determining the residues in BioPython. One example is: > 3jui 0 A 547 > In the protein structure, using sequence identifier + icode, this residue > is > unambiguously determined. But in BioPython, one has to specify ('H_MSE', > 547, ' ') to locate this residue. (Note that we can also simply use 547 > without icode to locate it. But we don't want to accidentally forget icode > in our script, do we :). > > Peter pointed out that the existence of hetero-flag in residue ID might be > due to the mistakes in the old PDB files before remediation. If it is the > case, hetero-flag should better be retained for backwards compatibility. > > regards, > hongbo > > On Tue, Jan 25, 2011 at 12:23 AM, Jo?o Rodrigues > wrote: > >> To be really honest, I don't understand the problem with the flag. I >> don't >> really see it as redundant. Could you please explain better? >> >> > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > From macrozhu at gmail.com Tue Jan 25 05:17:48 2011 From: macrozhu at gmail.com (Hongbo Zhu) Date: Tue, 25 Jan 2011 11:17:48 +0100 Subject: [Biopython-dev] Bio.PDB.Residue.id In-Reply-To: References: Message-ID: I totally agree that removing hetero-flag from residue ID is a big API change. I myself hate that very much when some other libraries I use announce such API changes. That should always be carefully planned and kept to minimal. In my opinion, what's more realistic is to add an additional mechanism to locate residues in PDB.Residue, in which no hetero-flag is required. That is, in this mechanism, one can use just sequence identifier + icode. Internally, PDB.Residue will first check whether residue (' ', seqnum, icode) exists. If not, it checks all residues with non-empty hetero-flag. Only if no residue with the sequence identifier + icode exsits (regardless of the hetero-flag) does it throw an exception, rather than simply throw an exception if (' ', seqnum, icode) does not exists. For instance, this can be realized by revising PDB.Chain._translate_id() to: def _translate_id(self, id): if type(id)==IntType: longid=(' ', id, ' ') if not self.has_id(longid): for r in self: if r.id[0] != ' ' and r.id[1] == id and r.id[2] == ' ': longid = r.id else: longid = id return longid On Tue, Jan 25, 2011 at 10:40 AM, Kristian Rother wrote: > > Hi, > > In our group, we've been discussing the PDB.Residue.id issue a couple of > times. The current notation is fine but it is unintuitive and hard to > learn for newbies. > We therefore use a wrapper that allows to access residues by one-string > ids = str(identifier + icode), like '101', '101A', '3B' etc. > > I'm sure changing ids in PDB.Residue would break a lot of scripts people > use. I could imagine some workarounds that allow ignoring the HETERO flag > though. Would work for me. How about you? > > Best regards, > Kristian > > -- Hongbo From anaryin at gmail.com Tue Jan 25 10:00:19 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 25 Jan 2011 16:00:19 +0100 Subject: [Biopython-dev] Atom.py - _assign_element function (small bug) Message-ID: Hey all, I stumbled upon a little bug today in this function. There's a formatting argument missing in line 92: msg = "Could not assign element %r for Atom (name=%s) with given element %r" \ % (*putative_element,* self.name, element) Apart from this "typo", there's a problem with hydrogens. For example with Glutamine, the hydrogens HE21 and HE22 (if present) fail to be assigned decently with the current setting. I'm adding an additional condition to the if-clause in line 77 to correct this. This correction now parses correctly these hydrogens (and everyone else). The tests run fine too and I don't think I should add anything to them either. GLN 205 HE22 [ 8.46199989 -1.04999995 -15.40400028] 4368 Final Assignment: H I'm pushing this fix to my github atom-element branch, I guess that's easy to cherry-pick? Best! Jo?o From p.j.a.cock at googlemail.com Tue Jan 25 10:38:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Jan 2011 15:38:03 +0000 Subject: [Biopython-dev] Atom.py - _assign_element function (small bug) In-Reply-To: References: Message-ID: On Tue, Jan 25, 2011 at 3:00 PM, Jo?o Rodrigues wrote: > Hey all, > > I stumbled upon a little bug today in this function. There's a formatting > argument missing in line 92: > > ? ? ? ? ? ? ? ?msg = "Could not assign element %r for Atom (name=%s) with > given element %r" \ > ? ? ? ? ? ? ? ? ? ? ?% (*putative_element,* self.name, element) > > Apart from this "typo", there's a problem with hydrogens. Oops. > For example with > Glutamine, the hydrogens HE21 and HE22 (if present) fail to be assigned > decently with the current setting. I'm adding an additional condition to the > if-clause in line 77 to correct this. This correction now parses correctly > these hydrogens (and everyone else). The tests run fine too and I don't > think I should add anything to them either. > > GLN 205 > HE22 [ ?8.46199989 ?-1.04999995 -15.40400028] 4368 > Final Assignment: H > > I'm pushing this fix to my github atom-element branch, I guess that's easy > to cherry-pick? Cherry-picked, Thanks, Peter From anaryin at gmail.com Tue Jan 25 19:41:42 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 26 Jan 2011 01:41:42 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: Trying to reply point per point both to Eric and Kristian. Hi Jo?o, > > Good stuff. I see you made a nice clean revision history for us, too -- > thanks! > I'm still trying to get the hang of Git but since I learned 'git reset' life is easier :) > > Whitespace: > Some extra spaces crept in and are throwing off the diff in > Structure.py. Also, Structure.remove_disordered_atoms has a bunch of > blank lines in the function body; could you slim it down? > They were mostly in the biological_unit function I suppose. I cut those down. I've also diff'ed the master and mine and checked differences in whitespace. Almost all gone, those left I can't really erase them, they're probably from my editor... > > > Chain.renumber_residues: > 1. How about 'res_init' and 'het_init'? When I see "seed" I think RNG. > Agreed, changed. > 2. It looks like het_init does numbering automagically if the given > value is 0, and otherwise numbers HETATMs in the expected way starting > from the given number. How about letting "het_init=None" be the > default, so "if not het_init" triggers the magical behavior, and if a > positive integer is given then go from there. (Check that het_init >= > 1 and maybe het_init == int(het_init) to allow 1.0 to be coerced to 1 > if you want.) > I'm using this later on to allow incremental chain renumbering. That's why it's 0 and not False or none, because I then add a number to it and it starts from that number on. I guess you understood when you read Structure.py. I'll add a comment pointing this out. > 3. I see in the last commit you added a local variable called 'magic'. > Could you find a better name for that? I think > 'guess_by_ca' would fit, if I'm reading the code correctly. > Changed 'magic' to 'filter_by_ca'. What it's doing is filtering the HETATMs if they have a CA atom so I guess it's a good name for it. > 4. In the last block (lines 170-174 now), could you add a comment > explaining why it would be reached? Before this commit there was a > comment "Other HETATMs" but I'm not sure I fully understand. Is it for > HETATMs not associated with any residue, i.e. not residue > modifications? > Added. It's for all HETATMs that don't have a CA atom basically and that are not contemplated in SEQRES (if there). Structure.renumber_residues: > 1. OK, I see what you're doing with het_seed=0 -- clever, but maybe > more clever than necessary. It's not obvious from reading just this > code that the first iteration is a special case for HETATM numbering; > a maintainer would have to look at Chain.py too. A comment about that > would help, I think. > Added. > 2. Why (h/1000 >= 1) instead of (h >= 1000) ? > Accumulated frustration over one day results in such logical typos :) > 3. If the Chain.renumber_residues arguments change to 'res_init' and > 'het_init', then 'seed' here should change to 'init' > Done. > 4. The arguments 'sequential' and 'chain_displace' seem to interact -- > I don't think I'd use chain_displace != 1 unless I had set > sequential=True. So, it seems like chain_displace should only take > effect if sequential=True (i.e. line 77 would be indented another > level). To tighten things up further, I'd combine those two arguments > into a single 'skip/gap_between_chains' or similar: > > # UNTESTED > for chain in model: > r_new, h_new = chain.renumber_residues(res_seed=r, het_seed=h) > if skip_between_chains: > r = r_new + skip_between_chains > if h_new >= 1000: > # Each chain's HETATM numbering starts at the next even 1000*N > h = 1000*((h_new/1000)+1) > else: > h = h_new + 1 > > I changed it to consecutive_chains and refactored the code a bit. I also changed the increment of the het_init value. This way, having more than 9 chains for example would lead to residue numbers over 10000 which is not allowed. I solved it by making all HETATMs starting (by default) at 1000 and just incrementing. If the numbering is consecutive they are also affected by the value chosen to skip between chains. A bit more logical IMO. Answering Kristian's suggestions: - When renumbering a chain, the id's of some residues are changed. Have > you tested whether the keys in Chain.child_dict are changed as well? > Good question... they didn't... is there an easy way of rebuilding that dictionary? Or should I just "rebuild" it and then overwrite child_dict? - Could you refactor a method Chain.change_residue_numbers(old_ids, new_ids) > that does the changing of the calculated identifiers? I think this would > have a some advantages (shorter code is more testable, easier to deal with > the point above, I could use this for some custom numbering schemes). > Could you elaborate on this? Should it be a new method? - Currently, Chain.renumber_residues in the lines > last_num = residue.id[1]+displace > residue.id = (residue.id[0], residue.id[1]+displace, residue.id[2]) > are repating 3 times. Changed. I merged the if-clauses. A bit more complicated but only one if-else condition. > Structure.build_biological_unit: > It looks like if the structure has more than one model, the models > after 0 will be clobbered when this method is run. So, unless a better > solution appears, it's safest to add an assert for len(self.models) == > 1 or check+ValueError for multiple models. > I would prefer to return a new Structure object just with the Biological Unit. It would save me the deepcopy but I'd have to create a new object so dunno if I could gain some speed there. But this would actually make more sense and avoid that problem. What do you think? I pushed again to the same branch: https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements From anaryin at gmail.com Tue Jan 25 20:47:02 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 26 Jan 2011 02:47:02 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: Adding to the build_biological_unit question. I re-wrote it to create a new structure and not to use deepcopy. A very crude benchmark (running two versions of the function) shows this: New function (w/out deepcopy): real 1m26.078s user 1m21.993s sys 0m2.048s Old function (w/ deepcopy): real 2m15.544s user 2m9.105s sys 0m3.092s So... a slight improvement I'd say. Pushed it to the pdb_enhancements branch as a new function called apply_transformation_matrix. A perhaps more descriptive and explicit name to the function. Cheers, Jo?o From bugzilla-daemon at portal.open-bio.org Wed Jan 26 06:08:21 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 26 Jan 2011 06:08:21 -0500 Subject: [Biopython-dev] [Bug 3171] New: inconsistent representation of missing elements PDB.PDBIO.PDBIO and PDB.Atom.Atom Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3171 Summary: inconsistent representation of missing elements PDB.PDBIO.PDBIO and PDB.Atom.Atom Product: Biopython Version: 1.56 Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: edvin.fuglebakk at gmail.com There seems to be an inconsistency in the way missing elements are represented in PDB.PDBIO.PDBIO and PDB.Atom.Atom. The constructor of Atom sets the attribute element to '?' if this is unkown, while PDBIO raises a value error if it encouters atoms with the element set to '?'. PDBIO._get_atom_line checks if Atom.element is falsish (None, False, 0 ...) and chanhes the value of Atom.element to " " if it is. So it seems Atom represents missing elements by "?", while PDBIO represents them by falsish values, presumably None. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From anaryin at gmail.com Wed Jan 26 06:55:46 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 26 Jan 2011 12:55:46 +0100 Subject: [Biopython-dev] [Bug 3171] New: inconsistent representation of missing elements PDB.PDBIO.PDBIO and PDB.Atom.Atom In-Reply-To: References: Message-ID: This being "my" problem, how do I fix this in Bugzilla? The problem is already solved and pushed to Github since a few weeks. Cheers! Jo?o [...] Rodrigues http://doeidoei.wordpress.com On Wed, Jan 26, 2011 at 12:08 PM, wrote: > http://bugzilla.open-bio.org/show_bug.cgi?id=3171 > > Summary: inconsistent representation of missing elements > PDB.PDBIO.PDBIO and PDB.Atom.Atom > Product: Biopython > Version: 1.56 > Platform: Macintosh > OS/Version: Mac OS > Status: NEW > Severity: normal > Priority: P2 > Component: Main Distribution > AssignedTo: biopython-dev at biopython.org > ReportedBy: edvin.fuglebakk at gmail.com > > > There seems to be an inconsistency in the way missing elements are > represented > in PDB.PDBIO.PDBIO and PDB.Atom.Atom. > > The constructor of Atom sets the attribute element to '?' if this is > unkown, > while PDBIO raises a value error if it encouters atoms with the element set > to > '?'. PDBIO._get_atom_line checks if Atom.element is falsish (None, False, 0 > ...) and chanhes the value of Atom.element to " " if it is. > > So it seems Atom represents missing elements by "?", while PDBIO represents > them by falsish values, presumably None. > > > -- > Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are the assignee for the bug, or are watching the assignee. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Wed Jan 26 08:51:38 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 26 Jan 2011 13:51:38 +0000 Subject: [Biopython-dev] [Bug 3171] New: inconsistent representation of missing elements PDB.PDBIO.PDBIO and PDB.Atom.Atom In-Reply-To: References: Message-ID: On Wed, Jan 26, 2011 at 11:55 AM, Jo?o Rodrigues wrote: > This being "my" problem, how do I fix this in Bugzilla? The problem is > already solved and pushed to Github since a few weeks. > > Cheers! > > Jo?o [...] Rodrigues > http://doeidoei.wordpress.com Hi Joao, If it has been fixed on the master, could you add a bug comment to say so with a link to the github commit (and say if you standardised on " ", "?" or something else). Then change the bug status to fixed (between the comment box and the commit button should be a radio-dialogue, move it from "Leave as NEW" to "Resolve bug, changing resolution to FIXED". Thanks, Peter From bugzilla-daemon at portal.open-bio.org Wed Jan 26 08:56:39 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 26 Jan 2011 08:56:39 -0500 Subject: [Biopython-dev] [Bug 2992] Adding Uniprot XML file format parsing to Biopython In-Reply-To: Message-ID: <201101261356.p0QDud42015498@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2992 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-26 08:56 EST ------- This was included in Biopython 1.56, marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 26 09:07:34 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 26 Jan 2011 09:07:34 -0500 Subject: [Biopython-dev] [Bug 2999] SeqIO.parse() or record.format("genbank") converts input sequence to uppercase or In-Reply-To: Message-ID: <201101261407.p0QE7Yco016101@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2999 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-26 09:07 EST ------- (In reply to comment #1) > In many file formats (e.g. FASTA) mixed case is allowed and useful. > > The sequence in a GenBank file is (by convention) always lower case, > but for historical reasons Biopython converts this to upper case on > parsing (not sure why, but changing it would risk breaking existing > scripts). > > However, I think we should convert to lower case on writing GenBank > output. > Done: https://github.com/biopython/biopython/commit/1f860f445d99794ef3747f7a90d73ac4b4a78a00 Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From n.j.loman at bham.ac.uk Wed Jan 26 09:23:47 2011 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Wed, 26 Jan 2011 14:23:47 +0000 Subject: [Biopython-dev] XMFA format support Message-ID: <4D402E73.90806@bham.ac.uk> Hi biopython-developers, Has anyone made a start on adding XMFA support to Bio.AlignIO? XMFA files are produced by software such as Mauve (amongst others). Here's an example file: http://www.bioperl.org/wiki/XMFA_multiple_alignment_format It should be relatively straight-forward to parse them in a basic way in that they can be split on the '=' line to produce the equivalent of multi-FASTA alignments. Cheers, Nick. From bugzilla-daemon at portal.open-bio.org Wed Jan 26 09:26:26 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 26 Jan 2011 09:26:26 -0500 Subject: [Biopython-dev] [Bug 3109] Record class in Bio.SCOP.Cla has hierarchy member as list instead of dictionary In-Reply-To: Message-ID: <201101261426.p0QEQQh6016867@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3109 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-26 09:26 EST ------- Committed: https://github.com/biopython/biopython/commit/9ec46f981a1fb7f97eaee4a2a01ad8bb3297234b and: https://github.com/biopython/biopython/commit/ce675b9299bf34e12335330d627385262f59b4e7 Marking as fixed - sorry for the delay. Thanks! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Wed Jan 26 09:32:41 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Jan 2011 14:32:41 +0000 Subject: [Biopython-dev] XMFA format support In-Reply-To: <4D402E73.90806@bham.ac.uk> References: <4D402E73.90806@bham.ac.uk> Message-ID: Hi Nick, On Wed, Jan 26, 2011 at 2:23 PM, Nick Loman wrote: > Hi biopython-developers, > > Has anyone made a start on adding XMFA support to Bio.AlignIO? >?XMFA files are produced by software such as Mauve (amongst others). > > Here's an example file: > http://www.bioperl.org/wiki/XMFA_multiple_alignment_format > > It should be relatively straight-forward to parse them in a basic way in > that they can be split on the '=' line to produce the equivalent of > multi-FASTA alignments. > > Cheers, > > Nick. Nope, but you are right they should be easy to parse - especially if you ignore the loosely defined optional key/value entries on the equals line. Do you want to tackle this? If not, can you at least provide some small example files and help with testing? You could file an enhancement bug on our bugzilla and then upload them as attachments. Also, could you name any other software that outputs these XMFA (extended multiple fasta) files, other than Mauve? I've not ever come across this format before. Thanks, Peter From n.j.loman at bham.ac.uk Wed Jan 26 09:46:51 2011 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Wed, 26 Jan 2011 14:46:51 +0000 Subject: [Biopython-dev] XMFA format support In-Reply-To: References: <4D402E73.90806@bham.ac.uk> Message-ID: <4D4033DB.6030607@bham.ac.uk> On 26/01/2011 14:32, Peter Cock wrote: >> Has anyone made a start on adding XMFA support to Bio.AlignIO? >> XMFA files are produced by software such as Mauve (amongst others). >> >> Here's an example file: >> http://www.bioperl.org/wiki/XMFA_multiple_alignment_format >> >> It should be relatively straight-forward to parse them in a basic way in >> that they can be split on the '=' line to produce the equivalent of >> multi-FASTA alignments. > Nope, but you are right they should be easy to parse - especially > if you ignore the loosely defined optional key/value entries on the > equals line. Do you want to tackle this? If not, can you at least > provide some small example files and help with testing? You > could file an enhancement bug on our bugzilla and then upload > them as attachments. Hi Peter, Sure, I'll have a go at writing a parser and let you know how I get on. > Also, could you name any other software that outputs these XMFA > (extended multiple fasta) files, other than Mauve? I've not ever > come across this format before. > I think the format was invented by the author of LAGAN (http://lagan.stanford.edu/lagan_web/index.shtml). From a Google search it looks like it is exported by Bigsdb too (http://pubmlst.org/software/database/bigsdb/userguide/isolates/xmfa.shtml) Cheers Nick From bugzilla-daemon at portal.open-bio.org Wed Jan 26 10:00:07 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 26 Jan 2011 10:00:07 -0500 Subject: [Biopython-dev] [Bug 3171] inconsistent representation of missing elements PDB.PDBIO.PDBIO and PDB.Atom.Atom In-Reply-To: Message-ID: <201101261500.p0QF07Hd018137@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3171 anaryin at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from anaryin at gmail.com 2011-01-26 10:00 EST ------- This problem is already fixed and pushed to the master branch. Commit: https://github.com/biopython/biopython/blob/6c7ef358e5f93599ca165ce8e7b46261106e2b06/Bio/PDB/PDBIO.py -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 26 10:04:56 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 26 Jan 2011 10:04:56 -0500 Subject: [Biopython-dev] [Bug 3171] inconsistent representation of missing elements PDB.PDBIO.PDBIO and PDB.Atom.Atom In-Reply-To: Message-ID: <201101261504.p0QF4uFN018424@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3171 ------- Comment #2 from anaryin at gmail.com 2011-01-26 10:04 EST ------- If the element can't be determined Atom defines it as "" (Empty string) which is correctl interpreted by PDBIO. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Wed Jan 26 10:06:00 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Jan 2011 15:06:00 +0000 Subject: [Biopython-dev] XMFA format support In-Reply-To: <4D4033DB.6030607@bham.ac.uk> References: <4D402E73.90806@bham.ac.uk> <4D4033DB.6030607@bham.ac.uk> Message-ID: On Wed, Jan 26, 2011 at 2:46 PM, Nick Loman wrote: > On 26/01/2011 14:32, Peter Cock wrote: >>> >>> Has anyone made a start on adding XMFA support to Bio.AlignIO? >>> ?XMFA files are produced by software such as Mauve (amongst others). >>> ... >> >> Nope, but you are right they should be easy to parse - especially >> if you ignore the loosely defined optional key/value entries on the >> equals line. Do you want to tackle this? If not, can you at least >> provide some small example files and help with testing? You >> could file an enhancement bug on our bugzilla and then upload >> them as attachments. > > Hi Peter, > > Sure, I'll have a go at writing a parser and let you know how I get on. > Great. I'd suggest format name "xmfa" to match BioPerl, and using Bio/SeqIO/XmfaIO.py for your parser (and writer if you do one). Peter From bugzilla-daemon at portal.open-bio.org Wed Jan 26 10:09:38 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 26 Jan 2011 10:09:38 -0500 Subject: [Biopython-dev] [Bug 3171] inconsistent representation of missing elements PDB.PDBIO.PDBIO and PDB.Atom.Atom In-Reply-To: Message-ID: <201101261509.p0QF9cPS018663@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3171 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-26 10:09 EST ------- (In reply to comment #1) > This problem is already fixed and pushed to the master branch. > > Commit: > https://github.com/biopython/biopython/blob/6c7ef358e5f93599ca165ce8e7b46261106e2b06/Bio/PDB/PDBIO.py > That's just a white space change (from a branch merge): https://github.com/biopython/biopython/commit/6c7ef358e5f93599ca165ce8e7b46261106e2b06 Probably the fix you are looking for was applied earlier in the history. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bioinformed at gmail.com Wed Jan 26 10:14:33 2011 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Wed, 26 Jan 2011 10:14:33 -0500 Subject: [Biopython-dev] Sequential SFF IO Message-ID: Any objections/worries about converting the SFF writer to use the sequential/incremental writer object interface? I know it looks specialized for text formats, but I need to split large SFF files into many smaller ones and would rather not materialize the whole thing. The SFF writer code already allows for deferred writing of read counts and index creation, so it looks to be only minor surgery. There doesn't seem to be an obvious API for obtaining such a writer using the SeqIO interface. Am I missing something obvious? -Kevin From p.j.a.cock at googlemail.com Wed Jan 26 10:45:56 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Jan 2011 15:45:56 +0000 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: On Wed, Jan 26, 2011 at 3:14 PM, Kevin Jacobs wrote: > Any objections/worries about converting the SFF writer to use the > sequential/incremental writer object interface? I know it looks > specialized for text formats, but It already uses Bio.SeqIO.Interfaces.SequenceWriter > ... I need to split large SFF files into many smaller ones > and would rather not materialize the whole thing. ?The SFF writer > code already allows for deferred writing of read counts and index > creation, so it looks to be only minor surgery. I don't understand what problem you are having with the SeqIO API. It should be quite happy to take a generator function, iterator, etc (as opposed to a list of SeqRecord objects which I assume is what you mean by "materialize the whole thing"). > There doesn't seem to be an obvious API for obtaining such a writer > using the SeqIO interface. You can do that with: from Bio.SeqIO.SffIO import SffWriter > Am I missing something obvious? > Probably. You can divide a large SFF file into smaller SFF files via the high level Bio.SeqIO.parse/write interface. Personally I like to use generator expressions to do a filtering operation. Note if you want to divide a large SFF file while preserving the Roche XML manifest things are a little more tricky. You should use the ReadRocheXmlManifest function in combination with the SffWriter. You can see an example of this in sff_filter_by_id.py, a tool I wrote for Galaxy - search for "Filter SFF by ID" here: http://community.g2.bx.psu.edu/ Peter From bioinformed at gmail.com Wed Jan 26 11:44:53 2011 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Wed, 26 Jan 2011 11:44:53 -0500 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: On Wed, Jan 26, 2011 at 10:45 AM, Peter Cock wrote: > On Wed, Jan 26, 2011 at 3:14 PM, Kevin Jacobs wrote: > > Any objections/worries about converting the SFF writer to use the > > sequential/incremental writer object interface? I know it looks > > specialized for text formats, but > > It already uses Bio.SeqIO.Interfaces.SequenceWriter > > Sorry-- was shooting from the hip. I meant a SequentialSequenceWriter. > > ... I need to split large SFF files into many smaller ones > > and would rather not materialize the whole thing. The SFF writer > > code already allows for deferred writing of read counts and index > > creation, so it looks to be only minor surgery. > > I don't understand what problem you are having with the SeqIO API. > It should be quite happy to take a generator function, iterator, etc > (as opposed to a list of SeqRecord objects which I assume is what > you mean by "materialize the whole thing"). The goal is to demultiplex a larger file, so I need a "push" interface. e.g. out = dict(...) # of SffWriters for rec in SeqIO(filename,'sff-trim'): out[id(read)].write_record(rec) for writer in out.itervalues(): writer.write_footer() I could use a simple generator if I was merely filtering records, but the write_file interface would require more co-routine functionality than generators provide. > There doesn't seem to be an obvious API for obtaining such a writer > > using the SeqIO interface. > > You can do that with: > > from Bio.SeqIO.SffIO import SffWriter > > For my immediate need, this is fine. However, the more general API doesn't have a SeqIO.writer to get SequentialSequenceWriter objects. -Kevin From bugzilla-daemon at portal.open-bio.org Wed Jan 26 12:01:58 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 26 Jan 2011 12:01:58 -0500 Subject: [Biopython-dev] [Bug 3171] inconsistent representation of missing elements PDB.PDBIO.PDBIO and PDB.Atom.Atom In-Reply-To: Message-ID: <201101261701.p0QH1wSc024517@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3171 ------- Comment #4 from anaryin at gmail.com 2011-01-26 12:01 EST ------- (In reply to comment #3) > (In reply to comment #1) > > This problem is already fixed and pushed to the master branch. > > > > Commit: > > https://github.com/biopython/biopython/blob/6c7ef358e5f93599ca165ce8e7b46261106e2b06/Bio/PDB/PDBIO.py > > > > That's just a white space change (from a branch merge): > > https://github.com/biopython/biopython/commit/6c7ef358e5f93599ca165ce8e7b46261106e2b06 > > Probably the fix you are looking for was applied earlier in the history. > True, sorry. https://github.com/biopython/biopython/commit/594526926f29411a83e996799afc8f010d4fd2e2 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Wed Jan 26 12:19:36 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Jan 2011 17:19:36 +0000 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: On Wed, Jan 26, 2011 at 4:44 PM, Kevin Jacobs wrote: > On Wed, Jan 26, 2011 at 10:45 AM, Peter Cock > wrote: >> >> On Wed, Jan 26, 2011 at 3:14 PM, Kevin Jacobs wrote: >> > Any objections/worries about converting the SFF writer to use the >> > sequential/incremental writer object interface? ?I know it looks >> > specialized for text formats, but >> >> It already uses Bio.SeqIO.Interfaces.SequenceWriter >> > > Sorry-- was shooting from the hip. ?I meant a SequentialSequenceWriter. > The file formats which use SequentialSequenceWriter have trivial (or no) header/footer, which require no additional arguments. The SFF file format has a non-trivial header which records flow space settings etc. Any write_header method would have to be SFF specific, likewise any write_footer method for the index and XML manifest. I don't see what you have in mind. In fact, looking at SffIO.py again now, I think the SffWriter's write_header and write_record method should be private with just write_file as a public method. >> > ... I need to split large SFF files into many smaller ones >> > and would rather not materialize the whole thing. ?The SFF writer >> > code already allows for deferred writing of read counts and index >> > creation, so it looks to be only minor surgery. >> >> I don't understand what problem you are having with the SeqIO API. >> It should be quite happy to take a generator function, iterator, etc >> (as opposed to a list of SeqRecord objects which I assume is what >> you mean by "materialize the whole thing"). > > The goal is to demultiplex a larger file, so I need a "push" interface. > ?e.g. > out = dict(...) # of SffWriters > for rec in SeqIO(filename,'sff-trim'): > ??out[id(read)].write_record(rec) > > for writer in out.itervalues(): > ??writer.write_footer() I don't think the above will work without some "magic" to record the SFF header (which currently would require using private attributes of the SffWriter objects) as done via its write_file method. Also you can't read in SFF files with "sff-trim" if you want to output them, since this discards all the flow space information. You have to use format "sff" instead. > I could use a simple generator if I was merely filtering records, but the > write_file interface would require more co-routine functionality than > generators provide. How many output files do you have? Assuming it is small I'd go for the simple solution of one loop over the input SFF file for each output file. A variation on this would be to make a list of read IDs for each output file, then use the Bio.SeqIO.index for random access to the records to get the records, e.g. records = SeqIO.index(original_filename, "sff") for filename in [...]: wanted = [...] # some list or generator records = (records[id] for id in wanted) SeqIO.write(records, filename, "sff") Otherwise look at itertools.tee for splitting the iterator if you really want to make a single pass though the original SFF file. >> > There doesn't seem to be an obvious API for obtaining such a >> > writer using the SeqIO interface. >> >> You can do that with: >> >> from Bio.SeqIO.SffIO import SffWriter >> > > For my immediate need, this is fine. ?However, the more general > API doesn't have a SeqIO.writer to get?SequentialSequenceWriter > objects. For good reason - not all the writers use SequentialSequenceWriter, because for many file formats it is too narrow in scope. Peter From bioinformed at gmail.com Wed Jan 26 13:30:44 2011 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Wed, 26 Jan 2011 13:30:44 -0500 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: On Wed, Jan 26, 2011 at 12:19 PM, Peter Cock wrote: > I don't think the above will work without some "magic" to record the > SFF header (which currently would require using private attributes > of the SffWriter objects) as done via its write_file method. > > Also you can't read in SFF files with "sff-trim" if you want to output > them, since this discards all the flow space information. You have > to use format "sff" instead. > > Agreed-- shooting from the hip again. > > I could use a simple generator if I was merely filtering records, but the > > write_file interface would require more co-routine functionality than > > generators provide. > > How many output files do you have? Assuming it is small I'd go for > the simple solution of one loop over the input SFF file for each output > file. > > We're routinely multiplexing hundreds or thousands of samples per SFF file and using sequence barcodes to identify them. The number of outputs make a one-pass solution is much preferable. Anyhow, it seems that this has gone beyond the scope of generic Biopython, so I'm happy to make my modifications locally (and share the results if anyone is interested). We're currently using the Roche/454 sff tools, but they have known bugs and we have 5' and 3' adapters to consider. Thanks, -Kevin From p.j.a.cock at googlemail.com Wed Jan 26 14:44:10 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Jan 2011 19:44:10 +0000 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: On Wednesday, January 26, 2011, Kevin Jacobs wrote: > > How many output files do you have? Assuming it is small I'd go for > the simple solution of one loop over the input SFF file for each output > file. > > We're routinely multiplexing hundreds or thousands of samples per SFF file and using sequence barcodes to identify them. ?The number of outputs make a one-pass solution is much preferable. ?Anyhow, it seems that this has gone beyond the scope of generic Biopython, so I'm happy to make my modifications locally (and share the results if anyone is interested). ?We're currently using the Roche/454 sff tools, but they have known bugs and we have 5' and 3' adapters to consider. > > Thanks,?-Kevin > I've got a better feel for what you are attempting to do now. I think one avenue would be to extend the write_header method to take some SFF specific arguments and add a write_footer method taking the optional Roche XML manifest which would (assuming it could seek) write the index block and update the header. All this may not make much sense without looking at the code and the SFF format spec. I'm currently looking at trimming 5' and 3' PCR primer sequences - which could equally be used for barcodes etc. I'd probably wrap this as a Galaxy tool (using Biopython). Peter From bioinformed at gmail.com Wed Jan 26 18:24:51 2011 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Wed, 26 Jan 2011 18:24:51 -0500 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: On Wed, Jan 26, 2011 at 2:44 PM, Peter Cock wrote: > I've got a better feel for what you are attempting to do now. I think > one avenue would be to extend the write_header method to take some SFF > specific arguments and add a write_footer method taking the optional > Roche XML manifest which would (assuming it could seek) write the > index block and update the header. All this may not make much sense > without looking at the code and the SFF format spec. > > This is essentially what I'm doing. The index and manifest are written after the flow records, so this approach is quite feasible. > I'm currently looking at trimming 5' and 3' PCR primer sequences - > which could equally be used for barcodes etc. I'd probably wrap this > as a Galaxy tool (using Biopython). > > I have 90% of such a tool written. I use a banded Smith-Waterman alignment to match barcodes and generic PCR adapters/consensus sequence to ensure that adapters and barcodes can be detected at both ends of reads. -Kevin From p.j.a.cock at googlemail.com Thu Jan 27 08:32:45 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Jan 2011 13:32:45 +0000 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: On Wed, Jan 26, 2011 at 11:24 PM, Kevin wrote: > On Wed, Jan 26, 2011 at 2:44 PM, Peter wrote: >> >> I'm currently looking at trimming 5' and 3' PCR ?primer sequences - >> which could equally be used for barcodes etc. I'd probably wrap this >> as a Galaxy tool (using Biopython). >> > > I have 90% of such a tool written. ?I use a banded Smith-Waterman > alignment to match barcodes and generic PCR adapters/consensus > sequence to ensure that adapters and barcodes can be detected at > both ends of reads. Interesting - and yes, we do seem to have similar aims here. I have been doing ungapped alignments, allowing 0 or 1 (maybe in future 2) mismatches, working on getting this running at reasonable speed. Gapped alignments would be particularly important in 454 reads with homopolymer errors, but most barcodes and PCR primers will avoid homopolymer runs so I don't expect this to be a common problem in this use case. Do you have good reasons to go to the expense of a gapped alignment? Peter From biopython at maubp.freeserve.co.uk Thu Jan 27 10:43:16 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 27 Jan 2011 15:43:16 +0000 Subject: [Biopython-dev] curious error from HMM unit test In-Reply-To: References: Message-ID: 2011/1/24 Peter : > 2011/1/24 Tiago Ant?o : >> Sorry, I am drenched in work (writing up my PhD thesis) and had no >> time to attend to this. Maybe it is the Jython version? It is still a >> release candidate. I think the one installed is RC2, I am going to >> upgrade to the new RC3 and try again. > > Not to worry Tiago - it's not your machine - its one of mine (with > Jython 2.5.2 RC3). You've got more important things right now ;) > > It looks like we may have found a bug worth reporting to the > Jython guys. I'll try and work out if I can reproduce it "by hand" > rather than just via buildbot. That turned out to be easy enough, logged in as the buildslave, got the latest Biopython code with git, did "jython setup.py install", switched to the test directory and: $ jython test_HMMCasino.py Training with the Standard Trainer... Training with Baum-Welch... # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00002aaaab550400, pid=16570, tid=1106139456 # # JRE version: 6.0_17-b17 # Java VM: OpenJDK 64-Bit Server VM (14.0-b16 mixed mode linux-amd64 ) # Derivative: IcedTea6 1.7.5 # Distribution: Custom build (Wed Oct 13 13:04:40 EDT 2010) # Problematic frame: # j Bio.HMM.Trainer$py.update_emissions$12(Lorg/python/core/PyFrame;Lorg/python/core/ThreadState;)Lorg/python/core/PyObject;+555 # # An error report file with more information is saved as: # /home/buildslave/repositories/biopython/Tests/hs_err_pid16570.log # # If you would like to submit a bug report, please include # instructions how to reproduce the bug and visit: # http://icedtea.classpath.org/bugzilla # /home/buildslave/bin/jython: line 271: 16570 Aborted "${JAVA_CMD[@]}" $JAVA_OPTS "${java_args[@]}" -Dpython.home="$JYTHON_HOME" -Dpython.executable="$PRG" org.python.util.jython $JYTHON_OPTS "$@" So, we've had several outcomes (note the build number doesn't increase with git changes, some of these are rebuilds of older code): Build - git revision - outcome 0 - ? - success 1 - ? - success 2 - ? - success 3 - ? - success 4 - ? - success 5 - ? - success 6 - ? - UnboundLocalError 7 - ? - success 8 - ? - success 9 - ? - success 10 - ? - success 11 - ? - success 12 - ? - UnboundLocalError 13 - ? - success 14 - 5c729927d79a9f22b89d8a6f794865c8e1209ed5 - success (At this point Java was updated on this machine.) 15 - cc6842e0f79178af6bf9f32ad6ac3025685f55d1 - timeout 16 - cc6842e0f79178af6bf9f32ad6ac3025685f55d1 - fatal error detect by Java 17 - 215d8a37e20b50491613cc153bdf366d875cf251 - fatal error detect by Java 18 - 215d8a37e20b50491613cc153bdf366d875cf251 - fatal error detect by Java 19 - cc6842e0f79178af6bf9f32ad6ac3025685f55d1 - fatal error detect by Java 20 - 215d8a37e20b50491613cc153bdf366d875cf251 - fatal error detect by Java 21 - f36daaf7dada756822c1040cdb1a74ae0794469d - fatal error detect by Java 22 - 9ec46f981a1fb7f97eaee4a2a01ad8bb3297234b - fatal error detect by Java (At this point I removed Biopython from jython's site-packages, just in case that was conflicting with the un-installed builds being tested. No change...) 23 - 9ec46f981a1fb7f97eaee4a2a01ad8bb3297234b - fatal error detect by Java (for the next builds, I went back to revision for last success, build 14, and build 0) 24 - 5c729927d79a9f22b89d8a6f794865c8e1209ed5 - fatal error detect by Java 25 - 5c729927d79a9f22b89d8a6f794865c8e1209ed5 - fatal error detect by Java 26 - b61cb9d34b24a2e24b0b95453b2707f040d44d89 - fatal error detect by Java I'm pretty convinced that the switch from an occasional UnboundLocalError to a repeatable fatal error is down to the change in Java (although why build 15 timed out rather than triggered the fatal error is curious). $ sudo grep java /var/log/yum.log Jan 21 14:18:29 Installed: tzdata-java-2010l-1.el5.x86_64 Jan 21 14:20:31 Updated: 1:java-1.6.0-openjdk-1.6.0.0-1.16.b17.el5.x86_64 Jan 21 14:20:46 Updated: 1:java-1.6.0-openjdk-devel-1.6.0.0-1.16.b17.el5.x86_64 $ jython Jython 2.5.2rc3 (Release_2_5_2rc3:7184, Jan 10 2011, 22:54:57) [OpenJDK 64-Bit Server VM (Sun Microsystems Inc.)] on java1.6.0_17 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.version_info (2, 5, 2, 'candidate', 3) >>> print sys.version 2.5.2rc3 (Release_2_5_2rc3:7184, Jan 10 2011, 22:54:57) [OpenJDK 64-Bit Server VM (Sun Microsystems Inc.)] So this looks like a bug in either Jython 2.5.2 RC3 and/or Java under CentOS. Peter From biopython at maubp.freeserve.co.uk Thu Jan 27 11:13:32 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 27 Jan 2011 16:13:32 +0000 Subject: [Biopython-dev] curious error from HMM unit test In-Reply-To: References: Message-ID: On Thu, Jan 27, 2011 at 3:43 PM, Peter wrote: > > So this looks like a bug in either Jython 2.5.2 RC3 and/or Java under CentOS. > Same fatal error from test_HMMCasino.py using Jython 2.5.1 on this machine. This could be two bugs (one in Java causing the fatal error, and one in Jython causing the intermittent UnboundLocalError). I think for now I'll just disable this test on Jython, https://github.com/biopython/biopython/commit/8a16e24a1e1076a93957b61b28e933a0cf65d49f Peter From bioinformed at gmail.com Fri Jan 28 07:14:39 2011 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Fri, 28 Jan 2011 07:14:39 -0500 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: On Thu, Jan 27, 2011 at 8:32 AM, Peter Cock wrote: > On Wed, Jan 26, 2011 at 11:24 PM, Kevin wrote: > > On Wed, Jan 26, 2011 at 2:44 PM, Peter wrote: > >> I'm currently looking at trimming 5' and 3' PCR primer sequences - > >> which could equally be used for barcodes etc. I'd probably wrap this > >> as a Galaxy tool (using Biopython). > > > > I have 90% of such a tool written. I use a banded Smith-Waterman > > alignment to match barcodes and generic PCR adapters/consensus > > sequence to ensure that adapters and barcodes can be detected at > > both ends of reads. > > Interesting - and yes, we do seem to have similar aims here. I have > been doing ungapped alignments, allowing 0 or 1 (maybe in future 2) > mismatches, working on getting this running at reasonable speed. > For just 5' barcode detection, I am using a memoized scheme that computes anchored alignments and then stores the result in a hash table (match/mismatch, edit distance). This approach allows me to reject barcodes with too small an edit distance to the next best candidate. It is reasonably fast for our fairly long 454 barcode set (10-'mers), though I do have an optional Cython version of the edit distance routine. The pure-Python version is pretty zippy and can decode a 454 run in a minute or two. > Gapped alignments would be particularly important in 454 reads > with homopolymer errors, but most barcodes and PCR primers > will avoid homopolymer runs so I don't expect this to be a common > problem in this use case. Do you have good reasons to go to the > expense of a gapped alignment? > > When only trimming short 5' adapters, a gapped alignment may be a bit overkill. However, for our short-amplicon libraries we have a bit of a challenge using simpler approaches. Instead of hand-waving, here are the gory details: We're using Fluidigm Access Arrays to generate libraries and sequence fragments with the following structure: Forward: Reverse: where on, e.g., 454 and Ion Torrent: = TCAG = non-homopolymeric 10-mer (up to 192 of them, min. edit distance 4) = 22-mer generic PCR primer (distinct from cs2) = 22-mer generic PCR primer (distinct from cs1) = ~30-150-mer genomic DNA target ( and "rc" denoting reverse compliment. Our designs for Illumina are a bit different, so I won't go into those right now. I use the procedure outlined before to determine the barcode. Then I compute left-anchored gapped alignments between the read and constructs that represent perfect matches to the targets to find the most likely boundaries to trim the 5' and 3' elements from the target sequence. I'm in the process of adding position specific scoring and gap penalties, since this adds virtually no computational cost and improves the boundary detection. The results go to a genotype calling algorithm to classify known and novel variants. This approach is a bit overkill for some sequencing platforms with shorter reads (e.g. Illumina or IonTorrent with 100 bp reads), but on 454 (and soon PacBio, we hope) we routinely sequence through the targets and into the 3' elements and have to trim. -Kevin From chapmanb at 50mail.com Fri Jan 28 07:34:18 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 28 Jan 2011 07:34:18 -0500 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: <20110128123418.GD7866@sobchak.mgh.harvard.edu> Kevin and Peter; I'm really enjoying this discussion -- thanks for talking this through here. > For just 5' barcode detection, I am using a memoized scheme that computes > anchored alignments and then stores the result in a hash table > (match/mismatch, edit distance). This approach allows me to reject barcodes > with too small an edit distance to the next best candidate. It is > reasonably fast for our fairly long 454 barcode set (10-'mers), though I do > have an optional Cython version of the edit distance routine. The > pure-Python version is pretty zippy and can decode a 454 run in a minute or > two. This sounds like a nice approach. Do you have code available or is it not packaged up yet? I wrote up a barcode detector, remover and sorter for our Illumina reads. There is nothing especially tricky in the implementation: it looks for exact matches and then checks for approximate matches, with gaps, using pairwise2: https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/barcode_sort_trim.py The "best_match" function could be replaced with different implementations, using the rest of the script as scaffolding to do all of the other sorting, trimming and output. Brad From bioinformed at gmail.com Fri Jan 28 08:54:47 2011 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Fri, 28 Jan 2011 08:54:47 -0500 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: <20110128123418.GD7866@sobchak.mgh.harvard.edu> References: <20110128123418.GD7866@sobchak.mgh.harvard.edu> Message-ID: On Fri, Jan 28, 2011 at 7:34 AM, Brad Chapman wrote: > Kevin and Peter; > I'm really enjoying this discussion -- thanks for talking this > through here. > > > For just 5' barcode detection, I am using a memoized scheme that computes > > anchored alignments and then stores the result in a hash table > > (match/mismatch, edit distance). This approach allows me to reject > barcodes > > with too small an edit distance to the next best candidate. It is > > reasonably fast for our fairly long 454 barcode set (10-'mers), though I > do > > have an optional Cython version of the edit distance routine. The > > pure-Python version is pretty zippy and can decode a 454 run in a minute > or > > two. > > This sounds like a nice approach. Do you have code available or is > it not packaged up yet? > It is still under development with some of the refinements I mentioned in a non-public branch and have not percolated out to my Google code version. However, a previous version is available from: http://code.google.com/p/glu-genetics/source/browse/glu/modules/seq/unbarcode.py# > I wrote up a barcode detector, remover and sorter for our Illumina > reads. There is nothing especially tricky in the implementation: it > looks for exact matches and then checks for approximate matches, > with gaps, using pairwise2: > > > https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/barcode_sort_trim.py > > The "best_match" function could be replaced with different > implementations, using the rest of the script as scaffolding to do > all of the other sorting, trimming and output. > > Nice! I didn't know about pairwise2, though I figured BioPython would have something to that effect. -Kevin From anaryin at gmail.com Fri Jan 28 19:51:35 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Sat, 29 Jan 2011 01:51:35 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: Also refactored a bit and extended the remove_disordered_atoms after a discussion with a colleague in the lab. Pushed as well to my branch pdb_enhancements. From p.j.a.cock at googlemail.com Sat Jan 29 18:28:21 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 29 Jan 2011 23:28:21 +0000 Subject: [Biopython-dev] [GitHub] Bug 2947 viterbi Message-ID: Hi all, I'm forwarding a pull request via github (in future should we try to have these sent to the dev list automatically?). Does anyone familiar with HMM's want to look at this? Thanks, Peter ---------- Forwarded message ---------- From: GitHub Date: Sat, Jan 29, 2011 at 10:34 PM Subject: [GitHub] Bug 2947 viterbi [biopython/biopython GH-1] To: p.j.a.cock at googlemail.com pgarland wants someone to pull from pgarland:bug-2947-viterbi: Hello, I think this fixes bug 2947. There were 2 errors in how the state sequence was calcuated. The first occurs at the beginning of the sequence. viterbi_probs[(state_letters[0], -1)] was initialized to 1 and viterbi_probs[(state_letters[0], -1)] ?to 0 for all state letters other than the zeroth. This is how it is described in Biological Sequence Analysis by Durbin, et al, but it doesn't work for the code as written because the code doesn't provide for an particular begin state. By initializing the zeroth state letter to 1, and all the others to 0, you're starting off by assigning a higher probability to a state sequence that begins with the zeroth state letter. I fixed this error by also setting ?viterbi_probs[(state_letters[0], -1)] to 0, so that all possible initial states are equally probable. There's a second error in the Viterbi termination code. The algorithm described in Durbin et al allows for modeling a particular end state. Since the code as written doesn't provide for specifying an end state, the termination code miscalculates the sequence probability. The pseudocode in Durbin et al confusingly labels the end state as "0", at least in the printing I have, and this seems to have been carried over into the biopython code, where the zeroth state_letter is whichever one is first in the list. The code as written calculated the probability of the discovered state sequence multiplied by the probability of transitioning from the sequence's last element to the zeroth state named in state_letters, when it should just calculate the probability of the discovered state sequence. I fixed this by deleting the lines that were intended to account for the transition to an end state. It could be useful to specify particular begin and end states, but I believe this patch should give correct results for all cases that don't need that ability. ~Phillip View Pull Request: https://github.com/biopython/biopython/pull/1 From bugzilla-daemon at portal.open-bio.org Mon Jan 31 11:50:15 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 31 Jan 2011 11:50:15 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201101311650.p0VGoFsS028981@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-31 11:50 EST ------- Should be fixed now, can anyone confirm this? https://github.com/biopython/biopython/commit/0502ba205bd227655cd5229f5adad63bf9813b23 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Sat Jan 1 21:15:43 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 1 Jan 2011 21:15:43 +0000 Subject: [Biopython-dev] Bio.SeqIO.index extension, Bio.SeqIO.index_many In-Reply-To: <20101208124144.GL4621@sobchak.mgh.harvard.edu> References: <20101207135941.GF4621@sobchak.mgh.harvard.edu> <20101208124144.GL4621@sobchak.mgh.harvard.edu> Message-ID: On Wed, Dec 8, 2010 at 12:41 PM, Brad Chapman wrote: > >Eric wrote: >> How about index_db or index_sqlite? The fact that it uses a SQLite >> database for storage seems significant enough to be noted in the name. > > +1 for index_db. That's clearer than index_file(s), which sort of > just implies you are indexing something but not that it is > non-memory. It also allows you to have multiple backends in addition > to SQLite. Nice. > > Brad > Checked in, but I still need to look at Python 3 support. Even plain Bio.SeqIO.index() will need some re-engineering to run at acceptable speed on Python 3, so this isn't unexpected. Peter From biopython at maubp.freeserve.co.uk Sun Jan 2 21:29:52 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 2 Jan 2011 21:29:52 +0000 Subject: [Biopython-dev] Bio.SeqIO.index extension, Bio.SeqIO.index_many In-Reply-To: References: <20101207135941.GF4621@sobchak.mgh.harvard.edu> <20101208124144.GL4621@sobchak.mgh.harvard.edu> Message-ID: On Sat, Jan 1, 2011 at 9:15 PM, Peter wrote: >> > > Checked in, but I still need to look at Python 3 support. Even > plain Bio.SeqIO.index() will need some re-engineering to run > at acceptable speed on Python 3, so this isn't unexpected. > There was also a UserDict thing to sort out for Python 3 which isn't taken care of by 2to3, see http://bugs.python.org/issue2876 However, the biggest annoyance of the index_db stuff was my unit tests on Windows - it turns out repeatedly creating, using, closing and deleting a file with the same name was a bad idea. Something was keeping a stale handle to the file as it wasn't always getting deleted. In the end I just gave in and used a different temp file for each tests, and it works fine. Weird, also rather frustrating - but the tests pass now :) I wonder how SQLite3 support in Jython is coming along... http://bugs.jython.org/issue1682864 Peter From biopython at maubp.freeserve.co.uk Tue Jan 4 22:43:27 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 Jan 2011 22:43:27 +0000 Subject: [Biopython-dev] Calling 2to3 from setup.py Message-ID: Hi all, Something we've talked about before is calling lib2to3 or the 2to3 script from within setup.py to make installing Biopython simpler on Python 3. Also, our current arrangement where we recommend calling 2to3 in situ is not very helpful from a source code control point of view - it makes working on Python 3 specific fixes rather fiddly. If we didn't need any special arguments for calling 2to3 we could try this simple solution: try: from distutils.command.build_py import build_py_2to3 as build_py except ImportError: from distutils.command.build_py import build_py then add this to the setup function call, cmdclass = {'build_py': build_py} See http://docs.python.org/py3k/distutils/apiref.html and other pages. However, as far as I can see, that doesn't cater for passing in options like disabling the long fixer (which we require). I then looked at how NumPy are doing this, and they have a hook in setup.py which calls their own Python script called py3tool.py to do the conversion, then change the current directory to the converted code before calling the setup function. See: https://github.com/numpy/numpy/blob/master/setup.py https://github.com/numpy/numpy/blob/master/tools/py3tool.py The NumPy py3tool.py script has some brains - it will not bother to reconvert previously converted but unchanged files. Since 2to3 is quite slow this is important, e.g. for doing: python3 setup.py build python3 setup.py test python3 setup.py install However, this only seems to be a simple check based on the file timestamps. I worry that you'd have to clear the converted files to ensure a clean rebuild after switching branches - but from a brief search online it looks like git will give modified files the current time stamp when you do a checkout. On the following branch I've followed the same basic strategy as NumPy - handle the 2to3 conversion with a script and then before calling the setup function switch to the converted source tree. The main difference is I also track the md5 checksums of the source files and the 2to3 converted python scripts. Perhaps it is over engineered, but it seems safer than looking at the files' time stamps? https://github.com/peterjc/biopython/tree/py3setup I haven't tried this yet on Windows or Mac, just Linux with Python 3.1 for now. Another potential issue with the NumPy code is it doesn't worry about the Python 3.1 and 3.2 (etc) versions of 2to3 giving slightly different results. To be safe, I'm using a separate build folder for each. If you run setup.py under Python 3.x, it calls lib2to3 from that Python. Has anyone else looked at this? Peter From biopython at maubp.freeserve.co.uk Tue Jan 4 23:30:29 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 Jan 2011 23:30:29 +0000 Subject: [Biopython-dev] Calling 2to3 from setup.py In-Reply-To: References: Message-ID: On Tue, Jan 4, 2011 at 10:43 PM, Peter wrote: > Hi all, > > Something we've talked about before is calling lib2to3 or the 2to3 script > from within setup.py to make installing Biopython simpler on Python 3. > > ... > > I then looked at how NumPy are doing this, and they have a hook > in setup.py which calls their own Python script called py3tool.py to > do the conversion, ... it will not bother to reconvert previously > converted but unchanged files. ... > > However, this only seems to be a simple check based on the file > timestamps. I worry that you'd have to clear the converted files > to ensure a clean rebuild after switching branches - but from > a brief search online it looks like git will give modified files the > current time stamp when you do a checkout. For an interesting but heated discussion of this and related issues, see this thread: http://www.spinics.net/lists/git/msg24579.html The key point is that although some version control systems do have an option to restore time stamps, git does not. Thus if you switch branches, and changed file gets the current timestamp. This is simple, and ensures simple build systems like make will rebuild all dependencies (but may do unnecessary work). > On the following branch I've followed the same basic strategy as > NumPy - handle the 2to3 conversion with a script and then before > calling the setup function switch to the converted source tree. > The main difference is I also track the md5 checksums of the > source files and the 2to3 converted python scripts. Perhaps it > is over engineered, but it seems safer than looking at the files' > time stamps? If you are on the master branch, then checkout another branch, then checkout the master branch again, the net result with git is any files which differed between the two branches would have had their time stamp updated (but with no net change to their contents). Using the NumPy setup.py script this would trigger a needless reconversion of those files with 2to3. Using the md5 approach would not do this extra work. On the other hand, this example is contrived - in practice when I change branch I want to build/install and test that code. So on reflection, using the time stamp to decide if 2to3 needs to be rerun is probable quite sufficient (and will be faster too). Peter From bugzilla-daemon at portal.open-bio.org Wed Jan 5 17:33:13 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 Jan 2011 12:33:13 -0500 Subject: [Biopython-dev] [Bug 3166] New: Bio.PDB.DSSP fails to work on PDBs with HETATM Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3166 Summary: Bio.PDB.DSSP fails to work on PDBs with HETATM Product: Biopython Version: 1.50 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: macrozhu+biopy at gmail.com Hi, I am current using BioPython 1.50. It seems Bio.PDB.DSSP fails if the input PDB file contains HETATM. For example, for PDB entry 3jui, the DSSP.__init__() function breaks with an exception: KeyError: (' ', 547, ' ') This is because residue 547 has id ('H_MSE', 547, ' '). But the function Bio.PDB.DSSP.make_dssp_dict() never fill the het field in residue id when parsing DSSP output. See line 135 in DSSP.py: res_id=(" ", resseq, icode) As a matter of fact, there is no way to figure out the het value from DSSP output. Therefore, to address this issue, I suggest to revise the function DSSP.__init__() so that it looks like below (revised lines marked with comments # REVISED): class DSSP(AbstractResiduePropertyMap): """ Run DSSP on a pdb file, and provide a handle to the DSSP secondary structure and accessibility. Note that DSSP can only handle one model. Example: >>> p=PDBParser() >>> structure=parser.get_structure("1fat.pdb") >>> model=structure[0] >>> dssp=DSSP(model, "1fat.pdb") >>> # print dssp data for a residue >>> secondary_structure, accessibility=dssp[(chain_id, res_id)] """ def __init__(self, model, pdb_file, dssp="dssp"): """ @param model: the first model of the structure @type model: L{Model} @param pdb_file: a PDB file @type pdb_file: string @param dssp: the dssp executable (ie. the argument to os.system) @type dssp: string """ # create DSSP dictionary dssp_dict, dssp_keys=dssp_dict_from_pdb_file(pdb_file, dssp) dssp_map={} dssp_list=[] # Now create a dictionary that maps Residue objects to # secondary structure and accessibility, and a list of # (residue, (secondary structure, accessibility)) tuples for key in dssp_keys: chain_id, res_id=key chain=model[chain_id] #################### ### REVISED #################### # in DSSP, HET field is not considered # thus HETATM records may cause unnecessary exceptions # e.g. 3jui. try: res=chain[res_id] except KeyError: found = False # try again with all HETATM # consider resseq + icode res_seq_icode = ('%s%s' % (res_id[1],res_id[2])).strip() for r in chain: if r.id[0] != ' ': r_seq_icode = ('%s%s' % (r.id[1],r.id[2])).strip() if r_seq_icode == res_seq_icode: res = r found = True break if not found: raise KeyError(res_id) #################### ### REVISED FINISHES #################### aa, ss, acc=dssp_dict[key] res.xtra["SS_DSSP"]=ss res.xtra["EXP_DSSP_ASA"]=acc # relative accessibility resname=res.get_resname() try: rel_acc=acc/MAX_ACC[resname] if rel_acc>1.0: rel_acc=1.0 except KeyError: rel_acc = 'NA' res.xtra["EXP_DSSP_RASA"]=rel_acc # Verify if AA in DSSP == AA in Structure # Something went wrong if this is not true! resname=to_one_letter_code[resname] if resname=="C": # DSSP renames C in C-bridges to a,b,c,d,... # - we rename it back to 'C' if _dssp_cys.match(aa): aa='C' #################### ### REVISED #################### if not (resname==aa or (res.id[0] != ' ' and aa=='X')): #################### ### REVISED FINISHES #################### raise PDBException("Structure/DSSP mismatch at "+str(res)) dssp_map[key]=((res, ss, acc, rel_acc)) dssp_list.append((res, ss, acc, rel_acc)) AbstractResiduePropertyMap.__init__(self, dssp_map, dssp_keys, dssp_list) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 5 17:44:22 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 Jan 2011 12:44:22 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101051744.p05HiMA0006192@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-05 12:44 EST ------- Hi, Biopython 1.50 was released in April 2009, and is nearly two years old. However, checking for recent changes to the file Bio/PDB/DSSP.py nothing looks to have altered the __init__ code you're interested in. https://github.com/biopython/biopython/commits/master/Bio/PDB/DSSP.py Could you submit your proposed changes as a patch against the latest code please? You can attach the patch file to this bug. Also an explicit example (ideally just a few lines of Python) showing how to reproduce the problem would be very helpful. Thanks! Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 5 18:07:15 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 Jan 2011 13:07:15 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101051807.p05I7FhZ007739@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #2 from macrozhu+biopy at gmail.com 2011-01-05 13:07 EST ------- Created an attachment (id=1556) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1556&action=view) handle DSSP results on PDBs with HETATM, parse PHI and PSI angles The current version of DSSP can not handle some PDB files with HETATM in it. e.g. 3jui. This can be illustrated by the following example: > python DSSP.py 3jui.pdb KeyError: (' ', 547, ' ') In addition, PHI and PSI angles calculated by DSSP are useful in many cases. Therefore, in this patch I also revised the code so that PHI and PSI angles in DSSP output files are parsed and assigned to residue feature xtra. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 5 18:08:24 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 Jan 2011 13:08:24 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101051808.p05I8OC8007831@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #3 from macrozhu+biopy at gmail.com 2011-01-05 13:08 EST ------- Created an attachment (id=1557) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1557&action=view) handle DSSP results on PDBs with HETATM, parse PHI and PSI angles Hi, Peter, The current version of DSSP can not handle some PDB files with HETATM in it. e.g. 3jui. This can be illustrated by the following example: > python DSSP.py 3jui.pdb KeyError: (' ', 547, ' ') Here is a patch file DSSP.py for addressing this problem. In addition, PHI and PSI angles calculated by DSSP are useful in many cases. Therefore, in this patch I also revised the code so that PHI and PSI angles in DSSP output files are parsed and assigned to residue feature xtra. regards, hongbo zhu -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 5 18:10:56 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 5 Jan 2011 13:10:56 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101051810.p05IAu8d007991@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #4 from macrozhu+biopy at gmail.com 2011-01-05 13:10 EST ------- oops, the same patch was submitted twice. Please ignore 1556 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From updates at feedmyinbox.com Fri Jan 7 08:45:34 2011 From: updates at feedmyinbox.com (Feed My Inbox) Date: Fri, 7 Jan 2011 03:45:34 -0500 Subject: [Biopython-dev] 1/7 newest questions tagged biopython - Stack Overflow Message-ID: // python multiprocessing each with own subprocess (Kubuntu,Mac) // January 6, 2011 at 4:25 PM http://stackoverflow.com/questions/4620041/python-multiprocessing-each-with-own-subprocess-kubuntu-mac I've created a script that by default creates one multiprocessing Process; then it works fine. When starting multiple processes, it starts to hang, and not always in the same place. The program's about 700 lines of code, so I'll try to summarise what's going on. I want to make the most of my multi-cores, by parallelising the slowest task, which is aligning DNA sequences. For that I use the subprocess module to call a command-line program: 'hmmsearch', which I can feed in sequences through /dev/stdin, and then I read out the aligned sequences through /dev/stdout. I imagine the hang occurs because of these multiple subprocess instances reading / writing from stdout / stdin, and I really don't know the best way to go about this... I was looking into os.fdopen(...) & os.tmpfile(), to create temporary filehandles or pipes where I can flush the data through. However, I've never used either before & I can't picture how to do that with the subprocess module. Ideally I'd like to bypass using the hard-drive entirely, because pipes are much better with high-throughput data processing! Any help with this would be super wonderful!! import multiprocessing, subprocess from Bio import SeqIO class align_seq( multiprocessing.Process ): def __init__( self, inPipe, outPipe, semaphore, options ): multiprocessing.Process.__init__(self) self.in_pipe = inPipe ## Sequences in self.out_pipe = outPipe ## Alignment out self.options = options.copy() ## Modifiable sub-environment self.sem = semaphore def run(self): inp = self.in_pipe.recv() while inp != 'STOP': seq_record , HMM = inp # seq_record is only ever one Bio.Seq.SeqRecord object at a time. # HMM is a file location. align_process = subprocess.Popen( ['hmmsearch', '-A', '/dev/stdout', '-o',os.devnull, HMM, '/dev/stdin'], shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE ) self.sem.acquire() align_process.stdin.write( seq_record.format('fasta') ) align_process.stdin.close() for seq in SeqIO.parse( align_process.stdout, 'stockholm' ): # get the alignment output self.out_pipe.send_bytes( seq.seq.tostring() ) # send it to consumer align_process.wait() # Don't know if there's any need for this?? self.sem.release() align_process.stdout.close() inp = self.in_pipe.recv() self.in_pipe.close() #Close handles so don't overshoot max. limit on number of file-handles. self.out_pipe.close() -- Website: http://stackoverflow.com/questions/tagged/?tagnames=biopython&sort=newest Account Login: https://www.feedmyinbox.com/members/login/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/520218/8a8361b28bdb22206ffa317797e7067a6f101db5/?utm_source=fmi&utm_medium=email&utm_campaign=feed-email -- This email was carefully delivered by FeedMyInbox.com. PO Box 682532 Franklin, TN 37068 From bugzilla-daemon at portal.open-bio.org Wed Jan 12 13:54:30 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 12 Jan 2011 08:54:30 -0500 Subject: [Biopython-dev] [Bug 3168] New: different StringIO import for Python 3 Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3168 Summary: different StringIO import for Python 3 Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: michael.kuhn at gmail.com Bio/File.py fails in Python 3, because StringIO.StringIO has been moved to the io module. These changes fix this (in Bio/File.py): import StringIO --> try: from StringIO import StringIO except ImportError: from io import StringIO and StringHandle = StringIO.StringIO --> StringHandle = StringIO (I didn't see a proper process documented anywhere to submit patches with the whole 2to3 conversion going on at the same time). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 12 14:46:47 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 12 Jan 2011 09:46:47 -0500 Subject: [Biopython-dev] [Bug 3169] New: to_one_letter_code in Bio.SCOP.Raf is old Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3169 Summary: to_one_letter_code in Bio.SCOP.Raf is old Product: Biopython Version: 1.56 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: macrozhu+biopy at gmail.com Hi, The dictionary to_one_letter_code in Bio.SCOP.Raf is a bit old now. The current dictionary is based on a table taken from the RAF release notes of ASTRAL. This is an old table and some new three-letter codes in the PDB are not found in it (e.g. M3L in 2X4W). ASTRAL does not use the table since v1.73. Rather, PDB Chemical Component Dictionary is used. See http://astral.berkeley.edu/seq.cgi?get=raf-edit-comments;ver=1.75 "Beginning with ASTRAL 1.73, the PDB's chemical dictionary is used to translate chemically modified residues, instead of the translation table from ASTRAL 1.55." The PDB Chemical Component Dictionary can be obtained from: http://deposit.pdb.org/cc_dict_tut.html . I have parsed the dictionary and there are 12054 three-letter codes (as of Jan 2011). Among them, most correspond to a one-letter code '?'. Still, there are 1245 three-letter codes corresponding to a one-letter code other than '?' (the list is attached in the end). Therefore, I suggest to update the to_one_letter_code dictionary in Bio.SCOP.Raf. Best regards, hongbo zhu to_one_letter_code = { '00C':'C','01W':'X','0A0':'D','0A1':'Y','0A2':'K', '0A8':'C','0AA':'V','0AB':'V','0AC':'G','0AD':'G', '0AF':'W','0AG':'L','0AH':'S','0AK':'D','0AM':'A', '0AP':'C','0AU':'U','0AV':'A','0AZ':'P','0BN':'F', '0C ':'C','0CS':'A','0DC':'C','0DG':'G','0DT':'T', '0G ':'G','0NC':'A','0SP':'A','0U ':'U','0YG':'YG', '10C':'C','125':'U','126':'U','127':'U','128':'N', '12A':'A','143':'C','175':'ASG','193':'X','1AP':'A', '1MA':'A','1MG':'G','1PA':'F','1PI':'A','1PR':'N', '1SC':'C','1TQ':'W','1TY':'Y','200':'F','23F':'F', '23S':'X','26B':'T','2AD':'X','2AG':'G','2AO':'X', '2AR':'A','2AS':'X','2AT':'T','2AU':'U','2BD':'I', '2BT':'T','2BU':'A','2CO':'C','2DA':'A','2DF':'N', '2DM':'N','2DO':'X','2DT':'T','2EG':'G','2FE':'N', '2FI':'N','2FM':'M','2GT':'T','2HF':'H','2LU':'L', '2MA':'A','2MG':'G','2ML':'L','2MR':'R','2MT':'P', '2MU':'U','2NT':'T','2OM':'U','2OT':'T','2PI':'X', '2PR':'G','2SA':'N','2SI':'X','2ST':'T','2TL':'T', '2TY':'Y','2VA':'V','32S':'X','32T':'X','3AH':'H', '3AR':'X','3CF':'F','3DA':'A','3DR':'N','3GA':'A', '3MD':'D','3ME':'U','3NF':'Y','3TY':'X','3XH':'G', '4AC':'N','4BF':'Y','4CF':'F','4CY':'M','4DP':'W', '4F3':'GYG','4FB':'P','4FW':'W','4HT':'W','4IN':'X', '4MF':'N','4MM':'X','4OC':'C','4PC':'C','4PD':'C', '4PE':'C','4PH':'F','4SC':'C','4SU':'U','4TA':'N', '5AA':'A','5AT':'T','5BU':'U','5CG':'G','5CM':'C', '5CS':'C','5FA':'A','5FC':'C','5FU':'U','5HP':'E', '5HT':'T','5HU':'U','5IC':'C','5IT':'T','5IU':'U', '5MC':'C','5MD':'N','5MU':'U','5NC':'C','5PC':'C', '5PY':'T','5SE':'U','5ZA':'TWG','64T':'T','6CL':'K', '6CT':'T','6CW':'W','6HA':'A','6HC':'C','6HG':'G', '6HN':'K','6HT':'T','6IA':'A','6MA':'A','6MC':'A', '6MI':'N','6MT':'A','6MZ':'N','6OG':'G','70U':'U', '7DA':'A','7GU':'G','7JA':'I','7MG':'G','8AN':'A', '8FG':'G','8MG':'G','8OG':'G','9NE':'E','9NF':'F', '9NR':'R','9NV':'V','A ':'A','A1P':'N','A23':'A', 'A2L':'A','A2M':'A','A34':'A','A35':'A','A38':'A', 'A39':'A','A3A':'A','A3P':'A','A40':'A','A43':'A', 'A44':'A','A47':'A','A5L':'A','A5M':'C','A5O':'A', 'A66':'X','AA3':'A','AA4':'A','AAR':'R','AB7':'X', 'ABA':'A','ABR':'A','ABS':'A','ABT':'N','ACB':'D', 'ACL':'R','AD2':'A','ADD':'X','ADX':'N','AEA':'X', 'AEI':'D','AET':'A','AFA':'N','AFF':'N','AFG':'G', 'AGM':'R','AGT':'X','AHB':'N','AHH':'X','AHO':'A', 'AHP':'A','AHS':'X','AHT':'X','AIB':'A','AKL':'D', 'ALA':'A','ALC':'A','ALG':'R','ALM':'A','ALN':'A', 'ALO':'T','ALQ':'X','ALS':'A','ALT':'A','ALY':'K', 'AP7':'A','APE':'X','APH':'A','API':'K','APK':'K', 'APM':'X','APP':'X','AR2':'R','AR4':'E','ARG':'R', 'ARM':'R','ARO':'R','ARV':'X','AS ':'A','AS2':'D', 'AS9':'X','ASA':'D','ASB':'D','ASI':'D','ASK':'D', 'ASL':'D','ASM':'X','ASN':'N','ASP':'D','ASQ':'D', 'ASU':'N','ASX':'B','ATD':'T','ATL':'T','ATM':'T', 'AVC':'A','AVN':'X','AYA':'A','AYG':'AYG','AZK':'K', 'AZS':'S','AZY':'Y','B1F':'F','B1P':'N','B2A':'A', 'B2F':'F','B2I':'I','B2V':'V','B3A':'A','B3D':'D', 'B3E':'E','B3K':'K','B3L':'X','B3M':'X','B3Q':'X', 'B3S':'S','B3T':'X','B3U':'H','B3X':'N','B3Y':'Y', 'BB6':'C','BB7':'C','BB9':'C','BBC':'C','BCS':'C', 'BCX':'C','BE2':'X','BFD':'D','BG1':'S','BGM':'G', 'BHD':'D','BIF':'F','BIL':'X','BIU':'I','BJH':'X', 'BLE':'L','BLY':'K','BMP':'N','BMT':'T','BNN':'A', 'BNO':'X','BOE':'T','BOR':'R','BPE':'C','BRU':'U', 'BSE':'S','BT5':'N','BTA':'L','BTC':'C','BTR':'W', 'BUC':'C','BUG':'V','BVP':'U','BZG':'N','C ':'C', 'C12':'TYG','C1X':'K','C25':'C','C2L':'C','C2S':'C', 'C31':'C','C32':'C','C34':'C','C36':'C','C37':'C', 'C38':'C','C3Y':'C','C42':'C','C43':'C','C45':'C', 'C46':'C','C49':'C','C4R':'C','C4S':'C','C5C':'C', 'C66':'X','C6C':'C','C99':'TFG','CAF':'C','CAL':'X', 'CAR':'C','CAS':'C','CAV':'X','CAY':'C','CB2':'C', 'CBR':'C','CBV':'C','CCC':'C','CCL':'K','CCS':'C', 'CCY':'CYG','CDE':'X','CDV':'X','CDW':'C','CEA':'C', 'CFL':'C','CFY':'FCYG','CG1':'G','CGA':'E','CGU':'E', 'CH ':'C','CH6':'MYG','CH7':'KYG','CHF':'X','CHG':'X', 'CHP':'G','CHS':'X','CIR':'R','CJO':'GYG','CLE':'L', 'CLG':'K','CLH':'K','CLV':'AFG','CM0':'N','CME':'C', 'CMH':'C','CML':'C','CMR':'C','CMT':'C','CNU':'U', 'CP1':'C','CPC':'X','CPI':'X','CQR':'GYG','CR0':'TLG', 'CR2':'GYG','CR5':'G','CR7':'KYG','CR8':'HYG','CRF':'TWG', 'CRG':'THG','CRK':'MYG','CRO':'GYG','CRQ':'QYG','CRU':'E', 'CRW':'ASG','CRX':'ASG','CS0':'C','CS1':'C','CS3':'C', 'CS4':'C','CS8':'N','CSA':'C','CSB':'C','CSD':'C', 'CSE':'C','CSF':'C','CSH':'SHG','CSI':'G','CSJ':'C', 'CSL':'C','CSO':'C','CSP':'C','CSR':'C','CSS':'C', 'CSU':'C','CSW':'C','CSX':'C','CSY':'SYG','CSZ':'C', 'CTE':'W','CTG':'T','CTH':'T','CUC':'X','CWR':'S', 'CXM':'M','CY0':'C','CY1':'C','CY3':'C','CY4':'C', 'CYA':'C','CYD':'C','CYF':'C','CYG':'C','CYJ':'X', 'CYM':'C','CYQ':'C','CYR':'C','CYS':'C','CZ2':'C', 'CZO':'GYG','CZZ':'C','D11':'T','D1P':'N','D3 ':'N', 'D33':'N','D3P':'G','D3T':'T','D4M':'T','D4P':'X', 'DA ':'A','DA2':'X','DAB':'A','DAH':'F','DAL':'A', 'DAR':'R','DAS':'D','DBB':'T','DBM':'N','DBS':'S', 'DBU':'T','DBY':'Y','DBZ':'A','DC ':'C','DC2':'C', 'DCG':'G','DCI':'X','DCL':'X','DCT':'C','DCY':'C', 'DDE':'H','DDG':'G','DDN':'U','DDX':'N','DFC':'C', 'DFG':'G','DFI':'X','DFO':'X','DFT':'N','DG ':'G', 'DGH':'G','DGI':'G','DGL':'E','DGN':'Q','DHA':'A', 'DHI':'H','DHL':'X','DHN':'V','DHP':'X','DHU':'U', 'DHV':'V','DI ':'I','DIL':'I','DIR':'R','DIV':'V', 'DLE':'L','DLS':'K','DLY':'K','DM0':'K','DMH':'N', 'DMK':'D','DMT':'X','DN ':'N','DNE':'L','DNG':'L', 'DNL':'K','DNM':'L','DNP':'A','DNR':'C','DNS':'K', 'DOA':'X','DOC':'C','DOH':'D','DON':'L','DPB':'T', 'DPH':'F','DPL':'P','DPP':'A','DPQ':'Y','DPR':'P', 'DPY':'N','DRM':'U','DRP':'N','DRT':'T','DRZ':'N', 'DSE':'S','DSG':'N','DSN':'S','DSP':'D','DT ':'T', 'DTH':'T','DTR':'W','DTY':'Y','DU ':'U','DVA':'V', 'DXD':'N','DXN':'N','DYG':'DYG','DYS':'C','DZM':'A', 'E ':'A','E1X':'A','EDA':'A','EDC':'G','EFC':'C', 'EHP':'F','EIT':'T','ENP':'N','ESB':'Y','ESC':'M', 'EXY':'L','EY5':'N','EYS':'X','F2F':'F','FA2':'A', 'FA5':'N','FAG':'N','FAI':'N','FCL':'F','FFD':'N', 'FGL':'G','FGP':'S','FHL':'X','FHO':'K','FHU':'U', 'FLA':'A','FLE':'L','FLT':'Y','FME':'M','FMG':'G', 'FMU':'N','FOE':'C','FOX':'G','FP9':'P','FPA':'F', 'FRD':'X','FT6':'W','FTR':'W','FTY':'Y','FZN':'K', 'G ':'G','G25':'G','G2L':'G','G2S':'G','G31':'G', 'G32':'G','G33':'G','G36':'G','G38':'G','G42':'G', 'G46':'G','G47':'G','G48':'G','G49':'G','G4P':'N', 'G7M':'G','GAO':'G','GAU':'E','GCK':'C','GCM':'X', 'GDP':'G','GDR':'G','GFL':'G','GGL':'E','GH3':'G', 'GHG':'Q','GHP':'G','GL3':'G','GLH':'Q','GLM':'X', 'GLN':'Q','GLQ':'E','GLU':'E','GLX':'Z','GLY':'G', 'GLZ':'G','GMA':'E','GMS':'G','GMU':'U','GN7':'G', 'GND':'X','GNE':'N','GOM':'G','GPL':'K','GS ':'G', 'GSC':'G','GSR':'G','GSS':'G','GSU':'E','GT9':'C', 'GTP':'G','GVL':'X','GYC':'CYG','GYS':'SYG','H2U':'U', 'H5M':'P','HAC':'A','HAR':'R','HBN':'H','HCS':'X', 'HDP':'U','HEU':'U','HFA':'X','HGL':'X','HHI':'H', 'HHK':'AK','HIA':'H','HIC':'H','HIP':'H','HIQ':'H', 'HIS':'H','HL2':'L','HLU':'L','HMF':'A','HMR':'R', 'HOL':'N','HPC':'F','HPE':'F','HPQ':'F','HQA':'A', 'HRG':'R','HRP':'W','HS8':'H','HS9':'H','HSE':'S', 'HSL':'S','HSO':'H','HTI':'C','HTN':'N','HTR':'W', 'HV5':'A','HVA':'V','HY3':'P','HYP':'P','HZP':'P', 'I ':'I','I2M':'I','I58':'K','I5C':'C','IAM':'A', 'IAR':'R','IAS':'D','IC ':'C','IEL':'K','IEY':'HYG', 'IG ':'G','IGL':'G','IGU':'G','IIC':'SHG','IIL':'I', 'ILE':'I','ILG':'E','ILX':'I','IMC':'C','IML':'I', 'IOY':'F','IPG':'G','IPN':'N','IRN':'N','IT1':'K', 'IU ':'U','IYR':'Y','IYT':'T','JJJ':'C','JJK':'C', 'JJL':'C','JW5':'N','K1R':'C','KAG':'G','KCX':'K', 'KGC':'K','KOR':'M','KPI':'K','KST':'K','KYQ':'K', 'L2A':'X','LA2':'K','LAA':'D','LAL':'A','LBY':'K', 'LC ':'C','LCA':'A','LCC':'N','LCG':'G','LCH':'N', 'LCK':'K','LCX':'K','LDH':'K','LED':'L','LEF':'L', 'LEH':'L','LEI':'V','LEM':'L','LEN':'L','LET':'X', 'LEU':'L','LG ':'G','LGP':'G','LHC':'X','LHU':'U', 'LKC':'N','LLP':'K','LLY':'K','LME':'E','LMQ':'Q', 'LMS':'N','LP6':'K','LPD':'P','LPG':'G','LPL':'X', 'LPS':'S','LSO':'X','LTA':'X','LTR':'W','LVG':'G', 'LVN':'V','LYM':'K','LYN':'K','LYR':'K','LYS':'K', 'LYX':'K','LYZ':'K','M0H':'C','M1G':'G','M2G':'G', 'M2L':'K','M2S':'M','M3L':'K','M5M':'C','MA ':'A', 'MA6':'A','MA7':'A','MAA':'A','MAD':'A','MAI':'R', 'MBQ':'Y','MBZ':'N','MC1':'S','MCG':'X','MCL':'K', 'MCS':'C','MCY':'C','MDH':'X','MDO':'ASG','MDR':'N', 'MEA':'F','MED':'M','MEG':'E','MEN':'N','MEP':'U', 'MEQ':'Q','MET':'M','MEU':'G','MF3':'X','MFC':'GYG', 'MG1':'G','MGG':'R','MGN':'Q','MGQ':'A','MGV':'G', 'MGY':'G','MHL':'L','MHO':'M','MHS':'H','MIA':'A', 'MIS':'S','MK8':'L','ML3':'K','MLE':'L','MLL':'L', 'MLY':'K','MLZ':'K','MME':'M','MMT':'T','MND':'N', 'MNL':'L','MNU':'U','MNV':'V','MOD':'X','MP8':'P', 'MPH':'X','MPJ':'X','MPQ':'G','MRG':'G','MSA':'G', 'MSE':'M','MSL':'M','MSO':'M','MSP':'X','MT2':'M', 'MTR':'T','MTU':'A','MTY':'Y','MVA':'V','N ':'N', 'N10':'S','N2C':'X','N5I':'N','N5M':'C','N6G':'G', 'N7P':'P','NA8':'A','NAL':'A','NAM':'A','NB8':'N', 'NBQ':'Y','NC1':'S','NCB':'A','NCX':'N','NCY':'X', 'NDF':'F','NDN':'U','NEM':'H','NEP':'H','NF2':'N', 'NFA':'F','NHL':'E','NIT':'X','NIY':'Y','NLE':'L', 'NLN':'L','NLO':'L','NLP':'L','NLQ':'Q','NMC':'G', 'NMM':'R','NMS':'T','NMT':'T','NNH':'R','NP3':'N', 'NPH':'C','NRP':'LYG','NRQ':'MYG','NSK':'X','NTY':'Y', 'NVA':'V','NYC':'TWG','NYG':'NYG','NYM':'N','NYS':'C', 'NZH':'H','O12':'X','O2C':'N','O2G':'G','OAD':'N', 'OAS':'S','OBF':'X','OBS':'X','OCS':'C','OCY':'C', 'ODP':'N','OHI':'H','OHS':'D','OIC':'X','OIP':'I', 'OLE':'X','OLT':'T','OLZ':'S','OMC':'C','OMG':'G', 'OMT':'M','OMU':'U','ONE':'U','ONL':'X','OPR':'R', 'ORN':'A','ORQ':'R','OSE':'S','OTB':'X','OTH':'T', 'OTY':'Y','OXX':'D','P ':'G','P1L':'C','P1P':'N', 'P2T':'T','P2U':'U','P2Y':'P','P5P':'A','PAQ':'Y', 'PAS':'D','PAT':'W','PAU':'A','PBB':'C','PBF':'F', 'PBT':'N','PCA':'E','PCC':'P','PCE':'X','PCS':'F', 'PDL':'X','PDU':'U','PEC':'C','PF5':'F','PFF':'F', 'PFX':'X','PG1':'S','PG7':'G','PG9':'G','PGL':'X', 'PGN':'G','PGP':'G','PGY':'G','PHA':'F','PHD':'D', 'PHE':'F','PHI':'F','PHL':'F','PHM':'F','PIV':'X', 'PLE':'L','PM3':'F','PMT':'C','POM':'P','PPN':'F', 'PPU':'A','PPW':'G','PQ1':'N','PR3':'C','PR5':'A', 'PR9':'P','PRN':'A','PRO':'P','PRS':'P','PSA':'F', 'PSH':'H','PST':'T','PSU':'U','PSW':'C','PTA':'X', 'PTH':'Y','PTM':'Y','PTR':'Y','PU ':'A','PUY':'N', 'PVH':'H','PVL':'X','PYA':'A','PYO':'U','PYX':'C', 'PYY':'N','QLG':'QLG','QUO':'G','R ':'A','R1A':'C', 'R1B':'C','R1F':'C','R7A':'C','RC7':'HYG','RCY':'C', 'RIA':'A','RMP':'A','RON':'X','RT ':'T','RTP':'N', 'S1H':'S','S2C':'C','S2D':'A','S2M':'T','S2P':'A', 'S4A':'A','S4C':'C','S4G':'G','S4U':'U','S6G':'G', 'SAC':'S','SAH':'C','SAR':'G','SBL':'S','SC ':'C', 'SCH':'C','SCS':'C','SCY':'C','SD2':'X','SDG':'G', 'SDP':'S','SEB':'S','SEC':'A','SEG':'A','SEL':'S', 'SEM':'X','SEN':'S','SEP':'S','SER':'S','SET':'S', 'SGB':'S','SHC':'C','SHP':'G','SHR':'K','SIB':'C', 'SIC':'DC','SLA':'P','SLR':'P','SLZ':'K','SMC':'C', 'SME':'M','SMF':'F','SMP':'A','SMT':'T','SNC':'C', 'SNN':'N','SOC':'C','SOS':'N','SOY':'S','SPT':'T', 'SRA':'A','SSU':'U','STY':'Y','SUB':'X','SUI':'DG', 'SUN':'S','SUR':'U','SVA':'S','SVX':'S','SVZ':'X', 'SYS':'C','T ':'T','T11':'F','T23':'T','T2S':'T', 'T2T':'N','T31':'U','T32':'T','T36':'T','T37':'T', 'T38':'T','T39':'T','T3P':'T','T41':'T','T48':'T', 'T49':'T','T4S':'T','T5O':'U','T5S':'T','T66':'X', 'T6A':'A','TA3':'T','TA4':'X','TAF':'T','TAL':'N', 'TAV':'D','TBG':'V','TBM':'T','TC1':'C','TCP':'T', 'TCQ':'X','TCR':'W','TCY':'A','TDD':'L','TDY':'T', 'TFE':'T','TFO':'A','TFQ':'F','TFT':'T','TGP':'G', 'TH6':'T','THC':'T','THO':'X','THR':'T','THX':'N', 'THZ':'R','TIH':'A','TLB':'N','TLC':'T','TLN':'U', 'TMB':'T','TMD':'T','TNB':'C','TNR':'S','TOX':'W', 'TP1':'T','TPC':'C','TPG':'G','TPH':'X','TPL':'W', 'TPO':'T','TPQ':'Y','TQQ':'W','TRF':'W','TRG':'K', 'TRN':'W','TRO':'W','TRP':'W','TRQ':'W','TRW':'W', 'TRX':'W','TS ':'N','TST':'X','TT ':'N','TTD':'T', 'TTI':'U','TTM':'T','TTQ':'W','TTS':'Y','TY2':'Y', 'TY3':'Y','TYB':'Y','TYI':'Y','TYN':'Y','TYO':'Y', 'TYQ':'Y','TYR':'Y','TYS':'Y','TYT':'Y','TYU':'N', 'TYX':'X','TYY':'Y','TZB':'X','TZO':'X','U ':'U', 'U25':'U','U2L':'U','U2N':'U','U2P':'U','U31':'U', 'U33':'U','U34':'U','U36':'U','U37':'U','U8U':'U', 'UAR':'U','UCL':'U','UD5':'U','UDP':'N','UFP':'N', 'UFR':'U','UFT':'U','UMA':'A','UMP':'U','UMS':'U', 'UN1':'X','UN2':'X','UNK':'X','UR3':'U','URD':'U', 'US1':'U','US2':'U','US3':'T','US5':'U','USM':'U', 'V1A':'C','VAD':'V','VAF':'V','VAL':'V','VB1':'K', 'VDL':'X','VLL':'X','VLM':'X','VMS':'X','VOL':'X', 'X ':'G','X2W':'E','X4A':'N','X9Q':'AFG','XAD':'A', 'XAE':'N','XAL':'A','XAR':'N','XCL':'C','XCP':'X', 'XCR':'C','XCS':'N','XCT':'C','XCY':'C','XGA':'N', 'XGL':'G','XGR':'G','XGU':'G','XTH':'T','XTL':'T', 'XTR':'T','XTS':'G','XTY':'N','XUA':'A','XUG':'G', 'XX1':'K','XXY':'THG','XYG':'DYG','Y ':'A','YCM':'C', 'YG ':'G','YOF':'Y','YRR':'N','YYG':'G','Z ':'C', 'ZAD':'A','ZAL':'A','ZBC':'C','ZCY':'C','ZDU':'U', 'ZFB':'X','ZGU':'G','ZHP':'N','ZTH':'T','ZZJ':'A' } -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jan 13 10:51:48 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 13 Jan 2011 05:51:48 -0500 Subject: [Biopython-dev] [Bug 3168] different StringIO import for Python 3 In-Reply-To: Message-ID: <201101131051.p0DApmJx003755@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3168 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-13 05:51 EST ------- Hi Michael, We have started looking at calling 2to3 via the setup.py script, and will update the REAME file if things change. However, for now you must run the 2to3 script twice (as described in the README file) before installing Biopython. The 2to3 script automatically switches this: import StringIO ... StringHandle = StringIO.StringIO to a Python 3 equivalent: import io ... StringHandle = io.StringIO i.e. I can't reproduce any problem. Could you clarify what you are doing? Thanks, Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jan 13 11:37:16 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 13 Jan 2011 06:37:16 -0500 Subject: [Biopython-dev] [Bug 3169] to_one_letter_code in Bio.SCOP.Raf is old In-Reply-To: Message-ID: <201101131137.p0DBbGeS013288@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3169 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-13 06:37 EST ------- Hi Hongbo, Could you share the script you used to parse the mmCIF file to build the to_one_letter_code dictionary from the chem_comp (Table 1)? That would be very helpful since this data will need to be updated again in future. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jan 13 12:54:46 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 13 Jan 2011 07:54:46 -0500 Subject: [Biopython-dev] [Bug 3169] to_one_letter_code in Bio.SCOP.Raf is old In-Reply-To: Message-ID: <201101131254.p0DCskc7029238@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3169 ------- Comment #2 from macrozhu+biopy at gmail.com 2011-01-13 07:54 EST ------- Created an attachment (id=1560) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1560&action=view) python script for parsing PDB Chem Component Hi, Peter, the script is attached. It was a quick hack: I just parsed all the fields of "_chem_comp.one_letter_code" and "_chem_comp.three_letter_code" in the component.cif and wrote output to a .txt file. Hope that helps. cheers, hongbo -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jan 14 10:06:00 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 Jan 2011 05:06:00 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101141006.p0EA60Tc013924@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 macrozhu+biopy at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1556 is|0 |1 obsolete| | Attachment #1557 is|0 |1 obsolete| | ------- Comment #5 from macrozhu+biopy at gmail.com 2011-01-14 05:05 EST ------- Created an attachment (id=1561) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1561&action=view) fix DSSP crash when reading PDB with DisorderedResidue The current version of DSSP.py does not handle DisorderedResidue well. Case 1, Point mutations, e.g. 1h9h chain E resi 22 BioPython uses the last residue as default (resi SER in this case). But DSSP takes the first one (alternative location is blank,A or 1, CYS in this case). >python DSSP.py 1h9h.pdb Case 2, one of the disordered residues is HET. e.g. 3piu chain A Residue 273 >python DSSP.py 3piu.pdb In the first case, the DisorderedResidue.is_disordered() returns 2, and in the 2nd case, the DisorderedResidue.is_disordered() returns 1. These values are used to cope with DisorderedResidue in DSSP.py. Minors: tempfile.mktemp() should be replaced by tempfile.mkstemp() (see http://docs.python.org/library/tempfile.html#tempfile.mktemp "Deprecated since version 2.3: Use mkstemp() instead.") -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jan 14 10:32:31 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 Jan 2011 05:32:31 -0500 Subject: [Biopython-dev] [Bug 3168] different StringIO import for Python 3 In-Reply-To: Message-ID: <201101141032.p0EAWVqo018072@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3168 ------- Comment #2 from michael.kuhn at gmail.com 2011-01-14 05:32 EST ------- Ok, for me, the 2to3 script does not make the change (using Python 3.1.1). I only updated biopython. I'll ask our sysadmin to upgrade Python 3 and get back to you. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From b.invergo at gmail.com Fri Jan 14 10:56:38 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Fri, 14 Jan 2011 11:56:38 +0100 Subject: [Biopython-dev] pypaml Message-ID: Hi everyone, New subscriber here, and hopefully a new contributer as well! I have written a Python interface to the CODEML program of the PAML package (http://abacus.gene.ucl.ac.uk/software/paml.html), with the intention of eventually covering all of the programs in the package. You can find my package here: http://code.google.com/p/pypaml/ I recently ran across a discussion that occurred on the main Biopython list regarding my interface (http://lists.open-bio.org/pipermail/biopython/2010-September/006743.html) and I realized that perhaps it would be better if I integrated it into Biopython. I know that it's something many people would be interested in. I am very enthusiastic to continue this project and to do whatever I need to do to facilitate the integration. Some immediate tasks that need to be done are: - change the licensing: currently it's GPL, as described in the code and on the project page. Is it sufficient to simply remove its dedicated project page and change the verbiage in the code? - check coding standards as described in the Contributing to Biopython wiki - make some changes to be compatible with Python 2.5: I use @property and @x.setter decorator tags which are only 2.6+. I think that's the only incompatability - double-check the CODEML output parsing for many PAML versions; the output is notoriously non-standard from release to release. I may have to build some version-checking into the parser. I wrote it based on the output of PAML 4.3 - build some unit tests (I'm new to this in Python so I need to learn a bit about that - perhaps making it fit with any other structural standards in the Biopython library? I've tried from the start to make it very generalized so I don't think any major changes need to be made. Plus, I think structurally it should be easy to implement the other PAML programs by copying a lot of the code. The output parsing for each program is a different story, though. So, as I understand it, I should file an enhancement bug over at the Bugzilla site. In the meantime I can start working on some of the points listed above. I also need to refresh my memory of using git since I've gotten in the dirty habit of using svn (assuming this is all approved)! Is there anything else I need to do for now? Cheers, Brandon Invergo Pompeu Fabra University Barcelona, Spain From p.j.a.cock at googlemail.com Fri Jan 14 12:28:30 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Jan 2011 12:28:30 +0000 Subject: [Biopython-dev] pypaml In-Reply-To: References: Message-ID: On Fri, Jan 14, 2011 at 10:56 AM, Brandon Invergo wrote: > Hi everyone, > New subscriber here, and hopefully a new contributer as well! > Hi Brandon, Welcome to the list. By the way, apologies for mixing you and the PAML author Ziheng Yang up last year (I misread the pypaml webpage): http://lists.open-bio.org/pipermail/biopython/2010-September/006747.html > I have written a Python interface to the CODEML program of the PAML > package (http://abacus.gene.ucl.ac.uk/software/paml.html), with the > intention of eventually covering all of the programs in the package. > You can find my package here: > http://code.google.com/p/pypaml/ > > I recently ran across a discussion that occurred on the main Biopython > list regarding my interface > (http://lists.open-bio.org/pipermail/biopython/2010-September/006743.html) > and I realized that perhaps it would be better if I integrated it into > Biopython. I know that it's something many people would be interested > in. I am very enthusiastic to continue this project and to do whatever > I need to do to facilitate the integration. That is great news :) > Some immediate tasks that need to be done are: > - change the licensing: currently it's GPL, as described in the code > and on the project page. Is it sufficient to simply remove its > dedicated project page and change the verbiage in the code? Assuming you wrote all the code (or have your co-authors agreement), then yes, you can just change the licence. If you want to you can update the code in your repository and website, maybe make a new release while you are at it. Alternatively, you could just leave the standalone pypaml code as it is (under the GPL), but base your Biopython contributions on it (under the Biopython MIT/BSD licence). I would suggest that you don't make API changes to standalone pypaml, so as not to disrupt your existing users. However some of the work like Python 2.5 support might be worth doing there (before looking at Biopython integration). As a bonus, that should also mean you can use pypaml under Jython (Python on the JVM). > - check coding standards as described in the Contributing to Biopython wiki > - make some changes to be compatible with Python 2.5: I use @property > and @x.setter decorator tags which are only 2.6+. I think that's the > only incompatability If so that doesn't sound too hard to update. > - double-check the CODEML output parsing for many PAML versions; the > output is notoriously non-standard from release to release. I may have > to build some version-checking into the parser. I wrote it based on > the output of PAML 4.3 >From Chris Field's comments last year, that may be a lot of work for relatively little gain. I don't use PAML and have no idea what versions are typically used though. http://lists.open-bio.org/pipermail/biopython/2010-September/006760.html > - build some unit tests (I'm new to this in Python so I need to learn > a bit about that We've tried to cover the basics in a chapter in our tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > - perhaps making it fit with any other structural standards in the > Biopython library? > > I've tried from the start to make it very generalized so I don't think > any major changes need to be made. Plus, I think structurally it > should be easy to implement the other PAML programs by copying a lot > of the code. The output parsing for each program is a different story, > though. Does that mean you have wrappers for calling the PAML command line tools? Can you point me at the code for that - I'd like a quick look to see if it makes sense to switch over to the Bio.Application based system we're trying to standardise on in Biopython. On the other hand, if you have a much higher level wrapper maybe it is fine as it is (e.g. the Bio.PopGen wrappers follow their own route, although they use Bio.Application for the low level API inside). > So, as I understand it, I should file an enhancement bug over at the > Bugzilla site. That would be useful to give us a reference number for tracking it. A lot of your email would make a good introduction to the issue to put in the comment. > In the meantime I can start working on some of the > points listed above. I also need to refresh my memory of using git > since I've gotten in the dirty habit of using svn (assuming this is > all approved)! Is there anything else I need to do for now? Doing your work on a github fork of the Biopython repository would be great (although you may want to start with adding unit tests or doing Python 2.5 changes within standalone pypaml). Peter. From b.invergo at gmail.com Fri Jan 14 13:36:48 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Fri, 14 Jan 2011 14:36:48 +0100 Subject: [Biopython-dev] pypaml In-Reply-To: References: Message-ID: Hi Peter, Thanks for the welcome! > Assuming you wrote all the code (or have your co-authors agreement), > then yes, you can just change the licence. If you want to you can > update the code in your repository and website, maybe make a new > release while you are at it. Alternatively, you could just leave the > standalone pypaml code as it is (under the GPL), but base your > Biopython contributions on it (under the Biopython MIT/BSD licence). I wrote all the code myself so changing it shouldn't be a problem. I tend to license tools with the GPL by habit but I'm not opposed to relicensing it. > I would suggest that you don't make API changes to standalone > pypaml, so as not to disrupt your existing users. However some of > the work like Python 2.5 support might be worth doing there (before > looking at Biopython integration). As a bonus, that should also mean > you can use pypaml under Jython (Python on the JVM). > >> - check coding standards as described in the Contributing to Biopython wiki >> - make some changes to be compatible with Python 2.5: I use @property >> and @x.setter decorator tags which are only 2.6+. I think that's the >> only incompatability > > If so that doesn't sound too hard to update. I think, as it stands, the CODEML api is complete so no real changes need to be made there. As for the decorators, that was actually added in the last commit I made, so rolling back is quite simple. >> - double-check the CODEML output parsing for many PAML versions; the >> output is notoriously non-standard from release to release. I may have >> to build some version-checking into the parser. I wrote it based on >> the output of PAML 4.3 > > From Chris Field's comments last year, that may be a lot of work for > relatively little gain. I don't use PAML and have no idea what versions > are typically used though. > http://lists.open-bio.org/pipermail/biopython/2010-September/006760.html I would suggest that we don't support very old versions. Perhaps from 4.x up (currently it's at 4.4c). Most of the parsing is done via regular expressions, so changes in the order of the outputs shouldn't matter. Changes in the wording will. This is something to work on. >> - build some unit tests (I'm new to this in Python so I need to learn >> a bit about that > > We've tried to cover the basics in a chapter in our tutorial, > http://biopython.org/DIST/docs/tutorial/Tutorial.html > http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Thanks I'll check them out > Does that mean you have wrappers for calling the PAML command > line tools? Can you point me at the code for that - I'd like a quick > look to see if it makes sense to switch over to the Bio.Application > based system we're trying to standardise on in Biopython. On the > other hand, if you have a much higher level wrapper maybe it is > fine as it is (e.g. the Bio.PopGen wrappers follow their own route, > although they use Bio.Application for the low level API inside). I use the subprocess library of python to call the command line tool. PAML programs work by calling the tool with a control file as its argument. The control file specifies all of the run arguments, including the data files, output files, and other variables. Basically, pypaml works by dynamically building a control file via properties for the data files and a dictionary for the other variables, running the command line tool with that control file as its parameter, and then grabbing the output file, parsing it and storing the results in a dictionary object. The run() function, line 217, does this: http://code.google.com/p/pypaml/source/browse/trunk/src/pypaml/codeml.py with the actual subprocess call happening at 239/241 (verbose/silent). So, much of the code is dedicated to building the control file and parsing the output. I'm not as familiar with the other PAML programs, but a look through the manual indicates that they operate in a similar manner. (sorry that the code isn't fully commented yet) Ok, well, time to get cracking then. I'll add the Bugzilla item and make some changes in the standalone. I'll then inform the dev-list when things are in better condition for integration! Cheers, Brandon >> So, as I understand it, I should file an enhancement bug over at the >> Bugzilla site. > > That would be useful to give us a reference number for tracking it. > A lot of your email would make a good introduction to the issue to > put in the comment. > >> In the meantime I can start working on some of the >> points listed above. I also need to refresh my memory of using git >> since I've gotten in the dirty habit of using svn (assuming this is >> all approved)! Is there anything else I need to do for now? > > Doing your work on a github fork of the Biopython repository > would be great (although you may want to start with adding unit > tests or doing Python 2.5 changes within standalone pypaml). > > Peter. > From p.j.a.cock at googlemail.com Fri Jan 14 13:50:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Jan 2011 13:50:08 +0000 Subject: [Biopython-dev] pypaml In-Reply-To: References: Message-ID: On Fri, Jan 14, 2011 at 1:36 PM, Brandon Invergo wrote: > Hi Peter, > Thanks for the welcome! > >> Assuming you wrote all the code (or have your co-authors agreement), >> then yes, you can just change the licence. If you want to you can >> update the code in your repository and website, maybe make a new >> release while you are at it. Alternatively, you could just leave the >> standalone pypaml code as it is (under the GPL), but base your >> Biopython contributions on it (under the Biopython MIT/BSD licence). > > I wrote all the code myself so changing it shouldn't be a problem. I > tend to license tools with the GPL by habit but I'm not opposed to > relicensing it. For standalone projects I also like the GPL, but for libraries LGPL is better. However, in the scientific Python community people have generally followed the Python licence convention and gone with the more flexible MIT/BSD style licence. >> I would suggest that you don't make API changes to standalone >> pypaml, so as not to disrupt your existing users. However some of >> the work like Python 2.5 support might be worth doing there (before >> looking at Biopython integration). As a bonus, that should also mean >> you can use pypaml under Jython (Python on the JVM). >> >>> - check coding standards as described in the Contributing to Biopython wiki >>> - make some changes to be compatible with Python 2.5: I use @property >>> and @x.setter decorator tags which are only 2.6+. I think that's the >>> only incompatability >> >> If so that doesn't sound too hard to update. > > I think, as it stands, the CODEML api is complete so no real changes > need to be made there. As for the decorators, that was actually added > in the last commit I made, so rolling back is quite simple. You can of course define properties, setters, getters etc without using decorators (this is what we do in Biopython). >>> - double-check the CODEML output parsing for many PAML versions; the >>> output is notoriously non-standard from release to release. I may have >>> to build some version-checking into the parser. I wrote it based on >>> the output of PAML 4.3 >> >> From Chris Field's comments last year, that may be a lot of work for >> relatively little gain. I don't use PAML and have no idea what versions >> are typically used though. >> http://lists.open-bio.org/pipermail/biopython/2010-September/006760.html > > I would suggest that we don't support very old versions. Perhaps from > 4.x up (currently it's at 4.4c). Most of the parsing is done via > regular expressions, so changes in the order of the outputs shouldn't > matter. Changes in the wording will. This is something to work on. You may be able to get some comments from any PAML users on the main Biopython discussion list to guide you here. >>> - build some unit tests (I'm new to this in Python so I need to learn >>> a bit about that >> >> We've tried to cover the basics in a chapter in our tutorial, >> http://biopython.org/DIST/docs/tutorial/Tutorial.html >> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > > Thanks I'll check them out > >> Does that mean you have wrappers for calling the PAML command >> line tools? Can you point me at the code for that - I'd like a quick >> look to see if it makes sense to switch over to the Bio.Application >> based system we're trying to standardise on in Biopython. On the >> other hand, if you have a much higher level wrapper maybe it is >> fine as it is (e.g. the Bio.PopGen wrappers follow their own route, >> although they use Bio.Application for the low level API inside). > > I use the subprocess library of python to call the command line tool. > PAML programs work by calling the tool with a control file as its > argument. The control file specifies all of the run arguments, > including the data files, output files, and other variables. > Basically, pypaml works by dynamically building a control file via > properties for the data files and a dictionary for the other > variables, running the command line tool with that control file as its > parameter, and then grabbing the output file, parsing it and storing > the results in a dictionary object. > > The run() function, line 217, does this: > http://code.google.com/p/pypaml/source/browse/trunk/src/pypaml/codeml.py > with the actual subprocess call happening at 239/241 (verbose/silent). > > So, much of the code is dedicated to building the control file and > parsing the output. I'm not as familiar with the other PAML programs, > but a look through the manual indicates that they operate in a similar > manner. (sorry that the code isn't fully commented yet) Having looked at that briefly, since this is a command line tool driven by a configuration input file, rather than command line switches and arguments, I see no reason to bother with using our Bio.Application framework. By the way, have you ever tried using this under Windows? > Ok, well, time to get cracking then. I'll add the Bugzilla item and > make some changes in the standalone. I'll then inform the dev-list > when things are in better condition for integration! That sounds like a plan. Peter From bugzilla-daemon at portal.open-bio.org Fri Jan 14 14:01:13 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 Jan 2011 09:01:13 -0500 Subject: [Biopython-dev] [Bug 3170] New: pypaml Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3170 Summary: pypaml Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: b.invergo at gmail.com PAML (Phylogenetic Analysis by Maximum Likelihood; http://abacus.gene.ucl.ac.uk/software/paml.html) is a package of programs written by Ziheng Yang. The programs are used widely, especially CODEML which is used to estimate evolutionary rate parameters for a given sequence alignment. There is currently a PAML library for BioPerl but, to my knowledge, no such wrapper exists for Python. I have independently written a Python interface to the CODEML program of the PAML package, with the intention of eventually covering all of the programs in the package. You can find my code here: http://code.google.com/p/pypaml/ I believe it would be beneficial to integrate my pypaml package into the main Biopython project and to continue its development as such. Before it can be integrated, some immediate tasks must be done: - change the licensing: currently it's GPL, as described in the code and on the project page. Is it sufficient to simply remove its dedicated project page and change the verbiage in the code? - check coding standards as described in the Contributing to Biopython wiki - make some changes to be compatible with Python 2.5: I use @property and @x.setter decorator tags which are only 2.6+. I think that's the only incompatability - double-check the CODEML output parsing for many PAML versions; the output is notoriously non-standard from release to release. I may have to build some version-checking into the parser. I wrote it based on the output of PAML 4.3. I propose that compatibility with only 4.X+ be implemented (current version = 4.4c - build some unit tests (I'm new to this in Python so I need to learn a bit about that I've tried from the start to make it very generalized so I don't think any major changes need to be made. Plus, I think structurally it should be easy to implement the other PAML programs by copying a lot of the code. The output parsing for each program is a different story, though. I will implement many of the above changes first in my stand-alone library before merging it with a branch of the Biopython git repository. Because CODEML appears to be the most commonly used program from the package, for the immediate future it will continue to receive most of the focus, but with time the other programs will be implemented. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From b.invergo at gmail.com Fri Jan 14 14:11:44 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Fri, 14 Jan 2011 15:11:44 +0100 Subject: [Biopython-dev] pypaml In-Reply-To: References: Message-ID: > You can of course define properties, setters, getters etc without using > decorators (this is what we do in Biopython). That's how I had it before. I decided to switch over to new-style classes and in reading up on that topic I came across the decorators and I became a bit excited to implement them. They're certainly not necessary, as you say. > By the way, have you ever tried using this under Windows? I haven't yet but by the looks of it it should work fine assuming the programs are in the system path and thus can be called by name from any location in the file system. I see one line where I accidentally made it *nix-specific (default working directory is "./") but other than that, all files/directories are located via os.path or by user-inputted strings (as they would be in the control file). I have both a Linux and a Windows 7 machine at home though so I can do some testing. Obviously the unit tests here will help catch system-specific errors such as entering file locations incorrectly (I can see a few exceptions that I'm currently not handling). Once I make a couple of the core changes, I'll send a message to the main Biopython list to get some people to try it out and to let me know how it works (esp. re: version numbers and parsing) as well as to indicate if I currently don't support something they want to do. Regards, Brandon From bugzilla-daemon at portal.open-bio.org Fri Jan 14 14:18:26 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 Jan 2011 09:18:26 -0500 Subject: [Biopython-dev] [Bug 3170] Integration of external package: pypaml In-Reply-To: Message-ID: <201101141418.p0EEIQVi030815@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3170 b.invergo at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|pypaml |Integration of external | |package: pypaml -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Jan 14 14:27:51 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 14 Jan 2011 09:27:51 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101141427.p0EERpkN032207@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 macrozhu+biopy at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1561 is|0 |1 obsolete| | ------- Comment #6 from macrozhu+biopy at gmail.com 2011-01-14 09:27 EST ------- Created an attachment (id=1562) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1562&action=view) temp file created by tempfile.mkstemp() needs os.close() I realize that temp files created using tempfile.mkstemp() needs to be closed using os.close(). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Fri Jan 14 15:40:35 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 14 Jan 2011 10:40:35 -0500 Subject: [Biopython-dev] pypaml In-Reply-To: References: Message-ID: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Brandon; It's great you are looking to contribute your CODEML wrappers to Biopython. It looks like really useful functionality. Peter tackled most of the high level details so I'll chime in with a few more detailed suggestions. > I use the subprocess library of python to call the command line tool. > PAML programs work by calling the tool with a control file as its > argument. The control file specifies all of the run arguments, > including the data files, output files, and other variables. > Basically, pypaml works by dynamically building a control file via > properties for the data files and a dictionary for the other > variables, running the command line tool with that control file as its > parameter, and then grabbing the output file, parsing it and storing > the results in a dictionary object. > > The run() function, line 217, does this: > http://code.google.com/p/pypaml/source/browse/trunk/src/pypaml/codeml.py > with the actual subprocess call happening at 239/241 (verbose/silent). The functionality here looks great. My stylistic suggestion would be to separate the code for running the commandline from that used to parse the output file. Ideally these would be two separate classes that could live under the Bio.Phylo namespace: https://github.com/biopython/biopython/tree/master/Bio/Phylo For the commandline code, it would be nice to have a Bio.Phylo.Applications that is organized similar to Bio.Align.Applications: https://github.com/biopython/biopython/tree/master/Bio/Align/Applications This will give you some flexibility as you want to expand out to support other programs, and provide a framework for additional phylogenetic commandline utilities. Eric might have some suggestions about the best module name to use for the parsing code as he has been managing the Phylo namespace. Separating parsing from commandline generation can also let you move the _results dictionary from being a class member to a return value for a parse function. This is a bit more straightforward workflow instead of having the side-effect of assigning an internal class attribute. Thanks again for contributing, Brad From b.invergo at gmail.com Fri Jan 14 15:56:06 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Fri, 14 Jan 2011 16:56:06 +0100 Subject: [Biopython-dev] pypaml In-Reply-To: <20110114154035.GC30193@sobchak.mgh.harvard.edu> References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: Brad, Thanks for your comments! This is something I hinted at in my original post about structural rearrangements; what I meant is that I'm largely unfamiliar with the structure/organization of the Biopython source and how one works within it. I have admittedly not used Biopython much yet in my own research so I don't have much experience with it. Your tips on where pypaml would fit in the namespace are very helpful. Already from thinking about integrating with the project I had thought about separating the parsing engine, especially once I have to start doing version checking. The code may grow quickly for that and it does make more organizational sense to move it. Your comments also indirectly made me realize that eventually it would be nice to be able to run the software on Bio.Align objects and Bio.Phylo objects as inputs (they would have to be written to temporary text files so that PAML could read them). Well, that'll come in time, but it's a thought. It looks like I have a lot of Biopython-related studying to do for homework! Luckily, these things excite me... Cheers, Brandon On Fri, Jan 14, 2011 at 4:40 PM, Brad Chapman wrote: > Brandon; > It's great you are looking to contribute your CODEML wrappers to > Biopython. It looks like really useful functionality. Peter tackled > most of the high level details so I'll chime in with a few more > detailed suggestions. > >> I use the subprocess library of python to call the command line tool. >> PAML programs work by calling the tool with a control file as its >> argument. The control file specifies all of the run arguments, >> including the data files, output files, and other variables. >> Basically, pypaml works by dynamically building a control file via >> properties for the data files and a dictionary for the other >> variables, running the command line tool with that control file as its >> parameter, and then grabbing the output file, parsing it and storing >> the results in a dictionary object. >> >> The run() function, line 217, does this: >> http://code.google.com/p/pypaml/source/browse/trunk/src/pypaml/codeml.py >> with the actual subprocess call happening at 239/241 (verbose/silent). > > The functionality here looks great. My stylistic suggestion would be > to separate the code for running the commandline from that used to > parse the output file. Ideally these would be two separate classes > that could live under the Bio.Phylo namespace: > > https://github.com/biopython/biopython/tree/master/Bio/Phylo > > For the commandline code, it would be nice to have a > Bio.Phylo.Applications that is organized similar to > Bio.Align.Applications: > > https://github.com/biopython/biopython/tree/master/Bio/Align/Applications > > This will give you some flexibility as you want to expand out to > support other programs, and provide a framework for additional > phylogenetic commandline utilities. > > Eric might have some suggestions about the best module name to use > for the parsing code as he has been managing the Phylo namespace. > > Separating parsing from commandline generation can also let you move > the _results dictionary from being a class member to a return value for > a parse function. This is a bit more straightforward workflow > instead of having the side-effect of assigning an internal class > attribute. > > Thanks again for contributing, > Brad > > From biopython at maubp.freeserve.co.uk Fri Jan 14 19:02:50 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 Jan 2011 19:02:50 +0000 Subject: [Biopython-dev] test_PhyloXML.py on Python 3 In-Reply-To: References: Message-ID: On Fri, Aug 13, 2010 at 2:24 AM, Eric Talevich wrote: > On Thu, Aug 12, 2010 at 12:37 PM, Peter wrote: > >> Hi Eric (et al), >> >> Is test_PhyloXML.py working for you under Python 3? >> >> I'm getting the following (both with and without the 2to3 --nofix=long >> option): >> >> $ python3 test_PhyloXML.py >> ... >> ?File "/home/xxx/lib/python3.1/site-packages/Bio/Phylo/PhyloXMLIO.py", >> line 298, in __init__ >> ? ?event, root = next(context) >> ?File "", line 59, in __iter__ >> TypeError: invalid event tuple >> >> ---------------------------------------------------------------------- >> Ran 47 tests in 0.015s >> >> All the sub-tests in test_PhyloXML.py are failing the same way. >> >> >From memory this was working recently. >> >> > Yeah, it was... it's fixed now/again. > > This is the issue with passing byte/unicode strings to cElementTree in > Python 3. I had a check for Python versions 3.0.0 through 3.1.1, where we > need to import ElementTree instead of cElementTree. Apparently Python 3.1.2 > still has the bug. > > -Eric It looks like Python 3.1.3 also has the same bug in cElementTree :( See this buildbot-slave log for example, http://events.open-bio.org:8010/builders/Linux%2064%20-%20Python%203.1/builds/96/steps/shell_3/logs/stdio I've extended the workaround again: https://github.com/biopython/biopython/commit/105444a340a2ad0e48c8582864104333b90adfc0 Let's see if we get any progress on the Python bug itself, http://bugs.python.org/issue9257 Peter From eric.talevich at gmail.com Sat Jan 15 05:35:48 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 15 Jan 2011 00:35:48 -0500 Subject: [Biopython-dev] pypaml In-Reply-To: <20110114154035.GC30193@sobchak.mgh.harvard.edu> References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: Hi Brandon, Thanks for volunteering! I think this will be a nice addition to Biopython and particularly Bio.Phylo. Some thoughts on organization: On Fri, Jan 14, 2011 at 10:40 AM, Brad Chapman wrote: > > The functionality here looks great. My stylistic suggestion would be > to separate the code for running the commandline from that used to > parse the output file. Ideally these would be two separate classes > that could live under the Bio.Phylo namespace: > > https://github.com/biopython/biopython/tree/master/Bio/Phylo > I agree. For the commandline code, it would be nice to have a > Bio.Phylo.Applications that is organized similar to > Bio.Align.Applications: > > https://github.com/biopython/biopython/tree/master/Bio/Align/Applications > > This will give you some flexibility as you want to expand out to > support other programs, and provide a framework for additional > phylogenetic commandline utilities. > Since it sounds like you might eventually write wrappers for other programs in the PAML suite, a layout like this might work: Bio/Phylo/Applications/_codeml.py -- just the wrapper for running the command-line program, perhaps based on the Bio.Application classes. The API for calling the wrapper goes through __init__.py; the user doesn't import this module directly. (See Bio.Align.Applications) Bio/Phylo/PAML/codeml.py -- all the code for parsing the output of the command-line program, and working with that dictionary/class. Any other modules this depends on would also go here, as would the other code for working with the input/output of other PAML programs. Separating parsing from commandline generation can also let you move > the _results dictionary from being a class member to a return value for > a parse function. This is a bit more straightforward workflow > instead of having the side-effect of assigning an internal class > attribute. > Yes. Also, the user might have saved the output from a codeml run previously (maybe from a shell script/pipeline), and want to parse it without re-running codeml through a Python wrapper. Right? (Sorry if I misunderstood your code.) I look forward to seeing your branch on GitHub. Please let us know if you have any problems along the way. All the best, Eric From p.j.a.cock at googlemail.com Sat Jan 15 11:56:23 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 15 Jan 2011 11:56:23 +0000 Subject: [Biopython-dev] pypaml In-Reply-To: References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: On Sat, Jan 15, 2011 at 5:35 AM, Eric Talevich wrote: > Hi Brandon, > > Thanks for volunteering! I think this will be a nice addition to Biopython > and particularly Bio.Phylo. > > Some thoughts on organization: > > On Fri, Jan 14, 2011 at 10:40 AM, Brad Chapman wrote: > >> >> The functionality here looks great. My stylistic suggestion would be >> to separate the code for running the commandline from that used to >> parse the output file. Ideally these would be two separate classes >> that could live under the Bio.Phylo namespace: >> >> https://github.com/biopython/biopython/tree/master/Bio/Phylo >> > > I agree. That sounds good. This will be a big change for anyone already using the stand alone pypaml - but some changes are unavoidable. > For the commandline code, it would be nice to have a >> Bio.Phylo.Applications that is organized similar to >> Bio.Align.Applications: >> >> https://github.com/biopython/biopython/tree/master/Bio/Align/Applications >> >> This will give you some flexibility as you want to expand out to >> support other programs, and provide a framework for additional >> phylogenetic commandline utilities. >> > > Since it sounds like you might eventually write wrappers for other programs > in the PAML suite, a layout like this might work: > > Bio/Phylo/Applications/_codeml.py > ?-- just the wrapper for running the command-line program, perhaps based on > the Bio.Application classes. The API for calling the wrapper goes through > __init__.py; the user doesn't import this module directly. (See > Bio.Align.Applications) > Roughly how many applications are there in PAML? What Brad and Eric have outlined would work fine, but we could opt for something a little different, like the namespace Bio.Phylo.Applications for general tools (there are some tree building tools I could write wrappers for - using the same setup as Bio.Align.Applications), and have namespace Bio.Phylo.Applications.PAML for the PAML wrappers. Another reason to separate them is they won't be using the simple Bio.Application framework (due to the way PAML options must be specified via input files). > > Bio/Phylo/PAML/codeml.py > ?-- all the code for parsing the output of the command-line program, and > working with that dictionary/class. Any other modules this depends on would > also go here, as would the other code for working with the input/output of > other PAML programs. > > >> Separating parsing from commandline generation can also let you move >> the _results dictionary from being a class member to a return value for >> a parse function. This is a bit more straightforward workflow >> instead of having the side-effect of assigning an internal class >> attribute. >> > > Yes. Also, the user might have saved the output from a codeml run > previously (maybe from a shell script/pipeline), and want to parse it > without re-running codeml through a Python wrapper. Right? (Sorry > if I misunderstood your code.) > > I look forward to seeing your branch on GitHub. Please let us know > if you have any problems along the way. > > All the best, > Eric Thanks for your comments Brad and Eric :) Peter From b.invergo at gmail.com Sat Jan 15 12:20:05 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Sat, 15 Jan 2011 13:20:05 +0100 Subject: [Biopython-dev] pypaml In-Reply-To: References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: I'll reply to both Eric and Peter in this one... >>> The functionality here looks great. My stylistic suggestion would be >>> to separate the code for running the commandline from that used to >>> parse the output file. Ideally these would be two separate classes >>> that could live under the Bio.Phylo namespace: >>> >>> https://github.com/biopython/biopython/tree/master/Bio/Phylo >>> >> >> I agree. > > That sounds good. This will be a big change for anyone already > using the stand alone pypaml - but some changes are unavoidable. I plan to make a tag of the current version on Google Code and then branch it and start making these structural changes. I'll put a notice on the main page to let the users know how things will be changing as I prepare to migrate to Biopython. It'll be a slow, steady process. >> For the commandline code, it would be nice to have a >>> Bio.Phylo.Applications that is organized similar to >>> Bio.Align.Applications: >>> >>> https://github.com/biopython/biopython/tree/master/Bio/Align/Applications >>> >>> This will give you some flexibility as you want to expand out to >>> support other programs, and provide a framework for additional >>> phylogenetic commandline utilities. >>> >> >> Since it sounds like you might eventually write wrappers for other programs >> in the PAML suite, a layout like this might work: >> >> Bio/Phylo/Applications/_codeml.py >> ?-- just the wrapper for running the command-line program, perhaps based on >> the Bio.Application classes. The API for calling the wrapper goes through >> __init__.py; the user doesn't import this module directly. (See >> Bio.Align.Applications) >> > > Roughly how many applications are there in PAML? What Brad and > Eric have outlined would work fine, but we could opt for something > a little different, like the namespace Bio.Phylo.Applications for > general tools (there are some tree building tools I could write > wrappers for - using the same setup as Bio.Align.Applications), > and have namespace Bio.Phylo.Applications.PAML for the PAML > wrappers. Another reason to separate them is they won't be > using the simple Bio.Application framework (due to the way > PAML options must be specified via input files). There are 8 programs in PAML. Copied from the manual: ? Comparison and tests of phylogenetic trees (baseml and codeml); ? Estimation of parameters in sophisticated substitution models, including models of variable rates among sites and models for combined analysis of multiple genes or site partitions (baseml and codeml); ? Likelihood ratio tests of hypotheses through comparison of implemented models (baseml, codeml, chi2); ? Estimation of divergence times under global and local clock models (baseml and codeml); ? Likelihood (Empirical Bayes) reconstruction of ancestral sequences using nucleotide, amino acid and codon models (baseml and codeml); ? Generation of datasets of nucleotide, codon, and amino acid sequence by Monte Carlo simulation (evolver); ? Estimation of synonymous and nonsynonymous substitution rates and detection of positive selection in protein-coding DNA sequences (yn00 and codeml). ? Bayesian estimation of species divergence times incorporating uncertainties in fossil calibrations (mcmctree). >> Yes. Also, the user might have saved the output from a codeml run >> previously (maybe from a shell script/pipeline), and want to parse it >> without re-running codeml through a Python wrapper. Right? (Sorry >> if I misunderstood your code.) Actually, it currently does support doing this. The parse_results() function takes a string filename as an argument so you can call it without having run any analyses yet. Still, it makes more sense to make the parser a separate class. What I'm torn about is to either have a single PAML parser class or to have separate parsers for each program. The output files contain the program name in the first line so it's simple enough to determine what kind of output you're looking at, but the code might get a bit long and cumbersome. Thanks for the input everyone. I'll have a lot of things done this weekend I hope (it's a busy one with other projects at the same time). Cheers, Brandon From eric.talevich at gmail.com Sat Jan 15 18:23:19 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 15 Jan 2011 13:23:19 -0500 Subject: [Biopython-dev] pypaml In-Reply-To: References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: On Sat, Jan 15, 2011 at 7:20 AM, Brandon Invergo wrote: > I'll reply to both Eric and Peter in this one... > > >>> The functionality here looks great. My stylistic suggestion would be > >>> to separate the code for running the commandline from that used to > >>> parse the output file. Ideally these would be two separate classes > >>> that could live under the Bio.Phylo namespace: > >>> > >>> https://github.com/biopython/biopython/tree/master/Bio/Phylo > >>> > >> > >> I agree. > > > > That sounds good. This will be a big change for anyone already > > using the stand alone pypaml - but some changes are unavoidable. > > I plan to make a tag of the current version on Google Code and then > branch it and start making these structural changes. I'll put a notice > on the main page to let the users know how things will be changing as > I prepare to migrate to Biopython. It'll be a slow, steady process. > Sounds great to me! Slow and steady wins the race. >> For the commandline code, it would be nice to have a > >>> Bio.Phylo.Applications that is organized similar to > >>> Bio.Align.Applications: > >>> > >>> > https://github.com/biopython/biopython/tree/master/Bio/Align/Applications > >>> > >>> This will give you some flexibility as you want to expand out to > >>> support other programs, and provide a framework for additional > >>> phylogenetic commandline utilities. > >>> > >> > >> Since it sounds like you might eventually write wrappers for other > programs > >> in the PAML suite, a layout like this might work: > >> > >> Bio/Phylo/Applications/_codeml.py > >> -- just the wrapper for running the command-line program, perhaps based > on > >> the Bio.Application classes. The API for calling the wrapper goes > through > >> __init__.py; the user doesn't import this module directly. (See > >> Bio.Align.Applications) > >> > > > > Roughly how many applications are there in PAML? What Brad and > > Eric have outlined would work fine, but we could opt for something > > a little different, like the namespace Bio.Phylo.Applications for > > general tools (there are some tree building tools I could write > > wrappers for - using the same setup as Bio.Align.Applications), > > and have namespace Bio.Phylo.Applications.PAML for the PAML > > wrappers. Another reason to separate them is they won't be > > using the simple Bio.Application framework (due to the way > > PAML options must be specified via input files). > If the Bio.Applications framework won't work for codeml/PAML then I think it would be misleading to put any pypaml code under Bio.Phylo.Applications (at least for now). Later we might find a way to put PAML options into named temporary files and run the command-line applications that way, but that's probably not a priority yet. *Code Philosophy*: It's my understanding that tightly nested namespaces are nicer for library developers, but flatter namespaces are nicer for the users of those libraries (especially those who don't use full-featured IDEs like Eclipse). Python lets us have it both ways, to some extent, by importing protected modules to a higher-level namespace. See if you agree with examples like these: # Common functionality and generalized tree I/O is available at the top level >>> from Bio import Phylo # Everything under *.Applications directly uses the Bio.Application framework >>> from Bio.Phylo.Applications import RAxMLCommandline >>> from Bio.Phylo.Applications import MrBayesCommandline # Extra functionality for a popular application suite goes in a separate sub-package >>> from Bio.Phylo.PAML import codeml # A namespace for web services reminds the user that the network will be used >>> from Bio.Phylo.WWW import Dryad This is basically what we proposed with Bio.Struct for GSoC 2010, and I don't think any of it contradicts the existing conventions of Bio.Align. Namespace collisions are unlikely: the sub-packages would generally be either support for new file formats or helpers for application suites, and those would only match if an application suite defined its own file formats -- in which case the modules do belong under the same sub-package. >> Yes. Also, the user might have saved the output from a codeml run > >> previously (maybe from a shell script/pipeline), and want to parse it > >> without re-running codeml through a Python wrapper. Right? (Sorry > >> if I misunderstood your code.) > > Actually, it currently does support doing this. The parse_results() > function takes a string filename as an argument so you can call it > without having run any analyses yet. Still, it makes more sense to > make the parser a separate class. What I'm torn about is to either > have a single PAML parser class or to have separate parsers for each > program. The output files contain the program name in the first line > so it's simple enough to determine what kind of output you're looking > at, but the code might get a bit long and cumbersome. > I'd recommend splitting the parsers into separate modules. Small functions and classes are much easier to maintain. If everyone agrees with this layout, I'd suggest putting your existing __init__.py and codeml.py under Bio/Phylo/PAML/. Inside codeml.py, I'd suggest: 1. Have the run() method raise an exception when the subprocess return code is non-zero, instead of returning the subprocess return code directly (try subprocess.check_call in place of subprocess.call, or see Bio/Application/__init__.py). Most of the time the user will want to throw an error of some sort if the command line fails; this is more direct. Then, since run() no longer needs to return an integer, it's free to return the results dictionary instead. 2. Change parse_results() to return a dictionary, rather than setting it on self.results. So the run() function retrieves this dictionary by calling parse_results(), then returns it (after chdir'ing). 3. Now that parse_results() doesn't need direct access to self._results, move it out of the codeml class and rename it as a standalone function: def read(results_file, version=None): ... Any other optional info that parse_results()/read() needs can also be passed as keyword arguments -- I'm not sure if I missed any places where that's occurring. This is the same overall change Brad was suggesting, I think. It also brings the style of pypaml/codeml pretty much in line with how Biopython and Bio.Application work, so further integration would be easier in the future. Best, Eric From b.invergo at gmail.com Sun Jan 16 14:19:13 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Sun, 16 Jan 2011 15:19:13 +0100 Subject: [Biopython-dev] pypaml In-Reply-To: References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: Hi everyone, A quick question about style: since the name "codeml" is based on a program which is always spelled either in all caps or in all lower-case, what would be the best way to write the class name regarding capitalization? Stick with the usual camel-case convention, "Codeml", anyway? Things are progressing nicely. I've already taken care of a lot of the minor tasks and improvements... Cheers, Brandon From p.j.a.cock at googlemail.com Sun Jan 16 15:09:07 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 16 Jan 2011 15:09:07 +0000 Subject: [Biopython-dev] pypaml In-Reply-To: References: <20110114154035.GC30193@sobchak.mgh.harvard.edu> Message-ID: On Sun, Jan 16, 2011 at 2:19 PM, Brandon Invergo wrote: > Hi everyone, > A quick question about style: since the name "codeml" is based on a > program which is always spelled either in all caps or in all > lower-case, what would be the best way to write the class name > regarding capitalization? Stick with the usual camel-case convention, > "Codeml", anyway? I'd go with Codeml for a class name (or something like CodemlResult or whatever). Neither CODEML nor codeml seem good class names in Python. > Things are progressing nicely. I've already taken care of a lot of the > minor tasks and improvements... Sounds good :) Peter From biopython at maubp.freeserve.co.uk Mon Jan 17 00:38:05 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Jan 2011 00:38:05 +0000 Subject: [Biopython-dev] Calling 2to3 from setup.py In-Reply-To: References: Message-ID: On Tue, Jan 4, 2011 at 11:30 PM, Peter wrote: > On Tue, Jan 4, 2011 at 10:43 PM, Peter wrote: >> Hi all, >> >> Something we've talked about before is calling lib2to3 or the 2to3 script >> from within setup.py to make installing Biopython simpler on Python 3. >> >> ... >> >> I then looked at how NumPy are doing this, and they have a hook >> in setup.py which calls their own Python script called py3tool.py to >> do the conversion, ... it will not bother to reconvert previously >> converted but unchanged files. ... > > ... So > on reflection, using the time stamp to decide if 2to3 needs to > be rerun is probable quite sufficient (and will be faster too). > > Peter > I switched to the simpler approach used by NumPy (just look at the last modified timestamp) and committed this: https://github.com/biopython/biopython/commit/1eeed11aefc54787fb836a6b3b5f4c82628edef4 I've had to tweak the buildbot accordingly: it no longer needs to call the 2to3 script, and the tests must be run from the converted version rather than the original Python 2 version of the code. Peter From bugzilla-daemon at portal.open-bio.org Mon Jan 17 10:11:31 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 17 Jan 2011 05:11:31 -0500 Subject: [Biopython-dev] [Bug 3168] different StringIO import for Python 3 In-Reply-To: Message-ID: <201101171011.p0HABVp5004830@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3168 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-17 05:11 EST ------- Which version of Python 3 do you have? We're testing on Python 3.1 and 3.2 at the moment, ignoring Python 3.0. Also, could you retry using the latest Biopython from git? This now calls 2to3 automatically from setup.py which is much simpler than manually calling 2to3 at the command line. Thanks Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Jan 17 13:54:32 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Jan 2011 13:54:32 +0000 Subject: [Biopython-dev] curious error from HMM unit test Message-ID: Hi all, There was a curious failure under one of the buildslaves running Jython 2.5.2rc3 on 64bit Linux: http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/6/steps/shell/logs/stdio ====================================================================== ERROR: test_HMMCasino ---------------------------------------------------------------------- Traceback (most recent call last): File "run_tests.py", line 314, in runTest suite = unittest.TestLoader().loadTestsFromName(name) File "/home/buildslave/jython2.5.2rc3/Lib/unittest.py", line 533, in loadTestsFromName module = __import__('.'.join(parts_copy)) File "/home/buildslave/jython2.5.2rc3/Lib/unittest.py", line 533, in loadTestsFromName module = __import__('.'.join(parts_copy)) File "/home/buildslave/BuildBot/jython252lin64/build/Tests/test_HMMCasino.py", line 180, in trained_mm = trainer.train([training_seq], stop_training) File "/home/buildslave/BuildBot/jython252lin64/build/Bio/HMM/Trainer.py", line 212, in train emission_count = self.update_emissions(emission_count, File "/home/buildslave/BuildBot/jython252lin64/build/Bio/HMM/Trainer.py", line 332, in update_emissions expected_times += (forward_vars[(k, i)] * UnboundLocalError: local variable 'k' referenced before assignment ---------------------------------------------------------------------- The error message is rather odd, since k is defined as the outer loop variable, see: https://github.com/biopython/biopython/blob/master/Bio/HMM/Trainer.py This is in the HMM code, so due to the stochastic nature we expect each run of test may take slightly different branches through the code. We haven't altered this code for a while, so this is either a long standing issue, or perhaps indicative of a problem in Jython 2.5.2rc3 instead (although I don't see any open bugs relevant). I've logged into the buildslave and re-run test_HMMCasino.py under Jython about 30 times - all fine. Rerunning the build also came up clear. Has anyone got any insight into what might be going on? Unless it happens again maybe it was a fluke (cosmic ray, bad ram, etc). Peter From eric.talevich at gmail.com Mon Jan 17 16:17:14 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 17 Jan 2011 11:17:14 -0500 Subject: [Biopython-dev] curious error from HMM unit test In-Reply-To: References: Message-ID: On Mon, Jan 17, 2011 at 8:54 AM, Peter wrote: > Hi all, > > There was a curious failure under one of the buildslaves running > Jython 2.5.2rc3 on 64bit Linux: > > http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/6/steps/shell/logs/stdio > > [...] > > The error message is rather odd, since k is defined as the outer loop > variable, see: > https://github.com/biopython/biopython/blob/master/Bio/HMM/Trainer.py > > This is in the HMM code, so due to the stochastic nature we expect > each run of test may take slightly different branches through the > code. We haven't altered this code for a while, so this is either a > long standing issue, or perhaps indicative of a problem in Jython > 2.5.2rc3 instead (although I don't see any open bugs relevant). I've > logged into the buildslave and re-run test_HMMCasino.py under Jython > about 30 times - all fine. Rerunning the build also came up clear. > > Has anyone got any insight into what might be going on? Unless it > happens again maybe it was a fluke (cosmic ray, bad ram, etc). > It could have been an exotic bug in Jython (or its interactions with the JVM) where the JIT or garbage collector is removing local variables too early. I don't see how you could provide a "fix" for it in Biopython, since k definitely exists at that point in the loop in any valid Python and Jython almost always handles it correctly. Maybe you could seed the RNG at the start of the unit test to ensure the same paths are always taken? -E From biopython at maubp.freeserve.co.uk Mon Jan 17 16:37:03 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Jan 2011 16:37:03 +0000 Subject: [Biopython-dev] curious error from HMM unit test In-Reply-To: References: Message-ID: On Mon, Jan 17, 2011 at 4:17 PM, Eric Talevich wrote: > > It could have been an exotic bug in Jython (or its interactions with the > JVM) where the JIT or garbage collector is removing local variables too > early. I don't see how you could provide a "fix" for it in Biopython, since > k definitely exists at that point in the loop in any valid Python and Jython > almost always handles it correctly. > Good point - maybe that is the most likely explanation. > > Maybe you could seed the RNG at the start of the unit test to ensure the > same paths are always taken? > We could do, but as part of that I'd want to increase the test coverage to ensure most of the HMM code is actually covered in the single non- stochastic run. As no-one is actively looking after it, so I'd rather not touch the tests. Peter From biopython at maubp.freeserve.co.uk Mon Jan 17 18:04:27 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Jan 2011 18:04:27 +0000 Subject: [Biopython-dev] Python 3 and Bio.SeqIO.index() In-Reply-To: References: Message-ID: Continuing a thread back in July 2010, http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008004.html On Thu, Jul 15, 2010 at 2:31 PM, Peter wrote: > > I think the clear message (from both Windows and Linux) is that for > Bio.SeqIO.index() to perform at a tolerable speed on Python 3 we > can't use the default text mode with unicode strings, we are going > to have to use binary mode with bytes. > I've now done that - which brings the time for test_SeqIO_index.py down for Python 3.x to roughly the same as Python 2.x (about a four fold speed up). Under Windows there may also be a slight speed up for Python 2, while on Linux/Mac there could be a slight slow down. I expect we can work on this. The good news is this yet another step towards supporting Biopython under Python 3. Peter From bugzilla-daemon at portal.open-bio.org Tue Jan 18 11:11:51 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 Jan 2011 06:11:51 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101181111.p0IBBpVh029286@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 macrozhu+biopy at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1562 is|0 |1 obsolete| | ------- Comment #7 from macrozhu+biopy at gmail.com 2011-01-18 06:11 EST ------- Created an attachment (id=1563) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1563&action=view) handle DSSP shifted output when there are too many res in a PDB In this version, one more issue is fixed when reading DSSP output: Sometimes DSSP output is not in the correct format. Column 34 to 38, which are supposed to be occupied by solvent accessibility (acc), are partially occupied by the elongated field before it. Thus, the conversion of the string in col 34 to 38 to integer will raise a ValueError exception. e.g. 3kic chain T res 321, or 1VSY chain T res 6077. >python DSSP.py 3kic.pdb In such cases, the acc value, and all the following values in the same line are shifted to the right because residue sequence number is too long (more than 4 digits). Normally, only 4 digits are allocated to seq num in DSSP output. When there are too many residues, this problems appears. Now the ValueError exception is caught and the line is re-examined for shifted acc values. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 18 11:14:38 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 Jan 2011 06:14:38 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101181114.p0IBEclN029403@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #8 from macrozhu+biopy at gmail.com 2011-01-18 06:14 EST ------- the code had been tested on more than 24,000 PDB files (a subset of the PDB) and it seems it works well. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 18 17:10:02 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 Jan 2011 12:10:02 -0500 Subject: [Biopython-dev] [Bug 3168] different StringIO import for Python 3 In-Reply-To: Message-ID: <201101181710.p0IHA2gp002494@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3168 michael.kuhn at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |WORKSFORME ------- Comment #4 from michael.kuhn at gmail.com 2011-01-18 12:10 EST ------- Ok, so the 2to3 from Python 3.1.3 handles this correctly, while Python 3.1.1 fails. So I'm closing this bug, though perhaps you can update the install instructions to require Python >= 3.1.3. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 19 04:02:26 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 18 Jan 2011 23:02:26 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101190402.p0J42QvK002446@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #9 from eric.talevich at gmail.com 2011-01-18 23:02 EST ------- Created an attachment (id=1564) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1564&action=view) Patch based on Hongbo Zhu's attachment I adapted the previous code to the current Biopython head. Of note: - the pseudo-doctest in the DSSP class docstring wasn't runnable. I fixed that. - Instead of mkstemp I used NamedTemporary file -- this way Python will delete the tempfile automatically when the handle is closed - Some code compression, but I kept most of Hongbo Zhu's comments intact -- I found them useful - I tweaked the test script at the end of the file Tested this with PDB 1MOT, 3KIC, 1VSY, 3LLT, 3NR9 and checked with pylint. Seems OK to me. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 19 11:00:09 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 19 Jan 2011 06:00:09 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101191100.p0JB09fZ020533@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #10 from macrozhu+biopy at gmail.com 2011-01-19 06:00 EST ------- Hi, Eric, thanks for the nice code compression :) A mistake in my comments among the code ought to be corrected: the "9999 atoms" should be "9999 residues". In addition, a quick search shows that biopython 1.56 uses still tempfile.mktemp() in the following files. ./Scripts/xbbtools/xbb_blastbg.py ./Tests/test_PhyloXML.py ./Bio/GFF/GenericTools.py ./Bio/PDB/DSSP.py ./Bio/PDB/ResidueDepth.py ./Bio/PDB/NACCESS.py tempfile.mktemp() is deprecated since python release 2.3 https://github.com/biopython/biopython And even python 2.3 support had been dropped by biopython :-) best hongbo -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jan 20 03:46:02 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 19 Jan 2011 22:46:02 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101200346.p0K3k2Fi023153@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #11 from eric.talevich at gmail.com 2011-01-19 22:46 EST ------- (In reply to comment #10) > Hi, Eric, > thanks for the nice code compression :) Cheers. Have you been able to test this patch, and does it satisfy? Should I commit this change to the Biopython trunk, then? > A mistake in my comments among the code ought to be corrected: the "9999 atoms" > should be "9999 residues". OK, I was suspicious of that comment too. I've fixed it on my local branch. > In addition, a quick search shows that biopython 1.56 uses still > tempfile.mktemp() in the following files. [...] Thanks, I've made a note of it. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Jan 20 12:04:23 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Jan 2011 12:04:23 +0000 Subject: [Biopython-dev] curious error from HMM unit test In-Reply-To: References: Message-ID: On Mon, Jan 17, 2011 at 4:37 PM, Peter wrote: > On Mon, Jan 17, 2011 at 4:17 PM, Eric Talevich wrote: >> >> It could have been an exotic bug in Jython (or its interactions with the >> JVM) where the JIT or garbage collector is removing local variables too >> early. I don't see how you could provide a "fix" for it in Biopython, since >> k definitely exists at that point in the loop in any valid Python and Jython >> almost always handles it correctly. >> > > Good point - maybe that is the most likely explanation. > It has happened again on the same install of Jython 2.5.2rc3 on 64bit Linux, previously on build 6, now on build 12: http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/12/steps/shell/logs/stdio Again, repeating the build made the error go away - but the load on the machine would have been different etc. Peter From bugzilla-daemon at portal.open-bio.org Thu Jan 20 15:57:55 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 20 Jan 2011 10:57:55 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101201557.p0KFvt3I008953@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ------- Comment #12 from macrozhu+biopy at gmail.com 2011-01-20 10:57 EST ------- (In reply to comment #11) > (In reply to comment #10) > Cheers. Have you been able to test this patch, and does it satisfy? > I made changes to two lines because: -- input to function res2code() should be instances of Bio.PDB.Residue, not strings; -- variable res is from last loop round, thus not suitable as input to res2code(). So in the end, I changed two of the lines that use function res2code(). It has been tested on around 20,000 PDBs. cheers, 221c221 < res_seq_icode = '%s%s' % (res_id[1],res_id[2]) --- > res_seq_icode = res2code(res_id) 266c266 < res_seq_icode = '%s%s' % (res_id[1],res_id[2]) --- > res_seq_icode = res2code(res_id) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From anaryin at gmail.com Fri Jan 21 15:13:48 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 21 Jan 2011 16:13:48 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: Hello all, I've been working on the renumbering residues, remove disordered atoms, and biological unit representation functions. I've made quite some changes, specially to the renumbering algorithm. Explanation follows: Before I simply calculated how much to subtract from each residue number based on the first. That worked perfectly if all residue numbers were in a growing progression, which was not the case for some structures. Also, HETATMs weren't separated from the main ATOM lines, and in many PDB files you see numbering starting from 1000 for example. What I coded allows for certain discrimination of HETATMs from ATOMs based on the SEQRES field of the PDB file header (added parsing to parse_pdb_header). This ensures HETATMs are numbered from 1000. I've also incorporated a way of filtering modified aminoacids (that show up as HETATM but in between ATOM lines) to be treated as ATOMs if there is no SEQRES header present in the PDB file by looking for a CA atom. A warning is issued along with this "magic" feature turning on annoucing that the results may be a bit unreliable.. I've shown the code and the idea to the people in my lab and I got generally good responses, but of course they are all biased :) Have a look for yourselves, I created a branch for these. https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements Thanks! From bugzilla-daemon at portal.open-bio.org Sat Jan 22 05:20:56 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 22 Jan 2011 00:20:56 -0500 Subject: [Biopython-dev] [Bug 3166] Bio.PDB.DSSP fails to work on PDBs with HETATM In-Reply-To: Message-ID: <201101220520.p0M5KuMW019063@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3166 eric.talevich at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #13 from eric.talevich at gmail.com 2011-01-22 00:20 EST ------- (In reply to comment #12) > I made changes to two lines because: > -- input to function res2code() should be instances of Bio.PDB.Residue, not > strings; > -- variable res is from last loop round, thus not suitable as input to > res2code(). > So in the end, I changed two of the lines that use function res2code(). It has > been tested on around 20,000 PDBs. Committed: https://github.com/biopython/biopython/commit/cc6842e0f79178af6bf9f32ad6ac3025685f55d1 Thanks for your help! Would you like to be added to the contributors list (in the NEWS file), with or without an e-mail address? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From krother at rubor.de Sat Jan 22 13:15:45 2011 From: krother at rubor.de (Kristian Rother) Date: Sat, 22 Jan 2011 14:15:45 +0100 Subject: [Biopython-dev] PDBFINDER module added Message-ID: Hi, PDBFINDER is a weekly updated text file that contains annotation for the entire PDB database. see: http://swift.cmbi.ru.nl/gv/pdbfinder/ The recent publication of PDBFINDER in NAR encouraged me to write a parser. I've added it as Bio.PDB.PDBFINDER in the branch 'pdbfinder' including tests. See: https://github.com/krother/biopython/commit/1ee57fc7ca08357d29fe4d8289c23ab30eecb5f9 Would be nice if someone interested could review the module. Have fun! Kristian From eric.talevich at gmail.com Sat Jan 22 23:06:52 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 22 Jan 2011 18:06:52 -0500 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: On Fri, Jan 21, 2011 at 10:13 AM, Jo?o Rodrigues wrote: > Hello all, > > I've been working on the renumbering residues, remove disordered atoms, and > biological unit representation functions. > > I've made quite some changes, specially to the renumbering algorithm. > Explanation follows: > > Before I simply calculated how much to subtract from each residue number > based on the first. That worked perfectly if all residue numbers were in a > growing progression, which was not the case for some structures. Also, > HETATMs weren't separated from the main ATOM lines, and in many PDB files > you see numbering starting from 1000 for example. > > What I coded allows for certain discrimination of HETATMs from ATOMs based > on the SEQRES field of the PDB file header (added parsing to > parse_pdb_header). This ensures HETATMs are numbered from 1000. I've also > incorporated a way of filtering modified aminoacids (that show up as HETATM > but in between ATOM lines) to be treated as ATOMs if there is no SEQRES > header present in the PDB file by looking for a CA atom. A warning is issued > along with this "magic" feature turning on annoucing that the results may be > a bit unreliable.. > > I've shown the code and the idea to the people in my lab and I got generally > good responses, but of course they are all biased :) Have a look for > yourselves, I created a branch for these. > > https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements Hi Jo?o, Good stuff. I see you made a nice clean revision history for us, too -- thanks! Whitespace: Some extra spaces crept in and are throwing off the diff in Structure.py. Also, Structure.remove_disordered_atoms has a bunch of blank lines in the function body; could you slim it down? Chain.renumber_residues: 1. How about 'res_init' and 'het_init'? When I see "seed" I think RNG. 2. It looks like het_init does numbering automagically if the given value is 0, and otherwise numbers HETATMs in the expected way starting from the given number. How about letting "het_init=None" be the default, so "if not het_init" triggers the magical behavior, and if a positive integer is given then go from there. (Check that het_init >= 1 and maybe het_init == int(het_init) to allow 1.0 to be coerced to 1 if you want.) 3. I see in the last commit you added a local variable called 'magic'. Could you find a better name for that? I think 'guess_by_ca' would fit, if I'm reading the code correctly. 4. In the last block (lines 170-174 now), could you add a comment explaining why it would be reached? Before this commit there was a comment "Other HETATMs" but I'm not sure I fully understand. Is it for HETATMs not associated with any residue, i.e. not residue modifications? Structure.renumber_residues: 1. OK, I see what you're doing with het_seed=0 -- clever, but maybe more clever than necessary. It's not obvious from reading just this code that the first iteration is a special case for HETATM numbering; a maintainer would have to look at Chain.py too. A comment about that would help, I think. 2. Why (h/1000 >= 1) instead of (h >= 1000) ? 3. If the Chain.renumber_residues arguments change to 'res_init' and 'het_init', then 'seed' here should change to 'init' 4. The arguments 'sequential' and 'chain_displace' seem to interact -- I don't think I'd use chain_displace != 1 unless I had set sequential=True. So, it seems like chain_displace should only take effect if sequential=True (i.e. line 77 would be indented another level). To tighten things up further, I'd combine those two arguments into a single 'skip/gap_between_chains' or similar: # UNTESTED for chain in model: r_new, h_new = chain.renumber_residues(res_seed=r, het_seed=h) if skip_between_chains: r = r_new + skip_between_chains if h_new >= 1000: # Each chain's HETATM numbering starts at the next even 1000*N h = 1000*((h_new/1000)+1) else: h = h_new + 1 Structure.build_biological_unit: It looks like if the structure has more than one model, the models after 0 will be clobbered when this method is run. So, unless a better solution appears, it's safest to add an assert for len(self.models) == 1 or check+ValueError for multiple models. All the best, Eric From krother at rubor.de Mon Jan 24 11:38:45 2011 From: krother at rubor.de (Kristian Rother) Date: Mon, 24 Jan 2011 12:38:45 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged Message-ID: Hi Joao, I got two things to add to Erics comments: - When renumbering a chain, the id's of some residues are changed. Have you tested whether the keys in Chain.child_dict are changed as well? - Could you refactor a method Chain.change_residue_numbers(old_ids, new_ids) that does the changing of the calculated identifiers? I think this would have a some advantages (shorter code is more testable, easier to deal with the point above, I could use this for some custom numbering schemes). - Currently, Chain.renumber_residues in the lines last_num = residue.id[1]+displace residue.id = (residue.id[0], residue.id[1]+displace, residue.id[2]) are repating 3 times. Best regards, Kristian From biopython at maubp.freeserve.co.uk Mon Jan 24 11:50:21 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 24 Jan 2011 11:50:21 +0000 Subject: [Biopython-dev] curious error from HMM unit test In-Reply-To: References: Message-ID: On Thu, Jan 20, 2011 at 12:04 PM, Peter wrote: > On Mon, Jan 17, 2011 at 4:37 PM, Peter wrote: >> On Mon, Jan 17, 2011 at 4:17 PM, Eric Talevich wrote: >>> >>> It could have been an exotic bug in Jython (or its interactions with the >>> JVM) where the JIT or garbage collector is removing local variables too >>> early. I don't see how you could provide a "fix" for it in Biopython, since >>> k definitely exists at that point in the loop in any valid Python and Jython >>> almost always handles it correctly. >>> >> >> Good point - maybe that is the most likely explanation. >> > > It has happened again on the same install of Jython 2.5.2rc3 on 64bit Linux, > previously on build 6, now on build 12: > > http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/12/steps/shell/logs/stdio > > Again, repeating the build made the error go away - but the load on the > machine would have been different etc. And a more interesting variant, http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/16/steps/shell/logs/stdio http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/18/steps/shell/logs/stdio # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00002aaaab54c400, pid=13422, tid=1078704448 # # JRE version: 6.0_17-b17 # Java VM: OpenJDK 64-Bit Server VM (14.0-b16 mixed mode linux-amd64 ) # Derivative: IcedTea6 1.7.5 # Distribution: Custom build (Wed Oct 13 13:04:40 EDT 2010) # Problematic frame: # j Bio.HMM.Trainer$py.update_emissions$12(Lorg/python/core/PyFrame;Lorg/python/core/ThreadState;)Lorg/python/core/PyObject;+555 # # An error report file with more information is saved as: # /home/buildslave/BuildBot/jython252lin64/build/Tests/hs_err_pid13422.log # # If you would like to submit a bug report, please include # instructions how to reproduce the bug and visit: # http://icedtea.classpath.org/bugzilla # Looking at the log suggests this could be a low memory issue, perhaps from running multiple test builds at once. Peter From tiagoantao at gmail.com Mon Jan 24 14:02:47 2011 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 24 Jan 2011 14:02:47 +0000 Subject: [Biopython-dev] curious error from HMM unit test In-Reply-To: References: Message-ID: Sorry, I am drenched in work (writing up my PhD thesis) and had no time to attend to this. Maybe it is the Jython version? It is still a release candidate. I think the one installed is RC2, I am going to upgrade to the new RC3 and try again. Tiago On Mon, Jan 24, 2011 at 11:50 AM, Peter wrote: > On Thu, Jan 20, 2011 at 12:04 PM, Peter wrote: >> On Mon, Jan 17, 2011 at 4:37 PM, Peter wrote: >>> On Mon, Jan 17, 2011 at 4:17 PM, Eric Talevich wrote: >>>> >>>> It could have been an exotic bug in Jython (or its interactions with the >>>> JVM) where the JIT or garbage collector is removing local variables too >>>> early. I don't see how you could provide a "fix" for it in Biopython, since >>>> k definitely exists at that point in the loop in any valid Python and Jython >>>> almost always handles it correctly. >>>> >>> >>> Good point - maybe that is the most likely explanation. >>> >> >> It has happened again on the same install of Jython 2.5.2rc3 on 64bit Linux, >> previously on build 6, now on build 12: >> >> http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/12/steps/shell/logs/stdio >> >> Again, repeating the build made the error go away - but the load on the >> machine would have been different etc. > > And a more interesting variant, > http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/16/steps/shell/logs/stdio > http://events.open-bio.org:8010/builders/Linux%2064%20-%20Jython%202.5.2/builds/18/steps/shell/logs/stdio > > # A fatal error has been detected by the Java Runtime Environment: > # > # ?SIGSEGV (0xb) at pc=0x00002aaaab54c400, pid=13422, tid=1078704448 > # > # JRE version: 6.0_17-b17 > # Java VM: OpenJDK 64-Bit Server VM (14.0-b16 mixed mode linux-amd64 ) > # Derivative: IcedTea6 1.7.5 > # Distribution: Custom build (Wed Oct 13 13:04:40 EDT 2010) > # Problematic frame: > # j ?Bio.HMM.Trainer$py.update_emissions$12(Lorg/python/core/PyFrame;Lorg/python/core/ThreadState;)Lorg/python/core/PyObject;+555 > # > # An error report file with more information is saved as: > # /home/buildslave/BuildBot/jython252lin64/build/Tests/hs_err_pid13422.log > # > # If you would like to submit a bug report, please include > # instructions how to reproduce the bug and visit: > # ? http://icedtea.classpath.org/bugzilla > # > > Looking at the log suggests this could be a low memory issue, > perhaps from running multiple test builds at once. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From macrozhu at gmail.com Mon Jan 24 18:25:17 2011 From: macrozhu at gmail.com (Hongbo Zhu) Date: Mon, 24 Jan 2011 19:25:17 +0100 Subject: [Biopython-dev] why HETERO-flag in residue identifier (Bio.PDB.Residue)? Message-ID: Hi, I was recently working on the BioPython module DSSP.py . There was some problem in the module when reading DSSP output. One of them was due to different descriptions of residue identifier in DSSP and BioPython. As we all know, in BioPython, residue identifier consists of three fields ( hetero-?ag, sequence identifier, insertion code ). But DSSP uses only the latter two. This can sometimes cause unnecessary exceptions (see http://bugzilla.open-bio.org/show_bug.cgi?id=3166 ). In retrospect, I start to wonder why BioPython included hetero-flag in residue identifier. After checking several BioPython documents, I found that in "The Biopython Structural Bioinformatics FAQ", this question has been answered: "The reason for the hetero-?ag is that many, many PDB ?les use the same sequence identi?er for an amino acid and a hetero-residue or a water, which would create obvious problems if the hetero-?ag was not used." I somehow got interested in the issue and performed a scanning on a subset of PDB (a non-redundant set of ~22,000 pdb entries derived using PISCES http://dunbrack.fccc.edu/PISCES.php ). I found ~30 cases in which same sequence identifier + icode is used for more than one residues (see below). I checked all of them. It turned out that in all of these cases, though same sequence identifier+icode is used for different residues, the residues have different alternative locations. This means they can still be distinguished if alternative locations are considered. In BioPython, alternative location is always very well taken care of. So it seems to me that hetero-flag is a bit redundant in residue identifier. It should also be fine if hetero-flag is just given as an attribute to residues (I still need to scan all the PDB entries to confirm my claim). I want to hear your opinions about the hetero-flag in residue identifier. cheers, hongbo zhu Duplicate: 2pxs 0 A ('H_XYG', 66, ' ') Duplicate: 2pxs 0 B ('H_XYG', 66, ' ') Duplicate: 3bln 0 A ('H_MPD', 147, ' ') Duplicate: 3ned 0 A ('H_CH6', 67, ' ') Duplicate: 3ned 0 A ('H_NRQ', 67, ' ') Duplicate: 3l4j 0 A ('H_PTR', 782, ' ') Duplicate: 1ysl 0 B (' ', 111, ' ') Duplicate: 3gju 0 A (' ', 289, ' ') Duplicate: 3fcr 0 A ('H_LLP', 288, ' ') Duplicate: 1xpm 0 A (' ', 111, ' ') Duplicate: 1xpm 0 B (' ', 111, ' ') Duplicate: 1xpm 0 C (' ', 111, ' ') Duplicate: 1xpm 0 D (' ', 111, ' ') Duplicate: 2vqr 0 A ('H_DDZ', 57, ' ') Duplicate: 3piu 0 A (' ', 273, ' ') Duplicate: 2w8s 0 A ('H_FGL', 57, ' ') Duplicate: 2w8s 0 B ('H_FGL', 57, ' ') Duplicate: 2w8s 0 C ('H_FGL', 57, ' ') Duplicate: 2w8s 0 D ('H_FGL', 57, ' ') Duplicate: 2wpn 0 B ('H_PSW', 489, ' ') Duplicate: 2wpn 0 B ('H_PSW', 489, ' ') Duplicate: 3a0m 0 F (' ', 13, ' ') Duplicate: 3a0m 0 F (' ', 16, ' ') Duplicate: 3a0m 0 F (' ', 13, ' ') Duplicate: 3a0m 0 F (' ', 16, ' ') Duplicate: 2ci1 0 A ('H_K1R', 273, ' ') Duplicate: 2uv2 0 A ('H_TPO', 183, ' ') Duplicate: 3d3w 0 B ('H_CSO', 138, ' ') Duplicate: 3hvy 0 A ('H_LLP', 243, ' ') Duplicate: 3hvy 0 B ('H_LLP', 243, ' ') Duplicate: 3hvy 0 C ('H_LLP', 243, ' ') Duplicate: 3hvy 0 D ('H_LLP', 243, ' ') Duplicate: 2j6v 0 A ('H_ALY', 229, ' ') Duplicate: 2j6v 0 B ('H_ALY', 229, ' ') -- Hongbo From biopython at maubp.freeserve.co.uk Mon Jan 24 23:05:56 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 24 Jan 2011 23:05:56 +0000 Subject: [Biopython-dev] curious error from HMM unit test In-Reply-To: References: Message-ID: 2011/1/24 Tiago Ant?o : > Sorry, I am drenched in work (writing up my PhD thesis) and had no > time to attend to this. Maybe it is the Jython version? It is still a > release candidate. I think the one installed is RC2, I am going to > upgrade to the new RC3 and try again. Not to worry Tiago - it's not your machine - its one of mine (with Jython 2.5.2 RC3). You've got more important things right now ;) It looks like we may have found a bug worth reporting to the Jython guys. I'll try and work out if I can reproduce it "by hand" rather than just via buildbot. Peter From biopython at maubp.freeserve.co.uk Mon Jan 24 23:08:59 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 24 Jan 2011 23:08:59 +0000 Subject: [Biopython-dev] why HETERO-flag in residue identifier (Bio.PDB.Residue)? In-Reply-To: References: Message-ID: On Mon, Jan 24, 2011 at 6:25 PM, Hongbo Zhu wrote: > Hi, > > I was recently working on the BioPython module DSSP.py . There was some > problem in the module when reading DSSP output. One of them was due to > different descriptions of residue identifier in DSSP and BioPython. As we > all know, in BioPython, residue identifier consists of three fields ( > hetero-?ag, sequence identifier, insertion code ). ... > > I somehow got interested in the issue and performed a scanning on a subset > of PDB (a non-redundant set of ~22,000 pdb entries derived using PISCES > http://dunbrack.fccc.edu/PISCES.php ). I found ~30 cases in which same > sequence identifier + icode is used for more than one residues (see below). > I checked all of them. It turned out that in all of these cases, though same > sequence identifier+icode is used for different residues, the residues have > different alternative locations. This means they can still be distinguished > if alternative locations are considered. In BioPython, alternative location > is always very well taken care of. > > So it seems to me that hetero-flag is a bit redundant in residue identifier. > It should also be fine if hetero-flag is just given as an attribute to > residues ?(I still need to scan all the PDB entries to confirm my claim). I > want to hear your opinions about the hetero-flag in residue identifier. It may be that prior to the big PDB re-mediation (clean up) this was a real and common problem. Certainly your investigation suggests this isn't the case now. Peter From anaryin at gmail.com Mon Jan 24 23:23:06 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 25 Jan 2011 00:23:06 +0100 Subject: [Biopython-dev] why HETERO-flag in residue identifier (Bio.PDB.Residue)? In-Reply-To: References: Message-ID: To be really honest, I don't understand the problem with the flag. I don't really see it as redundant. Could you please explain better? From macrozhu at gmail.com Tue Jan 25 07:19:35 2011 From: macrozhu at gmail.com (Hongbo Zhu) Date: Tue, 25 Jan 2011 08:19:35 +0100 Subject: [Biopython-dev] why HETERO-flag in residue identifier (Bio.PDB.Residue)? In-Reply-To: References: Message-ID: > > So it seems to me that hetero-flag is a bit redundant in residue > identifier. > > It should also be fine if hetero-flag is just given as an attribute to > > residues (I still need to scan all the PDB entries to confirm my claim). > I > > want to hear your opinions about the hetero-flag in residue identifier. > > It may be that prior to the big PDB re-mediation (clean up) this was a > real and common problem. Certainly your investigation suggests > this isn't the case now. > This also occurred to me. You are right, I performed the test on PDB files after remediation. If this is the case, hetero-flag is better kept for backward compatibility. > > Peter > -- Hongbo From macrozhu at gmail.com Tue Jan 25 08:17:13 2011 From: macrozhu at gmail.com (Hongbo Zhu) Date: Tue, 25 Jan 2011 09:17:13 +0100 Subject: [Biopython-dev] why HETERO-flag in residue identifier (Bio.PDB.Residue)? In-Reply-To: References: Message-ID: By redundant, I mean that a residue can be unambiguously determined by using (PDB code, model id, chain id, residue sequence identifier+icode) . HETERO-flag itself is definitely not redundant information for a residue. But it seems to be redundant in residue ID according to the small test on ~22,000 remediated PDB files. This redundancy sometimes causes unnecessary problems. For example, in DSSP, residues are determined by using sequence identifier+icode. When parsing DSSP output, some residues cannot be located the PDB structure stored in Bio.PDB.Structure because sequence identifier + icode is not enough for determining the residues in BioPython. One example is: 3jui 0 A 547 In the protein structure, using sequence identifier + icode, this residue is unambiguously determined. But in BioPython, one has to specify ('H_MSE', 547, ' ') to locate this residue. (Note that we can also simply use 547 without icode to locate it. But we don't want to accidentally forget icode in our script, do we :). Peter pointed out that the existence of hetero-flag in residue ID might be due to the mistakes in the old PDB files before remediation. If it is the case, hetero-flag should better be retained for backwards compatibility. regards, hongbo On Tue, Jan 25, 2011 at 12:23 AM, Jo?o Rodrigues wrote: > To be really honest, I don't understand the problem with the flag. I don't > really see it as redundant. Could you please explain better? > > From anaryin at gmail.com Tue Jan 25 09:39:21 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 25 Jan 2011 10:39:21 +0100 Subject: [Biopython-dev] why HETERO-flag in residue identifier (Bio.PDB.Residue)? In-Reply-To: References: Message-ID: Thanks, now I understand what you meant :) I ran into somewhat of a similar problem when trying to deal with the renumbering of residues but I guess indeed old PDB files are perhaps messy in this aspect (just looking at the odd numbering of some..) so I'd agree with Peter. From krother at rubor.de Tue Jan 25 09:40:12 2011 From: krother at rubor.de (Kristian Rother) Date: Tue, 25 Jan 2011 10:40:12 +0100 Subject: [Biopython-dev] Bio.PDB.Residue.id In-Reply-To: References: Message-ID: Hi, In our group, we've been discussing the PDB.Residue.id issue a couple of times. The current notation is fine but it is unintuitive and hard to learn for newbies. We therefore use a wrapper that allows to access residues by one-string ids = str(identifier + icode), like '101', '101A', '3B' etc. I'm sure changing ids in PDB.Residue would break a lot of scripts people use. I could imagine some workarounds that allow ignoring the HETERO flag though. Would work for me. How about you? Best regards, Kristian > By redundant, I mean that a residue can be unambiguously determined by > using > (PDB code, model id, chain id, residue sequence identifier+icode) . > HETERO-flag itself is definitely not redundant information for a residue. > But it seems to be redundant in residue ID according to the small test on > ~22,000 remediated PDB files. > > This redundancy sometimes causes unnecessary problems. For example, in > DSSP, > residues are determined by using sequence identifier+icode. When parsing > DSSP output, some residues cannot be located the PDB structure stored in > Bio.PDB.Structure because sequence identifier + icode is not enough for > determining the residues in BioPython. One example is: > 3jui 0 A 547 > In the protein structure, using sequence identifier + icode, this residue > is > unambiguously determined. But in BioPython, one has to specify ('H_MSE', > 547, ' ') to locate this residue. (Note that we can also simply use 547 > without icode to locate it. But we don't want to accidentally forget icode > in our script, do we :). > > Peter pointed out that the existence of hetero-flag in residue ID might be > due to the mistakes in the old PDB files before remediation. If it is the > case, hetero-flag should better be retained for backwards compatibility. > > regards, > hongbo > > On Tue, Jan 25, 2011 at 12:23 AM, Jo?o Rodrigues > wrote: > >> To be really honest, I don't understand the problem with the flag. I >> don't >> really see it as redundant. Could you please explain better? >> >> > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > From macrozhu at gmail.com Tue Jan 25 10:17:48 2011 From: macrozhu at gmail.com (Hongbo Zhu) Date: Tue, 25 Jan 2011 11:17:48 +0100 Subject: [Biopython-dev] Bio.PDB.Residue.id In-Reply-To: References: Message-ID: I totally agree that removing hetero-flag from residue ID is a big API change. I myself hate that very much when some other libraries I use announce such API changes. That should always be carefully planned and kept to minimal. In my opinion, what's more realistic is to add an additional mechanism to locate residues in PDB.Residue, in which no hetero-flag is required. That is, in this mechanism, one can use just sequence identifier + icode. Internally, PDB.Residue will first check whether residue (' ', seqnum, icode) exists. If not, it checks all residues with non-empty hetero-flag. Only if no residue with the sequence identifier + icode exsits (regardless of the hetero-flag) does it throw an exception, rather than simply throw an exception if (' ', seqnum, icode) does not exists. For instance, this can be realized by revising PDB.Chain._translate_id() to: def _translate_id(self, id): if type(id)==IntType: longid=(' ', id, ' ') if not self.has_id(longid): for r in self: if r.id[0] != ' ' and r.id[1] == id and r.id[2] == ' ': longid = r.id else: longid = id return longid On Tue, Jan 25, 2011 at 10:40 AM, Kristian Rother wrote: > > Hi, > > In our group, we've been discussing the PDB.Residue.id issue a couple of > times. The current notation is fine but it is unintuitive and hard to > learn for newbies. > We therefore use a wrapper that allows to access residues by one-string > ids = str(identifier + icode), like '101', '101A', '3B' etc. > > I'm sure changing ids in PDB.Residue would break a lot of scripts people > use. I could imagine some workarounds that allow ignoring the HETERO flag > though. Would work for me. How about you? > > Best regards, > Kristian > > -- Hongbo From anaryin at gmail.com Tue Jan 25 15:00:19 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 25 Jan 2011 16:00:19 +0100 Subject: [Biopython-dev] Atom.py - _assign_element function (small bug) Message-ID: Hey all, I stumbled upon a little bug today in this function. There's a formatting argument missing in line 92: msg = "Could not assign element %r for Atom (name=%s) with given element %r" \ % (*putative_element,* self.name, element) Apart from this "typo", there's a problem with hydrogens. For example with Glutamine, the hydrogens HE21 and HE22 (if present) fail to be assigned decently with the current setting. I'm adding an additional condition to the if-clause in line 77 to correct this. This correction now parses correctly these hydrogens (and everyone else). The tests run fine too and I don't think I should add anything to them either. GLN 205 HE22 [ 8.46199989 -1.04999995 -15.40400028] 4368 Final Assignment: H I'm pushing this fix to my github atom-element branch, I guess that's easy to cherry-pick? Best! Jo?o From p.j.a.cock at googlemail.com Tue Jan 25 15:38:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Jan 2011 15:38:03 +0000 Subject: [Biopython-dev] Atom.py - _assign_element function (small bug) In-Reply-To: References: Message-ID: On Tue, Jan 25, 2011 at 3:00 PM, Jo?o Rodrigues wrote: > Hey all, > > I stumbled upon a little bug today in this function. There's a formatting > argument missing in line 92: > > ? ? ? ? ? ? ? ?msg = "Could not assign element %r for Atom (name=%s) with > given element %r" \ > ? ? ? ? ? ? ? ? ? ? ?% (*putative_element,* self.name, element) > > Apart from this "typo", there's a problem with hydrogens. Oops. > For example with > Glutamine, the hydrogens HE21 and HE22 (if present) fail to be assigned > decently with the current setting. I'm adding an additional condition to the > if-clause in line 77 to correct this. This correction now parses correctly > these hydrogens (and everyone else). The tests run fine too and I don't > think I should add anything to them either. > > GLN 205 > HE22 [ ?8.46199989 ?-1.04999995 -15.40400028] 4368 > Final Assignment: H > > I'm pushing this fix to my github atom-element branch, I guess that's easy > to cherry-pick? Cherry-picked, Thanks, Peter From anaryin at gmail.com Wed Jan 26 00:41:42 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 26 Jan 2011 01:41:42 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: Trying to reply point per point both to Eric and Kristian. Hi Jo?o, > > Good stuff. I see you made a nice clean revision history for us, too -- > thanks! > I'm still trying to get the hang of Git but since I learned 'git reset' life is easier :) > > Whitespace: > Some extra spaces crept in and are throwing off the diff in > Structure.py. Also, Structure.remove_disordered_atoms has a bunch of > blank lines in the function body; could you slim it down? > They were mostly in the biological_unit function I suppose. I cut those down. I've also diff'ed the master and mine and checked differences in whitespace. Almost all gone, those left I can't really erase them, they're probably from my editor... > > > Chain.renumber_residues: > 1. How about 'res_init' and 'het_init'? When I see "seed" I think RNG. > Agreed, changed. > 2. It looks like het_init does numbering automagically if the given > value is 0, and otherwise numbers HETATMs in the expected way starting > from the given number. How about letting "het_init=None" be the > default, so "if not het_init" triggers the magical behavior, and if a > positive integer is given then go from there. (Check that het_init >= > 1 and maybe het_init == int(het_init) to allow 1.0 to be coerced to 1 > if you want.) > I'm using this later on to allow incremental chain renumbering. That's why it's 0 and not False or none, because I then add a number to it and it starts from that number on. I guess you understood when you read Structure.py. I'll add a comment pointing this out. > 3. I see in the last commit you added a local variable called 'magic'. > Could you find a better name for that? I think > 'guess_by_ca' would fit, if I'm reading the code correctly. > Changed 'magic' to 'filter_by_ca'. What it's doing is filtering the HETATMs if they have a CA atom so I guess it's a good name for it. > 4. In the last block (lines 170-174 now), could you add a comment > explaining why it would be reached? Before this commit there was a > comment "Other HETATMs" but I'm not sure I fully understand. Is it for > HETATMs not associated with any residue, i.e. not residue > modifications? > Added. It's for all HETATMs that don't have a CA atom basically and that are not contemplated in SEQRES (if there). Structure.renumber_residues: > 1. OK, I see what you're doing with het_seed=0 -- clever, but maybe > more clever than necessary. It's not obvious from reading just this > code that the first iteration is a special case for HETATM numbering; > a maintainer would have to look at Chain.py too. A comment about that > would help, I think. > Added. > 2. Why (h/1000 >= 1) instead of (h >= 1000) ? > Accumulated frustration over one day results in such logical typos :) > 3. If the Chain.renumber_residues arguments change to 'res_init' and > 'het_init', then 'seed' here should change to 'init' > Done. > 4. The arguments 'sequential' and 'chain_displace' seem to interact -- > I don't think I'd use chain_displace != 1 unless I had set > sequential=True. So, it seems like chain_displace should only take > effect if sequential=True (i.e. line 77 would be indented another > level). To tighten things up further, I'd combine those two arguments > into a single 'skip/gap_between_chains' or similar: > > # UNTESTED > for chain in model: > r_new, h_new = chain.renumber_residues(res_seed=r, het_seed=h) > if skip_between_chains: > r = r_new + skip_between_chains > if h_new >= 1000: > # Each chain's HETATM numbering starts at the next even 1000*N > h = 1000*((h_new/1000)+1) > else: > h = h_new + 1 > > I changed it to consecutive_chains and refactored the code a bit. I also changed the increment of the het_init value. This way, having more than 9 chains for example would lead to residue numbers over 10000 which is not allowed. I solved it by making all HETATMs starting (by default) at 1000 and just incrementing. If the numbering is consecutive they are also affected by the value chosen to skip between chains. A bit more logical IMO. Answering Kristian's suggestions: - When renumbering a chain, the id's of some residues are changed. Have > you tested whether the keys in Chain.child_dict are changed as well? > Good question... they didn't... is there an easy way of rebuilding that dictionary? Or should I just "rebuild" it and then overwrite child_dict? - Could you refactor a method Chain.change_residue_numbers(old_ids, new_ids) > that does the changing of the calculated identifiers? I think this would > have a some advantages (shorter code is more testable, easier to deal with > the point above, I could use this for some custom numbering schemes). > Could you elaborate on this? Should it be a new method? - Currently, Chain.renumber_residues in the lines > last_num = residue.id[1]+displace > residue.id = (residue.id[0], residue.id[1]+displace, residue.id[2]) > are repating 3 times. Changed. I merged the if-clauses. A bit more complicated but only one if-else condition. > Structure.build_biological_unit: > It looks like if the structure has more than one model, the models > after 0 will be clobbered when this method is run. So, unless a better > solution appears, it's safest to add an assert for len(self.models) == > 1 or check+ValueError for multiple models. > I would prefer to return a new Structure object just with the Biological Unit. It would save me the deepcopy but I'd have to create a new object so dunno if I could gain some speed there. But this would actually make more sense and avoid that problem. What do you think? I pushed again to the same branch: https://github.com/JoaoRodrigues/biopython/tree/pdb_enhancements From anaryin at gmail.com Wed Jan 26 01:47:02 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 26 Jan 2011 02:47:02 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: Adding to the build_biological_unit question. I re-wrote it to create a new structure and not to use deepcopy. A very crude benchmark (running two versions of the function) shows this: New function (w/out deepcopy): real 1m26.078s user 1m21.993s sys 0m2.048s Old function (w/ deepcopy): real 2m15.544s user 2m9.105s sys 0m3.092s So... a slight improvement I'd say. Pushed it to the pdb_enhancements branch as a new function called apply_transformation_matrix. A perhaps more descriptive and explicit name to the function. Cheers, Jo?o From bugzilla-daemon at portal.open-bio.org Wed Jan 26 11:08:21 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 26 Jan 2011 06:08:21 -0500 Subject: [Biopython-dev] [Bug 3171] New: inconsistent representation of missing elements PDB.PDBIO.PDBIO and PDB.Atom.Atom Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3171 Summary: inconsistent representation of missing elements PDB.PDBIO.PDBIO and PDB.Atom.Atom Product: Biopython Version: 1.56 Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: edvin.fuglebakk at gmail.com There seems to be an inconsistency in the way missing elements are represented in PDB.PDBIO.PDBIO and PDB.Atom.Atom. The constructor of Atom sets the attribute element to '?' if this is unkown, while PDBIO raises a value error if it encouters atoms with the element set to '?'. PDBIO._get_atom_line checks if Atom.element is falsish (None, False, 0 ...) and chanhes the value of Atom.element to " " if it is. So it seems Atom represents missing elements by "?", while PDBIO represents them by falsish values, presumably None. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From anaryin at gmail.com Wed Jan 26 11:55:46 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 26 Jan 2011 12:55:46 +0100 Subject: [Biopython-dev] [Bug 3171] New: inconsistent representation of missing elements PDB.PDBIO.PDBIO and PDB.Atom.Atom In-Reply-To: References: Message-ID: This being "my" problem, how do I fix this in Bugzilla? The problem is already solved and pushed to Github since a few weeks. Cheers! Jo?o [...] Rodrigues http://doeidoei.wordpress.com On Wed, Jan 26, 2011 at 12:08 PM, wrote: > http://bugzilla.open-bio.org/show_bug.cgi?id=3171 > > Summary: inconsistent representation of missing elements > PDB.PDBIO.PDBIO and PDB.Atom.Atom > Product: Biopython > Version: 1.56 > Platform: Macintosh > OS/Version: Mac OS > Status: NEW > Severity: normal > Priority: P2 > Component: Main Distribution > AssignedTo: biopython-dev at biopython.org > ReportedBy: edvin.fuglebakk at gmail.com > > > There seems to be an inconsistency in the way missing elements are > represented > in PDB.PDBIO.PDBIO and PDB.Atom.Atom. > > The constructor of Atom sets the attribute element to '?' if this is > unkown, > while PDBIO raises a value error if it encouters atoms with the element set > to > '?'. PDBIO._get_atom_line checks if Atom.element is falsish (None, False, 0 > ...) and chanhes the value of Atom.element to " " if it is. > > So it seems Atom represents missing elements by "?", while PDBIO represents > them by falsish values, presumably None. > > > -- > Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are the assignee for the bug, or are watching the assignee. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Wed Jan 26 13:51:38 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 26 Jan 2011 13:51:38 +0000 Subject: [Biopython-dev] [Bug 3171] New: inconsistent representation of missing elements PDB.PDBIO.PDBIO and PDB.Atom.Atom In-Reply-To: References: Message-ID: On Wed, Jan 26, 2011 at 11:55 AM, Jo?o Rodrigues wrote: > This being "my" problem, how do I fix this in Bugzilla? The problem is > already solved and pushed to Github since a few weeks. > > Cheers! > > Jo?o [...] Rodrigues > http://doeidoei.wordpress.com Hi Joao, If it has been fixed on the master, could you add a bug comment to say so with a link to the github commit (and say if you standardised on " ", "?" or something else). Then change the bug status to fixed (between the comment box and the commit button should be a radio-dialogue, move it from "Leave as NEW" to "Resolve bug, changing resolution to FIXED". Thanks, Peter From bugzilla-daemon at portal.open-bio.org Wed Jan 26 13:56:39 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 26 Jan 2011 08:56:39 -0500 Subject: [Biopython-dev] [Bug 2992] Adding Uniprot XML file format parsing to Biopython In-Reply-To: Message-ID: <201101261356.p0QDud42015498@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2992 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-26 08:56 EST ------- This was included in Biopython 1.56, marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 26 14:07:34 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 26 Jan 2011 09:07:34 -0500 Subject: [Biopython-dev] [Bug 2999] SeqIO.parse() or record.format("genbank") converts input sequence to uppercase or In-Reply-To: Message-ID: <201101261407.p0QE7Yco016101@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2999 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-26 09:07 EST ------- (In reply to comment #1) > In many file formats (e.g. FASTA) mixed case is allowed and useful. > > The sequence in a GenBank file is (by convention) always lower case, > but for historical reasons Biopython converts this to upper case on > parsing (not sure why, but changing it would risk breaking existing > scripts). > > However, I think we should convert to lower case on writing GenBank > output. > Done: https://github.com/biopython/biopython/commit/1f860f445d99794ef3747f7a90d73ac4b4a78a00 Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From n.j.loman at bham.ac.uk Wed Jan 26 14:23:47 2011 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Wed, 26 Jan 2011 14:23:47 +0000 Subject: [Biopython-dev] XMFA format support Message-ID: <4D402E73.90806@bham.ac.uk> Hi biopython-developers, Has anyone made a start on adding XMFA support to Bio.AlignIO? XMFA files are produced by software such as Mauve (amongst others). Here's an example file: http://www.bioperl.org/wiki/XMFA_multiple_alignment_format It should be relatively straight-forward to parse them in a basic way in that they can be split on the '=' line to produce the equivalent of multi-FASTA alignments. Cheers, Nick. From bugzilla-daemon at portal.open-bio.org Wed Jan 26 14:26:26 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 26 Jan 2011 09:26:26 -0500 Subject: [Biopython-dev] [Bug 3109] Record class in Bio.SCOP.Cla has hierarchy member as list instead of dictionary In-Reply-To: Message-ID: <201101261426.p0QEQQh6016867@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3109 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-26 09:26 EST ------- Committed: https://github.com/biopython/biopython/commit/9ec46f981a1fb7f97eaee4a2a01ad8bb3297234b and: https://github.com/biopython/biopython/commit/ce675b9299bf34e12335330d627385262f59b4e7 Marking as fixed - sorry for the delay. Thanks! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Wed Jan 26 14:32:41 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Jan 2011 14:32:41 +0000 Subject: [Biopython-dev] XMFA format support In-Reply-To: <4D402E73.90806@bham.ac.uk> References: <4D402E73.90806@bham.ac.uk> Message-ID: Hi Nick, On Wed, Jan 26, 2011 at 2:23 PM, Nick Loman wrote: > Hi biopython-developers, > > Has anyone made a start on adding XMFA support to Bio.AlignIO? >?XMFA files are produced by software such as Mauve (amongst others). > > Here's an example file: > http://www.bioperl.org/wiki/XMFA_multiple_alignment_format > > It should be relatively straight-forward to parse them in a basic way in > that they can be split on the '=' line to produce the equivalent of > multi-FASTA alignments. > > Cheers, > > Nick. Nope, but you are right they should be easy to parse - especially if you ignore the loosely defined optional key/value entries on the equals line. Do you want to tackle this? If not, can you at least provide some small example files and help with testing? You could file an enhancement bug on our bugzilla and then upload them as attachments. Also, could you name any other software that outputs these XMFA (extended multiple fasta) files, other than Mauve? I've not ever come across this format before. Thanks, Peter From n.j.loman at bham.ac.uk Wed Jan 26 14:46:51 2011 From: n.j.loman at bham.ac.uk (Nick Loman) Date: Wed, 26 Jan 2011 14:46:51 +0000 Subject: [Biopython-dev] XMFA format support In-Reply-To: References: <4D402E73.90806@bham.ac.uk> Message-ID: <4D4033DB.6030607@bham.ac.uk> On 26/01/2011 14:32, Peter Cock wrote: >> Has anyone made a start on adding XMFA support to Bio.AlignIO? >> XMFA files are produced by software such as Mauve (amongst others). >> >> Here's an example file: >> http://www.bioperl.org/wiki/XMFA_multiple_alignment_format >> >> It should be relatively straight-forward to parse them in a basic way in >> that they can be split on the '=' line to produce the equivalent of >> multi-FASTA alignments. > Nope, but you are right they should be easy to parse - especially > if you ignore the loosely defined optional key/value entries on the > equals line. Do you want to tackle this? If not, can you at least > provide some small example files and help with testing? You > could file an enhancement bug on our bugzilla and then upload > them as attachments. Hi Peter, Sure, I'll have a go at writing a parser and let you know how I get on. > Also, could you name any other software that outputs these XMFA > (extended multiple fasta) files, other than Mauve? I've not ever > come across this format before. > I think the format was invented by the author of LAGAN (http://lagan.stanford.edu/lagan_web/index.shtml). From a Google search it looks like it is exported by Bigsdb too (http://pubmlst.org/software/database/bigsdb/userguide/isolates/xmfa.shtml) Cheers Nick From bugzilla-daemon at portal.open-bio.org Wed Jan 26 15:00:07 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 26 Jan 2011 10:00:07 -0500 Subject: [Biopython-dev] [Bug 3171] inconsistent representation of missing elements PDB.PDBIO.PDBIO and PDB.Atom.Atom In-Reply-To: Message-ID: <201101261500.p0QF07Hd018137@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3171 anaryin at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from anaryin at gmail.com 2011-01-26 10:00 EST ------- This problem is already fixed and pushed to the master branch. Commit: https://github.com/biopython/biopython/blob/6c7ef358e5f93599ca165ce8e7b46261106e2b06/Bio/PDB/PDBIO.py -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 26 15:04:56 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 26 Jan 2011 10:04:56 -0500 Subject: [Biopython-dev] [Bug 3171] inconsistent representation of missing elements PDB.PDBIO.PDBIO and PDB.Atom.Atom In-Reply-To: Message-ID: <201101261504.p0QF4uFN018424@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3171 ------- Comment #2 from anaryin at gmail.com 2011-01-26 10:04 EST ------- If the element can't be determined Atom defines it as "" (Empty string) which is correctl interpreted by PDBIO. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Wed Jan 26 15:06:00 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Jan 2011 15:06:00 +0000 Subject: [Biopython-dev] XMFA format support In-Reply-To: <4D4033DB.6030607@bham.ac.uk> References: <4D402E73.90806@bham.ac.uk> <4D4033DB.6030607@bham.ac.uk> Message-ID: On Wed, Jan 26, 2011 at 2:46 PM, Nick Loman wrote: > On 26/01/2011 14:32, Peter Cock wrote: >>> >>> Has anyone made a start on adding XMFA support to Bio.AlignIO? >>> ?XMFA files are produced by software such as Mauve (amongst others). >>> ... >> >> Nope, but you are right they should be easy to parse - especially >> if you ignore the loosely defined optional key/value entries on the >> equals line. Do you want to tackle this? If not, can you at least >> provide some small example files and help with testing? You >> could file an enhancement bug on our bugzilla and then upload >> them as attachments. > > Hi Peter, > > Sure, I'll have a go at writing a parser and let you know how I get on. > Great. I'd suggest format name "xmfa" to match BioPerl, and using Bio/SeqIO/XmfaIO.py for your parser (and writer if you do one). Peter From bugzilla-daemon at portal.open-bio.org Wed Jan 26 15:09:38 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 26 Jan 2011 10:09:38 -0500 Subject: [Biopython-dev] [Bug 3171] inconsistent representation of missing elements PDB.PDBIO.PDBIO and PDB.Atom.Atom In-Reply-To: Message-ID: <201101261509.p0QF9cPS018663@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3171 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-26 10:09 EST ------- (In reply to comment #1) > This problem is already fixed and pushed to the master branch. > > Commit: > https://github.com/biopython/biopython/blob/6c7ef358e5f93599ca165ce8e7b46261106e2b06/Bio/PDB/PDBIO.py > That's just a white space change (from a branch merge): https://github.com/biopython/biopython/commit/6c7ef358e5f93599ca165ce8e7b46261106e2b06 Probably the fix you are looking for was applied earlier in the history. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bioinformed at gmail.com Wed Jan 26 15:14:33 2011 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Wed, 26 Jan 2011 10:14:33 -0500 Subject: [Biopython-dev] Sequential SFF IO Message-ID: Any objections/worries about converting the SFF writer to use the sequential/incremental writer object interface? I know it looks specialized for text formats, but I need to split large SFF files into many smaller ones and would rather not materialize the whole thing. The SFF writer code already allows for deferred writing of read counts and index creation, so it looks to be only minor surgery. There doesn't seem to be an obvious API for obtaining such a writer using the SeqIO interface. Am I missing something obvious? -Kevin From p.j.a.cock at googlemail.com Wed Jan 26 15:45:56 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Jan 2011 15:45:56 +0000 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: On Wed, Jan 26, 2011 at 3:14 PM, Kevin Jacobs wrote: > Any objections/worries about converting the SFF writer to use the > sequential/incremental writer object interface? I know it looks > specialized for text formats, but It already uses Bio.SeqIO.Interfaces.SequenceWriter > ... I need to split large SFF files into many smaller ones > and would rather not materialize the whole thing. ?The SFF writer > code already allows for deferred writing of read counts and index > creation, so it looks to be only minor surgery. I don't understand what problem you are having with the SeqIO API. It should be quite happy to take a generator function, iterator, etc (as opposed to a list of SeqRecord objects which I assume is what you mean by "materialize the whole thing"). > There doesn't seem to be an obvious API for obtaining such a writer > using the SeqIO interface. You can do that with: from Bio.SeqIO.SffIO import SffWriter > Am I missing something obvious? > Probably. You can divide a large SFF file into smaller SFF files via the high level Bio.SeqIO.parse/write interface. Personally I like to use generator expressions to do a filtering operation. Note if you want to divide a large SFF file while preserving the Roche XML manifest things are a little more tricky. You should use the ReadRocheXmlManifest function in combination with the SffWriter. You can see an example of this in sff_filter_by_id.py, a tool I wrote for Galaxy - search for "Filter SFF by ID" here: http://community.g2.bx.psu.edu/ Peter From bioinformed at gmail.com Wed Jan 26 16:44:53 2011 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Wed, 26 Jan 2011 11:44:53 -0500 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: On Wed, Jan 26, 2011 at 10:45 AM, Peter Cock wrote: > On Wed, Jan 26, 2011 at 3:14 PM, Kevin Jacobs wrote: > > Any objections/worries about converting the SFF writer to use the > > sequential/incremental writer object interface? I know it looks > > specialized for text formats, but > > It already uses Bio.SeqIO.Interfaces.SequenceWriter > > Sorry-- was shooting from the hip. I meant a SequentialSequenceWriter. > > ... I need to split large SFF files into many smaller ones > > and would rather not materialize the whole thing. The SFF writer > > code already allows for deferred writing of read counts and index > > creation, so it looks to be only minor surgery. > > I don't understand what problem you are having with the SeqIO API. > It should be quite happy to take a generator function, iterator, etc > (as opposed to a list of SeqRecord objects which I assume is what > you mean by "materialize the whole thing"). The goal is to demultiplex a larger file, so I need a "push" interface. e.g. out = dict(...) # of SffWriters for rec in SeqIO(filename,'sff-trim'): out[id(read)].write_record(rec) for writer in out.itervalues(): writer.write_footer() I could use a simple generator if I was merely filtering records, but the write_file interface would require more co-routine functionality than generators provide. > There doesn't seem to be an obvious API for obtaining such a writer > > using the SeqIO interface. > > You can do that with: > > from Bio.SeqIO.SffIO import SffWriter > > For my immediate need, this is fine. However, the more general API doesn't have a SeqIO.writer to get SequentialSequenceWriter objects. -Kevin From bugzilla-daemon at portal.open-bio.org Wed Jan 26 17:01:58 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 26 Jan 2011 12:01:58 -0500 Subject: [Biopython-dev] [Bug 3171] inconsistent representation of missing elements PDB.PDBIO.PDBIO and PDB.Atom.Atom In-Reply-To: Message-ID: <201101261701.p0QH1wSc024517@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3171 ------- Comment #4 from anaryin at gmail.com 2011-01-26 12:01 EST ------- (In reply to comment #3) > (In reply to comment #1) > > This problem is already fixed and pushed to the master branch. > > > > Commit: > > https://github.com/biopython/biopython/blob/6c7ef358e5f93599ca165ce8e7b46261106e2b06/Bio/PDB/PDBIO.py > > > > That's just a white space change (from a branch merge): > > https://github.com/biopython/biopython/commit/6c7ef358e5f93599ca165ce8e7b46261106e2b06 > > Probably the fix you are looking for was applied earlier in the history. > True, sorry. https://github.com/biopython/biopython/commit/594526926f29411a83e996799afc8f010d4fd2e2 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From p.j.a.cock at googlemail.com Wed Jan 26 17:19:36 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Jan 2011 17:19:36 +0000 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: On Wed, Jan 26, 2011 at 4:44 PM, Kevin Jacobs wrote: > On Wed, Jan 26, 2011 at 10:45 AM, Peter Cock > wrote: >> >> On Wed, Jan 26, 2011 at 3:14 PM, Kevin Jacobs wrote: >> > Any objections/worries about converting the SFF writer to use the >> > sequential/incremental writer object interface? ?I know it looks >> > specialized for text formats, but >> >> It already uses Bio.SeqIO.Interfaces.SequenceWriter >> > > Sorry-- was shooting from the hip. ?I meant a SequentialSequenceWriter. > The file formats which use SequentialSequenceWriter have trivial (or no) header/footer, which require no additional arguments. The SFF file format has a non-trivial header which records flow space settings etc. Any write_header method would have to be SFF specific, likewise any write_footer method for the index and XML manifest. I don't see what you have in mind. In fact, looking at SffIO.py again now, I think the SffWriter's write_header and write_record method should be private with just write_file as a public method. >> > ... I need to split large SFF files into many smaller ones >> > and would rather not materialize the whole thing. ?The SFF writer >> > code already allows for deferred writing of read counts and index >> > creation, so it looks to be only minor surgery. >> >> I don't understand what problem you are having with the SeqIO API. >> It should be quite happy to take a generator function, iterator, etc >> (as opposed to a list of SeqRecord objects which I assume is what >> you mean by "materialize the whole thing"). > > The goal is to demultiplex a larger file, so I need a "push" interface. > ?e.g. > out = dict(...) # of SffWriters > for rec in SeqIO(filename,'sff-trim'): > ??out[id(read)].write_record(rec) > > for writer in out.itervalues(): > ??writer.write_footer() I don't think the above will work without some "magic" to record the SFF header (which currently would require using private attributes of the SffWriter objects) as done via its write_file method. Also you can't read in SFF files with "sff-trim" if you want to output them, since this discards all the flow space information. You have to use format "sff" instead. > I could use a simple generator if I was merely filtering records, but the > write_file interface would require more co-routine functionality than > generators provide. How many output files do you have? Assuming it is small I'd go for the simple solution of one loop over the input SFF file for each output file. A variation on this would be to make a list of read IDs for each output file, then use the Bio.SeqIO.index for random access to the records to get the records, e.g. records = SeqIO.index(original_filename, "sff") for filename in [...]: wanted = [...] # some list or generator records = (records[id] for id in wanted) SeqIO.write(records, filename, "sff") Otherwise look at itertools.tee for splitting the iterator if you really want to make a single pass though the original SFF file. >> > There doesn't seem to be an obvious API for obtaining such a >> > writer using the SeqIO interface. >> >> You can do that with: >> >> from Bio.SeqIO.SffIO import SffWriter >> > > For my immediate need, this is fine. ?However, the more general > API doesn't have a SeqIO.writer to get?SequentialSequenceWriter > objects. For good reason - not all the writers use SequentialSequenceWriter, because for many file formats it is too narrow in scope. Peter From bioinformed at gmail.com Wed Jan 26 18:30:44 2011 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Wed, 26 Jan 2011 13:30:44 -0500 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: On Wed, Jan 26, 2011 at 12:19 PM, Peter Cock wrote: > I don't think the above will work without some "magic" to record the > SFF header (which currently would require using private attributes > of the SffWriter objects) as done via its write_file method. > > Also you can't read in SFF files with "sff-trim" if you want to output > them, since this discards all the flow space information. You have > to use format "sff" instead. > > Agreed-- shooting from the hip again. > > I could use a simple generator if I was merely filtering records, but the > > write_file interface would require more co-routine functionality than > > generators provide. > > How many output files do you have? Assuming it is small I'd go for > the simple solution of one loop over the input SFF file for each output > file. > > We're routinely multiplexing hundreds or thousands of samples per SFF file and using sequence barcodes to identify them. The number of outputs make a one-pass solution is much preferable. Anyhow, it seems that this has gone beyond the scope of generic Biopython, so I'm happy to make my modifications locally (and share the results if anyone is interested). We're currently using the Roche/454 sff tools, but they have known bugs and we have 5' and 3' adapters to consider. Thanks, -Kevin From p.j.a.cock at googlemail.com Wed Jan 26 19:44:10 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 26 Jan 2011 19:44:10 +0000 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: On Wednesday, January 26, 2011, Kevin Jacobs wrote: > > How many output files do you have? Assuming it is small I'd go for > the simple solution of one loop over the input SFF file for each output > file. > > We're routinely multiplexing hundreds or thousands of samples per SFF file and using sequence barcodes to identify them. ?The number of outputs make a one-pass solution is much preferable. ?Anyhow, it seems that this has gone beyond the scope of generic Biopython, so I'm happy to make my modifications locally (and share the results if anyone is interested). ?We're currently using the Roche/454 sff tools, but they have known bugs and we have 5' and 3' adapters to consider. > > Thanks,?-Kevin > I've got a better feel for what you are attempting to do now. I think one avenue would be to extend the write_header method to take some SFF specific arguments and add a write_footer method taking the optional Roche XML manifest which would (assuming it could seek) write the index block and update the header. All this may not make much sense without looking at the code and the SFF format spec. I'm currently looking at trimming 5' and 3' PCR primer sequences - which could equally be used for barcodes etc. I'd probably wrap this as a Galaxy tool (using Biopython). Peter From bioinformed at gmail.com Wed Jan 26 23:24:51 2011 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Wed, 26 Jan 2011 18:24:51 -0500 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: On Wed, Jan 26, 2011 at 2:44 PM, Peter Cock wrote: > I've got a better feel for what you are attempting to do now. I think > one avenue would be to extend the write_header method to take some SFF > specific arguments and add a write_footer method taking the optional > Roche XML manifest which would (assuming it could seek) write the > index block and update the header. All this may not make much sense > without looking at the code and the SFF format spec. > > This is essentially what I'm doing. The index and manifest are written after the flow records, so this approach is quite feasible. > I'm currently looking at trimming 5' and 3' PCR primer sequences - > which could equally be used for barcodes etc. I'd probably wrap this > as a Galaxy tool (using Biopython). > > I have 90% of such a tool written. I use a banded Smith-Waterman alignment to match barcodes and generic PCR adapters/consensus sequence to ensure that adapters and barcodes can be detected at both ends of reads. -Kevin From p.j.a.cock at googlemail.com Thu Jan 27 13:32:45 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Jan 2011 13:32:45 +0000 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: On Wed, Jan 26, 2011 at 11:24 PM, Kevin wrote: > On Wed, Jan 26, 2011 at 2:44 PM, Peter wrote: >> >> I'm currently looking at trimming 5' and 3' PCR ?primer sequences - >> which could equally be used for barcodes etc. I'd probably wrap this >> as a Galaxy tool (using Biopython). >> > > I have 90% of such a tool written. ?I use a banded Smith-Waterman > alignment to match barcodes and generic PCR adapters/consensus > sequence to ensure that adapters and barcodes can be detected at > both ends of reads. Interesting - and yes, we do seem to have similar aims here. I have been doing ungapped alignments, allowing 0 or 1 (maybe in future 2) mismatches, working on getting this running at reasonable speed. Gapped alignments would be particularly important in 454 reads with homopolymer errors, but most barcodes and PCR primers will avoid homopolymer runs so I don't expect this to be a common problem in this use case. Do you have good reasons to go to the expense of a gapped alignment? Peter From biopython at maubp.freeserve.co.uk Thu Jan 27 15:43:16 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 27 Jan 2011 15:43:16 +0000 Subject: [Biopython-dev] curious error from HMM unit test In-Reply-To: References: Message-ID: 2011/1/24 Peter : > 2011/1/24 Tiago Ant?o : >> Sorry, I am drenched in work (writing up my PhD thesis) and had no >> time to attend to this. Maybe it is the Jython version? It is still a >> release candidate. I think the one installed is RC2, I am going to >> upgrade to the new RC3 and try again. > > Not to worry Tiago - it's not your machine - its one of mine (with > Jython 2.5.2 RC3). You've got more important things right now ;) > > It looks like we may have found a bug worth reporting to the > Jython guys. I'll try and work out if I can reproduce it "by hand" > rather than just via buildbot. That turned out to be easy enough, logged in as the buildslave, got the latest Biopython code with git, did "jython setup.py install", switched to the test directory and: $ jython test_HMMCasino.py Training with the Standard Trainer... Training with Baum-Welch... # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00002aaaab550400, pid=16570, tid=1106139456 # # JRE version: 6.0_17-b17 # Java VM: OpenJDK 64-Bit Server VM (14.0-b16 mixed mode linux-amd64 ) # Derivative: IcedTea6 1.7.5 # Distribution: Custom build (Wed Oct 13 13:04:40 EDT 2010) # Problematic frame: # j Bio.HMM.Trainer$py.update_emissions$12(Lorg/python/core/PyFrame;Lorg/python/core/ThreadState;)Lorg/python/core/PyObject;+555 # # An error report file with more information is saved as: # /home/buildslave/repositories/biopython/Tests/hs_err_pid16570.log # # If you would like to submit a bug report, please include # instructions how to reproduce the bug and visit: # http://icedtea.classpath.org/bugzilla # /home/buildslave/bin/jython: line 271: 16570 Aborted "${JAVA_CMD[@]}" $JAVA_OPTS "${java_args[@]}" -Dpython.home="$JYTHON_HOME" -Dpython.executable="$PRG" org.python.util.jython $JYTHON_OPTS "$@" So, we've had several outcomes (note the build number doesn't increase with git changes, some of these are rebuilds of older code): Build - git revision - outcome 0 - ? - success 1 - ? - success 2 - ? - success 3 - ? - success 4 - ? - success 5 - ? - success 6 - ? - UnboundLocalError 7 - ? - success 8 - ? - success 9 - ? - success 10 - ? - success 11 - ? - success 12 - ? - UnboundLocalError 13 - ? - success 14 - 5c729927d79a9f22b89d8a6f794865c8e1209ed5 - success (At this point Java was updated on this machine.) 15 - cc6842e0f79178af6bf9f32ad6ac3025685f55d1 - timeout 16 - cc6842e0f79178af6bf9f32ad6ac3025685f55d1 - fatal error detect by Java 17 - 215d8a37e20b50491613cc153bdf366d875cf251 - fatal error detect by Java 18 - 215d8a37e20b50491613cc153bdf366d875cf251 - fatal error detect by Java 19 - cc6842e0f79178af6bf9f32ad6ac3025685f55d1 - fatal error detect by Java 20 - 215d8a37e20b50491613cc153bdf366d875cf251 - fatal error detect by Java 21 - f36daaf7dada756822c1040cdb1a74ae0794469d - fatal error detect by Java 22 - 9ec46f981a1fb7f97eaee4a2a01ad8bb3297234b - fatal error detect by Java (At this point I removed Biopython from jython's site-packages, just in case that was conflicting with the un-installed builds being tested. No change...) 23 - 9ec46f981a1fb7f97eaee4a2a01ad8bb3297234b - fatal error detect by Java (for the next builds, I went back to revision for last success, build 14, and build 0) 24 - 5c729927d79a9f22b89d8a6f794865c8e1209ed5 - fatal error detect by Java 25 - 5c729927d79a9f22b89d8a6f794865c8e1209ed5 - fatal error detect by Java 26 - b61cb9d34b24a2e24b0b95453b2707f040d44d89 - fatal error detect by Java I'm pretty convinced that the switch from an occasional UnboundLocalError to a repeatable fatal error is down to the change in Java (although why build 15 timed out rather than triggered the fatal error is curious). $ sudo grep java /var/log/yum.log Jan 21 14:18:29 Installed: tzdata-java-2010l-1.el5.x86_64 Jan 21 14:20:31 Updated: 1:java-1.6.0-openjdk-1.6.0.0-1.16.b17.el5.x86_64 Jan 21 14:20:46 Updated: 1:java-1.6.0-openjdk-devel-1.6.0.0-1.16.b17.el5.x86_64 $ jython Jython 2.5.2rc3 (Release_2_5_2rc3:7184, Jan 10 2011, 22:54:57) [OpenJDK 64-Bit Server VM (Sun Microsystems Inc.)] on java1.6.0_17 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.version_info (2, 5, 2, 'candidate', 3) >>> print sys.version 2.5.2rc3 (Release_2_5_2rc3:7184, Jan 10 2011, 22:54:57) [OpenJDK 64-Bit Server VM (Sun Microsystems Inc.)] So this looks like a bug in either Jython 2.5.2 RC3 and/or Java under CentOS. Peter From biopython at maubp.freeserve.co.uk Thu Jan 27 16:13:32 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 27 Jan 2011 16:13:32 +0000 Subject: [Biopython-dev] curious error from HMM unit test In-Reply-To: References: Message-ID: On Thu, Jan 27, 2011 at 3:43 PM, Peter wrote: > > So this looks like a bug in either Jython 2.5.2 RC3 and/or Java under CentOS. > Same fatal error from test_HMMCasino.py using Jython 2.5.1 on this machine. This could be two bugs (one in Java causing the fatal error, and one in Jython causing the intermittent UnboundLocalError). I think for now I'll just disable this test on Jython, https://github.com/biopython/biopython/commit/8a16e24a1e1076a93957b61b28e933a0cf65d49f Peter From bioinformed at gmail.com Fri Jan 28 12:14:39 2011 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Fri, 28 Jan 2011 07:14:39 -0500 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: On Thu, Jan 27, 2011 at 8:32 AM, Peter Cock wrote: > On Wed, Jan 26, 2011 at 11:24 PM, Kevin wrote: > > On Wed, Jan 26, 2011 at 2:44 PM, Peter wrote: > >> I'm currently looking at trimming 5' and 3' PCR primer sequences - > >> which could equally be used for barcodes etc. I'd probably wrap this > >> as a Galaxy tool (using Biopython). > > > > I have 90% of such a tool written. I use a banded Smith-Waterman > > alignment to match barcodes and generic PCR adapters/consensus > > sequence to ensure that adapters and barcodes can be detected at > > both ends of reads. > > Interesting - and yes, we do seem to have similar aims here. I have > been doing ungapped alignments, allowing 0 or 1 (maybe in future 2) > mismatches, working on getting this running at reasonable speed. > For just 5' barcode detection, I am using a memoized scheme that computes anchored alignments and then stores the result in a hash table (match/mismatch, edit distance). This approach allows me to reject barcodes with too small an edit distance to the next best candidate. It is reasonably fast for our fairly long 454 barcode set (10-'mers), though I do have an optional Cython version of the edit distance routine. The pure-Python version is pretty zippy and can decode a 454 run in a minute or two. > Gapped alignments would be particularly important in 454 reads > with homopolymer errors, but most barcodes and PCR primers > will avoid homopolymer runs so I don't expect this to be a common > problem in this use case. Do you have good reasons to go to the > expense of a gapped alignment? > > When only trimming short 5' adapters, a gapped alignment may be a bit overkill. However, for our short-amplicon libraries we have a bit of a challenge using simpler approaches. Instead of hand-waving, here are the gory details: We're using Fluidigm Access Arrays to generate libraries and sequence fragments with the following structure: Forward: Reverse: where on, e.g., 454 and Ion Torrent: = TCAG = non-homopolymeric 10-mer (up to 192 of them, min. edit distance 4) = 22-mer generic PCR primer (distinct from cs2) = 22-mer generic PCR primer (distinct from cs1) = ~30-150-mer genomic DNA target ( and "rc" denoting reverse compliment. Our designs for Illumina are a bit different, so I won't go into those right now. I use the procedure outlined before to determine the barcode. Then I compute left-anchored gapped alignments between the read and constructs that represent perfect matches to the targets to find the most likely boundaries to trim the 5' and 3' elements from the target sequence. I'm in the process of adding position specific scoring and gap penalties, since this adds virtually no computational cost and improves the boundary detection. The results go to a genotype calling algorithm to classify known and novel variants. This approach is a bit overkill for some sequencing platforms with shorter reads (e.g. Illumina or IonTorrent with 100 bp reads), but on 454 (and soon PacBio, we hope) we routinely sequence through the targets and into the 3' elements and have to trim. -Kevin From chapmanb at 50mail.com Fri Jan 28 12:34:18 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 28 Jan 2011 07:34:18 -0500 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: References: Message-ID: <20110128123418.GD7866@sobchak.mgh.harvard.edu> Kevin and Peter; I'm really enjoying this discussion -- thanks for talking this through here. > For just 5' barcode detection, I am using a memoized scheme that computes > anchored alignments and then stores the result in a hash table > (match/mismatch, edit distance). This approach allows me to reject barcodes > with too small an edit distance to the next best candidate. It is > reasonably fast for our fairly long 454 barcode set (10-'mers), though I do > have an optional Cython version of the edit distance routine. The > pure-Python version is pretty zippy and can decode a 454 run in a minute or > two. This sounds like a nice approach. Do you have code available or is it not packaged up yet? I wrote up a barcode detector, remover and sorter for our Illumina reads. There is nothing especially tricky in the implementation: it looks for exact matches and then checks for approximate matches, with gaps, using pairwise2: https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/barcode_sort_trim.py The "best_match" function could be replaced with different implementations, using the rest of the script as scaffolding to do all of the other sorting, trimming and output. Brad From bioinformed at gmail.com Fri Jan 28 13:54:47 2011 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Fri, 28 Jan 2011 08:54:47 -0500 Subject: [Biopython-dev] Sequential SFF IO In-Reply-To: <20110128123418.GD7866@sobchak.mgh.harvard.edu> References: <20110128123418.GD7866@sobchak.mgh.harvard.edu> Message-ID: On Fri, Jan 28, 2011 at 7:34 AM, Brad Chapman wrote: > Kevin and Peter; > I'm really enjoying this discussion -- thanks for talking this > through here. > > > For just 5' barcode detection, I am using a memoized scheme that computes > > anchored alignments and then stores the result in a hash table > > (match/mismatch, edit distance). This approach allows me to reject > barcodes > > with too small an edit distance to the next best candidate. It is > > reasonably fast for our fairly long 454 barcode set (10-'mers), though I > do > > have an optional Cython version of the edit distance routine. The > > pure-Python version is pretty zippy and can decode a 454 run in a minute > or > > two. > > This sounds like a nice approach. Do you have code available or is > it not packaged up yet? > It is still under development with some of the refinements I mentioned in a non-public branch and have not percolated out to my Google code version. However, a previous version is available from: http://code.google.com/p/glu-genetics/source/browse/glu/modules/seq/unbarcode.py# > I wrote up a barcode detector, remover and sorter for our Illumina > reads. There is nothing especially tricky in the implementation: it > looks for exact matches and then checks for approximate matches, > with gaps, using pairwise2: > > > https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/barcode_sort_trim.py > > The "best_match" function could be replaced with different > implementations, using the rest of the script as scaffolding to do > all of the other sorting, trimming and output. > > Nice! I didn't know about pairwise2, though I figured BioPython would have something to that effect. -Kevin From anaryin at gmail.com Sat Jan 29 00:51:35 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Sat, 29 Jan 2011 01:51:35 +0100 Subject: [Biopython-dev] Features of the GSOC branch ready to be merged In-Reply-To: References: Message-ID: Also refactored a bit and extended the remove_disordered_atoms after a discussion with a colleague in the lab. Pushed as well to my branch pdb_enhancements. From p.j.a.cock at googlemail.com Sat Jan 29 23:28:21 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 29 Jan 2011 23:28:21 +0000 Subject: [Biopython-dev] [GitHub] Bug 2947 viterbi Message-ID: Hi all, I'm forwarding a pull request via github (in future should we try to have these sent to the dev list automatically?). Does anyone familiar with HMM's want to look at this? Thanks, Peter ---------- Forwarded message ---------- From: GitHub Date: Sat, Jan 29, 2011 at 10:34 PM Subject: [GitHub] Bug 2947 viterbi [biopython/biopython GH-1] To: p.j.a.cock at googlemail.com pgarland wants someone to pull from pgarland:bug-2947-viterbi: Hello, I think this fixes bug 2947. There were 2 errors in how the state sequence was calcuated. The first occurs at the beginning of the sequence. viterbi_probs[(state_letters[0], -1)] was initialized to 1 and viterbi_probs[(state_letters[0], -1)] ?to 0 for all state letters other than the zeroth. This is how it is described in Biological Sequence Analysis by Durbin, et al, but it doesn't work for the code as written because the code doesn't provide for an particular begin state. By initializing the zeroth state letter to 1, and all the others to 0, you're starting off by assigning a higher probability to a state sequence that begins with the zeroth state letter. I fixed this error by also setting ?viterbi_probs[(state_letters[0], -1)] to 0, so that all possible initial states are equally probable. There's a second error in the Viterbi termination code. The algorithm described in Durbin et al allows for modeling a particular end state. Since the code as written doesn't provide for specifying an end state, the termination code miscalculates the sequence probability. The pseudocode in Durbin et al confusingly labels the end state as "0", at least in the printing I have, and this seems to have been carried over into the biopython code, where the zeroth state_letter is whichever one is first in the list. The code as written calculated the probability of the discovered state sequence multiplied by the probability of transitioning from the sequence's last element to the zeroth state named in state_letters, when it should just calculate the probability of the discovered state sequence. I fixed this by deleting the lines that were intended to account for the transition to an end state. It could be useful to specify particular begin and end states, but I believe this patch should give correct results for all cases that don't need that ability. ~Phillip View Pull Request: https://github.com/biopython/biopython/pull/1 From bugzilla-daemon at portal.open-bio.org Mon Jan 31 16:50:15 2011 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 31 Jan 2011 11:50:15 -0500 Subject: [Biopython-dev] [Bug 2947] Bio.HMM calculates wrong viterbi path In-Reply-To: Message-ID: <201101311650.p0VGoFsS028981@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2947 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2011-01-31 11:50 EST ------- Should be fixed now, can anyone confirm this? https://github.com/biopython/biopython/commit/0502ba205bd227655cd5229f5adad63bf9813b23 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.