From tallpaulinjax at yahoo.com Sun Nov 1 14:50:31 2009 From: tallpaulinjax at yahoo.com (Paul B) Date: Sun, 1 Nov 2009 11:50:31 -0800 (PST) Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex Message-ID: <882158.31004.qm@web30705.mail.mud.yahoo.com> Hi, ? I'm a computer science guy trying to figure out some chemistry logic to support my thesis, so bear with me! :-) To sum it up, I'm not sure MMCIFParser is handling ATOM and MODEL records correctly because of this code in MMCIFParser: ??????????? if fieldname=="HETATM": ??????????????? hetatm_flag="H" ??????????? else: ??????????????? hetatm_flag=" " This causes ATOM (and potentially MODEL) records to die as seen in the exception below (I think!) ? My questions are: 1. Am I correct?the correct code is insufficient? 2. What additional logic beyond just recognizing whether it's a HETATM, ATOM or MODEL record needs to be added? ? Thanks! ? Paul ? Background: I understand?MMCIFlex.py et cetera is commented out in the Windows setup.py package due to difficulties compiling it. So I re-wrote MMCIFlex strictly in Python to emulate what I THINK the original MMCIFlex did. My version processes a .cif file one line at a time (readline()) then passes tokens back to MMCIF2Dict at? each call to get_token(). That seems to work fine for unit testing of my MMCIFlex and MMCIFDict which I had to slightly re-write (to ensure it handled how I passed SEMICOLONS line back etc). ? However when I try and use this with MMCIFParser against the 2beg .cif file which has no HETATM records and, as I understand the definition,?no disordered atoms?I get: ? C:\Python25\Lib\site-packages\Bio\PDB\StructureBuilder.py:85: PDBConstructionWar ning: WARNING: Chain A is discontinuous at line 0. ? PDBConstructionWarning) C:\Python25\Lib\site-packages\Bio\PDB\StructureBuilder.py:122: PDBConstructionWa rning: WARNING: Residue (' ', 17, ' ') redefined at line 0. ? PDBConstructionWarning) Traceback (most recent call last): ? File "MMCIFParser.py", line 140, in ??? structure=p.get_structure("test", filename) ? File "MMCIFParser.py", line 23, in get_structure ??? self._build_structure(structure_id) ? File "MMCIFParser.py", line 88, in _build_structure ??? icode) ? File "C:\Python25\lib\site-packages\Bio\PDB\StructureBuilder.py", line 148, in ?init_residue ??? % (resname, field, resseq, icode)) PDBExceptions.PDBConstructionException: Blank altlocs in duplicate residue LEU ( ' ', 17, ' ') ? Basically what I think MIGHT be happening is MMCIFParser is currently only handling HETATM records, when some other kind of record comes in (ATOM, MODEL) it is treated incorrectly. See below. ? MMCIFParser.py ??? def _build_structure(self, structure_id): ?.... ??????? fieldname_list=mmcif_dict["_atom_site.group_PDB"] ?.... ??????? for i in xrange(0, len(atom_id_list)): ???? ... ??????????? altloc=alt_list[i] ??????????? if altloc==".": ??????????????? altloc=" " ???? ... ??????????? fieldname=fieldname_list[i] ??????????? #How are ATOM and MODEL records handled? ??????????? if fieldname=="HETATM": ??????????????? hetatm_flag="H" ??????????? else: ??????????????? hetatm_flag=" " ??????????? if current_chain_id!=chainid: ??????????????? current_chain_id=chainid ??????????????? structure_builder.init_chain(current_chain_id) ??????????????? current_residue_id=resseq ??????????????? icode, int_resseq=self._get_icode(resseq) ????????????? ??#This is line 87-88 in the real file ??????????????? structure_builder.init_residue(resname, hetatm_flag, int_resseq, ??????????????????? icode) ? Class StructureBuilder: ??? ... ??? def init_residue(self, resname, field, resseq, icode): ??????? if field!=" ": ??????????? if field=="H": ??????????????? # The hetero field consists of H_ + the residue name (e.g. H_FUC) ??????????????? field="H_"+resname ??????? res_id=(field, resseq, icode) ?????? ... ????????#This line will get executed for any non-HETATM record (ie ATOM Or MODEL) ??????? #because in MMCIFParser, if it wasn't a HETATM, then it's a ' ' ??? ??????? if field==" ": ??????????? if self.chain.has_id(res_id): =======>But there are no point mutations in 2beg that I know?of. Shouldn't be here! ??????????????? # There already is a residue with the id (field, resseq, icode). ??????????????? # This only makes sense in the case of a point mutation. ??????????????? if __debug__: ??????????????????? warnings.warn("WARNING: Residue ('%s', %i, '%s') " ????????????????????????????????? "redefined at line %i." ????????????????????????????????? % (field, resseq, icode, self.line_counter), ????????????????????????????????? PDBConstructionWarning) ??????????????? duplicate_residue=self.chain[res_id] ??????????????? if duplicate_residue.is_disordered()==2: ??????????????????? # The residue in the chain is a DisorderedResidue object. ??????????????????? # So just add the last Residue object. ??????????????????? if duplicate_residue.disordered_has_id(resname): ??????????????????????? # The residue was already made ??????????????????????? self.residue=duplicate_residue ??????????????????????? duplicate_residue.disordered_select(resname) ??????????????????? else: ??????????????????????? # Make a new residue and add it to the already ??????????????????????? # present DisorderedResidue ??????????????????????? new_residue=Residue(res_id, resname, self.segid) ??????????????????????? duplicate_residue.disordered_add(new_residue) ??????????????????????? self.residue=duplicate_residue ??????????????????????? return ??????????????? else: ??????????????????? # Make a new DisorderedResidue object and put all ??????????????????? # the Residue objects with the id (field, resseq, icode) in it. ??????????????????? # These residues each should have non-blank altlocs for all their atoms. ??????????????????? # If not, the PDB file probably contains an error. ====>????????? #This is the line throwing the exception, but we shouldn't be here! ??????????????????? if not self._is_completely_disordered(duplicate_residue): ??????????????????????? # if this exception is ignored, a residue will be missing ??????????????????????? self.residue=None ??????????????????????? raise PDBConstructionException(\ ??????????????????????????? "Blank altlocs in duplicate residue %s ('%s', %i, '%s')" \ ??????????????????????????? % (resname, field, resseq, icode)) ??????????????????? self.chain.detach_child(res_id) ??????????????????? new_residue=Residue(res_id, resname, self.segid) ??????????????????? disordered_residue=DisorderedResidue(res_id) ??????????????????? self.chain.add(disordered_residue) ??????????????????? disordered_residue.disordered_add(duplicate_residue) ??????????????????? disordered_residue.disordered_add(new_residue) ??????????????????? self.residue=disordered_residue ??????????????????? return ??????? residue=Residue(res_id, resname, self.segid) ??????? self.chain.add(residue) ??????? self.residue=residue From biopython at maubp.freeserve.co.uk Sun Nov 1 16:28:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 1 Nov 2009 21:28:50 +0000 Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex In-Reply-To: <882158.31004.qm@web30705.mail.mud.yahoo.com> References: <882158.31004.qm@web30705.mail.mud.yahoo.com> Message-ID: <320fb6e00911011328n57880ddmd47677c9b5ce597f@mail.gmail.com> On Sun, Nov 1, 2009 at 7:50 PM, Paul B wrote: > > Hi, > > I'm a computer science guy trying to figure out some chemistry logic > to support my thesis, so bear with me! :-) To sum it up, I'm not sure > MMCIFParser is handling ATOM and MODEL records correctly > because of this code in MMCIFParser: > ??????????? if fieldname=="HETATM": > ??????????????? hetatm_flag="H" > ??????????? else: > ??????????????? hetatm_flag=" " > This causes ATOM (and potentially MODEL) records to die as seen > in the exception below (I think!) I'll answer that below. > My questions are: > 1. Am I correct?the correct code is insufficient? > 2. What additional logic beyond just recognizing whether it's a > HETATM, ATOM or MODEL record needs to be added? > > Thanks! > > Paul > > > Background: > I understand?MMCIFlex.py et cetera is commented out in the > Windows setup.py package due to difficulties compiling it. It is commented out (on all platforms) because we don't know how to get setup.py to detect if flex and the relevant headers are installed, which we would need to compile the code. I'm note sure how this would work on Windows with an installer (i.e. what is a run time dependency versus compile time). > So I re-wrote MMCIFlex strictly in Python to emulate what Now that would be very handy (IMO), if you can get it working. Have you benchmarked it against the flex code? > I THINK the original MMCIFlex did. My version processes > a .cif file one line at a time (readline()) then passes tokens > back to MMCIF2Dict at? each call to get_token(). That > seems to work fine for unit testing of my MMCIFlex and > MMCIFDict which I had to slightly re-write (to ensure it > handled how I passed SEMICOLONS line back etc). > > However when I try and use this with MMCIFParser > against the 2beg .cif file which has no HETATM records > and, as I understand the definition,?no disordered atoms >?I get: > > ... > > Basically what I think MIGHT be happening is MMCIFParser > is currently only handling HETATM records, when some other > kind of record comes in (ATOM, MODEL) it is treated > incorrectly. See below. > > ... Have you been able to test the flex code? If not, could you give me a tiny script using the 2beg cif file which should work? If that works, then the problem is in your flex replacement code. Peter From tallpaulinjax at yahoo.com Mon Nov 2 08:21:14 2009 From: tallpaulinjax at yahoo.com (Paul B) Date: Mon, 2 Nov 2009 05:21:14 -0800 (PST) Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex In-Reply-To: <320fb6e00911011328n57880ddmd47677c9b5ce597f@mail.gmail.com> Message-ID: <621165.33564.qm@web30706.mail.mud.yahoo.com> I'll use the conventional response technique in future emails! :-) ? Hi Peter, ? 1. "Did you mean to not CC the list?": Sorry, I replied to your email address instead of the CC: address! 2. Peter: "I should be able to run the flex code and you new code side by side, for testing and profiling. Note sure when I'll find the time exactly, but we'll see. Examples will help as while I know plenty about PDB files, I've not used CIF at all": I'd be glad to run the tests myself as well and I have the time! :-) But without the flex module installed and operational the only way I can think of is with pickle'd .cif dicts. 3. Peter: "P.S. Are you OK with making this contribution under the Biopython license?" Absolutely I'd be glad to contribute to biopython! ? This was in response to my followup email to Peter: "Hi Peter: Paul: So I re-wrote MMCIFlex strictly in Python to emulate (the lex based MMCIFlex) Peter: Now that would be very handy (IMO), if you can get it working. Have you benchmarked it against the flex code? Have you been able to test the flex code? If not, could you give me a tiny script using the 2beg cif file which should work? If that works, then the problem is in your flex replacement code. Paul: It already works, but I have no way to benchmark it against the flex code myself. Perhaps someone could pickle a?half dozen PDB .cif files and send me the resultant files? I can then run a test agains each one.? I'll also clean up the code on both the new MMCIFlex.py as well as the changed MMCIF2Dict.py and send them to you most probably by today. Each will have a __main__ method for testing." ? --- On Sun, 11/1/09, Peter wrote: From: Peter Subject: Re: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex To: "Paul B" Cc: biopython-dev at biopython.org Date: Sunday, November 1, 2009, 4:28 PM On Sun, Nov 1, 2009 at 7:50 PM, Paul B wrote: > > Hi, > > I'm a computer science guy trying to figure out some chemistry logic > to support my thesis, so bear with me! :-) To sum it up, I'm not sure > MMCIFParser is handling ATOM and MODEL records correctly > because of this code in MMCIFParser: > ??????????? if fieldname=="HETATM": > ??????????????? hetatm_flag="H" > ??????????? else: > ??????????????? hetatm_flag=" " > This causes ATOM (and potentially MODEL) records to die as seen > in the exception below (I think!) I'll answer that below. > My questions are: > 1. Am I correct?the correct code is insufficient? > 2. What additional logic beyond just recognizing whether it's a > HETATM, ATOM or MODEL record needs to be added? > > Thanks! > > Paul > > > Background: > I understand?MMCIFlex.py et cetera is commented out in the > Windows setup.py package due to difficulties compiling it. It is commented out (on all platforms) because we don't know how to get setup.py to detect if flex and the relevant headers are installed, which we would need to compile the code. I'm note sure how this would work on Windows with an installer (i.e. what is a run time dependency versus compile time). > So I re-wrote MMCIFlex strictly in Python to emulate what Now that would be very handy (IMO), if you can get it working. Have you benchmarked it against the flex code? > I THINK the original MMCIFlex did. My version processes > a .cif file one line at a time (readline()) then passes tokens > back to MMCIF2Dict at? each call to get_token(). That > seems to work fine for unit testing of my MMCIFlex and > MMCIFDict which I had to slightly re-write (to ensure it > handled how I passed SEMICOLONS line back etc). > > However when I try and use this with MMCIFParser > against the 2beg .cif file which has no HETATM records > and, as I understand the definition,?no disordered atoms >?I get: > > ... > > Basically what I think MIGHT be happening is MMCIFParser > is currently only handling HETATM records, when some other > kind of record comes in (ATOM, MODEL) it is treated > incorrectly. See below. > > ... Have you been able to test the flex code? If not, could you give me a tiny script using the 2beg cif file which should work? If that works, then the problem is in your flex replacement code. Peter From tallpaulinjax at yahoo.com Mon Nov 2 17:03:51 2009 From: tallpaulinjax at yahoo.com (Paul B) Date: Mon, 2 Nov 2009 14:03:51 -0800 (PST) Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex Message-ID: <767823.60315.qm@web30702.mail.mud.yahoo.com> Hi Peter, ? I have attached drafts of MMCIFlex.py and MMCIFParser.py. They have __main__ methods that perform decent testing.? On my system, I have replaced their same-named counterparts? in the appropriate folders. Please note, however,?this version of MMCIFlex.py and MMCIFParser.py must work together as a pair! So, I don't know how you guys handling that: give them new names, or replace old files? ? I can't test them further right now because I believe MMCIFParser needs corrections. For example, the PDBParser.py calls the following methods in it's StructureBuilder object: structure_builder.init_structure structure_builder.set_header structure_builder.set_line_counter structure_builder.init_model structure_builder.init_seg structure_builder.init_chain structure_builder.init_residue structure_builder.init_atom structure_builder.set_anisou structure_builder.set_siguij structure_builder.set_sigatm ? However, MMCIFParser only calls: structure_builder.init_structure structure_builder.init_model structure_builder.init_seg structure_builder.init_chain structure_builder.init_residue structure_builder.init_atom structure_builder.set_anisou ? leaving out calls to: structure_builder.set_header structure_builder.set_line_counter structure_builder.set_siguij structure_builder.set_sigatm ? I believe the last two might be important for some people, I don't know about the first two whether they are housekeeping, etc... still checking. So I am still looking into MMCIFParser, in particular why it's bombing creating a structure on 2beg.cif when PDBParser correctly works on pdb2beg.ent. ? Paul --- On Mon, 11/2/09, Paul B wrote: From: Paul B Subject: Re: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex To: biopython-dev at biopython.org Date: Monday, November 2, 2009, 8:21 AM I'll use the conventional response technique in future emails! :-) ? Hi Peter, ? 1. "Did you mean to not CC the list?": Sorry, I replied to your email address instead of the CC: address! 2. Peter: "I should be able to run the flex code and you new code side by side, for testing and profiling. Note sure when I'll find the time exactly, but we'll see. Examples will help as while I know plenty about PDB files, I've not used CIF at all": I'd be glad to run the tests myself as well and I have the time! :-) But without the flex module installed and operational the only way I can think of is with pickle'd .cif dicts. 3. Peter: "P.S. Are you OK with making this contribution under the Biopython license?" Absolutely I'd be glad to contribute to biopython! ? This was in response to my followup email to Peter: "Hi Peter: Paul: So I re-wrote MMCIFlex strictly in Python to emulate (the lex based MMCIFlex) Peter: Now that would be very handy (IMO), if you can get it working. Have you benchmarked it against the flex code? Have you been able to test the flex code? If not, could you give me a tiny script using the 2beg cif file which should work? If that works, then the problem is in your flex replacement code. Paul: It already works, but I have no way to benchmark it against the flex code myself. Perhaps someone could pickle a?half dozen PDB .cif files and send me the resultant files? I can then run a test agains each one.? I'll also clean up the code on both the new MMCIFlex.py as well as the changed MMCIF2Dict.py and send them to you most probably by today. Each will have a __main__ method for testing." ? --- On Sun, 11/1/09, Peter wrote: From: Peter Subject: Re: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex To: "Paul B" Cc: biopython-dev at biopython.org Date: Sunday, November 1, 2009, 4:28 PM On Sun, Nov 1, 2009 at 7:50 PM, Paul B wrote: > > Hi, > > I'm a computer science guy trying to figure out some chemistry logic > to support my thesis, so bear with me! :-) To sum it up, I'm not sure > MMCIFParser is handling ATOM and MODEL records correctly > because of this code in MMCIFParser: > ??????????? if fieldname=="HETATM": > ??????????????? hetatm_flag="H" > ??????????? else: > ??????????????? hetatm_flag=" " > This causes ATOM (and potentially MODEL) records to die as seen > in the exception below (I think!) I'll answer that below. > My questions are: > 1. Am I correct?the correct code is insufficient? > 2. What additional logic beyond just recognizing whether it's a > HETATM, ATOM or MODEL record needs to be added? > > Thanks! > > Paul > > > Background: > I understand?MMCIFlex.py et cetera is commented out in the > Windows setup.py package due to difficulties compiling it. It is commented out (on all platforms) because we don't know how to get setup.py to detect if flex and the relevant headers are installed, which we would need to compile the code. I'm note sure how this would work on Windows with an installer (i.e. what is a run time dependency versus compile time). > So I re-wrote MMCIFlex strictly in Python to emulate what Now that would be very handy (IMO), if you can get it working. Have you benchmarked it against the flex code? > I THINK the original MMCIFlex did. My version processes > a .cif file one line at a time (readline()) then passes tokens > back to MMCIF2Dict at? each call to get_token(). That > seems to work fine for unit testing of my MMCIFlex and > MMCIFDict which I had to slightly re-write (to ensure it > handled how I passed SEMICOLONS line back etc). > > However when I try and use this with MMCIFParser > against the 2beg .cif file which has no HETATM records > and, as I understand the definition,?no disordered atoms >?I get: > > ... > > Basically what I think MIGHT be happening is MMCIFParser > is currently only handling HETATM records, when some other > kind of record comes in (ATOM, MODEL) it is treated > incorrectly. See below. > > ... Have you been able to test the flex code? If not, could you give me a tiny script using the 2beg cif file which should work? If that works, then the problem is in your flex replacement code. Peter -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: MMCIF2Dict.py URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: MMCIFlex.py URL: From bugzilla-daemon at portal.open-bio.org Tue Nov 3 08:20:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Nov 2009 08:20:14 -0500 Subject: [Biopython-dev] [Bug 2929] NCBIXML PSI-Blast parser should gather all information from XML blastgpg output In-Reply-To: Message-ID: <200911031320.nA3DKErZ024365@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2929 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-03 08:20 EST ------- (In reply to comment #3) > (In reply to comment #2) > > What specifically is our parser failing to extract from this example PSI > > BLAST XML file? > > (Sorry, I've been away) > Well, currently the code tries to get several pieces of information from the > Blast.Record.PSIBlast (brecord): > > brecord.converged There is a CONVERGED line in the XML we should be able to use here. I don't recall seeing this in pgpblast output from older versions of BLAST. > brecord.query > brecord.query_letters Those work (query and query_letters). > brecord.rounds > brecord.rounds.alignments > brecord.rounds.alignments.title > brecord.rounds.alignments.hsps Those also work but not via rounds, but as separate BLAST record objects. See mailing list discussion regarding PSI-BLAST and multiple BLAST queries. > then in the hsps: > hsp.identities > hsp.positives > hsp.query > hsp.sbjct > hsp.match > hsp.expect > hsp.query_start > hsp.query_end > hsp.sbjct_start > hsp.sbjct_end Again, those are all parsed fine. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tallpaulinjax at yahoo.com Tue Nov 3 11:36:08 2009 From: tallpaulinjax at yahoo.com (Paul B) Date: Tue, 3 Nov 2009 08:36:08 -0800 (PST) Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex Message-ID: <468854.15801.qm@web30708.mail.mud.yahoo.com> Hi, ? I have found the reason why MMCIParser is dying. It has no provision for more than one model, so when a second model comes around with the same chain and residue the program throws an exception. ? I will be joining github to submit the required changes. I haven't used github before, and this is my first open source project so please give me a few days to acclimate. ? My mods so far are as follows in MMCIFParser.py (and require the MMCIFlex.py and MMCIF2Dict.py files I will be submitting via github, and have submitted to Peter privately.) ? Change the __doc__ setting: #Mod by Paul T. Bathen to reflect MMCIFlex built solely in Python __doc__="mmCIF parser (implemented solely in Python, no lex/flex/C code needed)" Insert the following model_list line: ??????? occupancy_list=mmcif_dict["_atom_site.occupancy"] ??????? fieldname_list=mmcif_dict["_atom_site.group_PDB"] ??????? #Added by Paul T. Bathen Nov 2009 ??????? model_list=mmcif_dict["_atom_site.pdbx_PDB_model_num"] ??????? try: ? Make the following changes: ??????? #Modified by Paul T. Bathen Nov 2009: comment out this line ??????? #current_model_id=0 ??????? structure_builder=self._structure_builder ??????? structure_builder.init_structure(structure_id) ??????? #Modified by Paul T. Bathen Nov 2009: comment out this line ??????? #structure_builder.init_model(current_model_id) ??????? structure_builder.init_seg(" ") ??????? #Added by Paul T. Bathen Nov 2009 ??????? current_model_id = -1 Make the following changes in the for loop: ??????????? #Note by Paul T. Bathen: should this include the HOH and WAT stmts in PDBParser? ??????????? if fieldname=="HETATM": ??????????????? hetatm_flag="H" ??????????? else: ??????????????? hetatm_flag=" " ? ??????????? #Added by Paul T. Bathen Nov 2009 ??????????? model_id = model_list[i] ??????????? if current_model_id != model_id: ??????????????? current_model_id = model_id ??????????????? structure_builder.init_model(current_model_id) ??????????? #end of addition ? After these changes took place, and with the new MMCIFlex and MMCIF2Dict in place, I was able to parse and test 2beg.cif and pdb2bec.ent and both parsed with the same number of models, chains, and residues. ? The only difference is the PDBParser incorrectly states the first model as 0 when it should be 1: there is an explicit MODEL line in pdb2beg.ent. So all the models are off by one in 2beg when parsed by PDBParser.py. I can look into the bug in PDBParser.py and submit it if desired? ? Paul --- On Mon, 11/2/09, Paul B wrote: From: Paul B Subject: Re: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex To: biopython-dev at biopython.org Date: Monday, November 2, 2009, 8:21 AM I'll use the conventional response technique in future emails! :-) ? Hi Peter, ? 1. "Did you mean to not CC the list?": Sorry, I replied to your email address instead of the CC: address! 2. Peter: "I should be able to run the flex code and you new code side by side, for testing and profiling. Note sure when I'll find the time exactly, but we'll see. Examples will help as while I know plenty about PDB files, I've not used CIF at all": I'd be glad to run the tests myself as well and I have the time! :-) But without the flex module installed and operational the only way I can think of is with pickle'd .cif dicts. 3. Peter: "P.S. Are you OK with making this contribution under the Biopython license?" Absolutely I'd be glad to contribute to biopython! ? This was in response to my followup email to Peter: "Hi Peter: Paul: So I re-wrote MMCIFlex strictly in Python to emulate (the lex based MMCIFlex) Peter: Now that would be very handy (IMO), if you can get it working. Have you benchmarked it against the flex code? Have you been able to test the flex code? If not, could you give me a tiny script using the 2beg cif file which should work? If that works, then the problem is in your flex replacement code. Paul: It already works, but I have no way to benchmark it against the flex code myself. Perhaps someone could pickle a?half dozen PDB .cif files and send me the resultant files? I can then run a test agains each one.? I'll also clean up the code on both the new MMCIFlex.py as well as the changed MMCIF2Dict.py and send them to you most probably by today. Each will have a __main__ method for testing." ? --- On Sun, 11/1/09, Peter wrote: From: Peter Subject: Re: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex To: "Paul B" Cc: biopython-dev at biopython.org Date: Sunday, November 1, 2009, 4:28 PM On Sun, Nov 1, 2009 at 7:50 PM, Paul B wrote: > > Hi, > > I'm a computer science guy trying to figure out some chemistry logic > to support my thesis, so bear with me! :-) To sum it up, I'm not sure > MMCIFParser is handling ATOM and MODEL records correctly > because of this code in MMCIFParser: > ??????????? if fieldname=="HETATM": > ??????????????? hetatm_flag="H" > ??????????? else: > ??????????????? hetatm_flag=" " > This causes ATOM (and potentially MODEL) records to die as seen > in the exception below (I think!) I'll answer that below. > My questions are: > 1. Am I correct?the correct code is insufficient? > 2. What additional logic beyond just recognizing whether it's a > HETATM, ATOM or MODEL record needs to be added? > > Thanks! > > Paul > > > Background: > I understand?MMCIFlex.py et cetera is commented out in the > Windows setup.py package due to difficulties compiling it. It is commented out (on all platforms) because we don't know how to get setup.py to detect if flex and the relevant headers are installed, which we would need to compile the code. I'm note sure how this would work on Windows with an installer (i.e. what is a run time dependency versus compile time). > So I re-wrote MMCIFlex strictly in Python to emulate what Now that would be very handy (IMO), if you can get it working. Have you benchmarked it against the flex code? > I THINK the original MMCIFlex did. My version processes > a .cif file one line at a time (readline()) then passes tokens > back to MMCIF2Dict at? each call to get_token(). That > seems to work fine for unit testing of my MMCIFlex and > MMCIFDict which I had to slightly re-write (to ensure it > handled how I passed SEMICOLONS line back etc). > > However when I try and use this with MMCIFParser > against the 2beg .cif file which has no HETATM records > and, as I understand the definition,?no disordered atoms >?I get: > > ... > > Basically what I think MIGHT be happening is MMCIFParser > is currently only handling HETATM records, when some other > kind of record comes in (ATOM, MODEL) it is treated > incorrectly. See below. > > ... Have you been able to test the flex code? If not, could you give me a tiny script using the 2beg cif file which should work? If that works, then the problem is in your flex replacement code. Peter _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From kellrott at gmail.com Tue Nov 3 11:46:57 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Tue, 3 Nov 2009 08:46:57 -0800 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> Message-ID: (Moving this thread to Biopython-dev) I've hacked together some code, and tested it against the bacterial genome library I had on hand (of course, eukariotic features will be more complicated, so will need to test against them next). Examples of 'exotic' feature location would be helpful. I've posted the code below. I'll be moving it into my git fork, and add some testing. Any thoughts where it should go? It seems like it would best work as a SeqRecord method. def FeatureIDGuess( feature ): id = "N/A" try: id = feature.qualifiers['locus_tag'][0] except KeyError: try: id = feature.qualifiers['plasmid'][0] except KeyError: pass return id def FeatureDescGuess( feature ): desc = "" try: desc=feature.qualifiers['product'][0] except KeyError: pass return desc def ExtractFeatureDNA( record, feature ): dna = None if len( feature.sub_features ): dnaStr = "" for subFeat in feature.sub_features: if subFeat.location_operator=='join': subSeq = ExtractFeatureDNA( record, subFeat ) dnaStr += subSeq.seq dna = Seq( str(dnaStr), IUPAC.unambiguous_dna) if ( feature.strand == -1 ): dna = dna.reverse_complement() else: start_pos = feature.location.start.position end_pos = feature.location.end.position seqStr = record.seq[ start_pos:end_pos ] dna = Seq( str(seqStr), IUPAC.unambiguous_dna) if ( feature.strand == -1 and feature.location_operator != 'join' ): dna = dna.reverse_complement() outSeq = SeqRecord( dna, FeatureIDGuess( feature ) , description=FeatureDescGuess( feature ) ) return outSeq On Mon, Nov 2, 2009 at 2:30 PM, Peter wrote: > On Mon, Nov 2, 2009 at 9:31 PM, Kyle Ellrott wrote: > >> > >> You missed this thread earlier this month: > >> http://lists.open-bio.org/pipermail/biopython/2009-October/005695.html > >> > >> Are you on the dev mailing list? I was hoping to get a little discussion > >> going there, before moving over to the discussion list for more general > >> comment. > > > > I didn't need to do it when the original discussion came through, so it > got > > 'filtered' ;-) I guess if multiple people are asking the same question > > independently, it's probably a timely issue. > > > > I'll probably go ahead and pull the SeqRecord fork into my git fork and > > start playing around with it. > > Cool - sorry if the previous email was brusque - I was in the middle > of dinner preparation and shouldn't have been checking emails. > > If you just want to try the sequence extraction for a SeqFeature, > the code is on the trunk (as noted, as a function in a unit test). > My SeqRecord github branch is looking at other issues. > > Peter > From biopython at maubp.freeserve.co.uk Tue Nov 3 12:09:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 17:09:37 +0000 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> Message-ID: <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> On Tue, Nov 3, 2009 at 4:46 PM, Kyle Ellrott wrote: > (Moving this thread to Biopython-dev) > > I've hacked together some code, and tested it against the bacterial genome > library I had on hand (of course, eukariotic features will be more > complicated, so will need to test against them next). ?Examples of 'exotic' > feature location would be helpful. > I've posted the code below. ?I'll be moving it into my git fork, and add > some testing. ?Any thoughts where it should go? ?It seems like it would best > work as a SeqRecord method. i.e. Option (4) of this list of ideas? http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006922.html Peter P.S. def FeatureDescGuess( feature ): desc = "" try: desc=feature.qualifiers['product'][0] except KeyError: pass return desc Could be just: def FeatureDescGuess( feature ): return feature.qualifiers.get('product', [""])[0] and therefore doesn't really need an entire function. From biopython at maubp.freeserve.co.uk Tue Nov 3 12:13:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 17:13:25 +0000 Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex In-Reply-To: <468854.15801.qm@web30708.mail.mud.yahoo.com> References: <468854.15801.qm@web30708.mail.mud.yahoo.com> Message-ID: <320fb6e00911030913u11cf6380q2a2bbc07b2b1863f@mail.gmail.com> On Tue, Nov 3, 2009 at 4:36 PM, Paul B wrote: > > Hi, > > I have found the reason why MMCIParser is dying. It has no provision > for more than one model, so when a second model comes around with > the same chain and residue the program throws an exception. Please file a bug report on bugzilla for that. I guess no-one has tried NMR CIF data with the parser before (!). > I will be joining github to submit the required changes. I haven't used > github before, and this is my first open source project so please give > me a few days to acclimate. I you like - great. Otherwise we can manage with patches via an enhancement bug on bugzilla. > My mods so far are as follows in MMCIFParser.py (and require the > MMCIFlex.py and MMCIF2Dict.py files I will be submitting via github, > and have submitted to Peter privately.) Actually, I think that ended up on mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2009-November/006938.html > The only difference is the PDBParser incorrectly states the first model as 0 > when it should be 1: there is an explicit MODEL line in pdb2beg.ent. So all > the models are off by one in 2beg when parsed by PDBParser.py. I can > look into the bug in PDBParser.py and submit it if desired? Are you sure it should it be 1 and not 0? Remember, Python counts from zero. Peter From bugzilla-daemon at portal.open-bio.org Tue Nov 3 12:19:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Nov 2009 12:19:08 -0500 Subject: [Biopython-dev] [Bug 2731] Adding .upper() and .lower() methods to the Seq object In-Reply-To: Message-ID: <200911031719.nA3HJ84Y001795@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2731 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-03 12:19 EST ------- Created an attachment (id=1389) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1389&action=view) Patch to Bio/Seq.py Compared to the earlier patch, this takes the less invasive approach of only editing Bio/Seq.py (covering both Seq and UnknownSeq, with doctests), but has the downside that it is not easy to deal with gapped alphabets etc nicely. Adding (private) upper/lower methods as outlined in the earlier patch does seem a better plan. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From kellrott at gmail.com Tue Nov 3 12:23:27 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Tue, 3 Nov 2009 09:23:27 -0800 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> Message-ID: > > > def FeatureDescGuess( feature ): > desc = "" > try: > desc=feature.qualifiers['product'][0] > except KeyError: > pass > return desc > > Could be just: > > def FeatureDescGuess( feature ): > return feature.qualifiers.get('product', [""])[0] > > and therefore doesn't really need an entire function. > That could attempt to get the first element of a None type, if the 'product' qualifier doesn't exist. Actually, I wrote it that way so it could be extended. First it would try 'product' and if that didn't exist replace it with something like a 'db_xref' or a 'note' entry. I was hoping to get some input on what people think would be 'order of importance' of things to try. Kyle From bugzilla-daemon at portal.open-bio.org Tue Nov 3 12:34:05 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Nov 2009 12:34:05 -0500 Subject: [Biopython-dev] [Bug 2731] Adding .upper() and .lower() methods to the Seq object In-Reply-To: Message-ID: <200911031734.nA3HY52U002483@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2731 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1192 is|0 |1 obsolete| | Attachment #1389 is|0 |1 obsolete| | ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-03 12:34 EST ------- Created an attachment (id=1390) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1390&action=view) Patch to Bio/Seq.py and Bio/Alphabet/__init__.py Based on attachment 1192 and recent attachment 1389 with doctests. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tallpaulinjax at yahoo.com Tue Nov 3 12:34:46 2009 From: tallpaulinjax at yahoo.com (Paul B) Date: Tue, 3 Nov 2009 09:34:46 -0800 (PST) Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex In-Reply-To: <320fb6e00911030913u11cf6380q2a2bbc07b2b1863f@mail.gmail.com> Message-ID: <869703.76049.qm@web30705.mail.mud.yahoo.com> >Are you sure it should it be 1 and not 0? Remember, Python counts from zero. If the MODEL record in the .ent record says MODEL 1, should biopython report it as 0? In PDBParser, the code is as follows: ??????? current_model_id=0 ??????? # Flag we have an open model ??????? model_open=0 ??????? for i in range(0, len(coords_trailer)): ? ??????????? if(record_type=='ATOM? ' or record_type=='HETATM'): ??????????????? # Initialize the Model - there was no explicit MODEL record ??????????????? if not model_open: ??????????????????? structure_builder.init_model(current_model_id) ??????????????????? current_model_id+=1 ??????????????????? model_open=1 --- On Tue, 11/3/09, Peter wrote: From: Peter Subject: Re: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex To: "Paul B" Cc: biopython-dev at biopython.org Date: Tuesday, November 3, 2009, 12:13 PM On Tue, Nov 3, 2009 at 4:36 PM, Paul B wrote: > > Hi, > > I have found the reason why MMCIParser is dying. It has no provision > for more than one model, so when a second model comes around with > the same chain and residue the program throws an exception. Please file a bug report on bugzilla for that. I guess no-one has tried NMR CIF data with the parser before (!). > I will be joining github to submit the required changes. I haven't used > github before, and this is my first open source project so please give > me a few days to acclimate. I you like - great. Otherwise we can manage with patches via an enhancement bug on bugzilla. > My mods so far are as follows in MMCIFParser.py (and require the > MMCIFlex.py and MMCIF2Dict.py files I will be submitting via github, > and have submitted to Peter privately.) Actually, I think that ended up on mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2009-November/006938.html > The only difference is the PDBParser incorrectly states the first model as 0 > when it should be 1: there is an explicit MODEL line in pdb2beg.ent. So all > the models are off by one in 2beg when parsed by PDBParser.py. I can > look into the bug in PDBParser.py and submit it if desired? Are you sure it should it be 1 and not 0? Remember, Python counts from zero. Peter ? From bugzilla-daemon at portal.open-bio.org Tue Nov 3 12:39:46 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Nov 2009 12:39:46 -0500 Subject: [Biopython-dev] [Bug 2731] Adding .upper() and .lower() methods to the Seq object In-Reply-To: Message-ID: <200911031739.nA3Hdk8e002675@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2731 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-03 12:39 EST ------- Marking as fixed - updated patch checked in, with additional unit tests in Tests/test_seq.py -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 3 12:39:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Nov 2009 12:39:48 -0500 Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string, even subclass string? In-Reply-To: Message-ID: <200911031739.nA3HdmDM002689@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2351 Bug 2351 depends on bug 2731, which changed state. Bug 2731 Summary: Adding .upper() and .lower() methods to the Seq object http://bugzilla.open-bio.org/show_bug.cgi?id=2731 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Nov 3 12:41:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 17:41:31 +0000 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> Message-ID: <320fb6e00911030941l5f5748f3ofdfc2144af209a5e@mail.gmail.com> On Tue, Nov 3, 2009 at 5:23 PM, Kyle Ellrott wrote: >> >> def FeatureDescGuess( feature ): >> ? desc = "" >> ? try: >> ? ? ? desc=feature.qualifiers['product'][0] >> ? except KeyError: >> ? ? ? pass >> ? return desc >> >> Could be just: >> >> def FeatureDescGuess( feature ): >> ? return feature.qualifiers.get('product', [""])[0] >> >> and therefore doesn't really need an entire function. > > That could attempt to get the first element of a None type, if the 'product' > qualifier doesn't exist. No, because we supply a default value (a list containing the empty string). > Actually, I wrote it that way so it could be extended.? First it would try > 'product' and if that didn't exist replace it with something like a > 'db_xref' or a 'note' entry.? I was hoping to get some input on what people > think would be 'order of importance' of things to try. I might try and follow the NCBI's conventions used in FAA files for each GenBank file - see the bacteria folder on their FTP site. Peter From bugzilla-daemon at portal.open-bio.org Tue Nov 3 12:58:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Nov 2009 12:58:37 -0500 Subject: [Biopython-dev] [Bug 2943] New: MMCIFParser only handling a single model. Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2943 Summary: MMCIFParser only handling a single model. Product: Biopython Version: 1.52 Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: TallPaulInJax at yahoo.com MMCIFParser as-written only handles a single model in a protein. Any protein that has multiple modesl with repeating chains and residues will get an exception since the residue ID will already exist. Please make the following changes in MMCIFParser.py: Change the __doc__ setting: #Optional __DOC__ change if the new MMCIFlex is not used nor the changes #to MMCIF2Dict based on the new MMCIFlex. #Mod by Paul T. Bathen to reflect MMCIFlex built solely in Python __doc__="mmCIF parser (implemented solely in Python, no lex/flex/C code needed)" Regardles of the DOC changes: Insert the following model_list line occupancy_list=mmcif_dict["_atom_site.occupancy"] fieldname_list=mmcif_dict["_atom_site.group_PDB"] #Added by Paul T. Bathen Nov 2009 model_list=mmcif_dict["_atom_site.pdbx_PDB_model_num"] try: Make the following changes: #Modified by Paul T. Bathen Nov 2009: comment out this line #current_model_id=0 structure_builder=self._structure_builder structure_builder.init_structure(structure_id) #Modified by Paul T. Bathen Nov 2009: comment out this line #structure_builder.init_model(current_model_id) structure_builder.init_seg(" ") #Added by Paul T. Bathen Nov 2009 current_model_id = -1 Make the following changes in the for loop: #Note by Paul T. Bathen: should MMCIFParser include #the HOH and WAT stmts in PDBParser immediately below? #if fieldname=="HETATM": # if resname=="HOH" or resname=="WAT": # hetero_flag="W" # else: # hetero_flag="H" if fieldname=="HETATM": hetatm_flag="H" else: hetatm_flag=" " #Added by Paul T. Bathen Nov 2009 model_id = model_list[i] if current_model_id != model_id: current_model_id = model_id structure_builder.init_model(current_model_id) #end of addition After these changes took place, and with the new MMCIFlex and MMCIF2Dict in place, I was able to parse and test 2beg.cif and pdb2beg.ent and both parsed with the same number of models, chains, and residues. Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From kellrott at gmail.com Tue Nov 3 13:17:44 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Tue, 3 Nov 2009 10:17:44 -0800 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: <320fb6e00911030941l5f5748f3ofdfc2144af209a5e@mail.gmail.com> References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> <320fb6e00911030941l5f5748f3ofdfc2144af209a5e@mail.gmail.com> Message-ID: > > >> def FeatureDescGuess( feature ): > >> return feature.qualifiers.get('product', [""])[0] > >> > >> and therefore doesn't really need an entire function. > > > > That could attempt to get the first element of a None type, if the > 'product' > > qualifier doesn't exist. > > No, because we supply a default value (a list containing the empty string). > Didn't see that. That's what I get for programming during a colloquium ;-) Kyle From biopython at maubp.freeserve.co.uk Tue Nov 3 16:49:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 21:49:31 +0000 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> <320fb6e00911030941l5f5748f3ofdfc2144af209a5e@mail.gmail.com> Message-ID: <320fb6e00911031349v1303b7c3s2b1aaaa6fc695ced@mail.gmail.com> On Tue, Nov 3, 2009 at 6:17 PM, Kyle Ellrott wrote: >> >> def FeatureDescGuess( feature ): >> >> ? return feature.qualifiers.get('product', [""])[0] >> >> >> >> and therefore doesn't really need an entire function. >> > >> > That could attempt to get the first element of a None type, if the >> > 'product' >> > qualifier doesn't exist. >> >> No, because we supply a default value (a list containing the empty >> string). > > Didn't see that.? That's what I get for programming during a colloquium ;-) > :) There could be a problem if the SeqFeature qualifiers wasn't a list, for example a string like ""NC_123456" instead of say ["NC_123456"], but the assumption is safe with anything from the GenBank or EMBL parsers. Peter From bugzilla-daemon at portal.open-bio.org Tue Nov 3 16:51:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Nov 2009 16:51:51 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911032151.nA3LppCX010030@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-03 16:51 EST ------- Could you attach a patch to this bug (using the diff command line tool)? A short example script parsing one of these problem CIF files would also be very helpful and could form the basis of a new unit test. If you can suggest a small (file size) CIF file we could use for this that would be ideal. Thanks -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Nov 3 16:54:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 21:54:15 +0000 Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex In-Reply-To: <869703.76049.qm@web30705.mail.mud.yahoo.com> References: <320fb6e00911030913u11cf6380q2a2bbc07b2b1863f@mail.gmail.com> <869703.76049.qm@web30705.mail.mud.yahoo.com> Message-ID: <320fb6e00911031354w659f33d2oad838eeffc8ea585@mail.gmail.com> On Tue, Nov 3, 2009 at 5:34 PM, Paul B wrote: >>Are you sure it should it be 1 and not 0? Remember, Python counts from zero. > > If the MODEL record in the .ent record says MODEL 1, should > biopython report it as 0? In PDBParser, the code is as follows: > ... If for the PDB parser Thomas already chose to report the model as 1, then to yes you are right - to be consistent the CIF parser should do the same as the PDB parser. Peter From tallpaulinjax at yahoo.com Tue Nov 3 17:10:50 2009 From: tallpaulinjax at yahoo.com (Paul B) Date: Tue, 3 Nov 2009 14:10:50 -0800 (PST) Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex In-Reply-To: <320fb6e00911031354w659f33d2oad838eeffc8ea585@mail.gmail.com> Message-ID: <342688.27132.qm@web30705.mail.mud.yahoo.com> Hmmm... no matter whether there is a (explicit first parsed) MODEL record or not in the PDB file, PDBParser currently records the model id as 0. I can see that being the choice if there wasn't a MODEL record.?But if there?IS a MODEL record of 1, the model id?should be forced to 0? What if the MODEL records exists, and for some reason it is MODEL 2 and that's the first record PDBParser parses? Should that also have a forced model id of 0? This seems to me to be a bug instead of a feature. I'd hate to promulgate that error in the MMCIFParser code as well. Someone could be thinking they are doing a study on model X when really they are studying X+1, or worse yet X+Y where Y is the offset between the forced 0 id and the true first model id. And if the models in the file has skips, ie, MODEL 1 then 3,4,7, and 9... those should be model id's 0 through 4? I don't know if that can happen, just saying... But if the TRUE model record information is not trapped (and it's not), how would someone know what true model they are looking at instead of the forced model id that starts at 0? ? Not to rock the boat, but food for thought. I can make MMCIFParser match whatever is deemed to be the correct thing to do. ? Paul ? --- On Tue, 11/3/09, Peter wrote: From: Peter Subject: Re: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex To: "Paul B" Cc: "Biopython Development" Date: Tuesday, November 3, 2009, 4:54 PM On Tue, Nov 3, 2009 at 5:34 PM, Paul B wrote: >>Are you sure it should it be 1 and not 0? Remember, Python counts from zero. > > If the MODEL record in the .ent record says MODEL 1, should > biopython report it as 0? In PDBParser, the code is as follows: > ... If for the PDB parser Thomas already chose to report the model as 1, then to yes you are right - to be consistent the CIF parser should do the same as the PDB parser. Peter From kellrott at gmail.com Tue Nov 3 18:06:34 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Tue, 3 Nov 2009 15:06:34 -0800 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: <320fb6e00911031349v1303b7c3s2b1aaaa6fc695ced@mail.gmail.com> References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> <320fb6e00911030941l5f5748f3ofdfc2144af209a5e@mail.gmail.com> <320fb6e00911031349v1303b7c3s2b1aaaa6fc695ced@mail.gmail.com> Message-ID: I've posted a branch on my git ( http://github.com/kellrott/biopython/tree/FeatureExtract ). The Name/Description guess functions need to be finalized. I wrote a unit test that extracts CDS feature dna, and then runs translate on the dna and compares it to the translation stored in the feature. It passes all the genbank files in the Test directory except for the ones that have 'N' in the DNA sequence (that causes a translation exception) and one_of.gb (it refers to sequence outside of the file). More test ideas would be appreciated. Kyle There could be a problem if the SeqFeature qualifiers wasn't a > list, for example a string like ""NC_123456" instead of say > ["NC_123456"], but the assumption is safe with anything > from the GenBank or EMBL parsers. > > Peter > From biopython at maubp.freeserve.co.uk Tue Nov 3 18:41:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 23:41:57 +0000 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> Message-ID: <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> On Wed, Oct 28, 2009 at 12:50 PM, Peter wrote: > On Wed, Oct 28, 2009 at 12:07 PM, Peter wrote: >> I think this should be part of Biopython proper (with unit tests etc), and >> would like to discuss where to put it. My ideas include: >> >> (1) Method of the SeqFeature object taking the parent sequence (as a >> string, Seq, ...?) as a required argument. Would return an object of the >> same type as the parent sequence passed in. >> >> (2) Separate function, perhaps in Bio.SeqUtils taking the parent >> sequence (as a string, Seq, ...?) and a SeqFeature object. Would >> return an object of the same type as the parent sequence passed in. >> >> (3) Method of the Seq object taking a SeqFeature, returning a Seq. >> [A drawback is Bio.Seq currently does not depend on Bio.SeqFeature] >> >> (4) Method of the SeqRecord object taking a SeqFeature. Could >> return a SeqRecord using annotation from the SeqFeature. Complex. >> >> Any other ideas? >> >> We could even offer more than one of these approaches, but ideally >> there should be one obvious way for the end user to do this. My >> question is, which is most intuitive? I quite like idea (1). >> >> In terms of code complexity, I expect (1), (2) and (3) to be about the >> same. Building a SeqRecord in (4) is trickier. > > Actually, thinking about this over lunch, for many of the use cases > we do want to turn a SeqFeature into a SeqRecord - either for the > nucleotides, or in some cases their translation. And if doing this, > do something sensible with the SeqFeature annotation (qualifiers) > seems generally to be useful. This could still be done with approaches > (1) and (2) as well as (4). Kyle at least seems to like idea (4), so much so that he has gone ahead and coded up something: http://lists.open-bio.org/pipermail/biopython-dev/2009-November/006941.html Certainly there are good reasons for wanting to be able to take a SeqFeature and the parent sequence (SeqRecord or Seq) and create a SeqRecord (either plain nucleotides or translated into protein). e.g. pretty much all non-trivial GenBank to FASTA conversions. Offering this as a SeqRecord method might be the best approach, option (4). However, this is I think on top of the more fundamental step of just extracting the sequence (without worrying about the annotation). Here as noted above, I currently favour adding a method to the SeqFeature, option (1). How about as the method name get_sequence, extract_sequence or maybe just extract? Peter From biopython at maubp.freeserve.co.uk Tue Nov 3 18:49:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 23:49:33 +0000 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: References: <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> <320fb6e00911030941l5f5748f3ofdfc2144af209a5e@mail.gmail.com> <320fb6e00911031349v1303b7c3s2b1aaaa6fc695ced@mail.gmail.com> Message-ID: <320fb6e00911031549k5a7e5f7fs1d4c6a6299d20ed6@mail.gmail.com> On Tue, Nov 3, 2009 at 11:06 PM, Kyle Ellrott wrote: > I've posted a branch on my git ( > http://github.com/kellrott/biopython/tree/FeatureExtract ). ?The > Name/Description guess functions need to be finalized. ?I wrote a unit test > that extracts CDS feature dna, and then runs translate on the dna and > compares it to the translation stored in the feature. ?It passes all the > genbank files in the Test directory except for the ones that have 'N' in the > DNA sequence (that causes a translation exception) and one_of.gb (it refers > to sequence outside of the file). > More test ideas would be appreciated. There are several things I would have done differently there. Firstly, and perhaps most importantly, you shouldn't assume the SeqRecord is DNA. It could be RNA or protein after all. Reuse the parent SeqRecord's seq's alphabet Perhaps you could comment on this other thread about the more general problem of how to make getting the sequence (i.e. a Seq object) for a SeqFeature available in Biopython? http://lists.open-bio.org/pipermail/biopython-dev/2009-November/006958.html Peter From kellrott at gmail.com Tue Nov 3 22:37:43 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Tue, 3 Nov 2009 19:37:43 -0800 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: <320fb6e00911031549k5a7e5f7fs1d4c6a6299d20ed6@mail.gmail.com> References: <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> <320fb6e00911030941l5f5748f3ofdfc2144af209a5e@mail.gmail.com> <320fb6e00911031349v1303b7c3s2b1aaaa6fc695ced@mail.gmail.com> <320fb6e00911031549k5a7e5f7fs1d4c6a6299d20ed6@mail.gmail.com> Message-ID: > > There are several things I would have done differently there. Firstly, > and perhaps most importantly, you shouldn't assume the SeqRecord > is DNA. It could be RNA or protein after all. Reuse the parent > SeqRecord's seq's alphabet > It's an open source rule of thumb, if you want something done quickly, post broken code and someone will fix it to prove they're smarter then you. ;-) Kyle From biopython at maubp.freeserve.co.uk Wed Nov 4 06:37:14 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 4 Nov 2009 11:37:14 +0000 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> Message-ID: <320fb6e00911040337q5bad8db2r925e39f4bcd375f7@mail.gmail.com> On Tue, Nov 3, 2009 at 11:41 PM, Peter wrote: > > Certainly there are good reasons for wanting to be able to take > a SeqFeature and the parent sequence (SeqRecord or Seq) > and create a SeqRecord (either plain nucleotides or translated > into protein). e.g. pretty much all non-trivial GenBank to FASTA > conversions. Offering this as a SeqRecord method might be the > best approach, option (4). > > However, this is I think on top of the more fundamental step > of just extracting the sequence (without worrying about the > annotation). Here as noted above, I currently favour adding > a method to the SeqFeature, option (1). How about as the > method name get_sequence, extract_sequence or maybe > just extract? Done on a github branch, comments welcome: http://github.com/peterjc/biopython/tree/seqfeature-extract If that seems the best way to expose this functionality (i.e. option (1) from my earlier list of suggestions), I can commit this to the trunk, and we can move on to the related idea of how to this nicely with get SeqRecord objects for SeqFeatures. Peter From biopython at maubp.freeserve.co.uk Wed Nov 4 09:22:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 4 Nov 2009 14:22:45 +0000 Subject: [Biopython-dev] Adding SeqRecord objects Message-ID: <320fb6e00911040622n1d60c6b1n29511f82f4f6b674@mail.gmail.com> Hi all, I'd like to add support for adding SeqRecord objects to the trunk, i.e. cherry-pick this change from my experimental branch: http://github.com/peterjc/biopython/commit/6fd5675b1c03dc7eb190d84db1fa19ae744559aa Plus some docstring/doctest examples and unit tests of course, e.g. http://github.com/peterjc/biopython/commit/a8405a54406226c6726daea743ea59dc544c5bc0 Any comments? Peter From peter at maubp.freeserve.co.uk Wed Nov 4 12:40:40 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Wed, 4 Nov 2009 17:40:40 +0000 Subject: [Biopython-dev] [blast-announce] BLAST 2.2.22 now available In-Reply-To: <320fb6e00910200456mbac8d28ra1385c102b899c9a@mail.gmail.com> References: <33559E80-E78D-4CCB-8E8C-79C36E89C007@ncbi.nlm.nih.gov> <320fb6e00910200456mbac8d28ra1385c102b899c9a@mail.gmail.com> Message-ID: <320fb6e00911040940h48f88ch87ad9a22d79b4aa3@mail.gmail.com> On Tue, Oct 20, 2009 at 11:56 AM, Peter wrote: > Hi all, > > The new NCBI BLAST tools are out now, and I'd only just updated > my desktop to BLAST 2.2.21 this morning! > > It looks like the "old style" blastall etc (which are written in C) are > much the same, but we will need to add Bio.Blast.Applications > wrappers for the new "BLAST+" tools (written in C++). The bulk of that work is done in the main repository now. However, we still need to go through all the tools and confirm all their arguments are included. This could be partly automated since the BLAST help output is nicely formatted... Peter From kellrott at gmail.com Wed Nov 4 13:25:14 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Wed, 4 Nov 2009 10:25:14 -0800 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: <320fb6e00911040337q5bad8db2r925e39f4bcd375f7@mail.gmail.com> References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> <320fb6e00911040337q5bad8db2r925e39f4bcd375f7@mail.gmail.com> Message-ID: I pulled the this branch and tested it on the project I originally needed the code for. It looks like it's working. Kyle Done on a github branch, comments welcome: > http://github.com/peterjc/biopython/tree/seqfeature-extract > > If that seems the best way to expose this functionality > (i.e. option (1) from my earlier list of suggestions), I can > commit this to the trunk, and we can move on to the > related idea of how to this nicely with get SeqRecord > objects for SeqFeatures. > > Peter > From bugzilla-daemon at portal.open-bio.org Wed Nov 4 13:52:18 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Nov 2009 13:52:18 -0500 Subject: [Biopython-dev] [Bug 2945] New: update_pdb: shutil.move needs to be indented; try block also? Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2945 Summary: update_pdb: shutil.move needs to be indented; try block also? Product: Biopython Version: 1.52 Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: TallPaulInJax at yahoo.com As written, shutil.move will only move the last file if there are any. If there aren't any, it will raise an exception since old_file and new_file are not initialized. Finally, any failure to move the file will also raise an exception (I believe), so a try block should be in place: Existing code: # move the obsolete files to a special folder for pdb_code in obsolete: if self.flat_tree: old_file = self.local_pdb + os.sep + 'pdb%s.ent'%(pdb_code) new_file = self.obsolete_pdb + os.sep + 'pdb%s.ent'%(pdb_code) else: old_file = self.local_pdb + os.sep + pdb_code[1:3] + os.sep + 'pdb%s.ent'%(pdb_code) new_file = self.obsolete_pdb + os.sep + pdb_code[1:3] + os.sep + 'pdb%s.ent'%(pdb_code) shutil.move(old_file, new_file) shutil.move needs to be indented one column, and potentially a try/catch phrase added: for pdb_code in obsolete: if self.flat_tree: old_file = self.local_pdb + os.sep + 'pdb%s.ent'%(pdb_code) new_file = self.obsolete_pdb + os.sep + 'pdb%s.ent'%(pdb_code) else: old_file = self.local_pdb + os.sep + pdb_code[1:3] + os.sep + 'pdb%s.ent'%(pdb_code) new_file = self.obsolete_pdb + os.sep + pdb_code[1:3] + os.sep + 'pdb%s.ent'%(pdb_code) try: shutil.move(old_file, new_file) except: warnings.warn("Unable to move from old file: \n%s\n to new file: \n%s\n" % (old_file, new_file) RuntimeWarning) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 4 16:56:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Nov 2009 16:56:00 -0500 Subject: [Biopython-dev] [Bug 2945] update_pdb: shutil.move needs to be indented; try block also? In-Reply-To: Message-ID: <200911042156.nA4Lu0NU025735@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2945 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-04 16:56 EST ------- Fixed in git, looks like I missed that in fixing Bug 2867. Thanks I didn't go for a try/except. Have you had this fail in real use? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Nov 4 17:05:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 4 Nov 2009 22:05:50 +0000 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> <320fb6e00911040337q5bad8db2r925e39f4bcd375f7@mail.gmail.com> Message-ID: <320fb6e00911041405x2a424d61n2bf617a99d8ffef4@mail.gmail.com> On Wed, Nov 4, 2009 at 6:25 PM, Kyle Ellrott wrote: > I pulled the this branch and tested it on the project I originally needed > the code for. ?It looks like it's working. Cool. What do you think of this interface? Does a method of the SeqFeature seem natural to you? Peter From kellrott at gmail.com Wed Nov 4 17:16:30 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Wed, 4 Nov 2009 14:16:30 -0800 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: <320fb6e00911041405x2a424d61n2bf617a99d8ffef4@mail.gmail.com> References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> <320fb6e00911040337q5bad8db2r925e39f4bcd375f7@mail.gmail.com> <320fb6e00911041405x2a424d61n2bf617a99d8ffef4@mail.gmail.com> Message-ID: I guess it's a question of making it a method of the SeqFeature vs a Method of the SeqRecord. My try put the extract feature in the SeqRecord, because that's what SeqFeature belongs to. But it actually makes more sense to have the SeqFeature operate on a Seq. If people want to create features (or copy them from other SeqRecords) and use them to extract subsequences from other Seq objects this format makes it more natural/flexable. Kyle On Wed, Nov 4, 2009 at 2:05 PM, Peter wrote: > On Wed, Nov 4, 2009 at 6:25 PM, Kyle Ellrott wrote: > > I pulled the this branch and tested it on the project I originally needed > > the code for. It looks like it's working. > > Cool. What do you think of this interface? Does a method of the > SeqFeature seem natural to you? > > Peter > From bugzilla-daemon at portal.open-bio.org Wed Nov 4 17:18:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Nov 2009 17:18:29 -0500 Subject: [Biopython-dev] [Bug 2945] update_pdb: shutil.move needs to be indented; try block also? In-Reply-To: Message-ID: <200911042218.nA4MITCf026322@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2945 TallPaulInJax at yahoo.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |TallPaulInJax at yahoo.com ------- Comment #2 from TallPaulInJax at yahoo.com 2009-11-04 17:18 EST ------- My thought process on the try block (and similar): if something or someone outside of my control can affect the state of my program I am going to try and catch that exception. So, when dealing with a file system who knows what could happen! After all, we are trying to move obsolete files that may have been downloaded months or years ago... who KNOWS where they are now, or what they've been named. We are just calculating their SUPPOSED path and filename. If we had just done a directory walk and found all files matching a regex expression, that might be a different matter. Anyway, that's the logic! :-) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 4 23:34:46 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Nov 2009 23:34:46 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911050434.nA54YkFs004138@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 ------- Comment #2 from TallPaulInJax at yahoo.com 2009-11-04 23:34 EST ------- Created an attachment (id=1392) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1392&action=view) MMCIFParser patch using WinMerge. I'm not sure if this is in a useable format or not. It was generated by WinMerge, and I had not used the product before tonight. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 4 23:42:22 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Nov 2009 23:42:22 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911050442.nA54gMZe004307@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 ------- Comment #3 from TallPaulInJax at yahoo.com 2009-11-04 23:42 EST ------- Important point of discussion: In PDBParser.py as of this date, NO MATTER whether there is an explicit MODEL record or not in the .ent file, PDBParser forces the first model id to be 0 and then increments the counter. If the authors of the .ent file chose to explicitly use MODEL records 2,3,5,7 these would be ignored and instead given model id's of 0,1,2,3 respectively by PDBParser. There is no attribute in the object model for the true MODEL number. To me, this is a bug since some user thinking they are studying model X is instead studying model Y. Currently the code for MMCIFParser.py does NOT follow this logic. Model numbers in the file are faithfully used as the model id in the object structure. However, should it be deemed that the PDBParser method is a feature instead of a bug, then the change is easy enough to make. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 4 23:54:42 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Nov 2009 23:54:42 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911050454.nA54sgUk004474@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 ------- Comment #4 from TallPaulInJax at yahoo.com 2009-11-04 23:54 EST ------- Note: my testing of MMCIFParser.py depends on the changes I have made to MMCIF2Dict.py which requires the new MMCIFlex.py I wrote completely in python with no lex/flex/C support required. Peter, you should have copies of these? To test MMCIFParser.py, download: ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/mmCIF/be/2beg.cif.gz and unzip it. Run MMCIFParser in IDLE, etc. It will prompt you for a filename. Enter the path and filename to the unzipped 2beg file you just downloaded. 2beg has 10 models (1 thru 10), each with 5 chains, and each chain with 26 residues. You should get: Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Note on a different protein: 100d.ent identifies the models, chains, and residues differently than it's counterpart 100d.cif!!! And I thought they came from the same database!? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 5 00:03:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Nov 2009 00:03:09 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911050503.nA5539T2004675@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 ------- Comment #5 from TallPaulInJax at yahoo.com 2009-11-05 00:03 EST ------- Created an attachment (id=1393) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1393&action=view) MMCIF2Dict patch for use with new MMCIFlex I don't know if MMCIF2Dict.py has bugs or not with the lex/flex/C version of MMCIFlex.py. However, these corrections are necessary for testing the complete package of MMCIFParser and the MMCIF2Dict and MMCIFlex modules written without need for lex/flex/C. So I am attaching them here! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 5 00:05:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Nov 2009 00:05:59 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911050505.nA555xpM004734@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 ------- Comment #6 from TallPaulInJax at yahoo.com 2009-11-05 00:05 EST ------- Created an attachment (id=1394) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1394&action=view) The complete MMCIF2Dict file modified for the new MMCIFlex.py If you don't want to mess with the diff file, here is the complete MMCIF2Dict.py file written to work with the python-only MMCIFlex.py file I will also add as an attachment. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 5 00:10:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Nov 2009 00:10:19 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911050510.nA55AJxu004822@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 ------- Comment #7 from TallPaulInJax at yahoo.com 2009-11-05 00:10 EST ------- Created an attachment (id=1395) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1395&action=view) The new MMCIFlex.py file written solely in python. And here is the complete MMCIFlex.py file written solely in python, no need for lex/flex/c/etc. It has been tested on Windows (as have the other files). This file should be placed in the mmCIF subfolder of PDB. Note: I don't know if this will be used as a REPLACEMENT for the old MMCIFlex or not so I have no idea what it's name should be. The same goes for the modifications of the MMCIF2Dict.py file: it will only work with this version of MMCIFlex, not the lex/flex/C version!!! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 5 06:12:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Nov 2009 06:12:55 -0500 Subject: [Biopython-dev] [Bug 2945] update_pdb: shutil.move needs to be indented; try block also? In-Reply-To: Message-ID: <200911051112.nA5BCtAr014626@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2945 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-05 06:12 EST ------- Fair point. Could you try the updated file in the repository? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Nov 5 07:44:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Nov 2009 12:44:24 +0000 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> <320fb6e00911040337q5bad8db2r925e39f4bcd375f7@mail.gmail.com> <320fb6e00911041405x2a424d61n2bf617a99d8ffef4@mail.gmail.com> Message-ID: <320fb6e00911050444m2a9afbcen5e5a548225fef79c@mail.gmail.com> On Wed, Nov 4, 2009 at 10:16 PM, Kyle Ellrott wrote: > I guess it's a question of making it a method of the SeqFeature vs a Method > of the SeqRecord. ?My try put the extract feature in the SeqRecord, because > that's what SeqFeature belongs to. ?But ?it actually makes more sense to > have the SeqFeature operate on a Seq. ?If people want to create features (or > copy them from other SeqRecords) and use them to extract subsequences from > other Seq objects this format makes it more natural/flexable. :) You are right that the SeqFeature won't always be used with a SeqRecord (or at least, the parent SeqRecord). There are possible examples like a GenBank file with no sequence (just CONTIG information) where the sequence has been loaded from a FASTA file. Or, perhaps more likely (once Brad's code is merged), a list of SeqFeature objects loaded from a GFF file, plus a sequence loaded from a FASTA file. Does my current choice of "extract" for the name of the proposed SeqFeature method seem clear? My other suggestions earlier were get_sequence or extract_sequence. http://github.com/peterjc/biopython/tree/seqfeature-extract Peter From bugzilla-daemon at portal.open-bio.org Thu Nov 5 10:01:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Nov 2009 10:01:14 -0500 Subject: [Biopython-dev] [Bug 2945] update_pdb: shutil.move needs to be indented; try block also? In-Reply-To: Message-ID: <200911051501.nA5F1EPU023807@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2945 ------- Comment #4 from TallPaulInJax at yahoo.com 2009-11-05 10:01 EST ------- I downloaded the updated PDBList.py from github. Unfortunately, testing fails for several reasons: 1. The code you wrote includes an 'if' statement: if os.path.isfile(old_file) : try : shutil.move(old_file, new_file) except : warnings.warn("Could not move %s to obsolete folder" \ % pdb_code, RuntimeWarning) If the file does NOT exist, then no warning is issued as if the file HAD existed and HAD been moved. This is a bug: there should be no if statement. Simply try and move the file within a try/catch as I had written it without the if statement: try : shutil.move(old_file, new_file) except : warnings.warn("Could not move %s to obsolete folder" \ % pdb_code, RuntimeWarning) That way no matter whether the file does not exist or cannot be moved, the warning is issued. If you would like to trap the warnings separately, you should write: if os.path.isfile(old_file) : try : shutil.move(old_file, new_file) except : warnings.warn("Could not move %s to obsolete folder" \ % pdb_code, RuntimeWarning) else: warnings.warn("File %s not found to move to obsolete folder" \ % pdb_code, RuntimeWarning) 2. At least on Windows, if the subfolders of the obsolete folder do not exist, a warning will be issued as well: python will create the 'obsolete' subolder folder but will not create the subfolders under that. This will occur even if the files DO exist and COULD be moved: a different kind of error. We just need to create the subfolders, and warn if we can't. 3. Not a bug, but an enhancement: As a user, I'd rather see the whole new_file name instead of the pdb_code. To sum up, the below code will implement all those changes. Whether there are other errors or not I have not checked. But I did check the above warnings/errors with testing: for pdb_code in obsolete: if self.flat_tree: old_file = self.local_pdb + os.sep + 'pdb%s.ent'%(pdb_code) new_file = self.obsolete_pdb + os.sep + 'pdb%s.ent'%(pdb_code) new_path = self.obsolete_pdb #<===================== else: old_file = self.local_pdb + os.sep + pdb_code[1:3] + os.sep + 'pdb%s.ent'%(pdb_code) new_file = self.obsolete_pdb + os.sep + pdb_code[1:3] + os.sep + 'pdb%s.ent'%(pdb_code) new_path = self.obsolete_pdb + os.sep + pdb_code[1:3]#<===== #If the old file doesn't exist, maybe someone else moved it #or deleted it already. Should we issue a warning? if os.path.isfile(old_file) : try : os.makedirs(new_path) #<================ shutil.move(old_file, new_file) except : warnings.warn("Could not move %s to obsolete folder" \ % old_file, RuntimeWarning) #<====old_file else: #<=========== warnings.warn("Could not find file %s to move to obsolete folder" \ % old_file, RuntimeWarning) #<====old_file -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 5 11:13:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Nov 2009 11:13:32 -0500 Subject: [Biopython-dev] [Bug 2945] update_pdb: shutil.move needs to be indented; try block also? In-Reply-To: Message-ID: <200911051613.nA5GDWqk025397@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2945 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution| |FIXED ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-05 11:13 EST ------- (In reply to comment #4) > I downloaded the updated PDBList.py from github. Unfortunately, testing fails > for several reasons: > 1. The code you wrote includes an 'if' statement: > if os.path.isfile(old_file) : > try : > shutil.move(old_file, new_file) > except : > warnings.warn("Could not move %s to obsolete folder" \ > % pdb_code, RuntimeWarning) > If the file does NOT exist, then no warning is issued as if the file HAD > existed and HAD been moved. This is a bug ... Why is this a bug? If the file does not exist, there is no point trying to move it. But OK, a message here could be informative. > 2. At least on Windows, if the subfolders of the obsolete folder do not exist, > a warning will be issued as well: python will create the 'obsolete' subolder > folder but will not create the subfolders under that. This will occur even if > the files DO exist and COULD be moved: a different kind of error. We just need > to create the subfolders, and warn if we can't. That's a separate bug. Fixed now. > 3. Not a bug, but an enhancement: As a user, I'd rather see the whole new_file > name instead of the pdb_code. Fair enough. I felt having a very long message with both the old and the new paths was excessive, but at least including the old path would be a useful compromise. > To sum up, the below code will implement all those changes. Whether there are > other errors or not I have not checked. But I did check the above > warnings/errors with testing: I've updated the repository along similar lines: http://github.com/biopython/biopython/blob/master/Bio/PDB/PDBList.py Marking this bug as fixed. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From kellrott at gmail.com Thu Nov 5 14:42:03 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Thu, 5 Nov 2009 11:42:03 -0800 Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <128a885f0910182126i74f7712bo2cb6e7d612532e5@mail.gmail.com> References: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> <20091018163436.GA66322@kunkel> <128a885f0910182126i74f7712bo2cb6e7d612532e5@mail.gmail.com> Message-ID: Any notes on the progress? I had to get some GO information for a project, so I checked out your git fork. Looks like a skeleton of the project has been outlined, but no real code yet. I added some code to the oboparser to get what I needed, only basic term parsing, and no network support. Can't speak to it's performance, but NetworkX has a very simple install (easy_install on mac, and part of the standard package set on Fedora). Kyle On Sun, Oct 18, 2009 at 7:22 AM, Chris Lasher wrote: > I'm going to go ahead and make the executive decision to use NetworkX. > I think BioPerl's Ontology framework has both third-party > dependency-based (Graph.pm) and non-dependency-based solutions. Maybe > we can figure out something similar, but NetworkX is such an easy > dependency to satisfy that I'm going with it. > > Looks like this is going to be a busy week. > > Chris > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From bugzilla-daemon at portal.open-bio.org Fri Nov 6 10:05:07 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 6 Nov 2009 10:05:07 -0500 Subject: [Biopython-dev] [Bug 2947] New: Bio.HMM calculates wrong viterbi path Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2947 Summary: Bio.HMM calculates wrong viterbi path Product: Biopython Version: 1.47 Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: georg.lipps at fhnw.ch Hi, I have tested the Bio.HMM with some simple code (see below). However the results of the viterbi path calculation are wrong (apparently they depend upon the order of the state alphabet). I also do not understand the number/score/probability returned along with the viterbi path. Greetings, Georg from Bio.HMM import MarkovModel, Trainer ## definition of the alphabets for hidden states and emissions class coin: def __init__(self): self.letters = ["u", "f"] # must be a single letter alphabet or raises an error in viterbi class outcome: def __init__(self): self.letters = ["head", "tail"] coin=coin() outcome=outcome() ## initialize HMM model build=MarkovModel.MarkovModelBuilder(coin, outcome) build.allow_all_transitions() build.set_equal_probabilities() ## build HMM model with test frequencies build.set_transition_score("u", "f", 0.05) build.set_transition_score("f", "u", 0.05) build.set_transition_score("f", "f", 0.95) build.set_transition_score("u", "u", 0.95) build.set_emission_score("f", "tail", 0.5) build.set_emission_score("f", "head", 0.5) build.set_emission_score("u", "tail", 0.75) build.set_emission_score("u", "head", 0.25) model=build.get_markov_model() print "Emission probabilites:\n", model.emission_prob print "Transitions probabilities:\n", model.transition_prob, "\n" observed_emissions=["tail"]*2 viterbi=model.viterbi(observed_emissions, coin) seq=viterbi[0] prob=viterbi[1] print "============= Calculation of the most probable path" ## does not work for very short observations ## calculated path is dependant upon order in states alphabet print observed_emissions print seq print prob, "\n" OUTPUT: Emission probabilites: {('u', 'head'): 0.25, ('f', 'head'): 0.5, ('f', 'tail'): 0.5, ('u', 'tail'): 0.75} Transitions probabilities: {('f', 'u'): 0.050000000000000003, ('u', 'f'): 0.050000000000000003, ('u', 'u'): 0.94999999999999996, ('f', 'f'): 0.94999999999999996} ============= Calculation of the most probable path ['tail', 'tail'] ff 4.46028871308 This is certainly not true, since the most probable path would be uu (unfair/unfair) When the sequence of observation is longer, e.g six the following results are obtained: ============= Calculation of the most probable path ['tail', 'tail', 'tail', 'tail', 'tail', 'tail'] uuuuuf 13.1325601923 Which is still not true as the last coin should still be u (unfair). Interestingly when the order of the state alphabet is changed, i.e.: class coin: def __init__(self): self.letters = ["f", "u"] the output is correct. ============= Calculation of the most probable path ['tail', 'tail', 'tail', 'tail', 'tail', 'tail'] uuuuuu 6.09287667828 Thus it appears to me that the viterbi algorithm is not robust enough and biased towards the last letter of the state alphabet. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From kellrott at gmail.com Fri Nov 6 12:36:16 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Fri, 6 Nov 2009 09:36:16 -0800 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: <320fb6e00911050444m2a9afbcen5e5a548225fef79c@mail.gmail.com> References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> <320fb6e00911040337q5bad8db2r925e39f4bcd375f7@mail.gmail.com> <320fb6e00911041405x2a424d61n2bf617a99d8ffef4@mail.gmail.com> <320fb6e00911050444m2a9afbcen5e5a548225fef79c@mail.gmail.com> Message-ID: It's always hard to judge how obvious something is when you already know it. I don't think there is any sort of name clash. Is there anything else you would want to 'extract' with/from a Sequence Feature?... One of the rules I've heard from API designers is that it's best if written code almost sounds like a proper sentence. ( From http://www.youtube.com/watch?v=aAb7hSCtvGw if you haven't seen it ) my_feature_sequence = my_feature.extract( my_sequence ) seems to fit that rule. Kyle Does my current choice of "extract" for the name of the proposed > SeqFeature method seem clear? My other suggestions earlier were > get_sequence or extract_sequence. > > http://github.com/peterjc/biopython/tree/seqfeature-extract > > Peter > From biopython at maubp.freeserve.co.uk Fri Nov 6 14:00:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 6 Nov 2009 19:00:08 +0000 Subject: [Biopython-dev] Seq object ungap method Message-ID: <320fb6e00911061100g2122ef77se80f5bc6198cb030@mail.gmail.com> Hi all, Something we discussed last year was adding an ungap method to the Seq object. See for example this thread: http://lists.open-bio.org/pipermail/biopython/2008-September/004515.html I've (finally) taken the time to actually implement this, and have posted it on a github branch for comment: http://github.com/peterjc/biopython/tree/ungap The code includes a selection of examples in the docstring which double as doctests. You can read this online here: http://github.com/peterjc/biopython/commit/13de9f793d13d3c9485f8d7cc42a48b99613d931 Peter [At some point we may want to move some of the assorted private functions in Bio.Alphabet into (private) methods of the alphabet objects or something... we'll see.] From biopython at maubp.freeserve.co.uk Tue Nov 10 11:23:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 10 Nov 2009 16:23:16 +0000 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> <320fb6e00911040337q5bad8db2r925e39f4bcd375f7@mail.gmail.com> <320fb6e00911041405x2a424d61n2bf617a99d8ffef4@mail.gmail.com> <320fb6e00911050444m2a9afbcen5e5a548225fef79c@mail.gmail.com> Message-ID: <320fb6e00911100823n22821fcg1b162d436f235c7f@mail.gmail.com> Peter wrote: >> Does my current choice of "extract" for the name of the proposed >> SeqFeature method seem clear? My other suggestions earlier were >> get_sequence or extract_sequence. >> >> http://github.com/peterjc/biopython/tree/seqfeature-extract >> >> Peter On Fri, Nov 6, 2009 at 5:36 PM, Kyle Ellrott wrote: > It's always hard to judge how obvious something is when you already know > it.? I don't think there is any sort of name clash.? Is there anything else you > would want to 'extract' with/from a Sequence Feature?... > One of the rules I've heard from API designers is that it's best if written > code almost sounds like a proper sentence.? ( From > http://www.youtube.com/watch?v=aAb7hSCtvGw if you haven't seen it ) > my_feature_sequence = my_feature.extract( my_sequence ) seems to fit that > rule. > > Kyle So I'm happy with the SeqFeature method name ("extract"), and so is Kyle, and no one else has commented - which is a shame. I've just merged this into the main branch to encourage more testing of this, and will work on covering this in the tutorial at some point. If anyone gives this a go, please let us know on the list - and of course report any issues. Thanks, Peter From eric.talevich at gmail.com Wed Nov 11 00:37:16 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 11 Nov 2009 00:37:16 -0500 Subject: [Biopython-dev] Seq object ungap method In-Reply-To: <320fb6e00911061100g2122ef77se80f5bc6198cb030@mail.gmail.com> References: <320fb6e00911061100g2122ef77se80f5bc6198cb030@mail.gmail.com> Message-ID: <3f6baf360911102137s2534dc23med16712375db4cb7@mail.gmail.com> On Fri, Nov 6, 2009 at 2:00 PM, Peter wrote: > Hi all, > > Something we discussed last year was adding an ungap method > to the Seq object. See for example this thread: > > http://lists.open-bio.org/pipermail/biopython/2008-September/004515.html > > I've (finally) taken the time to actually implement this, and have > posted it on a github branch for comment: > > http://github.com/peterjc/biopython/tree/ungap > > Neat! Some trivial comments: 1. There's a typo on line 897 in Seq.py: s/stil/still/ 2. Each colon character has a space before it in this function. I've never seen you use that style before. (Other Biopython code doesn't do that.) 3. In the exception messages, and other places in Biopython, the concatenated string (or compound expression) is contained in parentheses, but there's also a backslash at the end of each line. I don't think the backslash is necessary, since the parens already group the multi-line expression. Cheers, Eric From biopython at maubp.freeserve.co.uk Wed Nov 11 05:32:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 11 Nov 2009 10:32:19 +0000 Subject: [Biopython-dev] Seq object ungap method In-Reply-To: <3f6baf360911102137s2534dc23med16712375db4cb7@mail.gmail.com> References: <320fb6e00911061100g2122ef77se80f5bc6198cb030@mail.gmail.com> <3f6baf360911102137s2534dc23med16712375db4cb7@mail.gmail.com> Message-ID: <320fb6e00911110232g3d507c7cg34512a4e9f28c55d@mail.gmail.com> On Wed, Nov 11, 2009 at 5:37 AM, Eric Talevich wrote: > On Fri, Nov 6, 2009 at 2:00 PM, Peter > wrote: >> >> Hi all, >> >> Something we discussed last year was adding an ungap method >> to the Seq object. See for example this thread: >> >> http://lists.open-bio.org/pipermail/biopython/2008-September/004515.html >> >> I've (finally) taken the time to actually implement this, and have >> posted it on a github branch for comment: >> >> http://github.com/peterjc/biopython/tree/ungap >> > > Neat! So worth checking in then? > Some trivial comments: > 1. There's a typo on line 897 in Seq.py: s/stil/still/ Thanks. > 2. Each colon character has a space before it in this function. I've never > seen you use that style before. (Other Biopython code doesn't do that.) I think some other bits of Biopython do this, but I agree, we should be consistent and removing the spaces matches PEP8. > 3. In the exception messages, and other places in Biopython, the > concatenated string (or compound expression) is contained in parentheses, > but there's also a backslash at the end of each line. I don't think the > backslash is necessary, since the parens already group the multi-line > expression. Again, you are right, and it would perhaps be worth tidying up cases of that. But not critical. Peter From biopython at maubp.freeserve.co.uk Wed Nov 11 07:44:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 11 Nov 2009 12:44:57 +0000 Subject: [Biopython-dev] Seq object ungap method In-Reply-To: <320fb6e00911110232g3d507c7cg34512a4e9f28c55d@mail.gmail.com> References: <320fb6e00911061100g2122ef77se80f5bc6198cb030@mail.gmail.com> <3f6baf360911102137s2534dc23med16712375db4cb7@mail.gmail.com> <320fb6e00911110232g3d507c7cg34512a4e9f28c55d@mail.gmail.com> Message-ID: <320fb6e00911110444q4baa4314ye6cb2fad5931a0a1@mail.gmail.com> On Wed, Nov 11, 2009 at 10:32 AM, Peter wrote: > On Wed, Nov 11, 2009 at 5:37 AM, Eric Talevich wrote: > >> 2. Each colon character has a space before it in this function. I've never >> seen you use that style before. (Other Biopython code doesn't do that.) > > I think some other bits of Biopython do this, but I agree, we should > be consistent and removing the spaces matches PEP8. It turns out lots of bits of Biopython did this for functions, if statements and class definitions - in many cases this is my own fault, as I seem to have adopted a personal style here at odds with PEP8. Anyway, I think I have found and fixed all the cases on the trunk now. Peter From biopython at maubp.freeserve.co.uk Wed Nov 11 09:24:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 11 Nov 2009 14:24:15 +0000 Subject: [Biopython-dev] Adding SeqRecord objects In-Reply-To: <320fb6e00911040622n1d60c6b1n29511f82f4f6b674@mail.gmail.com> References: <320fb6e00911040622n1d60c6b1n29511f82f4f6b674@mail.gmail.com> Message-ID: <320fb6e00911110624h2453c39ah423f64f53fd33873@mail.gmail.com> On Wed, Nov 4, 2009 at 2:22 PM, Peter wrote: > Hi all, > > I'd like to add support for adding SeqRecord objects to the trunk, i.e. > cherry-pick this change from my experimental branch: > > http://github.com/peterjc/biopython/commit/6fd5675b1c03dc7eb190d84db1fa19ae744559aa > > Plus some docstring/doctest examples and unit tests of course, e.g. > http://github.com/peterjc/biopython/commit/a8405a54406226c6726daea743ea59dc544c5bc0 > > Any comments? Checked in - comments and testing still welcome of course. Peter From biopython at maubp.freeserve.co.uk Wed Nov 11 11:31:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 11 Nov 2009 16:31:32 +0000 Subject: [Biopython-dev] Adding SeqRecord objects In-Reply-To: <320fb6e00911110624h2453c39ah423f64f53fd33873@mail.gmail.com> References: <320fb6e00911040622n1d60c6b1n29511f82f4f6b674@mail.gmail.com> <320fb6e00911110624h2453c39ah423f64f53fd33873@mail.gmail.com> Message-ID: <320fb6e00911110831n3984bekef6ec2a646131234@mail.gmail.com> On Wed, Nov 11, 2009 at 2:24 PM, Peter wrote: > > Checked in - comments and testing still welcome of course. > I've also added a new unit test, test_SeqRecord.py, and while working on this found two unreported bugs with SeqRecord slicing for the SeqFeatures. Firstly negative slice indices didn't work properly, and secondly there was a corner case where the slice stop was right at the end of a feature. Fixed. Peter From bugzilla-daemon at portal.open-bio.org Wed Nov 11 23:00:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Nov 2009 23:00:51 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911120400.nAC40puR016271@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 ------- Comment #8 from TallPaulInJax at yahoo.com 2009-11-11 23:00 EST ------- Created an attachment (id=1396) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1396&action=view) MMCIFlex.py all in python w/ability to read from .cif.gz file; also bug fix Hi, I don't believe this file has been added to the git so I updated it to include the ability to parse .cif.gz files as well as .cif files. In addition, I was unaware that .cif files can have semi-colon lines like: ; ASDasdasd ads askdjlkasjdlakjsdlasd asd asdkjasdl;kjalskdjlasjdlasjdlkasjdalsdkj ; so I update the logic accordingly and tested it. As stated before, this MMCIFlex.py has to be matched with the MMCIF2Dict.py I had slightly re-written to work with it. See other attachments. Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 11 23:05:50 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Nov 2009 23:05:50 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911120405.nAC45oR5016526@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 TallPaulInJax at yahoo.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1395 is|0 |1 obsolete| | ------- Comment #9 from TallPaulInJax at yahoo.com 2009-11-11 23:05 EST ------- (From update of attachment 1395) See #1396 instead. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Nov 15 22:20:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 15 Nov 2009 22:20:11 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911160320.nAG3KBh6021138@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 TallPaulInJax at yahoo.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1396 is|0 |1 obsolete| | ------- Comment #10 from TallPaulInJax at yahoo.com 2009-11-15 22:20 EST ------- Created an attachment (id=1398) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1398&action=view) MMCIFlex.py all in python. Slight bug fix from previous upload. Oops! Forgot to strip ';' out of the first semi-colon line found! :-) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From peter at maubp.freeserve.co.uk Mon Nov 16 10:04:27 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Mon, 16 Nov 2009 15:04:27 +0000 Subject: [Biopython-dev] [blast-announce] BLAST 2.2.22 now available In-Reply-To: <320fb6e00911040940h48f88ch87ad9a22d79b4aa3@mail.gmail.com> References: <33559E80-E78D-4CCB-8E8C-79C36E89C007@ncbi.nlm.nih.gov> <320fb6e00910200456mbac8d28ra1385c102b899c9a@mail.gmail.com> <320fb6e00911040940h48f88ch87ad9a22d79b4aa3@mail.gmail.com> Message-ID: <320fb6e00911160704l2ba12596t7b6bf4d69511e9af@mail.gmail.com> On Wed, Nov 4, 2009 at 5:40 PM, Peter wrote: > On Tue, Oct 20, 2009 at 11:56 AM, Peter wrote: >> Hi all, >> >> The new NCBI BLAST tools are out now, and I'd only just updated >> my desktop to BLAST 2.2.21 this morning! >> >> It looks like the "old style" blastall etc (which are written in C) are >> much the same, but we will need to add Bio.Blast.Applications >> wrappers for the new "BLAST+" tools (written in C++). > > The bulk of that work is done in the main repository now. > > However, we still need to go through all the tools and confirm > all their arguments are included. This could be partly automated > since the BLAST help output is nicely formatted... Done - with a basic unit test which confirms the list of arguments the NCBI tools report via -help matches those we handle via the wrapper. We still need to clarify how the -soft_masking and -use_index options should work (i.e. do that take an argument or not), but otherwise the wrapper code should be fit for testing (and we can update the tutorial). Peter From bugzilla-daemon at portal.open-bio.org Mon Nov 16 18:14:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 16 Nov 2009 18:14:00 -0500 Subject: [Biopython-dev] [Bug 2948] New: _parse_pdb_header_list: bug in TITLE handling Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2948 Summary: _parse_pdb_header_list: bug in TITLE handling Product: Biopython Version: 1.52 Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: TallPaulInJax at yahoo.com parse_pdb_header.py _parse_pdb_header_list Hi, 1. If the TITLE in a PDB begins with a number, the parse_pdb_header_list method is stripping the prefixed number from the title, I believe because the regex written did not expect this. So the TITLE line: TITLE 3D STRUCTURE OF ALZHEIMER'S ABETA(1-42) FIBRILS becomes: " D STRUCTURE OF ALZHEIMER'S ABETA(1-42) FIBRILS" 2. ... or it should, but it doesn't. This is because for some reason the title is converted to lower case. So it actually becomes: " d structure of alzheimer's abeta(1-42) fibrils" This is fixed by changing the line of code: name=_chop_end_codes(tail).lower() to: name=_chop_end_codes(tail) I don't have a solution for problem #1. Frankly, I think the (whole, or most all of the) method should be re-written to use positional stripping, ie, line[X:Y].strip(). Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Nov 16 18:19:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 16 Nov 2009 18:19:15 -0500 Subject: [Biopython-dev] [Bug 2949] New: _parse_pdb_header_list: REVDAT is for oldest entry. Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2949 Summary: _parse_pdb_header_list: REVDAT is for oldest entry. Product: Biopython Version: 1.52 Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: TallPaulInJax at yahoo.com Hi, I don't know if this is considered a bug or a feature, but currently _parse_pdb_header_list in parse_pdb_header.py is grabbing the least recent date when I believe it should be grabbing the MOST current date. Here is the current code: elif key=="REVDAT": rr=re.search("\d\d-\w\w\w-\d\d",tail) if rr!=None: dict['release_date']=_format_date(_nice_case(rr.group())) And here is the fix, with additional REVDAT components added (can't hurt! :-) ): elif key=="REVDAT": #Modified by Paul T. Bathen to get most recent date instead of oldest date. #Also added additional dict entries if dict['release_date'] == "1909-01-08": #set in init rr=re.search("\d\d-\w\w\w-\d\d",tail) if rr!=None: dict['release_date']=_format_date(_nice_case(rr.group())) dict['mod_number'] = hh[7:10].strip() dict['mod_id'] = hh[23:28].strip() dict['mod_type'] = hh[31:32].strip() Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 17 00:35:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 00:35:45 -0500 Subject: [Biopython-dev] [Bug 2950] New: Bio.PDBIO.save writes MODEL records without model id Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2950 Summary: Bio.PDBIO.save writes MODEL records without model id Product: Biopython Version: Not Applicable Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: barry_finzel at yahoo.com The MODEL record format for PDB files has an integer model identifier (e.g., "MODEL 1") not currently written to output. Files read (Bio.PDB.PDBIO.PDBParser.get_structure) and then immediately written back out have MODEL records lacking any ID, even though a model id is stored. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 17 01:06:02 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 01:06:02 -0500 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <200911170606.nAH662Vc032551@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 ------- Comment #1 from TallPaulInJax at yahoo.com 2009-11-17 01:06 EST ------- Hi Barry, FYI: PDBParser also starts the model numbering at model 0 no matter what the true model number is in the PDB file. I am going to file that as a bug as well. Look at line 106 and 122 here: http://github.com/biopython/biopython/blob/master/Bio/PDB/PDBParser.py#L122 Wanted to let you know! This would be REALLY problematic if the actual model numbers in the PDB file skip around, ie, MODEL 2, then MODEL 4, etc. I don't know if that's possible, I'm just a comp sci guy! Do you know? Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 17 01:11:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 01:11:32 -0500 Subject: [Biopython-dev] [Bug 2951] New: PDBParser assigns model 0 to first model no matter what... Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2951 Summary: PDBParser assigns model 0 to first model no matter what... Product: Biopython Version: 1.52 Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: TallPaulInJax at yahoo.com I'm not sure if this is a bug or a feature, but PDBParser assigns the first model it sees as model 0 then increments that. This means someone thinking they are studying model X is actually studying X+1, and that assumes that authors always use sequential model numbers without skips. If authors CAN skip model number, ie, MODEL 2, then MODEL 4, then MODEL 5... then in biopython these be models 0,1, and 2 in the structure... yuck. If this needs to be maintained for posterity, I would suggest adding another field to capture the TRUE model number if it exists. See lines 106 and 122 here: http://github.com/biopython/biopython/blob/master/Bio/PDB/PDBParser.py#106 Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 17 05:31:52 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 05:31:52 -0500 Subject: [Biopython-dev] [Bug 2951] PDBParser assigns model 0 to first model no matter what... In-Reply-To: Message-ID: <200911171031.nAHAVqHY010394@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2951 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-17 05:31 EST ------- We'd have to get Thomas' opinion (the original author), but I would say from a programming point of view having the model indices follow Python norms is very useful (i.e. start at 0 and increment). Consider looping operations based on the number of models etc. I would therefore prefer to see the "reported" model number given as a separate (independent) field from the existing "model index" (assigned incrementally). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 17 05:36:57 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 05:36:57 -0500 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <200911171036.nAHAavZ9010514@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2951 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-17 05:36 EST ------- In order to write out MODEL records with an ID, we probably need to store the ID on parsing - therefore marking this as depending on Bug 2951 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 17 05:37:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 05:37:12 -0500 Subject: [Biopython-dev] [Bug 2951] PDBParser assigns model 0 to first model no matter what... In-Reply-To: Message-ID: <200911171037.nAHAbC1v010530@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2951 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2950 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Nov 17 07:16:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Nov 2009 12:16:38 +0000 Subject: [Biopython-dev] Updating http://biopython.open-bio.org/SRC/biopython/ Message-ID: <320fb6e00911170416t4e58d4bpf9d7525262b4fec6@mail.gmail.com> Dear all, Back when Biopython used CVS, we had an hourly checkout of the code published here: http://www.biopython.org/SRC/biopython/ http://biopython.open-bio.org/SRC/biopython/ I did ask how this was setup (which username etc) but no one got back to me. In any case, this can now be turned off - along with making the Biopython CVS read only (OBF support call #857). I've managed to get this working again from the github repository, although the route isn't ideal (and is all running under my username): (1) At quarter past the hour, a cron job on dev.open-bio.org does a "git pull" to update a local copy of the repository to that on github. This doesn't need any passwords. (2) At 25mins past the hour, a cron job on www.biopython.org does an rsync to get a copy of the repository from dev.open-bio.org (using an SSH key for access). I tried having a single job running on dev.open-bio.org to push the files to www.biopython.org but the host policies seem to block that. It would be simpler to have everything done on www.biopython.org, which would require git to be installed on that machine. This would avoid any SSH security problems. Does that seem like a better idea? Thanks, Peter From bartek at rezolwenta.eu.org Tue Nov 17 07:42:24 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 17 Nov 2009 13:42:24 +0100 Subject: [Biopython-dev] Updating http://biopython.open-bio.org/SRC/biopython/ In-Reply-To: <320fb6e00911170416t4e58d4bpf9d7525262b4fec6@mail.gmail.com> References: <320fb6e00911170416t4e58d4bpf9d7525262b4fec6@mail.gmail.com> Message-ID: <8b34ec180911170442v31a4e48do196d7d6ae95b725d@mail.gmail.com> On Tue, Nov 17, 2009 at 1:16 PM, Peter wrote: > Dear all, > > Back when Biopython used CVS, we had an hourly checkout of > the code published here: > http://www.biopython.org/SRC/biopython/ > http://biopython.open-bio.org/SRC/biopython/ > > I did ask how this was setup (which username etc) but no one got > back to me. In any case, this can now be turned off - along with > making the Biopython CVS read only (OBF support call #857). > > I've managed to get this working again from the github repository, > although the route isn't ideal (and is all running under my username): Great! > > (1) At quarter past the hour, a cron job on dev.open-bio.org does > a "git pull" to update a local copy of the repository to that on github. > This doesn't need any passwords. > > (2) At 25mins past the hour, a cron job on www.biopython.org does > an rsync to get a copy of the repository from dev.open-bio.org (using > an SSH key for access). > > I tried having a single job running on dev.open-bio.org to push the > files to www.biopython.org but the host policies seem to block that. that's a pity. In case you haven't thought about it, I would only suggest to use some sort of a lockfile: In case github is very slow (happens every so often to me) it might occur that the second job will start before the first one is finished. Leading potentially to a broken repository for download. If the first job would create a lockfile before it starts and would delete it after it's done, the second job could condition the rsync operation on the existence of the file. This way, we could have delaysin syncing , rather than potentially broken repo. Of course, if we move to a simpler setup using only one host, this would not be needed. > > It would be simpler to have everything done on www.biopython.org, > which would require git to be installed on that machine. This would > avoid any SSH security problems. Does that seem like a better idea? > indeed, this would be best, and it only requires someone with root privileges to install a package. Also, it would be cool if the biopython user was somehow unlocked. (Currently no-one seems to have the password...) I don't hav an account on www.biopython.org server, so I can't help much at this point... cheers Bartek From bugzilla-daemon at portal.open-bio.org Tue Nov 17 08:13:17 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 08:13:17 -0500 Subject: [Biopython-dev] [Bug 2951] PDBParser assigns model 0 to first model no matter what... In-Reply-To: Message-ID: <200911171313.nAHDDHvS015696@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2951 ------- Comment #2 from TallPaulInJax at yahoo.com 2009-11-17 08:13 EST ------- "Consider looping operations based on the number of models etc." Wouldn't that just be: for m in struc.get_list() or to count: len(struc.get_list()) "I would therefore prefer to see the "reported" model number given as a separate (independent) field from the existing "model index" (assigned incrementally)." Sounds good to me!!! :-) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Tue Nov 17 08:12:44 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 17 Nov 2009 08:12:44 -0500 Subject: [Biopython-dev] Updating http://biopython.open-bio.org/SRC/biopython/ In-Reply-To: <320fb6e00911170416t4e58d4bpf9d7525262b4fec6@mail.gmail.com> References: <320fb6e00911170416t4e58d4bpf9d7525262b4fec6@mail.gmail.com> Message-ID: <20091117131244.GD68691@sobchak.mgh.harvard.edu> Hi Peter; Do we need to keep replicating this now that we've got GitHub? That gives us anonymous checkouts of the code, and on demand tarball generation. It seems like we could save a few cycles and worries by eliminating this. Brad > Back when Biopython used CVS, we had an hourly checkout of > the code published here: > http://www.biopython.org/SRC/biopython/ > http://biopython.open-bio.org/SRC/biopython/ > > I did ask how this was setup (which username etc) but no one got > back to me. In any case, this can now be turned off - along with > making the Biopython CVS read only (OBF support call #857). > > I've managed to get this working again from the github repository, > although the route isn't ideal (and is all running under my username): > > (1) At quarter past the hour, a cron job on dev.open-bio.org does > a "git pull" to update a local copy of the repository to that on github. > This doesn't need any passwords. > > (2) At 25mins past the hour, a cron job on www.biopython.org does > an rsync to get a copy of the repository from dev.open-bio.org (using > an SSH key for access). > > I tried having a single job running on dev.open-bio.org to push the > files to www.biopython.org but the host policies seem to block that. > > It would be simpler to have everything done on www.biopython.org, > which would require git to be installed on that machine. This would > avoid any SSH security problems. Does that seem like a better idea? > > Thanks, > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Tue Nov 17 08:22:47 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 08:22:47 -0500 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <200911171322.nAHDMlvj016137@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 ------- Comment #3 from TallPaulInJax at yahoo.com 2009-11-17 08:22 EST ------- The offending lines of code are here in PDBIO: http://github.com/biopython/biopython/blob/master/Bio/PDB/PDBIO.py#L129 if model_flag: fp.write("MODEL \n") For now, this should just be changed to: if model_flag: fp.write("MODEL %s\n" %model.id) When the 'reported' model number ID is added, I believe model ID would best be replaced with the new field above. Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Nov 17 09:40:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Nov 2009 14:40:11 +0000 Subject: [Biopython-dev] Updating http://biopython.open-bio.org/SRC/biopython/ In-Reply-To: <20091117131244.GD68691@sobchak.mgh.harvard.edu> References: <320fb6e00911170416t4e58d4bpf9d7525262b4fec6@mail.gmail.com> <20091117131244.GD68691@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00911170640q4e39d1f0p9b8375dd47de0ae1@mail.gmail.com> On Tue, Nov 17, 2009 at 1:12 PM, Brad Chapman wrote: > > Hi Peter; > Do we need to keep replicating this now that we've got GitHub? That > gives us anonymous checkouts of the code, and on demand tarball > generation. It seems like we could save a few cycles and worries by > eliminating this. > > Brad It isn't essential, no. On the other hand, lots of bits of documentation point there (e.g. for the Biopython license). Peter From biopython at maubp.freeserve.co.uk Tue Nov 17 09:46:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Nov 2009 14:46:44 +0000 Subject: [Biopython-dev] Updating http://biopython.open-bio.org/SRC/biopython/ In-Reply-To: <8b34ec180911170442v31a4e48do196d7d6ae95b725d@mail.gmail.com> References: <320fb6e00911170416t4e58d4bpf9d7525262b4fec6@mail.gmail.com> <8b34ec180911170442v31a4e48do196d7d6ae95b725d@mail.gmail.com> Message-ID: <320fb6e00911170646x280014dfq1f91a3f7bf6b31c1@mail.gmail.com> On Tue, Nov 17, 2009 at 12:42 PM, Bartek Wilczynski wrote: > On Tue, Nov 17, 2009 at 1:16 PM, Peter wrote: >> Dear all, >> >> Back when Biopython used CVS, we had an hourly checkout of >> the code published here: >> http://www.biopython.org/SRC/biopython/ >> http://biopython.open-bio.org/SRC/biopython/ >> >> I did ask how this was setup (which username etc) but no one got >> back to me. In any case, this can now be turned off - along with >> making the Biopython CVS read only (OBF support call #857). >> >> I've managed to get this working again from the github repository, >> although the route isn't ideal (and is all running under my username): > > Great! >> >> (1) At quarter past the hour, a cron job on dev.open-bio.org does >> a "git pull" to update a local copy of the repository to that on github. >> This doesn't need any passwords. >> >> (2) At 25mins past the hour, a cron job on www.biopython.org does >> an rsync to get a copy of the repository from dev.open-bio.org (using >> an SSH key for access). >> >> I tried having a single job running on dev.open-bio.org to push the >> files to www.biopython.org but the host policies seem to block that. > > that's a pity. In case you haven't thought about it, I would only > suggest to use some sort of a lockfile: In case github is very slow > (happens every so often to me) it might occur that the second job will > start before the first one is finished. Leading potentially to a > broken repository for download. If the first job would create a > lockfile before it starts and would delete it after it's done, the > second job could condition the rsync operation on the existence of the > file. This way, we could have delaysin syncing , rather than > potentially broken repo. Of course, if we move to a simpler setup > using only one host, this would not be needed. Note we're not (yet) making a clone-able repository available on www.biopython.org - this is just a simple hourly snapshot of the source code (without the .git folder). A lock file might be possible but seems overly complicated. I guess I could increase the time delay if you are worried about github being slow sometimes. >> It would be simpler to have everything done on www.biopython.org, >> which would require git to be installed on that machine. This would >> avoid any SSH security problems. Does that seem like a better idea? > > indeed, this would be best, and it only requires someone with root > privileges to install a package. That question was really aimed at the OBF admin team (hence my CC'ing this to root-l), in case they have any better ideas. > Also, it would be cool if the biopython user was somehow unlocked. > (Currently no-one seems to have the password...) I'm sure an OBF admin can reset it. Again, this is CC'd to root-l. Peter From bugzilla-daemon at portal.open-bio.org Tue Nov 17 10:07:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 10:07:55 -0500 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <200911171507.nAHF7sjx019463@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 ------- Comment #4 from barry_finzel at yahoo.com 2009-11-17 10:07 EST ------- Structures coming from the PDB should have sequentially-numbered models - usually starting with one, although the reason I was looking at this was that I would like to extract the model into separate files but still retain information regarding their original source. (So a Structure might contain only MODEL 6, for example). I don't imagine ever having multiple models out of order.., MODEL 4, then 2, then 1, then 8, etc). It would be helpful if the model id on the input record was stored and retrievable. Since code (no doubt) exists with references like struct[0], it might be better to create a new model attribute (model.label?) to store the label as a string, so old indexed references still work. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 17 10:24:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 10:24:51 -0500 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <200911171524.nAHFOpvb019880@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 ------- Comment #5 from TallPaulInJax at yahoo.com 2009-11-17 10:24 EST ------- Thanks for the note on whether models can be out of order, skipped, etc! When the MODEL is written out, I would imagine it should use the model.label or whatever it's named? (I believe it MIGHT be called SerialNo. See here: http://mmcif.pdb.org/dictionaries/pdb-correspondence/pdb2mmcif.html#MODEL Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 18 12:19:36 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 18 Nov 2009 12:19:36 -0500 Subject: [Biopython-dev] [Bug 2495] parse element symbols for ATOM/HETATM records (Bio.PDB.PDBParser) In-Reply-To: Message-ID: <200911181719.nAIHJaXD026880@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2495 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-18 12:19 EST ------- Trunk updated along the lines suggested by Hongbo Zhu. (In reply to comment #1) > IO.save should also write these element types on an output PDB file Leaving bug open to deal with the output as well. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Nov 20 11:11:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 20 Nov 2009 16:11:43 +0000 Subject: [Biopython-dev] Seq object join method Message-ID: <320fb6e00911200811xc2dab0w6b293e02761e2a43@mail.gmail.com> Hello all, Some more code to evaluate, again on a branch in github: http://github.com/peterjc/biopython/commit/c7cd0329061f88e3a8eae0979dd17c54a36ab4e5 This adds a join method to the Seq object, basically an alphabet aware version of the Python string join method. Recall that for strings: sep.join([a,b,c]) == a + sep + b + sep + c This leads to a common idiom for concatenating a list of strings, "".join([a,b,c]) == a + "" + b + "" + c == a + b + c That is fine for strings, but not necessarily for Seq objects since even a zero length sequence has an alphabet. Consider this example: >>> from Bio.Seq import Seq >>> from Bio.Alphabet.IUPAC import unambiguous_dna, ambiguous_dna >>> unamb_dna_seq = Seq("ACGT", unambiguous_dna) >>> ambig_dna_seq = Seq("ACRGT", ambiguous_dna) >>> unamb_dna_seq Seq('ACGT', IUPACUnambiguousDNA()) >>> ambig_dna_seq Seq('ACRGT', IUPACAmbiguousDNA()) If we add the ambiguous and unambiguous IUPAC DNA alphabets, we get the ambiguous IUPAC DNA alphabet: >>> unamb_dna_seq + ambig_dna_seq Seq('ACGTACRGT', IUPACAmbiguousDNA()) However, if the default generic alphabet is included, the result is a generic alphabet: >>> unamb_dna_seq + Seq("") + ambig_dna_seq Seq('ACGTACRGT', Alphabet()) Now consider Seq("").join([unamb_dna_seq, ambig_dna_seq]), should it follow the addition behaviour (giving a default alphabet) or "do the sensible thing" and preserve the IUPAC alphabet? As written, Seq("").join(...) is handled as a special case, and the alphabet of the empty string is ignored. To me this is a case of "practicality beats purity", it is much nicer than being forced to do Seq("", ambiguous_dna).join(...) where the empty sequence is given a suitable alphabet. So, what do people think? Peter From bugzilla-daemon at portal.open-bio.org Fri Nov 20 11:15:57 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 20 Nov 2009 11:15:57 -0500 Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string, even subclass string? In-Reply-To: Message-ID: <200911201615.nAKGFvuu018944@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2351 ------- Comment #18 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-20 11:15 EST ------- Possible join method for the Seq object outlined here (with code): http://lists.open-bio.org/pipermail/biopython-dev/2009-November/007012.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Fri Nov 20 14:28:42 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 20 Nov 2009 14:28:42 -0500 Subject: [Biopython-dev] Seq object join method In-Reply-To: <320fb6e00911200811xc2dab0w6b293e02761e2a43@mail.gmail.com> References: <320fb6e00911200811xc2dab0w6b293e02761e2a43@mail.gmail.com> Message-ID: <3f6baf360911201128u6b933b6asf8f2dfabbc1f275f@mail.gmail.com> On Fri, Nov 20, 2009 at 11:11 AM, Peter wrote: > > Hello all, > > Some more code to evaluate, again on a branch in github: > http://github.com/peterjc/biopython/commit/c7cd0329061f88e3a8eae0979dd17c54a36ab4e5 > > This adds a join method to the Seq object, basically an alphabet > aware version of the Python string join method. Recall that for > strings: > > sep.join([a,b,c]) == a + sep + b + sep + c > > This leads to a common idiom for concatenating a list of strings, > > "".join([a,b,c]) == a + "" + b + "" + c == a + b + c > > [...] > > Now consider Seq("").join([unamb_dna_seq, ambig_dna_seq]), > should it follow the addition behaviour (giving a default alphabet) > or "do the sensible thing" and preserve the IUPAC alphabet? > > As written, Seq("").join(...) is handled as a special case, and > the alphabet of the empty string is ignored. To me this is a > case of "practicality beats purity", it is much nicer than being > forced to do Seq("", ambiguous_dna).join(...) where the empty > sequence is given a suitable alphabet. > > So, what do people think? > > Peter > Thoughts: 1. Why doesn't Alphabet._consensus_alphabet raise a TypeError("Incompatable alphabets") where _check_type_compatibility would fail, at least as an optional argument? Probably because it's a private function. Should it be a public function, with a friendlier interface? 2. This might cause massive compatibility problems now, but would it be better for Seq() to use an "unknown_alphabet" by default instead of "generic"? Then _consensus_alphabet could safely ignore those sequences with unspecified alphabets, and Seq.join wouldn't need that special case. 3. Alternately, how much code would break if _consensus_alphabet simply treated generic_alphabet as an unspecified sequence, and ignored it when calculating the consensus alphabet? This effect could be limited to just Seq.join by dropping the test that the sequence length is 0, but it might be useful to have the same behavior for addition. Cheers, Eric From sbassi at clubdelarazon.org Sat Nov 21 09:31:44 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Sat, 21 Nov 2009 11:31:44 -0300 Subject: [Biopython-dev] Seq object join method In-Reply-To: <320fb6e00911200811xc2dab0w6b293e02761e2a43@mail.gmail.com> References: <320fb6e00911200811xc2dab0w6b293e02761e2a43@mail.gmail.com> Message-ID: <9e2f512b0911210631n429aa3det3ac60412b11e2e70@mail.gmail.com> On Fri, Nov 20, 2009 at 1:11 PM, Peter wrote: > Now consider Seq("").join([unamb_dna_seq, ambig_dna_seq]), > should it follow the addition behaviour (giving a default alphabet) > or "do the sensible thing" and preserve the IUPAC alphabet? .... > So, what do people think? >From my perspective, I like consistency, so I think that if you want to preserve the IUPAC alphabet, you should state the alphabet also in the separator sequence. From biopython at maubp.freeserve.co.uk Mon Nov 23 05:34:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 23 Nov 2009 10:34:31 +0000 Subject: [Biopython-dev] Seq object join method In-Reply-To: <3f6baf360911201128u6b933b6asf8f2dfabbc1f275f@mail.gmail.com> References: <320fb6e00911200811xc2dab0w6b293e02761e2a43@mail.gmail.com> <3f6baf360911201128u6b933b6asf8f2dfabbc1f275f@mail.gmail.com> Message-ID: <320fb6e00911230234s3360b563ye51c825ae62decde@mail.gmail.com> On Fri, Nov 20, 2009 at 7:28 PM, Eric Talevich wrote: > > Thoughts: > > 1. Why doesn't Alphabet._consensus_alphabet raise a > TypeError("Incompatable alphabets") where _check_type_compatibility > would fail, at least as an optional argument? Probably because it's a > private function. Should it be a public function, with a friendlier > interface? It is a private function, and right now I don't recall my precise thinking. The assorted private functions in Bio.Alphabet were to extract some commonly repeated actions for reuse (e.g. in the alignment code) while preserving backwards compatibility where possible, and fixing bugs as needed. I agree some of these are candidates for being made public, but this is a lower priority for me. I am also not sure if functions are the best way to do some of these tasks - Alphabet methods may be better. > 2. This might cause massive compatibility problems now, but would it > be better for Seq() to use an "unknown_alphabet" by default instead of > "generic"? Then _consensus_alphabet could safely ignore those > sequences with unspecified alphabets, and Seq.join wouldn't need that > special case. The base class generic alphabet *is* the "unknown alphabet". > 3. Alternately, how much code would break if _consensus_alphabet > simply treated generic_alphabet as an unspecified sequence, and > ignored it when calculating the consensus alphabet? This effect could > be limited to just Seq.join by dropping the test that the sequence > length is 0, but it might be useful to have the same behavior for > addition. I don't know specifically what would break, but that seems too permissive to me. The Seq("").join(...) seems like a special case to me as it fits the Python "".join(...) idiom for concatenating a list of strings. Peter From biopython at maubp.freeserve.co.uk Mon Nov 23 05:44:14 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 23 Nov 2009 10:44:14 +0000 Subject: [Biopython-dev] Seq object join method In-Reply-To: <9e2f512b0911210631n429aa3det3ac60412b11e2e70@mail.gmail.com> References: <320fb6e00911200811xc2dab0w6b293e02761e2a43@mail.gmail.com> <9e2f512b0911210631n429aa3det3ac60412b11e2e70@mail.gmail.com> Message-ID: <320fb6e00911230244q53a37b93xbf52a33f3a9e14f7@mail.gmail.com> On Sat, Nov 21, 2009 at 2:31 PM, Sebastian Bassi wrote: > On Fri, Nov 20, 2009 at 1:11 PM, Peter wrote: >> Now consider Seq("").join([unamb_dna_seq, ambig_dna_seq]), >> should it follow the addition behaviour (giving a default alphabet) >> or "do the sensible thing" and preserve the IUPAC alphabet? >> .... >> So, what do people think? > > From my perspective, I like consistency, so I think that if you want > to preserve the IUPAC alphabet, you should state the alphabet also in > the separator sequence. If you have a list of Seq objects with an IUPAC alphabet, then yes, you could concatenate them using: result = Seq("",the_known_IUPAC_alphabet).join(the_list_of_seqs) But what if you are writing a stand alone function taking Seq arguments of unknown alphabet? If you want to preserve the alphabet (and I would), you would be forced to do something nasty like this: result = Seq("",the_list_of_seqs[0].alphabet).join(the_list_of_seqs) or simply (as now) avoid using the join method completely, e.g. result = the_list_of_seqs[0] for seq in the_list_of_seqs[1:] : result += seq Neither of these have the clarity of: result = Seq("").join(the_list_of_seqs) To me, part of the issue here is that the use of "".join(list_of_strings) in plain Python has always taken a bit of getting used to. It isn't very intuitive - the old join function in the string module was in some ways more natural. Maybe we need to add a Bio.Seq module join function? e.g. def join(words, sep=None) : ... While the Python string module join had the separator defaulting to the empty string, here we can be explicit that by default there is no separator sequence by default, therefore no extra alphabet to worry about. However, while using a join function lets us avoid the separator alphabet issue, it isn't object orientated, and does not match the Python string object very well. Peter From bugzilla-daemon at portal.open-bio.org Mon Nov 23 06:37:46 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 23 Nov 2009 06:37:46 -0500 Subject: [Biopython-dev] [Bug 2597] Enforce alphabet letters in Seq objects In-Reply-To: Message-ID: <200911231137.nANBbkrO032437@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2597 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-23 06:37 EST ------- As recently noted on the mailing list, making the Seq alphabet check strict could be useful for file format validation with Bio.SeqIO or Bio.AlignIO, since the parse and read functions can be given an alphabet. e.g. While this would be allowed: from Bio import SeqIO from Bio.Alphabet.IUPAC import extended_protein from Bio.Alphabet import Gapped from StringIO import StringIO fasta_str = "\n\n\n>ID\nABCDEFGH-IPX\n" record = SeqIO.read(StringIO(fasta_str), "fasta", Gapped(extended_protein, "-")) If the Seq object checked the alphabet letters, this would fail due to the minus sign: >>> record = SeqIO.read(StringIO(fasta_str), "fasta", extended_protein) If the user doesn't care about the precise letters, they can use the default generic alphabet, e.g. >>> record = SeqIO.read(StringIO(fasta_str), "fasta") or, to at least specify this is a protein sequence: >>> from Bio.Alphabet import generic_protein >>> record = SeqIO.read(StringIO(fasta_str), "fasta", generic_protein) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Nov 23 09:43:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 23 Nov 2009 14:43:28 +0000 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? Message-ID: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> Dear all, Is there anyone on the dev mailing list willing to test the SFF support I've been working on for Bio.SeqIO? The code is here, a branch on github: http://github.com/peterjc/biopython/tree/sff-seqio The important files are: * Bio/SeqIO/SffIO.py * Bio/SeqIO/__init__.py (defining the new format) * Bio/SeqIO/_index.py (indexing SFF files) Plus unit test files: * Tests/run_tests.py (to run the doctests) * Tests/test_SeqIO_QualityIO.py * Tests/test_SeqIO_index.py * Tests/test_SeqIO.py * Tests/Roche/* (for unit tests) Sebastian Bassi had a look last month and his feedback has already helped (e.g. with error messages): http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006903.html I have been using this code myself in real work, for example editing the trim points in an SFF file to take into account PCR primer sequences, and filtering SFF reads, checking Roche barcodes etc. Thanks, Peter From biopython at maubp.freeserve.co.uk Mon Nov 23 16:49:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 23 Nov 2009 21:49:31 +0000 Subject: [Biopython-dev] PubMed E-Utility 2010 DTD changes In-Reply-To: <7B6F170840CA6C4DA63EE0C8A7BB43EC099627A4@NIHCESMLBX15.nih.gov> References: <7B6F170840CA6C4DA63EE0C8A7BB43EC099627A4@NIHCESMLBX15.nih.gov> Message-ID: <320fb6e00911231349i770a89cdrfd4341b3731d1b1c@mail.gmail.com> Hi all, See below - it look like there are two new DTD files to add to Bio.Entrez Peter ---------- Forwarded message ---------- From: Date: Mon, Nov 23, 2009 at 8:35 PM Subject: [Utilities-announce] PubMed E-Utility 2010 DTD changes To: utilities-announce at ncbi.nlm.nih.gov PubMed E-Utility Users, We anticipate switching to the updated PubMed 2010 DTDs on December 14, 2009. 2010 DTDs are available from the Entrez DTD page: http://eutils.ncbi.nlm.nih.gov/corehtml/query/DTD/ The DTD changes for the 2010 production year, as noted in the Revision Notes section near the top of each DTD, include: NLM MedlineCitationSet DTD used for MEDLINE/PubMed XML data files: 1. ?The CommentsCorrections group of elements was reorganized. Through the 2009 data year, the publications cited in CommentsCorrections were defined as elements. For the 2010 DTD, they are defined as valid values to the RefType attribute, and a new attribute value, Cites, was created. Cites will contain PMIDs and source data for items in the bibliography or list of references at the end of an article that is deposited in PubMed Central (PMC). There is no RefType attribute corresponding to Cites for PMIDs and source data of articles in which a paper is cited. In the implementation for 2010, RefType = "Cites" will contain only PMIDs and source data for citations where an actual PMID for the cited article exists in the NLM Data Creation and Maintenance System (DCMS). It is therefore possible for a citation to be present in the article's list of references and yet the PMID is not included in the Cites list because it is not present in the NLM DCMS. Cites will be present in the baseline files; however, the subsequent update frequency of Cites lists is not yet determined. Again, all Cites data for this initial implementation are coming from articles in PMC. 2. NameID element was added to Author and Investigator elements. NameID is a possibly multiply-occurring, optional element permissible within the Author (personal and collective) and Investigator elements. It is intended as a unique identifier associated with the name. The value in the NameID attribute Source designates the organizational authority that established the unique identifier. There is no target date for implementation of this field; it is a placeholder for now. Additional information is available from the Announcements to NLM Data Licensees 2010 DTD and XML Changes; File Distribution Schedule Changes: http://www.nlm.nih.gov/bsd/licensee/announce/2009.html#d09_17 _______________________________________________ Utilities-announce mailing list http://www.ncbi.nlm.nih.gov/mailman/listinfo/utilities-announce From biopython at maubp.freeserve.co.uk Tue Nov 24 06:30:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Nov 2009 11:30:04 +0000 Subject: [Biopython-dev] Changing Seq equality Message-ID: <320fb6e00911240330hb1ee2b6mbe75b1433e0ecdcc@mail.gmail.com> Dear all, One thing about the Seq object that still annoys me, and is rather confusing for novices, is the equality testing. It would be nice to "fix" this, but it turns out to be quite complicated due to the way Python works. Brad and I did started talking about this a few months ago at BOSC2009, but ran out of time. First, a brief aside about hashes (used in dictionaries and sets). In Python immutable objects can be hashed, via the hash function or a custom __hash__ method. An important detail is that if two objects evaluate as equal, they must have the same hash, and vice verse (otherwise dictionaries break and other bad things happen). e.g. >>> hash(1) 1 >>> hash(1.0) 1 >>> hash("1") 1977051568 >>> "1"==1 False >>> 1.0==1 True See also: http://mail.python.org/pipermail/python-dev/2002-December/031455.html In Biopython, the Seq object is immutable (read only) and can be used as a dictionary tree. However, we don't implement equality or hashes explicitly, thus get the object default behaviour. This means two Seq objects are only equal if they are the same object in memory. The hash is actually the address in memory: >>> from Bio.Seq import Seq >>> s = Seq("ACGT") >>> id(s) 532624 >>> hash(s) 532624 This means that while a Seq can be used as a dictionary key, the test is for object equality - which is of limited use. Now, the MutableSeq has an "alphabet aware" equality defined. Because these are mutable objects, they don't have a hash, and cannot be used as dictionary keys. This means there are no hash related restrictions on the equality rules. Now, what if the Seq object had a similar "alphabet aware" equality? The problem is if we'd like Seq("ACGT") to be equal to Seq("ACGT", generic_dna) then both must have the same hash. Then, if we also want Seq("ACGT") and Seq("ACGT", generic_protein) to be equal, they too must have the same hash. This means Seq("ACGT", generic_dna) and Seq("ACGT",generic_protein) would have the same hash, and therefore must evaluate as equal (!). The natural consequence of this chain of logic is we would then have Seq("ACGT") == Seq("ACGT", generic_dna) == Seq("ACGT",generic_protein) == Seq("ACGT",...). You reach the same point if we require the string "ACGT" equals Seq("ACGT", some_alphabet) i.e. Another option would be to base Seq equality and hashing on the sequence string only (ignoring the alphabet). This would at least be a simple rule to remember (and would mean we could implement less than, greater than etc in the same way) but basically means we'd ignore the alphabet. So, currently in Biopython, we have object identity. We could have string based identity. I've thought about other options but haven't come up with anything that would be self consistent (and could be hashed). If anyone has a alternative idea, please speak up. I don't know what thought process Jose went though, but he wants to use the same equality test in his code: http://lists.open-bio.org/pipermail/biopython/2009-November/005861.html Changing Seq equality like this would make Biopython much nicer to use for basic tasks. For example, my code (and the unit tests) often contains things like if str(seq1)==str(seq2). If we want to make this change, it is quite a break to backwards compatibility. (It also has the downside that a DNA sequence ACGT and a protein sequence ACGT would evaluate as equal - probably not a big issue in practice but counter intuitive). One way to handle this would be to start by adding explicit Seq __eq__ methods etc which preserve the current behaviour (i.e. act like id(seq1)==id(seq2) based on object identity) but issue a deprecation warning. Then for a series of releases people would be encouraged to use str(seq1)==str(seq2) or id(seq1)==id(seq2) as appropriate. Then, after this transition period, we would change the __eq__ methods to adopt the new behaviour. Or, we could have a Bio.Seq module level switch to control the behaviour - initially defaulting to the current system with a deprecation warning? Peter P.S. As a related point, we will need to switch the MutableSeq from using __cmp__ to __eq__ etc for future Python compatibility. From jblanca at btc.upv.es Tue Nov 24 07:31:04 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 24 Nov 2009 13:31:04 +0100 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <320fb6e00911240253g10986fcfj311c8a2adc12afd5@mail.gmail.com> References: <200911241132.55922.jblanca@btc.upv.es> <320fb6e00911240253g10986fcfj311c8a2adc12afd5@mail.gmail.com> Message-ID: <200911241331.05031.jblanca@btc.upv.es> > It is a reasonable change, but ONLY if all the subclasses support > the same __init__ method, which isn't true. For example, the > Bio.Seq.UnknownSeq subclasses Seq and uses a different __init__ > method signature. This means any change would at a minimum > have to include lots of fixes to the UnknownSeq In this case what I do is to create a new __init__ for the inherited class, like: class SeqWithQuality(SeqRecord): '''A wrapper around Biopython's SeqRecord that adds a couple of convenience methods''' def __init__(self, seq, id = "", name = "", description = "", dbxrefs = None, features = None, annotations = None, letter_annotations = None, qual = None): SeqRecord.__init__(self, seq, id=id, name=name, description=description, dbxrefs=dbxrefs, features=features, annotations=annotations, letter_annotations=letter_annotations) if qual is not None: self.qual = qual def _set_qual(self, qual): '''It stores the quality in the letter_annotations['phred_quality']''' self.letter_annotations["phred_quality"] = qual def _get_qual(self): '''It gets the quality from letter_annotations['phred_quality']''' return self.letter_annotations["phred_quality"] qual = property(_get_qual, _set_qual) def __add__(self, seq2): '''It returns a new object with both seq and qual joined ''' #per letter annotations new_seq = self.__class__(name = self.name + '+' + seq2.name, id = self.id + '+' + seq2.id, seq = self.seq + seq2.seq) #the letter annotations, including quality for name, annot in self.letter_annotations.items(): if name in seq2.letter_annotations: new_seq.letter_annotations[name] = annot + \ seq2.letter_annotations[name] return new_seq -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Nov 24 08:10:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Nov 2009 13:10:15 +0000 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <200911241331.05031.jblanca@btc.upv.es> References: <200911241132.55922.jblanca@btc.upv.es> <320fb6e00911240253g10986fcfj311c8a2adc12afd5@mail.gmail.com> <200911241331.05031.jblanca@btc.upv.es> Message-ID: <320fb6e00911240510n35d0ab5cvdfff0b8375723864@mail.gmail.com> On Tue, Nov 24, 2009 at 12:31 PM, Jose Blanca wrote: > >> It is a reasonable change, but ONLY if all the subclasses support >> the same __init__ method, which isn't true. For example, the >> Bio.Seq.UnknownSeq subclasses Seq and uses a different __init__ >> method signature. This means any change would at a minimum >> have to include lots of fixes to the UnknownSeq > > In this case what I do is to create a new __init__ for the inherited class, > like: > > class SeqWithQuality(SeqRecord): > ? ?'''A wrapper around Biopython's SeqRecord that adds a couple of > convenience methods''' > ? ?def __init__(self, seq, id = "", name = "", > ? ? ? ? ? ? ? ? description = "", dbxrefs = None, > ? ? ? ? ? ? ? ? features = None, annotations = None, > ? ? ? ? ? ? ? ? letter_annotations = None, qual = None): > ? ? ? ?SeqRecord.__init__(self, seq, id=id, name=name, > ? ? ? ? ? ? ? ? ? ? ? ? ? description=description, dbxrefs=dbxrefs, > ? ? ? ? ? ? ? ? ? ? ? ? ? features=features, annotations=annotations, > ? ? ? ? ? ? ? ? ? ? ? ? ? letter_annotations=letter_annotations) > ? ? ? ?if qual is not None: > ? ? ? ? ? ?self.qual = qual > > ? ?def _set_qual(self, qual): > ? ? ? ?'''It stores the quality in the letter_annotations['phred_quality']''' > ? ? ? ?self.letter_annotations["phred_quality"] = qual > ? ?def _get_qual(self): > ? ? ? ?'''It gets the quality from letter_annotations['phred_quality']''' > ? ? ? ?return self.letter_annotations["phred_quality"] > ? ?qual = property(_get_qual, _set_qual) I can see how adding a property makes accessing the PHRED qualities much easier. > ? ?def __add__(self, seq2): > ? ? ? ?'''It returns a new object with both seq and qual joined ''' > ? ? ? ?#per letter annotations > ? ? ? ?new_seq = self.__class__(name = self.name + '+' + seq2.name, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? id = self.id + '+' + seq2.id, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seq ?= self.seq + seq2.seq) > ? ? ? ?#the letter annotations, including quality > ? ? ? ?for name, annot in self.letter_annotations.items(): > ? ? ? ? ? ?if name in seq2.letter_annotations: > ? ? ? ? ? ? ? ?new_seq.letter_annotations[name] = annot + \ > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seq2.letter_annotations[name] > ? ? ? ?return new_seq This bit is much less clear to me - you completely ignore any features. Was it written before I added the __add__ method to the original SeqRecord (expected to be in Biopython 1.53)? Anyway - it looks like your SeqRecord subclass should work fine as it is (partly because the SeqRecord has relatively few methods you may need to subclass). Peter From jblanca at btc.upv.es Tue Nov 24 09:58:39 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 24 Nov 2009 15:58:39 +0100 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <320fb6e00911240510n35d0ab5cvdfff0b8375723864@mail.gmail.com> References: <200911241132.55922.jblanca@btc.upv.es> <200911241331.05031.jblanca@btc.upv.es> <320fb6e00911240510n35d0ab5cvdfff0b8375723864@mail.gmail.com> Message-ID: <200911241558.39357.jblanca@btc.upv.es> What I mean is this: http://github.com/JoseBlanca/biopython/commit/d4c87365f614de2d69d800dc63d0cc25087d96dc I would like to change the Seq() and SeqRecord() for self.__class__ There are places already in Seq in which self.__class__ is used. But this is not the case in all instances. I think that this would be a reasonable change. > > > > ? ?def _set_qual(self, qual): > > ? ? ? ?'''It stores the quality in the > > letter_annotations['phred_quality']''' > > self.letter_annotations["phred_quality"] = qual > > ? ?def _get_qual(self): > > ? ? ? ?'''It gets the quality from letter_annotations['phred_quality']''' > > ? ? ? ?return self.letter_annotations["phred_quality"] > > ? ?qual = property(_get_qual, _set_qual) > > I can see how adding a property makes accessing the PHRED > qualities much easier. Yes, that's just a convenience method. I used to have my own class with this property and now I'm trying to use SeqRecord instead but adding this method to ease the change. > > ? ?def __add__(self, seq2): > > ? ? ? ?'''It returns a new object with both seq and qual joined ''' > > ? ? ? ?#per letter annotations > > ? ? ? ?new_seq = self.__class__(name = self.name + '+' + seq2.name, > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? id = self.id + '+' + seq2.id, > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seq ?= self.seq + seq2.seq) > > ? ? ? ?#the letter annotations, including quality > > ? ? ? ?for name, annot in self.letter_annotations.items(): > > ? ? ? ? ? ?if name in seq2.letter_annotations: > > ? ? ? ? ? ? ? ?new_seq.letter_annotations[name] = annot + \ > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seq2.letter_annotations[name] > > ? ? ? ?return new_seq > > This bit is much less clear to me - you completely ignore any > features. Was it written before I added the __add__ method > to the original SeqRecord (expected to be in Biopython 1.53)? Yes, it was added much earliear. I'll remove it as soon as the Biopython SeqRecord has one. -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Nov 24 10:52:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Nov 2009 15:52:25 +0000 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <200911241558.39357.jblanca@btc.upv.es> References: <200911241132.55922.jblanca@btc.upv.es> <200911241331.05031.jblanca@btc.upv.es> <320fb6e00911240510n35d0ab5cvdfff0b8375723864@mail.gmail.com> <200911241558.39357.jblanca@btc.upv.es> Message-ID: <320fb6e00911240752h28683e33l5d5dc938882d2d1a@mail.gmail.com> On Tue, Nov 24, 2009 at 2:58 PM, Jose Blanca wrote: > What I mean is this: > http://github.com/JoseBlanca/biopython/commit/d4c87365f614de2d69d800dc63d0cc25087d96dc > > I would like to change the Seq() and SeqRecord() for self.__class__ > There are places already in Seq in which self.__class__ is used. > But this is not the case in all instances. I think that this would be a > reasonable change. I hadn't realised the add methods also used __class__, I thought it was just __repr__ - good point. Thinking about use-cases, sometimes a subclass will want the methods to return Seq objects, sometimes the same class. The UnknownSeq too sometimes can return another UnknownSeq, but must often return a Seq object. The BioSQL DBSeq on the other hand always returns a Seq object for all its methods. The fact that the Seq __add__ and __addr__ use __class__ was the cause of a bug in that adding DBSeq objects didn't work. A hypothetical CircularSeq could return CircularSeq objects for some cases (e.g. upper, lower, and perhaps transcribe, back_transcribe and reverse complement), Seq objects in other cases (e.g. slicing) while in other cases it may depend on the data (e.g. translation). Essentially, for an Seq subclass you may need to look at each method in turn and decide which is most appropriate. So which is the most sensible default behaviour from the Seq object? The cautious "return a Seq" approach will be robust (and makes sense for the existing Biopython subclasses, DBSeq and the UnknownSeq), but makes changing this in the subclass harder (as Jose has found). What does your Seq subclass aim to do? Add one or two general methods to enhance the Seq object - or model something a little different? If anyone else has written Seq (or SeqRecord) subclasses, it would be very helpful to hear about them. After all, the change Jose is proposing may break your code ;) >> > ? ?def __add__(self, seq2): >> > ? ? ? ?'''It returns a new object with both seq and qual joined ''' >> > ? ? ? ?#per letter annotations >> > ? ? ? ?new_seq = self.__class__(name = self.name + '+' + seq2.name, >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? id = self.id + '+' + seq2.id, >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seq ?= self.seq + seq2.seq) >> > ? ? ? ?#the letter annotations, including quality >> > ? ? ? ?for name, annot in self.letter_annotations.items(): >> > ? ? ? ? ? ?if name in seq2.letter_annotations: >> > ? ? ? ? ? ? ? ?new_seq.letter_annotations[name] = annot + \ >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seq2.letter_annotations[name] >> > ? ? ? ?return new_seq >> >> This bit is much less clear to me - you completely ignore any >> features. Was it written before I added the __add__ method >> to the original SeqRecord (expected to be in Biopython 1.53)? > > Yes, it was added much earliear. I'll remove it as soon as the > Biopython SeqRecord has one. OK - your code makes more sense now. The Biopython trunk does now have an __add__ method (which I expect to be in Biopython 1.53). Peter From jblanca at btc.upv.es Tue Nov 24 11:06:32 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 24 Nov 2009 17:06:32 +0100 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <320fb6e00911240752h28683e33l5d5dc938882d2d1a@mail.gmail.com> References: <200911241132.55922.jblanca@btc.upv.es> <200911241558.39357.jblanca@btc.upv.es> <320fb6e00911240752h28683e33l5d5dc938882d2d1a@mail.gmail.com> Message-ID: <200911241706.32940.jblanca@btc.upv.es> > What does your Seq subclass aim to do? Add one or two general > methods to enhance the Seq object - or model something a little > different? Basically I want to use just a Seq and a SeqRecord almost as the Biopython ones. For the SeqRecord I'm adding a qual property, but I can change my code to deal with that. Also I've added an __add__ (that I will remove as soon as the biopython one is stable) and complement method to SeqRecord and __eq__ to Seq. Regards, Jose Blanca > If anyone else has written Seq (or SeqRecord) subclasses, it > would be very helpful to hear about them. After all, the change > Jose is proposing may break your code ;) > > >> > ? ?def __add__(self, seq2): > >> > ? ? ? ?'''It returns a new object with both seq and qual joined ''' > >> > ? ? ? ?#per letter annotations > >> > ? ? ? ?new_seq = self.__class__(name = self.name + '+' + seq2.name, > >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? id = self.id + '+' + seq2.id, > >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seq ?= self.seq + seq2.seq) > >> > ? ? ? ?#the letter annotations, including quality > >> > ? ? ? ?for name, annot in self.letter_annotations.items(): > >> > ? ? ? ? ? ?if name in seq2.letter_annotations: > >> > ? ? ? ? ? ? ? ?new_seq.letter_annotations[name] = annot + \ > >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seq2.letter_annotations[name] > >> > ? ? ? ?return new_seq > >> > >> This bit is much less clear to me - you completely ignore any > >> features. Was it written before I added the __add__ method > >> to the original SeqRecord (expected to be in Biopython 1.53)? > > > > Yes, it was added much earliear. I'll remove it as soon as the > > Biopython SeqRecord has one. > > OK - your code makes more sense now. The Biopython trunk > does now have an __add__ method (which I expect to be in > Biopython 1.53). > > Peter -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Nov 24 11:17:14 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Nov 2009 16:17:14 +0000 Subject: [Biopython-dev] Subclassing Seq and SeqRecord In-Reply-To: <200911241706.32940.jblanca@btc.upv.es> References: <200911241132.55922.jblanca@btc.upv.es> <200911241558.39357.jblanca@btc.upv.es> <320fb6e00911240752h28683e33l5d5dc938882d2d1a@mail.gmail.com> <200911241706.32940.jblanca@btc.upv.es> Message-ID: <320fb6e00911240817y402a445el853c6e51b00d98ab@mail.gmail.com> On Tue, Nov 24, 2009 at 4:06 PM, Jose Blanca wrote: >> What does your Seq subclass aim to do? Add one or two general >> methods to enhance the Seq object - or model something a little >> different? > > Basically I want to use just a Seq and a SeqRecord almost as the Biopython > ones. For the SeqRecord I'm adding a qual property, but I can change my > code to deal with that. OK, I can see the purpose here. > Also I've added an __add__ (that I will remove as soon as the biopython > one is stable) ... OK. I consider the SeqRecord __add__ to be stable, but await comments. > ... and complement method to SeqRecord ... Interesting. Do you mean complement or reverse_complement? It would have been nice to have had your comments on this earlier thread: http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006850.html http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html > ... and __eq__ to Seq. Seq object equality is a tricky thing, see the other thread: http://lists.open-bio.org/pipermail/biopython-dev/2009-November/007021.html Peter From jblanca at btc.upv.es Tue Nov 24 11:17:55 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 24 Nov 2009 17:17:55 +0100 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <320fb6e00911240752h28683e33l5d5dc938882d2d1a@mail.gmail.com> References: <200911241132.55922.jblanca@btc.upv.es> <200911241558.39357.jblanca@btc.upv.es> <320fb6e00911240752h28683e33l5d5dc938882d2d1a@mail.gmail.com> Message-ID: <200911241717.55484.jblanca@btc.upv.es> On Tuesday 24 November 2009 16:52:25 Peter wrote: > Thinking about use-cases, sometimes a subclass will want the > methods to return Seq objects, sometimes the same class. > > The UnknownSeq too sometimes can return another UnknownSeq, > but must often return a Seq object. I'm thinking about that and I don't think it's a problem. If the subclass wants to return as the parent class it can chose to do it. I'm just proposing to change the behaviour of the parent class. > The BioSQL DBSeq on the other hand always returns a Seq > object for all its methods. The fact that the Seq __add__ and > __addr__ use __class__ was the cause of a bug in that adding > DBSeq objects didn't work. I haven't realized that problem. Was that a bug of the BioSQL project that could be solved or a desing problem related to my proposal? Regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Nov 24 11:30:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Nov 2009 16:30:21 +0000 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <200911241717.55484.jblanca@btc.upv.es> References: <200911241132.55922.jblanca@btc.upv.es> <200911241558.39357.jblanca@btc.upv.es> <320fb6e00911240752h28683e33l5d5dc938882d2d1a@mail.gmail.com> <200911241717.55484.jblanca@btc.upv.es> Message-ID: <320fb6e00911240830l68269b80ja685f0a2dce5946f@mail.gmail.com> On Tue, Nov 24, 2009 at 4:17 PM, Jose Blanca wrote: > On Tuesday 24 November 2009 16:52:25 Peter wrote: >> Thinking about use-cases, sometimes a subclass will want the >> methods to return Seq objects, sometimes the same class. >> >> The UnknownSeq too sometimes can return another UnknownSeq, >> but must often return a Seq object. > > I'm thinking about that and I don't think it's a problem. If the subclass > wants to return as the parent class it can chose to do it. I'm just proposing > to change the behaviour of the parent class. Yes - but it means any existing subclasses will need updating (fairly easy for those included with Biopython) which could be a big problem for end user scripts (especially if anyone wants to target old and new versions of Biopython). >> The BioSQL DBSeq on the other hand always returns a Seq >> object for all its methods. The fact that the Seq __add__ and >> __addr__ use __class__ was the cause of a bug in that adding >> DBSeq objects didn't work. > > I haven't realized that problem. Was that a bug of the BioSQL project > that could be solved or a desing problem related to my proposal? It was just a bug in Biopython's BioSQL wrappers, fixed by adding explicit __add__ and __addr__ methods to the DBSeq class since it couldn't safely use the default methods of the Seq class. Your proposal would require further similar changes to the DBSeq class to override *all* the Seq returning methods to ensure a Seq object is returned and not attempt to create a DBSeq object with the wrong __init__ arguments. The point is while your proposed change will make some tasks easier (e.g. writing an extended Seq subclass that adds a new method or changes an existing method), it will make other tasks much harder (e.g. the DBSeq class). Peter From lpritc at scri.ac.uk Tue Nov 24 11:26:34 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 24 Nov 2009 16:26:34 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e00911240330hb1ee2b6mbe75b1433e0ecdcc@mail.gmail.com> Message-ID: Hi, Without wanting to get too philosophical, an issue to consider in this, in addition to the technical problems outlined by Peter, is what do we *mean* when we ask about equality of two sequences? As Peter points out, there is something counterintuitive about the peptide "ACGT" somehow being equal to the nucleotide sequence "ACGT", and that is because we know that the things that these sequences represent are not in reality the same thing. Likewise, two instances of a repeat sequence in a genome are not necessarily the same conceptual item, even though they may have the same nucleotide sequence. Also, two CDS from different sources may have the same conceptual translation, but the identical translations are arguably not the same sequence, and in both these circumstances a test for string equality ignores potentially significant between the physical/biological elements they describe. These particular cases would give false positives for equality that could be 'gotchas' for the use in dictionaries that prompted this discussion. If we want to test for string equality of two sequences, we can already do that explicitly and simply with str(s1) == str(s2). Making this the default behaviour for a string doesn't always conform to my own expectations of what 'equality' means for two sequences, because my expectation changes depending on the task in hand. An alternative reasonable test for equality might be whether the two sequences represent the same sequence, so Seq("M", generic_protein) == Seq("ATG", generic_dna) might return True if we make some potentially dodgy assumptions about reading frames, and consider that they conceptually represent the same thing. I think that it would a bad default behaviour, and harder to implement than testing string equality, but equally reasonable depending on what you think 'equality' means. Another, equally reasonable, definition of two sequences being 'equal' is that they share a locus tag or accession. I test on this more frequently than I do on sequence identity, but still think it's a bad idea to make it a default test for sequence equality. Similarly, if two sequences (e.g. mRNA/cDNA) map to the same location on a genome, you might consider them equal. There are several equally reasonable and yet non-universal definitions of 'equality' for sequence comparisons, and we currently have the ability to test simply but explicitly for equality on the basis of any of these as we need to at the time. I would prefer to see this requirement for an explicit string comparison kept, and the test for object equality kept as the default, because this never produces a false positive (and I value specificity over sensitivity as a default ;) ). Cheers, L. On 24/11/2009 11:30, "Peter" wrote: [...] > The problem is if we'd like Seq("ACGT") to be equal to > Seq("ACGT", generic_dna) then both must have the > same hash. Then, if we also want Seq("ACGT") and > Seq("ACGT", generic_protein) to be equal, they too must > have the same hash. This means Seq("ACGT", generic_dna) > and Seq("ACGT",generic_protein) would have the same > hash, and therefore must evaluate as equal (!). The > natural consequence of this chain of logic is we would > then have Seq("ACGT") == Seq("ACGT", generic_dna) > == Seq("ACGT",generic_protein) == Seq("ACGT",...). > You reach the same point if we require the string > "ACGT" equals Seq("ACGT", some_alphabet) > > i.e. Another option would be to base Seq equality > and hashing on the sequence string only (ignoring > the alphabet). > > This would at least be a simple rule to remember (and > would mean we could implement less than, greater than > etc in the same way) but basically means we'd ignore > the alphabet. [...] > Changing Seq equality like this would make Biopython > much nicer to use for basic tasks. For example, my > code (and the unit tests) often contains things like if > str(seq1)==str(seq2). > > If we want to make this change, it is quite a break to > backwards compatibility. (It also has the downside that > a DNA sequence ACGT and a protein sequence ACGT > would evaluate as equal - probably not a big issue in > practice but counter intuitive). -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From bugzilla-daemon at portal.open-bio.org Tue Nov 24 11:47:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 24 Nov 2009 11:47:00 -0500 Subject: [Biopython-dev] [Bug 2927] Problem parsing PSI-BLAST plain text output with NCBStandalone.PSIBlastParser In-Reply-To: Message-ID: <200911241647.nAOGl0RE007751@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2927 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|blocker |normal ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-24 11:47 EST ------- This isn't a blocker severity level bug. And we still need at least one example PSI-BLAST plain text output to try and fix this... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jblanca at btc.upv.es Wed Nov 25 03:31:53 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 25 Nov 2009 09:31:53 +0100 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <320fb6e00911240830l68269b80ja685f0a2dce5946f@mail.gmail.com> References: <200911241132.55922.jblanca@btc.upv.es> <200911241717.55484.jblanca@btc.upv.es> <320fb6e00911240830l68269b80ja685f0a2dce5946f@mail.gmail.com> Message-ID: <200911250931.54003.jblanca@btc.upv.es> > The point is while your proposed change will make some tasks > easier (e.g. writing an extended Seq subclass that adds a new > method or changes an existing method), it will make other tasks > much harder (e.g. the DBSeq class). That's a fair point. You're right in either case some methods would have to be reimplemented. I don't know if the current situation is the most convenient because the actual implementations have a mixed behaviour. Some methods like the Seq's __add__ use __class__ and some others like __getitem__ use Seq(). Is there a reason for that? Regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From jblanca at btc.upv.es Wed Nov 25 03:45:20 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 25 Nov 2009 09:45:20 +0100 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: References: Message-ID: <200911250945.20870.jblanca@btc.upv.es> Hi, > Without wanting to get too philosophical, an issue to consider in this, in > addition to the technical problems outlined by Peter, is what do we *mean* > when we ask about equality of two sequences? > > As Peter points out, there is something counterintuitive about the peptide > "ACGT" somehow being equal to the nucleotide sequence "ACGT", and that is > because we know that the things that these sequences represent are not in > reality the same thing. > > Likewise, two instances of a repeat sequence in a genome are not > necessarily the same conceptual item, even though they may have the same > nucleotide sequence. Also, two CDS from different sources may have the > same conceptual translation, but the identical translations are arguably > not the same sequence, and in both these circumstances a test for string > equality ignores potentially significant between the physical/biological > elements they describe. These particular cases would give false positives > for equality that could be 'gotchas' for the use in dictionaries that > prompted this discussion. > > If we want to test for string equality of two sequences, we can already do > that explicitly and simply with str(s1) == str(s2). Making this the > default behaviour for a string doesn't always conform to my own > expectations of what 'equality' means for two sequences, because my > expectation changes depending on the task in hand. > > An alternative reasonable test for equality might be whether the two > sequences represent the same sequence, so Seq("M", generic_protein) == > Seq("ATG", generic_dna) might return True if we make some potentially dodgy > assumptions about reading frames, and consider that they conceptually > represent the same thing. I think that it would a bad default behaviour, > and harder to implement than testing string equality, but equally > reasonable depending on what you think 'equality' means. > > Another, equally reasonable, definition of two sequences being 'equal' is > that they share a locus tag or accession. I test on this more frequently > than I do on sequence identity, but still think it's a bad idea to make it > a default test for sequence equality. > > Similarly, if two sequences (e.g. mRNA/cDNA) map to the same location on a > genome, you might consider them equal. > > There are several equally reasonable and yet non-universal definitions of > 'equality' for sequence comparisons, and we currently have the ability to > test simply but explicitly for equality on the basis of any of these as we > need to at the time. I would prefer to see this requirement for an > explicit string comparison kept, and the test for object equality kept as > the default, because this never produces a false positive (and I value > specificity over sensitivity as a default ;) ). You're right, there's a lot of corner cases that I hadn't considered. I think of a Seq as an str with an alphabet so I wouldn't mind for some things, like the genome location of the Seq. But anyway, I use the __eq__ method as a convenience to avoid writting str(seq1) == str(seq2). I'm aware that all abstractions leak, but that's not good or bad in itself. The abstraction is a model of the reality, as a model it won't be a perfect representation of the reality, just a convenient model. The abstraction is suited to a particular use, so its behaviour should be tailored to this use. I would implement this behaviour and document it's gotchas. Not implementing the behaviour because the abstraction leak also prevents the most general case that the abstraction is trying to cover. > On 24/11/2009 11:30, "Peter" wrote: > > [...] > > > The problem is if we'd like Seq("ACGT") to be equal to > > Seq("ACGT", generic_dna) then both must have the > > same hash. Then, if we also want Seq("ACGT") and > > Seq("ACGT", generic_protein) to be equal, they too must > > have the same hash. This means Seq("ACGT", generic_dna) > > and Seq("ACGT",generic_protein) would have the same > > hash, and therefore must evaluate as equal (!). The > > natural consequence of this chain of logic is we would > > then have Seq("ACGT") == Seq("ACGT", generic_dna) > > == Seq("ACGT",generic_protein) == Seq("ACGT",...). > > You reach the same point if we require the string > > "ACGT" equals Seq("ACGT", some_alphabet) Oh! I didn't know that! It's great to learn new python things! I'm being naive here because I just have a swallow understanding of the problem, but here are my two cents. would it be possible to generate the hashes and the __eq__ taking into account the base alphabet. For instance DNAAlphabet=0, RNAAlphabet=1 and ProteinAlphabet=2. So to check if two sequences we would do something like: 'ACGT1' == 'ACGT2' Regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Wed Nov 25 05:26:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 25 Nov 2009 10:26:34 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <200911250945.20870.jblanca@btc.upv.es> References: <200911250945.20870.jblanca@btc.upv.es> Message-ID: <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> On Wed, Nov 25, 2009 at 8:45 AM, Jose Blanca wrote: >> On 24/11/2009 11:30, "Peter" wrote: >> > The problem is if we'd like Seq("ACGT") to be equal to >> > Seq("ACGT", generic_dna) then both must have the >> > same hash. Then, if we also want Seq("ACGT") and >> > Seq("ACGT", generic_protein) to be equal, they too must >> > have the same hash. This means Seq("ACGT", generic_dna) >> > and Seq("ACGT",generic_protein) would have the same >> > hash, and therefore must evaluate as equal (!). The >> > natural consequence of this chain of logic is we would >> > then have Seq("ACGT") == Seq("ACGT", generic_dna) >> > == Seq("ACGT",generic_protein) == Seq("ACGT",...). >> > You reach the same point if we require the string >> > "ACGT" equals Seq("ACGT", some_alphabet) > > Oh! I didn't know that! It's great to learn new python things! > I'm being naive here because I just have a swallow understanding > of the problem, but here are my two cents. It took me a while to try and understand this stuff - its tricky and I'm not 100% sure I have the details perfectly right. > would it be possible to generate the hashes and the __eq__ taking into account > the base alphabet. For instance DNAAlphabet=0, RNAAlphabet=1 and > ProteinAlphabet=2. So to check if two sequences we would do something like: > 'ACGT1' == 'ACGT2' I'd wondered about that too - if we treated all DNA alphabets (generic, IUPAC ambiguous etc) as one group, all RNA alphabets as another, and all Protein as a third, then within those groups things are fine. But what about all the other alphabets? In particular the generic (base) default alphabet or the generic single letter alphabet? These are very very commonly used (e.g. parsing a FASTA file without giving a specific alphabet). i.e. It is only a partial solution that doesn't really work :( Also, there is the issue of comparing a Seq object to a string. It would be very nice to have string "ACGT" == Seq("ACGT", some_alphabet) but that means we would also have to have hash("ACGT") === hash(Seq("ACGT", some_alphabet), which as noted above would mean Seq comparisons would have to ignore the alphabet. Which is bad :( Peter From jblanca at btc.upv.es Wed Nov 25 06:20:53 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 25 Nov 2009 12:20:53 +0100 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> Message-ID: <200911251220.53881.jblanca@btc.upv.es> > > would it be possible to generate the hashes and the __eq__ taking into > > account the base alphabet. For instance DNAAlphabet=0, RNAAlphabet=1 and > > ProteinAlphabet=2. So to check if two sequences we would do something > > like: 'ACGT1' == 'ACGT2' > > I'd wondered about that too - if we treated all DNA alphabets (generic, > IUPAC ambiguous etc) as one group, all RNA alphabets as another, and > all Protein as a third, then within those groups things are fine. But what > about all the other alphabets? In particular the generic (base) default > alphabet or the generic single letter alphabet? These are very very > commonly used (e.g. parsing a FASTA file without giving a specific > alphabet). i.e. It is only a partial solution that doesn't really work :( > > Also, there is the issue of comparing a Seq object to a string. It would > be very nice to have string "ACGT" == Seq("ACGT", some_alphabet) > but that means we would also have to have hash("ACGT") === > hash(Seq("ACGT", some_alphabet), which as noted above would > mean Seq comparisons would have to ignore the alphabet. Which > is bad :( That's a tricky issue. I think that the desired behaviour should be defined and after that the implementation should go. One possible solution would be to consider the generic alphabet different than the more specific ones and consider the str as having a generic alphabet. It would be something like: GenericAlphabet=0, DNAAlphabet=1, RNAAlphabet=2, ProteinAlphabet=3 if str: alphabet=generic else: alphabet=seq.alphabet return str(seq1) + str(alphabet) == str(seq2) + str(alphabet) -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Wed Nov 25 06:22:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 25 Nov 2009 11:22:05 +0000 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <200911250931.54003.jblanca@btc.upv.es> References: <200911241132.55922.jblanca@btc.upv.es> <200911241717.55484.jblanca@btc.upv.es> <320fb6e00911240830l68269b80ja685f0a2dce5946f@mail.gmail.com> <200911250931.54003.jblanca@btc.upv.es> Message-ID: <320fb6e00911250322j3b10a792r7db9df4b4cb269c0@mail.gmail.com> On Wed, Nov 25, 2009 at 8:31 AM, Jose Blanca wrote: > >> The point is while your proposed change will make some tasks >> easier (e.g. writing an extended Seq subclass that adds a new >> method or changes an existing method), it will make other tasks >> much harder (e.g. the DBSeq class). > > That's a fair point. You're right in either case some methods would > have to be reimplemented. That is unavoidable :( > I don't know if the current situation is the most convenient because > the actual implementations have a mixed behaviour. Some methods > like the Seq's __add__ use __class__ and some others like > __getitem__ use Seq(). Is there a reason for that? Historical accident I think. If you want to pursue this change (using __class__ in the methods), you'll also need to update the BioSQL DBSeq and DBSeqRecord subclasses. Comments from other people who have written Seq (or SeqRecord) subclasses would be very valuable here. Peter From biopython at maubp.freeserve.co.uk Wed Nov 25 06:48:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 25 Nov 2009 11:48:16 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <200911251220.53881.jblanca@btc.upv.es> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> Message-ID: <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> On Wed, Nov 25, 2009 at 11:20 AM, Jose Blanca wrote: > > That's a tricky issue. I think that the desired behaviour should be defined > and after that the implementation should go. > Many desired behaviours are mutually contradictory given the way Python works, and the current Seq/Alphabet objects. One can come up many possible desired behaviours, but often they are not coherent or not technically possible. > One possible solution would be > to consider the generic alphabet different than the more specific ones and > consider the str as having a generic alphabet. It would be something like: > > GenericAlphabet=0, DNAAlphabet=1, RNAAlphabet=2, ProteinAlphabet=3 > if str: > ? ?alphabet=generic > else: > ? ?alphabet=seq.alphabet > return str(seq1) + str(alphabet) == str(seq2) + str(alphabet) Dividing alphabets into those four groups would imply: "ACG" == Seq("ACG") == Seq("ACG", generic_nucleotide) "ACG" != Seq("ACG", generic_rna) "ACG" != Seq("ACG", generic_dna) "ACG" != Seq("ACG", generic_protein) ... Seq("ACG") != Seq("ACG", generic_protein) This has some non-intuitive behaviour. Also it doesn't take into account a number of corner cases (which could be better handled in the existing Seq objects I admit) - things like secondary structure alphabets (e.g. for proteins: coils, beta sheet, alpha helix) or reduced alphabets? (e.g. for proteins using Aliphatic/Aromatic/Charged/Tiny/Diverse, or any of the Murphy (2000) tables). The whole issue is horribly complicated! Quoting "Zen of Python": * If the implementation is hard to explain, it's a bad idea. * If the implementation is easy to explain, it may be a good idea. Doing anything complex with alphabets may fall into the "hard to explain" category. Using object identity or string identity is at least simple to explain. Thus far we have just two options, and neither is ideal: (a) Object identity, following id(seq1)==id(seq2) as now (b) String identity, following str(seq1)==str(seq2) We could consider a modified version of the string identity approach - make seq1==seq2 act as str(seq1)==str(seq2), but *also* look at the alphabets and if they are incompatible (using the existing rules used in addition etc) raise a Python warning. Right now this seems like quite a tempting idea to explore... Peter From chapmanb at 50mail.com Wed Nov 25 07:53:14 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 25 Nov 2009 07:53:14 -0500 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> Message-ID: <20091125125314.GA11038@sobchak.mgh.harvard.edu> Hi all; Interesting discussion on the equality issue. > Dividing alphabets into those four groups would imply: > > "ACG" == Seq("ACG") == Seq("ACG", generic_nucleotide) > "ACG" != Seq("ACG", generic_rna) > "ACG" != Seq("ACG", generic_dna) > "ACG" != Seq("ACG", generic_protein) > ... > Seq("ACG") != Seq("ACG", generic_protein) > > This has some non-intuitive behaviour. Also it doesn't take > into account a number of corner cases (which could be better > handled in the existing Seq objects I admit) - things like > secondary structure alphabets (e.g. for proteins: coils, beta > sheet, alpha helix) or reduced alphabets? (e.g. for proteins > using Aliphatic/Aromatic/Charged/Tiny/Diverse, or any of > the Murphy (2000) tables). Instead of considering the most horrible edge cases, we should think about the most common use cases and make those easy. Alphabets are a bit overcomplicated and in practice are probably not being used to represent these other potential alphabets. I may be simple minded in my programming, but have never seen the benefit of directly encoding anything more complicated that DNA, RNA or proteins. The 3 things I've used alphabets for are: - Is it DNA, RNA or protein? - Does a sequence match the alphabet? Checking input files. - Being careful not to add DNA and protein. In practice, I don't really do this very often. > We could consider a modified version of the string identity > approach - make seq1==seq2 act as str(seq1)==str(seq2), > but *also* look at the alphabets and if they are incompatible > (using the existing rules used in addition etc) raise a Python > warning. Right now this seems like quite a tempting idea to > explore... I like this with Jose's cases for the standard DNA, RNA, protein and generic alphabets. So provide sequence + alphabet checking for all of the common cases, and a warning plus just sequence checking for the edge cases. So if you try and compare a DNA sequence and your secondary structure alphabet, you will get a mismatch on the sequences and a warning about incompatible alphabets. Brad From biopython at maubp.freeserve.co.uk Wed Nov 25 08:15:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 25 Nov 2009 13:15:25 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <20091125125314.GA11038@sobchak.mgh.harvard.edu> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> <20091125125314.GA11038@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00911250515w68d808bdrd463d7834ef14985@mail.gmail.com> On Wed, Nov 25, 2009 at 12:53 PM, Brad Chapman wrote: > Hi all; > Interesting discussion on the equality issue. > >> Dividing alphabets into those four groups would imply: >> >> "ACG" == Seq("ACG") == Seq("ACG", generic_nucleotide) >> "ACG" != Seq("ACG", generic_rna) >> "ACG" != Seq("ACG", generic_dna) >> "ACG" != Seq("ACG", generic_protein) >> ... >> Seq("ACG") != Seq("ACG", generic_protein) >> >> This has some non-intuitive behaviour. Also it doesn't take >> into account a number of corner cases (which could be better >> handled in the existing Seq objects I admit) - things like >> secondary structure alphabets (e.g. for proteins: coils, beta >> sheet, alpha helix) or reduced alphabets? (e.g. for proteins >> using Aliphatic/Aromatic/Charged/Tiny/Diverse, or any of >> the Murphy (2000) tables). > > Instead of considering the most horrible edge cases, we should think > about the most common use cases and make those easy. Alphabets are a > bit overcomplicated and in practice are probably not being used to > represent these other potential alphabets. I may be simple minded in > my programming, but have never seen the benefit of directly encoding > anything more complicated that DNA, RNA or proteins. The 3 things > I've used alphabets for are: > > - Is it DNA, RNA or protein? > - Does a sequence match the alphabet? Checking input files. > - Being careful not to add DNA and protein. In practice, I don't > ?really do this very often. Me too - but fixing Bug 2597 would really help (either an exception or a warning would be a big improvement). >> We could consider a modified version of the string identity >> approach - make seq1==seq2 act as str(seq1)==str(seq2), >> but *also* look at the alphabets and if they are incompatible >> (using the existing rules used in addition etc) raise a Python >> warning. Right now this seems like quite a tempting idea to >> explore... > > I like this with Jose's cases for the standard DNA, RNA, protein and > generic alphabets. So provide sequence + alphabet checking for > all of the common cases, and a warning plus just sequence checking > for the edge cases. So if you try and compare a DNA sequence and > your secondary structure alphabet, you will get a mismatch on the > sequences and a warning about incompatible alphabets. You seem to be suggesting some hybrid plan here Brad - I don't quite follow you. Could you clarify (e.g. with some examples)? In the mean time, I'll work on a patch to do my suggestion of hashing and comparison based on string comparison, but with alphabet aware warnings. Peter From biopython at maubp.freeserve.co.uk Wed Nov 25 09:15:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 25 Nov 2009 14:15:41 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e00911250515w68d808bdrd463d7834ef14985@mail.gmail.com> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> <20091125125314.GA11038@sobchak.mgh.harvard.edu> <320fb6e00911250515w68d808bdrd463d7834ef14985@mail.gmail.com> Message-ID: <320fb6e00911250615v47047e65rdb257ea73cdc940b@mail.gmail.com> On Wed, Nov 25, 2009 at 1:15 PM, Peter wrote: > > In the mean time, I'll work on a patch to do my suggestion of > hashing and comparison based on string comparison, but with > alphabet aware warnings. > Branch: http://github.com/peterjc/biopython/tree/seq-comparisons Commit: http://github.com/peterjc/biopython/commit/e7859d47a4a1b873b307d5c2db622d335957a6ed You'll see some basic examples at the top of Bio/Seq.py as module level docstring doctests, including dictionary and set demonstrations. As I hope this demonstrates, even this simple rule (Seq comparison follows strings, but with incompatible alphabets giving warnings) leads to some "odd" results - but that is just the way Python works (see the int/float examples in the doctests using dicts and sets). Note that we may want to do something (on the trunk) about warnings in doctests (e.g. force them to print to stdout so they can be included in doctests explicitly). Other than that, all the other unit tests seem fine (including the BioSQL tests which is important and they use Seq object subclasses). Peter From bugzilla-daemon at portal.open-bio.org Wed Nov 25 12:18:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 25 Nov 2009 12:18:15 -0500 Subject: [Biopython-dev] [Bug 2954] New: xbbtools and SeqGui still using Bio.Translate and Bio.Transcribe Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2954 Summary: xbbtools and SeqGui still using Bio.Translate and Bio.Transcribe Product: Biopython Version: 1.51 Platform: PC OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Other AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk Scripts/SeqGui/SeqGui.py is still using Bio.Translate and Bio.Transcribe which were deprecated in Biopython 1.51. Using Bio.Seq instead should be trivial, except for back-translation (which could just be removed from the SeqGui tool). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 25 12:19:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 25 Nov 2009 12:19:14 -0500 Subject: [Biopython-dev] [Bug 2954] xbbtools and SeqGui still using Bio.Translate and Bio.Transcribe In-Reply-To: Message-ID: <200911251719.nAPHJEsk010282@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2954 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-25 12:19 EST ------- Scripts/xbbtools/xbb_widget.py is also still using Bio.Translate -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Thu Nov 26 02:14:08 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 26 Nov 2009 02:14:08 -0500 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> Message-ID: <3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com> On Wed, Nov 25, 2009 at 6:48 AM, Peter wrote: > On Wed, Nov 25, 2009 at 11:20 AM, Jose Blanca wrote: >> >> That's a tricky issue. I think that the desired behaviour should be defined >> and after that the implementation should go. >> > > Many desired behaviours are mutually contradictory given the way > Python works, and the current Seq/Alphabet objects. One can come > up many possible desired behaviours, but often they are not coherent > or not technically possible. > >> One possible solution would be >> to consider the generic alphabet different than the more specific ones and >> consider the str as having a generic alphabet. It would be something like: >> >> GenericAlphabet=0, DNAAlphabet=1, RNAAlphabet=2, ProteinAlphabet=3 >> if str: >> ? ?alphabet=generic >> else: >> ? ?alphabet=seq.alphabet >> return str(seq1) + str(alphabet) == str(seq2) + str(alphabet) > > [...] > > The whole issue is horribly complicated! Quoting "Zen of Python": > > * If the implementation is hard to explain, it's a bad idea. > * If the implementation is easy to explain, it may be a good idea. > > Doing anything complex with alphabets may fall into the "hard > to explain" category. Using object identity or string identity is > at least simple to explain. > > Thus far we have just two options, and neither is ideal: > (a) Object identity, following id(seq1)==id(seq2) as now > (b) String identity, following str(seq1)==str(seq2) How about (c), string and generic alphabet identity, where Seq.__hash__ uses the sequence string and some simplification of the alphabets types like Jose described. Premise: the sequence string and alphabet are the only arguments the Seq constructor takes, so if two objects can both be recreated from the same arguments, they should be equal as far as sets and dictionaries are concerned. To fall back on string identity, it's easy enough to map str onto a collection of Seq objects. def __hash__(self): """Same string, same alphabet --> same hash.""" # If alphabet is a standard type, match the generic alphabet types if self.alphabet == generic_nucleotide: return hash(str(self), Alphabet) #OR, to match raw strings: return hash(str(self)) elif isinstance(self.alphabet, DNAAlphabet): return hash((str(self), DNAAlphabet)) elif isinstance(self.alphabet, RNAAlphabet): return hash((str(self), RNAAlphabet)) elif isinstance(self.alphabet, ProteinAlphabet): return hash((str(self), ProteinAlphabet)) # Other alphabets, maybe user-defined --> require exactly the same type else: return hash((str(self), self.alphabet.__class__)) Cheers, Eric From biopython at maubp.freeserve.co.uk Thu Nov 26 05:41:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Nov 2009 10:41:10 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> <3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com> Message-ID: <320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com> On Thu, Nov 26, 2009 at 7:14 AM, Eric Talevich wrote: > > On Wed, Nov 25, 2009 at 6:48 AM, Peter wrote: >> Doing anything complex with alphabets may fall into the "hard >> to explain" category. Using object identity or string identity is >> at least simple to explain. >> >> Thus far we have just two options, and neither is ideal: >> (a) Object identity, following id(seq1)==id(seq2) as now >> (b) String identity, following str(seq1)==str(seq2) > > How about (c), string and generic alphabet identity, where > Seq.__hash__ uses the sequence string and some simplification of the > alphabets types like Jose described. Premise: the sequence string and > alphabet are the only arguments the Seq constructor takes, so if two > objects can both be recreated from the same arguments, they should be > equal as far as sets and dictionaries are concerned. To fall back on > string identity, it's easy enough to map str onto a collection of Seq > objects. > > def __hash__(self): > ? ?"""Same string, same alphabet --> same hash.""" > ? ?# If alphabet is a standard type, match the generic alphabet types > ? ?if self.alphabet == generic_nucleotide: > ? ? ? ?return hash(str(self), Alphabet) > ? ? ? ?#OR, to match raw strings: return hash(str(self)) > ? ?elif isinstance(self.alphabet, DNAAlphabet): > ? ? ? ?return hash((str(self), DNAAlphabet)) > ? ?elif isinstance(self.alphabet, RNAAlphabet): > ? ? ? ?return hash((str(self), RNAAlphabet)) > ? ?elif isinstance(self.alphabet, ProteinAlphabet): > ? ? ? ?return hash((str(self), ProteinAlphabet)) > ? ?# Other alphabets, maybe user-defined --> require exactly the same type > ? ?else: > ? ? ? ?return hash((str(self), self.alphabet.__class__)) As an aside, you'd need to get the base alphabet (i.e. remove any AlphabetEncoder wrappers) to decide if it is RNA/DNA/Protein. There is a private helper function in Bio.Alphabet for this. I don't think these AlphabetEncoder objects (like Gapped) were an entirely sensible design... but its done now. This idea (c) has a major drawback for me, in that it appears you wouldn't support comparing Seq objects to strings. However, perhaps that is actually a good thing - that could raise a TypeError, to force the user to do str(my_seq) == "ACG" which is explicit. As I understood his proposal, in Jose's related idea (which didn't get assigned a letter yet), "ACG"==Seq("ACG") would hold for the default generic alphabet, but for not for RNA/DNA/Protein. e.g. "ACG"!=Seq("ACG",generic_dna), which I find very counter intuitive. Peter From eric.talevich at gmail.com Thu Nov 26 15:13:37 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 26 Nov 2009 15:13:37 -0500 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> <3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com> <320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com> Message-ID: <3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com> On Thu, Nov 26, 2009 at 5:41 AM, Peter wrote: > On Thu, Nov 26, 2009 at 7:14 AM, Eric Talevich wrote: >> >> On Wed, Nov 25, 2009 at 6:48 AM, Peter wrote: >>> Doing anything complex with alphabets may fall into the "hard >>> to explain" category. Using object identity or string identity is >>> at least simple to explain. >>> >>> Thus far we have just two options, and neither is ideal: >>> (a) Object identity, following id(seq1)==id(seq2) as now >>> (b) String identity, following str(seq1)==str(seq2) >> >> How about (c), string and generic alphabet identity, where >> Seq.__hash__ uses the sequence string and some simplification of the >> alphabets types like Jose described. >> [...] >> >> def __hash__(self): >> ? ?"""Same string, same alphabet --> same hash.""" >> ? ?[...] > > [...] > > This idea (c) has a major drawback for me, in that it appears you > wouldn't support comparing Seq objects to strings. However, > perhaps that is actually a good thing - that could raise a TypeError, > to force the user to do str(my_seq) == "ACG" which is explicit. > I guess this is the basic question: is a Seq a string-type, or complex class that contains a string (is-a vs. has-a)? Python will let us be inconsistent with the type system if want, but for a class as fundamental as Seq, I think it should be consistent. Biopython-dev discussed making Seq inherit from str or basestring earlier [1], and I think it was decided that while actual inheritance would be tricky, Seq should mimic that interface as much as possible (using the alphabet attribute for validation and extra features, mainly). So we'd treat Seq as a string-like type -- option (b) -- and let SeqRecord be the complex type that has a sequence, accession number, location, etc., where object identity is the only valid case for equality. In short: +1 for your patch on GitHub; I think the rationale is solid. -Eric [1] http://bugzilla.open-bio.org/show_bug.cgi?id=2351#c6 From biopython at maubp.freeserve.co.uk Fri Nov 27 06:39:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Nov 2009 11:39:41 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> <3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com> <320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com> <3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com> Message-ID: <320fb6e00911270339s3354051cub0cc193466575f16@mail.gmail.com> On Thu, Nov 26, 2009 at 8:13 PM, Eric Talevich wrote: > > I guess this is the basic question: is a Seq a string-type, or complex > class that contains a string (is-a vs. has-a)? Python will let us be > inconsistent with the type system if want, but for a class as > fundamental as Seq, I think it should be consistent. > > Biopython-dev discussed making Seq inherit from str or basestring > earlier [1], and I think it was decided that while actual inheritance > would be tricky, Seq should mimic that interface as much as possible > (using the alphabet attribute for validation and extra features, > mainly). So we'd treat Seq as a string-like type -- option (b) -- and > let SeqRecord be the complex type that has a sequence, accession > number, location, etc., where object identity is the only valid case > for equality. > > In short: +1 for your patch on GitHub; I think the rationale is solid. > > -Eric > > [1] http://bugzilla.open-bio.org/show_bug.cgi?id=2351#c6 Nicely put. Peter From bugzilla-daemon at portal.open-bio.org Fri Nov 27 08:10:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 27 Nov 2009 08:10:00 -0500 Subject: [Biopython-dev] [Bug 2954] xbbtools and SeqGui still using Bio.Translate and Bio.Transcribe In-Reply-To: Message-ID: <200911271310.nARDA0lj005870@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2954 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-27 08:10 EST ------- SeqGui is fixed (also updated wxPython calls as the old way is now deprecated) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Nov 27 09:50:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 27 Nov 2009 09:50:08 -0500 Subject: [Biopython-dev] [Bug 2954] xbbtools and SeqGui still using Bio.Translate and Bio.Transcribe In-Reply-To: Message-ID: <200911271450.nAREo8Ln008920@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2954 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-27 09:50 EST ------- xbbtools updated to avoid Bio.Translate Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Nov 27 11:23:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Nov 2009 16:23:52 +0000 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090406220826.GH43636@sobchak.mgh.harvard.edu> References: <320fb6e00904060625v4a49da2au76159eae18f707eb@mail.gmail.com> <20090406220826.GH43636@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00911270823g320c7c24pd0773ae8b72902ee@mail.gmail.com> Hi all, Brad has some GFF parsing code he as been working on, which would be nice to merge into Biopython at some point. See: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005700.html As we started to discuss earlier this year, we need to think about what to do with the existing (old) Bio.GFF module. This was written by Michael Hoffman back in 2002 which accesses MySQL General Feature Format (GFF) databases created with BioPerl. I've been looking at the old Bio.GFF code, and there are a lot of redundant things like its own GenBank/EMBL location parsing, plus its own location objects and its own Feature objects (rather than reusing Bio.SeqFeature which should have sufficed). I want to suggest we deprecate Michael Hoffman's Bio.GFF module in Biopython 1.53 (I'm hoping we can do this next month, Dec 2009). Depending on how soon Brad's code is ready to be merged (which I am assuming could be Biopython 1.54, spring 2010), we can perhaps accelerate removal of the old module. How does that sound? If we're all happy on the dev list, we'll still need to ask on the main list in case if anyone is using the old Bio.GFF code. Peter From bugzilla-daemon at portal.open-bio.org Mon Nov 30 14:06:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Nov 2009 14:06:12 -0500 Subject: [Biopython-dev] [Bug 2957] New: GenBank Writer Should Write Out Date Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2957 Summary: GenBank Writer Should Write Out Date Product: Biopython Version: 1.52 Platform: PC OS/Version: Windows Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: n.j.loman at bham.ac.uk Hi, Would it be possible to modify the GenBank writer to output the date in the LOCUS header. When reading GenBank files this is already parsed in the correct format into record.annotations, so the following patch would work: diff -u InsdcIO.py InsdcIO-date.py --- InsdcIO.py 2009-11-30 19:54:05.000000000 +0000 +++ InsdcIO-date.py 2009-11-30 19:55:32.000000000 +0000 @@ -278,12 +278,13 @@ assert len(division) == 3 #TODO - date #TODO - mol_type - line = "LOCUS %s %s %s %s %s 01-JAN-1980\n" \ + line = "LOCUS %s %s %s %s %s %s\n" \ % (locus.ljust(16), str(len(record)).rjust(11), units, mol_type.ljust(6), - division) + division, + record.annotations.get('date', '01-JAN-1980')) assert len(line) == 79+1, repr(line) #plus one for new line assert line[12:28].rstrip() == locus, \ I realise you might not work when converting between different record types, but this would suit my needs for the time being. Cheers -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tallpaulinjax at yahoo.com Sun Nov 1 19:50:31 2009 From: tallpaulinjax at yahoo.com (Paul B) Date: Sun, 1 Nov 2009 11:50:31 -0800 (PST) Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex Message-ID: <882158.31004.qm@web30705.mail.mud.yahoo.com> Hi, ? I'm a computer science guy trying to figure out some chemistry logic to support my thesis, so bear with me! :-) To sum it up, I'm not sure MMCIFParser is handling ATOM and MODEL records correctly because of this code in MMCIFParser: ??????????? if fieldname=="HETATM": ??????????????? hetatm_flag="H" ??????????? else: ??????????????? hetatm_flag=" " This causes ATOM (and potentially MODEL) records to die as seen in the exception below (I think!) ? My questions are: 1. Am I correct?the correct code is insufficient? 2. What additional logic beyond just recognizing whether it's a HETATM, ATOM or MODEL record needs to be added? ? Thanks! ? Paul ? Background: I understand?MMCIFlex.py et cetera is commented out in the Windows setup.py package due to difficulties compiling it. So I re-wrote MMCIFlex strictly in Python to emulate what I THINK the original MMCIFlex did. My version processes a .cif file one line at a time (readline()) then passes tokens back to MMCIF2Dict at? each call to get_token(). That seems to work fine for unit testing of my MMCIFlex and MMCIFDict which I had to slightly re-write (to ensure it handled how I passed SEMICOLONS line back etc). ? However when I try and use this with MMCIFParser against the 2beg .cif file which has no HETATM records and, as I understand the definition,?no disordered atoms?I get: ? C:\Python25\Lib\site-packages\Bio\PDB\StructureBuilder.py:85: PDBConstructionWar ning: WARNING: Chain A is discontinuous at line 0. ? PDBConstructionWarning) C:\Python25\Lib\site-packages\Bio\PDB\StructureBuilder.py:122: PDBConstructionWa rning: WARNING: Residue (' ', 17, ' ') redefined at line 0. ? PDBConstructionWarning) Traceback (most recent call last): ? File "MMCIFParser.py", line 140, in ??? structure=p.get_structure("test", filename) ? File "MMCIFParser.py", line 23, in get_structure ??? self._build_structure(structure_id) ? File "MMCIFParser.py", line 88, in _build_structure ??? icode) ? File "C:\Python25\lib\site-packages\Bio\PDB\StructureBuilder.py", line 148, in ?init_residue ??? % (resname, field, resseq, icode)) PDBExceptions.PDBConstructionException: Blank altlocs in duplicate residue LEU ( ' ', 17, ' ') ? Basically what I think MIGHT be happening is MMCIFParser is currently only handling HETATM records, when some other kind of record comes in (ATOM, MODEL) it is treated incorrectly. See below. ? MMCIFParser.py ??? def _build_structure(self, structure_id): ?.... ??????? fieldname_list=mmcif_dict["_atom_site.group_PDB"] ?.... ??????? for i in xrange(0, len(atom_id_list)): ???? ... ??????????? altloc=alt_list[i] ??????????? if altloc==".": ??????????????? altloc=" " ???? ... ??????????? fieldname=fieldname_list[i] ??????????? #How are ATOM and MODEL records handled? ??????????? if fieldname=="HETATM": ??????????????? hetatm_flag="H" ??????????? else: ??????????????? hetatm_flag=" " ??????????? if current_chain_id!=chainid: ??????????????? current_chain_id=chainid ??????????????? structure_builder.init_chain(current_chain_id) ??????????????? current_residue_id=resseq ??????????????? icode, int_resseq=self._get_icode(resseq) ????????????? ??#This is line 87-88 in the real file ??????????????? structure_builder.init_residue(resname, hetatm_flag, int_resseq, ??????????????????? icode) ? Class StructureBuilder: ??? ... ??? def init_residue(self, resname, field, resseq, icode): ??????? if field!=" ": ??????????? if field=="H": ??????????????? # The hetero field consists of H_ + the residue name (e.g. H_FUC) ??????????????? field="H_"+resname ??????? res_id=(field, resseq, icode) ?????? ... ????????#This line will get executed for any non-HETATM record (ie ATOM Or MODEL) ??????? #because in MMCIFParser, if it wasn't a HETATM, then it's a ' ' ??? ??????? if field==" ": ??????????? if self.chain.has_id(res_id): =======>But there are no point mutations in 2beg that I know?of. Shouldn't be here! ??????????????? # There already is a residue with the id (field, resseq, icode). ??????????????? # This only makes sense in the case of a point mutation. ??????????????? if __debug__: ??????????????????? warnings.warn("WARNING: Residue ('%s', %i, '%s') " ????????????????????????????????? "redefined at line %i." ????????????????????????????????? % (field, resseq, icode, self.line_counter), ????????????????????????????????? PDBConstructionWarning) ??????????????? duplicate_residue=self.chain[res_id] ??????????????? if duplicate_residue.is_disordered()==2: ??????????????????? # The residue in the chain is a DisorderedResidue object. ??????????????????? # So just add the last Residue object. ??????????????????? if duplicate_residue.disordered_has_id(resname): ??????????????????????? # The residue was already made ??????????????????????? self.residue=duplicate_residue ??????????????????????? duplicate_residue.disordered_select(resname) ??????????????????? else: ??????????????????????? # Make a new residue and add it to the already ??????????????????????? # present DisorderedResidue ??????????????????????? new_residue=Residue(res_id, resname, self.segid) ??????????????????????? duplicate_residue.disordered_add(new_residue) ??????????????????????? self.residue=duplicate_residue ??????????????????????? return ??????????????? else: ??????????????????? # Make a new DisorderedResidue object and put all ??????????????????? # the Residue objects with the id (field, resseq, icode) in it. ??????????????????? # These residues each should have non-blank altlocs for all their atoms. ??????????????????? # If not, the PDB file probably contains an error. ====>????????? #This is the line throwing the exception, but we shouldn't be here! ??????????????????? if not self._is_completely_disordered(duplicate_residue): ??????????????????????? # if this exception is ignored, a residue will be missing ??????????????????????? self.residue=None ??????????????????????? raise PDBConstructionException(\ ??????????????????????????? "Blank altlocs in duplicate residue %s ('%s', %i, '%s')" \ ??????????????????????????? % (resname, field, resseq, icode)) ??????????????????? self.chain.detach_child(res_id) ??????????????????? new_residue=Residue(res_id, resname, self.segid) ??????????????????? disordered_residue=DisorderedResidue(res_id) ??????????????????? self.chain.add(disordered_residue) ??????????????????? disordered_residue.disordered_add(duplicate_residue) ??????????????????? disordered_residue.disordered_add(new_residue) ??????????????????? self.residue=disordered_residue ??????????????????? return ??????? residue=Residue(res_id, resname, self.segid) ??????? self.chain.add(residue) ??????? self.residue=residue From biopython at maubp.freeserve.co.uk Sun Nov 1 21:28:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 1 Nov 2009 21:28:50 +0000 Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex In-Reply-To: <882158.31004.qm@web30705.mail.mud.yahoo.com> References: <882158.31004.qm@web30705.mail.mud.yahoo.com> Message-ID: <320fb6e00911011328n57880ddmd47677c9b5ce597f@mail.gmail.com> On Sun, Nov 1, 2009 at 7:50 PM, Paul B wrote: > > Hi, > > I'm a computer science guy trying to figure out some chemistry logic > to support my thesis, so bear with me! :-) To sum it up, I'm not sure > MMCIFParser is handling ATOM and MODEL records correctly > because of this code in MMCIFParser: > ??????????? if fieldname=="HETATM": > ??????????????? hetatm_flag="H" > ??????????? else: > ??????????????? hetatm_flag=" " > This causes ATOM (and potentially MODEL) records to die as seen > in the exception below (I think!) I'll answer that below. > My questions are: > 1. Am I correct?the correct code is insufficient? > 2. What additional logic beyond just recognizing whether it's a > HETATM, ATOM or MODEL record needs to be added? > > Thanks! > > Paul > > > Background: > I understand?MMCIFlex.py et cetera is commented out in the > Windows setup.py package due to difficulties compiling it. It is commented out (on all platforms) because we don't know how to get setup.py to detect if flex and the relevant headers are installed, which we would need to compile the code. I'm note sure how this would work on Windows with an installer (i.e. what is a run time dependency versus compile time). > So I re-wrote MMCIFlex strictly in Python to emulate what Now that would be very handy (IMO), if you can get it working. Have you benchmarked it against the flex code? > I THINK the original MMCIFlex did. My version processes > a .cif file one line at a time (readline()) then passes tokens > back to MMCIF2Dict at? each call to get_token(). That > seems to work fine for unit testing of my MMCIFlex and > MMCIFDict which I had to slightly re-write (to ensure it > handled how I passed SEMICOLONS line back etc). > > However when I try and use this with MMCIFParser > against the 2beg .cif file which has no HETATM records > and, as I understand the definition,?no disordered atoms >?I get: > > ... > > Basically what I think MIGHT be happening is MMCIFParser > is currently only handling HETATM records, when some other > kind of record comes in (ATOM, MODEL) it is treated > incorrectly. See below. > > ... Have you been able to test the flex code? If not, could you give me a tiny script using the 2beg cif file which should work? If that works, then the problem is in your flex replacement code. Peter From tallpaulinjax at yahoo.com Mon Nov 2 13:21:14 2009 From: tallpaulinjax at yahoo.com (Paul B) Date: Mon, 2 Nov 2009 05:21:14 -0800 (PST) Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex In-Reply-To: <320fb6e00911011328n57880ddmd47677c9b5ce597f@mail.gmail.com> Message-ID: <621165.33564.qm@web30706.mail.mud.yahoo.com> I'll use the conventional response technique in future emails! :-) ? Hi Peter, ? 1. "Did you mean to not CC the list?": Sorry, I replied to your email address instead of the CC: address! 2. Peter: "I should be able to run the flex code and you new code side by side, for testing and profiling. Note sure when I'll find the time exactly, but we'll see. Examples will help as while I know plenty about PDB files, I've not used CIF at all": I'd be glad to run the tests myself as well and I have the time! :-) But without the flex module installed and operational the only way I can think of is with pickle'd .cif dicts. 3. Peter: "P.S. Are you OK with making this contribution under the Biopython license?" Absolutely I'd be glad to contribute to biopython! ? This was in response to my followup email to Peter: "Hi Peter: Paul: So I re-wrote MMCIFlex strictly in Python to emulate (the lex based MMCIFlex) Peter: Now that would be very handy (IMO), if you can get it working. Have you benchmarked it against the flex code? Have you been able to test the flex code? If not, could you give me a tiny script using the 2beg cif file which should work? If that works, then the problem is in your flex replacement code. Paul: It already works, but I have no way to benchmark it against the flex code myself. Perhaps someone could pickle a?half dozen PDB .cif files and send me the resultant files? I can then run a test agains each one.? I'll also clean up the code on both the new MMCIFlex.py as well as the changed MMCIF2Dict.py and send them to you most probably by today. Each will have a __main__ method for testing." ? --- On Sun, 11/1/09, Peter wrote: From: Peter Subject: Re: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex To: "Paul B" Cc: biopython-dev at biopython.org Date: Sunday, November 1, 2009, 4:28 PM On Sun, Nov 1, 2009 at 7:50 PM, Paul B wrote: > > Hi, > > I'm a computer science guy trying to figure out some chemistry logic > to support my thesis, so bear with me! :-) To sum it up, I'm not sure > MMCIFParser is handling ATOM and MODEL records correctly > because of this code in MMCIFParser: > ??????????? if fieldname=="HETATM": > ??????????????? hetatm_flag="H" > ??????????? else: > ??????????????? hetatm_flag=" " > This causes ATOM (and potentially MODEL) records to die as seen > in the exception below (I think!) I'll answer that below. > My questions are: > 1. Am I correct?the correct code is insufficient? > 2. What additional logic beyond just recognizing whether it's a > HETATM, ATOM or MODEL record needs to be added? > > Thanks! > > Paul > > > Background: > I understand?MMCIFlex.py et cetera is commented out in the > Windows setup.py package due to difficulties compiling it. It is commented out (on all platforms) because we don't know how to get setup.py to detect if flex and the relevant headers are installed, which we would need to compile the code. I'm note sure how this would work on Windows with an installer (i.e. what is a run time dependency versus compile time). > So I re-wrote MMCIFlex strictly in Python to emulate what Now that would be very handy (IMO), if you can get it working. Have you benchmarked it against the flex code? > I THINK the original MMCIFlex did. My version processes > a .cif file one line at a time (readline()) then passes tokens > back to MMCIF2Dict at? each call to get_token(). That > seems to work fine for unit testing of my MMCIFlex and > MMCIFDict which I had to slightly re-write (to ensure it > handled how I passed SEMICOLONS line back etc). > > However when I try and use this with MMCIFParser > against the 2beg .cif file which has no HETATM records > and, as I understand the definition,?no disordered atoms >?I get: > > ... > > Basically what I think MIGHT be happening is MMCIFParser > is currently only handling HETATM records, when some other > kind of record comes in (ATOM, MODEL) it is treated > incorrectly. See below. > > ... Have you been able to test the flex code? If not, could you give me a tiny script using the 2beg cif file which should work? If that works, then the problem is in your flex replacement code. Peter From tallpaulinjax at yahoo.com Mon Nov 2 22:03:51 2009 From: tallpaulinjax at yahoo.com (Paul B) Date: Mon, 2 Nov 2009 14:03:51 -0800 (PST) Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex Message-ID: <767823.60315.qm@web30702.mail.mud.yahoo.com> Hi Peter, ? I have attached drafts of MMCIFlex.py and MMCIFParser.py. They have __main__ methods that perform decent testing.? On my system, I have replaced their same-named counterparts? in the appropriate folders. Please note, however,?this version of MMCIFlex.py and MMCIFParser.py must work together as a pair! So, I don't know how you guys handling that: give them new names, or replace old files? ? I can't test them further right now because I believe MMCIFParser needs corrections. For example, the PDBParser.py calls the following methods in it's StructureBuilder object: structure_builder.init_structure structure_builder.set_header structure_builder.set_line_counter structure_builder.init_model structure_builder.init_seg structure_builder.init_chain structure_builder.init_residue structure_builder.init_atom structure_builder.set_anisou structure_builder.set_siguij structure_builder.set_sigatm ? However, MMCIFParser only calls: structure_builder.init_structure structure_builder.init_model structure_builder.init_seg structure_builder.init_chain structure_builder.init_residue structure_builder.init_atom structure_builder.set_anisou ? leaving out calls to: structure_builder.set_header structure_builder.set_line_counter structure_builder.set_siguij structure_builder.set_sigatm ? I believe the last two might be important for some people, I don't know about the first two whether they are housekeeping, etc... still checking. So I am still looking into MMCIFParser, in particular why it's bombing creating a structure on 2beg.cif when PDBParser correctly works on pdb2beg.ent. ? Paul --- On Mon, 11/2/09, Paul B wrote: From: Paul B Subject: Re: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex To: biopython-dev at biopython.org Date: Monday, November 2, 2009, 8:21 AM I'll use the conventional response technique in future emails! :-) ? Hi Peter, ? 1. "Did you mean to not CC the list?": Sorry, I replied to your email address instead of the CC: address! 2. Peter: "I should be able to run the flex code and you new code side by side, for testing and profiling. Note sure when I'll find the time exactly, but we'll see. Examples will help as while I know plenty about PDB files, I've not used CIF at all": I'd be glad to run the tests myself as well and I have the time! :-) But without the flex module installed and operational the only way I can think of is with pickle'd .cif dicts. 3. Peter: "P.S. Are you OK with making this contribution under the Biopython license?" Absolutely I'd be glad to contribute to biopython! ? This was in response to my followup email to Peter: "Hi Peter: Paul: So I re-wrote MMCIFlex strictly in Python to emulate (the lex based MMCIFlex) Peter: Now that would be very handy (IMO), if you can get it working. Have you benchmarked it against the flex code? Have you been able to test the flex code? If not, could you give me a tiny script using the 2beg cif file which should work? If that works, then the problem is in your flex replacement code. Paul: It already works, but I have no way to benchmark it against the flex code myself. Perhaps someone could pickle a?half dozen PDB .cif files and send me the resultant files? I can then run a test agains each one.? I'll also clean up the code on both the new MMCIFlex.py as well as the changed MMCIF2Dict.py and send them to you most probably by today. Each will have a __main__ method for testing." ? --- On Sun, 11/1/09, Peter wrote: From: Peter Subject: Re: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex To: "Paul B" Cc: biopython-dev at biopython.org Date: Sunday, November 1, 2009, 4:28 PM On Sun, Nov 1, 2009 at 7:50 PM, Paul B wrote: > > Hi, > > I'm a computer science guy trying to figure out some chemistry logic > to support my thesis, so bear with me! :-) To sum it up, I'm not sure > MMCIFParser is handling ATOM and MODEL records correctly > because of this code in MMCIFParser: > ??????????? if fieldname=="HETATM": > ??????????????? hetatm_flag="H" > ??????????? else: > ??????????????? hetatm_flag=" " > This causes ATOM (and potentially MODEL) records to die as seen > in the exception below (I think!) I'll answer that below. > My questions are: > 1. Am I correct?the correct code is insufficient? > 2. What additional logic beyond just recognizing whether it's a > HETATM, ATOM or MODEL record needs to be added? > > Thanks! > > Paul > > > Background: > I understand?MMCIFlex.py et cetera is commented out in the > Windows setup.py package due to difficulties compiling it. It is commented out (on all platforms) because we don't know how to get setup.py to detect if flex and the relevant headers are installed, which we would need to compile the code. I'm note sure how this would work on Windows with an installer (i.e. what is a run time dependency versus compile time). > So I re-wrote MMCIFlex strictly in Python to emulate what Now that would be very handy (IMO), if you can get it working. Have you benchmarked it against the flex code? > I THINK the original MMCIFlex did. My version processes > a .cif file one line at a time (readline()) then passes tokens > back to MMCIF2Dict at? each call to get_token(). That > seems to work fine for unit testing of my MMCIFlex and > MMCIFDict which I had to slightly re-write (to ensure it > handled how I passed SEMICOLONS line back etc). > > However when I try and use this with MMCIFParser > against the 2beg .cif file which has no HETATM records > and, as I understand the definition,?no disordered atoms >?I get: > > ... > > Basically what I think MIGHT be happening is MMCIFParser > is currently only handling HETATM records, when some other > kind of record comes in (ATOM, MODEL) it is treated > incorrectly. See below. > > ... Have you been able to test the flex code? If not, could you give me a tiny script using the 2beg cif file which should work? If that works, then the problem is in your flex replacement code. Peter -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: MMCIF2Dict.py URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: MMCIFlex.py URL: From bugzilla-daemon at portal.open-bio.org Tue Nov 3 13:20:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Nov 2009 08:20:14 -0500 Subject: [Biopython-dev] [Bug 2929] NCBIXML PSI-Blast parser should gather all information from XML blastgpg output In-Reply-To: Message-ID: <200911031320.nA3DKErZ024365@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2929 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-03 08:20 EST ------- (In reply to comment #3) > (In reply to comment #2) > > What specifically is our parser failing to extract from this example PSI > > BLAST XML file? > > (Sorry, I've been away) > Well, currently the code tries to get several pieces of information from the > Blast.Record.PSIBlast (brecord): > > brecord.converged There is a CONVERGED line in the XML we should be able to use here. I don't recall seeing this in pgpblast output from older versions of BLAST. > brecord.query > brecord.query_letters Those work (query and query_letters). > brecord.rounds > brecord.rounds.alignments > brecord.rounds.alignments.title > brecord.rounds.alignments.hsps Those also work but not via rounds, but as separate BLAST record objects. See mailing list discussion regarding PSI-BLAST and multiple BLAST queries. > then in the hsps: > hsp.identities > hsp.positives > hsp.query > hsp.sbjct > hsp.match > hsp.expect > hsp.query_start > hsp.query_end > hsp.sbjct_start > hsp.sbjct_end Again, those are all parsed fine. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tallpaulinjax at yahoo.com Tue Nov 3 16:36:08 2009 From: tallpaulinjax at yahoo.com (Paul B) Date: Tue, 3 Nov 2009 08:36:08 -0800 (PST) Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex Message-ID: <468854.15801.qm@web30708.mail.mud.yahoo.com> Hi, ? I have found the reason why MMCIParser is dying. It has no provision for more than one model, so when a second model comes around with the same chain and residue the program throws an exception. ? I will be joining github to submit the required changes. I haven't used github before, and this is my first open source project so please give me a few days to acclimate. ? My mods so far are as follows in MMCIFParser.py (and require the MMCIFlex.py and MMCIF2Dict.py files I will be submitting via github, and have submitted to Peter privately.) ? Change the __doc__ setting: #Mod by Paul T. Bathen to reflect MMCIFlex built solely in Python __doc__="mmCIF parser (implemented solely in Python, no lex/flex/C code needed)" Insert the following model_list line: ??????? occupancy_list=mmcif_dict["_atom_site.occupancy"] ??????? fieldname_list=mmcif_dict["_atom_site.group_PDB"] ??????? #Added by Paul T. Bathen Nov 2009 ??????? model_list=mmcif_dict["_atom_site.pdbx_PDB_model_num"] ??????? try: ? Make the following changes: ??????? #Modified by Paul T. Bathen Nov 2009: comment out this line ??????? #current_model_id=0 ??????? structure_builder=self._structure_builder ??????? structure_builder.init_structure(structure_id) ??????? #Modified by Paul T. Bathen Nov 2009: comment out this line ??????? #structure_builder.init_model(current_model_id) ??????? structure_builder.init_seg(" ") ??????? #Added by Paul T. Bathen Nov 2009 ??????? current_model_id = -1 Make the following changes in the for loop: ??????????? #Note by Paul T. Bathen: should this include the HOH and WAT stmts in PDBParser? ??????????? if fieldname=="HETATM": ??????????????? hetatm_flag="H" ??????????? else: ??????????????? hetatm_flag=" " ? ??????????? #Added by Paul T. Bathen Nov 2009 ??????????? model_id = model_list[i] ??????????? if current_model_id != model_id: ??????????????? current_model_id = model_id ??????????????? structure_builder.init_model(current_model_id) ??????????? #end of addition ? After these changes took place, and with the new MMCIFlex and MMCIF2Dict in place, I was able to parse and test 2beg.cif and pdb2bec.ent and both parsed with the same number of models, chains, and residues. ? The only difference is the PDBParser incorrectly states the first model as 0 when it should be 1: there is an explicit MODEL line in pdb2beg.ent. So all the models are off by one in 2beg when parsed by PDBParser.py. I can look into the bug in PDBParser.py and submit it if desired? ? Paul --- On Mon, 11/2/09, Paul B wrote: From: Paul B Subject: Re: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex To: biopython-dev at biopython.org Date: Monday, November 2, 2009, 8:21 AM I'll use the conventional response technique in future emails! :-) ? Hi Peter, ? 1. "Did you mean to not CC the list?": Sorry, I replied to your email address instead of the CC: address! 2. Peter: "I should be able to run the flex code and you new code side by side, for testing and profiling. Note sure when I'll find the time exactly, but we'll see. Examples will help as while I know plenty about PDB files, I've not used CIF at all": I'd be glad to run the tests myself as well and I have the time! :-) But without the flex module installed and operational the only way I can think of is with pickle'd .cif dicts. 3. Peter: "P.S. Are you OK with making this contribution under the Biopython license?" Absolutely I'd be glad to contribute to biopython! ? This was in response to my followup email to Peter: "Hi Peter: Paul: So I re-wrote MMCIFlex strictly in Python to emulate (the lex based MMCIFlex) Peter: Now that would be very handy (IMO), if you can get it working. Have you benchmarked it against the flex code? Have you been able to test the flex code? If not, could you give me a tiny script using the 2beg cif file which should work? If that works, then the problem is in your flex replacement code. Paul: It already works, but I have no way to benchmark it against the flex code myself. Perhaps someone could pickle a?half dozen PDB .cif files and send me the resultant files? I can then run a test agains each one.? I'll also clean up the code on both the new MMCIFlex.py as well as the changed MMCIF2Dict.py and send them to you most probably by today. Each will have a __main__ method for testing." ? --- On Sun, 11/1/09, Peter wrote: From: Peter Subject: Re: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex To: "Paul B" Cc: biopython-dev at biopython.org Date: Sunday, November 1, 2009, 4:28 PM On Sun, Nov 1, 2009 at 7:50 PM, Paul B wrote: > > Hi, > > I'm a computer science guy trying to figure out some chemistry logic > to support my thesis, so bear with me! :-) To sum it up, I'm not sure > MMCIFParser is handling ATOM and MODEL records correctly > because of this code in MMCIFParser: > ??????????? if fieldname=="HETATM": > ??????????????? hetatm_flag="H" > ??????????? else: > ??????????????? hetatm_flag=" " > This causes ATOM (and potentially MODEL) records to die as seen > in the exception below (I think!) I'll answer that below. > My questions are: > 1. Am I correct?the correct code is insufficient? > 2. What additional logic beyond just recognizing whether it's a > HETATM, ATOM or MODEL record needs to be added? > > Thanks! > > Paul > > > Background: > I understand?MMCIFlex.py et cetera is commented out in the > Windows setup.py package due to difficulties compiling it. It is commented out (on all platforms) because we don't know how to get setup.py to detect if flex and the relevant headers are installed, which we would need to compile the code. I'm note sure how this would work on Windows with an installer (i.e. what is a run time dependency versus compile time). > So I re-wrote MMCIFlex strictly in Python to emulate what Now that would be very handy (IMO), if you can get it working. Have you benchmarked it against the flex code? > I THINK the original MMCIFlex did. My version processes > a .cif file one line at a time (readline()) then passes tokens > back to MMCIF2Dict at? each call to get_token(). That > seems to work fine for unit testing of my MMCIFlex and > MMCIFDict which I had to slightly re-write (to ensure it > handled how I passed SEMICOLONS line back etc). > > However when I try and use this with MMCIFParser > against the 2beg .cif file which has no HETATM records > and, as I understand the definition,?no disordered atoms >?I get: > > ... > > Basically what I think MIGHT be happening is MMCIFParser > is currently only handling HETATM records, when some other > kind of record comes in (ATOM, MODEL) it is treated > incorrectly. See below. > > ... Have you been able to test the flex code? If not, could you give me a tiny script using the 2beg cif file which should work? If that works, then the problem is in your flex replacement code. Peter _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From kellrott at gmail.com Tue Nov 3 16:46:57 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Tue, 3 Nov 2009 08:46:57 -0800 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> Message-ID: (Moving this thread to Biopython-dev) I've hacked together some code, and tested it against the bacterial genome library I had on hand (of course, eukariotic features will be more complicated, so will need to test against them next). Examples of 'exotic' feature location would be helpful. I've posted the code below. I'll be moving it into my git fork, and add some testing. Any thoughts where it should go? It seems like it would best work as a SeqRecord method. def FeatureIDGuess( feature ): id = "N/A" try: id = feature.qualifiers['locus_tag'][0] except KeyError: try: id = feature.qualifiers['plasmid'][0] except KeyError: pass return id def FeatureDescGuess( feature ): desc = "" try: desc=feature.qualifiers['product'][0] except KeyError: pass return desc def ExtractFeatureDNA( record, feature ): dna = None if len( feature.sub_features ): dnaStr = "" for subFeat in feature.sub_features: if subFeat.location_operator=='join': subSeq = ExtractFeatureDNA( record, subFeat ) dnaStr += subSeq.seq dna = Seq( str(dnaStr), IUPAC.unambiguous_dna) if ( feature.strand == -1 ): dna = dna.reverse_complement() else: start_pos = feature.location.start.position end_pos = feature.location.end.position seqStr = record.seq[ start_pos:end_pos ] dna = Seq( str(seqStr), IUPAC.unambiguous_dna) if ( feature.strand == -1 and feature.location_operator != 'join' ): dna = dna.reverse_complement() outSeq = SeqRecord( dna, FeatureIDGuess( feature ) , description=FeatureDescGuess( feature ) ) return outSeq On Mon, Nov 2, 2009 at 2:30 PM, Peter wrote: > On Mon, Nov 2, 2009 at 9:31 PM, Kyle Ellrott wrote: > >> > >> You missed this thread earlier this month: > >> http://lists.open-bio.org/pipermail/biopython/2009-October/005695.html > >> > >> Are you on the dev mailing list? I was hoping to get a little discussion > >> going there, before moving over to the discussion list for more general > >> comment. > > > > I didn't need to do it when the original discussion came through, so it > got > > 'filtered' ;-) I guess if multiple people are asking the same question > > independently, it's probably a timely issue. > > > > I'll probably go ahead and pull the SeqRecord fork into my git fork and > > start playing around with it. > > Cool - sorry if the previous email was brusque - I was in the middle > of dinner preparation and shouldn't have been checking emails. > > If you just want to try the sequence extraction for a SeqFeature, > the code is on the trunk (as noted, as a function in a unit test). > My SeqRecord github branch is looking at other issues. > > Peter > From biopython at maubp.freeserve.co.uk Tue Nov 3 17:09:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 17:09:37 +0000 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> Message-ID: <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> On Tue, Nov 3, 2009 at 4:46 PM, Kyle Ellrott wrote: > (Moving this thread to Biopython-dev) > > I've hacked together some code, and tested it against the bacterial genome > library I had on hand (of course, eukariotic features will be more > complicated, so will need to test against them next). ?Examples of 'exotic' > feature location would be helpful. > I've posted the code below. ?I'll be moving it into my git fork, and add > some testing. ?Any thoughts where it should go? ?It seems like it would best > work as a SeqRecord method. i.e. Option (4) of this list of ideas? http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006922.html Peter P.S. def FeatureDescGuess( feature ): desc = "" try: desc=feature.qualifiers['product'][0] except KeyError: pass return desc Could be just: def FeatureDescGuess( feature ): return feature.qualifiers.get('product', [""])[0] and therefore doesn't really need an entire function. From biopython at maubp.freeserve.co.uk Tue Nov 3 17:13:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 17:13:25 +0000 Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex In-Reply-To: <468854.15801.qm@web30708.mail.mud.yahoo.com> References: <468854.15801.qm@web30708.mail.mud.yahoo.com> Message-ID: <320fb6e00911030913u11cf6380q2a2bbc07b2b1863f@mail.gmail.com> On Tue, Nov 3, 2009 at 4:36 PM, Paul B wrote: > > Hi, > > I have found the reason why MMCIParser is dying. It has no provision > for more than one model, so when a second model comes around with > the same chain and residue the program throws an exception. Please file a bug report on bugzilla for that. I guess no-one has tried NMR CIF data with the parser before (!). > I will be joining github to submit the required changes. I haven't used > github before, and this is my first open source project so please give > me a few days to acclimate. I you like - great. Otherwise we can manage with patches via an enhancement bug on bugzilla. > My mods so far are as follows in MMCIFParser.py (and require the > MMCIFlex.py and MMCIF2Dict.py files I will be submitting via github, > and have submitted to Peter privately.) Actually, I think that ended up on mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2009-November/006938.html > The only difference is the PDBParser incorrectly states the first model as 0 > when it should be 1: there is an explicit MODEL line in pdb2beg.ent. So all > the models are off by one in 2beg when parsed by PDBParser.py. I can > look into the bug in PDBParser.py and submit it if desired? Are you sure it should it be 1 and not 0? Remember, Python counts from zero. Peter From bugzilla-daemon at portal.open-bio.org Tue Nov 3 17:19:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Nov 2009 12:19:08 -0500 Subject: [Biopython-dev] [Bug 2731] Adding .upper() and .lower() methods to the Seq object In-Reply-To: Message-ID: <200911031719.nA3HJ84Y001795@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2731 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-03 12:19 EST ------- Created an attachment (id=1389) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1389&action=view) Patch to Bio/Seq.py Compared to the earlier patch, this takes the less invasive approach of only editing Bio/Seq.py (covering both Seq and UnknownSeq, with doctests), but has the downside that it is not easy to deal with gapped alphabets etc nicely. Adding (private) upper/lower methods as outlined in the earlier patch does seem a better plan. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From kellrott at gmail.com Tue Nov 3 17:23:27 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Tue, 3 Nov 2009 09:23:27 -0800 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> Message-ID: > > > def FeatureDescGuess( feature ): > desc = "" > try: > desc=feature.qualifiers['product'][0] > except KeyError: > pass > return desc > > Could be just: > > def FeatureDescGuess( feature ): > return feature.qualifiers.get('product', [""])[0] > > and therefore doesn't really need an entire function. > That could attempt to get the first element of a None type, if the 'product' qualifier doesn't exist. Actually, I wrote it that way so it could be extended. First it would try 'product' and if that didn't exist replace it with something like a 'db_xref' or a 'note' entry. I was hoping to get some input on what people think would be 'order of importance' of things to try. Kyle From bugzilla-daemon at portal.open-bio.org Tue Nov 3 17:34:05 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Nov 2009 12:34:05 -0500 Subject: [Biopython-dev] [Bug 2731] Adding .upper() and .lower() methods to the Seq object In-Reply-To: Message-ID: <200911031734.nA3HY52U002483@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2731 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1192 is|0 |1 obsolete| | Attachment #1389 is|0 |1 obsolete| | ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-03 12:34 EST ------- Created an attachment (id=1390) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1390&action=view) Patch to Bio/Seq.py and Bio/Alphabet/__init__.py Based on attachment 1192 and recent attachment 1389 with doctests. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tallpaulinjax at yahoo.com Tue Nov 3 17:34:46 2009 From: tallpaulinjax at yahoo.com (Paul B) Date: Tue, 3 Nov 2009 09:34:46 -0800 (PST) Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex In-Reply-To: <320fb6e00911030913u11cf6380q2a2bbc07b2b1863f@mail.gmail.com> Message-ID: <869703.76049.qm@web30705.mail.mud.yahoo.com> >Are you sure it should it be 1 and not 0? Remember, Python counts from zero. If the MODEL record in the .ent record says MODEL 1, should biopython report it as 0? In PDBParser, the code is as follows: ??????? current_model_id=0 ??????? # Flag we have an open model ??????? model_open=0 ??????? for i in range(0, len(coords_trailer)): ? ??????????? if(record_type=='ATOM? ' or record_type=='HETATM'): ??????????????? # Initialize the Model - there was no explicit MODEL record ??????????????? if not model_open: ??????????????????? structure_builder.init_model(current_model_id) ??????????????????? current_model_id+=1 ??????????????????? model_open=1 --- On Tue, 11/3/09, Peter wrote: From: Peter Subject: Re: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex To: "Paul B" Cc: biopython-dev at biopython.org Date: Tuesday, November 3, 2009, 12:13 PM On Tue, Nov 3, 2009 at 4:36 PM, Paul B wrote: > > Hi, > > I have found the reason why MMCIParser is dying. It has no provision > for more than one model, so when a second model comes around with > the same chain and residue the program throws an exception. Please file a bug report on bugzilla for that. I guess no-one has tried NMR CIF data with the parser before (!). > I will be joining github to submit the required changes. I haven't used > github before, and this is my first open source project so please give > me a few days to acclimate. I you like - great. Otherwise we can manage with patches via an enhancement bug on bugzilla. > My mods so far are as follows in MMCIFParser.py (and require the > MMCIFlex.py and MMCIF2Dict.py files I will be submitting via github, > and have submitted to Peter privately.) Actually, I think that ended up on mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2009-November/006938.html > The only difference is the PDBParser incorrectly states the first model as 0 > when it should be 1: there is an explicit MODEL line in pdb2beg.ent. So all > the models are off by one in 2beg when parsed by PDBParser.py. I can > look into the bug in PDBParser.py and submit it if desired? Are you sure it should it be 1 and not 0? Remember, Python counts from zero. Peter ? From bugzilla-daemon at portal.open-bio.org Tue Nov 3 17:39:46 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Nov 2009 12:39:46 -0500 Subject: [Biopython-dev] [Bug 2731] Adding .upper() and .lower() methods to the Seq object In-Reply-To: Message-ID: <200911031739.nA3Hdk8e002675@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2731 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-03 12:39 EST ------- Marking as fixed - updated patch checked in, with additional unit tests in Tests/test_seq.py -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 3 17:39:48 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Nov 2009 12:39:48 -0500 Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string, even subclass string? In-Reply-To: Message-ID: <200911031739.nA3HdmDM002689@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2351 Bug 2351 depends on bug 2731, which changed state. Bug 2731 Summary: Adding .upper() and .lower() methods to the Seq object http://bugzilla.open-bio.org/show_bug.cgi?id=2731 What |Old Value |New Value ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Nov 3 17:41:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 17:41:31 +0000 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> Message-ID: <320fb6e00911030941l5f5748f3ofdfc2144af209a5e@mail.gmail.com> On Tue, Nov 3, 2009 at 5:23 PM, Kyle Ellrott wrote: >> >> def FeatureDescGuess( feature ): >> ? desc = "" >> ? try: >> ? ? ? desc=feature.qualifiers['product'][0] >> ? except KeyError: >> ? ? ? pass >> ? return desc >> >> Could be just: >> >> def FeatureDescGuess( feature ): >> ? return feature.qualifiers.get('product', [""])[0] >> >> and therefore doesn't really need an entire function. > > That could attempt to get the first element of a None type, if the 'product' > qualifier doesn't exist. No, because we supply a default value (a list containing the empty string). > Actually, I wrote it that way so it could be extended.? First it would try > 'product' and if that didn't exist replace it with something like a > 'db_xref' or a 'note' entry.? I was hoping to get some input on what people > think would be 'order of importance' of things to try. I might try and follow the NCBI's conventions used in FAA files for each GenBank file - see the bacteria folder on their FTP site. Peter From bugzilla-daemon at portal.open-bio.org Tue Nov 3 17:58:37 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Nov 2009 12:58:37 -0500 Subject: [Biopython-dev] [Bug 2943] New: MMCIFParser only handling a single model. Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2943 Summary: MMCIFParser only handling a single model. Product: Biopython Version: 1.52 Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: TallPaulInJax at yahoo.com MMCIFParser as-written only handles a single model in a protein. Any protein that has multiple modesl with repeating chains and residues will get an exception since the residue ID will already exist. Please make the following changes in MMCIFParser.py: Change the __doc__ setting: #Optional __DOC__ change if the new MMCIFlex is not used nor the changes #to MMCIF2Dict based on the new MMCIFlex. #Mod by Paul T. Bathen to reflect MMCIFlex built solely in Python __doc__="mmCIF parser (implemented solely in Python, no lex/flex/C code needed)" Regardles of the DOC changes: Insert the following model_list line occupancy_list=mmcif_dict["_atom_site.occupancy"] fieldname_list=mmcif_dict["_atom_site.group_PDB"] #Added by Paul T. Bathen Nov 2009 model_list=mmcif_dict["_atom_site.pdbx_PDB_model_num"] try: Make the following changes: #Modified by Paul T. Bathen Nov 2009: comment out this line #current_model_id=0 structure_builder=self._structure_builder structure_builder.init_structure(structure_id) #Modified by Paul T. Bathen Nov 2009: comment out this line #structure_builder.init_model(current_model_id) structure_builder.init_seg(" ") #Added by Paul T. Bathen Nov 2009 current_model_id = -1 Make the following changes in the for loop: #Note by Paul T. Bathen: should MMCIFParser include #the HOH and WAT stmts in PDBParser immediately below? #if fieldname=="HETATM": # if resname=="HOH" or resname=="WAT": # hetero_flag="W" # else: # hetero_flag="H" if fieldname=="HETATM": hetatm_flag="H" else: hetatm_flag=" " #Added by Paul T. Bathen Nov 2009 model_id = model_list[i] if current_model_id != model_id: current_model_id = model_id structure_builder.init_model(current_model_id) #end of addition After these changes took place, and with the new MMCIFlex and MMCIF2Dict in place, I was able to parse and test 2beg.cif and pdb2beg.ent and both parsed with the same number of models, chains, and residues. Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From kellrott at gmail.com Tue Nov 3 18:17:44 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Tue, 3 Nov 2009 10:17:44 -0800 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: <320fb6e00911030941l5f5748f3ofdfc2144af209a5e@mail.gmail.com> References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> <320fb6e00911030941l5f5748f3ofdfc2144af209a5e@mail.gmail.com> Message-ID: > > >> def FeatureDescGuess( feature ): > >> return feature.qualifiers.get('product', [""])[0] > >> > >> and therefore doesn't really need an entire function. > > > > That could attempt to get the first element of a None type, if the > 'product' > > qualifier doesn't exist. > > No, because we supply a default value (a list containing the empty string). > Didn't see that. That's what I get for programming during a colloquium ;-) Kyle From biopython at maubp.freeserve.co.uk Tue Nov 3 21:49:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 21:49:31 +0000 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> <320fb6e00911030941l5f5748f3ofdfc2144af209a5e@mail.gmail.com> Message-ID: <320fb6e00911031349v1303b7c3s2b1aaaa6fc695ced@mail.gmail.com> On Tue, Nov 3, 2009 at 6:17 PM, Kyle Ellrott wrote: >> >> def FeatureDescGuess( feature ): >> >> ? return feature.qualifiers.get('product', [""])[0] >> >> >> >> and therefore doesn't really need an entire function. >> > >> > That could attempt to get the first element of a None type, if the >> > 'product' >> > qualifier doesn't exist. >> >> No, because we supply a default value (a list containing the empty >> string). > > Didn't see that.? That's what I get for programming during a colloquium ;-) > :) There could be a problem if the SeqFeature qualifiers wasn't a list, for example a string like ""NC_123456" instead of say ["NC_123456"], but the assumption is safe with anything from the GenBank or EMBL parsers. Peter From bugzilla-daemon at portal.open-bio.org Tue Nov 3 21:51:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 3 Nov 2009 16:51:51 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911032151.nA3LppCX010030@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-03 16:51 EST ------- Could you attach a patch to this bug (using the diff command line tool)? A short example script parsing one of these problem CIF files would also be very helpful and could form the basis of a new unit test. If you can suggest a small (file size) CIF file we could use for this that would be ideal. Thanks -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Nov 3 21:54:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 21:54:15 +0000 Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex In-Reply-To: <869703.76049.qm@web30705.mail.mud.yahoo.com> References: <320fb6e00911030913u11cf6380q2a2bbc07b2b1863f@mail.gmail.com> <869703.76049.qm@web30705.mail.mud.yahoo.com> Message-ID: <320fb6e00911031354w659f33d2oad838eeffc8ea585@mail.gmail.com> On Tue, Nov 3, 2009 at 5:34 PM, Paul B wrote: >>Are you sure it should it be 1 and not 0? Remember, Python counts from zero. > > If the MODEL record in the .ent record says MODEL 1, should > biopython report it as 0? In PDBParser, the code is as follows: > ... If for the PDB parser Thomas already chose to report the model as 1, then to yes you are right - to be consistent the CIF parser should do the same as the PDB parser. Peter From tallpaulinjax at yahoo.com Tue Nov 3 22:10:50 2009 From: tallpaulinjax at yahoo.com (Paul B) Date: Tue, 3 Nov 2009 14:10:50 -0800 (PST) Subject: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex In-Reply-To: <320fb6e00911031354w659f33d2oad838eeffc8ea585@mail.gmail.com> Message-ID: <342688.27132.qm@web30705.mail.mud.yahoo.com> Hmmm... no matter whether there is a (explicit first parsed) MODEL record or not in the PDB file, PDBParser currently records the model id as 0. I can see that being the choice if there wasn't a MODEL record.?But if there?IS a MODEL record of 1, the model id?should be forced to 0? What if the MODEL records exists, and for some reason it is MODEL 2 and that's the first record PDBParser parses? Should that also have a forced model id of 0? This seems to me to be a bug instead of a feature. I'd hate to promulgate that error in the MMCIFParser code as well. Someone could be thinking they are doing a study on model X when really they are studying X+1, or worse yet X+Y where Y is the offset between the forced 0 id and the true first model id. And if the models in the file has skips, ie, MODEL 1 then 3,4,7, and 9... those should be model id's 0 through 4? I don't know if that can happen, just saying... But if the TRUE model record information is not trapped (and it's not), how would someone know what true model they are looking at instead of the forced model id that starts at 0? ? Not to rock the boat, but food for thought. I can make MMCIFParser match whatever is deemed to be the correct thing to do. ? Paul ? --- On Tue, 11/3/09, Peter wrote: From: Peter Subject: Re: [Biopython-dev] Questions on StructureBuilder, MMCIFParser, and MMCIFlex To: "Paul B" Cc: "Biopython Development" Date: Tuesday, November 3, 2009, 4:54 PM On Tue, Nov 3, 2009 at 5:34 PM, Paul B wrote: >>Are you sure it should it be 1 and not 0? Remember, Python counts from zero. > > If the MODEL record in the .ent record says MODEL 1, should > biopython report it as 0? In PDBParser, the code is as follows: > ... If for the PDB parser Thomas already chose to report the model as 1, then to yes you are right - to be consistent the CIF parser should do the same as the PDB parser. Peter From kellrott at gmail.com Tue Nov 3 23:06:34 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Tue, 3 Nov 2009 15:06:34 -0800 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: <320fb6e00911031349v1303b7c3s2b1aaaa6fc695ced@mail.gmail.com> References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> <320fb6e00911030941l5f5748f3ofdfc2144af209a5e@mail.gmail.com> <320fb6e00911031349v1303b7c3s2b1aaaa6fc695ced@mail.gmail.com> Message-ID: I've posted a branch on my git ( http://github.com/kellrott/biopython/tree/FeatureExtract ). The Name/Description guess functions need to be finalized. I wrote a unit test that extracts CDS feature dna, and then runs translate on the dna and compares it to the translation stored in the feature. It passes all the genbank files in the Test directory except for the ones that have 'N' in the DNA sequence (that causes a translation exception) and one_of.gb (it refers to sequence outside of the file). More test ideas would be appreciated. Kyle There could be a problem if the SeqFeature qualifiers wasn't a > list, for example a string like ""NC_123456" instead of say > ["NC_123456"], but the assumption is safe with anything > from the GenBank or EMBL parsers. > > Peter > From biopython at maubp.freeserve.co.uk Tue Nov 3 23:41:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 23:41:57 +0000 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> Message-ID: <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> On Wed, Oct 28, 2009 at 12:50 PM, Peter wrote: > On Wed, Oct 28, 2009 at 12:07 PM, Peter wrote: >> I think this should be part of Biopython proper (with unit tests etc), and >> would like to discuss where to put it. My ideas include: >> >> (1) Method of the SeqFeature object taking the parent sequence (as a >> string, Seq, ...?) as a required argument. Would return an object of the >> same type as the parent sequence passed in. >> >> (2) Separate function, perhaps in Bio.SeqUtils taking the parent >> sequence (as a string, Seq, ...?) and a SeqFeature object. Would >> return an object of the same type as the parent sequence passed in. >> >> (3) Method of the Seq object taking a SeqFeature, returning a Seq. >> [A drawback is Bio.Seq currently does not depend on Bio.SeqFeature] >> >> (4) Method of the SeqRecord object taking a SeqFeature. Could >> return a SeqRecord using annotation from the SeqFeature. Complex. >> >> Any other ideas? >> >> We could even offer more than one of these approaches, but ideally >> there should be one obvious way for the end user to do this. My >> question is, which is most intuitive? I quite like idea (1). >> >> In terms of code complexity, I expect (1), (2) and (3) to be about the >> same. Building a SeqRecord in (4) is trickier. > > Actually, thinking about this over lunch, for many of the use cases > we do want to turn a SeqFeature into a SeqRecord - either for the > nucleotides, or in some cases their translation. And if doing this, > do something sensible with the SeqFeature annotation (qualifiers) > seems generally to be useful. This could still be done with approaches > (1) and (2) as well as (4). Kyle at least seems to like idea (4), so much so that he has gone ahead and coded up something: http://lists.open-bio.org/pipermail/biopython-dev/2009-November/006941.html Certainly there are good reasons for wanting to be able to take a SeqFeature and the parent sequence (SeqRecord or Seq) and create a SeqRecord (either plain nucleotides or translated into protein). e.g. pretty much all non-trivial GenBank to FASTA conversions. Offering this as a SeqRecord method might be the best approach, option (4). However, this is I think on top of the more fundamental step of just extracting the sequence (without worrying about the annotation). Here as noted above, I currently favour adding a method to the SeqFeature, option (1). How about as the method name get_sequence, extract_sequence or maybe just extract? Peter From biopython at maubp.freeserve.co.uk Tue Nov 3 23:49:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 23:49:33 +0000 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: References: <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> <320fb6e00911030941l5f5748f3ofdfc2144af209a5e@mail.gmail.com> <320fb6e00911031349v1303b7c3s2b1aaaa6fc695ced@mail.gmail.com> Message-ID: <320fb6e00911031549k5a7e5f7fs1d4c6a6299d20ed6@mail.gmail.com> On Tue, Nov 3, 2009 at 11:06 PM, Kyle Ellrott wrote: > I've posted a branch on my git ( > http://github.com/kellrott/biopython/tree/FeatureExtract ). ?The > Name/Description guess functions need to be finalized. ?I wrote a unit test > that extracts CDS feature dna, and then runs translate on the dna and > compares it to the translation stored in the feature. ?It passes all the > genbank files in the Test directory except for the ones that have 'N' in the > DNA sequence (that causes a translation exception) and one_of.gb (it refers > to sequence outside of the file). > More test ideas would be appreciated. There are several things I would have done differently there. Firstly, and perhaps most importantly, you shouldn't assume the SeqRecord is DNA. It could be RNA or protein after all. Reuse the parent SeqRecord's seq's alphabet Perhaps you could comment on this other thread about the more general problem of how to make getting the sequence (i.e. a Seq object) for a SeqFeature available in Biopython? http://lists.open-bio.org/pipermail/biopython-dev/2009-November/006958.html Peter From kellrott at gmail.com Wed Nov 4 03:37:43 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Tue, 3 Nov 2009 19:37:43 -0800 Subject: [Biopython-dev] [Biopython] Using SeqLocation to extract subsequence In-Reply-To: <320fb6e00911031549k5a7e5f7fs1d4c6a6299d20ed6@mail.gmail.com> References: <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> <320fb6e00911030909p2e0bc70ci6beaef2187f1168e@mail.gmail.com> <320fb6e00911030941l5f5748f3ofdfc2144af209a5e@mail.gmail.com> <320fb6e00911031349v1303b7c3s2b1aaaa6fc695ced@mail.gmail.com> <320fb6e00911031549k5a7e5f7fs1d4c6a6299d20ed6@mail.gmail.com> Message-ID: > > There are several things I would have done differently there. Firstly, > and perhaps most importantly, you shouldn't assume the SeqRecord > is DNA. It could be RNA or protein after all. Reuse the parent > SeqRecord's seq's alphabet > It's an open source rule of thumb, if you want something done quickly, post broken code and someone will fix it to prove they're smarter then you. ;-) Kyle From biopython at maubp.freeserve.co.uk Wed Nov 4 11:37:14 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 4 Nov 2009 11:37:14 +0000 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> Message-ID: <320fb6e00911040337q5bad8db2r925e39f4bcd375f7@mail.gmail.com> On Tue, Nov 3, 2009 at 11:41 PM, Peter wrote: > > Certainly there are good reasons for wanting to be able to take > a SeqFeature and the parent sequence (SeqRecord or Seq) > and create a SeqRecord (either plain nucleotides or translated > into protein). e.g. pretty much all non-trivial GenBank to FASTA > conversions. Offering this as a SeqRecord method might be the > best approach, option (4). > > However, this is I think on top of the more fundamental step > of just extracting the sequence (without worrying about the > annotation). Here as noted above, I currently favour adding > a method to the SeqFeature, option (1). How about as the > method name get_sequence, extract_sequence or maybe > just extract? Done on a github branch, comments welcome: http://github.com/peterjc/biopython/tree/seqfeature-extract If that seems the best way to expose this functionality (i.e. option (1) from my earlier list of suggestions), I can commit this to the trunk, and we can move on to the related idea of how to this nicely with get SeqRecord objects for SeqFeatures. Peter From biopython at maubp.freeserve.co.uk Wed Nov 4 14:22:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 4 Nov 2009 14:22:45 +0000 Subject: [Biopython-dev] Adding SeqRecord objects Message-ID: <320fb6e00911040622n1d60c6b1n29511f82f4f6b674@mail.gmail.com> Hi all, I'd like to add support for adding SeqRecord objects to the trunk, i.e. cherry-pick this change from my experimental branch: http://github.com/peterjc/biopython/commit/6fd5675b1c03dc7eb190d84db1fa19ae744559aa Plus some docstring/doctest examples and unit tests of course, e.g. http://github.com/peterjc/biopython/commit/a8405a54406226c6726daea743ea59dc544c5bc0 Any comments? Peter From peter at maubp.freeserve.co.uk Wed Nov 4 17:40:40 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Wed, 4 Nov 2009 17:40:40 +0000 Subject: [Biopython-dev] [blast-announce] BLAST 2.2.22 now available In-Reply-To: <320fb6e00910200456mbac8d28ra1385c102b899c9a@mail.gmail.com> References: <33559E80-E78D-4CCB-8E8C-79C36E89C007@ncbi.nlm.nih.gov> <320fb6e00910200456mbac8d28ra1385c102b899c9a@mail.gmail.com> Message-ID: <320fb6e00911040940h48f88ch87ad9a22d79b4aa3@mail.gmail.com> On Tue, Oct 20, 2009 at 11:56 AM, Peter wrote: > Hi all, > > The new NCBI BLAST tools are out now, and I'd only just updated > my desktop to BLAST 2.2.21 this morning! > > It looks like the "old style" blastall etc (which are written in C) are > much the same, but we will need to add Bio.Blast.Applications > wrappers for the new "BLAST+" tools (written in C++). The bulk of that work is done in the main repository now. However, we still need to go through all the tools and confirm all their arguments are included. This could be partly automated since the BLAST help output is nicely formatted... Peter From kellrott at gmail.com Wed Nov 4 18:25:14 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Wed, 4 Nov 2009 10:25:14 -0800 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: <320fb6e00911040337q5bad8db2r925e39f4bcd375f7@mail.gmail.com> References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> <320fb6e00911040337q5bad8db2r925e39f4bcd375f7@mail.gmail.com> Message-ID: I pulled the this branch and tested it on the project I originally needed the code for. It looks like it's working. Kyle Done on a github branch, comments welcome: > http://github.com/peterjc/biopython/tree/seqfeature-extract > > If that seems the best way to expose this functionality > (i.e. option (1) from my earlier list of suggestions), I can > commit this to the trunk, and we can move on to the > related idea of how to this nicely with get SeqRecord > objects for SeqFeatures. > > Peter > From bugzilla-daemon at portal.open-bio.org Wed Nov 4 18:52:18 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Nov 2009 13:52:18 -0500 Subject: [Biopython-dev] [Bug 2945] New: update_pdb: shutil.move needs to be indented; try block also? Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2945 Summary: update_pdb: shutil.move needs to be indented; try block also? Product: Biopython Version: 1.52 Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: TallPaulInJax at yahoo.com As written, shutil.move will only move the last file if there are any. If there aren't any, it will raise an exception since old_file and new_file are not initialized. Finally, any failure to move the file will also raise an exception (I believe), so a try block should be in place: Existing code: # move the obsolete files to a special folder for pdb_code in obsolete: if self.flat_tree: old_file = self.local_pdb + os.sep + 'pdb%s.ent'%(pdb_code) new_file = self.obsolete_pdb + os.sep + 'pdb%s.ent'%(pdb_code) else: old_file = self.local_pdb + os.sep + pdb_code[1:3] + os.sep + 'pdb%s.ent'%(pdb_code) new_file = self.obsolete_pdb + os.sep + pdb_code[1:3] + os.sep + 'pdb%s.ent'%(pdb_code) shutil.move(old_file, new_file) shutil.move needs to be indented one column, and potentially a try/catch phrase added: for pdb_code in obsolete: if self.flat_tree: old_file = self.local_pdb + os.sep + 'pdb%s.ent'%(pdb_code) new_file = self.obsolete_pdb + os.sep + 'pdb%s.ent'%(pdb_code) else: old_file = self.local_pdb + os.sep + pdb_code[1:3] + os.sep + 'pdb%s.ent'%(pdb_code) new_file = self.obsolete_pdb + os.sep + pdb_code[1:3] + os.sep + 'pdb%s.ent'%(pdb_code) try: shutil.move(old_file, new_file) except: warnings.warn("Unable to move from old file: \n%s\n to new file: \n%s\n" % (old_file, new_file) RuntimeWarning) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 4 21:56:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Nov 2009 16:56:00 -0500 Subject: [Biopython-dev] [Bug 2945] update_pdb: shutil.move needs to be indented; try block also? In-Reply-To: Message-ID: <200911042156.nA4Lu0NU025735@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2945 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-04 16:56 EST ------- Fixed in git, looks like I missed that in fixing Bug 2867. Thanks I didn't go for a try/except. Have you had this fail in real use? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Nov 4 22:05:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 4 Nov 2009 22:05:50 +0000 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> <320fb6e00911040337q5bad8db2r925e39f4bcd375f7@mail.gmail.com> Message-ID: <320fb6e00911041405x2a424d61n2bf617a99d8ffef4@mail.gmail.com> On Wed, Nov 4, 2009 at 6:25 PM, Kyle Ellrott wrote: > I pulled the this branch and tested it on the project I originally needed > the code for. ?It looks like it's working. Cool. What do you think of this interface? Does a method of the SeqFeature seem natural to you? Peter From kellrott at gmail.com Wed Nov 4 22:16:30 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Wed, 4 Nov 2009 14:16:30 -0800 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: <320fb6e00911041405x2a424d61n2bf617a99d8ffef4@mail.gmail.com> References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> <320fb6e00911040337q5bad8db2r925e39f4bcd375f7@mail.gmail.com> <320fb6e00911041405x2a424d61n2bf617a99d8ffef4@mail.gmail.com> Message-ID: I guess it's a question of making it a method of the SeqFeature vs a Method of the SeqRecord. My try put the extract feature in the SeqRecord, because that's what SeqFeature belongs to. But it actually makes more sense to have the SeqFeature operate on a Seq. If people want to create features (or copy them from other SeqRecords) and use them to extract subsequences from other Seq objects this format makes it more natural/flexable. Kyle On Wed, Nov 4, 2009 at 2:05 PM, Peter wrote: > On Wed, Nov 4, 2009 at 6:25 PM, Kyle Ellrott wrote: > > I pulled the this branch and tested it on the project I originally needed > > the code for. It looks like it's working. > > Cool. What do you think of this interface? Does a method of the > SeqFeature seem natural to you? > > Peter > From bugzilla-daemon at portal.open-bio.org Wed Nov 4 22:18:29 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Nov 2009 17:18:29 -0500 Subject: [Biopython-dev] [Bug 2945] update_pdb: shutil.move needs to be indented; try block also? In-Reply-To: Message-ID: <200911042218.nA4MITCf026322@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2945 TallPaulInJax at yahoo.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |TallPaulInJax at yahoo.com ------- Comment #2 from TallPaulInJax at yahoo.com 2009-11-04 17:18 EST ------- My thought process on the try block (and similar): if something or someone outside of my control can affect the state of my program I am going to try and catch that exception. So, when dealing with a file system who knows what could happen! After all, we are trying to move obsolete files that may have been downloaded months or years ago... who KNOWS where they are now, or what they've been named. We are just calculating their SUPPOSED path and filename. If we had just done a directory walk and found all files matching a regex expression, that might be a different matter. Anyway, that's the logic! :-) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 5 04:34:46 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Nov 2009 23:34:46 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911050434.nA54YkFs004138@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 ------- Comment #2 from TallPaulInJax at yahoo.com 2009-11-04 23:34 EST ------- Created an attachment (id=1392) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1392&action=view) MMCIFParser patch using WinMerge. I'm not sure if this is in a useable format or not. It was generated by WinMerge, and I had not used the product before tonight. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 5 04:42:22 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Nov 2009 23:42:22 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911050442.nA54gMZe004307@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 ------- Comment #3 from TallPaulInJax at yahoo.com 2009-11-04 23:42 EST ------- Important point of discussion: In PDBParser.py as of this date, NO MATTER whether there is an explicit MODEL record or not in the .ent file, PDBParser forces the first model id to be 0 and then increments the counter. If the authors of the .ent file chose to explicitly use MODEL records 2,3,5,7 these would be ignored and instead given model id's of 0,1,2,3 respectively by PDBParser. There is no attribute in the object model for the true MODEL number. To me, this is a bug since some user thinking they are studying model X is instead studying model Y. Currently the code for MMCIFParser.py does NOT follow this logic. Model numbers in the file are faithfully used as the model id in the object structure. However, should it be deemed that the PDBParser method is a feature instead of a bug, then the change is easy enough to make. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 5 04:54:42 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 4 Nov 2009 23:54:42 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911050454.nA54sgUk004474@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 ------- Comment #4 from TallPaulInJax at yahoo.com 2009-11-04 23:54 EST ------- Note: my testing of MMCIFParser.py depends on the changes I have made to MMCIF2Dict.py which requires the new MMCIFlex.py I wrote completely in python with no lex/flex/C support required. Peter, you should have copies of these? To test MMCIFParser.py, download: ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/mmCIF/be/2beg.cif.gz and unzip it. Run MMCIFParser in IDLE, etc. It will prompt you for a filename. Enter the path and filename to the unzipped 2beg file you just downloaded. 2beg has 10 models (1 thru 10), each with 5 chains, and each chain with 26 residues. You should get: Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Found 26 residues. Note on a different protein: 100d.ent identifies the models, chains, and residues differently than it's counterpart 100d.cif!!! And I thought they came from the same database!? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 5 05:03:09 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Nov 2009 00:03:09 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911050503.nA5539T2004675@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 ------- Comment #5 from TallPaulInJax at yahoo.com 2009-11-05 00:03 EST ------- Created an attachment (id=1393) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1393&action=view) MMCIF2Dict patch for use with new MMCIFlex I don't know if MMCIF2Dict.py has bugs or not with the lex/flex/C version of MMCIFlex.py. However, these corrections are necessary for testing the complete package of MMCIFParser and the MMCIF2Dict and MMCIFlex modules written without need for lex/flex/C. So I am attaching them here! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 5 05:05:59 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Nov 2009 00:05:59 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911050505.nA555xpM004734@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 ------- Comment #6 from TallPaulInJax at yahoo.com 2009-11-05 00:05 EST ------- Created an attachment (id=1394) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1394&action=view) The complete MMCIF2Dict file modified for the new MMCIFlex.py If you don't want to mess with the diff file, here is the complete MMCIF2Dict.py file written to work with the python-only MMCIFlex.py file I will also add as an attachment. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 5 05:10:19 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Nov 2009 00:10:19 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911050510.nA55AJxu004822@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 ------- Comment #7 from TallPaulInJax at yahoo.com 2009-11-05 00:10 EST ------- Created an attachment (id=1395) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1395&action=view) The new MMCIFlex.py file written solely in python. And here is the complete MMCIFlex.py file written solely in python, no need for lex/flex/c/etc. It has been tested on Windows (as have the other files). This file should be placed in the mmCIF subfolder of PDB. Note: I don't know if this will be used as a REPLACEMENT for the old MMCIFlex or not so I have no idea what it's name should be. The same goes for the modifications of the MMCIF2Dict.py file: it will only work with this version of MMCIFlex, not the lex/flex/C version!!! -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 5 11:12:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Nov 2009 06:12:55 -0500 Subject: [Biopython-dev] [Bug 2945] update_pdb: shutil.move needs to be indented; try block also? In-Reply-To: Message-ID: <200911051112.nA5BCtAr014626@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2945 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-05 06:12 EST ------- Fair point. Could you try the updated file in the repository? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Nov 5 12:44:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Nov 2009 12:44:24 +0000 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> <320fb6e00911040337q5bad8db2r925e39f4bcd375f7@mail.gmail.com> <320fb6e00911041405x2a424d61n2bf617a99d8ffef4@mail.gmail.com> Message-ID: <320fb6e00911050444m2a9afbcen5e5a548225fef79c@mail.gmail.com> On Wed, Nov 4, 2009 at 10:16 PM, Kyle Ellrott wrote: > I guess it's a question of making it a method of the SeqFeature vs a Method > of the SeqRecord. ?My try put the extract feature in the SeqRecord, because > that's what SeqFeature belongs to. ?But ?it actually makes more sense to > have the SeqFeature operate on a Seq. ?If people want to create features (or > copy them from other SeqRecords) and use them to extract subsequences from > other Seq objects this format makes it more natural/flexable. :) You are right that the SeqFeature won't always be used with a SeqRecord (or at least, the parent SeqRecord). There are possible examples like a GenBank file with no sequence (just CONTIG information) where the sequence has been loaded from a FASTA file. Or, perhaps more likely (once Brad's code is merged), a list of SeqFeature objects loaded from a GFF file, plus a sequence loaded from a FASTA file. Does my current choice of "extract" for the name of the proposed SeqFeature method seem clear? My other suggestions earlier were get_sequence or extract_sequence. http://github.com/peterjc/biopython/tree/seqfeature-extract Peter From bugzilla-daemon at portal.open-bio.org Thu Nov 5 15:01:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Nov 2009 10:01:14 -0500 Subject: [Biopython-dev] [Bug 2945] update_pdb: shutil.move needs to be indented; try block also? In-Reply-To: Message-ID: <200911051501.nA5F1EPU023807@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2945 ------- Comment #4 from TallPaulInJax at yahoo.com 2009-11-05 10:01 EST ------- I downloaded the updated PDBList.py from github. Unfortunately, testing fails for several reasons: 1. The code you wrote includes an 'if' statement: if os.path.isfile(old_file) : try : shutil.move(old_file, new_file) except : warnings.warn("Could not move %s to obsolete folder" \ % pdb_code, RuntimeWarning) If the file does NOT exist, then no warning is issued as if the file HAD existed and HAD been moved. This is a bug: there should be no if statement. Simply try and move the file within a try/catch as I had written it without the if statement: try : shutil.move(old_file, new_file) except : warnings.warn("Could not move %s to obsolete folder" \ % pdb_code, RuntimeWarning) That way no matter whether the file does not exist or cannot be moved, the warning is issued. If you would like to trap the warnings separately, you should write: if os.path.isfile(old_file) : try : shutil.move(old_file, new_file) except : warnings.warn("Could not move %s to obsolete folder" \ % pdb_code, RuntimeWarning) else: warnings.warn("File %s not found to move to obsolete folder" \ % pdb_code, RuntimeWarning) 2. At least on Windows, if the subfolders of the obsolete folder do not exist, a warning will be issued as well: python will create the 'obsolete' subolder folder but will not create the subfolders under that. This will occur even if the files DO exist and COULD be moved: a different kind of error. We just need to create the subfolders, and warn if we can't. 3. Not a bug, but an enhancement: As a user, I'd rather see the whole new_file name instead of the pdb_code. To sum up, the below code will implement all those changes. Whether there are other errors or not I have not checked. But I did check the above warnings/errors with testing: for pdb_code in obsolete: if self.flat_tree: old_file = self.local_pdb + os.sep + 'pdb%s.ent'%(pdb_code) new_file = self.obsolete_pdb + os.sep + 'pdb%s.ent'%(pdb_code) new_path = self.obsolete_pdb #<===================== else: old_file = self.local_pdb + os.sep + pdb_code[1:3] + os.sep + 'pdb%s.ent'%(pdb_code) new_file = self.obsolete_pdb + os.sep + pdb_code[1:3] + os.sep + 'pdb%s.ent'%(pdb_code) new_path = self.obsolete_pdb + os.sep + pdb_code[1:3]#<===== #If the old file doesn't exist, maybe someone else moved it #or deleted it already. Should we issue a warning? if os.path.isfile(old_file) : try : os.makedirs(new_path) #<================ shutil.move(old_file, new_file) except : warnings.warn("Could not move %s to obsolete folder" \ % old_file, RuntimeWarning) #<====old_file else: #<=========== warnings.warn("Could not find file %s to move to obsolete folder" \ % old_file, RuntimeWarning) #<====old_file -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 5 16:13:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 5 Nov 2009 11:13:32 -0500 Subject: [Biopython-dev] [Bug 2945] update_pdb: shutil.move needs to be indented; try block also? In-Reply-To: Message-ID: <200911051613.nA5GDWqk025397@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2945 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution| |FIXED ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-05 11:13 EST ------- (In reply to comment #4) > I downloaded the updated PDBList.py from github. Unfortunately, testing fails > for several reasons: > 1. The code you wrote includes an 'if' statement: > if os.path.isfile(old_file) : > try : > shutil.move(old_file, new_file) > except : > warnings.warn("Could not move %s to obsolete folder" \ > % pdb_code, RuntimeWarning) > If the file does NOT exist, then no warning is issued as if the file HAD > existed and HAD been moved. This is a bug ... Why is this a bug? If the file does not exist, there is no point trying to move it. But OK, a message here could be informative. > 2. At least on Windows, if the subfolders of the obsolete folder do not exist, > a warning will be issued as well: python will create the 'obsolete' subolder > folder but will not create the subfolders under that. This will occur even if > the files DO exist and COULD be moved: a different kind of error. We just need > to create the subfolders, and warn if we can't. That's a separate bug. Fixed now. > 3. Not a bug, but an enhancement: As a user, I'd rather see the whole new_file > name instead of the pdb_code. Fair enough. I felt having a very long message with both the old and the new paths was excessive, but at least including the old path would be a useful compromise. > To sum up, the below code will implement all those changes. Whether there are > other errors or not I have not checked. But I did check the above > warnings/errors with testing: I've updated the repository along similar lines: http://github.com/biopython/biopython/blob/master/Bio/PDB/PDBList.py Marking this bug as fixed. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From kellrott at gmail.com Thu Nov 5 19:42:03 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Thu, 5 Nov 2009 11:42:03 -0800 Subject: [Biopython-dev] Building Gene Ontology support into Biopython In-Reply-To: <128a885f0910182126i74f7712bo2cb6e7d612532e5@mail.gmail.com> References: <128a885f0910172222n70e44898y8bb7bb2faf5986d1@mail.gmail.com> <20091018163436.GA66322@kunkel> <128a885f0910182126i74f7712bo2cb6e7d612532e5@mail.gmail.com> Message-ID: Any notes on the progress? I had to get some GO information for a project, so I checked out your git fork. Looks like a skeleton of the project has been outlined, but no real code yet. I added some code to the oboparser to get what I needed, only basic term parsing, and no network support. Can't speak to it's performance, but NetworkX has a very simple install (easy_install on mac, and part of the standard package set on Fedora). Kyle On Sun, Oct 18, 2009 at 7:22 AM, Chris Lasher wrote: > I'm going to go ahead and make the executive decision to use NetworkX. > I think BioPerl's Ontology framework has both third-party > dependency-based (Graph.pm) and non-dependency-based solutions. Maybe > we can figure out something similar, but NetworkX is such an easy > dependency to satisfy that I'm going with it. > > Looks like this is going to be a busy week. > > Chris > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From bugzilla-daemon at portal.open-bio.org Fri Nov 6 15:05:07 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 6 Nov 2009 10:05:07 -0500 Subject: [Biopython-dev] [Bug 2947] New: Bio.HMM calculates wrong viterbi path Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2947 Summary: Bio.HMM calculates wrong viterbi path Product: Biopython Version: 1.47 Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: BioSQL AssignedTo: biopython-dev at biopython.org ReportedBy: georg.lipps at fhnw.ch Hi, I have tested the Bio.HMM with some simple code (see below). However the results of the viterbi path calculation are wrong (apparently they depend upon the order of the state alphabet). I also do not understand the number/score/probability returned along with the viterbi path. Greetings, Georg from Bio.HMM import MarkovModel, Trainer ## definition of the alphabets for hidden states and emissions class coin: def __init__(self): self.letters = ["u", "f"] # must be a single letter alphabet or raises an error in viterbi class outcome: def __init__(self): self.letters = ["head", "tail"] coin=coin() outcome=outcome() ## initialize HMM model build=MarkovModel.MarkovModelBuilder(coin, outcome) build.allow_all_transitions() build.set_equal_probabilities() ## build HMM model with test frequencies build.set_transition_score("u", "f", 0.05) build.set_transition_score("f", "u", 0.05) build.set_transition_score("f", "f", 0.95) build.set_transition_score("u", "u", 0.95) build.set_emission_score("f", "tail", 0.5) build.set_emission_score("f", "head", 0.5) build.set_emission_score("u", "tail", 0.75) build.set_emission_score("u", "head", 0.25) model=build.get_markov_model() print "Emission probabilites:\n", model.emission_prob print "Transitions probabilities:\n", model.transition_prob, "\n" observed_emissions=["tail"]*2 viterbi=model.viterbi(observed_emissions, coin) seq=viterbi[0] prob=viterbi[1] print "============= Calculation of the most probable path" ## does not work for very short observations ## calculated path is dependant upon order in states alphabet print observed_emissions print seq print prob, "\n" OUTPUT: Emission probabilites: {('u', 'head'): 0.25, ('f', 'head'): 0.5, ('f', 'tail'): 0.5, ('u', 'tail'): 0.75} Transitions probabilities: {('f', 'u'): 0.050000000000000003, ('u', 'f'): 0.050000000000000003, ('u', 'u'): 0.94999999999999996, ('f', 'f'): 0.94999999999999996} ============= Calculation of the most probable path ['tail', 'tail'] ff 4.46028871308 This is certainly not true, since the most probable path would be uu (unfair/unfair) When the sequence of observation is longer, e.g six the following results are obtained: ============= Calculation of the most probable path ['tail', 'tail', 'tail', 'tail', 'tail', 'tail'] uuuuuf 13.1325601923 Which is still not true as the last coin should still be u (unfair). Interestingly when the order of the state alphabet is changed, i.e.: class coin: def __init__(self): self.letters = ["f", "u"] the output is correct. ============= Calculation of the most probable path ['tail', 'tail', 'tail', 'tail', 'tail', 'tail'] uuuuuu 6.09287667828 Thus it appears to me that the viterbi algorithm is not robust enough and biased towards the last letter of the state alphabet. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From kellrott at gmail.com Fri Nov 6 17:36:16 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Fri, 6 Nov 2009 09:36:16 -0800 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: <320fb6e00911050444m2a9afbcen5e5a548225fef79c@mail.gmail.com> References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> <320fb6e00911040337q5bad8db2r925e39f4bcd375f7@mail.gmail.com> <320fb6e00911041405x2a424d61n2bf617a99d8ffef4@mail.gmail.com> <320fb6e00911050444m2a9afbcen5e5a548225fef79c@mail.gmail.com> Message-ID: It's always hard to judge how obvious something is when you already know it. I don't think there is any sort of name clash. Is there anything else you would want to 'extract' with/from a Sequence Feature?... One of the rules I've heard from API designers is that it's best if written code almost sounds like a proper sentence. ( From http://www.youtube.com/watch?v=aAb7hSCtvGw if you haven't seen it ) my_feature_sequence = my_feature.extract( my_sequence ) seems to fit that rule. Kyle Does my current choice of "extract" for the name of the proposed > SeqFeature method seem clear? My other suggestions earlier were > get_sequence or extract_sequence. > > http://github.com/peterjc/biopython/tree/seqfeature-extract > > Peter > From biopython at maubp.freeserve.co.uk Fri Nov 6 19:00:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 6 Nov 2009 19:00:08 +0000 Subject: [Biopython-dev] Seq object ungap method Message-ID: <320fb6e00911061100g2122ef77se80f5bc6198cb030@mail.gmail.com> Hi all, Something we discussed last year was adding an ungap method to the Seq object. See for example this thread: http://lists.open-bio.org/pipermail/biopython/2008-September/004515.html I've (finally) taken the time to actually implement this, and have posted it on a github branch for comment: http://github.com/peterjc/biopython/tree/ungap The code includes a selection of examples in the docstring which double as doctests. You can read this online here: http://github.com/peterjc/biopython/commit/13de9f793d13d3c9485f8d7cc42a48b99613d931 Peter [At some point we may want to move some of the assorted private functions in Bio.Alphabet into (private) methods of the alphabet objects or something... we'll see.] From biopython at maubp.freeserve.co.uk Tue Nov 10 16:23:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 10 Nov 2009 16:23:16 +0000 Subject: [Biopython-dev] Getting nucleotide sequence for GenBank features In-Reply-To: References: <320fb6e00910280507l268e0c72ufd1848cb9f62b72d@mail.gmail.com> <320fb6e00910280550va76ed14xeacd37df9aca720e@mail.gmail.com> <320fb6e00911031541ge1306f6rb387e36656f0603f@mail.gmail.com> <320fb6e00911040337q5bad8db2r925e39f4bcd375f7@mail.gmail.com> <320fb6e00911041405x2a424d61n2bf617a99d8ffef4@mail.gmail.com> <320fb6e00911050444m2a9afbcen5e5a548225fef79c@mail.gmail.com> Message-ID: <320fb6e00911100823n22821fcg1b162d436f235c7f@mail.gmail.com> Peter wrote: >> Does my current choice of "extract" for the name of the proposed >> SeqFeature method seem clear? My other suggestions earlier were >> get_sequence or extract_sequence. >> >> http://github.com/peterjc/biopython/tree/seqfeature-extract >> >> Peter On Fri, Nov 6, 2009 at 5:36 PM, Kyle Ellrott wrote: > It's always hard to judge how obvious something is when you already know > it.? I don't think there is any sort of name clash.? Is there anything else you > would want to 'extract' with/from a Sequence Feature?... > One of the rules I've heard from API designers is that it's best if written > code almost sounds like a proper sentence.? ( From > http://www.youtube.com/watch?v=aAb7hSCtvGw if you haven't seen it ) > my_feature_sequence = my_feature.extract( my_sequence ) seems to fit that > rule. > > Kyle So I'm happy with the SeqFeature method name ("extract"), and so is Kyle, and no one else has commented - which is a shame. I've just merged this into the main branch to encourage more testing of this, and will work on covering this in the tutorial at some point. If anyone gives this a go, please let us know on the list - and of course report any issues. Thanks, Peter From eric.talevich at gmail.com Wed Nov 11 05:37:16 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 11 Nov 2009 00:37:16 -0500 Subject: [Biopython-dev] Seq object ungap method In-Reply-To: <320fb6e00911061100g2122ef77se80f5bc6198cb030@mail.gmail.com> References: <320fb6e00911061100g2122ef77se80f5bc6198cb030@mail.gmail.com> Message-ID: <3f6baf360911102137s2534dc23med16712375db4cb7@mail.gmail.com> On Fri, Nov 6, 2009 at 2:00 PM, Peter wrote: > Hi all, > > Something we discussed last year was adding an ungap method > to the Seq object. See for example this thread: > > http://lists.open-bio.org/pipermail/biopython/2008-September/004515.html > > I've (finally) taken the time to actually implement this, and have > posted it on a github branch for comment: > > http://github.com/peterjc/biopython/tree/ungap > > Neat! Some trivial comments: 1. There's a typo on line 897 in Seq.py: s/stil/still/ 2. Each colon character has a space before it in this function. I've never seen you use that style before. (Other Biopython code doesn't do that.) 3. In the exception messages, and other places in Biopython, the concatenated string (or compound expression) is contained in parentheses, but there's also a backslash at the end of each line. I don't think the backslash is necessary, since the parens already group the multi-line expression. Cheers, Eric From biopython at maubp.freeserve.co.uk Wed Nov 11 10:32:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 11 Nov 2009 10:32:19 +0000 Subject: [Biopython-dev] Seq object ungap method In-Reply-To: <3f6baf360911102137s2534dc23med16712375db4cb7@mail.gmail.com> References: <320fb6e00911061100g2122ef77se80f5bc6198cb030@mail.gmail.com> <3f6baf360911102137s2534dc23med16712375db4cb7@mail.gmail.com> Message-ID: <320fb6e00911110232g3d507c7cg34512a4e9f28c55d@mail.gmail.com> On Wed, Nov 11, 2009 at 5:37 AM, Eric Talevich wrote: > On Fri, Nov 6, 2009 at 2:00 PM, Peter > wrote: >> >> Hi all, >> >> Something we discussed last year was adding an ungap method >> to the Seq object. See for example this thread: >> >> http://lists.open-bio.org/pipermail/biopython/2008-September/004515.html >> >> I've (finally) taken the time to actually implement this, and have >> posted it on a github branch for comment: >> >> http://github.com/peterjc/biopython/tree/ungap >> > > Neat! So worth checking in then? > Some trivial comments: > 1. There's a typo on line 897 in Seq.py: s/stil/still/ Thanks. > 2. Each colon character has a space before it in this function. I've never > seen you use that style before. (Other Biopython code doesn't do that.) I think some other bits of Biopython do this, but I agree, we should be consistent and removing the spaces matches PEP8. > 3. In the exception messages, and other places in Biopython, the > concatenated string (or compound expression) is contained in parentheses, > but there's also a backslash at the end of each line. I don't think the > backslash is necessary, since the parens already group the multi-line > expression. Again, you are right, and it would perhaps be worth tidying up cases of that. But not critical. Peter From biopython at maubp.freeserve.co.uk Wed Nov 11 12:44:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 11 Nov 2009 12:44:57 +0000 Subject: [Biopython-dev] Seq object ungap method In-Reply-To: <320fb6e00911110232g3d507c7cg34512a4e9f28c55d@mail.gmail.com> References: <320fb6e00911061100g2122ef77se80f5bc6198cb030@mail.gmail.com> <3f6baf360911102137s2534dc23med16712375db4cb7@mail.gmail.com> <320fb6e00911110232g3d507c7cg34512a4e9f28c55d@mail.gmail.com> Message-ID: <320fb6e00911110444q4baa4314ye6cb2fad5931a0a1@mail.gmail.com> On Wed, Nov 11, 2009 at 10:32 AM, Peter wrote: > On Wed, Nov 11, 2009 at 5:37 AM, Eric Talevich wrote: > >> 2. Each colon character has a space before it in this function. I've never >> seen you use that style before. (Other Biopython code doesn't do that.) > > I think some other bits of Biopython do this, but I agree, we should > be consistent and removing the spaces matches PEP8. It turns out lots of bits of Biopython did this for functions, if statements and class definitions - in many cases this is my own fault, as I seem to have adopted a personal style here at odds with PEP8. Anyway, I think I have found and fixed all the cases on the trunk now. Peter From biopython at maubp.freeserve.co.uk Wed Nov 11 14:24:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 11 Nov 2009 14:24:15 +0000 Subject: [Biopython-dev] Adding SeqRecord objects In-Reply-To: <320fb6e00911040622n1d60c6b1n29511f82f4f6b674@mail.gmail.com> References: <320fb6e00911040622n1d60c6b1n29511f82f4f6b674@mail.gmail.com> Message-ID: <320fb6e00911110624h2453c39ah423f64f53fd33873@mail.gmail.com> On Wed, Nov 4, 2009 at 2:22 PM, Peter wrote: > Hi all, > > I'd like to add support for adding SeqRecord objects to the trunk, i.e. > cherry-pick this change from my experimental branch: > > http://github.com/peterjc/biopython/commit/6fd5675b1c03dc7eb190d84db1fa19ae744559aa > > Plus some docstring/doctest examples and unit tests of course, e.g. > http://github.com/peterjc/biopython/commit/a8405a54406226c6726daea743ea59dc544c5bc0 > > Any comments? Checked in - comments and testing still welcome of course. Peter From biopython at maubp.freeserve.co.uk Wed Nov 11 16:31:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 11 Nov 2009 16:31:32 +0000 Subject: [Biopython-dev] Adding SeqRecord objects In-Reply-To: <320fb6e00911110624h2453c39ah423f64f53fd33873@mail.gmail.com> References: <320fb6e00911040622n1d60c6b1n29511f82f4f6b674@mail.gmail.com> <320fb6e00911110624h2453c39ah423f64f53fd33873@mail.gmail.com> Message-ID: <320fb6e00911110831n3984bekef6ec2a646131234@mail.gmail.com> On Wed, Nov 11, 2009 at 2:24 PM, Peter wrote: > > Checked in - comments and testing still welcome of course. > I've also added a new unit test, test_SeqRecord.py, and while working on this found two unreported bugs with SeqRecord slicing for the SeqFeatures. Firstly negative slice indices didn't work properly, and secondly there was a corner case where the slice stop was right at the end of a feature. Fixed. Peter From bugzilla-daemon at portal.open-bio.org Thu Nov 12 04:00:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Nov 2009 23:00:51 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911120400.nAC40puR016271@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 ------- Comment #8 from TallPaulInJax at yahoo.com 2009-11-11 23:00 EST ------- Created an attachment (id=1396) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1396&action=view) MMCIFlex.py all in python w/ability to read from .cif.gz file; also bug fix Hi, I don't believe this file has been added to the git so I updated it to include the ability to parse .cif.gz files as well as .cif files. In addition, I was unaware that .cif files can have semi-colon lines like: ; ASDasdasd ads askdjlkasjdlakjsdlasd asd asdkjasdl;kjalskdjlasjdlasjdlkasjdalsdkj ; so I update the logic accordingly and tested it. As stated before, this MMCIFlex.py has to be matched with the MMCIF2Dict.py I had slightly re-written to work with it. See other attachments. Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Nov 12 04:05:50 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 11 Nov 2009 23:05:50 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911120405.nAC45oR5016526@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 TallPaulInJax at yahoo.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1395 is|0 |1 obsolete| | ------- Comment #9 from TallPaulInJax at yahoo.com 2009-11-11 23:05 EST ------- (From update of attachment 1395) See #1396 instead. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Nov 16 03:20:11 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 15 Nov 2009 22:20:11 -0500 Subject: [Biopython-dev] [Bug 2943] MMCIFParser only handling a single model. In-Reply-To: Message-ID: <200911160320.nAG3KBh6021138@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2943 TallPaulInJax at yahoo.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #1396 is|0 |1 obsolete| | ------- Comment #10 from TallPaulInJax at yahoo.com 2009-11-15 22:20 EST ------- Created an attachment (id=1398) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1398&action=view) MMCIFlex.py all in python. Slight bug fix from previous upload. Oops! Forgot to strip ';' out of the first semi-colon line found! :-) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From peter at maubp.freeserve.co.uk Mon Nov 16 15:04:27 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Mon, 16 Nov 2009 15:04:27 +0000 Subject: [Biopython-dev] [blast-announce] BLAST 2.2.22 now available In-Reply-To: <320fb6e00911040940h48f88ch87ad9a22d79b4aa3@mail.gmail.com> References: <33559E80-E78D-4CCB-8E8C-79C36E89C007@ncbi.nlm.nih.gov> <320fb6e00910200456mbac8d28ra1385c102b899c9a@mail.gmail.com> <320fb6e00911040940h48f88ch87ad9a22d79b4aa3@mail.gmail.com> Message-ID: <320fb6e00911160704l2ba12596t7b6bf4d69511e9af@mail.gmail.com> On Wed, Nov 4, 2009 at 5:40 PM, Peter wrote: > On Tue, Oct 20, 2009 at 11:56 AM, Peter wrote: >> Hi all, >> >> The new NCBI BLAST tools are out now, and I'd only just updated >> my desktop to BLAST 2.2.21 this morning! >> >> It looks like the "old style" blastall etc (which are written in C) are >> much the same, but we will need to add Bio.Blast.Applications >> wrappers for the new "BLAST+" tools (written in C++). > > The bulk of that work is done in the main repository now. > > However, we still need to go through all the tools and confirm > all their arguments are included. This could be partly automated > since the BLAST help output is nicely formatted... Done - with a basic unit test which confirms the list of arguments the NCBI tools report via -help matches those we handle via the wrapper. We still need to clarify how the -soft_masking and -use_index options should work (i.e. do that take an argument or not), but otherwise the wrapper code should be fit for testing (and we can update the tutorial). Peter From bugzilla-daemon at portal.open-bio.org Mon Nov 16 23:14:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 16 Nov 2009 18:14:00 -0500 Subject: [Biopython-dev] [Bug 2948] New: _parse_pdb_header_list: bug in TITLE handling Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2948 Summary: _parse_pdb_header_list: bug in TITLE handling Product: Biopython Version: 1.52 Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: TallPaulInJax at yahoo.com parse_pdb_header.py _parse_pdb_header_list Hi, 1. If the TITLE in a PDB begins with a number, the parse_pdb_header_list method is stripping the prefixed number from the title, I believe because the regex written did not expect this. So the TITLE line: TITLE 3D STRUCTURE OF ALZHEIMER'S ABETA(1-42) FIBRILS becomes: " D STRUCTURE OF ALZHEIMER'S ABETA(1-42) FIBRILS" 2. ... or it should, but it doesn't. This is because for some reason the title is converted to lower case. So it actually becomes: " d structure of alzheimer's abeta(1-42) fibrils" This is fixed by changing the line of code: name=_chop_end_codes(tail).lower() to: name=_chop_end_codes(tail) I don't have a solution for problem #1. Frankly, I think the (whole, or most all of the) method should be re-written to use positional stripping, ie, line[X:Y].strip(). Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Nov 16 23:19:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 16 Nov 2009 18:19:15 -0500 Subject: [Biopython-dev] [Bug 2949] New: _parse_pdb_header_list: REVDAT is for oldest entry. Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2949 Summary: _parse_pdb_header_list: REVDAT is for oldest entry. Product: Biopython Version: 1.52 Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: TallPaulInJax at yahoo.com Hi, I don't know if this is considered a bug or a feature, but currently _parse_pdb_header_list in parse_pdb_header.py is grabbing the least recent date when I believe it should be grabbing the MOST current date. Here is the current code: elif key=="REVDAT": rr=re.search("\d\d-\w\w\w-\d\d",tail) if rr!=None: dict['release_date']=_format_date(_nice_case(rr.group())) And here is the fix, with additional REVDAT components added (can't hurt! :-) ): elif key=="REVDAT": #Modified by Paul T. Bathen to get most recent date instead of oldest date. #Also added additional dict entries if dict['release_date'] == "1909-01-08": #set in init rr=re.search("\d\d-\w\w\w-\d\d",tail) if rr!=None: dict['release_date']=_format_date(_nice_case(rr.group())) dict['mod_number'] = hh[7:10].strip() dict['mod_id'] = hh[23:28].strip() dict['mod_type'] = hh[31:32].strip() Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 17 05:35:45 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 00:35:45 -0500 Subject: [Biopython-dev] [Bug 2950] New: Bio.PDBIO.save writes MODEL records without model id Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2950 Summary: Bio.PDBIO.save writes MODEL records without model id Product: Biopython Version: Not Applicable Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: barry_finzel at yahoo.com The MODEL record format for PDB files has an integer model identifier (e.g., "MODEL 1") not currently written to output. Files read (Bio.PDB.PDBIO.PDBParser.get_structure) and then immediately written back out have MODEL records lacking any ID, even though a model id is stored. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 17 06:06:02 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 01:06:02 -0500 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <200911170606.nAH662Vc032551@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 ------- Comment #1 from TallPaulInJax at yahoo.com 2009-11-17 01:06 EST ------- Hi Barry, FYI: PDBParser also starts the model numbering at model 0 no matter what the true model number is in the PDB file. I am going to file that as a bug as well. Look at line 106 and 122 here: http://github.com/biopython/biopython/blob/master/Bio/PDB/PDBParser.py#L122 Wanted to let you know! This would be REALLY problematic if the actual model numbers in the PDB file skip around, ie, MODEL 2, then MODEL 4, etc. I don't know if that's possible, I'm just a comp sci guy! Do you know? Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 17 06:11:32 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 01:11:32 -0500 Subject: [Biopython-dev] [Bug 2951] New: PDBParser assigns model 0 to first model no matter what... Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2951 Summary: PDBParser assigns model 0 to first model no matter what... Product: Biopython Version: 1.52 Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: TallPaulInJax at yahoo.com I'm not sure if this is a bug or a feature, but PDBParser assigns the first model it sees as model 0 then increments that. This means someone thinking they are studying model X is actually studying X+1, and that assumes that authors always use sequential model numbers without skips. If authors CAN skip model number, ie, MODEL 2, then MODEL 4, then MODEL 5... then in biopython these be models 0,1, and 2 in the structure... yuck. If this needs to be maintained for posterity, I would suggest adding another field to capture the TRUE model number if it exists. See lines 106 and 122 here: http://github.com/biopython/biopython/blob/master/Bio/PDB/PDBParser.py#106 Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 17 10:31:52 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 05:31:52 -0500 Subject: [Biopython-dev] [Bug 2951] PDBParser assigns model 0 to first model no matter what... In-Reply-To: Message-ID: <200911171031.nAHAVqHY010394@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2951 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-17 05:31 EST ------- We'd have to get Thomas' opinion (the original author), but I would say from a programming point of view having the model indices follow Python norms is very useful (i.e. start at 0 and increment). Consider looping operations based on the number of models etc. I would therefore prefer to see the "reported" model number given as a separate (independent) field from the existing "model index" (assigned incrementally). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 17 10:36:57 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 05:36:57 -0500 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <200911171036.nAHAavZ9010514@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- BugsThisDependsOn| |2951 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-17 05:36 EST ------- In order to write out MODEL records with an ID, we probably need to store the ID on parsing - therefore marking this as depending on Bug 2951 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 17 10:37:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 05:37:12 -0500 Subject: [Biopython-dev] [Bug 2951] PDBParser assigns model 0 to first model no matter what... In-Reply-To: Message-ID: <200911171037.nAHAbC1v010530@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2951 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- OtherBugsDependingO| |2950 nThis| | -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Nov 17 12:16:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Nov 2009 12:16:38 +0000 Subject: [Biopython-dev] Updating http://biopython.open-bio.org/SRC/biopython/ Message-ID: <320fb6e00911170416t4e58d4bpf9d7525262b4fec6@mail.gmail.com> Dear all, Back when Biopython used CVS, we had an hourly checkout of the code published here: http://www.biopython.org/SRC/biopython/ http://biopython.open-bio.org/SRC/biopython/ I did ask how this was setup (which username etc) but no one got back to me. In any case, this can now be turned off - along with making the Biopython CVS read only (OBF support call #857). I've managed to get this working again from the github repository, although the route isn't ideal (and is all running under my username): (1) At quarter past the hour, a cron job on dev.open-bio.org does a "git pull" to update a local copy of the repository to that on github. This doesn't need any passwords. (2) At 25mins past the hour, a cron job on www.biopython.org does an rsync to get a copy of the repository from dev.open-bio.org (using an SSH key for access). I tried having a single job running on dev.open-bio.org to push the files to www.biopython.org but the host policies seem to block that. It would be simpler to have everything done on www.biopython.org, which would require git to be installed on that machine. This would avoid any SSH security problems. Does that seem like a better idea? Thanks, Peter From bartek at rezolwenta.eu.org Tue Nov 17 12:42:24 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 17 Nov 2009 13:42:24 +0100 Subject: [Biopython-dev] Updating http://biopython.open-bio.org/SRC/biopython/ In-Reply-To: <320fb6e00911170416t4e58d4bpf9d7525262b4fec6@mail.gmail.com> References: <320fb6e00911170416t4e58d4bpf9d7525262b4fec6@mail.gmail.com> Message-ID: <8b34ec180911170442v31a4e48do196d7d6ae95b725d@mail.gmail.com> On Tue, Nov 17, 2009 at 1:16 PM, Peter wrote: > Dear all, > > Back when Biopython used CVS, we had an hourly checkout of > the code published here: > http://www.biopython.org/SRC/biopython/ > http://biopython.open-bio.org/SRC/biopython/ > > I did ask how this was setup (which username etc) but no one got > back to me. In any case, this can now be turned off - along with > making the Biopython CVS read only (OBF support call #857). > > I've managed to get this working again from the github repository, > although the route isn't ideal (and is all running under my username): Great! > > (1) At quarter past the hour, a cron job on dev.open-bio.org does > a "git pull" to update a local copy of the repository to that on github. > This doesn't need any passwords. > > (2) At 25mins past the hour, a cron job on www.biopython.org does > an rsync to get a copy of the repository from dev.open-bio.org (using > an SSH key for access). > > I tried having a single job running on dev.open-bio.org to push the > files to www.biopython.org but the host policies seem to block that. that's a pity. In case you haven't thought about it, I would only suggest to use some sort of a lockfile: In case github is very slow (happens every so often to me) it might occur that the second job will start before the first one is finished. Leading potentially to a broken repository for download. If the first job would create a lockfile before it starts and would delete it after it's done, the second job could condition the rsync operation on the existence of the file. This way, we could have delaysin syncing , rather than potentially broken repo. Of course, if we move to a simpler setup using only one host, this would not be needed. > > It would be simpler to have everything done on www.biopython.org, > which would require git to be installed on that machine. This would > avoid any SSH security problems. Does that seem like a better idea? > indeed, this would be best, and it only requires someone with root privileges to install a package. Also, it would be cool if the biopython user was somehow unlocked. (Currently no-one seems to have the password...) I don't hav an account on www.biopython.org server, so I can't help much at this point... cheers Bartek From bugzilla-daemon at portal.open-bio.org Tue Nov 17 13:13:17 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 08:13:17 -0500 Subject: [Biopython-dev] [Bug 2951] PDBParser assigns model 0 to first model no matter what... In-Reply-To: Message-ID: <200911171313.nAHDDHvS015696@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2951 ------- Comment #2 from TallPaulInJax at yahoo.com 2009-11-17 08:13 EST ------- "Consider looping operations based on the number of models etc." Wouldn't that just be: for m in struc.get_list() or to count: len(struc.get_list()) "I would therefore prefer to see the "reported" model number given as a separate (independent) field from the existing "model index" (assigned incrementally)." Sounds good to me!!! :-) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From chapmanb at 50mail.com Tue Nov 17 13:12:44 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 17 Nov 2009 08:12:44 -0500 Subject: [Biopython-dev] Updating http://biopython.open-bio.org/SRC/biopython/ In-Reply-To: <320fb6e00911170416t4e58d4bpf9d7525262b4fec6@mail.gmail.com> References: <320fb6e00911170416t4e58d4bpf9d7525262b4fec6@mail.gmail.com> Message-ID: <20091117131244.GD68691@sobchak.mgh.harvard.edu> Hi Peter; Do we need to keep replicating this now that we've got GitHub? That gives us anonymous checkouts of the code, and on demand tarball generation. It seems like we could save a few cycles and worries by eliminating this. Brad > Back when Biopython used CVS, we had an hourly checkout of > the code published here: > http://www.biopython.org/SRC/biopython/ > http://biopython.open-bio.org/SRC/biopython/ > > I did ask how this was setup (which username etc) but no one got > back to me. In any case, this can now be turned off - along with > making the Biopython CVS read only (OBF support call #857). > > I've managed to get this working again from the github repository, > although the route isn't ideal (and is all running under my username): > > (1) At quarter past the hour, a cron job on dev.open-bio.org does > a "git pull" to update a local copy of the repository to that on github. > This doesn't need any passwords. > > (2) At 25mins past the hour, a cron job on www.biopython.org does > an rsync to get a copy of the repository from dev.open-bio.org (using > an SSH key for access). > > I tried having a single job running on dev.open-bio.org to push the > files to www.biopython.org but the host policies seem to block that. > > It would be simpler to have everything done on www.biopython.org, > which would require git to be installed on that machine. This would > avoid any SSH security problems. Does that seem like a better idea? > > Thanks, > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From bugzilla-daemon at portal.open-bio.org Tue Nov 17 13:22:47 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 08:22:47 -0500 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <200911171322.nAHDMlvj016137@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 ------- Comment #3 from TallPaulInJax at yahoo.com 2009-11-17 08:22 EST ------- The offending lines of code are here in PDBIO: http://github.com/biopython/biopython/blob/master/Bio/PDB/PDBIO.py#L129 if model_flag: fp.write("MODEL \n") For now, this should just be changed to: if model_flag: fp.write("MODEL %s\n" %model.id) When the 'reported' model number ID is added, I believe model ID would best be replaced with the new field above. Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Tue Nov 17 14:40:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Nov 2009 14:40:11 +0000 Subject: [Biopython-dev] Updating http://biopython.open-bio.org/SRC/biopython/ In-Reply-To: <20091117131244.GD68691@sobchak.mgh.harvard.edu> References: <320fb6e00911170416t4e58d4bpf9d7525262b4fec6@mail.gmail.com> <20091117131244.GD68691@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00911170640q4e39d1f0p9b8375dd47de0ae1@mail.gmail.com> On Tue, Nov 17, 2009 at 1:12 PM, Brad Chapman wrote: > > Hi Peter; > Do we need to keep replicating this now that we've got GitHub? That > gives us anonymous checkouts of the code, and on demand tarball > generation. It seems like we could save a few cycles and worries by > eliminating this. > > Brad It isn't essential, no. On the other hand, lots of bits of documentation point there (e.g. for the Biopython license). Peter From biopython at maubp.freeserve.co.uk Tue Nov 17 14:46:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Nov 2009 14:46:44 +0000 Subject: [Biopython-dev] Updating http://biopython.open-bio.org/SRC/biopython/ In-Reply-To: <8b34ec180911170442v31a4e48do196d7d6ae95b725d@mail.gmail.com> References: <320fb6e00911170416t4e58d4bpf9d7525262b4fec6@mail.gmail.com> <8b34ec180911170442v31a4e48do196d7d6ae95b725d@mail.gmail.com> Message-ID: <320fb6e00911170646x280014dfq1f91a3f7bf6b31c1@mail.gmail.com> On Tue, Nov 17, 2009 at 12:42 PM, Bartek Wilczynski wrote: > On Tue, Nov 17, 2009 at 1:16 PM, Peter wrote: >> Dear all, >> >> Back when Biopython used CVS, we had an hourly checkout of >> the code published here: >> http://www.biopython.org/SRC/biopython/ >> http://biopython.open-bio.org/SRC/biopython/ >> >> I did ask how this was setup (which username etc) but no one got >> back to me. In any case, this can now be turned off - along with >> making the Biopython CVS read only (OBF support call #857). >> >> I've managed to get this working again from the github repository, >> although the route isn't ideal (and is all running under my username): > > Great! >> >> (1) At quarter past the hour, a cron job on dev.open-bio.org does >> a "git pull" to update a local copy of the repository to that on github. >> This doesn't need any passwords. >> >> (2) At 25mins past the hour, a cron job on www.biopython.org does >> an rsync to get a copy of the repository from dev.open-bio.org (using >> an SSH key for access). >> >> I tried having a single job running on dev.open-bio.org to push the >> files to www.biopython.org but the host policies seem to block that. > > that's a pity. In case you haven't thought about it, I would only > suggest to use some sort of a lockfile: In case github is very slow > (happens every so often to me) it might occur that the second job will > start before the first one is finished. Leading potentially to a > broken repository for download. If the first job would create a > lockfile before it starts and would delete it after it's done, the > second job could condition the rsync operation on the existence of the > file. This way, we could have delaysin syncing , rather than > potentially broken repo. Of course, if we move to a simpler setup > using only one host, this would not be needed. Note we're not (yet) making a clone-able repository available on www.biopython.org - this is just a simple hourly snapshot of the source code (without the .git folder). A lock file might be possible but seems overly complicated. I guess I could increase the time delay if you are worried about github being slow sometimes. >> It would be simpler to have everything done on www.biopython.org, >> which would require git to be installed on that machine. This would >> avoid any SSH security problems. Does that seem like a better idea? > > indeed, this would be best, and it only requires someone with root > privileges to install a package. That question was really aimed at the OBF admin team (hence my CC'ing this to root-l), in case they have any better ideas. > Also, it would be cool if the biopython user was somehow unlocked. > (Currently no-one seems to have the password...) I'm sure an OBF admin can reset it. Again, this is CC'd to root-l. Peter From bugzilla-daemon at portal.open-bio.org Tue Nov 17 15:07:55 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 10:07:55 -0500 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <200911171507.nAHF7sjx019463@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 ------- Comment #4 from barry_finzel at yahoo.com 2009-11-17 10:07 EST ------- Structures coming from the PDB should have sequentially-numbered models - usually starting with one, although the reason I was looking at this was that I would like to extract the model into separate files but still retain information regarding their original source. (So a Structure might contain only MODEL 6, for example). I don't imagine ever having multiple models out of order.., MODEL 4, then 2, then 1, then 8, etc). It would be helpful if the model id on the input record was stored and retrievable. Since code (no doubt) exists with references like struct[0], it might be better to create a new model attribute (model.label?) to store the label as a string, so old indexed references still work. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Nov 17 15:24:51 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 17 Nov 2009 10:24:51 -0500 Subject: [Biopython-dev] [Bug 2950] Bio.PDBIO.save writes MODEL records without model id In-Reply-To: Message-ID: <200911171524.nAHFOpvb019880@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2950 ------- Comment #5 from TallPaulInJax at yahoo.com 2009-11-17 10:24 EST ------- Thanks for the note on whether models can be out of order, skipped, etc! When the MODEL is written out, I would imagine it should use the model.label or whatever it's named? (I believe it MIGHT be called SerialNo. See here: http://mmcif.pdb.org/dictionaries/pdb-correspondence/pdb2mmcif.html#MODEL Paul -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 18 17:19:36 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 18 Nov 2009 12:19:36 -0500 Subject: [Biopython-dev] [Bug 2495] parse element symbols for ATOM/HETATM records (Bio.PDB.PDBParser) In-Reply-To: Message-ID: <200911181719.nAIHJaXD026880@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2495 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-18 12:19 EST ------- Trunk updated along the lines suggested by Hongbo Zhu. (In reply to comment #1) > IO.save should also write these element types on an output PDB file Leaving bug open to deal with the output as well. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Nov 20 16:11:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 20 Nov 2009 16:11:43 +0000 Subject: [Biopython-dev] Seq object join method Message-ID: <320fb6e00911200811xc2dab0w6b293e02761e2a43@mail.gmail.com> Hello all, Some more code to evaluate, again on a branch in github: http://github.com/peterjc/biopython/commit/c7cd0329061f88e3a8eae0979dd17c54a36ab4e5 This adds a join method to the Seq object, basically an alphabet aware version of the Python string join method. Recall that for strings: sep.join([a,b,c]) == a + sep + b + sep + c This leads to a common idiom for concatenating a list of strings, "".join([a,b,c]) == a + "" + b + "" + c == a + b + c That is fine for strings, but not necessarily for Seq objects since even a zero length sequence has an alphabet. Consider this example: >>> from Bio.Seq import Seq >>> from Bio.Alphabet.IUPAC import unambiguous_dna, ambiguous_dna >>> unamb_dna_seq = Seq("ACGT", unambiguous_dna) >>> ambig_dna_seq = Seq("ACRGT", ambiguous_dna) >>> unamb_dna_seq Seq('ACGT', IUPACUnambiguousDNA()) >>> ambig_dna_seq Seq('ACRGT', IUPACAmbiguousDNA()) If we add the ambiguous and unambiguous IUPAC DNA alphabets, we get the ambiguous IUPAC DNA alphabet: >>> unamb_dna_seq + ambig_dna_seq Seq('ACGTACRGT', IUPACAmbiguousDNA()) However, if the default generic alphabet is included, the result is a generic alphabet: >>> unamb_dna_seq + Seq("") + ambig_dna_seq Seq('ACGTACRGT', Alphabet()) Now consider Seq("").join([unamb_dna_seq, ambig_dna_seq]), should it follow the addition behaviour (giving a default alphabet) or "do the sensible thing" and preserve the IUPAC alphabet? As written, Seq("").join(...) is handled as a special case, and the alphabet of the empty string is ignored. To me this is a case of "practicality beats purity", it is much nicer than being forced to do Seq("", ambiguous_dna).join(...) where the empty sequence is given a suitable alphabet. So, what do people think? Peter From bugzilla-daemon at portal.open-bio.org Fri Nov 20 16:15:57 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 20 Nov 2009 11:15:57 -0500 Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string, even subclass string? In-Reply-To: Message-ID: <200911201615.nAKGFvuu018944@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2351 ------- Comment #18 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-20 11:15 EST ------- Possible join method for the Seq object outlined here (with code): http://lists.open-bio.org/pipermail/biopython-dev/2009-November/007012.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Fri Nov 20 19:28:42 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 20 Nov 2009 14:28:42 -0500 Subject: [Biopython-dev] Seq object join method In-Reply-To: <320fb6e00911200811xc2dab0w6b293e02761e2a43@mail.gmail.com> References: <320fb6e00911200811xc2dab0w6b293e02761e2a43@mail.gmail.com> Message-ID: <3f6baf360911201128u6b933b6asf8f2dfabbc1f275f@mail.gmail.com> On Fri, Nov 20, 2009 at 11:11 AM, Peter wrote: > > Hello all, > > Some more code to evaluate, again on a branch in github: > http://github.com/peterjc/biopython/commit/c7cd0329061f88e3a8eae0979dd17c54a36ab4e5 > > This adds a join method to the Seq object, basically an alphabet > aware version of the Python string join method. Recall that for > strings: > > sep.join([a,b,c]) == a + sep + b + sep + c > > This leads to a common idiom for concatenating a list of strings, > > "".join([a,b,c]) == a + "" + b + "" + c == a + b + c > > [...] > > Now consider Seq("").join([unamb_dna_seq, ambig_dna_seq]), > should it follow the addition behaviour (giving a default alphabet) > or "do the sensible thing" and preserve the IUPAC alphabet? > > As written, Seq("").join(...) is handled as a special case, and > the alphabet of the empty string is ignored. To me this is a > case of "practicality beats purity", it is much nicer than being > forced to do Seq("", ambiguous_dna).join(...) where the empty > sequence is given a suitable alphabet. > > So, what do people think? > > Peter > Thoughts: 1. Why doesn't Alphabet._consensus_alphabet raise a TypeError("Incompatable alphabets") where _check_type_compatibility would fail, at least as an optional argument? Probably because it's a private function. Should it be a public function, with a friendlier interface? 2. This might cause massive compatibility problems now, but would it be better for Seq() to use an "unknown_alphabet" by default instead of "generic"? Then _consensus_alphabet could safely ignore those sequences with unspecified alphabets, and Seq.join wouldn't need that special case. 3. Alternately, how much code would break if _consensus_alphabet simply treated generic_alphabet as an unspecified sequence, and ignored it when calculating the consensus alphabet? This effect could be limited to just Seq.join by dropping the test that the sequence length is 0, but it might be useful to have the same behavior for addition. Cheers, Eric From sbassi at clubdelarazon.org Sat Nov 21 14:31:44 2009 From: sbassi at clubdelarazon.org (Sebastian Bassi) Date: Sat, 21 Nov 2009 11:31:44 -0300 Subject: [Biopython-dev] Seq object join method In-Reply-To: <320fb6e00911200811xc2dab0w6b293e02761e2a43@mail.gmail.com> References: <320fb6e00911200811xc2dab0w6b293e02761e2a43@mail.gmail.com> Message-ID: <9e2f512b0911210631n429aa3det3ac60412b11e2e70@mail.gmail.com> On Fri, Nov 20, 2009 at 1:11 PM, Peter wrote: > Now consider Seq("").join([unamb_dna_seq, ambig_dna_seq]), > should it follow the addition behaviour (giving a default alphabet) > or "do the sensible thing" and preserve the IUPAC alphabet? .... > So, what do people think? >From my perspective, I like consistency, so I think that if you want to preserve the IUPAC alphabet, you should state the alphabet also in the separator sequence. From biopython at maubp.freeserve.co.uk Mon Nov 23 10:34:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 23 Nov 2009 10:34:31 +0000 Subject: [Biopython-dev] Seq object join method In-Reply-To: <3f6baf360911201128u6b933b6asf8f2dfabbc1f275f@mail.gmail.com> References: <320fb6e00911200811xc2dab0w6b293e02761e2a43@mail.gmail.com> <3f6baf360911201128u6b933b6asf8f2dfabbc1f275f@mail.gmail.com> Message-ID: <320fb6e00911230234s3360b563ye51c825ae62decde@mail.gmail.com> On Fri, Nov 20, 2009 at 7:28 PM, Eric Talevich wrote: > > Thoughts: > > 1. Why doesn't Alphabet._consensus_alphabet raise a > TypeError("Incompatable alphabets") where _check_type_compatibility > would fail, at least as an optional argument? Probably because it's a > private function. Should it be a public function, with a friendlier > interface? It is a private function, and right now I don't recall my precise thinking. The assorted private functions in Bio.Alphabet were to extract some commonly repeated actions for reuse (e.g. in the alignment code) while preserving backwards compatibility where possible, and fixing bugs as needed. I agree some of these are candidates for being made public, but this is a lower priority for me. I am also not sure if functions are the best way to do some of these tasks - Alphabet methods may be better. > 2. This might cause massive compatibility problems now, but would it > be better for Seq() to use an "unknown_alphabet" by default instead of > "generic"? Then _consensus_alphabet could safely ignore those > sequences with unspecified alphabets, and Seq.join wouldn't need that > special case. The base class generic alphabet *is* the "unknown alphabet". > 3. Alternately, how much code would break if _consensus_alphabet > simply treated generic_alphabet as an unspecified sequence, and > ignored it when calculating the consensus alphabet? This effect could > be limited to just Seq.join by dropping the test that the sequence > length is 0, but it might be useful to have the same behavior for > addition. I don't know specifically what would break, but that seems too permissive to me. The Seq("").join(...) seems like a special case to me as it fits the Python "".join(...) idiom for concatenating a list of strings. Peter From biopython at maubp.freeserve.co.uk Mon Nov 23 10:44:14 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 23 Nov 2009 10:44:14 +0000 Subject: [Biopython-dev] Seq object join method In-Reply-To: <9e2f512b0911210631n429aa3det3ac60412b11e2e70@mail.gmail.com> References: <320fb6e00911200811xc2dab0w6b293e02761e2a43@mail.gmail.com> <9e2f512b0911210631n429aa3det3ac60412b11e2e70@mail.gmail.com> Message-ID: <320fb6e00911230244q53a37b93xbf52a33f3a9e14f7@mail.gmail.com> On Sat, Nov 21, 2009 at 2:31 PM, Sebastian Bassi wrote: > On Fri, Nov 20, 2009 at 1:11 PM, Peter wrote: >> Now consider Seq("").join([unamb_dna_seq, ambig_dna_seq]), >> should it follow the addition behaviour (giving a default alphabet) >> or "do the sensible thing" and preserve the IUPAC alphabet? >> .... >> So, what do people think? > > From my perspective, I like consistency, so I think that if you want > to preserve the IUPAC alphabet, you should state the alphabet also in > the separator sequence. If you have a list of Seq objects with an IUPAC alphabet, then yes, you could concatenate them using: result = Seq("",the_known_IUPAC_alphabet).join(the_list_of_seqs) But what if you are writing a stand alone function taking Seq arguments of unknown alphabet? If you want to preserve the alphabet (and I would), you would be forced to do something nasty like this: result = Seq("",the_list_of_seqs[0].alphabet).join(the_list_of_seqs) or simply (as now) avoid using the join method completely, e.g. result = the_list_of_seqs[0] for seq in the_list_of_seqs[1:] : result += seq Neither of these have the clarity of: result = Seq("").join(the_list_of_seqs) To me, part of the issue here is that the use of "".join(list_of_strings) in plain Python has always taken a bit of getting used to. It isn't very intuitive - the old join function in the string module was in some ways more natural. Maybe we need to add a Bio.Seq module join function? e.g. def join(words, sep=None) : ... While the Python string module join had the separator defaulting to the empty string, here we can be explicit that by default there is no separator sequence by default, therefore no extra alphabet to worry about. However, while using a join function lets us avoid the separator alphabet issue, it isn't object orientated, and does not match the Python string object very well. Peter From bugzilla-daemon at portal.open-bio.org Mon Nov 23 11:37:46 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 23 Nov 2009 06:37:46 -0500 Subject: [Biopython-dev] [Bug 2597] Enforce alphabet letters in Seq objects In-Reply-To: Message-ID: <200911231137.nANBbkrO032437@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2597 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-23 06:37 EST ------- As recently noted on the mailing list, making the Seq alphabet check strict could be useful for file format validation with Bio.SeqIO or Bio.AlignIO, since the parse and read functions can be given an alphabet. e.g. While this would be allowed: from Bio import SeqIO from Bio.Alphabet.IUPAC import extended_protein from Bio.Alphabet import Gapped from StringIO import StringIO fasta_str = "\n\n\n>ID\nABCDEFGH-IPX\n" record = SeqIO.read(StringIO(fasta_str), "fasta", Gapped(extended_protein, "-")) If the Seq object checked the alphabet letters, this would fail due to the minus sign: >>> record = SeqIO.read(StringIO(fasta_str), "fasta", extended_protein) If the user doesn't care about the precise letters, they can use the default generic alphabet, e.g. >>> record = SeqIO.read(StringIO(fasta_str), "fasta") or, to at least specify this is a protein sequence: >>> from Bio.Alphabet import generic_protein >>> record = SeqIO.read(StringIO(fasta_str), "fasta", generic_protein) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Nov 23 14:43:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 23 Nov 2009 14:43:28 +0000 Subject: [Biopython-dev] Merging Bio.SeqIO SFF support? Message-ID: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com> Dear all, Is there anyone on the dev mailing list willing to test the SFF support I've been working on for Bio.SeqIO? The code is here, a branch on github: http://github.com/peterjc/biopython/tree/sff-seqio The important files are: * Bio/SeqIO/SffIO.py * Bio/SeqIO/__init__.py (defining the new format) * Bio/SeqIO/_index.py (indexing SFF files) Plus unit test files: * Tests/run_tests.py (to run the doctests) * Tests/test_SeqIO_QualityIO.py * Tests/test_SeqIO_index.py * Tests/test_SeqIO.py * Tests/Roche/* (for unit tests) Sebastian Bassi had a look last month and his feedback has already helped (e.g. with error messages): http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006903.html I have been using this code myself in real work, for example editing the trim points in an SFF file to take into account PCR primer sequences, and filtering SFF reads, checking Roche barcodes etc. Thanks, Peter From biopython at maubp.freeserve.co.uk Mon Nov 23 21:49:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 23 Nov 2009 21:49:31 +0000 Subject: [Biopython-dev] PubMed E-Utility 2010 DTD changes In-Reply-To: <7B6F170840CA6C4DA63EE0C8A7BB43EC099627A4@NIHCESMLBX15.nih.gov> References: <7B6F170840CA6C4DA63EE0C8A7BB43EC099627A4@NIHCESMLBX15.nih.gov> Message-ID: <320fb6e00911231349i770a89cdrfd4341b3731d1b1c@mail.gmail.com> Hi all, See below - it look like there are two new DTD files to add to Bio.Entrez Peter ---------- Forwarded message ---------- From: Date: Mon, Nov 23, 2009 at 8:35 PM Subject: [Utilities-announce] PubMed E-Utility 2010 DTD changes To: utilities-announce at ncbi.nlm.nih.gov PubMed E-Utility Users, We anticipate switching to the updated PubMed 2010 DTDs on December 14, 2009. 2010 DTDs are available from the Entrez DTD page: http://eutils.ncbi.nlm.nih.gov/corehtml/query/DTD/ The DTD changes for the 2010 production year, as noted in the Revision Notes section near the top of each DTD, include: NLM MedlineCitationSet DTD used for MEDLINE/PubMed XML data files: 1. ?The CommentsCorrections group of elements was reorganized. Through the 2009 data year, the publications cited in CommentsCorrections were defined as elements. For the 2010 DTD, they are defined as valid values to the RefType attribute, and a new attribute value, Cites, was created. Cites will contain PMIDs and source data for items in the bibliography or list of references at the end of an article that is deposited in PubMed Central (PMC). There is no RefType attribute corresponding to Cites for PMIDs and source data of articles in which a paper is cited. In the implementation for 2010, RefType = "Cites" will contain only PMIDs and source data for citations where an actual PMID for the cited article exists in the NLM Data Creation and Maintenance System (DCMS). It is therefore possible for a citation to be present in the article's list of references and yet the PMID is not included in the Cites list because it is not present in the NLM DCMS. Cites will be present in the baseline files; however, the subsequent update frequency of Cites lists is not yet determined. Again, all Cites data for this initial implementation are coming from articles in PMC. 2. NameID element was added to Author and Investigator elements. NameID is a possibly multiply-occurring, optional element permissible within the Author (personal and collective) and Investigator elements. It is intended as a unique identifier associated with the name. The value in the NameID attribute Source designates the organizational authority that established the unique identifier. There is no target date for implementation of this field; it is a placeholder for now. Additional information is available from the Announcements to NLM Data Licensees 2010 DTD and XML Changes; File Distribution Schedule Changes: http://www.nlm.nih.gov/bsd/licensee/announce/2009.html#d09_17 _______________________________________________ Utilities-announce mailing list http://www.ncbi.nlm.nih.gov/mailman/listinfo/utilities-announce From biopython at maubp.freeserve.co.uk Tue Nov 24 11:30:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Nov 2009 11:30:04 +0000 Subject: [Biopython-dev] Changing Seq equality Message-ID: <320fb6e00911240330hb1ee2b6mbe75b1433e0ecdcc@mail.gmail.com> Dear all, One thing about the Seq object that still annoys me, and is rather confusing for novices, is the equality testing. It would be nice to "fix" this, but it turns out to be quite complicated due to the way Python works. Brad and I did started talking about this a few months ago at BOSC2009, but ran out of time. First, a brief aside about hashes (used in dictionaries and sets). In Python immutable objects can be hashed, via the hash function or a custom __hash__ method. An important detail is that if two objects evaluate as equal, they must have the same hash, and vice verse (otherwise dictionaries break and other bad things happen). e.g. >>> hash(1) 1 >>> hash(1.0) 1 >>> hash("1") 1977051568 >>> "1"==1 False >>> 1.0==1 True See also: http://mail.python.org/pipermail/python-dev/2002-December/031455.html In Biopython, the Seq object is immutable (read only) and can be used as a dictionary tree. However, we don't implement equality or hashes explicitly, thus get the object default behaviour. This means two Seq objects are only equal if they are the same object in memory. The hash is actually the address in memory: >>> from Bio.Seq import Seq >>> s = Seq("ACGT") >>> id(s) 532624 >>> hash(s) 532624 This means that while a Seq can be used as a dictionary key, the test is for object equality - which is of limited use. Now, the MutableSeq has an "alphabet aware" equality defined. Because these are mutable objects, they don't have a hash, and cannot be used as dictionary keys. This means there are no hash related restrictions on the equality rules. Now, what if the Seq object had a similar "alphabet aware" equality? The problem is if we'd like Seq("ACGT") to be equal to Seq("ACGT", generic_dna) then both must have the same hash. Then, if we also want Seq("ACGT") and Seq("ACGT", generic_protein) to be equal, they too must have the same hash. This means Seq("ACGT", generic_dna) and Seq("ACGT",generic_protein) would have the same hash, and therefore must evaluate as equal (!). The natural consequence of this chain of logic is we would then have Seq("ACGT") == Seq("ACGT", generic_dna) == Seq("ACGT",generic_protein) == Seq("ACGT",...). You reach the same point if we require the string "ACGT" equals Seq("ACGT", some_alphabet) i.e. Another option would be to base Seq equality and hashing on the sequence string only (ignoring the alphabet). This would at least be a simple rule to remember (and would mean we could implement less than, greater than etc in the same way) but basically means we'd ignore the alphabet. So, currently in Biopython, we have object identity. We could have string based identity. I've thought about other options but haven't come up with anything that would be self consistent (and could be hashed). If anyone has a alternative idea, please speak up. I don't know what thought process Jose went though, but he wants to use the same equality test in his code: http://lists.open-bio.org/pipermail/biopython/2009-November/005861.html Changing Seq equality like this would make Biopython much nicer to use for basic tasks. For example, my code (and the unit tests) often contains things like if str(seq1)==str(seq2). If we want to make this change, it is quite a break to backwards compatibility. (It also has the downside that a DNA sequence ACGT and a protein sequence ACGT would evaluate as equal - probably not a big issue in practice but counter intuitive). One way to handle this would be to start by adding explicit Seq __eq__ methods etc which preserve the current behaviour (i.e. act like id(seq1)==id(seq2) based on object identity) but issue a deprecation warning. Then for a series of releases people would be encouraged to use str(seq1)==str(seq2) or id(seq1)==id(seq2) as appropriate. Then, after this transition period, we would change the __eq__ methods to adopt the new behaviour. Or, we could have a Bio.Seq module level switch to control the behaviour - initially defaulting to the current system with a deprecation warning? Peter P.S. As a related point, we will need to switch the MutableSeq from using __cmp__ to __eq__ etc for future Python compatibility. From jblanca at btc.upv.es Tue Nov 24 12:31:04 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 24 Nov 2009 13:31:04 +0100 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <320fb6e00911240253g10986fcfj311c8a2adc12afd5@mail.gmail.com> References: <200911241132.55922.jblanca@btc.upv.es> <320fb6e00911240253g10986fcfj311c8a2adc12afd5@mail.gmail.com> Message-ID: <200911241331.05031.jblanca@btc.upv.es> > It is a reasonable change, but ONLY if all the subclasses support > the same __init__ method, which isn't true. For example, the > Bio.Seq.UnknownSeq subclasses Seq and uses a different __init__ > method signature. This means any change would at a minimum > have to include lots of fixes to the UnknownSeq In this case what I do is to create a new __init__ for the inherited class, like: class SeqWithQuality(SeqRecord): '''A wrapper around Biopython's SeqRecord that adds a couple of convenience methods''' def __init__(self, seq, id = "", name = "", description = "", dbxrefs = None, features = None, annotations = None, letter_annotations = None, qual = None): SeqRecord.__init__(self, seq, id=id, name=name, description=description, dbxrefs=dbxrefs, features=features, annotations=annotations, letter_annotations=letter_annotations) if qual is not None: self.qual = qual def _set_qual(self, qual): '''It stores the quality in the letter_annotations['phred_quality']''' self.letter_annotations["phred_quality"] = qual def _get_qual(self): '''It gets the quality from letter_annotations['phred_quality']''' return self.letter_annotations["phred_quality"] qual = property(_get_qual, _set_qual) def __add__(self, seq2): '''It returns a new object with both seq and qual joined ''' #per letter annotations new_seq = self.__class__(name = self.name + '+' + seq2.name, id = self.id + '+' + seq2.id, seq = self.seq + seq2.seq) #the letter annotations, including quality for name, annot in self.letter_annotations.items(): if name in seq2.letter_annotations: new_seq.letter_annotations[name] = annot + \ seq2.letter_annotations[name] return new_seq -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Nov 24 13:10:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Nov 2009 13:10:15 +0000 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <200911241331.05031.jblanca@btc.upv.es> References: <200911241132.55922.jblanca@btc.upv.es> <320fb6e00911240253g10986fcfj311c8a2adc12afd5@mail.gmail.com> <200911241331.05031.jblanca@btc.upv.es> Message-ID: <320fb6e00911240510n35d0ab5cvdfff0b8375723864@mail.gmail.com> On Tue, Nov 24, 2009 at 12:31 PM, Jose Blanca wrote: > >> It is a reasonable change, but ONLY if all the subclasses support >> the same __init__ method, which isn't true. For example, the >> Bio.Seq.UnknownSeq subclasses Seq and uses a different __init__ >> method signature. This means any change would at a minimum >> have to include lots of fixes to the UnknownSeq > > In this case what I do is to create a new __init__ for the inherited class, > like: > > class SeqWithQuality(SeqRecord): > ? ?'''A wrapper around Biopython's SeqRecord that adds a couple of > convenience methods''' > ? ?def __init__(self, seq, id = "", name = "", > ? ? ? ? ? ? ? ? description = "", dbxrefs = None, > ? ? ? ? ? ? ? ? features = None, annotations = None, > ? ? ? ? ? ? ? ? letter_annotations = None, qual = None): > ? ? ? ?SeqRecord.__init__(self, seq, id=id, name=name, > ? ? ? ? ? ? ? ? ? ? ? ? ? description=description, dbxrefs=dbxrefs, > ? ? ? ? ? ? ? ? ? ? ? ? ? features=features, annotations=annotations, > ? ? ? ? ? ? ? ? ? ? ? ? ? letter_annotations=letter_annotations) > ? ? ? ?if qual is not None: > ? ? ? ? ? ?self.qual = qual > > ? ?def _set_qual(self, qual): > ? ? ? ?'''It stores the quality in the letter_annotations['phred_quality']''' > ? ? ? ?self.letter_annotations["phred_quality"] = qual > ? ?def _get_qual(self): > ? ? ? ?'''It gets the quality from letter_annotations['phred_quality']''' > ? ? ? ?return self.letter_annotations["phred_quality"] > ? ?qual = property(_get_qual, _set_qual) I can see how adding a property makes accessing the PHRED qualities much easier. > ? ?def __add__(self, seq2): > ? ? ? ?'''It returns a new object with both seq and qual joined ''' > ? ? ? ?#per letter annotations > ? ? ? ?new_seq = self.__class__(name = self.name + '+' + seq2.name, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? id = self.id + '+' + seq2.id, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seq ?= self.seq + seq2.seq) > ? ? ? ?#the letter annotations, including quality > ? ? ? ?for name, annot in self.letter_annotations.items(): > ? ? ? ? ? ?if name in seq2.letter_annotations: > ? ? ? ? ? ? ? ?new_seq.letter_annotations[name] = annot + \ > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seq2.letter_annotations[name] > ? ? ? ?return new_seq This bit is much less clear to me - you completely ignore any features. Was it written before I added the __add__ method to the original SeqRecord (expected to be in Biopython 1.53)? Anyway - it looks like your SeqRecord subclass should work fine as it is (partly because the SeqRecord has relatively few methods you may need to subclass). Peter From jblanca at btc.upv.es Tue Nov 24 14:58:39 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 24 Nov 2009 15:58:39 +0100 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <320fb6e00911240510n35d0ab5cvdfff0b8375723864@mail.gmail.com> References: <200911241132.55922.jblanca@btc.upv.es> <200911241331.05031.jblanca@btc.upv.es> <320fb6e00911240510n35d0ab5cvdfff0b8375723864@mail.gmail.com> Message-ID: <200911241558.39357.jblanca@btc.upv.es> What I mean is this: http://github.com/JoseBlanca/biopython/commit/d4c87365f614de2d69d800dc63d0cc25087d96dc I would like to change the Seq() and SeqRecord() for self.__class__ There are places already in Seq in which self.__class__ is used. But this is not the case in all instances. I think that this would be a reasonable change. > > > > ? ?def _set_qual(self, qual): > > ? ? ? ?'''It stores the quality in the > > letter_annotations['phred_quality']''' > > self.letter_annotations["phred_quality"] = qual > > ? ?def _get_qual(self): > > ? ? ? ?'''It gets the quality from letter_annotations['phred_quality']''' > > ? ? ? ?return self.letter_annotations["phred_quality"] > > ? ?qual = property(_get_qual, _set_qual) > > I can see how adding a property makes accessing the PHRED > qualities much easier. Yes, that's just a convenience method. I used to have my own class with this property and now I'm trying to use SeqRecord instead but adding this method to ease the change. > > ? ?def __add__(self, seq2): > > ? ? ? ?'''It returns a new object with both seq and qual joined ''' > > ? ? ? ?#per letter annotations > > ? ? ? ?new_seq = self.__class__(name = self.name + '+' + seq2.name, > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? id = self.id + '+' + seq2.id, > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seq ?= self.seq + seq2.seq) > > ? ? ? ?#the letter annotations, including quality > > ? ? ? ?for name, annot in self.letter_annotations.items(): > > ? ? ? ? ? ?if name in seq2.letter_annotations: > > ? ? ? ? ? ? ? ?new_seq.letter_annotations[name] = annot + \ > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seq2.letter_annotations[name] > > ? ? ? ?return new_seq > > This bit is much less clear to me - you completely ignore any > features. Was it written before I added the __add__ method > to the original SeqRecord (expected to be in Biopython 1.53)? Yes, it was added much earliear. I'll remove it as soon as the Biopython SeqRecord has one. -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Nov 24 15:52:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Nov 2009 15:52:25 +0000 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <200911241558.39357.jblanca@btc.upv.es> References: <200911241132.55922.jblanca@btc.upv.es> <200911241331.05031.jblanca@btc.upv.es> <320fb6e00911240510n35d0ab5cvdfff0b8375723864@mail.gmail.com> <200911241558.39357.jblanca@btc.upv.es> Message-ID: <320fb6e00911240752h28683e33l5d5dc938882d2d1a@mail.gmail.com> On Tue, Nov 24, 2009 at 2:58 PM, Jose Blanca wrote: > What I mean is this: > http://github.com/JoseBlanca/biopython/commit/d4c87365f614de2d69d800dc63d0cc25087d96dc > > I would like to change the Seq() and SeqRecord() for self.__class__ > There are places already in Seq in which self.__class__ is used. > But this is not the case in all instances. I think that this would be a > reasonable change. I hadn't realised the add methods also used __class__, I thought it was just __repr__ - good point. Thinking about use-cases, sometimes a subclass will want the methods to return Seq objects, sometimes the same class. The UnknownSeq too sometimes can return another UnknownSeq, but must often return a Seq object. The BioSQL DBSeq on the other hand always returns a Seq object for all its methods. The fact that the Seq __add__ and __addr__ use __class__ was the cause of a bug in that adding DBSeq objects didn't work. A hypothetical CircularSeq could return CircularSeq objects for some cases (e.g. upper, lower, and perhaps transcribe, back_transcribe and reverse complement), Seq objects in other cases (e.g. slicing) while in other cases it may depend on the data (e.g. translation). Essentially, for an Seq subclass you may need to look at each method in turn and decide which is most appropriate. So which is the most sensible default behaviour from the Seq object? The cautious "return a Seq" approach will be robust (and makes sense for the existing Biopython subclasses, DBSeq and the UnknownSeq), but makes changing this in the subclass harder (as Jose has found). What does your Seq subclass aim to do? Add one or two general methods to enhance the Seq object - or model something a little different? If anyone else has written Seq (or SeqRecord) subclasses, it would be very helpful to hear about them. After all, the change Jose is proposing may break your code ;) >> > ? ?def __add__(self, seq2): >> > ? ? ? ?'''It returns a new object with both seq and qual joined ''' >> > ? ? ? ?#per letter annotations >> > ? ? ? ?new_seq = self.__class__(name = self.name + '+' + seq2.name, >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? id = self.id + '+' + seq2.id, >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seq ?= self.seq + seq2.seq) >> > ? ? ? ?#the letter annotations, including quality >> > ? ? ? ?for name, annot in self.letter_annotations.items(): >> > ? ? ? ? ? ?if name in seq2.letter_annotations: >> > ? ? ? ? ? ? ? ?new_seq.letter_annotations[name] = annot + \ >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seq2.letter_annotations[name] >> > ? ? ? ?return new_seq >> >> This bit is much less clear to me - you completely ignore any >> features. Was it written before I added the __add__ method >> to the original SeqRecord (expected to be in Biopython 1.53)? > > Yes, it was added much earliear. I'll remove it as soon as the > Biopython SeqRecord has one. OK - your code makes more sense now. The Biopython trunk does now have an __add__ method (which I expect to be in Biopython 1.53). Peter From jblanca at btc.upv.es Tue Nov 24 16:06:32 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 24 Nov 2009 17:06:32 +0100 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <320fb6e00911240752h28683e33l5d5dc938882d2d1a@mail.gmail.com> References: <200911241132.55922.jblanca@btc.upv.es> <200911241558.39357.jblanca@btc.upv.es> <320fb6e00911240752h28683e33l5d5dc938882d2d1a@mail.gmail.com> Message-ID: <200911241706.32940.jblanca@btc.upv.es> > What does your Seq subclass aim to do? Add one or two general > methods to enhance the Seq object - or model something a little > different? Basically I want to use just a Seq and a SeqRecord almost as the Biopython ones. For the SeqRecord I'm adding a qual property, but I can change my code to deal with that. Also I've added an __add__ (that I will remove as soon as the biopython one is stable) and complement method to SeqRecord and __eq__ to Seq. Regards, Jose Blanca > If anyone else has written Seq (or SeqRecord) subclasses, it > would be very helpful to hear about them. After all, the change > Jose is proposing may break your code ;) > > >> > ? ?def __add__(self, seq2): > >> > ? ? ? ?'''It returns a new object with both seq and qual joined ''' > >> > ? ? ? ?#per letter annotations > >> > ? ? ? ?new_seq = self.__class__(name = self.name + '+' + seq2.name, > >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? id = self.id + '+' + seq2.id, > >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seq ?= self.seq + seq2.seq) > >> > ? ? ? ?#the letter annotations, including quality > >> > ? ? ? ?for name, annot in self.letter_annotations.items(): > >> > ? ? ? ? ? ?if name in seq2.letter_annotations: > >> > ? ? ? ? ? ? ? ?new_seq.letter_annotations[name] = annot + \ > >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seq2.letter_annotations[name] > >> > ? ? ? ?return new_seq > >> > >> This bit is much less clear to me - you completely ignore any > >> features. Was it written before I added the __add__ method > >> to the original SeqRecord (expected to be in Biopython 1.53)? > > > > Yes, it was added much earliear. I'll remove it as soon as the > > Biopython SeqRecord has one. > > OK - your code makes more sense now. The Biopython trunk > does now have an __add__ method (which I expect to be in > Biopython 1.53). > > Peter -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Nov 24 16:17:14 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Nov 2009 16:17:14 +0000 Subject: [Biopython-dev] Subclassing Seq and SeqRecord In-Reply-To: <200911241706.32940.jblanca@btc.upv.es> References: <200911241132.55922.jblanca@btc.upv.es> <200911241558.39357.jblanca@btc.upv.es> <320fb6e00911240752h28683e33l5d5dc938882d2d1a@mail.gmail.com> <200911241706.32940.jblanca@btc.upv.es> Message-ID: <320fb6e00911240817y402a445el853c6e51b00d98ab@mail.gmail.com> On Tue, Nov 24, 2009 at 4:06 PM, Jose Blanca wrote: >> What does your Seq subclass aim to do? Add one or two general >> methods to enhance the Seq object - or model something a little >> different? > > Basically I want to use just a Seq and a SeqRecord almost as the Biopython > ones. For the SeqRecord I'm adding a qual property, but I can change my > code to deal with that. OK, I can see the purpose here. > Also I've added an __add__ (that I will remove as soon as the biopython > one is stable) ... OK. I consider the SeqRecord __add__ to be stable, but await comments. > ... and complement method to SeqRecord ... Interesting. Do you mean complement or reverse_complement? It would have been nice to have had your comments on this earlier thread: http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006850.html http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html > ... and __eq__ to Seq. Seq object equality is a tricky thing, see the other thread: http://lists.open-bio.org/pipermail/biopython-dev/2009-November/007021.html Peter From jblanca at btc.upv.es Tue Nov 24 16:17:55 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 24 Nov 2009 17:17:55 +0100 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <320fb6e00911240752h28683e33l5d5dc938882d2d1a@mail.gmail.com> References: <200911241132.55922.jblanca@btc.upv.es> <200911241558.39357.jblanca@btc.upv.es> <320fb6e00911240752h28683e33l5d5dc938882d2d1a@mail.gmail.com> Message-ID: <200911241717.55484.jblanca@btc.upv.es> On Tuesday 24 November 2009 16:52:25 Peter wrote: > Thinking about use-cases, sometimes a subclass will want the > methods to return Seq objects, sometimes the same class. > > The UnknownSeq too sometimes can return another UnknownSeq, > but must often return a Seq object. I'm thinking about that and I don't think it's a problem. If the subclass wants to return as the parent class it can chose to do it. I'm just proposing to change the behaviour of the parent class. > The BioSQL DBSeq on the other hand always returns a Seq > object for all its methods. The fact that the Seq __add__ and > __addr__ use __class__ was the cause of a bug in that adding > DBSeq objects didn't work. I haven't realized that problem. Was that a bug of the BioSQL project that could be solved or a desing problem related to my proposal? Regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Nov 24 16:30:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Nov 2009 16:30:21 +0000 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <200911241717.55484.jblanca@btc.upv.es> References: <200911241132.55922.jblanca@btc.upv.es> <200911241558.39357.jblanca@btc.upv.es> <320fb6e00911240752h28683e33l5d5dc938882d2d1a@mail.gmail.com> <200911241717.55484.jblanca@btc.upv.es> Message-ID: <320fb6e00911240830l68269b80ja685f0a2dce5946f@mail.gmail.com> On Tue, Nov 24, 2009 at 4:17 PM, Jose Blanca wrote: > On Tuesday 24 November 2009 16:52:25 Peter wrote: >> Thinking about use-cases, sometimes a subclass will want the >> methods to return Seq objects, sometimes the same class. >> >> The UnknownSeq too sometimes can return another UnknownSeq, >> but must often return a Seq object. > > I'm thinking about that and I don't think it's a problem. If the subclass > wants to return as the parent class it can chose to do it. I'm just proposing > to change the behaviour of the parent class. Yes - but it means any existing subclasses will need updating (fairly easy for those included with Biopython) which could be a big problem for end user scripts (especially if anyone wants to target old and new versions of Biopython). >> The BioSQL DBSeq on the other hand always returns a Seq >> object for all its methods. The fact that the Seq __add__ and >> __addr__ use __class__ was the cause of a bug in that adding >> DBSeq objects didn't work. > > I haven't realized that problem. Was that a bug of the BioSQL project > that could be solved or a desing problem related to my proposal? It was just a bug in Biopython's BioSQL wrappers, fixed by adding explicit __add__ and __addr__ methods to the DBSeq class since it couldn't safely use the default methods of the Seq class. Your proposal would require further similar changes to the DBSeq class to override *all* the Seq returning methods to ensure a Seq object is returned and not attempt to create a DBSeq object with the wrong __init__ arguments. The point is while your proposed change will make some tasks easier (e.g. writing an extended Seq subclass that adds a new method or changes an existing method), it will make other tasks much harder (e.g. the DBSeq class). Peter From lpritc at scri.ac.uk Tue Nov 24 16:26:34 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 24 Nov 2009 16:26:34 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e00911240330hb1ee2b6mbe75b1433e0ecdcc@mail.gmail.com> Message-ID: Hi, Without wanting to get too philosophical, an issue to consider in this, in addition to the technical problems outlined by Peter, is what do we *mean* when we ask about equality of two sequences? As Peter points out, there is something counterintuitive about the peptide "ACGT" somehow being equal to the nucleotide sequence "ACGT", and that is because we know that the things that these sequences represent are not in reality the same thing. Likewise, two instances of a repeat sequence in a genome are not necessarily the same conceptual item, even though they may have the same nucleotide sequence. Also, two CDS from different sources may have the same conceptual translation, but the identical translations are arguably not the same sequence, and in both these circumstances a test for string equality ignores potentially significant between the physical/biological elements they describe. These particular cases would give false positives for equality that could be 'gotchas' for the use in dictionaries that prompted this discussion. If we want to test for string equality of two sequences, we can already do that explicitly and simply with str(s1) == str(s2). Making this the default behaviour for a string doesn't always conform to my own expectations of what 'equality' means for two sequences, because my expectation changes depending on the task in hand. An alternative reasonable test for equality might be whether the two sequences represent the same sequence, so Seq("M", generic_protein) == Seq("ATG", generic_dna) might return True if we make some potentially dodgy assumptions about reading frames, and consider that they conceptually represent the same thing. I think that it would a bad default behaviour, and harder to implement than testing string equality, but equally reasonable depending on what you think 'equality' means. Another, equally reasonable, definition of two sequences being 'equal' is that they share a locus tag or accession. I test on this more frequently than I do on sequence identity, but still think it's a bad idea to make it a default test for sequence equality. Similarly, if two sequences (e.g. mRNA/cDNA) map to the same location on a genome, you might consider them equal. There are several equally reasonable and yet non-universal definitions of 'equality' for sequence comparisons, and we currently have the ability to test simply but explicitly for equality on the basis of any of these as we need to at the time. I would prefer to see this requirement for an explicit string comparison kept, and the test for object equality kept as the default, because this never produces a false positive (and I value specificity over sensitivity as a default ;) ). Cheers, L. On 24/11/2009 11:30, "Peter" wrote: [...] > The problem is if we'd like Seq("ACGT") to be equal to > Seq("ACGT", generic_dna) then both must have the > same hash. Then, if we also want Seq("ACGT") and > Seq("ACGT", generic_protein) to be equal, they too must > have the same hash. This means Seq("ACGT", generic_dna) > and Seq("ACGT",generic_protein) would have the same > hash, and therefore must evaluate as equal (!). The > natural consequence of this chain of logic is we would > then have Seq("ACGT") == Seq("ACGT", generic_dna) > == Seq("ACGT",generic_protein) == Seq("ACGT",...). > You reach the same point if we require the string > "ACGT" equals Seq("ACGT", some_alphabet) > > i.e. Another option would be to base Seq equality > and hashing on the sequence string only (ignoring > the alphabet). > > This would at least be a simple rule to remember (and > would mean we could implement less than, greater than > etc in the same way) but basically means we'd ignore > the alphabet. [...] > Changing Seq equality like this would make Biopython > much nicer to use for basic tasks. For example, my > code (and the unit tests) often contains things like if > str(seq1)==str(seq2). > > If we want to make this change, it is quite a break to > backwards compatibility. (It also has the downside that > a DNA sequence ACGT and a protein sequence ACGT > would evaluate as equal - probably not a big issue in > practice but counter intuitive). -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From bugzilla-daemon at portal.open-bio.org Tue Nov 24 16:47:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 24 Nov 2009 11:47:00 -0500 Subject: [Biopython-dev] [Bug 2927] Problem parsing PSI-BLAST plain text output with NCBStandalone.PSIBlastParser In-Reply-To: Message-ID: <200911241647.nAOGl0RE007751@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2927 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|blocker |normal ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-24 11:47 EST ------- This isn't a blocker severity level bug. And we still need at least one example PSI-BLAST plain text output to try and fix this... -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jblanca at btc.upv.es Wed Nov 25 08:31:53 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 25 Nov 2009 09:31:53 +0100 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <320fb6e00911240830l68269b80ja685f0a2dce5946f@mail.gmail.com> References: <200911241132.55922.jblanca@btc.upv.es> <200911241717.55484.jblanca@btc.upv.es> <320fb6e00911240830l68269b80ja685f0a2dce5946f@mail.gmail.com> Message-ID: <200911250931.54003.jblanca@btc.upv.es> > The point is while your proposed change will make some tasks > easier (e.g. writing an extended Seq subclass that adds a new > method or changes an existing method), it will make other tasks > much harder (e.g. the DBSeq class). That's a fair point. You're right in either case some methods would have to be reimplemented. I don't know if the current situation is the most convenient because the actual implementations have a mixed behaviour. Some methods like the Seq's __add__ use __class__ and some others like __getitem__ use Seq(). Is there a reason for that? Regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From jblanca at btc.upv.es Wed Nov 25 08:45:20 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 25 Nov 2009 09:45:20 +0100 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: References: Message-ID: <200911250945.20870.jblanca@btc.upv.es> Hi, > Without wanting to get too philosophical, an issue to consider in this, in > addition to the technical problems outlined by Peter, is what do we *mean* > when we ask about equality of two sequences? > > As Peter points out, there is something counterintuitive about the peptide > "ACGT" somehow being equal to the nucleotide sequence "ACGT", and that is > because we know that the things that these sequences represent are not in > reality the same thing. > > Likewise, two instances of a repeat sequence in a genome are not > necessarily the same conceptual item, even though they may have the same > nucleotide sequence. Also, two CDS from different sources may have the > same conceptual translation, but the identical translations are arguably > not the same sequence, and in both these circumstances a test for string > equality ignores potentially significant between the physical/biological > elements they describe. These particular cases would give false positives > for equality that could be 'gotchas' for the use in dictionaries that > prompted this discussion. > > If we want to test for string equality of two sequences, we can already do > that explicitly and simply with str(s1) == str(s2). Making this the > default behaviour for a string doesn't always conform to my own > expectations of what 'equality' means for two sequences, because my > expectation changes depending on the task in hand. > > An alternative reasonable test for equality might be whether the two > sequences represent the same sequence, so Seq("M", generic_protein) == > Seq("ATG", generic_dna) might return True if we make some potentially dodgy > assumptions about reading frames, and consider that they conceptually > represent the same thing. I think that it would a bad default behaviour, > and harder to implement than testing string equality, but equally > reasonable depending on what you think 'equality' means. > > Another, equally reasonable, definition of two sequences being 'equal' is > that they share a locus tag or accession. I test on this more frequently > than I do on sequence identity, but still think it's a bad idea to make it > a default test for sequence equality. > > Similarly, if two sequences (e.g. mRNA/cDNA) map to the same location on a > genome, you might consider them equal. > > There are several equally reasonable and yet non-universal definitions of > 'equality' for sequence comparisons, and we currently have the ability to > test simply but explicitly for equality on the basis of any of these as we > need to at the time. I would prefer to see this requirement for an > explicit string comparison kept, and the test for object equality kept as > the default, because this never produces a false positive (and I value > specificity over sensitivity as a default ;) ). You're right, there's a lot of corner cases that I hadn't considered. I think of a Seq as an str with an alphabet so I wouldn't mind for some things, like the genome location of the Seq. But anyway, I use the __eq__ method as a convenience to avoid writting str(seq1) == str(seq2). I'm aware that all abstractions leak, but that's not good or bad in itself. The abstraction is a model of the reality, as a model it won't be a perfect representation of the reality, just a convenient model. The abstraction is suited to a particular use, so its behaviour should be tailored to this use. I would implement this behaviour and document it's gotchas. Not implementing the behaviour because the abstraction leak also prevents the most general case that the abstraction is trying to cover. > On 24/11/2009 11:30, "Peter" wrote: > > [...] > > > The problem is if we'd like Seq("ACGT") to be equal to > > Seq("ACGT", generic_dna) then both must have the > > same hash. Then, if we also want Seq("ACGT") and > > Seq("ACGT", generic_protein) to be equal, they too must > > have the same hash. This means Seq("ACGT", generic_dna) > > and Seq("ACGT",generic_protein) would have the same > > hash, and therefore must evaluate as equal (!). The > > natural consequence of this chain of logic is we would > > then have Seq("ACGT") == Seq("ACGT", generic_dna) > > == Seq("ACGT",generic_protein) == Seq("ACGT",...). > > You reach the same point if we require the string > > "ACGT" equals Seq("ACGT", some_alphabet) Oh! I didn't know that! It's great to learn new python things! I'm being naive here because I just have a swallow understanding of the problem, but here are my two cents. would it be possible to generate the hashes and the __eq__ taking into account the base alphabet. For instance DNAAlphabet=0, RNAAlphabet=1 and ProteinAlphabet=2. So to check if two sequences we would do something like: 'ACGT1' == 'ACGT2' Regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Wed Nov 25 10:26:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 25 Nov 2009 10:26:34 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <200911250945.20870.jblanca@btc.upv.es> References: <200911250945.20870.jblanca@btc.upv.es> Message-ID: <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> On Wed, Nov 25, 2009 at 8:45 AM, Jose Blanca wrote: >> On 24/11/2009 11:30, "Peter" wrote: >> > The problem is if we'd like Seq("ACGT") to be equal to >> > Seq("ACGT", generic_dna) then both must have the >> > same hash. Then, if we also want Seq("ACGT") and >> > Seq("ACGT", generic_protein) to be equal, they too must >> > have the same hash. This means Seq("ACGT", generic_dna) >> > and Seq("ACGT",generic_protein) would have the same >> > hash, and therefore must evaluate as equal (!). The >> > natural consequence of this chain of logic is we would >> > then have Seq("ACGT") == Seq("ACGT", generic_dna) >> > == Seq("ACGT",generic_protein) == Seq("ACGT",...). >> > You reach the same point if we require the string >> > "ACGT" equals Seq("ACGT", some_alphabet) > > Oh! I didn't know that! It's great to learn new python things! > I'm being naive here because I just have a swallow understanding > of the problem, but here are my two cents. It took me a while to try and understand this stuff - its tricky and I'm not 100% sure I have the details perfectly right. > would it be possible to generate the hashes and the __eq__ taking into account > the base alphabet. For instance DNAAlphabet=0, RNAAlphabet=1 and > ProteinAlphabet=2. So to check if two sequences we would do something like: > 'ACGT1' == 'ACGT2' I'd wondered about that too - if we treated all DNA alphabets (generic, IUPAC ambiguous etc) as one group, all RNA alphabets as another, and all Protein as a third, then within those groups things are fine. But what about all the other alphabets? In particular the generic (base) default alphabet or the generic single letter alphabet? These are very very commonly used (e.g. parsing a FASTA file without giving a specific alphabet). i.e. It is only a partial solution that doesn't really work :( Also, there is the issue of comparing a Seq object to a string. It would be very nice to have string "ACGT" == Seq("ACGT", some_alphabet) but that means we would also have to have hash("ACGT") === hash(Seq("ACGT", some_alphabet), which as noted above would mean Seq comparisons would have to ignore the alphabet. Which is bad :( Peter From jblanca at btc.upv.es Wed Nov 25 11:20:53 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Wed, 25 Nov 2009 12:20:53 +0100 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> Message-ID: <200911251220.53881.jblanca@btc.upv.es> > > would it be possible to generate the hashes and the __eq__ taking into > > account the base alphabet. For instance DNAAlphabet=0, RNAAlphabet=1 and > > ProteinAlphabet=2. So to check if two sequences we would do something > > like: 'ACGT1' == 'ACGT2' > > I'd wondered about that too - if we treated all DNA alphabets (generic, > IUPAC ambiguous etc) as one group, all RNA alphabets as another, and > all Protein as a third, then within those groups things are fine. But what > about all the other alphabets? In particular the generic (base) default > alphabet or the generic single letter alphabet? These are very very > commonly used (e.g. parsing a FASTA file without giving a specific > alphabet). i.e. It is only a partial solution that doesn't really work :( > > Also, there is the issue of comparing a Seq object to a string. It would > be very nice to have string "ACGT" == Seq("ACGT", some_alphabet) > but that means we would also have to have hash("ACGT") === > hash(Seq("ACGT", some_alphabet), which as noted above would > mean Seq comparisons would have to ignore the alphabet. Which > is bad :( That's a tricky issue. I think that the desired behaviour should be defined and after that the implementation should go. One possible solution would be to consider the generic alphabet different than the more specific ones and consider the str as having a generic alphabet. It would be something like: GenericAlphabet=0, DNAAlphabet=1, RNAAlphabet=2, ProteinAlphabet=3 if str: alphabet=generic else: alphabet=seq.alphabet return str(seq1) + str(alphabet) == str(seq2) + str(alphabet) -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Wed Nov 25 11:22:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 25 Nov 2009 11:22:05 +0000 Subject: [Biopython-dev] [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <200911250931.54003.jblanca@btc.upv.es> References: <200911241132.55922.jblanca@btc.upv.es> <200911241717.55484.jblanca@btc.upv.es> <320fb6e00911240830l68269b80ja685f0a2dce5946f@mail.gmail.com> <200911250931.54003.jblanca@btc.upv.es> Message-ID: <320fb6e00911250322j3b10a792r7db9df4b4cb269c0@mail.gmail.com> On Wed, Nov 25, 2009 at 8:31 AM, Jose Blanca wrote: > >> The point is while your proposed change will make some tasks >> easier (e.g. writing an extended Seq subclass that adds a new >> method or changes an existing method), it will make other tasks >> much harder (e.g. the DBSeq class). > > That's a fair point. You're right in either case some methods would > have to be reimplemented. That is unavoidable :( > I don't know if the current situation is the most convenient because > the actual implementations have a mixed behaviour. Some methods > like the Seq's __add__ use __class__ and some others like > __getitem__ use Seq(). Is there a reason for that? Historical accident I think. If you want to pursue this change (using __class__ in the methods), you'll also need to update the BioSQL DBSeq and DBSeqRecord subclasses. Comments from other people who have written Seq (or SeqRecord) subclasses would be very valuable here. Peter From biopython at maubp.freeserve.co.uk Wed Nov 25 11:48:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 25 Nov 2009 11:48:16 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <200911251220.53881.jblanca@btc.upv.es> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> Message-ID: <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> On Wed, Nov 25, 2009 at 11:20 AM, Jose Blanca wrote: > > That's a tricky issue. I think that the desired behaviour should be defined > and after that the implementation should go. > Many desired behaviours are mutually contradictory given the way Python works, and the current Seq/Alphabet objects. One can come up many possible desired behaviours, but often they are not coherent or not technically possible. > One possible solution would be > to consider the generic alphabet different than the more specific ones and > consider the str as having a generic alphabet. It would be something like: > > GenericAlphabet=0, DNAAlphabet=1, RNAAlphabet=2, ProteinAlphabet=3 > if str: > ? ?alphabet=generic > else: > ? ?alphabet=seq.alphabet > return str(seq1) + str(alphabet) == str(seq2) + str(alphabet) Dividing alphabets into those four groups would imply: "ACG" == Seq("ACG") == Seq("ACG", generic_nucleotide) "ACG" != Seq("ACG", generic_rna) "ACG" != Seq("ACG", generic_dna) "ACG" != Seq("ACG", generic_protein) ... Seq("ACG") != Seq("ACG", generic_protein) This has some non-intuitive behaviour. Also it doesn't take into account a number of corner cases (which could be better handled in the existing Seq objects I admit) - things like secondary structure alphabets (e.g. for proteins: coils, beta sheet, alpha helix) or reduced alphabets? (e.g. for proteins using Aliphatic/Aromatic/Charged/Tiny/Diverse, or any of the Murphy (2000) tables). The whole issue is horribly complicated! Quoting "Zen of Python": * If the implementation is hard to explain, it's a bad idea. * If the implementation is easy to explain, it may be a good idea. Doing anything complex with alphabets may fall into the "hard to explain" category. Using object identity or string identity is at least simple to explain. Thus far we have just two options, and neither is ideal: (a) Object identity, following id(seq1)==id(seq2) as now (b) String identity, following str(seq1)==str(seq2) We could consider a modified version of the string identity approach - make seq1==seq2 act as str(seq1)==str(seq2), but *also* look at the alphabets and if they are incompatible (using the existing rules used in addition etc) raise a Python warning. Right now this seems like quite a tempting idea to explore... Peter From chapmanb at 50mail.com Wed Nov 25 12:53:14 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 25 Nov 2009 07:53:14 -0500 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> Message-ID: <20091125125314.GA11038@sobchak.mgh.harvard.edu> Hi all; Interesting discussion on the equality issue. > Dividing alphabets into those four groups would imply: > > "ACG" == Seq("ACG") == Seq("ACG", generic_nucleotide) > "ACG" != Seq("ACG", generic_rna) > "ACG" != Seq("ACG", generic_dna) > "ACG" != Seq("ACG", generic_protein) > ... > Seq("ACG") != Seq("ACG", generic_protein) > > This has some non-intuitive behaviour. Also it doesn't take > into account a number of corner cases (which could be better > handled in the existing Seq objects I admit) - things like > secondary structure alphabets (e.g. for proteins: coils, beta > sheet, alpha helix) or reduced alphabets? (e.g. for proteins > using Aliphatic/Aromatic/Charged/Tiny/Diverse, or any of > the Murphy (2000) tables). Instead of considering the most horrible edge cases, we should think about the most common use cases and make those easy. Alphabets are a bit overcomplicated and in practice are probably not being used to represent these other potential alphabets. I may be simple minded in my programming, but have never seen the benefit of directly encoding anything more complicated that DNA, RNA or proteins. The 3 things I've used alphabets for are: - Is it DNA, RNA or protein? - Does a sequence match the alphabet? Checking input files. - Being careful not to add DNA and protein. In practice, I don't really do this very often. > We could consider a modified version of the string identity > approach - make seq1==seq2 act as str(seq1)==str(seq2), > but *also* look at the alphabets and if they are incompatible > (using the existing rules used in addition etc) raise a Python > warning. Right now this seems like quite a tempting idea to > explore... I like this with Jose's cases for the standard DNA, RNA, protein and generic alphabets. So provide sequence + alphabet checking for all of the common cases, and a warning plus just sequence checking for the edge cases. So if you try and compare a DNA sequence and your secondary structure alphabet, you will get a mismatch on the sequences and a warning about incompatible alphabets. Brad From biopython at maubp.freeserve.co.uk Wed Nov 25 13:15:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 25 Nov 2009 13:15:25 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <20091125125314.GA11038@sobchak.mgh.harvard.edu> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> <20091125125314.GA11038@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00911250515w68d808bdrd463d7834ef14985@mail.gmail.com> On Wed, Nov 25, 2009 at 12:53 PM, Brad Chapman wrote: > Hi all; > Interesting discussion on the equality issue. > >> Dividing alphabets into those four groups would imply: >> >> "ACG" == Seq("ACG") == Seq("ACG", generic_nucleotide) >> "ACG" != Seq("ACG", generic_rna) >> "ACG" != Seq("ACG", generic_dna) >> "ACG" != Seq("ACG", generic_protein) >> ... >> Seq("ACG") != Seq("ACG", generic_protein) >> >> This has some non-intuitive behaviour. Also it doesn't take >> into account a number of corner cases (which could be better >> handled in the existing Seq objects I admit) - things like >> secondary structure alphabets (e.g. for proteins: coils, beta >> sheet, alpha helix) or reduced alphabets? (e.g. for proteins >> using Aliphatic/Aromatic/Charged/Tiny/Diverse, or any of >> the Murphy (2000) tables). > > Instead of considering the most horrible edge cases, we should think > about the most common use cases and make those easy. Alphabets are a > bit overcomplicated and in practice are probably not being used to > represent these other potential alphabets. I may be simple minded in > my programming, but have never seen the benefit of directly encoding > anything more complicated that DNA, RNA or proteins. The 3 things > I've used alphabets for are: > > - Is it DNA, RNA or protein? > - Does a sequence match the alphabet? Checking input files. > - Being careful not to add DNA and protein. In practice, I don't > ?really do this very often. Me too - but fixing Bug 2597 would really help (either an exception or a warning would be a big improvement). >> We could consider a modified version of the string identity >> approach - make seq1==seq2 act as str(seq1)==str(seq2), >> but *also* look at the alphabets and if they are incompatible >> (using the existing rules used in addition etc) raise a Python >> warning. Right now this seems like quite a tempting idea to >> explore... > > I like this with Jose's cases for the standard DNA, RNA, protein and > generic alphabets. So provide sequence + alphabet checking for > all of the common cases, and a warning plus just sequence checking > for the edge cases. So if you try and compare a DNA sequence and > your secondary structure alphabet, you will get a mismatch on the > sequences and a warning about incompatible alphabets. You seem to be suggesting some hybrid plan here Brad - I don't quite follow you. Could you clarify (e.g. with some examples)? In the mean time, I'll work on a patch to do my suggestion of hashing and comparison based on string comparison, but with alphabet aware warnings. Peter From biopython at maubp.freeserve.co.uk Wed Nov 25 14:15:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 25 Nov 2009 14:15:41 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e00911250515w68d808bdrd463d7834ef14985@mail.gmail.com> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> <20091125125314.GA11038@sobchak.mgh.harvard.edu> <320fb6e00911250515w68d808bdrd463d7834ef14985@mail.gmail.com> Message-ID: <320fb6e00911250615v47047e65rdb257ea73cdc940b@mail.gmail.com> On Wed, Nov 25, 2009 at 1:15 PM, Peter wrote: > > In the mean time, I'll work on a patch to do my suggestion of > hashing and comparison based on string comparison, but with > alphabet aware warnings. > Branch: http://github.com/peterjc/biopython/tree/seq-comparisons Commit: http://github.com/peterjc/biopython/commit/e7859d47a4a1b873b307d5c2db622d335957a6ed You'll see some basic examples at the top of Bio/Seq.py as module level docstring doctests, including dictionary and set demonstrations. As I hope this demonstrates, even this simple rule (Seq comparison follows strings, but with incompatible alphabets giving warnings) leads to some "odd" results - but that is just the way Python works (see the int/float examples in the doctests using dicts and sets). Note that we may want to do something (on the trunk) about warnings in doctests (e.g. force them to print to stdout so they can be included in doctests explicitly). Other than that, all the other unit tests seem fine (including the BioSQL tests which is important and they use Seq object subclasses). Peter From bugzilla-daemon at portal.open-bio.org Wed Nov 25 17:18:15 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 25 Nov 2009 12:18:15 -0500 Subject: [Biopython-dev] [Bug 2954] New: xbbtools and SeqGui still using Bio.Translate and Bio.Transcribe Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2954 Summary: xbbtools and SeqGui still using Bio.Translate and Bio.Transcribe Product: Biopython Version: 1.51 Platform: PC OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Other AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk Scripts/SeqGui/SeqGui.py is still using Bio.Translate and Bio.Transcribe which were deprecated in Biopython 1.51. Using Bio.Seq instead should be trivial, except for back-translation (which could just be removed from the SeqGui tool). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Nov 25 17:19:14 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 25 Nov 2009 12:19:14 -0500 Subject: [Biopython-dev] [Bug 2954] xbbtools and SeqGui still using Bio.Translate and Bio.Transcribe In-Reply-To: Message-ID: <200911251719.nAPHJEsk010282@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2954 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-25 12:19 EST ------- Scripts/xbbtools/xbb_widget.py is also still using Bio.Translate -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From eric.talevich at gmail.com Thu Nov 26 07:14:08 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 26 Nov 2009 02:14:08 -0500 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> Message-ID: <3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com> On Wed, Nov 25, 2009 at 6:48 AM, Peter wrote: > On Wed, Nov 25, 2009 at 11:20 AM, Jose Blanca wrote: >> >> That's a tricky issue. I think that the desired behaviour should be defined >> and after that the implementation should go. >> > > Many desired behaviours are mutually contradictory given the way > Python works, and the current Seq/Alphabet objects. One can come > up many possible desired behaviours, but often they are not coherent > or not technically possible. > >> One possible solution would be >> to consider the generic alphabet different than the more specific ones and >> consider the str as having a generic alphabet. It would be something like: >> >> GenericAlphabet=0, DNAAlphabet=1, RNAAlphabet=2, ProteinAlphabet=3 >> if str: >> ? ?alphabet=generic >> else: >> ? ?alphabet=seq.alphabet >> return str(seq1) + str(alphabet) == str(seq2) + str(alphabet) > > [...] > > The whole issue is horribly complicated! Quoting "Zen of Python": > > * If the implementation is hard to explain, it's a bad idea. > * If the implementation is easy to explain, it may be a good idea. > > Doing anything complex with alphabets may fall into the "hard > to explain" category. Using object identity or string identity is > at least simple to explain. > > Thus far we have just two options, and neither is ideal: > (a) Object identity, following id(seq1)==id(seq2) as now > (b) String identity, following str(seq1)==str(seq2) How about (c), string and generic alphabet identity, where Seq.__hash__ uses the sequence string and some simplification of the alphabets types like Jose described. Premise: the sequence string and alphabet are the only arguments the Seq constructor takes, so if two objects can both be recreated from the same arguments, they should be equal as far as sets and dictionaries are concerned. To fall back on string identity, it's easy enough to map str onto a collection of Seq objects. def __hash__(self): """Same string, same alphabet --> same hash.""" # If alphabet is a standard type, match the generic alphabet types if self.alphabet == generic_nucleotide: return hash(str(self), Alphabet) #OR, to match raw strings: return hash(str(self)) elif isinstance(self.alphabet, DNAAlphabet): return hash((str(self), DNAAlphabet)) elif isinstance(self.alphabet, RNAAlphabet): return hash((str(self), RNAAlphabet)) elif isinstance(self.alphabet, ProteinAlphabet): return hash((str(self), ProteinAlphabet)) # Other alphabets, maybe user-defined --> require exactly the same type else: return hash((str(self), self.alphabet.__class__)) Cheers, Eric From biopython at maubp.freeserve.co.uk Thu Nov 26 10:41:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Nov 2009 10:41:10 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> <3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com> Message-ID: <320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com> On Thu, Nov 26, 2009 at 7:14 AM, Eric Talevich wrote: > > On Wed, Nov 25, 2009 at 6:48 AM, Peter wrote: >> Doing anything complex with alphabets may fall into the "hard >> to explain" category. Using object identity or string identity is >> at least simple to explain. >> >> Thus far we have just two options, and neither is ideal: >> (a) Object identity, following id(seq1)==id(seq2) as now >> (b) String identity, following str(seq1)==str(seq2) > > How about (c), string and generic alphabet identity, where > Seq.__hash__ uses the sequence string and some simplification of the > alphabets types like Jose described. Premise: the sequence string and > alphabet are the only arguments the Seq constructor takes, so if two > objects can both be recreated from the same arguments, they should be > equal as far as sets and dictionaries are concerned. To fall back on > string identity, it's easy enough to map str onto a collection of Seq > objects. > > def __hash__(self): > ? ?"""Same string, same alphabet --> same hash.""" > ? ?# If alphabet is a standard type, match the generic alphabet types > ? ?if self.alphabet == generic_nucleotide: > ? ? ? ?return hash(str(self), Alphabet) > ? ? ? ?#OR, to match raw strings: return hash(str(self)) > ? ?elif isinstance(self.alphabet, DNAAlphabet): > ? ? ? ?return hash((str(self), DNAAlphabet)) > ? ?elif isinstance(self.alphabet, RNAAlphabet): > ? ? ? ?return hash((str(self), RNAAlphabet)) > ? ?elif isinstance(self.alphabet, ProteinAlphabet): > ? ? ? ?return hash((str(self), ProteinAlphabet)) > ? ?# Other alphabets, maybe user-defined --> require exactly the same type > ? ?else: > ? ? ? ?return hash((str(self), self.alphabet.__class__)) As an aside, you'd need to get the base alphabet (i.e. remove any AlphabetEncoder wrappers) to decide if it is RNA/DNA/Protein. There is a private helper function in Bio.Alphabet for this. I don't think these AlphabetEncoder objects (like Gapped) were an entirely sensible design... but its done now. This idea (c) has a major drawback for me, in that it appears you wouldn't support comparing Seq objects to strings. However, perhaps that is actually a good thing - that could raise a TypeError, to force the user to do str(my_seq) == "ACG" which is explicit. As I understood his proposal, in Jose's related idea (which didn't get assigned a letter yet), "ACG"==Seq("ACG") would hold for the default generic alphabet, but for not for RNA/DNA/Protein. e.g. "ACG"!=Seq("ACG",generic_dna), which I find very counter intuitive. Peter From eric.talevich at gmail.com Thu Nov 26 20:13:37 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 26 Nov 2009 15:13:37 -0500 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> <3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com> <320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com> Message-ID: <3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com> On Thu, Nov 26, 2009 at 5:41 AM, Peter wrote: > On Thu, Nov 26, 2009 at 7:14 AM, Eric Talevich wrote: >> >> On Wed, Nov 25, 2009 at 6:48 AM, Peter wrote: >>> Doing anything complex with alphabets may fall into the "hard >>> to explain" category. Using object identity or string identity is >>> at least simple to explain. >>> >>> Thus far we have just two options, and neither is ideal: >>> (a) Object identity, following id(seq1)==id(seq2) as now >>> (b) String identity, following str(seq1)==str(seq2) >> >> How about (c), string and generic alphabet identity, where >> Seq.__hash__ uses the sequence string and some simplification of the >> alphabets types like Jose described. >> [...] >> >> def __hash__(self): >> ? ?"""Same string, same alphabet --> same hash.""" >> ? ?[...] > > [...] > > This idea (c) has a major drawback for me, in that it appears you > wouldn't support comparing Seq objects to strings. However, > perhaps that is actually a good thing - that could raise a TypeError, > to force the user to do str(my_seq) == "ACG" which is explicit. > I guess this is the basic question: is a Seq a string-type, or complex class that contains a string (is-a vs. has-a)? Python will let us be inconsistent with the type system if want, but for a class as fundamental as Seq, I think it should be consistent. Biopython-dev discussed making Seq inherit from str or basestring earlier [1], and I think it was decided that while actual inheritance would be tricky, Seq should mimic that interface as much as possible (using the alphabet attribute for validation and extra features, mainly). So we'd treat Seq as a string-like type -- option (b) -- and let SeqRecord be the complex type that has a sequence, accession number, location, etc., where object identity is the only valid case for equality. In short: +1 for your patch on GitHub; I think the rationale is solid. -Eric [1] http://bugzilla.open-bio.org/show_bug.cgi?id=2351#c6 From biopython at maubp.freeserve.co.uk Fri Nov 27 11:39:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Nov 2009 11:39:41 +0000 Subject: [Biopython-dev] Changing Seq equality In-Reply-To: <3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com> References: <200911250945.20870.jblanca@btc.upv.es> <320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com> <200911251220.53881.jblanca@btc.upv.es> <320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com> <3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com> <320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com> <3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com> Message-ID: <320fb6e00911270339s3354051cub0cc193466575f16@mail.gmail.com> On Thu, Nov 26, 2009 at 8:13 PM, Eric Talevich wrote: > > I guess this is the basic question: is a Seq a string-type, or complex > class that contains a string (is-a vs. has-a)? Python will let us be > inconsistent with the type system if want, but for a class as > fundamental as Seq, I think it should be consistent. > > Biopython-dev discussed making Seq inherit from str or basestring > earlier [1], and I think it was decided that while actual inheritance > would be tricky, Seq should mimic that interface as much as possible > (using the alphabet attribute for validation and extra features, > mainly). So we'd treat Seq as a string-like type -- option (b) -- and > let SeqRecord be the complex type that has a sequence, accession > number, location, etc., where object identity is the only valid case > for equality. > > In short: +1 for your patch on GitHub; I think the rationale is solid. > > -Eric > > [1] http://bugzilla.open-bio.org/show_bug.cgi?id=2351#c6 Nicely put. Peter From bugzilla-daemon at portal.open-bio.org Fri Nov 27 13:10:00 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 27 Nov 2009 08:10:00 -0500 Subject: [Biopython-dev] [Bug 2954] xbbtools and SeqGui still using Bio.Translate and Bio.Transcribe In-Reply-To: Message-ID: <200911271310.nARDA0lj005870@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2954 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-27 08:10 EST ------- SeqGui is fixed (also updated wxPython calls as the old way is now deprecated) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Fri Nov 27 14:50:08 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 27 Nov 2009 09:50:08 -0500 Subject: [Biopython-dev] [Bug 2954] xbbtools and SeqGui still using Bio.Translate and Bio.Transcribe In-Reply-To: Message-ID: <200911271450.nAREo8Ln008920@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2954 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-11-27 09:50 EST ------- xbbtools updated to avoid Bio.Translate Marking as fixed. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Fri Nov 27 16:23:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Nov 2009 16:23:52 +0000 Subject: [Biopython-dev] Bio.GFF and Brad's code In-Reply-To: <20090406220826.GH43636@sobchak.mgh.harvard.edu> References: <320fb6e00904060625v4a49da2au76159eae18f707eb@mail.gmail.com> <20090406220826.GH43636@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00911270823g320c7c24pd0773ae8b72902ee@mail.gmail.com> Hi all, Brad has some GFF parsing code he as been working on, which would be nice to merge into Biopython at some point. See: http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005700.html As we started to discuss earlier this year, we need to think about what to do with the existing (old) Bio.GFF module. This was written by Michael Hoffman back in 2002 which accesses MySQL General Feature Format (GFF) databases created with BioPerl. I've been looking at the old Bio.GFF code, and there are a lot of redundant things like its own GenBank/EMBL location parsing, plus its own location objects and its own Feature objects (rather than reusing Bio.SeqFeature which should have sufficed). I want to suggest we deprecate Michael Hoffman's Bio.GFF module in Biopython 1.53 (I'm hoping we can do this next month, Dec 2009). Depending on how soon Brad's code is ready to be merged (which I am assuming could be Biopython 1.54, spring 2010), we can perhaps accelerate removal of the old module. How does that sound? If we're all happy on the dev list, we'll still need to ask on the main list in case if anyone is using the old Bio.GFF code. Peter From bugzilla-daemon at portal.open-bio.org Mon Nov 30 19:06:12 2009 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 30 Nov 2009 14:06:12 -0500 Subject: [Biopython-dev] [Bug 2957] New: GenBank Writer Should Write Out Date Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2957 Summary: GenBank Writer Should Write Out Date Product: Biopython Version: 1.52 Platform: PC OS/Version: Windows Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: n.j.loman at bham.ac.uk Hi, Would it be possible to modify the GenBank writer to output the date in the LOCUS header. When reading GenBank files this is already parsed in the correct format into record.annotations, so the following patch would work: diff -u InsdcIO.py InsdcIO-date.py --- InsdcIO.py 2009-11-30 19:54:05.000000000 +0000 +++ InsdcIO-date.py 2009-11-30 19:55:32.000000000 +0000 @@ -278,12 +278,13 @@ assert len(division) == 3 #TODO - date #TODO - mol_type - line = "LOCUS %s %s %s %s %s 01-JAN-1980\n" \ + line = "LOCUS %s %s %s %s %s %s\n" \ % (locus.ljust(16), str(len(record)).rjust(11), units, mol_type.ljust(6), - division) + division, + record.annotations.get('date', '01-JAN-1980')) assert len(line) == 79+1, repr(line) #plus one for new line assert line[12:28].rstrip() == locus, \ I realise you might not work when converting between different record types, but this would suit my needs for the time being. Cheers -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.