From macrozhu at gmail.com Fri Dec 2 04:41:30 2011 From: macrozhu at gmail.com (=?UTF-8?B?SG9uZ2JvIFpodSDmnLHlro/ljZo=?=) Date: Fri, 2 Dec 2011 10:41:30 +0100 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? Message-ID: Hi, I propose to add slicing to class Bio.PDB.Chain by changing function Bio.PDB.Chain.__getitem__(). * Why is slicing necessary for Bio.PDB.Chain? Protein domain definitions are usually presented as the starting and ending positions of the domain in protein primary structures, e.g. in SCOP, or CATH. Slicing comes in handy when extracting domains from PDB files. * Why is slicing not available at the moment? I understand that the majority of Bio.PDB.Entity objects are not lists. And there is not internal *sequential order* for the child entities in these objects. For example, In Bio.PDB.Model, its child Chain entities do not really have a sequential order within Model. Slicing seems not make sense. But Bio.PDB.Chain is exceptional: Residue entities in Bio.PDB.Chain have a sequence order as presented in the primary structure and slicing becomes a reasonable operation. * How to slice a Chain entity? I think it can be realized by revising the function Bio.PDB.Chain.__getitem__(). For example: def __getitem__(self, id): """Return the residue with given id. The id of a residue is (hetero flag, sequence identifier, insertion code). If id is an int, it is translated to (" ", id, " ") by the _translate_id method. Arguments: o id - (string, int, string) or int """ if isinstance(id, slice): res_id_list = [r.id for r in self.get_iterator()] if id.start is not None: start_index = res_id_list.index(self._translate_id(id.start)) else: start_index = 0 stop_index = res_id_list.index(self._translate_id(id.stop)) return self.get_list()[start_index:stop_index:id.step] else: id=self._translate_id(id) return Entity.__getitem__(self, id) From anaryin at gmail.com Fri Dec 2 05:32:27 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 2 Dec 2011 11:32:27 +0100 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References: Message-ID: Hey Hongbo, Interesting idea, but couldn't it be done already with child_list in a more or less straightforward manner? Best, Jo?o No dia 2 de Dez de 2011 10:43, "Hongbo Zhu ???" escreveu: > Hi, > > I propose to add slicing to class Bio.PDB.Chain by changing function > Bio.PDB.Chain.__getitem__(). > > * Why is slicing necessary for Bio.PDB.Chain? > Protein domain definitions are usually presented as the starting and ending > positions of the domain in protein primary structures, e.g. in SCOP, or > CATH. Slicing comes in handy when extracting domains from PDB files. > > * Why is slicing not available at the moment? > I understand that the majority of Bio.PDB.Entity objects are not lists. And > there is not internal *sequential order* for the child entities in these > objects. For example, In Bio.PDB.Model, its child Chain entities do not > really have a sequential order within Model. Slicing seems not make sense. > But Bio.PDB.Chain is exceptional: Residue entities in Bio.PDB.Chain have a > sequence order as presented in the primary structure and slicing becomes a > reasonable operation. > > * How to slice a Chain entity? > I think it can be realized by revising the > function Bio.PDB.Chain.__getitem__(). For example: > > def __getitem__(self, id): > """Return the residue with given id. > > The id of a residue is (hetero flag, sequence identifier, insertion > code). > If id is an int, it is translated to (" ", id, " ") by the > _translate_id > method. > > Arguments: > o id - (string, int, string) or int > """ > if isinstance(id, slice): > res_id_list = [r.id for r in self.get_iterator()] > if id.start is not None: > start_index = > res_id_list.index(self._translate_id(id.start)) > else: > start_index = 0 > stop_index = res_id_list.index(self._translate_id(id.stop)) > return self.get_list()[start_index:stop_index:id.step] > else: > id=self._translate_id(id) > return Entity.__getitem__(self, id) > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From macrozhu at gmail.com Fri Dec 2 07:43:59 2011 From: macrozhu at gmail.com (=?UTF-8?B?SG9uZ2JvIFpodSDmnLHlro/ljZo=?=) Date: Fri, 2 Dec 2011 13:43:59 +0100 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: Hi, Joao, thanks for the response. When I spoke of slicing Bio.PDB.Chain, I meant to slice it using residue id, not list index. And these two ways are fundamentally different. For instance : not only slicing like this: or chain.child_list[2:12] # slice using list index but also slicing like this: chain[2:12] # slice using residue sequence id, not feasible at the moment # NOTE: this is fundamentally different from chain.child_list[2:12] or even: chain[(' ', 2, ' ') : (' ', 12, ' ')] # slice using residue full id, even better Of course one can play with child_list and obtain the same outcome. But I think it would be very convenient to implement it in the __getitem__() function. cheers,hongbo 2011/12/2 Jo?o Rodrigues > Hey Hongbo, > > Interesting idea, but couldn't it be done already with child_list in a > more or less straightforward manner? > > Best, > > Jo?o > No dia 2 de Dez de 2011 10:43, "Hongbo Zhu ???" > escreveu: > >> Hi, >> >> I propose to add slicing to class Bio.PDB.Chain by changing function >> Bio.PDB.Chain.__getitem__(). >> >> * Why is slicing necessary for Bio.PDB.Chain? >> Protein domain definitions are usually presented as the starting and >> ending >> positions of the domain in protein primary structures, e.g. in SCOP, or >> CATH. Slicing comes in handy when extracting domains from PDB files. >> >> * Why is slicing not available at the moment? >> I understand that the majority of Bio.PDB.Entity objects are not lists. >> And >> there is not internal *sequential order* for the child entities in these >> objects. For example, In Bio.PDB.Model, its child Chain entities do not >> really have a sequential order within Model. Slicing seems not make sense. >> But Bio.PDB.Chain is exceptional: Residue entities in Bio.PDB.Chain have a >> sequence order as presented in the primary structure and slicing becomes a >> reasonable operation. >> >> * How to slice a Chain entity? >> I think it can be realized by revising the >> function Bio.PDB.Chain.__getitem__(). For example: >> >> def __getitem__(self, id): >> """Return the residue with given id. >> >> The id of a residue is (hetero flag, sequence identifier, insertion >> code). >> If id is an int, it is translated to (" ", id, " ") by the >> _translate_id >> method. >> >> Arguments: >> o id - (string, int, string) or int >> """ >> if isinstance(id, slice): >> res_id_list = [r.id for r in self.get_iterator()] >> if id.start is not None: >> start_index = >> res_id_list.index(self._translate_id(id.start)) >> else: >> start_index = 0 >> stop_index = res_id_list.index(self._translate_id(id.stop)) >> return self.get_list()[start_index:stop_index:id.step] >> else: >> id=self._translate_id(id) >> return Entity.__getitem__(self, id) >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > -- Hongbo From p.j.a.cock at googlemail.com Mon Dec 5 05:45:29 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Dec 2011 10:45:29 +0000 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: 2011/12/2 Hongbo Zhu ??? : > Hi, Joao, > > thanks for the response. When I spoke of slicing Bio.PDB.Chain, I meant to > slice it using residue id, not list index. And these two ways are > fundamentally different. > > For instance : > > not only slicing like this: > or > chain.child_list[2:12] ?# slice using list index > > but also slicing like this: > > chain[2:12] ? # slice using residue sequence id, not feasible at the moment > ? ? ? ? ? ? ? ? ? # NOTE: this is fundamentally different from > chain.child_list[2:12] > or even: > chain[(' ', 2, ' ') : (' ', 12, ' ')] # slice using residue full id, even > better > > Of course one can play with child_list and obtain the same outcome. But I > think it would be very convenient to implement it in the __getitem__() > function. > > cheers,hongbo Hi Hongbo, I agree defining integer based slicing of Chain objects sounds like a good idea. Could you write a couple of unit tests for the new slicing please (in file Tests/test_PDB.py)? You can just give code snippets, a patch, or create a branch on github it you would prefer. Does it make sense to consider __add__ for the Chain as well? Peter From macrozhu at gmail.com Mon Dec 5 06:46:59 2011 From: macrozhu at gmail.com (=?UTF-8?B?SG9uZ2JvIFpodSDmnLHlro/ljZo=?=) Date: Mon, 5 Dec 2011 12:46:59 +0100 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: Hi, Peter, I just realized a special issue concerning slicing Bio.PDB.Chain. Normally, in python a slice is given by three arguments: start, stop and step, where the element at position *stop* is not included in the output. For example, mylist[2:40:1] would return: [ mylist[2],mylist[3], ...., mylist[39] ] But in CATH and SCOP, sequence segments composing domains are given as start and end position. And the residue at the end position is also included in the domain definition. e.g. if a domain is defined to be from residue (' ', 1, ' ') to residue (' ', 40, ' '), a slicing like this mychain[(' ', 2, ' '): (' ', 40, ' ')] or mychain[2:40] would not include residue (' ',40,' '). And it is not definite that mychain[(' ', 2, ' '): (' ', 41, ' ')] would give the correct outcome because the residue after (' ',40,' ') does not necessary have to be (' ',41,' '). Of course we can change the code in the __getitem__() such that it includes the end position. But then it is against the general python convention of slicing. So I think maybe an independent function is perhaps needed: class Chain(Entity): def get_slice(self, start, end, step=None): """Return a slice of the chain from start to end (including end position) Arguments: o start - (string, int, string) or int o end - (string, int, string) or int o step - None or int """ res_id_list = [r.id for r in self.get_iterator()] start_index = res_id_list.index(self._translate_id(start)) stop_index = res_id_list.index(self._translate_id(stop)) return self.get_list()[start_index:stop_index:step] And for the overload of operator __add__(), is it for the concatenation of chain segments? I think it is very important (if I chop the sequence into pieces, I should also be able to glue them together back, right? ) But this implies the function get_slice should return Chain instance, not just a list of Residue instances, right? --Hongbo > > Hi Hongbo, > > I agree defining integer based slicing of Chain objects sounds like a good > idea. > > Could you write a couple of unit tests for the new slicing please (in > file Tests/test_PDB.py)? You can just give code snippets, a patch, or > create a branch on github it you would prefer. > > Does it make sense to consider __add__ for the Chain as well? > > Peter > -- Hongbo From p.j.a.cock at googlemail.com Mon Dec 5 07:15:55 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Dec 2011 12:15:55 +0000 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: On Mon, Dec 5, 2011 at 11:46 AM, Hongbo Zhu ??? wrote: > Hi, Peter, > > I just realized a special issue concerning slicing Bio.PDB.Chain. > Normally, in python a slice is given by three arguments: start, stop and > step, where the element at position *stop* is not included in the output. > For example, > > mylist[2:40:1] ?would return: [ mylist[2],mylist[3], ...., mylist[39] ] > Yes, > But in CATH and SCOP, sequence segments composing domains > are given as start and end position. And the residue at the end > position is also included in the domain definition. OK. I'd have to double check what our parsers return (and if they convert the start/end into C/Python style). > e.g. if a domain > is defined to be from residue (' ', 1, ' ') to residue (' ', 40, ' '), a slicing > like this mychain[(' ', 2, ' '):?(' ', 40, ' ')] or mychain[2:40] would not > include residue (' ',40,' '). Perhaps I misunderstood - I would not want to allow the syntax mychain[(' ', 2, ' '): (' ', 40, ' ')] which is unclear, rather only allow the user to use mychain[2:41] which requires Python counting. Peter From macrozhu at gmail.com Mon Dec 5 08:38:09 2011 From: macrozhu at gmail.com (=?UTF-8?B?SG9uZ2JvIFpodSDmnLHlro/ljZo=?=) Date: Mon, 5 Dec 2011 14:38:09 +0100 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: > But in CATH and SCOP, sequence segments composing domains > > are given as start and end position. And the residue at the end > > position is also included in the domain definition. > > OK. I'd have to double check what our parsers return (and if > they convert the start/end into C/Python style). > > > e.g. if a domain > > is defined to be from residue (' ', 1, ' ') to residue (' ', 40, ' '), a > slicing > > like this mychain[(' ', 2, ' '): (' ', 40, ' ')] or mychain[2:40] would > not > > include residue (' ',40,' '). > > Perhaps I misunderstood - I would not want to allow the syntax > mychain[(' ', 2, ' '): (' ', 40, ' ')] which is unclear, rather only allow > the user to use mychain[2:41] which requires Python counting. > > But even in mychain[2:41], the 2 and 41 should be residue sequence number. Then it is consistent with the current acceptable syntax mychain[2], where 2 also refers to a sequence number. At the moment, BioPython also accepts mychain[(' ', 2, ' ')]. So I think mychain[(' ', 2, ' '): (' ', 40, ' ')] would be just a nature extension of mychain[(' ', 2, ' ')]. According to the source code, mychain[2] is considered an abbreviation of mychain[(' ', 2, ' ')]. Internally, mychain[2] will be translated to mychain[(' ', 2, ' ')] by function Bio.PDB.Chain.__translate_id(). So if mychain[2:4] would be allowed, internally it would also be first translated to mychain[(' ', 2, ' '): (' ', 40, ' ')]. So in my point of view, mychain[2:4] is just an abbreviation for mychain[(' ', 2, ' '): (' ', 40, ' ')], just like mychain[2] is a short version of mychain[(' ',2,' ')]. hongbo > Peter > -- Hongbo From p.j.a.cock at googlemail.com Mon Dec 5 08:50:44 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Dec 2011 13:50:44 +0000 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: On Mon, Dec 5, 2011 at 1:38 PM, Hongbo Zhu ??? wrote: > >> Perhaps I misunderstood - I would not want to allow the syntax >> mychain[(' ', 2, ' '): (' ', 40, ' ')] which is unclear, rather only allow >> the user to use mychain[2:41] which requires Python counting. > > But even in mychain[2:41], the 2 and 41 should be residue sequence number. > Then it is consistent with the current acceptable syntax mychain[2], where 2 > also refers to a sequence number. At the moment, BioPython also > accepts?mychain[(' ', 2, ' ')]. So I think?mychain[(' ', 2, ' '): (' ', 40, > ' ')] would be just a nature extension of?mychain[(' ', 2, ' ')]. > > According to the source code,?mychain[2] is considered an abbreviation of > mychain[(' ', 2, ' ')]. Internally, mychain[2] will be translated to > mychain[(' ', 2, ' ')] by function Bio.PDB.Chain.__translate_id(). So if > mychain[2:4] would be allowed, internally it would also > be?first?translated?to?mychain[(' ', 2, ' '): (' ', 40, ' ')]. So in my > point of view,?mychain[2:4] is just an abbreviation for?mychain[(' ', 2, ' > '): (' ', 40, ' ')], just like mychain[2] is a short version of mychain[(' > ',2,' ')]. > > hongbo I've never really liked these strange tuple IDs, which are usually but not always full of empty values. I understand some of the corner cases they handle, but they are very complicated. You cannot assume 2 will map to (' ', 2, ' ') in general - this is what the _translate_id method handles. Consider the case where you have sliced the Chain as discussed, since the first two elements have been removed, that mapping will shift. We definitely would need a test case covering non-trivial ID tuples (e.g. using insertion codes), and tests slicing a previously sliced Chain. Peter From macrozhu at gmail.com Mon Dec 5 09:48:25 2011 From: macrozhu at gmail.com (=?UTF-8?B?SG9uZ2JvIFpodSDmnLHlro/ljZo=?=) Date: Mon, 5 Dec 2011 15:48:25 +0100 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: On Mon, Dec 5, 2011 at 2:50 PM, Peter Cock wrote: > On Mon, Dec 5, 2011 at 1:38 PM, Hongbo Zhu ??? wrote: > > > >> Perhaps I misunderstood - I would not want to allow the syntax > >> mychain[(' ', 2, ' '): (' ', 40, ' ')] which is unclear, rather only > allow > >> the user to use mychain[2:41] which requires Python counting. > > > > But even in mychain[2:41], the 2 and 41 should be residue sequence > number. > > Then it is consistent with the current acceptable syntax mychain[2], > where 2 > > also refers to a sequence number. At the moment, BioPython also > > accepts mychain[(' ', 2, ' ')]. So I think mychain[(' ', 2, ' '): (' ', > 40, > > ' ')] would be just a nature extension of mychain[(' ', 2, ' ')]. > > > > According to the source code, mychain[2] is considered an abbreviation of > > mychain[(' ', 2, ' ')]. Internally, mychain[2] will be translated to > > mychain[(' ', 2, ' ')] by function Bio.PDB.Chain.__translate_id(). So if > > mychain[2:4] would be allowed, internally it would also > > be first translated to mychain[(' ', 2, ' '): (' ', 40, ' ')]. So in my > > point of view, mychain[2:4] is just an abbreviation for mychain[(' ', 2, > ' > > '): (' ', 40, ' ')], just like mychain[2] is a short version of > mychain[(' > > ',2,' ')]. > > > > hongbo > > I've never really liked these strange tuple IDs, which are usually > but not always full of empty values. I understand some of > the corner cases they handle, but they are very complicated. > This seems to be the problem of PDB. I don't know how other packages handle the issue. In addition, I once proposed to remove the HETERO-flag in the residue ID. http://biopython.org/pipermail/biopython-dev/2011-January/008640.html It is only retained for the backwards compatibility with PDB files before remediation in 2007. Removing only HETERO-flag does not solve the problem totally, but to some extent (say, around 50%). I think, one possible solution is to treat residue ID always as a string '%4d%s' % (resnum, icode) instead of a tuple composed of resnum plus icode (if we do not consider HETERO-flag). This way, biotpyhon also serves to remind users that icode is indispensable for uniquely locating a residue even if icode is empty. But this would lead to formidable API change. > You cannot assume 2 will map to (' ', 2, ' ') in general - this > is what the _translate_id method handles. Consider the case > where you have sliced the Chain as discussed, since the > first two elements have been removed, that mapping will shift. > I am afraid I don't concur. As a matter of fact, the mapping are internally fine-tuned by comparing the input residue ID to a list of all residue IDs so the correct index values are obtained for the slicing: res_id_list = [r.id for r in self.get_iterator()] start_index = res_id_list.index(self._translate_id(start)) stop_index = res_id_list.index(self._translate_id(stop)) return self.get_list()[start_index:stop_index:step] It is like mychain[2], this 2 will be first translated to (' ',2,' ') and this residue ID is searched in the dictionary indexed by all residues IDs to locate the correct residue. > We definitely would need a test case covering non-trivial > ID tuples (e.g. using insertion codes), and tests slicing a > previously sliced Chain. > PDB entry 1h4w is a good example with icode and the sequence of chain A starts with resnum 16. def get_slice(self, start, end, step=None): """Return a slice of the chain from start to end (including end position) Arguments: o start - (string, int, string) or int o end - (string, int, string) or int o step - None or int """ res_id_list = [r.id for r in self.get_iterator()] start_index = res_id_list.index(self._translate_id(start)) end_index = res_id_list.index(self._translate_id(end)) return self.get_list()[start_index:end_index+1:step] In [66]: mychain.get_slice(182,189) Out[66]: [, , , , , , , , , ] > Peter > -- Hongbo From p.j.a.cock at googlemail.com Mon Dec 5 10:53:36 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Dec 2011 15:53:36 +0000 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: On Mon, Dec 5, 2011 at 2:48 PM, Hongbo Zhu ??? wrote: > > On Mon, Dec 5, 2011 at 2:50 PM, Peter Cock wrote: >> >> I've never really liked these strange tuple IDs, which are usually >> but not always full of empty values. I understand some of >> the corner cases they handle, but they are very complicated. > > > This seems to be the problem of PDB. Yes. > I don't know how other packages handle the issue. > In addition, I once proposed to remove the HETERO-flag in the residue ID. > http://biopython.org/pipermail/biopython-dev/2011-January/008640.html > It is only retained for the backwards compatibility with PDB files before > remediation in 2007. Removing only HETERO-flag does not solve > the?problem?totally, but to some extent (say, around 50%). Breaking the API without making the ID much easier to use is a bad idea. > PDB entry 1h4w is a good example with icode and the sequence of chain A > starts with resnum 16. That shows the problem nicely, >>> from Bio import PDB >>> structure = PDB.PDBParser().get_structure("1h4w", "1h4w.pdb") >>> chain = structure[0]['A'] >>> len(chain) 351 >>> chain[0] Traceback (most recent call last): File "", line 1, in File "Bio/PDB/Chain.py", line 67, in __getitem__ return Entity.__getitem__(self, id) File "Bio/PDB/Entity.py", line 38, in __getitem__ return self.child_dict[id] KeyError: (' ', 0, ' ') However, you can access the first residue like this: >>> chain[16] Likewise, >>> for index, residue in enumerate(chain): ... print index, residue ... assert chain[index] == residue ... 0 Traceback (most recent call last): File "", line 3, in File "Bio/PDB/Chain.py", line 67, in __getitem__ return Entity.__getitem__(self, id) File "Bio/PDB/Entity.py", line 38, in __getitem__ return self.child_dict[id] KeyError: (' ', 0, ' ') So as you say, the current implementation does map an integer index to the middle field of the ID tuple, rather than the position in the list as I had assumed. Sadly this means it is incompatible with Pythonic slicing, so we can't extend __getitem__ to offer that. Peter From macrozhu at gmail.com Mon Dec 5 11:22:42 2011 From: macrozhu at gmail.com (=?UTF-8?B?SG9uZ2JvIFpodSDmnLHlro/ljZo=?=) Date: Mon, 5 Dec 2011 17:22:42 +0100 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: > > > > PDB entry 1h4w is a good example with icode and the sequence of chain A > > starts with resnum 16. > > That shows the problem nicely, > > >>> from Bio import PDB > >>> structure = PDB.PDBParser().get_structure("1h4w", "1h4w.pdb") > >>> chain = structure[0]['A'] > >>> len(chain) > 351 > >>> chain[0] > Traceback (most recent call last): > File "", line 1, in > File "Bio/PDB/Chain.py", line 67, in __getitem__ > return Entity.__getitem__(self, id) > File "Bio/PDB/Entity.py", line 38, in __getitem__ > return self.child_dict[id] > KeyError: (' ', 0, ' ') > > However, you can access the first residue like this: > > >>> chain[16] > > > Likewise, > > >>> for index, residue in enumerate(chain): > ... print index, residue > ... assert chain[index] == residue > ... > 0 > Traceback (most recent call last): > File "", line 3, in > File "Bio/PDB/Chain.py", line 67, in __getitem__ > return Entity.__getitem__(self, id) > File "Bio/PDB/Entity.py", line 38, in __getitem__ > return self.child_dict[id] > KeyError: (' ', 0, ' ') > > So as you say, the current implementation does map > an integer index to the middle field of the ID tuple, > rather than the position in the list as I had assumed. > Sadly this means it is incompatible with Pythonic > slicing, so we can't extend __getitem__ to offer that. > > Interesting! I was thinking of the problem from a different angle: slicing is just a natural extension from __getitem__, like in pythonic list. And I think the current implementation is a great realization of pythonic list in the special case of protein chain. But since my proposal has another conflict with pythonic slicing, i.e., the ambiguity about the ending position of sequence segments, I prefer to implement the slicing as an independent function get_slice(start, end), if not in Bio.PDB.Chain, then in my own code. Thanks a lot for the helpful discussion! -- Hongbo From redmine at redmine.open-bio.org Mon Dec 5 12:33:55 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 5 Dec 2011 17:33:55 +0000 Subject: [Biopython-dev] [Biopython - Feature #3236] Make Biopython work in PyPy 1.5 References: Message-ID: Issue #3236 has been updated by Peter Cock. I have found and fixed a few handle leaks, which means as of https://github.com/biopython/biopython/commit/65da2fe99d923c5a69a6bfa2ed3b3375496d4826 there are now no ResourceWarning messages from Python 3.2 using: python3 -W all test_SeqIO_index.py However, despite that, running that test alone still fails on Mac OS X under PyPy (1.6 and) 1.7 with IOError: [Errno 24] Too many open files: ... See also https://bugs.pypy.org/issue828 which may possibly be related. ---------------------------------------- Feature #3236: Make Biopython work in PyPy 1.5 https://redmine.open-bio.org/issues/3236 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Target version: URL: PyPy is now roughly as production-ready as Jython: http://morepypy.blogspot.com/2011/04/pypy-15-released-catching-up.html Let's make Biopython work on PyPy 1.5. To make the pure-Python core of Biopython work, I did this: * Download and unpack the pre-compiled Linux tarball from pypy.org * Copy the header file @marshal.h@ from the CPython 2.X installation into the @pypy-c-.../include/@ directory * pypy setup.py build; pypy setup.py install * Delete pypy-c-.../site-packages/Bio/cpairwise2*.so Benchmarking a script that leans heavily on Bio.pairwise2, I see about a 2x speedup between Pypy 1.5 and CPython 2.6 -- yes, that's with the compiled C extension @cpairwise2@ in the CPython 2.6 installation. Numpy isn't available on PyPy yet, and it may be some time before it does. Observations from @pypy setup.py test@: * test_BioSQL triggers tons of RuntimeWarnings related to sqlite3 functions * test_BioSQL_SeqIO fails -- attempts to retrieve P01892 instead of Q29899 (?) * test_Restriction triggers a TypeError, somehow (also causing test_CAPS to err) * test_Entrez fails with many noisy errors -- looks related to expat, may be just my installation * importing @Bio.trie@ fails, probably due to a @marshal.h@ issue with compilation -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From mmaddren at soe.ucsc.edu Mon Dec 5 19:29:53 2011 From: mmaddren at soe.ucsc.edu (Morgan Maddren) Date: Mon, 5 Dec 2011 16:29:53 -0800 (PST) Subject: [Biopython-dev] GEO Libraries In-Reply-To: <1777002553.129203.1323131003663.JavaMail.root@mail-01.cse.ucsc.edu> Message-ID: <504286596.129243.1323131393104.JavaMail.root@mail-01.cse.ucsc.edu> Hi, I don't know who this may concern, but I noticed on the wiki that there is interest in GEO SOFT parsers/writers. I work for the UCSC Genome Browser, and we've already created libraries for working with SOFT files, and would be happy to share. -Morgan From p.j.a.cock at googlemail.com Tue Dec 6 05:20:10 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 6 Dec 2011 10:20:10 +0000 Subject: [Biopython-dev] GEO Libraries In-Reply-To: <504286596.129243.1323131393104.JavaMail.root@mail-01.cse.ucsc.edu> References: <1777002553.129203.1323131003663.JavaMail.root@mail-01.cse.ucsc.edu> <504286596.129243.1323131393104.JavaMail.root@mail-01.cse.ucsc.edu> Message-ID: On Tue, Dec 6, 2011 at 12:29 AM, Morgan Maddren wrote: > Hi, I don't know who this may concern, but I noticed on the wiki > that there is interest in GEO SOFT parsers/writers. I work for > the UCSC Genome Browser, and we've already created libraries > for working with SOFT files, and would be happy to share. > > -Morgan In Python? ;) We have an old and somewhat rudimentary set of GEO SOFT parsers under Bio.Geo, see: https://github.com/biopython/biopython/tree/master/Bio/Geo But not much else. Here R/Bioconductor is much more mature, not just Sean Davis' GEO parser but the whole microarray expression set objects, plus of course the statistics side (for which we may want to use SciPy). Peter From p.j.a.cock at googlemail.com Wed Dec 7 07:20:11 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 7 Dec 2011 12:20:11 +0000 Subject: [Biopython-dev] Test failure: write/read simple one-of locations. Message-ID: Hi all, I noticed the following odd buildslave failure yesterday, which was not reproducible (a retest passed), however something very similar happened today (but again it passed when repeated): Failure on 6 December 2011, under Python 3.2 http://testing.open-bio.org/biopython/builders/Linux%20-%20Python%203.2/builds/291/steps/shell/logs/stdio ERROR: test_oneof (test_SeqIO_features.FeatureWriting) Features: write/read simple one-of locations. ... ValueError: [one-of(0,3,6):21](+) versus [one-of(0,3,6):21](+): SeqFeature(FeatureLocation(OneOfPosition(0, choices=[ExactPosition(0), ExactPosition(3), ExactPosition(6)]), ExactPosition(21), strand=1), type='CDS') vs: SeqFeature(FeatureLocation(OneOfPosition(44, choices=[ExactPosition(0), ExactPosition(3), ExactPosition(6)]), ExactPosition(21), strand=1), type='CDS') Failure on 7 December 2011, under Python 3.1 http://testing.open-bio.org/biopython/builders/Linux%20-%20Python%203.1/builds/459/steps/shell/logs/stdio ERROR: test_oneof (test_SeqIO_features.FeatureWriting) Features: write/read simple one-of locations. ... ValueError: [one-of(0,3,6):21](+) versus [one-of(0,3,6):21](+): SeqFeature(FeatureLocation(OneOfPosition(0, choices=[ExactPosition(0), ExactPosition(3), ExactPosition(6)]), ExactPosition(21), strand=1), type='CDS') vs: SeqFeature(FeatureLocation(OneOfPosition(180, choices=[ExactPosition(0), ExactPosition(3), ExactPosition(6)]), ExactPosition(21), strand=1), type='CDS') This is the same unit test, on the same machine, but different revisions, different versions of Python, and also slightly different output. This is an old machine, so it could be a memory problem, or evidence of some subtle race condition. Has anyone else seen this error? Peter From redmine at redmine.open-bio.org Mon Dec 12 07:47:30 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 12 Dec 2011 12:47:30 +0000 Subject: [Biopython-dev] [Biopython - Bug #3315] (New) Bio.SwissProt fails parsing .dat dumps Message-ID: Issue #3315 has been reported by Leszek Pryszcz. ---------------------------------------- Bug #3315: Bio.SwissProt fails parsing .dat dumps https://redmine.open-bio.org/issues/3315 Author: Leszek Pryszcz Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: SwissProt module fails when parsing .dat dump of Uniprot_trembl vesion 201111. The error is due to corrupted RX lines in .dat for Aureococcus anophagefferens (i.e. C6KIH8_AURAN): > RX DOI=DOI; 10.1111/j.1529-8817.2010.00841.x; I have reported the problem. The thing is, that it happened before. Previously, I have reported similar issue in releases 201010, 201011, 201012... > RX DOI=10.1098/rspb= .2010.1301; Will it be possible to alter error catching mechanisms in Bio.SwissProt._read_rx, so the module warns about corrupted entry instead of failing the parser? ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon Dec 12 07:47:31 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 12 Dec 2011 12:47:31 +0000 Subject: [Biopython-dev] [Biopython - Bug #3315] (New) Bio.SwissProt fails parsing .dat dumps Message-ID: Issue #3315 has been reported by Leszek Pryszcz. ---------------------------------------- Bug #3315: Bio.SwissProt fails parsing .dat dumps https://redmine.open-bio.org/issues/3315 Author: Leszek Pryszcz Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: SwissProt module fails when parsing .dat dump of Uniprot_trembl vesion 201111. The error is due to corrupted RX lines in .dat for Aureococcus anophagefferens (i.e. C6KIH8_AURAN): > RX DOI=DOI; 10.1111/j.1529-8817.2010.00841.x; I have reported the problem. The thing is, that it happened before. Previously, I have reported similar issue in releases 201010, 201011, 201012... > RX DOI=10.1098/rspb= .2010.1301; Will it be possible to alter error catching mechanisms in Bio.SwissProt._read_rx, so the module warns about corrupted entry instead of failing the parser? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon Dec 12 09:49:37 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 12 Dec 2011 14:49:37 +0000 Subject: [Biopython-dev] [Biopython - Bug #3315] Bio.SwissProt fails parsing .dat dumps References: Message-ID: Issue #3315 has been updated by Leszek Pryszcz. Simple replacing

            assert len(x) == 2, "I don't understand RX line %s" % value
            key, value = x[0], x[1].rstrip(";")
            reference.references.append((key, value))

            #assert len(x) == 2, "I don't understand RX line %s" % value
            if len(x) != 2:
              print " Warning: I don't understand RX line %s" % value
            else:
              key, value = x[0], x[1].rstrip(";")
              reference.references.append((key, value))

would do the job. ---------------------------------------- Bug #3315: Bio.SwissProt fails parsing .dat dumps https://redmine.open-bio.org/issues/3315 Author: Leszek Pryszcz Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: SwissProt module fails when parsing .dat dump of Uniprot_trembl vesion 201111. The error is due to corrupted RX lines in .dat for Aureococcus anophagefferens (i.e. C6KIH8_AURAN): > RX DOI=DOI; 10.1111/j.1529-8817.2010.00841.x; I have reported the problem. The thing is, that it happened before. Previously, I have reported similar issue in releases 201010, 201011, 201012... > RX DOI=10.1098/rspb= .2010.1301; Will it be possible to alter error catching mechanisms in Bio.SwissProt._read_rx, so the module warns about corrupted entry instead of failing the parser? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon Dec 12 09:57:21 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 12 Dec 2011 14:57:21 +0000 Subject: [Biopython-dev] [Biopython - Bug #3315] Bio.SwissProt fails parsing .dat dumps References: Message-ID: Issue #3315 has been updated by Peter Cock. The database problem is visible at http://www.uniprot.org/uniprot/C6KIH8.txt where the line is just: RX DOI=DOI; You said you'd reported this record (C6KIH8_AURAN) to SwissProt/UniPort, and other problems in the past, so this is a recurrent problem. Regarding the proposed fix, not really, we need to use the warnings module rather than a print statement. I'm looking at it, but have to download the latest uniprot_trembl.dat first (last month's was fine, so it uniprot_sprot.dat this month and last month). ---------------------------------------- Bug #3315: Bio.SwissProt fails parsing .dat dumps https://redmine.open-bio.org/issues/3315 Author: Leszek Pryszcz Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: SwissProt module fails when parsing .dat dump of Uniprot_trembl vesion 201111. The error is due to corrupted RX lines in .dat for Aureococcus anophagefferens (i.e. C6KIH8_AURAN): > RX DOI=DOI; 10.1111/j.1529-8817.2010.00841.x; I have reported the problem. The thing is, that it happened before. Previously, I have reported similar issue in releases 201010, 201011, 201012... > RX DOI=10.1098/rspb= .2010.1301; Will it be possible to alter error catching mechanisms in Bio.SwissProt._read_rx, so the module warns about corrupted entry instead of failing the parser? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon Dec 12 13:36:45 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 12 Dec 2011 18:36:45 +0000 Subject: [Biopython-dev] [Biopython - Bug #3315] (Closed) Bio.SwissProt fails parsing .dat dumps References: Message-ID: Issue #3315 has been updated by Peter Cock. Status changed from New to Closed % Done changed from 0 to 100 This should be fixed as of: https://github.com/biopython/biopython/commit/8dffb06e18cd725321fcee6f6a7aee9e5c5a146f Please let us know if you run into any other problems. ---------------------------------------- Bug #3315: Bio.SwissProt fails parsing .dat dumps https://redmine.open-bio.org/issues/3315 Author: Leszek Pryszcz Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: SwissProt module fails when parsing .dat dump of Uniprot_trembl vesion 201111. The error is due to corrupted RX lines in .dat for Aureococcus anophagefferens (i.e. C6KIH8_AURAN): > RX DOI=DOI; 10.1111/j.1529-8817.2010.00841.x; I have reported the problem. The thing is, that it happened before. Previously, I have reported similar issue in releases 201010, 201011, 201012... > RX DOI=10.1098/rspb= .2010.1301; Will it be possible to alter error catching mechanisms in Bio.SwissProt._read_rx, so the module warns about corrupted entry instead of failing the parser? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Fri Dec 16 07:36:12 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 16 Dec 2011 12:36:12 +0000 Subject: [Biopython-dev] TogoWS in Biopython? In-Reply-To: References:

<8762j2iump.fsf@fastmail.fm> Message-ID: On Wed, Nov 2, 2011 at 1:27 PM, Peter Cock wrote: > On Wed, Nov 2, 2011 at 12:19 PM, Brad Chapman wrote: >> >> Peter; >> >>> > Would someone like to review the TogoWS code I have written >>> > to access the Togo Web Service's REST API please? >> >> This looks great and the tests are all passing for me. My only small >> suggestion would be to avoid hardcoding 'http://togows.dbcls.jp' >> everywhere. I'd stick this as a top level variable along with the global >> caches and reference it in the code. This way if they ever get any >> mirrors we could adjust on the fly. >> >> Thanks for getting this in, >> Brad > > Good point regarding the URL. > > I've also realised it will need some tweaks for Python 3 (bytes > versus unicode), or at least to skip the unit tests in the short > term to avoid hiding real errors on the buildbot. > > Peter That is on the trunk now, further testing and feedback welcome. https://github.com/biopython/biopython/commit/a4875c62d44c40cebfc11d84654ad9dcb83c81b3 Peter From Markus.Piotrowski at ruhr-uni-bochum.de Sat Dec 17 17:33:20 2011 From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus Piotrowski) Date: Sat, 17 Dec 2011 22:33:20 +0000 (UTC) Subject: [Biopython-dev] =?utf-8?q?Bio=2ESeqUtils=2EMeltingTemp_Tm=5Fstalu?= =?utf-8?q?c_does_not_accept_sequence_object?= Message-ID: Dear Biophython developers, It seems that modul Tm_staluc from Bio.SeqUtils does not accept sequence objects as input but 'just' strings. I think that is somewhat unusal for a 'SeqUtil', shouldn't sequence utilities accept sequence objects? I also think that the modul name 'Tm_staluc' is not self-explanatory. Obviously the name is derived from the corresponding author of the paper which describes this Tm calculation, SantaLucia. The method is better known (and described so in the documentation) as nearest neighbor thermodynamics, so it would better be called something like Tm_nn (nearest neighbor). However, taking into consideration that more Tm methods will be added in future, they may just be numbered? From Markus.Piotrowski at ruhr-uni-bochum.de Sun Dec 18 08:50:07 2011 From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus Piotrowski) Date: Sun, 18 Dec 2011 13:50:07 +0000 (UTC) Subject: [Biopython-dev] Sequence object allows non-alphabet characters Message-ID: Dear Biopyhton developers, I wonder why the following code does not throw an exception: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> mySeq = Seq("GATC1234YWSK", IUPAC.unambiguous_dna) >>> mySeq Seq('GATC1234YWSK', IUPACUnambiguousDNA()) I expected that trying to generate a sequence object containing non-alphabet characters would either throw an exception/warning or "downgrade" the alphabet, if possible. Another facet of the same problem are whitespaces: >>> mySeq = Seq("GATC GATC", IUPAC.unambiguous_dna) >>> mySeq Seq('GATC GATC', IUPACUnambiguousDNA()) >>> len(mySeq) 9 Which is problematic when the sequence length is required (calculating GC content, calculating melting temperature, etc.) While it could be argued that checking the integrity of the sequence data is related to parsing, I think that the sequence in the sequence object should never contain whitespaces and if an alphabet is assigned it should not contain non-alphabet characters. So this should be handled by the sequence object itself? From p.j.a.cock at googlemail.com Sun Dec 18 09:03:10 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 18 Dec 2011 14:03:10 +0000 Subject: [Biopython-dev] Sequence object allows non-alphabet characters In-Reply-To: References: Message-ID: On Sunday, December 18, 2011, Markus Piotrowski < Markus.Piotrowski at ruhr-uni-bochum.de> wrote: > Dear Biopyhton developers, > > I wonder why the following code does not throw an exception: > >>>> from Bio.Seq import Seq >>>> from Bio.Alphabet import IUPAC >>>> mySeq = Seq("GATC1234YWSK", IUPAC.unambiguous_dna) >>>> mySeq > Seq('GATC1234YWSK', IUPACUnambiguousDNA()) > > I expected that trying to generate a sequence object containing non-alphabet > characters would either throw an exception/warning or "downgrade" the alphabet, > if possible. > > Another facet of the same problem are whitespaces: > >>>> mySeq = Seq("GATC GATC", IUPAC.unambiguous_dna) >>>> mySeq > Seq('GATC GATC', IUPACUnambiguousDNA()) >>>> len(mySeq) > 9 > > Which is problematic when the sequence length is required (calculating GC > content, calculating melting temperature, etc.) > > While it could be argued that checking the integrity of the sequence data is > related to parsing, I think that the sequence in the sequence object should > never contain whitespaces and if an alphabet is assigned it should not contain > non-alphabet characters. So this should be handled by the sequence object itself? > See https://redmine.open-bio.org/issues/2597 To me the obvious approach is to valid this in the Seq object __init__ if and only if the alphabet selected has a letters attribute with the valid characters given. However, this will slow things down and probably break a number of existing scripts. Perhaps a global setting for ignore (current behaviuor), warning, or exception? We could make the default the warning for a release or two and then switch to an error. Peter From p.j.a.cock at googlemail.com Sun Dec 18 09:22:54 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 18 Dec 2011 14:22:54 +0000 Subject: [Biopython-dev] Bio.SeqUtils.MeltingTemp Tm_staluc does not accept sequence object In-Reply-To: References: Message-ID: On Sat, Dec 17, 2011 at 10:33 PM, Markus Piotrowski wrote: > Dear Biophython developers, > > It seems that modul Tm_staluc from Bio.SeqUtils does not > accept sequence objects as input but 'just' strings. Well spotted, but the important information missing from your email was what goes wrong: >>> from Bio.Seq import Seq >>> from Bio.SeqUtils.MeltingTemp import Tm_staluc >>> Tm_staluc(Seq("ACGT")) Traceback (most recent call last): ... i = st.index(p,x) AttributeError: 'Seq' object has no attribute 'index' >>> "ACGT".index("G") That looks like an oversight in the Seq object, which is easily fixed (the unit tests need a bit more work...). > I think that is somewhat unusal for a 'SeqUtil', shouldn't sequence utilities > accept sequence objects? > > I also think that the modul name 'Tm_staluc' is not self-explanatory. Obviously > the name is derived from the corresponding author of the paper which describes > this Tm calculation, SantaLucia. The method is better known (and described so in > the documentation) as nearest neighbor thermodynamics, so it would better be > called something like Tm_nn (nearest neighbour). The doctoring could be expand to explain that. Would you like to suggest an improvement? > However, taking into consideration that more Tm methods will be added > in future, they may just be numbered? Numbered how? Tm_nn1, Tm_nn2, ... seems a bad idea. Peter From p.j.a.cock at googlemail.com Sun Dec 18 09:26:32 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 18 Dec 2011 14:26:32 +0000 Subject: [Biopython-dev] Bio.SeqUtils.MeltingTemp Tm_staluc does not accept sequence object In-Reply-To: References:

Message-ID: On Sun, Dec 18, 2011 at 2:22 PM, Peter Cock wrote: > On Sat, Dec 17, 2011 at 10:33 PM, Markus Piotrowski > wrote: >> Dear Biophython developers, >> >> It seems that modul Tm_staluc from Bio.SeqUtils does not >> accept sequence objects as input but 'just' strings. > > Well spotted, but the important information missing from your > email was what goes wrong: > >>>> from Bio.Seq import Seq >>>> from Bio.SeqUtils.MeltingTemp import Tm_staluc >>>> Tm_staluc(Seq("ACGT")) > Traceback (most recent call last): > ... > ? ?i = st.index(p,x) > AttributeError: 'Seq' object has no attribute 'index' >>>> "ACGT".index("G") > > That looks like an oversight in the Seq object, which > is easily fixed (the unit tests need a bit more work...). Actually, I might have been worried about what to do with the existing MutableSeq's index method (which is list/array like not string like). Perhaps in the short term a fix to the TM function is easier. Peter From p.j.a.cock at googlemail.com Sun Dec 18 13:01:47 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 18 Dec 2011 18:01:47 +0000 Subject: [Biopython-dev] Bio.SeqUtils.MeltingTemp Tm_staluc does not accept sequence object In-Reply-To: References:

Message-ID: On Sun, Dec 18, 2011 at 2:26 PM, Peter Cock wrote: >> That looks like an oversight in the Seq object, which >> is easily fixed (the unit tests need a bit more work...). > > Actually, I might have been worried about what to do > with the existing MutableSeq's index method (which > is list/array like not string like). > > Perhaps in the short term a fix to the TM function is > easier. Done: https://github.com/biopython/biopython/commit/cd01acc2cdfb1a3e9e16e5107924168231514842 I noticed that the MeltingTemp module didn't have any unit tests, so turned the existing simple example into a doctest (some stray text in my git commit comment which is a shame): https://github.com/biopython/biopython/commit/a347dce69e0c061280165c33662dd57f49b18f5b Markus, could you test the latest code from github please? Thanks, Peter From Markus.Piotrowski at ruhr-uni-bochum.de Mon Dec 19 07:49:50 2011 From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus) Date: Mon, 19 Dec 2011 12:49:50 +0000 (UTC) Subject: [Biopython-dev] Sequence object allows non-alphabet characters References:

Message-ID: Peter Cock googlemail.com> writes: > > On Sunday, December 18, 2011, Markus Piotrowski < > Markus.Piotrowski ruhr-uni-bochum.de> wrote: > > Dear Biopyhton developers, > > > > I wonder why the following code does not throw an exception: > > > >>>> from Bio.Seq import Seq > >>>> from Bio.Alphabet import IUPAC > >>>> mySeq = Seq("GATC1234YWSK", IUPAC.unambiguous_dna) > >>>> mySeq > > Seq('GATC1234YWSK', IUPACUnambiguousDNA()) > > > > I expected that trying to generate a sequence object containing > non-alphabet > > characters would either throw an exception/warning or "downgrade" the > alphabet, > > if possible. > > > > See https://redmine.open-bio.org/issues/2597 > > To me the obvious approach is to valid this in the Seq object > __init__ if and only if the alphabet selected has a letters > attribute with the valid characters given. However, this will > slow things down and probably break a number of existing > scripts. Perhaps a global setting for ignore (current behaviuor), > warning, or exception? We could make the default the > warning for a release or two and then switch to an error. > > Peter > What about an additional optional option in the sequence object like "validate=True/False" with false as default. This would not break existing code, will not influence speed (if validate=False) but gives the possibility to have the sequence validated against the selected alphabet. In addition, validate=True without an selected alphabet would allow for a basic sequence polishing, like setting uppercase and removing whitespaces and digits (any non-alphabetic characters?). From p.j.a.cock at googlemail.com Mon Dec 19 09:02:25 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 14:02:25 +0000 Subject: [Biopython-dev] Sequence object allows non-alphabet characters In-Reply-To: References:

Message-ID: On Mon, Dec 19, 2011 at 12:49 PM, Markus wrote: > > What about an additional optional option in the sequence object > like "validate=True/False" with false as default. This would not > break existing code, will not influence speed (if validate=False) > but gives the possibility to have the sequence validated against > the selected alphabet. That could work, although there would still be trivial speed impact (the extra if statement), but it shouldn't really hurt. However, most Seq objects are created not directly by the user, but via SeqIO. I suppose that could get another argument for Seq construction... > In addition, validate=True without an selected alphabet would allow > for a basic sequence polishing, like setting uppercase and removing > whitespaces and digits (any non-alphabetic characters?). Things like mixed case are actually useful, and so too are extra symbols. Peter From eric.talevich at gmail.com Mon Dec 19 19:49:28 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 19 Dec 2011 16:49:28 -0800 Subject: [Biopython-dev] Sequence object allows non-alphabet characters In-Reply-To: References:

Message-ID: On Mon, Dec 19, 2011 at 6:02 AM, Peter Cock wrote: > On Mon, Dec 19, 2011 at 12:49 PM, Markus > wrote: > > > > What about an additional optional option in the sequence object > > like "validate=True/False" with false as default. This would not > > break existing code, will not influence speed (if validate=False) > > but gives the possibility to have the sequence validated against > > the selected alphabet. > > That could work, although there would still be trivial speed impact > (the extra if statement), but it shouldn't really hurt. However, most > Seq objects are created not directly by the user, but via SeqIO. > I suppose that could get another argument for Seq construction... > > As another alternative, you could add a method Seq.validate() which must be called separately. Then you'd have a way to trigger validation even after directly setting seq.data or .alphabet. -E From Markus.Piotrowski at ruhr-uni-bochum.de Tue Dec 20 09:48:25 2011 From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus Piotrowski) Date: Tue, 20 Dec 2011 14:48:25 +0000 (UTC) Subject: [Biopython-dev] Sequence object allows non-alphabet characters References:

Message-ID: Eric Talevich gmail.com> writes: > As another alternative, you could add a method Seq.validate() which must be > called separately. Then you'd have a way to trigger validation even after > directly setting seq.data or .alphabet. > > -E > There is a function _verify_alphabet(sequence) in the package Alphabet, which does exactly this. However, the example given in the API documentation doesn't work for me: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq ("MKQHK", IUPAC.protein) >>> _verify_alphabet(my_seq) Traceback (most recent call last): File "", line 1, in _verify_alphabet(my_seq) NameError: name '_verify_alphabet' is not defined >>> from Bio import Alphabet >>> Alphabet._verify_alphabet(my_seq) True Still, I would prefer to have checked the sequence against the choosen alphabet during initialization, maybe as option: Seq(sequence[, alphabet, verify]) Markus From p.j.a.cock at googlemail.com Tue Dec 20 10:27:00 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 20 Dec 2011 15:27:00 +0000 Subject: [Biopython-dev] Sequence object allows non-alphabet characters In-Reply-To: References:

Message-ID: On Tue, Dec 20, 2011 at 2:48 PM, Markus Piotrowski wrote: > Eric Talevich gmail.com> writes: > >> As another alternative, you could add a method Seq.validate() which must be >> called separately. Then you'd have a way to trigger validation even after >> directly setting seq.data or .alphabet. >> >> -E >> > > There is a function _verify_alphabet(sequence) in the package Alphabet, > which does exactly this. That starts with an underscore so it is a private API. > However, the example given in the API documentation doesn't > work for me: > >>>> from Bio.Seq import Seq >>>> from Bio.Alphabet import IUPAC >>>> my_seq = Seq ("MKQHK", IUPAC.protein) >>>> _verify_alphabet(my_seq) > > Traceback (most recent call last): > ?File "", line 1, in > ? ?_verify_alphabet(my_seq) > NameError: name '_verify_alphabet' is not defined You didn't import the function, thus a NameError >>>> from Bio import Alphabet >>>> Alphabet._verify_alphabet(my_seq) > True > > Still, I would prefer to have checked the sequence against the > choosen alphabet during initialization, maybe as option: > Seq(sequence[, alphabet, verify]) Yes, there are certainly advantages to having the alphabet validation happen during Seq object creation. Peter From Markus.Piotrowski at ruhr-uni-bochum.de Wed Dec 21 07:22:55 2011 From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus Piotrowski) Date: Wed, 21 Dec 2011 12:22:55 +0000 (UTC) Subject: [Biopython-dev] =?utf-8?q?Bio=2ESeqUtils=2EMeltingTemp_Tm=5Fstalu?= =?utf-8?q?c_does_not_accept_sequence_object?= References:

Message-ID: Peter Cock googlemail.com> writes: > > On Sun, Dec 18, 2011 at 2:26 PM, Peter Cock googlemail.com> wrote: > >> That looks like an oversight in the Seq object, which > >> is easily fixed (the unit tests need a bit more work...). > > > > Actually, I might have been worried about what to do > > with the existing MutableSeq's index method (which > > is list/array like not string like). > > > > Perhaps in the short term a fix to the TM function is > > easier. > > Done: > https://github.com/biopython/biopython/commit /cd01acc2cdfb1a3e9e16e5107924168231514842 > Markus, could you test the latest code from github please? > > Thanks, > > Peter OK, worked for me: >>> from Bio.Seq import Seq >>> import MeltingTemp #This is the latest code from github >>> mySeq = Seq("ATGGCGCTCGTCCCAGCACC") >>> MeltingTemp.Tm_staluc(mySeq) 61.837438978725686 >>> myString = "ATGGCGCTCGTCCCAGCACC" >>> MeltingTemp.Tm_staluc(myString) 61.837438978725686 I'm still worried about whitespaces, which, in this method, would influence the calculation by three ways: 1. In line 169 (of the latest code) len(s) is used in a calculation. 2. The method itself consists of taking thermodynamic values of neighboring bases. Thus "GC" gives a different value than ("G C"), since in the latter case G and C are not neighbored. 3. Also leading and trailing spaces would give wrong calculations since the function tercorr (lines 59-101) uses startswith and endswith methods to look for the first and last base, respectively. Example (same sequence as above with space): >>> mySeq = Seq("ATG GCGCTCGTCCCAGCACC") >>> MeltingTemp.Tm_staluc(mySeq) 58.16633757439382 I would suggest to change line 117 to sup = str(s).upper().translate(None, string.whitespace) which requires import of the string module, or simpler: sup = str(s).upper().replace(" ","") and changing line 169 to: ds = ds-0.368*(len(sup)-1)*math.log(saltc/1e3) #instead of len(s) Putting these changes into MeltingTempN: >>> import MeltingTempN >>> mySeq = Seq("ATGGCGCTCGTCCCAGCACC") >>> MeltingTempN.Tm_staluc(mySeq) 61.837438978725686 >>> mySeq = Seq("ATG GCGCTCGTCCCAGCACC") #space in sequence >>> MeltingTempN.Tm_staluc(mySeq) 61.837438978725686 >>> mySeq = Seq(" ATGGCGCTCGTCCCAGCACC") #leading space >>> MeltingTempN.Tm_staluc(mySeq) 61.837438978725686 It is of course a more philosophical question if the method should catch these problems or if the method can rely on a 'proper' sequence (i.e., without blanks). Markus From p.j.a.cock at googlemail.com Wed Dec 21 07:39:02 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 21 Dec 2011 12:39:02 +0000 Subject: [Biopython-dev] Bio.SeqUtils.MeltingTemp Tm_staluc does not accept sequence object In-Reply-To: References:

Message-ID: On Wed, Dec 21, 2011 at 12:22 PM, Markus Piotrowski wrote: > > I'm still worried about whitespaces, which, in this method, would influence the > calculation by three ways: > ... > > It is of course a more philosophical question if the method should catch these > problems or if the method can rely on a 'proper' sequence (i.e., without blanks). I would say it is reasonable for the melting temperature function to assume sensible sequences as input, and leave validation to the user. Peter From redmine at redmine.open-bio.org Thu Dec 22 21:51:01 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 23 Dec 2011 02:51:01 +0000 Subject: [Biopython-dev] [Biopython - Bug #3316] (New) Phylo/NewickIO fails to write non-leaf names with newick output Message-ID: Issue #3316 has been reported by Nicola Segata. ---------------------------------------- Bug #3316: Phylo/NewickIO fails to write non-leaf names with newick output https://redmine.open-bio.org/issues/3316 Author: Nicola Segata Status: New Priority: Normal Assignee: Category: Target version: URL: If I read the following newick tree

(A,B,(C,D)E)F;

and then I write it again in newick format I get

(A:1.00000,B:1.00000,(C:1.00000,D:1.00000):1.00000):1.00000;

which is missing the names of the internal nodes E and F. If I export it in PhyloXML I get the right names:

---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From macrozhu at gmail.com Fri Dec 2 09:41:30 2011 From: macrozhu at gmail.com (=?UTF-8?B?SG9uZ2JvIFpodSDmnLHlro/ljZo=?=) Date: Fri, 2 Dec 2011 10:41:30 +0100 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? Message-ID: Hi, I propose to add slicing to class Bio.PDB.Chain by changing function Bio.PDB.Chain.__getitem__(). * Why is slicing necessary for Bio.PDB.Chain? Protein domain definitions are usually presented as the starting and ending positions of the domain in protein primary structures, e.g. in SCOP, or CATH. Slicing comes in handy when extracting domains from PDB files. * Why is slicing not available at the moment? I understand that the majority of Bio.PDB.Entity objects are not lists. And there is not internal *sequential order* for the child entities in these objects. For example, In Bio.PDB.Model, its child Chain entities do not really have a sequential order within Model. Slicing seems not make sense. But Bio.PDB.Chain is exceptional: Residue entities in Bio.PDB.Chain have a sequence order as presented in the primary structure and slicing becomes a reasonable operation. * How to slice a Chain entity? I think it can be realized by revising the function Bio.PDB.Chain.__getitem__(). For example: def __getitem__(self, id): """Return the residue with given id. The id of a residue is (hetero flag, sequence identifier, insertion code). If id is an int, it is translated to (" ", id, " ") by the _translate_id method. Arguments: o id - (string, int, string) or int """ if isinstance(id, slice): res_id_list = [r.id for r in self.get_iterator()] if id.start is not None: start_index = res_id_list.index(self._translate_id(id.start)) else: start_index = 0 stop_index = res_id_list.index(self._translate_id(id.stop)) return self.get_list()[start_index:stop_index:id.step] else: id=self._translate_id(id) return Entity.__getitem__(self, id) From anaryin at gmail.com Fri Dec 2 10:32:27 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 2 Dec 2011 11:32:27 +0100 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References: Message-ID: Hey Hongbo, Interesting idea, but couldn't it be done already with child_list in a more or less straightforward manner? Best, Jo?o No dia 2 de Dez de 2011 10:43, "Hongbo Zhu ???" escreveu: > Hi, > > I propose to add slicing to class Bio.PDB.Chain by changing function > Bio.PDB.Chain.__getitem__(). > > * Why is slicing necessary for Bio.PDB.Chain? > Protein domain definitions are usually presented as the starting and ending > positions of the domain in protein primary structures, e.g. in SCOP, or > CATH. Slicing comes in handy when extracting domains from PDB files. > > * Why is slicing not available at the moment? > I understand that the majority of Bio.PDB.Entity objects are not lists. And > there is not internal *sequential order* for the child entities in these > objects. For example, In Bio.PDB.Model, its child Chain entities do not > really have a sequential order within Model. Slicing seems not make sense. > But Bio.PDB.Chain is exceptional: Residue entities in Bio.PDB.Chain have a > sequence order as presented in the primary structure and slicing becomes a > reasonable operation. > > * How to slice a Chain entity? > I think it can be realized by revising the > function Bio.PDB.Chain.__getitem__(). For example: > > def __getitem__(self, id): > """Return the residue with given id. > > The id of a residue is (hetero flag, sequence identifier, insertion > code). > If id is an int, it is translated to (" ", id, " ") by the > _translate_id > method. > > Arguments: > o id - (string, int, string) or int > """ > if isinstance(id, slice): > res_id_list = [r.id for r in self.get_iterator()] > if id.start is not None: > start_index = > res_id_list.index(self._translate_id(id.start)) > else: > start_index = 0 > stop_index = res_id_list.index(self._translate_id(id.stop)) > return self.get_list()[start_index:stop_index:id.step] > else: > id=self._translate_id(id) > return Entity.__getitem__(self, id) > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From macrozhu at gmail.com Fri Dec 2 12:43:59 2011 From: macrozhu at gmail.com (=?UTF-8?B?SG9uZ2JvIFpodSDmnLHlro/ljZo=?=) Date: Fri, 2 Dec 2011 13:43:59 +0100 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: Hi, Joao, thanks for the response. When I spoke of slicing Bio.PDB.Chain, I meant to slice it using residue id, not list index. And these two ways are fundamentally different. For instance : not only slicing like this: or chain.child_list[2:12] # slice using list index but also slicing like this: chain[2:12] # slice using residue sequence id, not feasible at the moment # NOTE: this is fundamentally different from chain.child_list[2:12] or even: chain[(' ', 2, ' ') : (' ', 12, ' ')] # slice using residue full id, even better Of course one can play with child_list and obtain the same outcome. But I think it would be very convenient to implement it in the __getitem__() function. cheers,hongbo 2011/12/2 Jo?o Rodrigues > Hey Hongbo, > > Interesting idea, but couldn't it be done already with child_list in a > more or less straightforward manner? > > Best, > > Jo?o > No dia 2 de Dez de 2011 10:43, "Hongbo Zhu ???" > escreveu: > >> Hi, >> >> I propose to add slicing to class Bio.PDB.Chain by changing function >> Bio.PDB.Chain.__getitem__(). >> >> * Why is slicing necessary for Bio.PDB.Chain? >> Protein domain definitions are usually presented as the starting and >> ending >> positions of the domain in protein primary structures, e.g. in SCOP, or >> CATH. Slicing comes in handy when extracting domains from PDB files. >> >> * Why is slicing not available at the moment? >> I understand that the majority of Bio.PDB.Entity objects are not lists. >> And >> there is not internal *sequential order* for the child entities in these >> objects. For example, In Bio.PDB.Model, its child Chain entities do not >> really have a sequential order within Model. Slicing seems not make sense. >> But Bio.PDB.Chain is exceptional: Residue entities in Bio.PDB.Chain have a >> sequence order as presented in the primary structure and slicing becomes a >> reasonable operation. >> >> * How to slice a Chain entity? >> I think it can be realized by revising the >> function Bio.PDB.Chain.__getitem__(). For example: >> >> def __getitem__(self, id): >> """Return the residue with given id. >> >> The id of a residue is (hetero flag, sequence identifier, insertion >> code). >> If id is an int, it is translated to (" ", id, " ") by the >> _translate_id >> method. >> >> Arguments: >> o id - (string, int, string) or int >> """ >> if isinstance(id, slice): >> res_id_list = [r.id for r in self.get_iterator()] >> if id.start is not None: >> start_index = >> res_id_list.index(self._translate_id(id.start)) >> else: >> start_index = 0 >> stop_index = res_id_list.index(self._translate_id(id.stop)) >> return self.get_list()[start_index:stop_index:id.step] >> else: >> id=self._translate_id(id) >> return Entity.__getitem__(self, id) >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > -- Hongbo From p.j.a.cock at googlemail.com Mon Dec 5 10:45:29 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Dec 2011 10:45:29 +0000 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: 2011/12/2 Hongbo Zhu ??? : > Hi, Joao, > > thanks for the response. When I spoke of slicing Bio.PDB.Chain, I meant to > slice it using residue id, not list index. And these two ways are > fundamentally different. > > For instance : > > not only slicing like this: > or > chain.child_list[2:12] ?# slice using list index > > but also slicing like this: > > chain[2:12] ? # slice using residue sequence id, not feasible at the moment > ? ? ? ? ? ? ? ? ? # NOTE: this is fundamentally different from > chain.child_list[2:12] > or even: > chain[(' ', 2, ' ') : (' ', 12, ' ')] # slice using residue full id, even > better > > Of course one can play with child_list and obtain the same outcome. But I > think it would be very convenient to implement it in the __getitem__() > function. > > cheers,hongbo Hi Hongbo, I agree defining integer based slicing of Chain objects sounds like a good idea. Could you write a couple of unit tests for the new slicing please (in file Tests/test_PDB.py)? You can just give code snippets, a patch, or create a branch on github it you would prefer. Does it make sense to consider __add__ for the Chain as well? Peter From macrozhu at gmail.com Mon Dec 5 11:46:59 2011 From: macrozhu at gmail.com (=?UTF-8?B?SG9uZ2JvIFpodSDmnLHlro/ljZo=?=) Date: Mon, 5 Dec 2011 12:46:59 +0100 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: Hi, Peter, I just realized a special issue concerning slicing Bio.PDB.Chain. Normally, in python a slice is given by three arguments: start, stop and step, where the element at position *stop* is not included in the output. For example, mylist[2:40:1] would return: [ mylist[2],mylist[3], ...., mylist[39] ] But in CATH and SCOP, sequence segments composing domains are given as start and end position. And the residue at the end position is also included in the domain definition. e.g. if a domain is defined to be from residue (' ', 1, ' ') to residue (' ', 40, ' '), a slicing like this mychain[(' ', 2, ' '): (' ', 40, ' ')] or mychain[2:40] would not include residue (' ',40,' '). And it is not definite that mychain[(' ', 2, ' '): (' ', 41, ' ')] would give the correct outcome because the residue after (' ',40,' ') does not necessary have to be (' ',41,' '). Of course we can change the code in the __getitem__() such that it includes the end position. But then it is against the general python convention of slicing. So I think maybe an independent function is perhaps needed: class Chain(Entity): def get_slice(self, start, end, step=None): """Return a slice of the chain from start to end (including end position) Arguments: o start - (string, int, string) or int o end - (string, int, string) or int o step - None or int """ res_id_list = [r.id for r in self.get_iterator()] start_index = res_id_list.index(self._translate_id(start)) stop_index = res_id_list.index(self._translate_id(stop)) return self.get_list()[start_index:stop_index:step] And for the overload of operator __add__(), is it for the concatenation of chain segments? I think it is very important (if I chop the sequence into pieces, I should also be able to glue them together back, right? ) But this implies the function get_slice should return Chain instance, not just a list of Residue instances, right? --Hongbo > > Hi Hongbo, > > I agree defining integer based slicing of Chain objects sounds like a good > idea. > > Could you write a couple of unit tests for the new slicing please (in > file Tests/test_PDB.py)? You can just give code snippets, a patch, or > create a branch on github it you would prefer. > > Does it make sense to consider __add__ for the Chain as well? > > Peter > -- Hongbo From p.j.a.cock at googlemail.com Mon Dec 5 12:15:55 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Dec 2011 12:15:55 +0000 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: On Mon, Dec 5, 2011 at 11:46 AM, Hongbo Zhu ??? wrote: > Hi, Peter, > > I just realized a special issue concerning slicing Bio.PDB.Chain. > Normally, in python a slice is given by three arguments: start, stop and > step, where the element at position *stop* is not included in the output. > For example, > > mylist[2:40:1] ?would return: [ mylist[2],mylist[3], ...., mylist[39] ] > Yes, > But in CATH and SCOP, sequence segments composing domains > are given as start and end position. And the residue at the end > position is also included in the domain definition. OK. I'd have to double check what our parsers return (and if they convert the start/end into C/Python style). > e.g. if a domain > is defined to be from residue (' ', 1, ' ') to residue (' ', 40, ' '), a slicing > like this mychain[(' ', 2, ' '):?(' ', 40, ' ')] or mychain[2:40] would not > include residue (' ',40,' '). Perhaps I misunderstood - I would not want to allow the syntax mychain[(' ', 2, ' '): (' ', 40, ' ')] which is unclear, rather only allow the user to use mychain[2:41] which requires Python counting. Peter From macrozhu at gmail.com Mon Dec 5 13:38:09 2011 From: macrozhu at gmail.com (=?UTF-8?B?SG9uZ2JvIFpodSDmnLHlro/ljZo=?=) Date: Mon, 5 Dec 2011 14:38:09 +0100 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: > But in CATH and SCOP, sequence segments composing domains > > are given as start and end position. And the residue at the end > > position is also included in the domain definition. > > OK. I'd have to double check what our parsers return (and if > they convert the start/end into C/Python style). > > > e.g. if a domain > > is defined to be from residue (' ', 1, ' ') to residue (' ', 40, ' '), a > slicing > > like this mychain[(' ', 2, ' '): (' ', 40, ' ')] or mychain[2:40] would > not > > include residue (' ',40,' '). > > Perhaps I misunderstood - I would not want to allow the syntax > mychain[(' ', 2, ' '): (' ', 40, ' ')] which is unclear, rather only allow > the user to use mychain[2:41] which requires Python counting. > > But even in mychain[2:41], the 2 and 41 should be residue sequence number. Then it is consistent with the current acceptable syntax mychain[2], where 2 also refers to a sequence number. At the moment, BioPython also accepts mychain[(' ', 2, ' ')]. So I think mychain[(' ', 2, ' '): (' ', 40, ' ')] would be just a nature extension of mychain[(' ', 2, ' ')]. According to the source code, mychain[2] is considered an abbreviation of mychain[(' ', 2, ' ')]. Internally, mychain[2] will be translated to mychain[(' ', 2, ' ')] by function Bio.PDB.Chain.__translate_id(). So if mychain[2:4] would be allowed, internally it would also be first translated to mychain[(' ', 2, ' '): (' ', 40, ' ')]. So in my point of view, mychain[2:4] is just an abbreviation for mychain[(' ', 2, ' '): (' ', 40, ' ')], just like mychain[2] is a short version of mychain[(' ',2,' ')]. hongbo > Peter > -- Hongbo From p.j.a.cock at googlemail.com Mon Dec 5 13:50:44 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Dec 2011 13:50:44 +0000 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: On Mon, Dec 5, 2011 at 1:38 PM, Hongbo Zhu ??? wrote: > >> Perhaps I misunderstood - I would not want to allow the syntax >> mychain[(' ', 2, ' '): (' ', 40, ' ')] which is unclear, rather only allow >> the user to use mychain[2:41] which requires Python counting. > > But even in mychain[2:41], the 2 and 41 should be residue sequence number. > Then it is consistent with the current acceptable syntax mychain[2], where 2 > also refers to a sequence number. At the moment, BioPython also > accepts?mychain[(' ', 2, ' ')]. So I think?mychain[(' ', 2, ' '): (' ', 40, > ' ')] would be just a nature extension of?mychain[(' ', 2, ' ')]. > > According to the source code,?mychain[2] is considered an abbreviation of > mychain[(' ', 2, ' ')]. Internally, mychain[2] will be translated to > mychain[(' ', 2, ' ')] by function Bio.PDB.Chain.__translate_id(). So if > mychain[2:4] would be allowed, internally it would also > be?first?translated?to?mychain[(' ', 2, ' '): (' ', 40, ' ')]. So in my > point of view,?mychain[2:4] is just an abbreviation for?mychain[(' ', 2, ' > '): (' ', 40, ' ')], just like mychain[2] is a short version of mychain[(' > ',2,' ')]. > > hongbo I've never really liked these strange tuple IDs, which are usually but not always full of empty values. I understand some of the corner cases they handle, but they are very complicated. You cannot assume 2 will map to (' ', 2, ' ') in general - this is what the _translate_id method handles. Consider the case where you have sliced the Chain as discussed, since the first two elements have been removed, that mapping will shift. We definitely would need a test case covering non-trivial ID tuples (e.g. using insertion codes), and tests slicing a previously sliced Chain. Peter From macrozhu at gmail.com Mon Dec 5 14:48:25 2011 From: macrozhu at gmail.com (=?UTF-8?B?SG9uZ2JvIFpodSDmnLHlro/ljZo=?=) Date: Mon, 5 Dec 2011 15:48:25 +0100 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: On Mon, Dec 5, 2011 at 2:50 PM, Peter Cock wrote: > On Mon, Dec 5, 2011 at 1:38 PM, Hongbo Zhu ??? wrote: > > > >> Perhaps I misunderstood - I would not want to allow the syntax > >> mychain[(' ', 2, ' '): (' ', 40, ' ')] which is unclear, rather only > allow > >> the user to use mychain[2:41] which requires Python counting. > > > > But even in mychain[2:41], the 2 and 41 should be residue sequence > number. > > Then it is consistent with the current acceptable syntax mychain[2], > where 2 > > also refers to a sequence number. At the moment, BioPython also > > accepts mychain[(' ', 2, ' ')]. So I think mychain[(' ', 2, ' '): (' ', > 40, > > ' ')] would be just a nature extension of mychain[(' ', 2, ' ')]. > > > > According to the source code, mychain[2] is considered an abbreviation of > > mychain[(' ', 2, ' ')]. Internally, mychain[2] will be translated to > > mychain[(' ', 2, ' ')] by function Bio.PDB.Chain.__translate_id(). So if > > mychain[2:4] would be allowed, internally it would also > > be first translated to mychain[(' ', 2, ' '): (' ', 40, ' ')]. So in my > > point of view, mychain[2:4] is just an abbreviation for mychain[(' ', 2, > ' > > '): (' ', 40, ' ')], just like mychain[2] is a short version of > mychain[(' > > ',2,' ')]. > > > > hongbo > > I've never really liked these strange tuple IDs, which are usually > but not always full of empty values. I understand some of > the corner cases they handle, but they are very complicated. > This seems to be the problem of PDB. I don't know how other packages handle the issue. In addition, I once proposed to remove the HETERO-flag in the residue ID. http://biopython.org/pipermail/biopython-dev/2011-January/008640.html It is only retained for the backwards compatibility with PDB files before remediation in 2007. Removing only HETERO-flag does not solve the problem totally, but to some extent (say, around 50%). I think, one possible solution is to treat residue ID always as a string '%4d%s' % (resnum, icode) instead of a tuple composed of resnum plus icode (if we do not consider HETERO-flag). This way, biotpyhon also serves to remind users that icode is indispensable for uniquely locating a residue even if icode is empty. But this would lead to formidable API change. > You cannot assume 2 will map to (' ', 2, ' ') in general - this > is what the _translate_id method handles. Consider the case > where you have sliced the Chain as discussed, since the > first two elements have been removed, that mapping will shift. > I am afraid I don't concur. As a matter of fact, the mapping are internally fine-tuned by comparing the input residue ID to a list of all residue IDs so the correct index values are obtained for the slicing: res_id_list = [r.id for r in self.get_iterator()] start_index = res_id_list.index(self._translate_id(start)) stop_index = res_id_list.index(self._translate_id(stop)) return self.get_list()[start_index:stop_index:step] It is like mychain[2], this 2 will be first translated to (' ',2,' ') and this residue ID is searched in the dictionary indexed by all residues IDs to locate the correct residue. > We definitely would need a test case covering non-trivial > ID tuples (e.g. using insertion codes), and tests slicing a > previously sliced Chain. > PDB entry 1h4w is a good example with icode and the sequence of chain A starts with resnum 16. def get_slice(self, start, end, step=None): """Return a slice of the chain from start to end (including end position) Arguments: o start - (string, int, string) or int o end - (string, int, string) or int o step - None or int """ res_id_list = [r.id for r in self.get_iterator()] start_index = res_id_list.index(self._translate_id(start)) end_index = res_id_list.index(self._translate_id(end)) return self.get_list()[start_index:end_index+1:step] In [66]: mychain.get_slice(182,189) Out[66]: [, , , , , , , , , ] > Peter > -- Hongbo From p.j.a.cock at googlemail.com Mon Dec 5 15:53:36 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Dec 2011 15:53:36 +0000 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: On Mon, Dec 5, 2011 at 2:48 PM, Hongbo Zhu ??? wrote: > > On Mon, Dec 5, 2011 at 2:50 PM, Peter Cock wrote: >> >> I've never really liked these strange tuple IDs, which are usually >> but not always full of empty values. I understand some of >> the corner cases they handle, but they are very complicated. > > > This seems to be the problem of PDB. Yes. > I don't know how other packages handle the issue. > In addition, I once proposed to remove the HETERO-flag in the residue ID. > http://biopython.org/pipermail/biopython-dev/2011-January/008640.html > It is only retained for the backwards compatibility with PDB files before > remediation in 2007. Removing only HETERO-flag does not solve > the?problem?totally, but to some extent (say, around 50%). Breaking the API without making the ID much easier to use is a bad idea. > PDB entry 1h4w is a good example with icode and the sequence of chain A > starts with resnum 16. That shows the problem nicely, >>> from Bio import PDB >>> structure = PDB.PDBParser().get_structure("1h4w", "1h4w.pdb") >>> chain = structure[0]['A'] >>> len(chain) 351 >>> chain[0] Traceback (most recent call last): File "", line 1, in File "Bio/PDB/Chain.py", line 67, in __getitem__ return Entity.__getitem__(self, id) File "Bio/PDB/Entity.py", line 38, in __getitem__ return self.child_dict[id] KeyError: (' ', 0, ' ') However, you can access the first residue like this: >>> chain[16] Likewise, >>> for index, residue in enumerate(chain): ... print index, residue ... assert chain[index] == residue ... 0 Traceback (most recent call last): File "", line 3, in File "Bio/PDB/Chain.py", line 67, in __getitem__ return Entity.__getitem__(self, id) File "Bio/PDB/Entity.py", line 38, in __getitem__ return self.child_dict[id] KeyError: (' ', 0, ' ') So as you say, the current implementation does map an integer index to the middle field of the ID tuple, rather than the position in the list as I had assumed. Sadly this means it is incompatible with Pythonic slicing, so we can't extend __getitem__ to offer that. Peter From macrozhu at gmail.com Mon Dec 5 16:22:42 2011 From: macrozhu at gmail.com (=?UTF-8?B?SG9uZ2JvIFpodSDmnLHlro/ljZo=?=) Date: Mon, 5 Dec 2011 17:22:42 +0100 Subject: [Biopython-dev] slicing in Bio.PDB.Chain.__getitem__() ? In-Reply-To: References:

Message-ID: > > > > PDB entry 1h4w is a good example with icode and the sequence of chain A > > starts with resnum 16. > > That shows the problem nicely, > > >>> from Bio import PDB > >>> structure = PDB.PDBParser().get_structure("1h4w", "1h4w.pdb") > >>> chain = structure[0]['A'] > >>> len(chain) > 351 > >>> chain[0] > Traceback (most recent call last): > File "", line 1, in > File "Bio/PDB/Chain.py", line 67, in __getitem__ > return Entity.__getitem__(self, id) > File "Bio/PDB/Entity.py", line 38, in __getitem__ > return self.child_dict[id] > KeyError: (' ', 0, ' ') > > However, you can access the first residue like this: > > >>> chain[16] > > > Likewise, > > >>> for index, residue in enumerate(chain): > ... print index, residue > ... assert chain[index] == residue > ... > 0 > Traceback (most recent call last): > File "", line 3, in > File "Bio/PDB/Chain.py", line 67, in __getitem__ > return Entity.__getitem__(self, id) > File "Bio/PDB/Entity.py", line 38, in __getitem__ > return self.child_dict[id] > KeyError: (' ', 0, ' ') > > So as you say, the current implementation does map > an integer index to the middle field of the ID tuple, > rather than the position in the list as I had assumed. > Sadly this means it is incompatible with Pythonic > slicing, so we can't extend __getitem__ to offer that. > > Interesting! I was thinking of the problem from a different angle: slicing is just a natural extension from __getitem__, like in pythonic list. And I think the current implementation is a great realization of pythonic list in the special case of protein chain. But since my proposal has another conflict with pythonic slicing, i.e., the ambiguity about the ending position of sequence segments, I prefer to implement the slicing as an independent function get_slice(start, end), if not in Bio.PDB.Chain, then in my own code. Thanks a lot for the helpful discussion! -- Hongbo From redmine at redmine.open-bio.org Mon Dec 5 17:33:55 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 5 Dec 2011 17:33:55 +0000 Subject: [Biopython-dev] [Biopython - Feature #3236] Make Biopython work in PyPy 1.5 References: Message-ID: Issue #3236 has been updated by Peter Cock. I have found and fixed a few handle leaks, which means as of https://github.com/biopython/biopython/commit/65da2fe99d923c5a69a6bfa2ed3b3375496d4826 there are now no ResourceWarning messages from Python 3.2 using: python3 -W all test_SeqIO_index.py However, despite that, running that test alone still fails on Mac OS X under PyPy (1.6 and) 1.7 with IOError: [Errno 24] Too many open files: ... See also https://bugs.pypy.org/issue828 which may possibly be related. ---------------------------------------- Feature #3236: Make Biopython work in PyPy 1.5 https://redmine.open-bio.org/issues/3236 Author: Eric Talevich Status: In Progress Priority: Low Assignee: Biopython Dev Mailing List Category: Target version: URL: PyPy is now roughly as production-ready as Jython: http://morepypy.blogspot.com/2011/04/pypy-15-released-catching-up.html Let's make Biopython work on PyPy 1.5. To make the pure-Python core of Biopython work, I did this: * Download and unpack the pre-compiled Linux tarball from pypy.org * Copy the header file @marshal.h@ from the CPython 2.X installation into the @pypy-c-.../include/@ directory * pypy setup.py build; pypy setup.py install * Delete pypy-c-.../site-packages/Bio/cpairwise2*.so Benchmarking a script that leans heavily on Bio.pairwise2, I see about a 2x speedup between Pypy 1.5 and CPython 2.6 -- yes, that's with the compiled C extension @cpairwise2@ in the CPython 2.6 installation. Numpy isn't available on PyPy yet, and it may be some time before it does. Observations from @pypy setup.py test@: * test_BioSQL triggers tons of RuntimeWarnings related to sqlite3 functions * test_BioSQL_SeqIO fails -- attempts to retrieve P01892 instead of Q29899 (?) * test_Restriction triggers a TypeError, somehow (also causing test_CAPS to err) * test_Entrez fails with many noisy errors -- looks related to expat, may be just my installation * importing @Bio.trie@ fails, probably due to a @marshal.h@ issue with compilation -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From mmaddren at soe.ucsc.edu Tue Dec 6 00:29:53 2011 From: mmaddren at soe.ucsc.edu (Morgan Maddren) Date: Mon, 5 Dec 2011 16:29:53 -0800 (PST) Subject: [Biopython-dev] GEO Libraries In-Reply-To: <1777002553.129203.1323131003663.JavaMail.root@mail-01.cse.ucsc.edu> Message-ID: <504286596.129243.1323131393104.JavaMail.root@mail-01.cse.ucsc.edu> Hi, I don't know who this may concern, but I noticed on the wiki that there is interest in GEO SOFT parsers/writers. I work for the UCSC Genome Browser, and we've already created libraries for working with SOFT files, and would be happy to share. -Morgan From p.j.a.cock at googlemail.com Tue Dec 6 10:20:10 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 6 Dec 2011 10:20:10 +0000 Subject: [Biopython-dev] GEO Libraries In-Reply-To: <504286596.129243.1323131393104.JavaMail.root@mail-01.cse.ucsc.edu> References: <1777002553.129203.1323131003663.JavaMail.root@mail-01.cse.ucsc.edu> <504286596.129243.1323131393104.JavaMail.root@mail-01.cse.ucsc.edu> Message-ID: On Tue, Dec 6, 2011 at 12:29 AM, Morgan Maddren wrote: > Hi, I don't know who this may concern, but I noticed on the wiki > that there is interest in GEO SOFT parsers/writers. I work for > the UCSC Genome Browser, and we've already created libraries > for working with SOFT files, and would be happy to share. > > -Morgan In Python? ;) We have an old and somewhat rudimentary set of GEO SOFT parsers under Bio.Geo, see: https://github.com/biopython/biopython/tree/master/Bio/Geo But not much else. Here R/Bioconductor is much more mature, not just Sean Davis' GEO parser but the whole microarray expression set objects, plus of course the statistics side (for which we may want to use SciPy). Peter From p.j.a.cock at googlemail.com Wed Dec 7 12:20:11 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 7 Dec 2011 12:20:11 +0000 Subject: [Biopython-dev] Test failure: write/read simple one-of locations. Message-ID: Hi all, I noticed the following odd buildslave failure yesterday, which was not reproducible (a retest passed), however something very similar happened today (but again it passed when repeated): Failure on 6 December 2011, under Python 3.2 http://testing.open-bio.org/biopython/builders/Linux%20-%20Python%203.2/builds/291/steps/shell/logs/stdio ERROR: test_oneof (test_SeqIO_features.FeatureWriting) Features: write/read simple one-of locations. ... ValueError: [one-of(0,3,6):21](+) versus [one-of(0,3,6):21](+): SeqFeature(FeatureLocation(OneOfPosition(0, choices=[ExactPosition(0), ExactPosition(3), ExactPosition(6)]), ExactPosition(21), strand=1), type='CDS') vs: SeqFeature(FeatureLocation(OneOfPosition(44, choices=[ExactPosition(0), ExactPosition(3), ExactPosition(6)]), ExactPosition(21), strand=1), type='CDS') Failure on 7 December 2011, under Python 3.1 http://testing.open-bio.org/biopython/builders/Linux%20-%20Python%203.1/builds/459/steps/shell/logs/stdio ERROR: test_oneof (test_SeqIO_features.FeatureWriting) Features: write/read simple one-of locations. ... ValueError: [one-of(0,3,6):21](+) versus [one-of(0,3,6):21](+): SeqFeature(FeatureLocation(OneOfPosition(0, choices=[ExactPosition(0), ExactPosition(3), ExactPosition(6)]), ExactPosition(21), strand=1), type='CDS') vs: SeqFeature(FeatureLocation(OneOfPosition(180, choices=[ExactPosition(0), ExactPosition(3), ExactPosition(6)]), ExactPosition(21), strand=1), type='CDS') This is the same unit test, on the same machine, but different revisions, different versions of Python, and also slightly different output. This is an old machine, so it could be a memory problem, or evidence of some subtle race condition. Has anyone else seen this error? Peter From redmine at redmine.open-bio.org Mon Dec 12 12:47:30 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 12 Dec 2011 12:47:30 +0000 Subject: [Biopython-dev] [Biopython - Bug #3315] (New) Bio.SwissProt fails parsing .dat dumps Message-ID: Issue #3315 has been reported by Leszek Pryszcz. ---------------------------------------- Bug #3315: Bio.SwissProt fails parsing .dat dumps https://redmine.open-bio.org/issues/3315 Author: Leszek Pryszcz Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: SwissProt module fails when parsing .dat dump of Uniprot_trembl vesion 201111. The error is due to corrupted RX lines in .dat for Aureococcus anophagefferens (i.e. C6KIH8_AURAN): > RX DOI=DOI; 10.1111/j.1529-8817.2010.00841.x; I have reported the problem. The thing is, that it happened before. Previously, I have reported similar issue in releases 201010, 201011, 201012... > RX DOI=10.1098/rspb= .2010.1301; Will it be possible to alter error catching mechanisms in Bio.SwissProt._read_rx, so the module warns about corrupted entry instead of failing the parser? ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon Dec 12 12:47:31 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 12 Dec 2011 12:47:31 +0000 Subject: [Biopython-dev] [Biopython - Bug #3315] (New) Bio.SwissProt fails parsing .dat dumps Message-ID: Issue #3315 has been reported by Leszek Pryszcz. ---------------------------------------- Bug #3315: Bio.SwissProt fails parsing .dat dumps https://redmine.open-bio.org/issues/3315 Author: Leszek Pryszcz Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: SwissProt module fails when parsing .dat dump of Uniprot_trembl vesion 201111. The error is due to corrupted RX lines in .dat for Aureococcus anophagefferens (i.e. C6KIH8_AURAN): > RX DOI=DOI; 10.1111/j.1529-8817.2010.00841.x; I have reported the problem. The thing is, that it happened before. Previously, I have reported similar issue in releases 201010, 201011, 201012... > RX DOI=10.1098/rspb= .2010.1301; Will it be possible to alter error catching mechanisms in Bio.SwissProt._read_rx, so the module warns about corrupted entry instead of failing the parser? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon Dec 12 14:49:37 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 12 Dec 2011 14:49:37 +0000 Subject: [Biopython-dev] [Biopython - Bug #3315] Bio.SwissProt fails parsing .dat dumps References: Message-ID: Issue #3315 has been updated by Leszek Pryszcz. Simple replacing

            assert len(x) == 2, "I don't understand RX line %s" % value
            key, value = x[0], x[1].rstrip(";")
            reference.references.append((key, value))

            #assert len(x) == 2, "I don't understand RX line %s" % value
            if len(x) != 2:
              print " Warning: I don't understand RX line %s" % value
            else:
              key, value = x[0], x[1].rstrip(";")
              reference.references.append((key, value))

would do the job. ---------------------------------------- Bug #3315: Bio.SwissProt fails parsing .dat dumps https://redmine.open-bio.org/issues/3315 Author: Leszek Pryszcz Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: SwissProt module fails when parsing .dat dump of Uniprot_trembl vesion 201111. The error is due to corrupted RX lines in .dat for Aureococcus anophagefferens (i.e. C6KIH8_AURAN): > RX DOI=DOI; 10.1111/j.1529-8817.2010.00841.x; I have reported the problem. The thing is, that it happened before. Previously, I have reported similar issue in releases 201010, 201011, 201012... > RX DOI=10.1098/rspb= .2010.1301; Will it be possible to alter error catching mechanisms in Bio.SwissProt._read_rx, so the module warns about corrupted entry instead of failing the parser? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon Dec 12 14:57:21 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 12 Dec 2011 14:57:21 +0000 Subject: [Biopython-dev] [Biopython - Bug #3315] Bio.SwissProt fails parsing .dat dumps References: Message-ID: Issue #3315 has been updated by Peter Cock. The database problem is visible at http://www.uniprot.org/uniprot/C6KIH8.txt where the line is just: RX DOI=DOI; You said you'd reported this record (C6KIH8_AURAN) to SwissProt/UniPort, and other problems in the past, so this is a recurrent problem. Regarding the proposed fix, not really, we need to use the warnings module rather than a print statement. I'm looking at it, but have to download the latest uniprot_trembl.dat first (last month's was fine, so it uniprot_sprot.dat this month and last month). ---------------------------------------- Bug #3315: Bio.SwissProt fails parsing .dat dumps https://redmine.open-bio.org/issues/3315 Author: Leszek Pryszcz Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: SwissProt module fails when parsing .dat dump of Uniprot_trembl vesion 201111. The error is due to corrupted RX lines in .dat for Aureococcus anophagefferens (i.e. C6KIH8_AURAN): > RX DOI=DOI; 10.1111/j.1529-8817.2010.00841.x; I have reported the problem. The thing is, that it happened before. Previously, I have reported similar issue in releases 201010, 201011, 201012... > RX DOI=10.1098/rspb= .2010.1301; Will it be possible to alter error catching mechanisms in Bio.SwissProt._read_rx, so the module warns about corrupted entry instead of failing the parser? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon Dec 12 18:36:45 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 12 Dec 2011 18:36:45 +0000 Subject: [Biopython-dev] [Biopython - Bug #3315] (Closed) Bio.SwissProt fails parsing .dat dumps References: Message-ID: Issue #3315 has been updated by Peter Cock. Status changed from New to Closed % Done changed from 0 to 100 This should be fixed as of: https://github.com/biopython/biopython/commit/8dffb06e18cd725321fcee6f6a7aee9e5c5a146f Please let us know if you run into any other problems. ---------------------------------------- Bug #3315: Bio.SwissProt fails parsing .dat dumps https://redmine.open-bio.org/issues/3315 Author: Leszek Pryszcz Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: SwissProt module fails when parsing .dat dump of Uniprot_trembl vesion 201111. The error is due to corrupted RX lines in .dat for Aureococcus anophagefferens (i.e. C6KIH8_AURAN): > RX DOI=DOI; 10.1111/j.1529-8817.2010.00841.x; I have reported the problem. The thing is, that it happened before. Previously, I have reported similar issue in releases 201010, 201011, 201012... > RX DOI=10.1098/rspb= .2010.1301; Will it be possible to alter error catching mechanisms in Bio.SwissProt._read_rx, so the module warns about corrupted entry instead of failing the parser? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Fri Dec 16 12:36:12 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 16 Dec 2011 12:36:12 +0000 Subject: [Biopython-dev] TogoWS in Biopython? In-Reply-To: References:

<8762j2iump.fsf@fastmail.fm> Message-ID: On Wed, Nov 2, 2011 at 1:27 PM, Peter Cock wrote: > On Wed, Nov 2, 2011 at 12:19 PM, Brad Chapman wrote: >> >> Peter; >> >>> > Would someone like to review the TogoWS code I have written >>> > to access the Togo Web Service's REST API please? >> >> This looks great and the tests are all passing for me. My only small >> suggestion would be to avoid hardcoding 'http://togows.dbcls.jp' >> everywhere. I'd stick this as a top level variable along with the global >> caches and reference it in the code. This way if they ever get any >> mirrors we could adjust on the fly. >> >> Thanks for getting this in, >> Brad > > Good point regarding the URL. > > I've also realised it will need some tweaks for Python 3 (bytes > versus unicode), or at least to skip the unit tests in the short > term to avoid hiding real errors on the buildbot. > > Peter That is on the trunk now, further testing and feedback welcome. https://github.com/biopython/biopython/commit/a4875c62d44c40cebfc11d84654ad9dcb83c81b3 Peter From Markus.Piotrowski at ruhr-uni-bochum.de Sat Dec 17 22:33:20 2011 From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus Piotrowski) Date: Sat, 17 Dec 2011 22:33:20 +0000 (UTC) Subject: [Biopython-dev] =?utf-8?q?Bio=2ESeqUtils=2EMeltingTemp_Tm=5Fstalu?= =?utf-8?q?c_does_not_accept_sequence_object?= Message-ID: Dear Biophython developers, It seems that modul Tm_staluc from Bio.SeqUtils does not accept sequence objects as input but 'just' strings. I think that is somewhat unusal for a 'SeqUtil', shouldn't sequence utilities accept sequence objects? I also think that the modul name 'Tm_staluc' is not self-explanatory. Obviously the name is derived from the corresponding author of the paper which describes this Tm calculation, SantaLucia. The method is better known (and described so in the documentation) as nearest neighbor thermodynamics, so it would better be called something like Tm_nn (nearest neighbor). However, taking into consideration that more Tm methods will be added in future, they may just be numbered? From Markus.Piotrowski at ruhr-uni-bochum.de Sun Dec 18 13:50:07 2011 From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus Piotrowski) Date: Sun, 18 Dec 2011 13:50:07 +0000 (UTC) Subject: [Biopython-dev] Sequence object allows non-alphabet characters Message-ID: Dear Biopyhton developers, I wonder why the following code does not throw an exception: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> mySeq = Seq("GATC1234YWSK", IUPAC.unambiguous_dna) >>> mySeq Seq('GATC1234YWSK', IUPACUnambiguousDNA()) I expected that trying to generate a sequence object containing non-alphabet characters would either throw an exception/warning or "downgrade" the alphabet, if possible. Another facet of the same problem are whitespaces: >>> mySeq = Seq("GATC GATC", IUPAC.unambiguous_dna) >>> mySeq Seq('GATC GATC', IUPACUnambiguousDNA()) >>> len(mySeq) 9 Which is problematic when the sequence length is required (calculating GC content, calculating melting temperature, etc.) While it could be argued that checking the integrity of the sequence data is related to parsing, I think that the sequence in the sequence object should never contain whitespaces and if an alphabet is assigned it should not contain non-alphabet characters. So this should be handled by the sequence object itself? From p.j.a.cock at googlemail.com Sun Dec 18 14:03:10 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 18 Dec 2011 14:03:10 +0000 Subject: [Biopython-dev] Sequence object allows non-alphabet characters In-Reply-To: References: Message-ID: On Sunday, December 18, 2011, Markus Piotrowski < Markus.Piotrowski at ruhr-uni-bochum.de> wrote: > Dear Biopyhton developers, > > I wonder why the following code does not throw an exception: > >>>> from Bio.Seq import Seq >>>> from Bio.Alphabet import IUPAC >>>> mySeq = Seq("GATC1234YWSK", IUPAC.unambiguous_dna) >>>> mySeq > Seq('GATC1234YWSK', IUPACUnambiguousDNA()) > > I expected that trying to generate a sequence object containing non-alphabet > characters would either throw an exception/warning or "downgrade" the alphabet, > if possible. > > Another facet of the same problem are whitespaces: > >>>> mySeq = Seq("GATC GATC", IUPAC.unambiguous_dna) >>>> mySeq > Seq('GATC GATC', IUPACUnambiguousDNA()) >>>> len(mySeq) > 9 > > Which is problematic when the sequence length is required (calculating GC > content, calculating melting temperature, etc.) > > While it could be argued that checking the integrity of the sequence data is > related to parsing, I think that the sequence in the sequence object should > never contain whitespaces and if an alphabet is assigned it should not contain > non-alphabet characters. So this should be handled by the sequence object itself? > See https://redmine.open-bio.org/issues/2597 To me the obvious approach is to valid this in the Seq object __init__ if and only if the alphabet selected has a letters attribute with the valid characters given. However, this will slow things down and probably break a number of existing scripts. Perhaps a global setting for ignore (current behaviuor), warning, or exception? We could make the default the warning for a release or two and then switch to an error. Peter From p.j.a.cock at googlemail.com Sun Dec 18 14:22:54 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 18 Dec 2011 14:22:54 +0000 Subject: [Biopython-dev] Bio.SeqUtils.MeltingTemp Tm_staluc does not accept sequence object In-Reply-To: References: Message-ID: On Sat, Dec 17, 2011 at 10:33 PM, Markus Piotrowski wrote: > Dear Biophython developers, > > It seems that modul Tm_staluc from Bio.SeqUtils does not > accept sequence objects as input but 'just' strings. Well spotted, but the important information missing from your email was what goes wrong: >>> from Bio.Seq import Seq >>> from Bio.SeqUtils.MeltingTemp import Tm_staluc >>> Tm_staluc(Seq("ACGT")) Traceback (most recent call last): ... i = st.index(p,x) AttributeError: 'Seq' object has no attribute 'index' >>> "ACGT".index("G") That looks like an oversight in the Seq object, which is easily fixed (the unit tests need a bit more work...). > I think that is somewhat unusal for a 'SeqUtil', shouldn't sequence utilities > accept sequence objects? > > I also think that the modul name 'Tm_staluc' is not self-explanatory. Obviously > the name is derived from the corresponding author of the paper which describes > this Tm calculation, SantaLucia. The method is better known (and described so in > the documentation) as nearest neighbor thermodynamics, so it would better be > called something like Tm_nn (nearest neighbour). The doctoring could be expand to explain that. Would you like to suggest an improvement? > However, taking into consideration that more Tm methods will be added > in future, they may just be numbered? Numbered how? Tm_nn1, Tm_nn2, ... seems a bad idea. Peter From p.j.a.cock at googlemail.com Sun Dec 18 14:26:32 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 18 Dec 2011 14:26:32 +0000 Subject: [Biopython-dev] Bio.SeqUtils.MeltingTemp Tm_staluc does not accept sequence object In-Reply-To: References:

Message-ID: On Sun, Dec 18, 2011 at 2:22 PM, Peter Cock wrote: > On Sat, Dec 17, 2011 at 10:33 PM, Markus Piotrowski > wrote: >> Dear Biophython developers, >> >> It seems that modul Tm_staluc from Bio.SeqUtils does not >> accept sequence objects as input but 'just' strings. > > Well spotted, but the important information missing from your > email was what goes wrong: > >>>> from Bio.Seq import Seq >>>> from Bio.SeqUtils.MeltingTemp import Tm_staluc >>>> Tm_staluc(Seq("ACGT")) > Traceback (most recent call last): > ... > ? ?i = st.index(p,x) > AttributeError: 'Seq' object has no attribute 'index' >>>> "ACGT".index("G") > > That looks like an oversight in the Seq object, which > is easily fixed (the unit tests need a bit more work...). Actually, I might have been worried about what to do with the existing MutableSeq's index method (which is list/array like not string like). Perhaps in the short term a fix to the TM function is easier. Peter From p.j.a.cock at googlemail.com Sun Dec 18 18:01:47 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 18 Dec 2011 18:01:47 +0000 Subject: [Biopython-dev] Bio.SeqUtils.MeltingTemp Tm_staluc does not accept sequence object In-Reply-To: References:

Message-ID: On Sun, Dec 18, 2011 at 2:26 PM, Peter Cock wrote: >> That looks like an oversight in the Seq object, which >> is easily fixed (the unit tests need a bit more work...). > > Actually, I might have been worried about what to do > with the existing MutableSeq's index method (which > is list/array like not string like). > > Perhaps in the short term a fix to the TM function is > easier. Done: https://github.com/biopython/biopython/commit/cd01acc2cdfb1a3e9e16e5107924168231514842 I noticed that the MeltingTemp module didn't have any unit tests, so turned the existing simple example into a doctest (some stray text in my git commit comment which is a shame): https://github.com/biopython/biopython/commit/a347dce69e0c061280165c33662dd57f49b18f5b Markus, could you test the latest code from github please? Thanks, Peter From Markus.Piotrowski at ruhr-uni-bochum.de Mon Dec 19 12:49:50 2011 From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus) Date: Mon, 19 Dec 2011 12:49:50 +0000 (UTC) Subject: [Biopython-dev] Sequence object allows non-alphabet characters References:

Message-ID: Peter Cock googlemail.com> writes: > > On Sunday, December 18, 2011, Markus Piotrowski < > Markus.Piotrowski ruhr-uni-bochum.de> wrote: > > Dear Biopyhton developers, > > > > I wonder why the following code does not throw an exception: > > > >>>> from Bio.Seq import Seq > >>>> from Bio.Alphabet import IUPAC > >>>> mySeq = Seq("GATC1234YWSK", IUPAC.unambiguous_dna) > >>>> mySeq > > Seq('GATC1234YWSK', IUPACUnambiguousDNA()) > > > > I expected that trying to generate a sequence object containing > non-alphabet > > characters would either throw an exception/warning or "downgrade" the > alphabet, > > if possible. > > > > See https://redmine.open-bio.org/issues/2597 > > To me the obvious approach is to valid this in the Seq object > __init__ if and only if the alphabet selected has a letters > attribute with the valid characters given. However, this will > slow things down and probably break a number of existing > scripts. Perhaps a global setting for ignore (current behaviuor), > warning, or exception? We could make the default the > warning for a release or two and then switch to an error. > > Peter > What about an additional optional option in the sequence object like "validate=True/False" with false as default. This would not break existing code, will not influence speed (if validate=False) but gives the possibility to have the sequence validated against the selected alphabet. In addition, validate=True without an selected alphabet would allow for a basic sequence polishing, like setting uppercase and removing whitespaces and digits (any non-alphabetic characters?). From p.j.a.cock at googlemail.com Mon Dec 19 14:02:25 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Dec 2011 14:02:25 +0000 Subject: [Biopython-dev] Sequence object allows non-alphabet characters In-Reply-To: References:

Message-ID: On Mon, Dec 19, 2011 at 12:49 PM, Markus wrote: > > What about an additional optional option in the sequence object > like "validate=True/False" with false as default. This would not > break existing code, will not influence speed (if validate=False) > but gives the possibility to have the sequence validated against > the selected alphabet. That could work, although there would still be trivial speed impact (the extra if statement), but it shouldn't really hurt. However, most Seq objects are created not directly by the user, but via SeqIO. I suppose that could get another argument for Seq construction... > In addition, validate=True without an selected alphabet would allow > for a basic sequence polishing, like setting uppercase and removing > whitespaces and digits (any non-alphabetic characters?). Things like mixed case are actually useful, and so too are extra symbols. Peter From eric.talevich at gmail.com Tue Dec 20 00:49:28 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 19 Dec 2011 16:49:28 -0800 Subject: [Biopython-dev] Sequence object allows non-alphabet characters In-Reply-To: References:

Message-ID: On Mon, Dec 19, 2011 at 6:02 AM, Peter Cock wrote: > On Mon, Dec 19, 2011 at 12:49 PM, Markus > wrote: > > > > What about an additional optional option in the sequence object > > like "validate=True/False" with false as default. This would not > > break existing code, will not influence speed (if validate=False) > > but gives the possibility to have the sequence validated against > > the selected alphabet. > > That could work, although there would still be trivial speed impact > (the extra if statement), but it shouldn't really hurt. However, most > Seq objects are created not directly by the user, but via SeqIO. > I suppose that could get another argument for Seq construction... > > As another alternative, you could add a method Seq.validate() which must be called separately. Then you'd have a way to trigger validation even after directly setting seq.data or .alphabet. -E From Markus.Piotrowski at ruhr-uni-bochum.de Tue Dec 20 14:48:25 2011 From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus Piotrowski) Date: Tue, 20 Dec 2011 14:48:25 +0000 (UTC) Subject: [Biopython-dev] Sequence object allows non-alphabet characters References:

Message-ID: Eric Talevich gmail.com> writes: > As another alternative, you could add a method Seq.validate() which must be > called separately. Then you'd have a way to trigger validation even after > directly setting seq.data or .alphabet. > > -E > There is a function _verify_alphabet(sequence) in the package Alphabet, which does exactly this. However, the example given in the API documentation doesn't work for me: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq ("MKQHK", IUPAC.protein) >>> _verify_alphabet(my_seq) Traceback (most recent call last): File "", line 1, in _verify_alphabet(my_seq) NameError: name '_verify_alphabet' is not defined >>> from Bio import Alphabet >>> Alphabet._verify_alphabet(my_seq) True Still, I would prefer to have checked the sequence against the choosen alphabet during initialization, maybe as option: Seq(sequence[, alphabet, verify]) Markus From p.j.a.cock at googlemail.com Tue Dec 20 15:27:00 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 20 Dec 2011 15:27:00 +0000 Subject: [Biopython-dev] Sequence object allows non-alphabet characters In-Reply-To: References:

Message-ID: On Tue, Dec 20, 2011 at 2:48 PM, Markus Piotrowski wrote: > Eric Talevich gmail.com> writes: > >> As another alternative, you could add a method Seq.validate() which must be >> called separately. Then you'd have a way to trigger validation even after >> directly setting seq.data or .alphabet. >> >> -E >> > > There is a function _verify_alphabet(sequence) in the package Alphabet, > which does exactly this. That starts with an underscore so it is a private API. > However, the example given in the API documentation doesn't > work for me: > >>>> from Bio.Seq import Seq >>>> from Bio.Alphabet import IUPAC >>>> my_seq = Seq ("MKQHK", IUPAC.protein) >>>> _verify_alphabet(my_seq) > > Traceback (most recent call last): > ?File "", line 1, in > ? ?_verify_alphabet(my_seq) > NameError: name '_verify_alphabet' is not defined You didn't import the function, thus a NameError >>>> from Bio import Alphabet >>>> Alphabet._verify_alphabet(my_seq) > True > > Still, I would prefer to have checked the sequence against the > choosen alphabet during initialization, maybe as option: > Seq(sequence[, alphabet, verify]) Yes, there are certainly advantages to having the alphabet validation happen during Seq object creation. Peter From Markus.Piotrowski at ruhr-uni-bochum.de Wed Dec 21 12:22:55 2011 From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus Piotrowski) Date: Wed, 21 Dec 2011 12:22:55 +0000 (UTC) Subject: [Biopython-dev] =?utf-8?q?Bio=2ESeqUtils=2EMeltingTemp_Tm=5Fstalu?= =?utf-8?q?c_does_not_accept_sequence_object?= References:

Message-ID: Peter Cock googlemail.com> writes: > > On Sun, Dec 18, 2011 at 2:26 PM, Peter Cock googlemail.com> wrote: > >> That looks like an oversight in the Seq object, which > >> is easily fixed (the unit tests need a bit more work...). > > > > Actually, I might have been worried about what to do > > with the existing MutableSeq's index method (which > > is list/array like not string like). > > > > Perhaps in the short term a fix to the TM function is > > easier. > > Done: > https://github.com/biopython/biopython/commit /cd01acc2cdfb1a3e9e16e5107924168231514842 > Markus, could you test the latest code from github please? > > Thanks, > > Peter OK, worked for me: >>> from Bio.Seq import Seq >>> import MeltingTemp #This is the latest code from github >>> mySeq = Seq("ATGGCGCTCGTCCCAGCACC") >>> MeltingTemp.Tm_staluc(mySeq) 61.837438978725686 >>> myString = "ATGGCGCTCGTCCCAGCACC" >>> MeltingTemp.Tm_staluc(myString) 61.837438978725686 I'm still worried about whitespaces, which, in this method, would influence the calculation by three ways: 1. In line 169 (of the latest code) len(s) is used in a calculation. 2. The method itself consists of taking thermodynamic values of neighboring bases. Thus "GC" gives a different value than ("G C"), since in the latter case G and C are not neighbored. 3. Also leading and trailing spaces would give wrong calculations since the function tercorr (lines 59-101) uses startswith and endswith methods to look for the first and last base, respectively. Example (same sequence as above with space): >>> mySeq = Seq("ATG GCGCTCGTCCCAGCACC") >>> MeltingTemp.Tm_staluc(mySeq) 58.16633757439382 I would suggest to change line 117 to sup = str(s).upper().translate(None, string.whitespace) which requires import of the string module, or simpler: sup = str(s).upper().replace(" ","") and changing line 169 to: ds = ds-0.368*(len(sup)-1)*math.log(saltc/1e3) #instead of len(s) Putting these changes into MeltingTempN: >>> import MeltingTempN >>> mySeq = Seq("ATGGCGCTCGTCCCAGCACC") >>> MeltingTempN.Tm_staluc(mySeq) 61.837438978725686 >>> mySeq = Seq("ATG GCGCTCGTCCCAGCACC") #space in sequence >>> MeltingTempN.Tm_staluc(mySeq) 61.837438978725686 >>> mySeq = Seq(" ATGGCGCTCGTCCCAGCACC") #leading space >>> MeltingTempN.Tm_staluc(mySeq) 61.837438978725686 It is of course a more philosophical question if the method should catch these problems or if the method can rely on a 'proper' sequence (i.e., without blanks). Markus From p.j.a.cock at googlemail.com Wed Dec 21 12:39:02 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 21 Dec 2011 12:39:02 +0000 Subject: [Biopython-dev] Bio.SeqUtils.MeltingTemp Tm_staluc does not accept sequence object In-Reply-To: References: