[Biopython] slashes in Stockholm format names are not properly parsed
Erick Matsen
matsen at fhcrc.org
Fri Dec 10 01:00:19 UTC 2010
Hello there---
In StockholmIO.py, class StockholmIterator(AlignmentIterator), there is
a subroutine like so:
def _identifier_split(self, identifier):
"""Returns (name,start,end) string tuple from an identier."""
if identifier.find("/")!=-1:
start_end = identifier.split("/",1)[1]
if start_end.count("-")==1:
start, end = map(int, start_end.split("-"))
name = identifier.split("/",1)[0]
return (name, start, end)
return (identifier, None, None)
which splits off the start and end tag which gets attached onto the end
of the Stockholm sequence identifier. These identifiers look like:
myseq/4-9
By using split like the above, the above code has a problem when the seq
has a slash in the name. Given
my/seq/4-9
it will get split into "my" and "seq/4-9", which is not right.
An easy start to fixing the issue is to simply replace the above calls
to split with rsplit. A more complete solution may require regex?
The definition at
http://sonnhammer.sbc.su.se/Stockholm.html
doesn't state that slashes are illegal in names.
I'm using 1.55, and I didn't see it mentioned in bugzilla.
Thank you for the great project.
Erick
More information about the Biopython
mailing list