[Biopython] slashes in Stockholm format names are not properly parsed
Peter
biopython at maubp.freeserve.co.uk
Fri Dec 10 10:50:31 UTC 2010
On Fri, Dec 10, 2010 at 1:00 AM, Erick Matsen <matsen at fhcrc.org> wrote:
> Hello there---
>
>
> In StockholmIO.py, class StockholmIterator(AlignmentIterator), there is
> a subroutine like so:
>
> def _identifier_split(self, identifier):
> """Returns (name,start,end) string tuple from an identier."""
> if identifier.find("/")!=-1:
> start_end = identifier.split("/",1)[1]
> if start_end.count("-")==1:
> start, end = map(int, start_end.split("-"))
> name = identifier.split("/",1)[0]
> return (name, start, end)
> return (identifier, None, None)
>
> which splits off the start and end tag which gets attached onto the end
> of the Stockholm sequence identifier. These identifiers look like:
>
> myseq/4-9
>
> By using split like the above, the above code has a problem when the seq
> has a slash in the name. Given
>
> my/seq/4-9
>
> it will get split into "my" and "seq/4-9", which is not right.
>
> An easy start to fixing the issue is to simply replace the above calls
> to split with rsplit. A more complete solution may require regex?
>
> The definition at
> http://sonnhammer.sbc.su.se/Stockholm.html
> doesn't state that slashes are illegal in names.
>
>
> I'm using 1.55, and I didn't see it mentioned in bugzilla.
>
> Thank you for the great project.
>
> Erick
Hi Erick,
Your suggested change to use rsplit makes sense - I'm
happy to commit that. Do you mind being thanked in the
release notes and list of contributors?
Also, do you have a small real example of a Stockholm file
with sequence identifiers with embedded slashes (for our
test suite) - or is this a hypothetical problem you've identified?
Thank you,
Peter
More information about the Biopython
mailing list