[Biopython] slashes in Stockholm format names are not properly parsed

Peter biopython at maubp.freeserve.co.uk
Fri Dec 10 10:50:31 UTC 2010


On Fri, Dec 10, 2010 at 1:00 AM, Erick Matsen <matsen at fhcrc.org> wrote:
> Hello there---
>
>
> In StockholmIO.py, class StockholmIterator(AlignmentIterator), there is
> a subroutine like so:
>
>    def _identifier_split(self, identifier):
>        """Returns (name,start,end) string tuple from an identier."""
>        if identifier.find("/")!=-1:
>            start_end = identifier.split("/",1)[1]
>            if start_end.count("-")==1:
>                start, end = map(int, start_end.split("-"))
>                name = identifier.split("/",1)[0]
>                return (name, start, end)
>        return (identifier, None, None)
>
> which splits off the start and end tag which gets attached onto the end
> of the Stockholm sequence identifier. These identifiers look like:
>
> myseq/4-9
>
> By using split like the above, the above code has a problem when the seq
> has a slash in the name. Given
>
> my/seq/4-9
>
> it will get split into "my" and "seq/4-9", which is not right.
>
> An easy start to fixing the issue is to simply replace the above calls
> to split with rsplit. A more complete solution may require regex?
>
> The definition at
> http://sonnhammer.sbc.su.se/Stockholm.html
> doesn't state that slashes are illegal in names.
>
>
> I'm using 1.55, and I didn't see it mentioned in bugzilla.
>
> Thank you for the great project.
>
> Erick

Hi Erick,

Your suggested change to use rsplit makes sense - I'm
happy to commit that. Do you mind being thanked in the
release notes and list of contributors?

Also, do you have a small real example of a Stockholm file
with sequence identifiers with embedded slashes (for our
test suite) - or is this a hypothetical problem you've identified?

Thank you,

Peter




More information about the Biopython mailing list