[Biopython] slashes in Stockholm format names are not properly parsed

Erick Matsen matsen at fhcrc.org
Fri Dec 10 01:00:19 UTC 2010


Hello there---


In StockholmIO.py, class StockholmIterator(AlignmentIterator), there is
a subroutine like so:

    def _identifier_split(self, identifier):
        """Returns (name,start,end) string tuple from an identier."""
        if identifier.find("/")!=-1:
            start_end = identifier.split("/",1)[1]
            if start_end.count("-")==1:
                start, end = map(int, start_end.split("-"))
                name = identifier.split("/",1)[0]
                return (name, start, end)
        return (identifier, None, None)

which splits off the start and end tag which gets attached onto the end
of the Stockholm sequence identifier. These identifiers look like:

myseq/4-9

By using split like the above, the above code has a problem when the seq
has a slash in the name. Given

my/seq/4-9

it will get split into "my" and "seq/4-9", which is not right.

An easy start to fixing the issue is to simply replace the above calls
to split with rsplit. A more complete solution may require regex?

The definition at
http://sonnhammer.sbc.su.se/Stockholm.html
doesn't state that slashes are illegal in names.


I'm using 1.55, and I didn't see it mentioned in bugzilla.

Thank you for the great project.

Erick



More information about the Biopython mailing list