[Biopython-dev] An equivalent to Bioperl's Bio::Seq::LargeSeq?

Danny Yoo dyoo at acoma.Stanford.EDU
Thu Mar 28 14:45:07 EST 2002


Hi everyone,

Has anyone been working on extending Seq.py extension to read large
sequences?  I didn't see anything like this in Biopython, and I'd be
happy to work on this if anyone's interested.


Here's the bits of code that I have so far:

######
""IOString.py --- Make files look like Python strings.

Danny Yoo (dyoo at acoma.stanford.edu)


This module implements a class that presents a file as if it were a
string sequence.  This is meant to do the opposite of Python's
StringIO.

But why would anyone want to do something like this, instead of just
doing a simple open(file).read()?  Because some strings may be too
huge to keep in memory at once, but we may still want to manipulate
slices of it as if it were a string.
"""


class IOString:
    def __init__(self, file):
        """Initialize a new IOString from a given file.  If 'file' is
        a string, assume the user is giving us a filename, and
        automatically open the file."""
        if type(file) == type(""): file = open(file)
        self.file = file

    def __getitem__(self, i):
        """Return a single character at index i."""
        if i < 0: i = len(self) + i
        if i >= len(self):
            raise IndexError, "IOString index out of range"
        self.file.seek(i)
        return self.file.read(1)

    def __getslice__(self, i, j):
        """Note: __getslice__() is deprecated as of Python 2.0.  Perhaps
        it might be better to push this functionality in __getitem__()
        """
        if i < 0: i = len(self) + i
        if j < 0: j = len(self) + j
        j = min(j, len(self))
        if i > j: return ''
        self.file.seek(i)
        return self.file.read(j-i)

    def __repr__(self):
        return "IOString(%s)" % repr(self.file)


    """We'll only print out up to 100 characters."""
    MAX_STR_DISPLAYED = 100

    def __str__(self):
        self.file.seek(0)
        if len(self) < self.MAX_STR_DISPLAYED: return self[:]
        else: return self[:self.MAX_STR_DISPLAYED] + " [...]"


    def __len__(self):
        self.file.seek(0, 2)   ## Seek to the end
        return int(self.file.tell())
######


This code tries to provides a nice read-only string interface for files on
disk:

###
>>> import IOString
>>> s = IOString.IOString("/home/arabidopsis/bacs/CHR1/F10A5.xml")
>>> print s
<?xml version = "1.0"?>
<!DOCTYPE TIGR SYSTEM "tigrxml.dtd">
<TIGR>
	<ASSEMBLY CLONE_ID = "1994" DAT [...]
>>> s[-50:]
'TGGAATTC</ASSEMBLY_SEQUENCE>\n\t</ASSEMBLY>\n</TIGR>\n'
>>> len(s)
318459
###

At the moment it doesn't support any of the other string methods, nor does
it yet allow people to search through it with a regular expression without
turning the whole thing back into a string.  However, when I have time, I
can try to make a C extension similar to IOString.py that implements the
read-only buffer interface so that this works nicely with regular
expressions.




More information about the Biopython-dev mailing list