[Biopython-dev] Tweaking the SeqRecord class

Wed Aug 16 22:20:28 UTC 2006

In the spirit of gradual improvements, I had a look at the SeqRecord class.

First of all, is there any comment on my suggestion to add __str__ and
__repr__ methods to the SeqRecord object, bug 2057:

http://bugzilla.open-bio.org/show_bug.cgi?id=2057

Next, I'd like to check in some basic __doc__ strings for the
SeqRecord class, e.g. something like this:

>>> from Bio.SeqRecord import SeqRecord
>>> print SeqRecord.__doc__
The SeqRecord object is designed to hold a sequence and information about it.

    Main properties:
    id          - Identifier such as a locus tag (string)
    seq         - The sequence itself (Seq object)

    Additional properties:
    name        - Sequence name, e.g. gene name (string)
    description - Additional text (string)
    dbxrefs     - List of database cross references (list of strings)
    features    - Any (sub)features defined (list of SeqFeature objects)
    annotations - Further information (dictionary)

I would also like to add doc strings to the id, seq, name, ...
themselves.  However, they are currently stored as attributes so this
isn't possible.  See PEP 0224,
http://www.python.org/dev/peps/pep-0224/

However, we could use the Python 2.2 "property" function to implement
these as properties.  The code might be clearer using the Python 2.4
"decorator" syntax, but I don't think we should depend on such a
recent version of python yet.

Using properties would allow this usage:

>>> print SeqRecord.features.__doc__
Annotations about parts of the sequence (list of SeqFeatures)

It would also mean that these properties show up in dir(SeqRecord) and
help(SeqRecord), which all in all should make the object slightly
easier to use.

Finally, using get/set property functions allows us to postpone
creation of string/list/dict objects for unused properties.  This does
actually seem to bring a slight improvement to the timings for Fasta
file parsing discussed last month.

If you recall, for the fastest parsers turning the data into SeqRecord
and Seq objects imposed a fairly large overhead (compared to just
using strings):

http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002407.html

I would be interested to see how those numbers change with the
attached implementation - if you wouldn't mind please Leighton... ;)

I have attached a version of SeqRecord.py which implements the changes
I have described.  The backwards compatibility if statement is a bit
ugly - can we just assume Python 2.2 or later?

Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SeqRecord.py
Type: text/x-script.phyton
Size: 9367 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060816/9e2f173c/attachment-0002.bin>