[BioPython] a sequence set object in biopython?

Peter biopython at maubp.freeserve.co.uk
Thu Nov 13 10:37:32 UTC 2008


On Thu, Nov 13, 2008 at 12:16 AM, Giovanni Marco Dall'Olio wrote:
>
> On Wed, Nov 12, 2008 at 7:36 PM, Peter  wrote:
>> Giovanni Marco Dall'Olio wrote:
>>>> All sensible use cases - but all seem to be covered by a simple python
>>>> list of SeqRecord objects, or in some cases a list of Seq objects
>>>> (e.g. the introns example, as I doube the introns have names).
>>>
>>> Not always.
>>> For example, if I have a set of genes in an organism, sometimes I
>>> would need to access to only some of them, by their id; so, a
>>> __getattribute__ method to make it work as a dictionary could also be
>>> useful.
>>
>> OK, then use a dict of SeqRecords for this, as shown in the tutorial
>> chapter for Bio.SeqIO and the wiki.  We even have a helper function
>> Bio.SeqIO.to_dict() to do this and check for duplicate keys.
>
> I would prefer a SeqRecordSet object with a to_dict method :)

OK, that is a style choice.

BTW, you're using the word "Set" hear rather than "List", which could
be misleading as in python sets have no order, but lists do.

>> If you need an order preserving dictionary, there are examples of this
>> on the net and there is even PEP372 for adding this to python itself:
>> http://www.python.org/dev/peps/pep-0372/
>
>>> The fact is that I think that such an object would be so widely used,
>>> that maybe it would be useful to implement it in biopython.
>>> What I would do, honestly, is to create a GenericSeqRecordSet class
>>> from which to derive Alignment, specifying that in an alignment all
>>> the sequences should have the same lenght. It would not require much
>>> work and it would change the interface.
>>
>> I agree that IF we added some sort of "GenericSeqRecordSet class", it
>> might be sensible for the alignment objects to subclass it -
>> especially if you want it to behave list a python list primarily.
>
> Let's see it from another point of view.
> In biopython, if you want to print a set of sequences in fasta format,
> you have to do the following:
>>>> s1 = SeqRecord(Seq('cacacac'))
>>>> s2 = SeqRecord(Seq('cacacac'))
>>>> seqs = s1, s2
>>>> out = ''
>>>> for seq in seqs:
>>>>     # a "print seq.format('fasta')" statement won't work properly here, because of blank lines
>>>>     out += seq.format('fasta')
>>>> print out

First of all, in my opinion using variable names seq and seqs for
SeqRecord objects rather than Seq objects is confusing.

Secondly, creating SeqRecord objects without an ID is a very bad idea
if you want to output them to a file.

Thirdly, you can have as many blank likes as you like in a FASTA
format file.  Your problem is using "print" in python will append a
new line for display.  For writing to a file, it is important that the
format("fasta") method include the trailing new line.  i.e. for
printing on screen you could do:

for rec in seqs: #seqs is a list of SeqRecord objects
   print rec.format("fasta").rstrip() #removing trailing new line as
print adds one

Or (based on Michiel's email which arrived while I was writing mine)
use the stdout handle:

import sys
from Bio import SeqIO
SeqIO.write(seqs, sys.stdout, "fasta")

> On the other side, printing an alignment in fasta format is a lot simpler:
>>>> al = Alignment(SingleLetterAlphabet)
>>>> al.add_sequence('s1', 'cacaca')
>>>> al.add_sequence('s2, 'cacaca')
>>>> print al.format('fasta')
>
> I work more often with sets of sequences rather than with alignments.
> So, why it is more difficult to print some un-related sequences in a
> certain format, than aligned sequence? I would end up using Alignment
> objects also for sequences that are not aligned.

Out of interest, why do you want to print out records to screen in a
particular file format?  Why not just write them to a file?

> I am also thinking about many format parsers.
>
> Wouldn't it be easier:
>>>> seqs = Bio.SeqIO.parse(filehandler, 'fasta')
>>>> record_dict = seqs.to_dict()
>
> than invoking SeqIO twice?

You don't like this:

from Bio import SeqIO
record_dict = SeqIO.to_dict(SeqIO.parse(handle, format))

Well, I can live with it.  We *could* make the SeqIO.parse function
always return a new object, a SeqRecordIterator which could have a
to_dict() method in addition to the iteration interface - but this is
overly complicated.

>> Note that in python sets are not order preserving.
>>
>>> very tiny little minusculus p.s. if you need help for implement such a
>>> thing or anything else I can volounteer :).
>>
>> That's good to hear :)
>>
>> However, we'd have to establish the need for this new object first -
>> but so far we've only had two people's view so its too early to form a
>> consensus.  I don't see a strong reason for adding yet another object,
>> when the core language provides lists, sets and dict which seem to be
>> enough.
>
> Take for example this code you wrote for me before:
>
>> class SeqRecordList(list) :
>>    """Subclass of the python list, to hold SeqRecord objects only."""
>>    #TODO - Override the list methods to make sure all the items
>>    #are indeed SeqRecord objects
>>
>>    def format(self, format) :
>>        """Returns a string of all the records in a requested file format.
>>
>>        The argument format should be any file format supported by
>>        the Bio.SeqIO.write() function.  This must be a lower case string.
>>        """
>>        from Bio import SeqIO
>>        from StringIO import StringIO
>>        handle = StringIO()
>>        SeqIO.write(self, handle, format)
>>        handle.seek(0)
>>        return handle.read()
>
> It's very useful, but I don't think a python/biopython newbie would be
> able to write it.
> That's why I think it should be included.
> Last year, I was in another laboratory and I didn't have much
> experience with biopython, and I was missing such a kind of object.

A python newbie should first learn about basic python lists, sets, etc.

Peter



More information about the Biopython mailing list