[BioPython] a sequence set object in biopython?

Giovanni Marco Dall'Olio dalloliogm at gmail.com
Wed Nov 12 19:16:44 EST 2008


On Wed, Nov 12, 2008 at 7:36 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Giovanni Marco Dall'Olio wrote:
>>> All sensible use cases - but all seem to be covered by a simple python
>>> list of SeqRecord objects, or in some cases a list of Seq objects
>>> (e.g. the introns example, as I doube the introns have names).
>>
>> Not always.
>> For example, if I have a set of genes in an organism, sometimes I
>> would need to access to only some of them, by their id; so, a
>> __getattribute__ method to make it work as a dictionary could also be
>> useful.
>
> OK, then use a dict of SeqRecords for this, as shown in the tutorial
> chapter for Bio.SeqIO and the wiki.  We even have a helper function
> Bio.SeqIO.to_dict() to do this and check for duplicate keys.

I would prefer a SeqRecordSet object with a to_dict method :)

> If you need an order preserving dictionary, there are examples of this
> on the net and there is even PEP372 for adding this to python itself:
> http://www.python.org/dev/peps/pep-0372/

>> The fact is that I think that such an object would be so widely used,
>> that maybe it would be useful to implement it in biopython.
>> What I would do, honestly, is to create a GenericSeqRecordSet class
>> from which to derive Alignment, specifying that in an alignment all
>> the sequences should have the same lenght. It would not require much
>> work and it would change the interface.
>
> I agree that IF we added some sort of "GenericSeqRecordSet class", it
> might be sensible for the alignment objects to subclass it -
> especially if you want it to behave list a python list primarily.

Let's see it from another point of view.
In biopython, if you want to print a set of sequences in fasta format,
you have to do the following:
>>> s1 = SeqRecord(Seq('cacacac'))
>>> s2 = SeqRecord(Seq('cacacac'))
>>> seqs = s1, s2
>>> out = ''
>>> for seq in seqs:
>>>     # a "print seq.format('fasta')" statement won't work properly here, because of blank lines
>>>     out += seq.format('fasta')
>>> print out

On the other side, printing an alignment in fasta format is a lot simpler:
>>> al = Alignment(SingleLetterAlphabet)
>>> al.add_sequence('s1', 'cacaca')
>>> al.add_sequence('s2, 'cacaca')
>>> print al.format('fasta')

I work more often with sets of sequences rather than with alignments.
So, why it is more difficult to print some un-related sequences in a
certain format, than aligned sequence? I would end up using Alignment
objects also for sequences that are not aligned.

I am also thinking about many format parsers.

Wouldn't it be easier:
>>> seqs = Bio.SeqIO.parse(filehandler, 'fasta')
>>> record_dict = seqs.to_dict()

than invoking SeqIO twice?



> Note that in python sets are not order preserving.
>
>> very tiny little minusculus p.s. if you need help for implement such a
>> thing or anything else I can volounteer :).
>
> That's good to hear :)
>
> However, we'd have to establish the need for this new object first -
> but so far we've only had two people's view so its too early to form a
> consensus.  I don't see a strong reason for adding yet another object,
> when the core language provides lists, sets and dict which seem to be
> enough.

Take for example this code you wrote for me before:

> class SeqRecordList(list) :
>    """Subclass of the python list, to hold SeqRecord objects only."""
>    #TODO - Override the list methods to make sure all the items
>    #are indeed SeqRecord objects
>
>    def format(self, format) :
>        """Returns a string of all the records in a requested file format.
>
>        The argument format should be any file format supported by
>        the Bio.SeqIO.write() function.  This must be a lower case string.
>        """
>        from Bio import SeqIO
>        from StringIO import StringIO
>        handle = StringIO()
>        SeqIO.write(self, handle, format)
>        handle.seek(0)
>        return handle.read()

It's very useful, but I don't think a python/biopython newbie would be
able to write it.
That's why I think it should be included.
Last year, I was in another laboratory and I didn't have much
experience with biopython, and I was missing such a kind of object.

> Peter
>

Goodnight!!


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


More information about the BioPython mailing list