[Biopython-dev] Bio.SeqIO

Peter biopython-dev at maubp.freeserve.co.uk
Wed Mar 7 10:43:36 UTC 2007


Michiel de Hoon wrote:
> Note that a dictionary can be created by specifying a list of [key, 
> value] pairs:
> 
>  >>> dict([['a','A'],['b','B'],['c','C']])
> {'a': 'A', 'c': 'C', 'b': 'B'}
> 
> This also works with an iterator:
>  >>> def f(text):
>          for character in text:
>              yield [character, character.upper()]
>  >>> dict(f("abcd"))
> {'a': 'A', 'c': 'C', 'b': 'B', 'd': 'D'}
> 
> Now, if we let SeqRecord inherit from list, we can make it behave as a 
> [record.id, record] list. Normally, this would not be visible to the 
> user, in the sense that a user who doesn't know that SeqRecord inherits 
> from list wouldn't notice that it does.
> 
> The upshot is that we can now create a dictionary like this:
>  >>> d = dict(SeqIO.parse(handle, format))
> without any changes to Bio.SeqIO.

That is clever...

> Two things get lost here:
> 1) We can't have a key_function to change how to choose the key.
> 2) We're no longer checking if all keys are different. This can be fixed 
> by saving the keys in the parser function and raising an exception if 
> two identical keys are found. This implies though that the same 
> exception is raised in all use cases of SeqIO.parse, which may not be 
> what we want.

Sadly not ideal.  Also, wouldn't this prevent us making a SeqRecord 
inherit from Seq (another interesting idea you proposed in the past)? 
And for Seq objects, they could behave a little more like a string, or a 
list of letters.

It might be nice to be able to splice a SeqRecord and get a new 
SeqRecord with the appropriate sub-sequence... I have been thinking 
about a "RichSeqRecord" subclass of SeqRecord which would support 
sequence level annotation (e.g. secondary structure). In this situation, 
when requesting a sub record, the appropriate sub set of the secondary 
structure information should also be extracted.

e.g. The pfam/stockholm alignment format can hold strings the same 
length as the sequences which contain "per sequence per character" 
information like secondary structure.

We could also load a PDB file in this way, and provide a list of residue 
objects (including the atom coordinates) in parallel with the sequence.

Peter




More information about the Biopython-dev mailing list