[Biopython-dev] Improving the Alignment object. Was Bio.AlignIO

Peter biopython-dev at maubp.freeserve.co.uk
Wed Jul 25 12:10:43 EDT 2007


Michiel de Hoon wrote:
> Peter wrote:
>> Personally I see an alignment as both an array of characters (i.e. amino 
>> acid residues or nucleotides), and a list of sequences.
>>
>> In the same way that a Numeric or NumPy array lets you iterate over 
>> rows, yet also access individual elements, we could allow iteration of 
>> SeqRecords and also allow access to individual letters.
> 
> How about the following:
> 
> -Iterators iterate for the SeqRecords in the alignment

I Agree. And this is trivial to implement without needing the element 
access/splicing support.

As to element access, we've been thinking along similar lines :)
Its just that with all the different special cases, there are lots of 
different possible return types!

> -An index of the form [xxx] returns the corresponding SeqRecord
> -An index of the form [xxx:yyy:zzz] returns an Alignment object 
>  containing the SeqRecords in rows [xxx:yyy:zzz]
>  (compare to the current method get_all_seqs()).

I agree. This is essential to make an alignment act like a list of 
SeqRecord objects when only a one-dimensional index is given.

> -An index of the form [xxx,:] returns the Seq object of the SeqRecord at 
> xxx (this is currently done by the get_seq_by_num() method).
> -An index of the form [xxx:yyy:zzz,:] returns a list of Seq objects

I'm not immediately convinced about returning Seq objects here.  I might 
expect indices like [xxx,:] to return a SeqRecord (not a Seq) and 
[xxx:yyy:zzz,:] to return a sub-alignment (not a list of Seq objects).

> -An index of the form [:,www] returns a string containing the characters 
>  at column www (which is currently done by the get_column method)
> -An index of the form [xxx,www] returns a string containing the 
>  character of the sequence in row xxx at column www.

Those look fine - however we might want to return Seq objects rather 
than strings.

 > -An index of the form [xxx:yyy:zzz,www] returns a string containing
 >  the characters at column www using only the rows xxx:yyy:zzz.

Or a sub alignment? See later...

> This is more-or-less how Numerical Python arrays work, except that we'll 
> be returning SeqRecord/Seq/string objects depending on the indices.

For comparison, that is what I had been thinking:
* [r,c] means one element is requested, return a single character string
* [r] or [r,:] means one row is requested, return a SeqRecord
* [:,c] means one column is requested, return a string (or Seq object?)
* Otherwise returns a (sub)alignment. Note that [:] or [:,:] would 
return a copy of the alignment.

This would cover slicing of the column index by returning a 
sub-alignment. i.e. indexes of the form [rrr, xxx:yyy:zzz] or 
[rrr:ppp:qqq, xxx:yyy:zzz]

I'm not sure if requests for part of a single row or column like [rrr, 
xxx:yyy:zzz] and [rrr:ppp:qqq, xxx] are best handled by returning 
sub-alignments or as special cases (strings/Seq and Seq/SeqRecord 
respectively?).

Peter


More information about the Biopython-dev mailing list