[Biopython-dev] Improving the Alignment object. Was Bio.AlignIO
Peter
biopython-dev at maubp.freeserve.co.uk
Wed Jul 25 16:10:43 UTC 2007
Michiel de Hoon wrote:
> Peter wrote:
>> Personally I see an alignment as both an array of characters (i.e. amino
>> acid residues or nucleotides), and a list of sequences.
>>
>> In the same way that a Numeric or NumPy array lets you iterate over
>> rows, yet also access individual elements, we could allow iteration of
>> SeqRecords and also allow access to individual letters.
>
> How about the following:
>
> -Iterators iterate for the SeqRecords in the alignment
I Agree. And this is trivial to implement without needing the element
access/splicing support.
As to element access, we've been thinking along similar lines :)
Its just that with all the different special cases, there are lots of
different possible return types!
> -An index of the form [xxx] returns the corresponding SeqRecord
> -An index of the form [xxx:yyy:zzz] returns an Alignment object
> containing the SeqRecords in rows [xxx:yyy:zzz]
> (compare to the current method get_all_seqs()).
I agree. This is essential to make an alignment act like a list of
SeqRecord objects when only a one-dimensional index is given.
> -An index of the form [xxx,:] returns the Seq object of the SeqRecord at
> xxx (this is currently done by the get_seq_by_num() method).
> -An index of the form [xxx:yyy:zzz,:] returns a list of Seq objects
I'm not immediately convinced about returning Seq objects here. I might
expect indices like [xxx,:] to return a SeqRecord (not a Seq) and
[xxx:yyy:zzz,:] to return a sub-alignment (not a list of Seq objects).
> -An index of the form [:,www] returns a string containing the characters
> at column www (which is currently done by the get_column method)
> -An index of the form [xxx,www] returns a string containing the
> character of the sequence in row xxx at column www.
Those look fine - however we might want to return Seq objects rather
than strings.
> -An index of the form [xxx:yyy:zzz,www] returns a string containing
> the characters at column www using only the rows xxx:yyy:zzz.
Or a sub alignment? See later...
> This is more-or-less how Numerical Python arrays work, except that we'll
> be returning SeqRecord/Seq/string objects depending on the indices.
For comparison, that is what I had been thinking:
* [r,c] means one element is requested, return a single character string
* [r] or [r,:] means one row is requested, return a SeqRecord
* [:,c] means one column is requested, return a string (or Seq object?)
* Otherwise returns a (sub)alignment. Note that [:] or [:,:] would
return a copy of the alignment.
This would cover slicing of the column index by returning a
sub-alignment. i.e. indexes of the form [rrr, xxx:yyy:zzz] or
[rrr:ppp:qqq, xxx:yyy:zzz]
I'm not sure if requests for part of a single row or column like [rrr,
xxx:yyy:zzz] and [rrr:ppp:qqq, xxx] are best handled by returning
sub-alignments or as special cases (strings/Seq and Seq/SeqRecord
respectively?).
Peter
More information about the Biopython-dev
mailing list