[Biopython-dev] Alignment object

Tue Mar 2 09:34:07 EST 2010

On Tue, Mar 2, 2010 at 12:36 PM, Kevin Jacobs wrote:
> On Tue, Mar 2, 2010 at 7:25 AM, Peter wrote:
>> My code does not (yet) attempt to deal with next-gen sequencing
>> alignments, which would require padding all the (short) reads with
>> leading and trailing gaps to ensure all rows of the alignment have
>> the same length. Doing this in a memory efficient way could be
>> done with a PaddedSeq object, or a very different alignment object
>> (hold read and their offsets in memory). I'm not sure what is best,
>> but the bx-python model looks worth understanding to help decide.
>>
>> Perhaps until this is settled, it would be premature to merge my
>> alignment class to the trunk. After all, we may need to tweak the
>> alignment object class heirachy.
>
>
> Hi Peter,
>
> I'm just jumping in here and have not yet read all of the background
> material.  However, I am working with next-gen alignments and am
> curious as to what you have in mind.  At first glance, it sounds like
> you want to access aligned reads in a 'pileup' format (i.e., an object
> model akin to http://samtools.sourceforge.net/pileup.shtml).  Or are
> you thinking of something different entirely?

Probably something different. My general concern boils down to
the fact that the current Alignment model as an enhanced "list of
SeqRecord objects" is potentially limiting.

The alignment code in Biopython (and my branch which is basically
an extension to that) deals with classical multiple sequence alignments
like ClustalW etc. You can think of the alignment as a matrix of letters,
each row is a sequence (e.g. a gene), and there will be some gap
characters for insertions, and padding for leading/trailing commissions.
There may or may not be a consensus sequence too.

With assembles you have a (long) consensus with many (short) reads
aligned to it. In order to hold this as a "matrix" representation, all the
(short) reads would require (lots of) leading/trailing padding. The same
applies when mapping reads to a reference genome.

So, while the current object model may work, all this extra padding
might mean too much of a memory overhead (especially as all the
rows are currently stored as SeqRecord objects). Instead, we might
just store the (short) read sequence, name, and its offset (and
perhaps the strand). We can then reconstruct columns or rows
mimicking the "matrix" interpretation on demand. However, the
API should make it easy to get the unpadded reads and their
offsets too - so the current alignment API might either be extended
or perhaps changed.

Related to this, a "Lite" version of the alignment object might
be useful when there is no annotation requiring using SeqRecord
objects. e.g. For ClustalW, FASTA, PHYLIP alignments all we
need is the sequence and identifiers.

Regarding one of your points, accessing aligned reads (or rows)
from an alignment - currently this is only supported by index
(row number). In most cases the reads (rows) have a unique
identifier/name, and thus one idea I am considering for this
branch is overloading the align[...] syntax further to allow a
record's id to be used as an alternative. i.e. More like a dictionary.

Other ideas for enhancements on this branch including sorting
the rows (with a list like sort method, defaulting to sorting on the
record's id strings), per-column annotation (useful for PFAM
alignments and the match string in pairwise alignments), and
a general annotations dictionary (like we have on SeqRecord
objects).

Peter