[BioPython] Alignment class

Jose Blanca jblanca at btc.upv.es
Fri Feb 8 15:23:34 UTC 2008


Hi:
I've been thinking a little little on this alignment problem.

On Thursday 07 February 2008 15:59:46 Peter wrote:
> On Feb 7, 2008 2:15 PM, Jan Kosinski <kosa at genesilico.pl> wrote:
> > Peter wrote:
> > > The whole idea behind the current alignment class is that all the
> > > sequences are the same length (often with gaps).
> >
> > I was always wondering what is the reason that you made the alignment
> > class which requires all sequences have the same length (even if incl.
> > gaps)?
>
> The design of the current alignment class predates my involvement, but
> from the point of view of the code (and the column access in
> particular) it assumes the sequences have the same length.  This
> assumption (with leading/trailing gaps) is also common to all the
> alignment file formats I have worked with.  I like this abstraction as
> you can regard the alignment as an array of characters (using matrix
> notation or what ever).

This kind of alignment is useful, but in my opinion it would be better if the 
sequences could have different lengths and start points.

>
> I can see that the EST alignment case is a little different, in that
> by convention the leading/trailing "gaps" are not shown.  It would be
> possible to write an new EST class which stored the sequences without
> leading/trailing "gap"s, but took into account the start offset, and
> would allow access to the "columns" inserting leading/trailing gaps
> where a given sequence has not started or has already finished.  I
> don't see that this would be any more useful (except perhaps for a
> small memory saving)
>
> In general leading/trailing gaps can mean the limits of a gene, or the
> limit of a domain with an gene, or the limits of a sequenced fragment,
> etc.  Sometimes there really is no character to go there, in other
> cases the sequence concerns does continue but for whatever reason it
> was not included in the alignment.
>
> One possibility (depending on what you want to do with the alignment)
> is to use different characters for internal gaps, leading "gaps" and
> trailing "gaps".

That would be a good solution for the EST case, althogh it could have some 
memory problems with longer sequences.
Anyway I felt like experimenting a bit so I looked at bioperl for inspiration. 
For this problem they use ranges and LocatableSeqs. I don't know if we need a 
full featured BioRange class for this problem, I've coded one, but I haven't 
used.
I have coded a draft of a LocatableSeq class and I've done some minimal 
modifications to the newAlignment proposal from bug 1944 
(http://bugzilla.open-bio.org/show_bug.cgi?id=1944). Maybe I should have 
created an Alignment subclass, but I think the most relevant change is the 
new LocatableSeq class.
This is not a finished work, but it's mostly working and I would like to know 
your opinions.
This is my first atempt to create something in python. I'm ready to learn from 
you, I will take the suggestions and criticisms with a smile, so don't be 
shy.
I guess that I could have broken some style rules, I hope to learn them with 
some time and help.

Best regards,

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: newAlignment.py
Type: application/x-python
Size: 22982 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20080208/dc25ab39/attachment-0004.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: LocatableSeq.py
Type: application/x-python
Size: 8195 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20080208/dc25ab39/attachment-0005.bin>


More information about the Biopython mailing list