[Biopython-dev] [BioPython] Alignment add_sequence

Thu Feb 7 09:58:10 UTC 2008

On Thursday 07 February 2008 10:33:49 Peter wrote:
> On Feb 7, 2008 8:25 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> > Hi:
> > I think I can't use Bio.SeqIO.to_alignment() because the
> > sequences have different lengths and start at different
> > positions. It's and EST alignmet not a clustal-like one.
> > I have also looked at your proposal in bug 1944 and I really
> > like it, specially the clever __getitem__ method. But I can't
> > use it because the different lengths of the sequences.
> > I'm going to add an add_seqRecord method. Now, thanks to you I
> > understand why this is not a good solution. But, at least, it
> > will do for this time.
>
> The whole idea behind the current alignment class is that all the
> sequences are the same length (often with gaps).  I don't think this
> fits with your intended usage - unless you pad each record with
> leading gap characters (according to its start) and then pad the end
> until they are all the same length.  You could write a function to
> take a list of SeqRecords and pad them like this (note the example
> will be easier to read in a mono-spaced font):

I could do this, but I don't like the idea. An initial pad is not the same as 
a gap. The whole point of the program I'm working on is to look for SNPs and 
indels and this implementation would confuse the indel search.
I have looked at your proposal for the new Alignment implementation and the 
more I look at it, the more I like the idea of subclassing from list. Maybe 
the only problem is that it shouldn't be a list of seqRecords. A sequence in 
an alignment it's a seqRecod located at a given position. Maybe the Alignment 
class could take that into account internaly. In that case I don't know how 
to create a simple api that could deal with the case of start=0 and with the 
more complex case of start <> 0. A possible solution could be to accept 
seqRecords and tuples like (seqRecord, start) in the constructor.

>
> e.g.
>
> CONSENSUS:       AGGCCTGAGGCCCCTTTT, start 0
> EST1     :    CGCAGGCCCGAGGCC, start -3
> EST2     :        GGCCTGAGGCCCCTT, start 1
> EST3     :           CTGAGGCCACTTTTTCGC, start 4
>
> In this case we want to add (start+3) gaps to each line, where -3 =
> min(starts). This becomes:
>
> ---AGGCCTGAGGCCCCTTTT, start 0
> CGCAGGCCCGAGGCC, start -3
> ----GGCCTGAGGCCCCTT, start 1
> -------CTGAGGCCACTTTTTCGC, start 4
>
> Then work out the maximum length, and pad all the sequences with trailing
> gaps:
>
> ---AGGCCTGAGGCCCCTTTT----
> CGCAGGCCCGAGGCC----------
> ----GGCCTGAGGCCCCTT------
> -------CTGAGGCCACTTTTTCGC
>
> A little bit of work, but now all the sequences are the same length
> and the Biopython alignment class will be happy.
>
> As far as I know, there is nothing for this built into Biopython at
> the moment.  Could you tell us what your input file looks like (e.g.
> link to the file format?)
The alignment is originally done by cap3, but the data is in a MySQL database. 
I'm using EST2uni (http://bioinf.comav.upv.es/est2uni/). I have fetched the 
information from the database and I have set up the seqRecod objects and now 
I'm trying to create the Alingment object.
>
> Peter

Thanks,

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)