[BioPython] Alignment add_sequence

Thu Feb 7 09:33:49 UTC 2008

On Feb 7, 2008 8:25 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> Hi:
> I think I can't use Bio.SeqIO.to_alignment() because the
> sequences have different lengths and start at different
> positions. It's and EST alignmet not a clustal-like one.
> I have also looked at your proposal in bug 1944 and I really
> like it, specially the clever __getitem__ method. But I can't
> use it because the different lengths of the sequences.
> I'm going to add an add_seqRecord method. Now, thanks to you I
> understand why this is not a good solution. But, at least, it
> will do for this time.

The whole idea behind the current alignment class is that all the
sequences are the same length (often with gaps).  I don't think this
fits with your intended usage - unless you pad each record with
leading gap characters (according to its start) and then pad the end
until they are all the same length.  You could write a function to
take a list of SeqRecords and pad them like this (note the example
will be easier to read in a mono-spaced font):

e.g.

CONSENSUS:       AGGCCTGAGGCCCCTTTT, start 0
EST1     :    CGCAGGCCCGAGGCC, start -3
EST2     :        GGCCTGAGGCCCCTT, start 1
EST3     :           CTGAGGCCACTTTTTCGC, start 4

In this case we want to add (start+3) gaps to each line, where -3 =
min(starts). This becomes:

---AGGCCTGAGGCCCCTTTT, start 0
CGCAGGCCCGAGGCC, start -3
----GGCCTGAGGCCCCTT, start 1
-------CTGAGGCCACTTTTTCGC, start 4

Then work out the maximum length, and pad all the sequences with trailing gaps:

---AGGCCTGAGGCCCCTTTT----
CGCAGGCCCGAGGCC----------
----GGCCTGAGGCCCCTT------
-------CTGAGGCCACTTTTTCGC

A little bit of work, but now all the sequences are the same length
and the Biopython alignment class will be happy.

As far as I know, there is nothing for this built into Biopython at
the moment.  Could you tell us what your input file looks like (e.g.
link to the file format?)

Peter