[BioPython] Alignment add_sequence
Peter
biopython-dev at maubp.freeserve.co.uk
Thu Feb 7 09:33:49 UTC 2008
On Feb 7, 2008 8:25 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> Hi:
> I think I can't use Bio.SeqIO.to_alignment() because the
> sequences have different lengths and start at different
> positions. It's and EST alignmet not a clustal-like one.
> I have also looked at your proposal in bug 1944 and I really
> like it, specially the clever __getitem__ method. But I can't
> use it because the different lengths of the sequences.
> I'm going to add an add_seqRecord method. Now, thanks to you I
> understand why this is not a good solution. But, at least, it
> will do for this time.
The whole idea behind the current alignment class is that all the
sequences are the same length (often with gaps). I don't think this
fits with your intended usage - unless you pad each record with
leading gap characters (according to its start) and then pad the end
until they are all the same length. You could write a function to
take a list of SeqRecords and pad them like this (note the example
will be easier to read in a mono-spaced font):
e.g.
CONSENSUS: AGGCCTGAGGCCCCTTTT, start 0
EST1 : CGCAGGCCCGAGGCC, start -3
EST2 : GGCCTGAGGCCCCTT, start 1
EST3 : CTGAGGCCACTTTTTCGC, start 4
In this case we want to add (start+3) gaps to each line, where -3 =
min(starts). This becomes:
---AGGCCTGAGGCCCCTTTT, start 0
CGCAGGCCCGAGGCC, start -3
----GGCCTGAGGCCCCTT, start 1
-------CTGAGGCCACTTTTTCGC, start 4
Then work out the maximum length, and pad all the sequences with trailing gaps:
---AGGCCTGAGGCCCCTTTT----
CGCAGGCCCGAGGCC----------
----GGCCTGAGGCCCCTT------
-------CTGAGGCCACTTTTTCGC
A little bit of work, but now all the sequences are the same length
and the Biopython alignment class will be happy.
As far as I know, there is nothing for this built into Biopython at
the moment. Could you tell us what your input file looks like (e.g.
link to the file format?)
Peter
More information about the Biopython
mailing list