[Bioperl-l] Creating a MSA from a set of pairwise alignments with a common reference sequence?

Dan Bolser dan.bolser at gmail.com
Thu Aug 20 15:00:41 UTC 2009


Hi,

Quick version: How do I get a column of Bio::SimpleAlign using
ungapped 'reference' sequence coordinates?



Longer version:

I have a set of pairwise alignments that I would like to process into
a 'multiple sequence alignment' (MSA). All the alignments are short
sequence 'contigs' aligned to a 'reference' sequence, so one sequence
in all the pairwise alignments is constant (making the resulting MSA
unambiguous).

I came up with the following pseudo-code to create a MSA
(Bio::SimpleAlign) from the set of pairwise alignments...

initialise:
    Create an 'empty' Bio::SimpleAlign from the REFERENCE sequence.

for each pairwise alignment:
    Create a Bio::LocatableSeq from the given fragment of the
    REFERENCE sequence (using ungapped REFERENCE coordinates).

    for each gap in the REFERENCE sequence:
        Take the position of the gap (in ungapped REFERENCE
        coordinates) and look up the corresponding column of the MSA
        (in ungapped REFERENCE coordinates).
	
	for each sequence in the column:
	    Check if there is a gap-character at this position.
	
	if any sequence has a non gap-character at this position:
	    Stick a gap in the MSA just before this position.

    Create a Bio::LocatableSeq from the CONTIG sequence (using
    ungapped REFERENCE coordinates) and add it to the
    Bio::SimpleAlign.

done.


I would very much appreciate, 1) feedback on the correctness of the
above algorithm (it could be horribly wrong), and 2) advice on how to
get a column of the alignment using ungapped REFERENCE coordinates?


Sorry if this is a solved problem (where is it solved?). If not, and
if I can get it working, I'll try to write a generic function to merge
two MSAs when they have a reference sequence in common.


For your reference, the pairwise alignments come from the show-aligns
command in the MUMmer sequence alignment package, and have the
following format:

my.reference.fasta my.contigs.multi.fasta

============================================================
-- Alignments between REFERENCE and CONTIG00012

-- BEGIN alignment [ +1 29237 - 45714 | +1 1 - 16441 ]


29237      aataacctctttaag.taatatttttctctggtcccaacttgcgccaat
1          aataa.ctctttaagataatatttttctctggtcccgacttgggccaat
                ^         ^                    ^     ^

29286      ggaaaaaaatcacttattcgataa.ataataagataaatatattttcta
49         ggaaaaaaatcactatttcgataagataataagata.atatattttcaa
                         ^^        ^           ^          ^

29335      aagacccctacataaatatatggtcccattaatattataaattaataat
97         aagacccctatataaatatatggtctcattaatattataaattaataat
                     ^              ^

...


For further reference:

This thread:
http://bioperl.org/pipermail/bioperl-l/2009-July/030643.html

http://www.bioperl.org/wiki/Align_Refactor

http://www.bioperl.org/wiki/Alignment_object



All the best,
Dan.



More information about the Bioperl-l mailing list