[Bioperl-l] Creating a MSA from a set of pairwise alignments with a common reference sequence?

Thu Aug 20 16:42:23 UTC 2009

Hi Dan,

I think you want the Bio::LocatableSeq method "column_from_residue_number".

You might also try combining your pairwise alignments using the profile 
alignment option in ClustalW.

Cheers.
Roy.

Dan Bolser wrote:
> Hi,
> 
> Quick version: How do I get a column of Bio::SimpleAlign using
> ungapped 'reference' sequence coordinates?
> 
> 
> 
> Longer version:
> 
> I have a set of pairwise alignments that I would like to process into
> a 'multiple sequence alignment' (MSA). All the alignments are short
> sequence 'contigs' aligned to a 'reference' sequence, so one sequence
> in all the pairwise alignments is constant (making the resulting MSA
> unambiguous).
> 
> I came up with the following pseudo-code to create a MSA
> (Bio::SimpleAlign) from the set of pairwise alignments...
> 
> initialise:
>     Create an 'empty' Bio::SimpleAlign from the REFERENCE sequence.
> 
> for each pairwise alignment:
>     Create a Bio::LocatableSeq from the given fragment of the
>     REFERENCE sequence (using ungapped REFERENCE coordinates).
> 
>     for each gap in the REFERENCE sequence:
>         Take the position of the gap (in ungapped REFERENCE
>         coordinates) and look up the corresponding column of the MSA
>         (in ungapped REFERENCE coordinates).
> 	
> 	for each sequence in the column:
> 	    Check if there is a gap-character at this position.
> 	
> 	if any sequence has a non gap-character at this position:
> 	    Stick a gap in the MSA just before this position.
> 
>     Create a Bio::LocatableSeq from the CONTIG sequence (using
>     ungapped REFERENCE coordinates) and add it to the
>     Bio::SimpleAlign.
> 
> done.
> 
> 
> I would very much appreciate, 1) feedback on the correctness of the
> above algorithm (it could be horribly wrong), and 2) advice on how to
> get a column of the alignment using ungapped REFERENCE coordinates?
> 
> 
> Sorry if this is a solved problem (where is it solved?). If not, and
> if I can get it working, I'll try to write a generic function to merge
> two MSAs when they have a reference sequence in common.
> 
> 
> For your reference, the pairwise alignments come from the show-aligns
> command in the MUMmer sequence alignment package, and have the
> following format:
> 
> my.reference.fasta my.contigs.multi.fasta
> 
> ============================================================
> -- Alignments between REFERENCE and CONTIG00012
> 
> -- BEGIN alignment [ +1 29237 - 45714 | +1 1 - 16441 ]
> 
> 
> 29237      aataacctctttaag.taatatttttctctggtcccaacttgcgccaat
> 1          aataa.ctctttaagataatatttttctctggtcccgacttgggccaat
>                 ^         ^                    ^     ^
> 
> 29286      ggaaaaaaatcacttattcgataa.ataataagataaatatattttcta
> 49         ggaaaaaaatcactatttcgataagataataagata.atatattttcaa
>                          ^^        ^           ^          ^
> 
> 29335      aagacccctacataaatatatggtcccattaatattataaattaataat
> 97         aagacccctatataaatatatggtctcattaatattataaattaataat
>                      ^              ^
> 
> ...
> 
> 
> For further reference:
> 
> This thread:
> http://bioperl.org/pipermail/bioperl-l/2009-July/030643.html
> 
> http://www.bioperl.org/wiki/Align_Refactor
> 
> http://www.bioperl.org/wiki/Alignment_object
> 
> 
> 
> All the best,
> Dan.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l