[Bioperl-l] Creating a MSA from a set of pairwise alignments with a common reference sequence?
Dan Bolser
dan.bolser at gmail.com
Thu Aug 20 15:00:41 UTC 2009
Hi,
Quick version: How do I get a column of Bio::SimpleAlign using
ungapped 'reference' sequence coordinates?
Longer version:
I have a set of pairwise alignments that I would like to process into
a 'multiple sequence alignment' (MSA). All the alignments are short
sequence 'contigs' aligned to a 'reference' sequence, so one sequence
in all the pairwise alignments is constant (making the resulting MSA
unambiguous).
I came up with the following pseudo-code to create a MSA
(Bio::SimpleAlign) from the set of pairwise alignments...
initialise:
Create an 'empty' Bio::SimpleAlign from the REFERENCE sequence.
for each pairwise alignment:
Create a Bio::LocatableSeq from the given fragment of the
REFERENCE sequence (using ungapped REFERENCE coordinates).
for each gap in the REFERENCE sequence:
Take the position of the gap (in ungapped REFERENCE
coordinates) and look up the corresponding column of the MSA
(in ungapped REFERENCE coordinates).
for each sequence in the column:
Check if there is a gap-character at this position.
if any sequence has a non gap-character at this position:
Stick a gap in the MSA just before this position.
Create a Bio::LocatableSeq from the CONTIG sequence (using
ungapped REFERENCE coordinates) and add it to the
Bio::SimpleAlign.
done.
I would very much appreciate, 1) feedback on the correctness of the
above algorithm (it could be horribly wrong), and 2) advice on how to
get a column of the alignment using ungapped REFERENCE coordinates?
Sorry if this is a solved problem (where is it solved?). If not, and
if I can get it working, I'll try to write a generic function to merge
two MSAs when they have a reference sequence in common.
For your reference, the pairwise alignments come from the show-aligns
command in the MUMmer sequence alignment package, and have the
following format:
my.reference.fasta my.contigs.multi.fasta
============================================================
-- Alignments between REFERENCE and CONTIG00012
-- BEGIN alignment [ +1 29237 - 45714 | +1 1 - 16441 ]
29237 aataacctctttaag.taatatttttctctggtcccaacttgcgccaat
1 aataa.ctctttaagataatatttttctctggtcccgacttgggccaat
^ ^ ^ ^
29286 ggaaaaaaatcacttattcgataa.ataataagataaatatattttcta
49 ggaaaaaaatcactatttcgataagataataagata.atatattttcaa
^^ ^ ^ ^
29335 aagacccctacataaatatatggtcccattaatattataaattaataat
97 aagacccctatataaatatatggtctcattaatattataaattaataat
^ ^
...
For further reference:
This thread:
http://bioperl.org/pipermail/bioperl-l/2009-July/030643.html
http://www.bioperl.org/wiki/Align_Refactor
http://www.bioperl.org/wiki/Alignment_object
All the best,
Dan.
More information about the Bioperl-l
mailing list