[Biopython] SIBsim4 alignment support
Martin Mokrejs
mmokrejs at ribosome.natur.cuni.cz
Tue May 4 08:27:14 EDT 2010
Hi,
I wonder whether there is anybody having time to write a parser for the
output of:
SIBsim4 -A 4 chr.fasta spliced_mRNA.fasta
SIBsim4 -A 4 chr.fasta spliced_mRNA_rc.fasta
SIBsim4 -A 4 chr_rc.fasta spliced_mRNA.fasta
SIBsim4 -A 4 chr_rc.fasta spliced_mRNA_rc.fasta
The alignment is oriented by "->" or "<-" and a word "(complement)" eventually
appears in the output (the program outputs result in the orientation of the
chromosome, so eventual query using sense mRNA against a chromosome resulting
in a match on minus strand gives the reverse-complemeted mRNA output, which is
not optimal of course).
You can get it from http://sibsim4.sourceforge.net/ . This is a nice program
to inspect exon/intron boundaries and I would like to get the sequences of
the individual HSPs corresponding to the exons but fixed by the genomic
sequence. SIBsim4 does not print out number of identities/similarities
within each HSP but that would be the next I would do in python. ;)
I could probably go and write the parser but would need some time to
learn the structure of Bio.AlignIO code ... and from a quick glance over
Bio/AlignIO/FastaIO.py I am not sure how much time I would need. ;)
There is some fun if one hits a duplicated genes with similar copies
on the chromosome, like in this case:
SIBsim4 -A 4 NT_078297.fasta XM_001473524.fasta
>149234350 Mus musculus chromosome 1 genomic contig, strain C57BL/6J.; LEN=70622195
>gi|149234181|ref|XM_001473524.1| PREDICTED: Mus musculus similar to SP140 nuclear body protein family member (LOC100039794), mRNA; LEN=1916
44155-44217 (35-97) 100% -> (GT/AG) 24
44544-44624 (98-178) 100% -> (GT/AG) 24
49140-49241 (179-280) 100% -> (GT/AG) 24
51030-51059 (281-310) 100% -> (GT/AG) 24
51605-51648 (311-354) 100% -> (GT/AC) 22
51986-52030 (355-399) 100% -> (GT/AG) 22
53987-54091 (400-505) 99% -> (GT/AG) 24
56009-56151 (506-648) 100% -> (GT/AG) 24
59086-59133 (649-696) 100% -> (GT/AG) 24
61331-61372 (697-738) 100% -> (GT/AG) 24
64542-64657 (739-854) 100% -> (GT/AG) 24
65350-65455 (855-960) 100% -> (GT/AG) 24
65743-65820 (961-1038) 100% -> (GT/AG) 24
66011-66154 (1039-1182) 100% -> (GT/AG) 24
67403-68136 (1183-1916) 100%
0 . : . : . : . : . :
44155 CATCCAAAACTAATGATGAACAAGCAGAGGAGATGCTTAGCCTCCCACAA
||||||||||||||||||||||||||||||||||||||||||||||||||
35 CATCCAAAACTAATGATGAACAAGCAGAGGAGATGCTTAGCCTCCCACAA
50 . : . : . : . : . :
44205 TGCAATGGAGGAGGTG...CAGAGGGAAGTAGTTCTTGTGAACAAACGTG
|||||||||||||>>>...>>>||||||||||||||||||||||||||||
85 TGCAATGGAGGAG AGGGAAGTAGTTCTTGTGAACAAACGTG
100 . : . : . : . : . :
44572 TGATGAACAAGAGCCCCAGGATGACCTGCCCTCATCCCTGAGACAAGAAG
||||||||||||||||||||||||||||||||||||||||||||||||||
126 TGATGAACAAGAGCCCCAGGATGACCTGCCCTCATCCCTGAGACAAGAAG
150 . : . : . : . : . :
44622 CAGGTG...CAGGAGCACAGCAACCCACACGTGAAAAGAAGTGTTCCTGT
|||>>>...>>>||||||||||||||||||||||||||||||||||||||
176 CAG GAGCACAGCAACCCACACGTGAAAAGAAGTGTTCCTGT
200 . : . : . : . : . :
49178 GTCATGTGTTCCCCAACATATGTGCCAGAAGACCTGGAAGCAAGGATGGG
||||||||||||||||||||||||||||||||||||||||||||||||||
217 GTCATGTGTTCCCCAACATATGTGCCAGAAGACCTGGAAGCAAGGATGGG
250 . : . : . : . : . :
49228 AAACAGCCAAGGAGGTA...CAGGATGCCTCCCTTTCTCCTTCCATTTCC
||||||||||||||>>>...>>>|||||||||||||||||||||||||||
267 AAACAGCCAAGGAG GATGCCTCCCTTTCTCCTTCCATTTCC
300 . : . : . : . : . :
51057 CCTGTG...GAGACAGGCAGACCATGTCTGAGAGAACAAAGAGCAAAGGA
|||>>>...>>>||||||||||||||||||||||||||||||||||||||
308 CCT ACAGGCAGACCATGTCTGAGAGAACAAAGAGCAAAGGA
350 . : . : . : . : . :
51643 ATGAACCTT...GTCTGGTGTAAGCCCCGCTGGCATGATATGATCCCACT
||||||>>>...>>>|||||||||||||||||||||||||||||||||||
349 ATGAAC TGGTGTAAGCCCCGCTGGCATGATATGATCCCACT
400 . : . : . : . : . :
52021 GATGTGTTCTGTG...CAG GTCTAAGAAGACGCAGAAAAGAAAATGCCA
||||||||||>>>...>>>-||||||||||||||||||||||||||||||
390 GATGTGTTCT CGTCTAAGAAGACGCAGAAAAGAAAATGCCA
[cut]
>149234350 Mus musculus chromosome 1 genomic contig, strain C57BL/6J.; LEN=70622195
>gi|149234181|ref|XM_001473524.1| PREDICTED: Mus musculus similar to SP140 nuclear body protein family member (LOC100039794), mRNA; LEN=1916
167701-167736 (35-70) 97% ==
168083-168168 (96-178) 88% -> (GT/AG) 24
172953-173054 (179-280) 98% -> (GT/AG) 24
181004-181033 (281-310) 100% -> (GT/AG) 24
181579-181622 (311-354) 100% -> (GT/AG) 21
181960-182004 (355-399) 100% -> (GT/AG) 22
183357-183461 (400-505) 98% -> (GT/AG) 24
185375-185517 (506-648) 100% -> (GT/AG) 24
188456-188503 (649-696) 97% -> (GT/AG) 24
190721-190762 (697-738) 100% -> (GT/AG) 24
194630-194745 (739-854) 96% -> (GT/AG) 24
195439-195544 (855-960) 100% -> (GT/AG) 24
195832-195909 (961-1038) 100% -> (GT/AG) 24
196100-196243 (1039-1182) 99% -> (GT/AG) 23
197481-198214 (1183-1916) 97%
0 . : . : . : .
167701 CATCCAAAACGAATGATGAACAAGCAGAGGAGATGC
|||||||||| |||||||||||||||||||||||||
35 CATCCAAAACTAATGATGAACAAGCAGAGGAGATGC
0 . : . : . : . : . :
168083 AGAGGGAAGTAATTCTTGTGAACAAACAAGACAAACAAGACAAGAGCCCC
||||||||||| |||||||||||||||--| | |-|||||||||||
96 AGAGGGAAGTAGTTCTTGTGAACAAAC GTGTGATGA ACAAGAGCCCC
50 . : . : . : . : . :
168133 AGGATGACCTGCCCTCATCCCTGAGACAAGAAGCAGGTG...CAGGAGCA
||||||||||||||||||||||||||||||||||||>>>...>>>|||||
143 AGGATGACCTGCCCTCATCCCTGAGACAAGAAGCAG GAGCA
100 . : . : . : . : . :
172958 CAGCAACCCACACGTGAAAAGAAGTGTTCCTGTGTCATATGTTCCCCAAC
|||||||||||||||||||||||||||||||||||||| |||||||||||
184 CAGCAACCCACACGTGAAAAGAAGTGTTCCTGTGTCATGTGTTCCCCAAC
Opinions how to tackle this?
Thanks,
Martin
More information about the Biopython
mailing list