[Biopython] SIBsim4 alignment support

Martin Mokrejs mmokrejs at ribosome.natur.cuni.cz
Tue May 4 08:27:14 EDT 2010


Hi,
   I wonder whether there is anybody having time to write a parser for the
output of:
SIBsim4 -A 4 chr.fasta spliced_mRNA.fasta
SIBsim4 -A 4 chr.fasta spliced_mRNA_rc.fasta
SIBsim4 -A 4 chr_rc.fasta spliced_mRNA.fasta
SIBsim4 -A 4 chr_rc.fasta spliced_mRNA_rc.fasta

The alignment is oriented by "->" or "<-" and a word "(complement)" eventually
appears in the output (the program outputs result in the orientation of the
chromosome, so eventual query using sense mRNA against a chromosome resulting
in a match on minus strand gives the reverse-complemeted mRNA output, which is
not optimal of course).

You can get it from http://sibsim4.sourceforge.net/ . This is a nice program
to inspect exon/intron boundaries and I would like to get the sequences of
the individual HSPs corresponding to the exons but fixed by the genomic
sequence. SIBsim4 does not print out number of identities/similarities
within each HSP but that would be the next I would do in python. ;)

   I could probably go and write the parser but would need some time to
learn the structure of Bio.AlignIO code ... and from a quick glance over
Bio/AlignIO/FastaIO.py I am not sure how much time I would need. ;)

   There is some fun if one hits a duplicated genes with similar copies
on the chromosome, like in this case:

SIBsim4 -A 4 NT_078297.fasta XM_001473524.fasta



>149234350 Mus musculus chromosome 1 genomic contig, strain C57BL/6J.; LEN=70622195
>gi|149234181|ref|XM_001473524.1| PREDICTED: Mus musculus similar to SP140 nuclear body protein family member (LOC100039794), mRNA; LEN=1916

44155-44217  (35-97)   100% -> (GT/AG) 24
44544-44624  (98-178)   100% -> (GT/AG) 24
49140-49241  (179-280)   100% -> (GT/AG) 24
51030-51059  (281-310)   100% -> (GT/AG) 24
51605-51648  (311-354)   100% -> (GT/AC) 22
51986-52030  (355-399)   100% -> (GT/AG) 22
53987-54091  (400-505)   99% -> (GT/AG) 24
56009-56151  (506-648)   100% -> (GT/AG) 24
59086-59133  (649-696)   100% -> (GT/AG) 24
61331-61372  (697-738)   100% -> (GT/AG) 24
64542-64657  (739-854)   100% -> (GT/AG) 24
65350-65455  (855-960)   100% -> (GT/AG) 24
65743-65820  (961-1038)   100% -> (GT/AG) 24
66011-66154  (1039-1182)   100% -> (GT/AG) 24
67403-68136  (1183-1916)   100%

       0     .    :    .    :    .    :    .    :    .    :
   44155 CATCCAAAACTAATGATGAACAAGCAGAGGAGATGCTTAGCCTCCCACAA
         ||||||||||||||||||||||||||||||||||||||||||||||||||
      35 CATCCAAAACTAATGATGAACAAGCAGAGGAGATGCTTAGCCTCCCACAA

      50     .    :    .    :    .    :    .    :    .    :
   44205 TGCAATGGAGGAGGTG...CAGAGGGAAGTAGTTCTTGTGAACAAACGTG
         |||||||||||||>>>...>>>||||||||||||||||||||||||||||
      85 TGCAATGGAGGAG         AGGGAAGTAGTTCTTGTGAACAAACGTG

     100     .    :    .    :    .    :    .    :    .    :
   44572 TGATGAACAAGAGCCCCAGGATGACCTGCCCTCATCCCTGAGACAAGAAG
         ||||||||||||||||||||||||||||||||||||||||||||||||||
     126 TGATGAACAAGAGCCCCAGGATGACCTGCCCTCATCCCTGAGACAAGAAG

     150     .    :    .    :    .    :    .    :    .    :
   44622 CAGGTG...CAGGAGCACAGCAACCCACACGTGAAAAGAAGTGTTCCTGT
         |||>>>...>>>||||||||||||||||||||||||||||||||||||||
     176 CAG         GAGCACAGCAACCCACACGTGAAAAGAAGTGTTCCTGT

     200     .    :    .    :    .    :    .    :    .    :
   49178 GTCATGTGTTCCCCAACATATGTGCCAGAAGACCTGGAAGCAAGGATGGG
         ||||||||||||||||||||||||||||||||||||||||||||||||||
     217 GTCATGTGTTCCCCAACATATGTGCCAGAAGACCTGGAAGCAAGGATGGG

     250     .    :    .    :    .    :    .    :    .    :
   49228 AAACAGCCAAGGAGGTA...CAGGATGCCTCCCTTTCTCCTTCCATTTCC
         ||||||||||||||>>>...>>>|||||||||||||||||||||||||||
     267 AAACAGCCAAGGAG         GATGCCTCCCTTTCTCCTTCCATTTCC

     300     .    :    .    :    .    :    .    :    .    :
   51057 CCTGTG...GAGACAGGCAGACCATGTCTGAGAGAACAAAGAGCAAAGGA
         |||>>>...>>>||||||||||||||||||||||||||||||||||||||
     308 CCT         ACAGGCAGACCATGTCTGAGAGAACAAAGAGCAAAGGA

     350     .    :    .    :    .    :    .    :    .    :
   51643 ATGAACCTT...GTCTGGTGTAAGCCCCGCTGGCATGATATGATCCCACT
         ||||||>>>...>>>|||||||||||||||||||||||||||||||||||
     349 ATGAAC         TGGTGTAAGCCCCGCTGGCATGATATGATCCCACT

     400     .    :    .    :    .    :    .    :    .    :
   52021 GATGTGTTCTGTG...CAG GTCTAAGAAGACGCAGAAAAGAAAATGCCA
         ||||||||||>>>...>>>-||||||||||||||||||||||||||||||
     390 GATGTGTTCT         CGTCTAAGAAGACGCAGAAAAGAAAATGCCA

[cut]


>149234350 Mus musculus chromosome 1 genomic contig, strain C57BL/6J.; LEN=70622195
>gi|149234181|ref|XM_001473524.1| PREDICTED: Mus musculus similar to SP140 nuclear body protein family member (LOC100039794), mRNA; LEN=1916

167701-167736  (35-70)   97% ==
168083-168168  (96-178)   88% -> (GT/AG) 24
172953-173054  (179-280)   98% -> (GT/AG) 24
181004-181033  (281-310)   100% -> (GT/AG) 24
181579-181622  (311-354)   100% -> (GT/AG) 21
181960-182004  (355-399)   100% -> (GT/AG) 22
183357-183461  (400-505)   98% -> (GT/AG) 24
185375-185517  (506-648)   100% -> (GT/AG) 24
188456-188503  (649-696)   97% -> (GT/AG) 24
190721-190762  (697-738)   100% -> (GT/AG) 24
194630-194745  (739-854)   96% -> (GT/AG) 24
195439-195544  (855-960)   100% -> (GT/AG) 24
195832-195909  (961-1038)   100% -> (GT/AG) 24
196100-196243  (1039-1182)   99% -> (GT/AG) 23
197481-198214  (1183-1916)   97%

       0     .    :    .    :    .    :    .
  167701 CATCCAAAACGAATGATGAACAAGCAGAGGAGATGC
         |||||||||| |||||||||||||||||||||||||
      35 CATCCAAAACTAATGATGAACAAGCAGAGGAGATGC

       0     .    :    .    :    .    :    .    :    .    :
  168083 AGAGGGAAGTAATTCTTGTGAACAAACAAGACAAACAAGACAAGAGCCCC
         ||||||||||| |||||||||||||||--|    |  |-|||||||||||
      96 AGAGGGAAGTAGTTCTTGTGAACAAAC  GTGTGATGA ACAAGAGCCCC

      50     .    :    .    :    .    :    .    :    .    :
  168133 AGGATGACCTGCCCTCATCCCTGAGACAAGAAGCAGGTG...CAGGAGCA
         ||||||||||||||||||||||||||||||||||||>>>...>>>|||||
     143 AGGATGACCTGCCCTCATCCCTGAGACAAGAAGCAG         GAGCA

     100     .    :    .    :    .    :    .    :    .    :
  172958 CAGCAACCCACACGTGAAAAGAAGTGTTCCTGTGTCATATGTTCCCCAAC
         |||||||||||||||||||||||||||||||||||||| |||||||||||
     184 CAGCAACCCACACGTGAAAAGAAGTGTTCCTGTGTCATGTGTTCCCCAAC


Opinions how to tackle this?
Thanks,
Martin


More information about the Biopython mailing list