[Biopython] SIBsim4 alignment support

Tue May 4 10:22:52 EDT 2010

Hi Peter,

Peter wrote:
> On Tue, May 4, 2010 at 1:27 PM, Martin Mokrejs
> <mmokrejs at ribosome.natur.cuni.cz>  wrote:
>> Hi,
>>   I wonder whether there is anybody having time to write a parser for the
>> output of:
>> SIBsim4 -A 4 chr.fasta spliced_mRNA.fasta
>> SIBsim4 -A 4 chr.fasta spliced_mRNA_rc.fasta
>> SIBsim4 -A 4 chr_rc.fasta spliced_mRNA.fasta
>> SIBsim4 -A 4 chr_rc.fasta spliced_mRNA_rc.fasta
>>
>> ...
>>
>> You can get it from http://sibsim4.sourceforge.net/ . This is a nice program
>> to inspect exon/intron boundaries and I would like to get the sequences of
>> the individual HSPs corresponding to the exons but fixed by the genomic
>> sequence. SIBsim4 does not print out number of identities/similarities
>> within each HSP but that would be the next I would do in python. ;)
>>
>>   I could probably go and write the parser but would need some time to
>> learn the structure of Bio.AlignIO code ... and from a quick glance over
>> Bio/AlignIO/FastaIO.py I am not sure how much time I would need. ;)
>
> Looking at the FASTA m10 alignment parser is sensible in that it is another
> pairwise alignment format - but it isn't the nicest parser in the world.
>
> How much of the data do you actually care about? Just the pairwise
> alignment (two sequences)? Right now annotation support is limited

If you give me just the two sequences without their coordinates in each chromosome
and mRNA it would hep but is not "enough" for my _future_ work - see below. ;)

> in the alignment object - but this is something I am working on (but
> not likely to be in the imminent Biopython 1.54 release).
>
> Related to the above, which of the output formats are you planning to
> support? http://sibsim4.sourceforge.net/manpage.html

In brief, the full output is in "-A 4" (the example I gave is not optimal as
the mRNA does not have poly(A) tail so you could see it mentioned in the output).
What I want to get is just the sequences corrected using the genome.

So, parsing out just the coordinates could be fine but if the alignment does
start at base 1 or end at the physical end of the mRNA, I would like to keep
the "crappy" sequence of the mRNA/EST sequence prepended/appended to the internal
region fixed by the genomic sequence.

Alternatively, parsing out the sequence of the chromosome while ripping off
the

GTA...CAG
>>>...>>>

or

CTG...TAC
<<<...<<<

splice junctions is another way but again, I want to prepend/append the low-quality
ends.

In future, I would like to utilize the coordinates of the individual exons on
chromosome, of their corresponding region in the transcript and the corresponding
identity values in each HSP shown along the output.

I would utilize the information about the actual boundary bases (gt..ag) of the
intron and probably will calculate further on the type of the intron in respect
to the ORF (type0 for the starting/ending just at the beginning of a codon,
type 1 for those having an extra 1 nt overhang, type2 for 2 nt overhangs). But
that does not probably make sense to accommodate in the alignment object. ;-)

Do you want to work on this project? ;-)

Martin