[Bioperl-l] cigar string in GenericHSP

Jason Stajich jason at cgt.mc.duke.edu
Tue Mar 11 21:59:41 EST 2003


On Tue, 11 Mar 2003, Jason Stajich wrote:

> okay -
>
> This is also possible via:
>
> $hsp->get_aln->cigar_line();
>
> But I am fine adding it to the HSP if it is faster.

Ah but now I look closer that cigar line fmt is MSA specific in
SimpleAlign is different so never mind - glad you got it to fit into HSP.
we may look at generalizing it to fit into HSPI so that other
implementations can make use of it too.  Or perhaps moving it to
SearchUtils.  We'll see.

Good stuff though - thanks for putting it in.

-j

>
> -jason
> On Wed, 12 Mar 2003, Juguang Xiao wrote:
>
> > Hi all,
> >
> > I added one method in Bio::Search::HSP::GenericHSP, named cigar_string.
> > The Cigar string issue raises when we try to annotate genome and store
> > into ensembl 9 and above database. I attach the concept of cigar string
> > at the end of this email.
> >
> > Now you can have a very simple script to get cigar string from hsp,
> > which works for all favors of blast.
> >
> > my $factory = new Bio::SearchIO( -format => 'blast', -file =>
> > 't/data/blast.report');
> > my $hsp = $factory->next_result->next_hit->next_hsp; # supposed to be
> > GenericHSP
> > my $cigar_string = $hsp->cigar_string;
> >
> > Beside this, I also wrote a static method to generate_cigar_string from
> > 2 equal-length seqence, and you can use it more directly if you have a
> > alignment sequence.
> >
> > my $qstr = 'tacgcta--tacgcta--cactg-c';
> > my $hstr = 'tac---tacgt----ctacgca---cc';
> > my $cigar_string = Bio::Search::HSP::GenericHSP::generate_cigar_string
> > ($qstr, $hstr);
> >
> > t/cigarstring.t is serving to test.
> >
> > Suggestions or questions? Thanks
> >
> > Juguang
> >
> > ----------
> > Copied from ensembl doc.
> >
> > Sequence alignment hits were previously stored within the core database
> > as
> > ungapped alignments. This imposed 2 major constraints on alignments:
> >
> > a) alignments for a single hit record would require multiple rows in the
> > database, and
> > b) it was not possible to accurately retrieve the exact original
> > alignment.
> >
> > Therefore, in the new branch sequence alignments are now stored as
> > ungapped
> > alignments in the cigar line format (where CIGAR stands for Concise
> > Idiosyncratic Gapped Alignment Report).
> >
> > In the cigar line format alignments are sotred as follows:
> >
> > M: Match
> > D: Deletino
> > I: Insertion
> >
> > An example of an alignment for a hypthetical protein match is shown
> > below:
> >
> >
> > Query:   42 PGPAGLP----GSVGLQGPRGLRGPLP-GPLGPPL...
> >              PG    P    G     GP   R      PLGP
> > Sbjct: 1672 PGTP*TPLVPLGPWVPLGPSSPR--LPSGPLGPTD...
> >
> >
> > protein_align_feature table as the following cigar line:
> >
> > 7M4D12M2I2MD7M
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> >
>
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu


More information about the Bioperl-l mailing list