[Bioperl-l] cigar string in GenericHSP

Marc Logghe Marc.Logghe at devgen.com
Wed Mar 12 11:12:13 EST 2003


Hi,
I was only wondering, if you really would like to reproduce the original
blast alignment, is the original homology string included in the cigar
string ? If I recall well, the homology sequence is not included in the aln
object and if you want to  add it explicitely, bioperl complains about the
bad sequence.
Also, I was missing a writer method for blast alignments. I have written my
own , but did not yet try to plug it in Bio::SearchIO::blast. If someone
else would like to give it a try, I'll include the code I currently have
($LINE_LEN is global and should be set to 60):

sub write_result
{
  my ( $self, $hsp ) = @_;
  my $aln;

  my $len = $hsp->length;
  my $l   = length($len);
  my $p   = ' ' x ( $l + 8 );

  my $cnt = 0;
  while ( ( $cnt * 60 ) < $len )
  {
    my ( $chunk, $start, $end );
    ($start) = $hsp->range('query');
    ( $chunk, $start, $end ) =
      $self->_get_chunk( $hsp->query_string, $start, $cnt );
    $aln .= sprintf( "Query: %-${l}s %s %d\n", $start, $chunk, $end );
    $aln .= sprintf( "$p%s\n",
                 substr( $hsp->homology_string, $cnt * $LINE_LEN, $LINE_LEN
) );
    ($start) = $hsp->range('hit');
    ( $chunk, $start, $end ) =
      $self->_get_chunk( $hsp->hit_string, $start, $cnt );
    $aln .= sprintf( "Sbjct: %-${l}s %s %d\n", $start, $chunk, $end );
    $aln .= "\n";
    $cnt++;
  }
  return $aln;
}

sub _get_chunk
{
  my ( undef, $g, $start, $cnt ) = @_;
  my $u = substr( $g, 0, $cnt * $LINE_LEN );
  my $gchunk = my $uchunk = substr( $g, $cnt * $LINE_LEN, $LINE_LEN );
  $uchunk =~ s/-//g;
  $u      =~ s/-//g;
  my $start_in_u = length($u) + $start;
  my $stop_in_u  = $start_in_u + length($uchunk) - 1;
  return ( $gchunk, $start_in_u, $stop_in_u );
}

Regards,
Marc

> -----Original Message-----
> From: Jason Stajich [mailto:jason at cgt.mc.duke.edu]
> Sent: Wednesday, March 12, 2003 3:48 AM
> To: Juguang Xiao
> Cc: Bioperl-l at bioperl.org
> Subject: Re: [Bioperl-l] cigar string in GenericHSP
> 
> 
> okay -
> 
> This is also possible via:
> 
> $hsp->get_aln->cigar_line();
> 
> But I am fine adding it to the HSP if it is faster.
> 
> -jason
> On Wed, 12 Mar 2003, Juguang Xiao wrote:
> 
> > Hi all,
> >
> > I added one method in Bio::Search::HSP::GenericHSP, named 
> cigar_string.
> > The Cigar string issue raises when we try to annotate 
> genome and store
> > into ensembl 9 and above database. I attach the concept of 
> cigar string
> > at the end of this email.
> >
> > Now you can have a very simple script to get cigar string from hsp,
> > which works for all favors of blast.
> >
> > my $factory = new Bio::SearchIO( -format => 'blast', -file =>
> > 't/data/blast.report');
> > my $hsp = $factory->next_result->next_hit->next_hsp; # 
> supposed to be
> > GenericHSP
> > my $cigar_string = $hsp->cigar_string;
> >
> > Beside this, I also wrote a static method to 
> generate_cigar_string from
> > 2 equal-length seqence, and you can use it more directly if 
> you have a
> > alignment sequence.
> >
> > my $qstr = 'tacgcta--tacgcta--cactg-c';
> > my $hstr = 'tac---tacgt----ctacgca---cc';
> > my $cigar_string = 
> Bio::Search::HSP::GenericHSP::generate_cigar_string
> > ($qstr, $hstr);
> >
> > t/cigarstring.t is serving to test.
> >
> > Suggestions or questions? Thanks
> >
> > Juguang
> >
> > ----------
> > Copied from ensembl doc.
> >
> > Sequence alignment hits were previously stored within the 
> core database
> > as
> > ungapped alignments. This imposed 2 major constraints on alignments:
> >
> > a) alignments for a single hit record would require 
> multiple rows in the
> > database, and
> > b) it was not possible to accurately retrieve the exact original
> > alignment.
> >
> > Therefore, in the new branch sequence alignments are now stored as
> > ungapped
> > alignments in the cigar line format (where CIGAR stands for Concise
> > Idiosyncratic Gapped Alignment Report).
> >
> > In the cigar line format alignments are sotred as follows:
> >
> > M: Match
> > D: Deletino
> > I: Insertion
> >
> > An example of an alignment for a hypthetical protein match is shown
> > below:
> >
> >
> > Query:   42 PGPAGLP----GSVGLQGPRGLRGPLP-GPLGPPL...
> >              PG    P    G     GP   R      PLGP
> > Sbjct: 1672 PGTP*TPLVPLGPWVPLGPSSPR--LPSGPLGPTD...
> >
> >
> > protein_align_feature table as the following cigar line:
> >
> > 7M4D12M2I2MD7M

> 


More information about the Bioperl-l mailing list