[Bioperl-l] parser for BLAT cDNA-genome alignment in Sim4 format

Jason Stajich jason at bioperl.org
Thu Jan 11 22:08:32 UTC 2007


Is the BLAT SIM4 output qualitatively different from "regular" Sim4  
output?  Doesn't it make sense to just fix the parser to be more  
flexible?  There is also a sim4 parser for query/hit style (rather  
than gene-style features you get from the Bio::Tools:: parser) in  
SearchIO::sim4 - curious if it fails for BLAT-created sim4 output?

-jason

On Jan 11, 2007, at 9:26 AM, Hong Xu wrote:

> Dear all,
>
> I have written a module for parsing BLAT result in Sim4 format. I
> borrowed the code from Bio::Tools::Sim4::Results. Hope this code will
> be useful to other people.
>
> regards,
>
> Hong
>
> ------------- see code below -------------------------
>
> #
> # BioPerl module for Bio::Tools::Sim4::BlatResults
> #
> # Cared for by Hong Xu <hxu.hong at gmail.com>
> # Modified from Bio::Tools::Sim4::BlatResults
> # that was developed by:
> #     Ewan Birney <birney at sanger.ac.uk>
> #     and Hilmar Lapp <hlapp at gmx.net>
> #
> # Copyright Ewan Birney, Hilmar Lapp, Hong Xu
> #
> # You may distribute this module under the same terms as perl itself
>
> # POD documentation - main docs before the code
>
> =head1 NAME
>
> Bio::Tools::Sim4::BlatResults - Results of one BLAT run with Sim4  
> output
>
> =head1 SYNOPSIS
>
>    # to preset the order of EST and genomic file as given on the sim4
>    # command line:
>    my $sim4 = Bio::Tools::Sim4::BlatResults->new(-file =>  
> 'result.sim4',
>                                                  -estfirst => 1);
>    # to let the order be determined automatically (by length  
> comparison):
>    $sim4 = Bio::Tools::Sim4::BlatResults->new( -file =>  
> 'sim4.results' );
>    # filehandle:
>    $sim4 = Bio::Tools::Sim4::BlatResults->new( -fh   => \*INPUT );
>
>    # parse the results
>    while(my $exonset = $sim4->next_exonset()) {
>        # $exonset is-a Bio::SeqFeature::Generic with  
> Bio::Tools::Sim4::Exons
>        # as sub features
>        print "Delimited on sequence ", $exonset->seq_id(),
>              "from ", $exonset->start(), " to ", $exonset->end(),  
> "\n";
>        foreach my $exon ( $exonset->sub_SeqFeature() ) {
> 	  # $exon is-a Bio::SeqFeature::FeaturePair
> 	  print "Exon from ", $exon->start, " to ", $exon->end,
>                 " on strand ", $exon->strand(), "\n";
>           # you can get out what it matched using the est_hit  
> attribute
>           my $homol = $exon->est_hit();
>           print "Matched to sequence ", $homol->seq_id,
>                 " at ", $homol->start," to ", $homol->end, "\n";
>       }
>    }
>
>    # essential if you gave a filename at initialization (otherwise  
> the file
>    # stays open)
>    $sim4->close();
>
> =head1 DESCRIPTION
>
> The sim4 module provides a parser and results object for sim4  
> output from BLAT.
> The sim4 results are specialised types of SeqFeatures, meaning you  
> can add them
> to AnnSeq objects fine, and manipulate them in the "normal"  
> seqfeature manner.
>
> The sim4 Exon objects are Bio::SeqFeature::FeaturePair inherited  
> objects. The
> $esthit = $exon-E<gt>est_hit() is the alignment as a feature on the  
> matching
> object (normally, an EST), in which the start/end points are where  
> the hit
> lies.
>
> To make this module work sensibly you need to run
>
>      blat -out=sim4 genomic.fasta est.fasta exon.sim4
>
> One fiddle here is that there are only two real possibilities to  
> the matching
> criteria: either one sequence needs reversing or not. Because of  
> this, it
> is impossible to tell whether the match is in the forward or  
> reverse strand
> of the genomic DNA. We solve this here by assuming that the genomic  
> DNA is
> always forward. As a consequence, the strand attribute of the  
> matching EST is
> unknown, and the strand attribute of the genomic DNA (i.e., the  
> Exon object)
> will reflect the direction of the hit.
>
> =head1 FEEDBACK
>
> =head2 Mailing Lists
>
> User feedback is an integral part of the evolution of this and other
> Bioperl modules. Send your comments and suggestions preferably to one
> of the Bioperl mailing lists.  Your participation is much appreciated.
>
>   bioperl-l at bioperl.org          - General discussion
>   http://bio.perl.org/MailList.html             - About the mailing  
> lists
>
> =head2 Reporting Bugs
>
> Report bugs to the Bioperl bug tracking system to help us keep track
> the bugs and their resolution.  Bug reports can be submitted via email
> or the web:
>
>   bioperl-bugs at bio.perl.org
>   http://bugzilla.bioperl.org/
>
> =head1 AUTHOR - Ewan Birney, Hilmar Lapp, Hong Xu
>
> Email birney at sanger.ac.uk
>       hlapp at gmx.net (or hilmar.lapp at pharma.novartis.com)
>       hxu.hong at gmail.com
>
> Describe contact details here
>
> =head1 APPENDIX
>
> The rest of the documentation details each of the object methods.
> Internal methods are usually preceded with a _
>
> =cut
>
>
> # Let the code begin...
>
>
> package Bio::Tools::Sim4::BlatResults;
> use vars qw(@ISA);
> use strict;
>
> # Object preamble - inherits from Bio::Root::Object
>
> use File::Basename;
> use Bio::Root::Root;
> use Bio::Tools::AnalysisResult;
> use Bio::Tools::Sim4::Exon;
>
> @ISA = qw(Bio::Tools::AnalysisResult);
>
>
> sub _initialize_state {
>     my($self, at args) = @_;
>
>     # call the inherited method first
>     my $make = $self->SUPER::_initialize_state(@args);
>
>     my ($est_is_first) = $self->_rearrange([qw(ESTFIRST)], @args);
>
>     delete($self->{'_est_is_first'});
>     $self->{'_est_is_first'} = $est_is_first if(defined 
> ($est_is_first));
>     $self->analysis_method("BLAT");
> }
>
> =head2 analysis_method
>
>  Usage     : $sim4->analysis_method();
>  Purpose   : Inherited method. Overridden to ensure that the name  
> matches
>              /blat/i.
>  Returns   : String
>  Argument  : n/a
>
> =cut
>
> #-------------
> sub analysis_method {
> #-------------
>     my ($self, $method) = @_;
>     if($method && ($method !~ /blat/i)) {
>        $self->throw("method $method not supported in " . ref($self));
>     }
>     return $self->SUPER::analysis_method($method);
> }
>
> =head2 parse_next_alignment
>
>  Title   : parse_next_alignment
>  Usage   : @exons = $sim4_result->parse_next_alignment;
>            foreach $exon (@exons) {
>                # do something
>            }
>  Function: Parses the next alignment of the BLAT sim4 result file  
> and returns
>            the found exons as an array of Bio::Tools::Sim4::Exon  
> objects. Call
>            this method repeatedly until an empty array is returned  
> to get the
>            results for all alignments.
>
>            The $exon->seq_id() attribute will be set to the  
> identifier of the
>            respective sequence for both sequences. The length is  
> accessible
>            via the seqlength() attribute of $exon->query() and
>            $exon->est_hit().
>
>            It automatically determines which of the two sequences  
> has been
>            reversed, and adjusts the coordinates for that sequence.  
> It will
>            also detect whether the EST sequence(s) were given as  
> first or as
>            second file to sim4, unless this has been specified at  
> creation
>            time of the object.
>
>  Example :
>  Returns : An array of Bio::Tools::Sim4::Exon objects
>  Args    :
>
>
> =cut
>
> sub parse_next_alignment {
>    my ($self) = @_;
>    my @exons = ();
>    my %seq1props = ();
>    my %seq2props = ();
>    # we refer to the properties of each seq by reference
>    my ($estseq, $genomseq, $to_reverse);
>    my $started = 0;
>    my $hit_direction = 1;
>
>    while(defined($_ = $self->_readline())) {
>
>      #
>      # bascially, each sim4 'hit' starts with seq1...
>      #
>      /^seq1/ && do {
>        if($started) {
> 	       $self->_pushback($_);
> 	       last;
> 	     }
> 	     $started = 1;
>
> 	     # seqname and length of seq 1
> 	     /^seq1\s+=\s+(\S+)\,\s+(\d+)/ ||
> 	       $self->throw("Sim4 parsing error on seq1 [$_] line. Sorry!");
> 	     $seq1props{'seqname'} = $1;
> 	     $seq1props{'length'} = $2;
> 	     next;
>      };
>      /^seq2/ && do {
>        /^seq2\s+=\s+(\S+)\,\s+(\d+)/ ||
> 	       $self->throw("Sim4 parsing error on seq2 [$_] line. Sorry!");
> 	     $seq2props{'seqname'} = $1;
> 	     $seq2props{'length'} = $2;
> 	     next;
>      };
>      /^\(complement\)/ && do {
> 	     $hit_direction = -1;
> 	     next;
>      };
>
>      # this matches
>      # start-end (start-end) pctid%
>      if(/(\d+)-(\d+)\s+\((\d+)-(\d+)\)\s+(\d+)%/) {
>  	     $seq1props{'start'} = $1;
>  	     $seq1props{'end'} = $2;
>  	     $seq2props{'start'} = $3;
>  	     $seq2props{'end'} = $4;
> 	     my $pctid   = $5;
> 	
> 	     if(! defined($estseq)) {
> 	       # for the first time here: need to set the references  
> referring
> 	       # to seq1 and seq2
> 	       if(! exists($self->{'_est_is_first'})) {
> 		       # detect which one is the EST by looking at the lengths,
> 		       # and assume that this holds throughout the entire result
> 		       # file (i.e., when this method is called for the next
> 		       # alignment, this will not be checked again)
> 		       if($seq1props{'length'} > $seq2props{'length'}) {
> 		         $self->{'_est_is_first'} = 0;
> 		       } else {
> 		         $self->{'_est_is_first'} = 1;
> 		       }
> 	       }
> 	       if($self->{'_est_is_first'}) {
> 		       $estseq = \%seq1props;
> 		       $genomseq = \%seq2props;
> 	       } else {
> 		       $estseq = \%seq2props;
> 		       $genomseq = \%seq1props;
> 	       }
> 	     }
> 	     if($hit_direction == -1) {
> 	       # we have to reverse the coordinates of one of both seqs
> 	       my $tmp = $to_reverse->{'start'};
> 	       $to_reverse->{'start'} =
> 		     $to_reverse->{'length'} - $to_reverse->{'end'} + 1;
> 	       $to_reverse->{'end'} = $to_reverse->{'length'} - $tmp + 1;
> 	     }
> 	     # create and initialize the exon object
> 	     my $exon = Bio::Tools::Sim4::Exon->new(
> 					    '-start' => $genomseq->{'start'},
> 					    '-end'   => $genomseq->{'end'},
> 					    '-strand' => $hit_direction);
>        $exon->seq_id($genomseq->{'seqname'});
> 	     # feature1 is supposed to be initialized to a Similarity object,
>        # but we provide a safety net
> 	     if($exon->feature1()->can('seqlength')) {
> 	       $exon->feature1()->seqlength($genomseq->{'length'});
> 	     } else {
> 	       $exon->feature1()->add_tag_value('SeqLength',
> 					 $genomseq->{'length'});
> 	     }
> 	     # create and initialize the feature wrapping the 'hit' (the EST)
> 	     my $fea2 = Bio::SeqFeature::Similarity->new(
>                     '-start' => $estseq->{'start'},
> 					          '-end'   => $estseq->{'end'},
> 					          '-strand' => 0,
> 					          '-primary' => "aligning_EST");
>        $fea2->seq_id($estseq->{'seqname'});
>        $fea2->seqlength($estseq->{'length'});
> 	     # store
> 	     $exon->est_hit($fea2);	
> 	     # general properties
> 	     $exon->source_tag($self->analysis_method());
> 	     $exon->percentage_id($pctid);
> 	     $exon->score($exon->percentage_id());
> 	     # push onto array
> 	     push(@exons, $exon);
> 	     next; # back to while loop
>      }
>    }
>    return @exons;
> }
>
> =head2 next_exonset
>
>  Title   : next_exonset
>  Usage   : $exonset = $sim4_result->parse_next_exonset;
>            print "Exons start at ", $exonset->start(),
>                  "and end at ", $exonset->end(), "\n";
>            foreach $exon ($exonset->sub_SeqFeature()) {
>                # do something
>            }
>  Function: Parses the next alignment of the Sim4 result file and  
> returns the
>            set of exons as a container of features. The container  
> is itself
>            a Bio::SeqFeature::Generic object, with the  
> Bio::Tools::Sim4::Exon
>            objects as sub features. Start, end, and strand of the  
> container
>            will represent the total region covered by the exons of  
> this set.
>
>            See the documentation of parse_next_alignment() for further
>            reference about parsing and how the information is stored.
>
>  Example :
>  Returns : An Bio::SeqFeature::Generic object holding  
> Bio::Tools::Sim4::Exon
>            objects as sub features.
>  Args    :
>
> =cut
>
> sub next_exonset {
>     my $self = shift;
>     my $exonset;
>
>     # get the next array of exons
>     my @exons = $self->parse_next_alignment();
>     return if($#exons < 0);
>     # create the container of exons as a feature object itself,  
> with the
>     # data of the first exon for initialization
>     $exonset = Bio::SeqFeature::Generic->new('-start' => $exons[0]- 
> >start(),
> 					     '-end' => $exons[0]->end(),
> 					     '-strand' => $exons[0]->strand(),
> 					     '-primary' => "ExonSet");
>     $exonset->source_tag($exons[0]->source_tag());
>     $exonset->seq_id($exons[0]->seq_id());
>     # now add all exons as sub features, with enabling EXPANsion of  
> the region
>     # covered in total
>     foreach my $exon (@exons) {
> 	$exonset->add_sub_SeqFeature($exon, 'EXPAND');
>     }
>     return $exonset;
> }
>
> =head2 next_feature
>
>  Title   : next_feature
>  Usage   : while($exonset = $sim4->next_feature()) {
>                   # do something
>            }
>  Function: Does the same as L<next_exonset()>. See there for  
> documentation of
>            the functionality. Call this method repeatedly until  
> FALSE is
>            returned.
>
>            The returned object is actually a SeqFeatureI  
> implementing object.
>            This method is required for classes implementing the
>            SeqAnalysisParserI interface, and is merely an alias for
>            next_exonset() at present.
>
>  Example :
>  Returns : A Bio::SeqFeature::Generic object.
>  Args    :
>
> =cut
>
> sub next_feature {
>     my ($self, at args) = @_;
>     # even though next_exonset doesn't expect any args (and this  
> method
>     # does neither), we pass on args in order to be prepared if  
> this changes
>     # ever
>     return $self->next_exonset(@args);
> }
>
> 1;
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list