[Bioperl-l] code review on LocatableSeq performance fix.
Mark A. Jensen
maj at fortinbras.us
Wed Aug 26 11:39:40 UTC 2009
I think it's great. column_from_residue_number doesn't have any
secret side effects, and the patch preserves nice integer in, nice integer
out, and input and output both are 1-origin indices as far as I can tell.
I say go for it-
MAJ
----- Original Message -----
From: "George Hartzell" <hartzell at alerce.com>
To: "bioperl-l List" <bioperl-l at bioperl.org>
Sent: Tuesday, August 25, 2009 4:22 PM
Subject: [Bioperl-l] code review on LocatableSeq performance fix.
>
> [For better or worse] I use pairs of locatable seq's to represent
> alignments between cDNAs (spliced mRNA) and genomic sequence.
>
> I end up using column_from_residue_number a lot to map features back
> and forth between the coordinate system.
>
> My sequences tend to be fairly long, and the current implementation of
> column_from_residue_number (which splits the sequences into arrays of
> individual characters) performs very badly on them.
>
> I've included below a small variation on a patch that I've been using
> for a while (when I pulled it up to the current bioperl-live I changed
> a couple of regexps to use $GAP_SYMBOLS and $RESIDUE_SYMBOLS). It
> passes the t/Seq/LocatableSeq.t tests and Works For Me (tm).
>
> Instead of creating whopping big arrays and then looping over them it
> breaks the sequence down into runs of residues/gaps and strides across
> them. It also unwinds the strandedness test and avoids the cute trick
> of using an anonymous sub (which saves a couple of lines in the source
> file but adds *signficant* overhead every time around the loop).
>
> All hail Devel::NYTProf.
>
> Chris et al.'s comments about the mysteries and vagaries of
> Bio::LocatableSeq makes me leary of just committing it.
>
> Anyone want to comment on it?
>
> g.
>
> Index: Bio/LocatableSeq.pm
> ===================================================================
> --- Bio/LocatableSeq.pm (revision 16001)
> +++ Bio/LocatableSeq.pm (working copy)
> @@ -423,27 +423,47 @@
> unless $resnumber =~ /^\d+$/ and $resnumber > 0;
>
> if ($resnumber >= $self->start() and $resnumber <= $self->end()) {
> - my @residues = split //, $self->seq;
> - my $count = $self->start();
> - my $i;
> - my ($start,$end,$inc,$test);
> - my $strand = $self->strand || 0;
> - # the following bit of "magic" allows the main loop logic to be the
> - # same regardless of the strand of the sequence
> - ($start,$end,$inc,$test)= ($strand == -1)?
> - (scalar(@residues-1),0,-1,sub{$i >= $end}) :
> - (0,scalar(@residues-1),1,sub{$i <= $end});
> + my @chunks;
> + my $column_incr;
> + my $current_column;
> + my $current_residue = $self->start - 1;
> + my $seq = $self->seq;
> + my $strand = $self->strand || 0;
>
> - for ($i=$start; $test->(); $i+= $inc) {
> - if ($residues[$i] ne '.' and $residues[$i] ne '-') {
> - $count == $resnumber and last;
> - $count++;
> - }
> - }
> - # $i now holds the index of the column.
> - # The actual column number is this index + 1
> + if ($strand == -1) {
> +# @chunks = reverse $seq =~ m/[^\.\-]+|[\.\-]+/go;
> + @chunks = reverse $seq =~ m/[$RESIDUE_SYMBOLS]+|[$GAP_SYMBOLS]+/go;
> + $column_incr = -1;
> + $current_column = (CORE::length $seq) + 1;
> + }
> + else {
> +# @chunks = $seq =~ m/[^\.\-]+|[\.\-]+/go;
> + @chunks = $seq =~ m/[$RESIDUE_SYMBOLS]+|[$GAP_SYMBOLS]+/go;
> + $column_incr = 1;
> + $current_column = 0;
> + }
>
> - return $i+1;
> + while (my $chunk = shift @chunks) {
> +# if ($chunk =~ m|^[\.\-]|o) {
> + if ($chunk =~ m|^[$GAP_SYMBOLS]|o) {
> + $current_column += $column_incr * CORE::length($chunk);
> + }
> + else {
> + if ($current_residue + CORE::length($chunk) < $resnumber) {
> + $current_column += $column_incr * CORE::length($chunk);
> + $current_residue += CORE::length($chunk);
> + }
> + else {
> + if ($strand == -1) {
> + $current_column -= $resnumber - $current_residue;
> + }
> + else {
> + $current_column += $resnumber - $current_residue;
> + }
> + return $current_column;
> + }
> + }
> + }
> }
>
> $self->throw("Could not find residue number $resnumber");
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
More information about the Bioperl-l
mailing list