[Bioperl-l] code review on LocatableSeq performance fix.
George Hartzell
hartzell at alerce.com
Tue Aug 25 16:22:20 EDT 2009
[For better or worse] I use pairs of locatable seq's to represent
alignments between cDNAs (spliced mRNA) and genomic sequence.
I end up using column_from_residue_number a lot to map features back
and forth between the coordinate system.
My sequences tend to be fairly long, and the current implementation of
column_from_residue_number (which splits the sequences into arrays of
individual characters) performs very badly on them.
I've included below a small variation on a patch that I've been using
for a while (when I pulled it up to the current bioperl-live I changed
a couple of regexps to use $GAP_SYMBOLS and $RESIDUE_SYMBOLS). It
passes the t/Seq/LocatableSeq.t tests and Works For Me (tm).
Instead of creating whopping big arrays and then looping over them it
breaks the sequence down into runs of residues/gaps and strides across
them. It also unwinds the strandedness test and avoids the cute trick
of using an anonymous sub (which saves a couple of lines in the source
file but adds *signficant* overhead every time around the loop).
All hail Devel::NYTProf.
Chris et al.'s comments about the mysteries and vagaries of
Bio::LocatableSeq makes me leary of just committing it.
Anyone want to comment on it?
g.
Index: Bio/LocatableSeq.pm
===================================================================
--- Bio/LocatableSeq.pm (revision 16001)
+++ Bio/LocatableSeq.pm (working copy)
@@ -423,27 +423,47 @@
unless $resnumber =~ /^\d+$/ and $resnumber > 0;
if ($resnumber >= $self->start() and $resnumber <= $self->end()) {
- my @residues = split //, $self->seq;
- my $count = $self->start();
- my $i;
- my ($start,$end,$inc,$test);
- my $strand = $self->strand || 0;
- # the following bit of "magic" allows the main loop logic to be the
- # same regardless of the strand of the sequence
- ($start,$end,$inc,$test)= ($strand == -1)?
- (scalar(@residues-1),0,-1,sub{$i >= $end}) :
- (0,scalar(@residues-1),1,sub{$i <= $end});
+ my @chunks;
+ my $column_incr;
+ my $current_column;
+ my $current_residue = $self->start - 1;
+ my $seq = $self->seq;
+ my $strand = $self->strand || 0;
- for ($i=$start; $test->(); $i+= $inc) {
- if ($residues[$i] ne '.' and $residues[$i] ne '-') {
- $count == $resnumber and last;
- $count++;
- }
- }
- # $i now holds the index of the column.
- # The actual column number is this index + 1
+ if ($strand == -1) {
+# @chunks = reverse $seq =~ m/[^\.\-]+|[\.\-]+/go;
+ @chunks = reverse $seq =~ m/[$RESIDUE_SYMBOLS]+|[$GAP_SYMBOLS]+/go;
+ $column_incr = -1;
+ $current_column = (CORE::length $seq) + 1;
+ }
+ else {
+# @chunks = $seq =~ m/[^\.\-]+|[\.\-]+/go;
+ @chunks = $seq =~ m/[$RESIDUE_SYMBOLS]+|[$GAP_SYMBOLS]+/go;
+ $column_incr = 1;
+ $current_column = 0;
+ }
- return $i+1;
+ while (my $chunk = shift @chunks) {
+# if ($chunk =~ m|^[\.\-]|o) {
+ if ($chunk =~ m|^[$GAP_SYMBOLS]|o) {
+ $current_column += $column_incr * CORE::length($chunk);
+ }
+ else {
+ if ($current_residue + CORE::length($chunk) < $resnumber) {
+ $current_column += $column_incr * CORE::length($chunk);
+ $current_residue += CORE::length($chunk);
+ }
+ else {
+ if ($strand == -1) {
+ $current_column -= $resnumber - $current_residue;
+ }
+ else {
+ $current_column += $resnumber - $current_residue;
+ }
+ return $current_column;
+ }
+ }
+ }
}
$self->throw("Could not find residue number $resnumber");
More information about the Bioperl-l
mailing list