[Bioperl-l] Storing Large Primary Sequences

David Block dblock@gene.pbi.nrc.ca
Wed, 20 Sep 2000 09:44:52 -0600 (CST)


James et al,

Here's the code.  I assume that $self has a 'file' attribute with the path
to the chromosomal sequence file (chr2.clean, in my case).  I assume
further that the file has been massaged so all the lines (except for
possibly the last one) have the same linelength.  I store that linelength
as another attribute of self.

sub loadseq {          #loads chromosomal sequence into memory
    my ($self)=@_;
    my $count=0;
    my @in;

    open IN,$self->file or die "Couldn't open Sequence file: $!";

    if (defined($_=<IN>)) {
        if (/^>/) {$_=<IN>}        # get rid of the fasta header if it
exists
        chomp;
        $self->{linelength}=CORE::length;        #assuming constant line
widths (as in .clean files)
        $in[0]=$_;
    }
    while (<IN>) {
        $count++;
        chomp;
        $in[$count]=$_;
    }

    return (\@in);
}

Here's the retrieval code.  $self->_load_a_seq calls loadseq if it hasn't
been done already.  I use a closure to store the sequence so it's
available to all members of the class.  Therefore I don't have to repeat
loadseq for the life of the executable.


sub getseq {            #quickly retrieves sequence from chromosomal
sequence present in memory
                        #returns sequence in 70 character lines
    my ($self)=@_;
    my $in=$self->_load_a_seq;
    my $startpos=$self->start;
    my $endpos=$self->stop;
    my $linelength=$self->linelength;
    my $length=$endpos-$startpos+1;
    my $row=int($startpos/$linelength);
    my $count=$row * $linelength + 1;

    my $seq = substr($in->[$row],($startpos-$count),$length);
    $length -= $linelength - ($startpos-$count);

    do {
        $row++;
        $seq .= substr($in->[$row],0,$length);
        $length -= $linelength;
    } until ($length <= 0);

    $seq .= "\n";
    $seq=~ s/([^\n]{70})/$1\n/g;
    return $seq;
}