[Bioperl-l] Reading sequences without parsing them

Jason Stajich jason@chg.mc.duke.edu
Mon, 16 Jul 2001 09:41:08 -0400


Amir - Either you don't understand the bioperl objects very well (please
read the documentation available within each module via the perldoc system
or online at bioperl.org->Docs) or I don't understand what you want to do
very well.

Of course we keep a record of the sequence string when reading in sequence
data, that would defeat the purpose of parsing the data in the first place.
Please spend 5 minutes looking at the documentation about Bio::PrimarySeqI
and Bio::SeqIO view the code snippet I have below.

I'd appreciate you spending the time looking at the basic code SYNOPSIS that
is atop every module and then restating your question.

----- Original Message -----
From: "Karger, Amir" <AKarger@curagen.com>
To: "Bioperl Mailing List (E-mail)" <bioperl-l@bioperl.org>
Sent: Monday, July 16, 2001 9:08 AM
Subject: [Bioperl-l] Reading sequences without parsing them


> I want to create a fingerprint for each sequence I read in, so that when
> updated versions of the database come in, I can check to see if the
> fingerprint changed before I bother doing all the work of parsing &
> otherwise analyzing the sequence. Unfortunately, I can't figure out how to
> do that in bioperl, because next_seq automatically parses the sequence.
And
> as far as I can tell, no record is kept of the entire sequence string
(which
> is what I would want to fingerprint using, e.g., Digest::MD5).
>

See the PrimarySeqI::seq method.

Before we go any further, most major sequence databases have version numbers
now so you should not even need to be doing this yourself, can't you verify
the version number instead of doing this or are you getting sequence from
someone who does not do this?

> I guess one reason bioperl does this is that you don't want to read whole
> sequences into memory at a time since they may be very large; you would
> rather work line by line. But it's making things awkward for me. And I
could
> imagine other instances where you'd want to do this. Let's say you want to
> deal only with sequences that have a certain substring in them. Do you
need
> to parse all the extra information in a Rich Seq just to decide whether to
> skip this sequence or not?
>

If you wanted to create a MD5 hash of a sequence string
use Digest::MD5
use Bio::SeqIO;

my $seqin = new Bio::SeqIO(-format => 'genbank', -file=> 'mydata.gb');

while( my $seq = $seqin->next_seq) {
   my  $digest = digest($seq->seq());
    # do something with that digest - are you matching via accession number
or what?
}

> Currently, I can think of two solutions. The first is to call write_seq
for
> each sequence. This is going to be pretty slow, since it has to go through
> the code for writing a sequence (which might take a while for, say,
> swiss-prot). It also kind of loses the advantages of creating a
fingerprint
> in the first place. And we have the problems of bioperl not outputting
> exactly the sequence it input. (So, for example, if bioperl gets upgraded
> and write_seq is changed to be a bit better, all of my fingerprints will
> change.)
>
> The second is to read each sequence in by myself, then create a new
bioperl
> stream for each one. I don't have a sense, without running benchmarks, of
> how slow it's going to be to have the new Bio::SeqIO for each sequence.
> Maybe it won't be that slow.
>
> If my output stream for write_seq is a file, I have to imagine it's going
to
> be reall slow. Bioperl will be opening a new file for each sequence, then
> writing to and closing it, then Digest::MD5 will be opening and reading
each
> file. What a waste! Unfortunately, the other option is IO::String, which
> means I need to upgrade my Perl5.004 (admittedly something I should be
doing
> anyway).
>
> Are there other ways to do this?
>
> Amir Karger
> Curagen Corporation
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>