[Bioperl-l] Reading sequences without parsing them

Karger, Amir AKarger@CuraGen.com
Mon, 16 Jul 2001 09:08:17 -0400


I want to create a fingerprint for each sequence I read in, so that when
updated versions of the database come in, I can check to see if the
fingerprint changed before I bother doing all the work of parsing &
otherwise analyzing the sequence. Unfortunately, I can't figure out how to
do that in bioperl, because next_seq automatically parses the sequence. And
as far as I can tell, no record is kept of the entire sequence string (which
is what I would want to fingerprint using, e.g., Digest::MD5).

I guess one reason bioperl does this is that you don't want to read whole
sequences into memory at a time since they may be very large; you would
rather work line by line. But it's making things awkward for me. And I could
imagine other instances where you'd want to do this. Let's say you want to
deal only with sequences that have a certain substring in them. Do you need
to parse all the extra information in a Rich Seq just to decide whether to
skip this sequence or not?

Currently, I can think of two solutions. The first is to call write_seq for
each sequence. This is going to be pretty slow, since it has to go through
the code for writing a sequence (which might take a while for, say,
swiss-prot). It also kind of loses the advantages of creating a fingerprint
in the first place. And we have the problems of bioperl not outputting
exactly the sequence it input. (So, for example, if bioperl gets upgraded
and write_seq is changed to be a bit better, all of my fingerprints will
change.)

The second is to read each sequence in by myself, then create a new bioperl
stream for each one. I don't have a sense, without running benchmarks, of
how slow it's going to be to have the new Bio::SeqIO for each sequence.
Maybe it won't be that slow.

If my output stream for write_seq is a file, I have to imagine it's going to
be reall slow. Bioperl will be opening a new file for each sequence, then
writing to and closing it, then Digest::MD5 will be opening and reading each
file. What a waste! Unfortunately, the other option is IO::String, which
means I need to upgrade my Perl5.004 (admittedly something I should be doing
anyway).

Are there other ways to do this?

Amir Karger
Curagen Corporation