[Bioperl-l] Loading Genbank records into BioSQL db

Thu, 23 May 2002 08:55:14 -0400

Hi Guys, this is a longish story that has more to do with genbank records and
mysql than with bioperl, but I thought it was relevant enough to burden the
group with.

We've done a fresh install of MySQL on a linux box, and we went ahead and
followed the tips in Chris Dagdigians travelogue (thanks Chris, that was cool).
The first order of business was to load all of the primate division data from
genbank (the gbpri*.seq files).  There's 17 files, at 250 megs apiece.  The biosql
db was built using the latest code from CVS, the bioperl install is the 1.0
release version, and the bioperldb code is the current release version.

So, off I go loading data into the db using the load_seqdatabase.pl script.  Over
the course of about 14 hours it gets into the middle of the fifteenth file, gbpri7.seq,
when I get this very entertaining error:

Reading /data/gbpri7.seq
DBD::mysql::st execute failed: MySQL server has gone away at
/usr/lib/perl5/site_perl/5.6.0/Bio/DB/SQL/CommentAdaptor.pm line 149, <GEN0> line 2905433.
DBD::mysql::st execute failed: MySQL server has gone away at
/usr/lib/perl5/site_perl/5.6.0/Bio/DB/SQL/CommentAdaptor.pm line 149, <GEN0> line 2905433.

The loading script stopped at this point.  I had a command line mysql
session open at the same time, but when I tried a query it gave me the
same message about the server going away (then it automatically
re-established the connection and ran my query -- doncha love mysql?).
Whatever killed the loading script didn't fubar the mysql daemon, but
it knocked all of its connections down.  Wierd.

Line 2905433 is the record separator terminating the record for accession
AC092470, a 96100bp BAC whose only redeeming feature, other than a couple
of STS's, is a series of named repeat units.  The unusual feature of this
record is a long comment detailing all of the base changes that have
been made in the finished version (the Consensus changing edits table).
In fact, this is the mother of all comments, detailing 25366 edits, each
one meriting an 80 character line.  (and, yes, if you must know, I obtained
this information by scrolling up through all of these lines while reading
slashdot).  Grep reassures us that the overwhelming majority of the
consensus changing edit tables are less than 100 lines long, and in fact
only one comes close to this one, at 12122 (also in gbpri7.seq).

So what happened?  We set up the mysql install according to Chris
Dagdigian's excellent advice, increasing the max_allowed_packet, the
sort_buffer, and the record_buffer to 1M.  The offending comment
(12122 lines) gets through at 969760bytes, but the mother of all comments
gorks the connection at just over 2 megs.  It's worth noting that
the actual sequence here is something like 96kbytes, while the comment
is larger by a factor of about 20.

There would seem to be two work-arounds.  One would be to truncate the
comment, which isn't important for my purposes anyway, or we could
rejigger the settings.  Just to make sure that we'd nailed the problem,
this is what we did.  Not having the manual close at hand we reset
max_allowed_packet, sort_buffer, and record_buffer to 3M.  It looks to
me like the key here was max_allowed_packet, which is the max size of
the buffer for incoming data.  I reran the load_seqdatabase.pl with
gbpri7.seq, and it loaded OK, but it took more than ten hours (at least
five times longer than I would have expected -- not too surprising
we get a mongo performance hit here).

I guess the moral here is that no matter how well you prepare things,
the data files will continue to rise up, evade your error handling,
and smite you.  (what was it Ewan said about a world of pain?).  This is
not a bug in bioperl, but I wonder what the right policy is toward this
sort of thing?  Or is this a case where no discriminate function will draw
a line between good and bad?

-Keith Allen