[Bioperl-l] Parsing a netblast file

Jason Stajich jason at cgt.duhs.duke.edu
Thu Jul 31 09:36:22 EDT 2003


> Through trial and error I have narrowed down the problem to the negative
> sign in the database details.  Here is the section in question from a
> netblast result file:
>
> Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS,
> or phase 0, 1 or 2 HTGS sequences)
>             1,819,241 sequences; -24,217,474 total letters

integer overflow.  The number of letters in nt is > than the
largest signed number (2147483647) that an integer can represent.

Looks like nt length is 8,782,847,770 - seems like it has been larger than
INT_MAX for a while, surprised they haven't updated their code.  Do you
have the latest version of netblast on your machine?  A bug report to NCBI
is probably a good idea if you are running the latest version

Some C code to illustrate what happens:
#include <stdlib.h>
#include <limits.h>

int main ( )
{
  int i = INT_MAX;
  unsigned int ui = INT_MAX;

  printf ("max integer size is %d\n",i);
  printf ("max unsigned int size is %u\n",ui);

  printf ("max integer+1 size is %d\n",i+1);
  printf ("max unsigned integer*2 size is %u\n",ui*2);
  return 0;
}


>
> I don't know why, but all netblast result files I have looked at show a
> negative value for the total number of letters.  If I remove the '-' sign,
> the blast result file parses just fine with the above script.
>
> Why does a netblast result file have a minus sign for the database size?
> Why won't the parser work if there is a minus sign?
> Is there a way to make the parser work despite the minus sign?
>

We'd just need to tweak the regexp a little bit to handle a leading -.
What version of bioperl are you running so can provide a patch which is
appropriate for your version?


-jason
--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu


More information about the Bioperl-l mailing list