[BioSQL-l] CRC calculation

Hilmar Lapp hlapp at gnf.org
Tue Aug 23 15:41:33 EDT 2005


On Aug 23, 2005, at 2:24 AM, mark.schreiber at novartis.com wrote:

> Hello -
>
> According the the BioSQL docs the CRC of the Reference table is a
> calculated checksum over the author, location and title fields. It also
> states that it helps ensure uniqueness. I have some questions:
>
> 1) Is the CRC constrained to be unique?

Yes. It is a substitute for a unique constraint over 
(authors,title,location) because their combined field length is too 
long to create an index on (at least in Oracle), and even if it were 
possible it would probably not be a very efficient index.

>  If it is that could be a problem as CRC is not guarenteed to be 
> unique for any String.

Correct, yes. However, I haven't had a single incident where this was a 
problem. Also, I'm using a CRC64, so there's several orders of 
magnitudes more possible values than journal articles accumulated over 
the past centuries :-)

> 2) Is everyone supposed to use the same algorithm or is it dependent on
> the user?

It is somewhat dependent on the user, in the sense that the Biosql 
schema does prescribe which algorithm to use. However, if you use 
bioperl/bioperl-db for loading then the CRCs will be calculated for you 
(and in fact also for sequences if you use the Oracle schema), so if 
you make updates or inserts outside of bioperl as well then you will 
probably want to be sure that you use consistent algorithms.

> 3) If 2 then are you using CRC16 or CRC32

Neither. It's a CRC64.

> 4) If 2 are you adding the independant CRC's of the author, location, 
> and
> title or are you combining them in a String and calculating the CRC for
> the string. If that is the case then what is the form of the string?

I combine them in a string first, substituting a default value for 
undefined properties. The CRC algorithm is the same as the one used for 
computing the SwissProt CRC64 (so if Biojava writes Swissprot then you 
probably have this algorithm in the library already). Here's the 
complete algorithm in perl with preceding documentation. I hope you can 
read perl; if not let me know ;-)

=head2 _crc64

  Title   : _crc64
  Usage   :
  Function: Computes and returns the CRC64 checksum for a given
            reference object.

            The method uses the reference's authors, title, and
            location properties.

  Example :
  Returns : the CRC64 as a string
  Args    : the Bio::Annotation::Reference object for which to compute
            the CRC

=cut

sub _crc64{
     my $self = shift;
     my $obj = shift;

     my $str =
         (defined($obj->authors) ? $obj->authors : "<undef>") .
         (defined($obj->title) ? $obj->title : "<undef>") .
         (defined($obj->location) ? $obj->location : "<undef>");

     return 'CRC-'.$self->crc64($str);

}

=head2 crc64

  Title   : crc64
  Usage   :
  Function: Computes and returns the CRC64 checksum for a given string.

            This is basically ripped out of the bioperl swissprot
            parser. Credits go to whoever contributed it there.

  Example :
  Returns : the CRC64 checksum as a string
  Args    : the string as a scalar for which to obtain the CRC64


=cut

sub crc64{
     my ($self, $str) = @_;
     my $POLY64REVh = 0xd8000000;
     my @CRCTableh;
     my @CRCTablel;

     if (exists($self->{'_CRCtableh'})) {
         @CRCTableh = @{$self->{'_CRCtableh'}};
         @CRCTablel = @{$self->{'_CRCtablel'}};
     } else {
         @CRCTableh = 256;
         @CRCTablel = 256;
         for (my $i=0; $i<256; $i++) {
             my $partl = $i;
             my $parth = 0;
             for (my $j=0; $j<8; $j++) {
                 my $rflag = $partl & 1;
                 $partl >>= 1;
                 $partl |= (1 << 31) if $parth & 1;
                 $parth >>= 1;
                 $parth ^= $POLY64REVh if $rflag;
             }
             $CRCTableh[$i] = $parth;
             $CRCTablel[$i] = $partl;
         }
         $self->{'_CRCtableh'} = \@CRCTableh;
         $self->{'_CRCtablel'} = \@CRCTablel;
     }

     my $crcl = 0;
     my $crch = 0;

     foreach (split '', $str) {
         my $shr = ($crch & 0xFF) << 24;
         my $temp1h = $crch >> 8;
         my $temp1l = ($crcl >> 8) | $shr;
         my $tableindex = ($crcl ^ (unpack "C", $_)) & 0xFF;
         $crch = $temp1h ^ $CRCTableh[$tableindex];
         $crcl = $temp1l ^ $CRCTablel[$tableindex];
     }
     my $crc64 = sprintf("%08X%08X", $crch, $crcl);

     return $crc64;

}




>
> - Mark
>
> Mark Schreiber
> Principal Scientist (Bioinformatics)
>
> Novartis Institute for Tropical Diseases (NITD)
> 10 Biopolis Road
> #05-01 Chromos
> Singapore 138670
> www.nitd.novartis.com
>
> phone +65 6722 2973
> fax  +65 6722 2910
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------



More information about the BioSQL-l mailing list