[BioSQL-l] load_seqdatabase.pl warnings and errors
Richard Holland
holland at eaglegenomics.com
Wed May 20 07:44:58 EDT 2009
Theoretically, although unlikely, it is statistically entirely possible
for two completely different references to share the same CRC. Hence the
CRC shouldn't really be used as an indicator of uniqueness, although it
is still useful as a hashing function for indexing and quick lookup.
cheers,
Richard
On Wed, 2009-05-20 at 12:34 +0100, Peter wrote:
> On Wed, May 20, 2009 at 12:25 PM, michael watson (IAH-C)
> <michael.watson at bbsrc.ac.uk> wrote:
> >
> > We have a winner :)
> >
> > NC_003992, NC_011452, NC_011451, NC_011450 all share
> > at least one reference.
> >
> > Would changing --flatlookup to --lookup change the behaviour
> > so it checks for an existing reference before trying to insert the
> > duplicate?
> >
> > The answer is no :( (see below).
> >
> > I guess this may need some coding then!
>
> My crude idea for a simple ad-hoc solution would be to remove these
> pointless references from the records, before loading them into
> BioSQL.
>
> One way would be to edit the four GenBank files by hand (e.g. to
> remove the reference or make them unique). You might also do this in a
> BioPerl script that loads the records, edits the references, and then
> puts them in the database. Personally I use Python not Perl, so I
> can't tell you how you might do that with BioPerl.
>
> Hilmar may be able to comment from a BioPerl/BioSQL point of view -
> clearly CRC collisions of this nature will happen again in future.
>
> Peter
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
--
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/
More information about the BioSQL-l
mailing list