[BioSQL-l] Re: [Bioperl-l] Error in loading into biosql database
Hilmar Lapp
hlapp at gnf.org
Tue Mar 25 15:14:04 EST 2003
Siddharta,
according to the --debug log you sent and unlike you originally
reported there is no problem with accession# P42655, nor is there a
problem with the species insertion or look-up. Apparently one of my
instructions did fix the original problem.
Instead, the exception that terminates the upload is due to a reference
entry with medline ID 20489853 in accession# P45954. This reference is
first encountered in accession# Q9UKU7, but without a medline ID
(swissprot only has the PubMed ID there). When it is encountered again
in accession# P45954, there is a Medline ID specified. The look-up for
the Medline ID 20489853 fails, triggering an insert, which in turn
fails because the computed CRC must be unique.
This needs to be fixed because the situation is not purely artificial.
The options are as I see it:
1) use PubMed instead of Medline if the latter is undef
2) look-up references by CRC rather than dbxref (medline or pubmed)
3) look-up, in this or any other order, by CRC, if not found by
medline, if not found by pubmed, omitting those look-ups where the key
value is undefined.
Option 1) doesn't really solve the problem, because e.g. even though in
the concrete case at hand the first occurrence of the reference did
come with a PubMed ID, you still don't know at the second occurrence
that you have to look-up by PubMed now instead of Medline (which is
defined for the second).
Option 2) relies on certain assumptions in order to work, namely a) all
instances of a reference are fully populated (wrt authors, location,
title), because otherwise you arrive at different CRC values, and b)
everyone inserting into the database uses the same CRC calculation
algorithm (no problem if you only use bioperl).
Option 3) is the most robust (I actually don't quite see when it would
not work), but potentially costly, and creates a headache for
implementation because it violates the definition of alternative keys
(to locate an object it otherwise suffices to locate by any alternative
key whose value is defined, not by all of them).
Does anyone have opinions, comments, or alternative suggestions?
Siddharta, in the meantime you can bypass failing entries by supplying
--safe on the command line.
-hilmar
On Tuesday, March 25, 2003, at 02:21 PM, Siddhartha Basu wrote:
> Hi Hilmar,
>
>
> Hilmar Lapp wrote:
>> If you dropped it and re-created it that should have taken care of
>> records erroneously without NCBI taxon ID.
>>
>> To verify you can query before an upload:
>>
>> mysql> SELECT binomial, variant, ncbi_taxon_id FROM taxon WHERE
>> ncbi_taxon_id IS NULL;
>>
>> To confirm for Homo sapiens:
>>
>> mysql> SELECT * FROM taxon WHERE binomial = 'Homo sapiens' AND
>> (ncbi_taxon_id IS NULL OR ncbi_taxon_id != 9606)
>>
>> Neither of the 2 queries should return any rows.
>>
> Done that, no rows returned.
>
>
>> If they don't and you still get this error then look in your input
>> file
>> for the first occurrence of Homo sapiens as species for the sequence.
>> Does it come with NCBI taxon ID?
> Yes it does.
>
> if yes,
>> look for the second sequence of Homo sapiens. Does it have accession#
>> P42655? If yes (*), truncate the taxon table and create a new input
>> file
>> with only the first Homo sapiens sequence entry (which supposedly has
>> a
>> taxon ID). Try to load the single-entry file.
> Followed the instruction and it's loaded properly.
>
> After that, check your
>> taxon table. There should be Homo sapiens. If it lacks the NCBI taxon
>> ID
>> (*), the problem is with the parser not parsing the taxon ID out of
>> the
>> input.
> It has the NCBI taxon ID.
>
>
>>
>> (*) if you have to answer 'no' here, there's possibly something weird
>> going that would need to be fully debugged. You can try to run
>> load_seqdatabase.pl with --debug and send me the output.
> Executed load_seqdatabase.pl with --debug and the output is included in
> the attachment.
>
>
> Siddhartha
>
>
>>
>> -hilmar
>>
> <debuginfo.tar.gz>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the BioSQL-l
mailing list