[BioSQL-l] Re: [Bioperl-l] Error in loading into biosql database

Hilmar Lapp hlapp at gnf.org
Tue Mar 25 15:14:04 EST 2003


Siddharta,

according to the --debug log you sent and unlike you originally 
reported there is no problem with accession# P42655, nor is there a 
problem with the species insertion or look-up. Apparently one of my 
instructions did fix the original problem.

Instead, the exception that terminates the upload is due to a reference 
entry with medline ID 20489853 in accession# P45954. This reference is 
first encountered in accession# Q9UKU7, but without a medline ID 
(swissprot only has the PubMed ID there). When it is encountered again 
in accession# P45954, there is a Medline ID specified. The look-up for 
the Medline ID 20489853 fails, triggering an insert, which in turn 
fails because the computed CRC must be unique.

This needs to be fixed because the situation is not purely artificial. 
The options are as I see it:

	1) use PubMed instead of Medline if the latter is undef

	2) look-up references by CRC rather than dbxref (medline or pubmed)

	3) look-up, in this or any other order, by CRC, if not found by 
medline, if not found by pubmed, omitting those look-ups where the key 
value is undefined.

Option 1) doesn't really solve the problem, because e.g. even though in 
the concrete case at hand the first occurrence of the reference did 
come with a PubMed ID, you still don't know at the second occurrence 
that you have to look-up by PubMed now instead of Medline (which is 
defined for the second).

Option 2) relies on certain assumptions in order to work, namely a) all 
instances of a reference are fully populated (wrt authors, location, 
title), because otherwise you arrive at different CRC values, and b) 
everyone inserting into the database uses the same CRC calculation 
algorithm (no problem if you only use bioperl).

Option 3) is the most robust (I actually don't quite see when it would 
not work), but potentially costly, and creates a headache for 
implementation because it violates the definition of alternative keys 
(to locate an object it otherwise suffices to locate by any alternative 
key whose value is defined, not by all of them).

Does anyone have opinions, comments, or alternative suggestions?

Siddharta, in the meantime you can bypass failing entries by supplying 
--safe on the command line.

	-hilmar

On Tuesday, March 25, 2003, at 02:21  PM, Siddhartha Basu wrote:

> Hi Hilmar,
>
>
> Hilmar Lapp wrote:
>> If you dropped it and re-created it that should have taken care of
>> records erroneously without NCBI taxon ID.
>>
>> To verify you can query before an upload:
>>
>>     mysql> SELECT binomial, variant, ncbi_taxon_id FROM taxon WHERE
>> ncbi_taxon_id IS NULL;
>>
>> To confirm for Homo sapiens:
>>
>>     mysql> SELECT * FROM taxon WHERE binomial = 'Homo sapiens' AND
>>     (ncbi_taxon_id IS NULL OR ncbi_taxon_id != 9606)
>>
>> Neither of the 2 queries should return any rows.
>>
> Done that, no rows returned.
>
>
>> If they don't and you still get this error then look in your input 
>> file
>> for the first occurrence of Homo sapiens as species for the sequence.
>> Does it come with NCBI taxon ID?
> Yes it does.
>
>   if yes,
>> look for the second sequence of Homo sapiens. Does it have accession#
>> P42655? If yes (*), truncate the taxon table and create a new input 
>> file
>> with only the first Homo sapiens sequence entry (which supposedly has 
>> a
>> taxon ID). Try to load the single-entry file.
> Followed the instruction and it's loaded properly.
>
> After that, check your
>> taxon table. There should be Homo sapiens. If it lacks the NCBI taxon 
>> ID
>> (*), the problem is with the parser not parsing the taxon ID out of 
>> the
>> input.
> It has the NCBI taxon ID.
>
>
>>
>> (*) if you have to answer 'no' here, there's possibly something weird
>> going that would need to be fully debugged. You can try to run
>> load_seqdatabase.pl with --debug and send me the output.
> Executed load_seqdatabase.pl with --debug and the output is included in
> the attachment.
>
>
> Siddhartha
>
>
>>
>>     -hilmar
>>
> <debuginfo.tar.gz>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------



More information about the BioSQL-l mailing list