[BioSQL-l] load_seqdatabase.pl warnings and errors

Peter biopython at maubp.freeserve.co.uk
Wed May 20 10:59:19 UTC 2009


On Wed, May 20, 2009 at 10:52 AM, michael watson (IAH-C)
<michael.watson at bbsrc.ac.uk> wrote:
>
> Hi Guys
>
> Ok, the warnings were due to duplicate sequences - I had downloaded a
> stream using Bio::DB::GenBank and I guess I assumed that would mean only
> unique entries were sent back.  Using "--flatlookup --remove" gets rid
> of the warnings.

Great - easy :)

> Now for NC_003992.gbk...
>
> To answer Hilmar's question:
> ...
> And when I run load_seqdatabase.pl on NC_003992.gbk alone I still get:
>
> perl load_seqdatabase.pl --host localhost --dbname fmd_biosql --format
> genbank --dbuser removed --dbpass removed --flatlookup --remove
> NC_003992.gbk
>
> Loading NC_003992.gbk ...
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values
> were ("","Direct Submission","Submitted (12-AUG-2004) National Center
> for Biotechnology Information, NIH, Bethesda, MD 20894,
> USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs (<NULL>)
> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
> ---------------------------------------------------
> Could not store NC_003992:
> ------------- EXCEPTION  -------------
> MSG: create: object (Bio::Annotation::Reference) failed to insert or to
> be found by unique key
> ...

I would guess that the problem is this rather generic reference in
NC_003992 may be repeated exactly in another genome (causing the CRC
collision):

CONSRTM   NCBI Genome Project
TITLE     Direct Submission
JOURNAL   Submitted (12-AUG-2004) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA

See http://www.ncbi.nlm.nih.gov/nuccore/NC_011452

i.e. Could there be another direct submission by the NCBI on that date
in your collection?  You could search the database looking for that
CRC and trace it back to a bioentry, or just try grep for "JOURNAL
Submitted (12-AUG-2004) National Center for Biotechnology" on your
GenBank files. e.g. Something like this SQL statement might be
interesting:

SELECT bioentry.accession, reference.title FROM bioentry,
bioentry_reference, reference WHERE
bioentry.bioentry_id=bioentry_reference.bioentry_id AND
bioentry_reference.reference_id=reference.reference_id AND
reference.crc="CRC-E8D3CBBD80002FA1";

Peter




More information about the BioSQL-l mailing list