[BioSQL-l] load_seqdatabase fails when loading refseq plant files
Angel Pizarro
angel at mail.med.upenn.edu
Fri Aug 11 14:57:35 EDT 2006
Glad I am not the only one that ran into this problem! Mike, I had
reported this issue a few emails back and have provided the list with an
example file for testing, so it should be resolved soon.
FYI, you are correct that CRC is computed on load to determine if two
pub references are in fact the same. This is a feature to save database
space. The expected behaviour would be for the subsequent entries with
the same CRC reference should have an FK to the originating reference
entry, and not insert a duplicate row into the reference table.
FYI #2, the --safe option explicitly states that it will continue to
process records after errors BUT do a roll-back at the end of the run.
This is to gather all of your errors in one shot, as opposed to fixing a
record, starting, error, fix, etc ,.
If you are impatient and do not care about references, you have three
choices.
1) drop the unique constraint on reference.crc (this will cause dups in
reference and you can not go back to a unique CRC without some major SQL
data migration routine to fix FK's and delete the dups.
2) filter your records to not contain reference information
3) alter load_seqdatabase to not enter reference information. This would
be in the Bio::AnnotationCollection object:
$seq->annotation()->remove_Annotations('reference');
The above command inserted someplace in the script line ~575 should do
the trick. Obviously this means that all reference information is not
loaded into the DB at all.
-angel
On Fri, 2006-08-11 at 11:10 -0500, Mike Muratet wrote:
> Hello all
>
> I am using biosql-schema/bioperl-db to load Refseq entries into a biosql
> database. I don't see any version info in the files, but I downloaded
> everything in the last month or so and everything passed all the tests
> when installed. I am using perl 5.8.5, mysql 5.0.22, DBI-1.5.1,
> DBD-mysql-3.006. I was loading plant file from Refseq rel 18:
>
> load_seqdatabase.pl --dbname biosql
> --lookup --u --namespace plant --format genbank --safe plant*.rna.gbff.gz
>
> and it crashed after about 30K of 60K records:
>
> at /usr/lib/perl5/site_perl/5.8.5/Bio/biosql-schema/sql/bioperl-db/scripts/biosql/load_seqdatabase.pl
> line 633
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values
> were ("","Direct Submission","Submitted (01-JUL-2004) National Center for
> Biotechnology Information, National Institutes of Health, Bethesda 20894,
> United States of America","CRC-6F1453182E2BAC3F","1","786","") FKs
> (<NULL>)
> Duplicate entry 'CRC-6F1453182E2BAC3F' for key 3
> ---------------------------------------------------
> Could not store XM_472403:
> ------------- EXCEPTION -------------
> MSG: create: object (Bio::Annotation::Reference) failed to insert or to be
> found by unique key
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:208
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:254
> STACK Bio::DB::Persistent::PersistentObject::store
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:272
> STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:219
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:216
> t
>
> I traced the error back through the source and database and found that
> XM_472403 has the same CRC value as XM_473880. I actually got many errors of this type,
> but only the last one crashed the script (in spite of --safe).
>
> Should there be more info included in the CRC field? I am weak when
> it comes to RDBMs, but looking at the schema, I would guess that the CRC field
> was added to make an otherwise degenerate key unique. Would it help to add
> more fields to the CRC, or another key? The former might be done without
> have to change a lot of code.
>
> Thanks
>
> Mike
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
More information about the BioSQL-l
mailing list