[Bioperl-l] Indexing large databases / BioSQL

Wed Apr 2 02:30:06 UTC 2008

On Apr 1, 2008, at 8:31 AM, Bánk Beszteri wrote:
> [...] So next we started to test BioSQL, by trying to load just  
> Swissprot in a MySQL DB first, like:
>
> load_seqdatabase.pl --host mysql.awi.de --dbname biosql2 --dbuser  
> xyz --dbpass abc --driver mysql --namespace uniprot_sprot --format  
> swiss uniprot_sprot.dat
>
> Here we get an error message
>
> ###########################################
>
> Loading /biodb/spinkern/uniprot_sprot.dat ...
> Could not store Q6DAH5:
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: The supplied lineage does not start near 'Erwinia carotovora  
> subsp. atroseptica' (I was supplied 'Erwinia carotovora subsp. |  
> Pectobacterium | Enterobacteriaceae | Enterobacteriales |  
> Gammaproteobacteria | Proteobacteria | Bacteria')
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /biodb/spinkern/bioperl-1.5/ 
> bioperl-1.5.2_102/Bio/Root/Root.pm:359
> STACK: Bio::Species::classification /biodb/spinkern/bioperl-1.5/ 
> bioperl-1.5.2_102/Bio/Species.pm:174
> STACK: Bio::DB::Persistent::PersistentObject::AUTOLOAD /biodb/ 
> spinkern/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm: 
> 552
> STACK: Bio::DB::BioSQL::SpeciesAdaptor::populate_from_row /biodb/ 
> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/SpeciesAdaptor.pm:281
> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object / 
> biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:1305
> STACK:  
> Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key /biodb/ 
> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:973
> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key / 
> biodb/spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:852
> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /biodb/ 
> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:182
> STACK: Bio::DB::Persistent::PersistentObject::create /biodb/ 
> spinkern/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm: 
> 244
> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /biodb/ 
> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:169
> STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store /biodb/ 
> spinkern/bioperl-db-1.5.2_100/Bio/DB/BioSQL/ 
> BasePersistenceAdaptor.pm:251
> STACK: Bio::DB::Persistent::PersistentObject::store /biodb/spinkern/ 
> bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:271
> STACK: load_seqdatabase.pl:622
> -----------------------------------------------------------
>
> at load_seqdatabase.pl line 635
>
> ############################################
>
> or similar, depending on whether we use a pre-loaded ncbi taxonomy  
> or not

I recommend to always use a pre-loaded NCBI taxonomy unless you know  
there are only a few organisms that are straightforward (for the  
parser, that is).

> , and which Swissprot release we are trying to load. It often seems  
> to come from sg. like here, subsp. or other special addition to the  
> species line; but alternative genus names and other curious things  
> also to appear. It looks like Species.pm tries to validate the  
> species name against the lineage info already there in the BioSQL  
> DB, and in several cases, it finds inconsistencies.

It actually happens upon a successful lookup when the species object  
is populated from the database.

> [...]
> The only workaround we have found so far was to comment out line  
> 174 in Species.pm:
>
> $self->throw("The supplied lineage does not start near '$name' (I  
> was supplied '".join(" | ", @vals)."')");

That should be OK if you work with a pre-loaded taxonomy. It's sort  
of a sanity check that should catch a parser having messed up a  
species. If you use a pre-loaded NCBI taxonomy the results of the  
species parsing don't matter in all details so long as the NCBI  
taxonID is parsed out correctly, and then found in the database.

Note that this actually a warn() in the main trunk version of  
BioPerl, so you might want to upgrade to that (or change throw() to  
warn() in your version). You still get the records flagged with that,  
but it isn't an exception.

>
> After doing so, load_seqdatabase.pl runs for several hours (until  
> it evetually crashes; I haven´t found out yet why), but proceeds  
> really slowly.

It should certainly *not* crash. Note also that you can supply --safe  
on the command line, in which case the script will continue with the  
next record if one fails to load for whatever reason.

You will want to adjust the width constraint of dbxref.accession, for  
example to 128 chars. This will also be fixed for BioSQL 1.0.1.
See http://bugzilla.open-bio.org/show_bug.cgi?id=2474

> I also found some info on this for Pg and Oracle in the mailing  
> list, but has anyone some approximate numbers for MySQL, how long  
> should a first Swissprot load take?

Possibly around 20 hours according to Erik Rijkers:
See http://lists.open-bio.org/pipermail/bioperl-l/2008-March/027427.html

You can use the --logchunks N option to have it print out performance  
statistics every N records.

Hope this helps,

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================