From stefan.guenther at charite.de Wed Apr 2 05:35:05 2008 From: stefan.guenther at charite.de (Stefan Guenther) Date: Wed, 02 Apr 2008 11:35:05 +0200 Subject: [BioSQL-l] swissprot - gene names Message-ID: <47F35349.4010105@charite.de> Hi, I have uploaded the swissprot flatfile (uniprot_sprot.dat) into the biosql scheme using load_seqdatabase.pl. Now I'm searching for the swissprot gene names in the bioseqdb-tables. I cannot find them in the bioentry table. Aren't them included? Stefan From hlapp at gmx.net Wed Apr 2 11:44:36 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 2 Apr 2008 11:44:36 -0400 Subject: [BioSQL-l] swissprot - gene names In-Reply-To: <47F35349.4010105@charite.de> References: <47F35349.4010105@charite.de> Message-ID: Are you talking about the gene symbols and names in the GN line? These will be in bioentry_qualifier_value associations, with the tag being gene_name (I think; check your term table to be sure). Let me know if that doesn't work. -hilmar On Apr 2, 2008, at 5:35 AM, Stefan Guenther wrote: > Hi, > > I have uploaded the swissprot flatfile (uniprot_sprot.dat) into the > biosql scheme using load_seqdatabase.pl. Now I'm searching for the > swissprot gene names in the bioseqdb-tables. I cannot find them in > the bioentry table. Aren't them included? > > Stefan > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From ericgibert at yahoo.fr Fri Apr 4 09:43:34 2008 From: ericgibert at yahoo.fr (Eric Gibert) Date: Fri, 4 Apr 2008 13:43:34 +0000 (GMT) Subject: [BioSQL-l] left_value and right_value in taxon table Message-ID: <942508.77770.qm@web26507.mail.ukl.yahoo.com> Dear all, I hope that I am not the 100th persons asking the following questions: 1) what are left and right values in the taxon table for? 2) How are they computed? Thank you for your input or link to an explanation page. Eric _____________________________________________________________________________ Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr From hlapp at gmx.net Fri Apr 4 18:40:56 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 4 Apr 2008 18:40:56 -0400 Subject: [BioSQL-l] left_value and right_value in taxon table In-Reply-To: <942508.77770.qm@web26507.mail.ukl.yahoo.com> References: <942508.77770.qm@web26507.mail.ukl.yahoo.com> Message-ID: <60C3976C-26BF-4981-9159-207835C9B5D0@gmx.net> Hi Eric, On Apr 4, 2008, at 9:43 AM, Eric Gibert wrote: > Dear all, > > I hope that I am not the 100th persons asking the following questions: > 1) what are left and right values in the taxon table for? they hold the nested set values. Nested sets are enumeration algorithm described in Joe Celko's SQL for Smarties books, and Aaron Mackey gives a good introduction here: http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html (This is in the schema DDL file, though obviously should be documented better. Good candidate for an FAQ, I suppose.) > 2) How are they computed load_ncbi_taxonomy.pl recomputes them automatically after each update. It's a simple recursive depth-first graph traversal algorithm. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Tue Apr 8 11:24:41 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Apr 2008 16:24:41 +0100 Subject: [BioSQL-l] left_value and right_value in taxon table In-Reply-To: <60C3976C-26BF-4981-9159-207835C9B5D0@gmx.net> References: <942508.77770.qm@web26507.mail.ukl.yahoo.com> <60C3976C-26BF-4981-9159-207835C9B5D0@gmx.net> Message-ID: <320fb6e00804080824x2bd92d41p884c8a4a61c04702@mail.gmail.com> > > Dear all, > > > > I hope that I am not the 100th persons asking the following questions: > > 1) what are left and right values in the taxon table for? > > > > they hold the nested set values. Nested sets are enumeration algorithm > described in Joe Celko's SQL for Smarties books, and Aaron Mackey gives a > good introduction here: > > http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html > > (This is in the schema DDL file, though obviously should be documented > better. Good candidate for an FAQ, I suppose.) That link does a good job of explaining the idea. > > 2) How are they computed > > load_ncbi_taxonomy.pl recomputes them automatically after each update. It's > a simple recursive depth-first graph traversal algorithm. I have the impression the recomputation is slow, and also moderately complex. This is fine for a weekly (or even daily) update which runs the load_ncbi_taxonomy.pl script. We (Biopython) are interested in incremental updates triggered when a new sequences is added to the database with a novel taxon id. Eric is looking at downloading the missing taxon data and updating the taxon/taxon_name tables "on the fly", transparently to the user. http://bugzilla.open-bio.org/show_bug.cgi?id=2475 (Biopython bug) Hilmar, am I right in thinking the following: Suppose when loading a new sequence into the database with a novel NCBI taxon, we record a new minimal taxon/taxon_names entry (without the lineage, a single taxon entry with null left/right entries). If the user then runs load_ncbi_taxonomy.pl, assuming the NCBI's online database contains the new taxon, will this update nicely? i.e. When the new sequence is retrieved from the database, its full lineage will be available. Thanks Peter From aaron.j.mackey at gsk.com Tue Apr 8 11:58:56 2008 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Tue, 8 Apr 2008 11:58:56 -0400 Subject: [BioSQL-l] left_value and right_value in taxon table In-Reply-To: <320fb6e00804080824x2bd92d41p884c8a4a61c04702@mail.gmail.com> Message-ID: I believe that the first thing the load_ncbi_taxonomy.pl script does is to wipe out everything already in the table. So you're incremental update strategy (with deferred left/right calculation) won't work. depending on the type of update you're making (e.g. you only add one new terminal taxonomic node, having no children), the incremental updates are pretty fast, computationally speaking (no tree traversal is required). I won't be able to recite them off the top of my head, but Joe Celko's "SQL For Smarties" book has the necessary code. In a nutshell, it's something like if the overall topology of the tree remains unchanged, you'll need to increment the right/left values of each node "to the right" of the new node you've inserted by 2, but it's a tiny bit more complicated than that. -Aaron biosql-l-bounces at lists.open-bio.org wrote on 04/08/2008 11:24:41 AM: > > > Dear all, > > > > > > I hope that I am not the 100th persons asking the following questions: > > > 1) what are left and right values in the taxon table for? > > > > > > > they hold the nested set values. Nested sets are enumeration algorithm > > described in Joe Celko's SQL for Smarties books, and Aaron Mackey gives a > > good introduction here: > > > > http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html > > > > (This is in the schema DDL file, though obviously should be documented > > better. Good candidate for an FAQ, I suppose.) > > That link does a good job of explaining the idea. > > > > 2) How are they computed > > > > load_ncbi_taxonomy.pl recomputes them automatically after each update. It's > > a simple recursive depth-first graph traversal algorithm. > > I have the impression the recomputation is slow, and also moderately > complex. This is fine for a weekly (or even daily) update which runs > the load_ncbi_taxonomy.pl script. > > We (Biopython) are interested in incremental updates triggered when a > new sequences is added to the database with a novel taxon id. Eric is > looking at downloading the missing taxon data and updating the > taxon/taxon_name tables "on the fly", transparently to the user. > > http://bugzilla.open-bio.org/show_bug.cgi?id=2475 (Biopython bug) > > Hilmar, am I right in thinking the following: Suppose when loading a > new sequence into the database with a novel NCBI taxon, we record a > new minimal taxon/taxon_names entry (without the lineage, a single > taxon entry with null left/right entries). If the user then runs > load_ncbi_taxonomy.pl, assuming the NCBI's online database contains > the new taxon, will this update nicely? i.e. When the new sequence is > retrieved from the database, its full lineage will be available. > > Thanks > > Peter > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > From hlapp at gmx.net Tue Apr 8 19:57:41 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 8 Apr 2008 19:57:41 -0400 Subject: [BioSQL-l] left_value and right_value in taxon table In-Reply-To: References: Message-ID: <5EDD43ED-57DD-4BAC-9A93-3B7DAE67ACB4@gmx.net> On Apr 8, 2008, at 11:58 AM, aaron.j.mackey at gsk.com wrote: > I believe that the first thing the load_ncbi_taxonomy.pl script > does is to > wipe out everything already in the table. That may have been in true in its beginnings but hasn't been for a long time :-) It only updates changed nodes, adds new ones, and deletes retired ones (unless you say --nodelete). The script does recompute *all* nested set values, though. > [...] > depending on the type of update you're making (e.g. you only add > one new > terminal taxonomic node, having no children), the incremental > updates are > pretty fast, computationally speaking (no tree traversal is > required). I > won't be able to recite them off the top of my head, but Joe > Celko's "SQL > For Smarties" book has the necessary code. In a nutshell, it's > something > like if the overall topology of the tree remains unchanged, you'll > need to > increment the right/left values of each node "to the right" of the new > node you've inserted by 2, but it's a tiny bit more complicated > than that. Though you can have very cheap cases indeed, in reality it turns out that on average you still need to traverse and update at least half of the nodes, so personally I really doubt you would save any significant amount of time by not just redoing all of them. And it's not that time-intensive either; typically it takes about 10-20mins, depending on CPU etc. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Wed Apr 9 00:31:31 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 9 Apr 2008 00:31:31 -0400 Subject: [BioSQL-l] BioSQL-1.0.0 + bioperl-db_1.5.2_100 + gbrowse question In-Reply-To: <47FA895F.4080407@embarqmail.com> References: <47F52F80.7030409@embarqmail.com> <47FA895F.4080407@embarqmail.com> Message-ID: <1094E323-C648-42AD-BE4B-623978FF7035@gmx.net> Hi Doug, thanks for your sleuthing work, it's much appreciated. I'm sure someone among the GBrowse crowd (I've copied the GBrowse mailing list) can make the fix according to what you found out, and they can help you out with the other problems you are having too if you elaborate what the issues are that you are seeing. As for the path tables, which ones are you referring to? I'm not sure the GBrowse BioSQL adapter uses any of them, so even if you do populate them it wouldn't change much. There is the load_ontology.pl script in Bioperl-db that can automatically compute the transitive closure for ontology terms, but as far as I am aware the feature or sequence loading scripts don't do that for seqfeature_path or bioentry_path, respectively. -hilmar On Apr 7, 2008, at 4:51 PM, doug brown wrote: > Hi Hilmar, > > Thank you for your reply. > > After much gnashing of teeth and much more code trawling, it > seems as it the basic problem I encountered was indeed one of > configuration. The current coding of > Bio::DB::Das::BioSQL::BioDatabaseAdaptor uses 'location' as the > keyword for representing the database host. The GFF adapter used, > and I guess still uses, 'host' for that purpose. The distributed > sample file 06.biosql.conf specified 'host' rather than 'location'. > Here is the correct configuration file section that now works for me: > > description = Magnaporthe grisea V5 genbank BioSQL > db_args = driver mysql > dbname M_grisea_genbank_biosql > location mycelium.fgl.ncsu.edu > user www > pass "" > namespace genbank > version 1 > > gbrowse starts up OK now. Could you please tell me whom would be > the responsible party so that I can send him/her a note? There are > also other problems with the file that I am trying to resolve. > Thank you. > > Now I need to figure out how to get gbrowse to display all of my > features (I finally have a database rich enough to represent them > but not so complex as to be unwieldy) ..... > > Oh, I noticed that the various path tables are not populated and > one piece of documentation said something about "... database > dependent...". Could you point me to any code samples that populate > those tables? Or, perhaps, pointers to specific mailing list > message chains even. > > In general, any clues or bread crumbs would be greatly appreciated. > > Regards, > Doug Brown > > Hilmar Lapp wrote: >> >> Hi Doug, >> >> I'm not exactly sure what the problem is but the error you are >> seeing is raised by code with GBrowse. I'd recommend that you post >> this to the GBrowse list. I have tried to trace the meaning of the >> Gbrowse conf parameters, and it seems to me that biodbname is >> obsolete and should be namespace instead. However, I also can't >> find where the error is being generated using the error message as >> guide, so I suspect you are using an outdated version of GBrowse. >> The folks on the Gbrowse list should be able to tell you more >> specifics, though. >> >> -hilmar >> >> On Apr 3, 2008, at 3:26 PM, doug brown wrote: >>> Hello Hilmar, >>> >>> First off, congratulations for achieving the version 1.0 >>> release of BioSQL. >>> >>> I am attempting to get BioSQL working with bioperl and gbrowse. >>> All are, I believe, the most recent versions of the software. >>> Unfortunately, I cam running into problems with getting gbrowse >>> to access the BioSQL database. >>> >>> It is my sincere hope that my problem is a configuration issue. >>> However, after trying multiple permutations of gbrowse >>> configuration params and much trawling through the bioperl code, >>> I am unable to resolve the problem. Could you take a moment and >>> see if there is an obvious solution to my problem? >>> >>> I have long awaited the 1.0 mature release of BioSQL and I am >>> eager to start using it in place my my ad hoc in-house databases! >>> >>> Here is the error from the apache log files: >>> ------------- EXCEPTION ------------- >>> MSG: error while executing query in >>> Bio::DB::Das::BioSQL::PartialSeqAdaptor::find_by_query: No >>> database selected >>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_query / >>> Library/Perl/5.8.6/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:1248 >>> STACK >>> Bio::DB::Das::BioSQL::BioDatabaseAdaptor::fetch_Seq_by_accession / >>> Library/Perl/5.8.6/darwin-thread-multi-2level/Bio/DB/Das/BioSQL/ >>> BioDatabaseAdaptor.pm:120 >>> STACK Bio::DB::Das::BioSQL::get_feature_by_name /Library/Perl/ >>> 5.8.6/darwin-thread-multi-2level/Bio/DB/Das/BioSQL.pm:220 >>> STACK Bio::Graphics::Browser::_feature_get /Library/Perl/5.8.6/ >>> darwin-thread-multi-2level/Bio/Graphics/Browser.pm:1884 >>> STACK Bio::Graphics::Browser::name2segments /Library/Perl/5.8.6/ >>> darwin-thread-multi-2level/Bio/Graphics/Browser.pm:1828 >>> STACK main::lookup_features_from_db /Library/WebServer/CGI- >>> Executables/gbrowse:1677 >>> STACK main::get_features /Library/WebServer/CGI-Executables/ >>> gbrowse:1577 >>> STACK toplevel /Library/WebServer/CGI-Executables/gbrowse:182 >>> >>> My configuration file: >>> [GENERAL] >>> # based on 06.biosql.conf,v 1.2.6.2.2.1 2006/06/07 20:50:29 which >>> was obtained >>> # from the basic gbrowse installation >>> description = Magnaporthe grisea V5 genbank BioSQL >>> db_adaptor = Bio::DB::Das::BioSQL >>> #nb: the dashes are required and were missing in the original >>> db_args = driver mysql >>> -dbname M_grisea_genbank_biosql >>> -namespace bioperl >>> -biodbname genbank >>> -version 1 >>> -host mycelium.fgl.ncsu.edu >>> -user www >>> -pass "" >>> -port 3306 >>> >>> plugins = FastaDumper RestrictionAnnotator >>> # deb 2-apr-08 SequenceDumper can't be found. So, dont use it. >>> #SequenceDumper >>> >>> ... remainder is unchanged ... >>> >>> database creation and load: >>> mysql -udebrown -p -h mycelium.fgl.ncsu.edu >>> drop database M_grisea_genbank_biosql; >>> create database M_grisea_genbank_biosql; >>> grant select on M_grisea_genbank_biosql.* to >>> www at marray.fgl.ncsu.edu; >>> use M_grisea_genbank_biosql; >>> >>> mysql -udebrown -p -h mycelium.fgl.ncsu.edu >>> M_grisea_genbank_biosql >> mysql.sql >>> >>> C:\Perl\site\bin\load_seqdatabase.bat\load_seqdatabase.bat --dsn >>> "dbi:mysql:database=M_grisea_genbank_biosql;host=mycelium.fgl.ncsu.e >>> du" -dbuser debrown --dbpass XXXXXXX --format genbank genbank >>> \CH476760.gb --namespace genbank >>> >>> machine (node) layout: >>> marray.fgl.ncsu.edu is the web server >>> mycelium.fgl.ncsu.edu is the database server >>> dougslaptop.fgl.ncsu.edu is a development system. >>> >>> >>> Thank you for your time, >>> >>> Regards, >>> Doug Brown >>> -- >>> Doug Brown - Bioinformatics >>> Fungal Genomics Laboratory >>> Center for Integrated Fungal Research >>> North Carolina State University >>> Campus Box 7251, Raleigh, NC 27695-7251 >>> https://www.fungalgenomics.ncsu.edu/~debrown/ >>> Tel: (919) 513-0394, Fax (919) 513-0024 >>> e-mail: doug_brown at ncsu.edu >>> >> >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> > > -- > Doug Brown - Bioinformatics > Fungal Genomics Laboratory > Center for Integrated Fungal Research > North Carolina State University > Campus Box 7251, Raleigh, NC 27695-7251 > https://www.fungalgenomics.ncsu.edu/~debrown/ > Tel: (919) 513-0394, Fax (919) 513-0024 > e-mail: doug_brown at ncsu.edu -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Wed Apr 9 06:02:27 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Apr 2008 11:02:27 +0100 Subject: [BioSQL-l] left_value and right_value in taxon table In-Reply-To: <5EDD43ED-57DD-4BAC-9A93-3B7DAE67ACB4@gmx.net> References: <5EDD43ED-57DD-4BAC-9A93-3B7DAE67ACB4@gmx.net> Message-ID: <320fb6e00804090302i5ab447acx912ce205c5580b4@mail.gmail.com> On Wed, Apr 9, 2008 at 12:57 AM, Hilmar Lapp wrote: > > The [load_ncbi_taxonomy.pl] script does recompute *all* > nested set values, though... > Though you can have very cheap cases indeed, in reality it > turns out that on average you still need to traverse and update > at least half of the nodes, so personally I really doubt you > would save any significant amount of time by not just redoing > all of them. And it's not that time-intensive either; typically it > takes about 10-20mins, depending on CPU etc. This does mean that in general, trying to fully update the taxon table when adding a new sequence with a novel NCBI taxon id would take at least 10mins (in addition to the drawback of having the Bio* project reimplement much of the load_ncbi_taxonomy.pl script's logic). This probably helps explain why when the NCBI taxon ID wasn't already defined, the old Biopython code would actually create new taxon table entries for the entire lineage (based on the species lineage names in a GenBank file) without linking into any existing taxon table entries which may have matched. Because these new entries were independent of everything else, their left/right values could be calculated trivially (starting above the largest existing left/right value). This had the advantage of recording as much information as possible (without having to use load_ncbi_taxonomy.pl at all), but left the taxon table full of redundant entries. I think that in this case, when trying to load a sequence with a novel NCBI taxon id, the best solution may be just to add a single minimal taxon table entry with NULL left/right values (and let the load_ncbi_taxonomy.pl fill in the lineage later). Peter From aaron.j.mackey at gsk.com Wed Apr 9 08:45:43 2008 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Wed, 9 Apr 2008 08:45:43 -0400 Subject: [BioSQL-l] left_value and right_value in taxon table In-Reply-To: <5EDD43ED-57DD-4BAC-9A93-3B7DAE67ACB4@gmx.net> Message-ID: > On Apr 8, 2008, at 11:58 AM, aaron.j.mackey at gsk.com wrote: > > I believe that the first thing the load_ncbi_taxonomy.pl script > > does is to > > wipe out everything already in the table. > > That may have been in true in its beginnings but hasn't been for a > long time :-) It only updates changed nodes, adds new ones, and > deletes retired ones (unless you say --nodelete). The script does > recompute *all* nested set values, though. Ahh right, I remember all that now. It was the wiping out of the left/right values that I was thinking of. Thanks, -Aaron From mmokrejs at ribosome.natur.cuni.cz Fri Apr 11 20:50:58 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Sat, 12 Apr 2008 02:50:58 +0200 Subject: [BioSQL-l] left_value and right_value in taxon table Message-ID: <48000772.4060903@ribosome.natur.cuni.cz> >> On Apr 8, 2008, at 11:58 AM, aaron.j.mackey at gsk.com wrote: >> > I believe that the first thing the load_ncbi_taxonomy.pl script >> > does is to >> > wipe out everything already in the table. >> >> That may have been in true in its beginnings but hasn't been for a >> long time :-) It only updates changed nodes, adds new ones, and >> deletes retired ones (unless you say --nodelete). The script does >> recompute *all* nested set values, though. > > Ahh right, I remember all that now. It was the wiping out of the > left/right values that I was thinking of. > > Thanks, > > -Aaron Hi, Maybe you have meant the other taxonomy loading script? ;-) http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/scripts/load_itis_taxonomy.pl You can use this script to load the taxonomy data into a fresh instance of biosql. Otherwise an already existing ITIS tree will be deleted first. I just don't understand why the very first sentence of the documentation within the scripts says something about 'update': This script loads or updates a biosql schema with phylodb extension with the ITIS taxonomy as a phylogenetic trees, one tree for each kingdom. Regards, Martin From mmokrejs at ribosome.natur.cuni.cz Fri Apr 11 21:32:14 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Sat, 12 Apr 2008 03:32:14 +0200 Subject: [BioSQL-l] [Bioperl-l] Loading sequences with novel NCBI taxon_id In-Reply-To: References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> Message-ID: <4800111E.3030802@ribosome.natur.cuni.cz> Chris Fields wrote: > The counter to that perspective (using new sequences with old tax info) > would be to regularly update NCBI taxonomy, particularly in > circumstances prior to adding new sequences. Hilmar mentioned that once > tax is loaded it doesn't take as long to update, so you could set up a > cron job to update regularly. > > I remember someone mentioning weekly or monthly updates on the list > quite a while ago, but I'm unsure how often NCBI updates tax information > (i.e. with every release, monthly, weekly, etc). I can see instances > popping up where you used the an up-to-date taxonomy but a new sequence > contains a tax ID not present. I think bioperl-db handles these but I'm > not sure what other Bio* do. > I spent some time benchmarking this and inspecting the mysql log files. The current load_ncbi_taxonomy.pl script with minor modification to show timestamps does this on initial import into mysql and then update of the database using exactly same dataset (but anyway it has to walk through all the data): $ ./load_ncbi_taxonomy.pl --dbname=biosqldb --driver=mysql --host=127.0.01 \ --port=3306 --directory=/home/mmokrejs/bioinformatics/databases/ncbitax/dump \ --chunksize=0 --verbose=2 --mycnf=~/.my.cnf Sat Apr 12 01:58:43 MEST 2008 Loading NCBI taxon database in /home/mmokrejs/bioinformatics/databases/ncbitax/dump: ... retrieving all taxon nodes in the database Sat Apr 12 01:58:43 MEST 2008 ... reading in taxon nodes from nodes.dmp Sat Apr 12 01:58:58 MEST 2008 ... insert / update / delete taxon nodes 10000/421098 done (in 5 secs, 2000.0 rows/s) 20000/421098 done (in 4 secs, 2500.0 rows/s) ... 420000/421098 done (in 4 secs, 2500.0 rows/s) Sat Apr 12 02:02:21 MEST 2008 ... (committing nodes) Sat Apr 12 02:02:21 MEST 2008 ... rebuilding nested set left/right values 10000 done (in 24 secs, 416.7 rows/s) 20000 done (in 26 secs, 384.6 rows/s) 30000 done (in 24 secs, 416.7 rows/s) ... 420004 done (in 23 secs, 434.8 rows/s) Sat Apr 12 02:19:25 MEST 2008 ... reading in taxon names from names.dmp Sat Apr 12 02:19:25 MEST 2008 ... deleting old taxon names Sat Apr 12 02:19:25 MEST 2008 ... inserting new taxon names 10000 done (in 8 secs, 1250.0 rows/s) 20000 done (in 8 secs, 1250.0 rows/s) ... 580000 done (in 5 secs, 2000.0 rows/s) Sat Apr 12 02:24:48 MEST 2008 ... cleaning up Sat Apr 12 02:24:49 MEST 2008 Done. $ I decided to re-import the same data to mimic at least somehow the future updates, although no record should be UPDATEd, except zapping left and right values with NULL. :(( $ ./load_ncbi_taxonomy.pl --dbname=biosqldb --driver=mysql --host=127.0.01 --port=3306 --directory=/home/mmokrejs/bioinformatics/databases/ncbitax/dump \ --chunksize=0 --verbose=2 --mycnf=~/.my.cnf Sat Apr 12 02:35:20 MEST 2008 Loading NCBI taxon database in /home/mmokrejs/bioinformatics/databases/ncbitax/dump: ... retrieving all taxon nodes in the database Sat Apr 12 02:35:26 MEST 2008 ... reading in taxon nodes from nodes.dmp Sat Apr 12 02:35:46 MEST 2008 ... insert / update / delete taxon nodes 10000/421098 done (in 0 secs, 10000.0 rows/s) 20000/421098 done (in 0 secs, 10000.0 rows/s) ... 410000/421098 done (in 0 secs, 10000.0 rows/s) 420000/421098 done (in 0 secs, 10000.0 rows/s) Sat Apr 12 02:35:55 MEST 2008 ... (committing nodes) Sat Apr 12 02:35:55 MEST 2008 ... rebuilding nested set left/right values 10000 done (in 9 secs, 1111.1 rows/s) 20000 done (in 9 secs, 1111.1 rows/s) ... 410004 done (in 8 secs, 1250.0 rows/s) 420004 done (in 9 secs, 1111.1 rows/s) Sat Apr 12 02:41:54 MEST 2008 ... reading in taxon names from names.dmp Sat Apr 12 02:41:54 MEST 2008 ... deleting old taxon names Sat Apr 12 02:41:55 MEST 2008 ... inserting new taxon names 10000 done (in 5 secs, 2000.0 rows/s) 20000 done (in 5 secs, 2000.0 rows/s) ... 570000 done (in 6 secs, 1666.7 rows/s) 580000 done (in 5 secs, 2000.0 rows/s) Sat Apr 12 02:47:27 MEST 2008 ... cleaning up Sat Apr 12 02:47:27 MEST 2008 Done. $ ls -la /var/log/mysql/mysql.log -rw-rw---- 1 mysql mysql 483443314 Apr 12 03:15 /var/log/mysql/mysql.log $ Pentium4 M laptop, 1.8GHz, 1 GB RAM, mysql-5.0.56 with enabled SQL text logging, the slow version of logging all SQL commands compared to binary logging. The log was cleared before the tests. I could provide some bits from the log or upload it somewhere if anybody else would like to dig into the details. I believe the recalculation step could be made faster. See what happens: 31 Query SELECT taxon_id, left_value, right_value FROM taxon WHERE parent_taxon_id = '1' ORDER BY ncbi_taxon_id 31 Query SELECT taxon_id, left_value, right_value FROM taxon WHERE parent_taxon_id = '10239' ORDER BY ncbi_taxon_id 31 Query SELECT taxon_id, left_value, right_value FROM taxon WHERE parent_taxon_id = '12333' ORDER BY ncbi_taxon_id 31 Query SELECT taxon_id, left_value, right_value FROM taxon WHERE parent_taxon_id = '12335' ORDER BY ncbi_taxon_id 31 Query UPDATE taxon SET left_value = NULL, right_value = NULL WHERE left_value = '4' 31 Query UPDATE taxon SET left_value = NULL, right_value = NULL WHERE right_value = '5' 31 Query UPDATE taxon SET left_value = '4', right_value = '5' WHERE taxon_id = '12335' 31 Query SELECT taxon_id, left_value, right_value FROM taxon WHERE parent_taxon_id = '12340' ORDER BY ncbi_taxon_id 31 Query UPDATE taxon SET left_value = NULL, right_value = NULL WHERE left_value = '6' 31 Query UPDATE taxon SET left_value = NULL, right_value = NULL WHERE right_value = '7' 31 Query UPDATE taxon SET left_value = '6', right_value = '7' WHERE taxon_id = '12340' The columns left_value and right_value have NULL value upon the table is created, so no need to write again NULL into them. This would mean writing a wrapper function which would mimic update() but before doing that it would do 'SELECT * FROM', compare the values with those to be written and include in the final UPDATE statement only those columns for which values have been changed. We use such a smart wrapper for our code in python. ;-) When the columns for left and right are to be made NULL during update of an existing database, I think it would be much faster to drop the columns and re-create them again with NULL values. I think it could be investigated more the possibility to create empty taxon and taxon_name tables as MyISAM tables and only after all the import and updates they could be converted into InnoDB tables. One would have to probably think a bit more of the foreign keys but it might be they would not even be lost during the conversion back and forth. Actually, easy to check. Dump your current taxon and taxon_name tables (maybe even without sql data using --without-data), run 'ALTER TABLE taxon ... type=MyISAM' followed by 'ALTER TABLE taxon ... type=InnoDB' dump again the database structure and compare by diff with the original. But, time for sleep here. Martin From hlapp at gmx.net Fri Apr 11 22:48:29 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 11 Apr 2008 22:48:29 -0400 Subject: [BioSQL-l] left_value and right_value in taxon table In-Reply-To: <48000772.4060903@ribosome.natur.cuni.cz> References: <48000772.4060903@ribosome.natur.cuni.cz> Message-ID: <82CF4290-2F0F-4E8D-86C9-B4C59072C74C@gmx.net> On Apr 11, 2008, at 8:50 PM, Martin MOKREJ? wrote: >>> On Apr 8, 2008, at 11:58 AM, aaron.j.mackey at gsk.com wrote: >>> > I believe that the first thing the load_ncbi_taxonomy.pl script >>> > does is to >>> > wipe out everything already in the table. >>> That may have been in true in its beginnings but hasn't been for >>> a long time :-) It only updates changed nodes, adds new ones, and >>> deletes retired ones (unless you say --nodelete). The script does >>> recompute *all* nested set values, though. >> Ahh right, I remember all that now. It was the wiping out of the >> left/right values that I was thinking of. >> Thanks, >> -Aaron > > Hi, > > Maybe you have meant the other taxonomy loading script? ;-) > > http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/ > trunk/scripts/load_itis_taxonomy.pl This loads a taxonomy too, but into the PhyloDB tables (because the ITIS taxonomy consists of multiple hierarchies, not a single one like NCBI). > > > You can use this script to load the taxonomy data into a fresh > instance of > biosql. Otherwise an already existing ITIS tree will be deleted first. > > > I just don't understand why the very first sentence of the > documentation > within the scripts says something about 'update': > > > This script loads or updates a biosql schema with phylodb extension > with the ITIS taxonomy as a phylogenetic trees, one tree for each > kingdom. > The 'update' here is forward looking :) At this point there won't really be an update as any existing trees within the ITIS namespace are deleted first. -hilmar > > > Regards, > Martin > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Fri Apr 11 23:23:21 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 11 Apr 2008 23:23:21 -0400 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon_id In-Reply-To: <4800111E.3030802@ribosome.natur.cuni.cz> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> <4800111E.3030802@ribosome.natur.cuni.cz> Message-ID: On Apr 11, 2008, at 9:32 PM, Martin MOKREJ? wrote: > I decided to re-import the same data to mimic at least somehow > the future updates, although no record should be UPDATEd, > except zapping left and right values with NULL. :(( Not sure what made you frown here? > [...] > > I believe the recalculation step could be made faster. See what > happens: > [...] > The columns left_value and right_value have NULL value upon > the table is created, so no need to write again NULL into > them. But that's only true the first time you load. For almost all real databases, all except the first run of the script won't be able to take advantage of that. > This would mean writing a wrapper function which would > mimic update() but before doing that it would do 'SELECT * FROM', > compare the values with those to be written and include in the > final UPDATE statement only those columns for which values have > been changed. We use such a smart wrapper for our code in python. > ;-) What you see is the "optimization" for MySQL. For all other RDBMSs it does both left and right in one update. BTW note that SELECT does not have zero cost, it requires both an index and a table read, only to find on average 50% of the time that you will need to update anyway. So what you gain 50% of the time you lose the other 50% of the time. > > When the columns for left and right are to be made NULL during > update of an existing database, I think it would be much faster > to drop the columns and re-create them again with NULL values. In terms of speed, that may be how MySQL works indeed. In PostgreSQL it would even be transactional (but very slow with concurrent queries), but with most databases you are now outside of a transaction (because it is DDL), which not only leaves the data in an inconsistent state, but also will immediately break any application you run against it because the table structure changed under its feet. > [...] I think it could be investigated more the possibility to create > empty taxon and taxon_name tables as MyISAM tables and only after > all the import and updates they could be converted into InnoDB > tables. I'm sure there are lots of hacks and tricks that would make this faster for one particular RBDMS, and you are welcome to explore those. But the script is written to deal with several RDBMSs, and it does so as transactionally safe as possible. The assumption is that you are running this against a live database that is being queried concurrently. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sat Apr 12 14:10:44 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 12 Apr 2008 14:10:44 -0400 Subject: [BioSQL-l] personal vs list email Message-ID: I'm not sure why but I have received several Bioperl or BioSQL- related email inquiries directed to me *personally* over the past few weeks. I have been responding as I get to them, but I feel that I am doing both the senders and this community a poor service, because sometimes someone else on the list could have responded much faster, and when I respond, others on the list who happen to be interested in the same question don't get to see the answer. So from now on as a policy I will redirect *every* email sent to me personally and that asks a question related to one of the projects to the respective mailing list. If you don't want this, please conspicuously say so at the top of your email, and in that case if you do ask a project-related question be prepared to wait and to possibly needing to follow up. As an aside, it's a pretty safe assumption to make that all other core developers, and quite possibly *all* developers are following a similar policy, whether expressly or not. Isn't this somewhere in the FAQ too? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Sat Apr 12 16:17:43 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 12 Apr 2008 15:17:43 -0500 Subject: [BioSQL-l] personal vs list email In-Reply-To: References: Message-ID: On Apr 12, 2008, at 1:10 PM, Hilmar Lapp wrote: > I'm not sure why but I have received several Bioperl or BioSQL- > related email inquiries directed to me *personally* over the past > few weeks. > > I have been responding as I get to them, but I feel that I am doing > both the senders and this community a poor service, because > sometimes someone else on the list could have responded much faster, > and when I respond, others on the list who happen to be interested > in the same question don't get to see the answer. > > So from now on as a policy I will redirect *every* email sent to me > personally and that asks a question related to one of the projects > to the respective mailing list. If you don't want this, please > conspicuously say so at the top of your email, and in that case if > you do ask a project-related question be prepared to wait and to > possibly needing to follow up. > > As an aside, it's a pretty safe assumption to make that all other > core developers, and quite possibly *all* developers are following a > similar policy, whether expressly or not. I agree; I'm sure several other core devs feel the same way. I always try to forward these to the list if I feel it is more relevant there. > Isn't this somewhere in the FAQ too? > > -hilmar No, but I've added it to the bioperl FAQ; might be worth checking over and editing. chris From aaron.j.mackey at gsk.com Mon Apr 14 09:00:52 2008 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Mon, 14 Apr 2008 09:00:52 -0400 Subject: [BioSQL-l] [Bioperl-l] personal vs list email In-Reply-To: Message-ID: I try to take it even one step further: I require the person to re-ask their question on the mailing list (and then try to answer it there). This has the added benefit of causing the person to pause a moment to reflect on their question, and (sometimes) to spend a bit more time preparing the question for more broader public consumption. -Aaron From biopython at maubp.freeserve.co.uk Wed Apr 23 05:04:33 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Apr 2008 10:04:33 +0100 Subject: [BioSQL-l] BioSQL script to update taxon table left/right values Message-ID: <320fb6e00804230204w5161e6afg143a66a3cbb8aa66@mail.gmail.com> Dear list, In addition to loading the NCBI taxonomy, the load_ncbi_taxonomy.pl script also recalculates the left/right values. Is there a separate BioSQL script which ONLY recalculates the left/right values? I was asked this by a Biopython user. Possible use-cases include people using a non-NCBI taxonomy. Peter From hlapp at gmx.net Wed Apr 23 10:14:19 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 23 Apr 2008 10:14:19 -0400 Subject: [BioSQL-l] BioSQL script to update taxon table left/right values In-Reply-To: <320fb6e00804230204w5161e6afg143a66a3cbb8aa66@mail.gmail.com> References: <320fb6e00804230204w5161e6afg143a66a3cbb8aa66@mail.gmail.com> Message-ID: <57EF37D9-295A-495A-ADFD-E13722DBD642@gmx.net> No there isn't but it's a good idea. Would you mind posting it as a BioSQL bug/feature request? -hilmar On Apr 23, 2008, at 5:04 AM, Peter wrote: > Dear list, > > In addition to loading the NCBI taxonomy, the load_ncbi_taxonomy.pl > script also recalculates the left/right values. > > Is there a separate BioSQL script which ONLY recalculates the left/ > right values? > > I was asked this by a Biopython user. Possible use-cases include > people using a non-NCBI taxonomy. > > Peter > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Wed Apr 23 11:41:00 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Apr 2008 16:41:00 +0100 Subject: [BioSQL-l] BioSQL script to update taxon table left/right values In-Reply-To: <57EF37D9-295A-495A-ADFD-E13722DBD642@gmx.net> References: <320fb6e00804230204w5161e6afg143a66a3cbb8aa66@mail.gmail.com> <57EF37D9-295A-495A-ADFD-E13722DBD642@gmx.net> Message-ID: <320fb6e00804230841p188b4897q6c7f08dcf1ead552@mail.gmail.com> On Wed, Apr 23, 2008 at 3:14 PM, Hilmar Lapp wrote: > No there isn't but it's a good idea. Would you mind posting it as a BioSQL > bug/feature request? Sure, I've filed an enhancement request: Bug 2493 - New script to recalculate left/right values in the taxon table http://bugzilla.open-bio.org/show_bug.cgi?id=2493 Peter From mmokrejs at ribosome.natur.cuni.cz Wed Apr 23 11:58:18 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Wed, 23 Apr 2008 17:58:18 +0200 Subject: [BioSQL-l] BioSQL script to update taxon table left/right values In-Reply-To: <320fb6e00804230841p188b4897q6c7f08dcf1ead552@mail.gmail.com> References: <320fb6e00804230204w5161e6afg143a66a3cbb8aa66@mail.gmail.com> <57EF37D9-295A-495A-ADFD-E13722DBD642@gmx.net> <320fb6e00804230841p188b4897q6c7f08dcf1ead552@mail.gmail.com> Message-ID: <480F5C9A.60007@ribosome.natur.cuni.cz> I would just propose to make the current script more modular and provide a command-line option argument which would just establish the database handler and update the fields and close $dbh. M. Peter wrote: > On Wed, Apr 23, 2008 at 3:14 PM, Hilmar Lapp wrote: >> No there isn't but it's a good idea. Would you mind posting it as a BioSQL >> bug/feature request? > > Sure, I've filed an enhancement request: > > Bug 2493 - New script to recalculate left/right values in the taxon table > http://bugzilla.open-bio.org/show_bug.cgi?id=2493 > > Peter > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l From darin.london at duke.edu Tue Apr 29 12:52:54 2008 From: darin.london at duke.edu (darin.london at duke.edu) Date: Tue, 29 Apr 2008 12:52:54 -0400 Subject: [BioSQL-l] BOSC 2008 Announcement and Call For Submissions Message-ID: <200804291653.m3TGqsF5020841@tenero.duhs.duke.edu> BOSC 2008 Call for Abstracts Reminder The 9th annual Bioinformatics Open Source Conference (BOSC 2008) will take place in Toronto, Ontario, Canada, as one of several Special Interest Group (SIG) meetings occurring in conjunction with the 16th annual Intelligent Systems for Molecular Biology Conference (ISMB 2008). This is a reminder to submit your proposals for talks to the BOSC submission system before May 11. Submission Process: All abstracts must be submitted through our Open Conference Systems site (http://events.open-bio.org/BOSC2008/openconf.php). The form will ask for a small Abstract Text to be pasted into it, and a full paper. The small Abstract text should be a summary, while the longer abstract (should provide more details, including the open-source license requirement details) Full-length abstracts are limited to one page with one inch (2.5 cm) margins on the top, sides, and bottom. The full-length abstract should include the title, authors, and affiliations. We prefer your abstract to be in PDF format, although plain t Important Dates: May 11: Abstract submission deadline. June 2: Notification of accepted talks. June 4: Early registration discount cut-off. July 18-19: BOSC 2008! We hope to see you at BOSC 2008! Kam Dahlquist and Darin London BOSC 2008 Co-organizers From stefan.guenther at charite.de Wed Apr 2 09:35:05 2008 From: stefan.guenther at charite.de (Stefan Guenther) Date: Wed, 02 Apr 2008 11:35:05 +0200 Subject: [BioSQL-l] swissprot - gene names Message-ID: <47F35349.4010105@charite.de> Hi, I have uploaded the swissprot flatfile (uniprot_sprot.dat) into the biosql scheme using load_seqdatabase.pl. Now I'm searching for the swissprot gene names in the bioseqdb-tables. I cannot find them in the bioentry table. Aren't them included? Stefan From hlapp at gmx.net Wed Apr 2 15:44:36 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 2 Apr 2008 11:44:36 -0400 Subject: [BioSQL-l] swissprot - gene names In-Reply-To: <47F35349.4010105@charite.de> References: <47F35349.4010105@charite.de> Message-ID: Are you talking about the gene symbols and names in the GN line? These will be in bioentry_qualifier_value associations, with the tag being gene_name (I think; check your term table to be sure). Let me know if that doesn't work. -hilmar On Apr 2, 2008, at 5:35 AM, Stefan Guenther wrote: > Hi, > > I have uploaded the swissprot flatfile (uniprot_sprot.dat) into the > biosql scheme using load_seqdatabase.pl. Now I'm searching for the > swissprot gene names in the bioseqdb-tables. I cannot find them in > the bioentry table. Aren't them included? > > Stefan > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From ericgibert at yahoo.fr Fri Apr 4 13:43:34 2008 From: ericgibert at yahoo.fr (Eric Gibert) Date: Fri, 4 Apr 2008 13:43:34 +0000 (GMT) Subject: [BioSQL-l] left_value and right_value in taxon table Message-ID: <942508.77770.qm@web26507.mail.ukl.yahoo.com> Dear all, I hope that I am not the 100th persons asking the following questions: 1) what are left and right values in the taxon table for? 2) How are they computed? Thank you for your input or link to an explanation page. Eric _____________________________________________________________________________ Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr From hlapp at gmx.net Fri Apr 4 22:40:56 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 4 Apr 2008 18:40:56 -0400 Subject: [BioSQL-l] left_value and right_value in taxon table In-Reply-To: <942508.77770.qm@web26507.mail.ukl.yahoo.com> References: <942508.77770.qm@web26507.mail.ukl.yahoo.com> Message-ID: <60C3976C-26BF-4981-9159-207835C9B5D0@gmx.net> Hi Eric, On Apr 4, 2008, at 9:43 AM, Eric Gibert wrote: > Dear all, > > I hope that I am not the 100th persons asking the following questions: > 1) what are left and right values in the taxon table for? they hold the nested set values. Nested sets are enumeration algorithm described in Joe Celko's SQL for Smarties books, and Aaron Mackey gives a good introduction here: http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html (This is in the schema DDL file, though obviously should be documented better. Good candidate for an FAQ, I suppose.) > 2) How are they computed load_ncbi_taxonomy.pl recomputes them automatically after each update. It's a simple recursive depth-first graph traversal algorithm. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Tue Apr 8 15:24:41 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Apr 2008 16:24:41 +0100 Subject: [BioSQL-l] left_value and right_value in taxon table In-Reply-To: <60C3976C-26BF-4981-9159-207835C9B5D0@gmx.net> References: <942508.77770.qm@web26507.mail.ukl.yahoo.com> <60C3976C-26BF-4981-9159-207835C9B5D0@gmx.net> Message-ID: <320fb6e00804080824x2bd92d41p884c8a4a61c04702@mail.gmail.com> > > Dear all, > > > > I hope that I am not the 100th persons asking the following questions: > > 1) what are left and right values in the taxon table for? > > > > they hold the nested set values. Nested sets are enumeration algorithm > described in Joe Celko's SQL for Smarties books, and Aaron Mackey gives a > good introduction here: > > http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html > > (This is in the schema DDL file, though obviously should be documented > better. Good candidate for an FAQ, I suppose.) That link does a good job of explaining the idea. > > 2) How are they computed > > load_ncbi_taxonomy.pl recomputes them automatically after each update. It's > a simple recursive depth-first graph traversal algorithm. I have the impression the recomputation is slow, and also moderately complex. This is fine for a weekly (or even daily) update which runs the load_ncbi_taxonomy.pl script. We (Biopython) are interested in incremental updates triggered when a new sequences is added to the database with a novel taxon id. Eric is looking at downloading the missing taxon data and updating the taxon/taxon_name tables "on the fly", transparently to the user. http://bugzilla.open-bio.org/show_bug.cgi?id=2475 (Biopython bug) Hilmar, am I right in thinking the following: Suppose when loading a new sequence into the database with a novel NCBI taxon, we record a new minimal taxon/taxon_names entry (without the lineage, a single taxon entry with null left/right entries). If the user then runs load_ncbi_taxonomy.pl, assuming the NCBI's online database contains the new taxon, will this update nicely? i.e. When the new sequence is retrieved from the database, its full lineage will be available. Thanks Peter From aaron.j.mackey at gsk.com Tue Apr 8 15:58:56 2008 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Tue, 8 Apr 2008 11:58:56 -0400 Subject: [BioSQL-l] left_value and right_value in taxon table In-Reply-To: <320fb6e00804080824x2bd92d41p884c8a4a61c04702@mail.gmail.com> Message-ID: I believe that the first thing the load_ncbi_taxonomy.pl script does is to wipe out everything already in the table. So you're incremental update strategy (with deferred left/right calculation) won't work. depending on the type of update you're making (e.g. you only add one new terminal taxonomic node, having no children), the incremental updates are pretty fast, computationally speaking (no tree traversal is required). I won't be able to recite them off the top of my head, but Joe Celko's "SQL For Smarties" book has the necessary code. In a nutshell, it's something like if the overall topology of the tree remains unchanged, you'll need to increment the right/left values of each node "to the right" of the new node you've inserted by 2, but it's a tiny bit more complicated than that. -Aaron biosql-l-bounces at lists.open-bio.org wrote on 04/08/2008 11:24:41 AM: > > > Dear all, > > > > > > I hope that I am not the 100th persons asking the following questions: > > > 1) what are left and right values in the taxon table for? > > > > > > > they hold the nested set values. Nested sets are enumeration algorithm > > described in Joe Celko's SQL for Smarties books, and Aaron Mackey gives a > > good introduction here: > > > > http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html > > > > (This is in the schema DDL file, though obviously should be documented > > better. Good candidate for an FAQ, I suppose.) > > That link does a good job of explaining the idea. > > > > 2) How are they computed > > > > load_ncbi_taxonomy.pl recomputes them automatically after each update. It's > > a simple recursive depth-first graph traversal algorithm. > > I have the impression the recomputation is slow, and also moderately > complex. This is fine for a weekly (or even daily) update which runs > the load_ncbi_taxonomy.pl script. > > We (Biopython) are interested in incremental updates triggered when a > new sequences is added to the database with a novel taxon id. Eric is > looking at downloading the missing taxon data and updating the > taxon/taxon_name tables "on the fly", transparently to the user. > > http://bugzilla.open-bio.org/show_bug.cgi?id=2475 (Biopython bug) > > Hilmar, am I right in thinking the following: Suppose when loading a > new sequence into the database with a novel NCBI taxon, we record a > new minimal taxon/taxon_names entry (without the lineage, a single > taxon entry with null left/right entries). If the user then runs > load_ncbi_taxonomy.pl, assuming the NCBI's online database contains > the new taxon, will this update nicely? i.e. When the new sequence is > retrieved from the database, its full lineage will be available. > > Thanks > > Peter > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > From hlapp at gmx.net Tue Apr 8 23:57:41 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 8 Apr 2008 19:57:41 -0400 Subject: [BioSQL-l] left_value and right_value in taxon table In-Reply-To: References: Message-ID: <5EDD43ED-57DD-4BAC-9A93-3B7DAE67ACB4@gmx.net> On Apr 8, 2008, at 11:58 AM, aaron.j.mackey at gsk.com wrote: > I believe that the first thing the load_ncbi_taxonomy.pl script > does is to > wipe out everything already in the table. That may have been in true in its beginnings but hasn't been for a long time :-) It only updates changed nodes, adds new ones, and deletes retired ones (unless you say --nodelete). The script does recompute *all* nested set values, though. > [...] > depending on the type of update you're making (e.g. you only add > one new > terminal taxonomic node, having no children), the incremental > updates are > pretty fast, computationally speaking (no tree traversal is > required). I > won't be able to recite them off the top of my head, but Joe > Celko's "SQL > For Smarties" book has the necessary code. In a nutshell, it's > something > like if the overall topology of the tree remains unchanged, you'll > need to > increment the right/left values of each node "to the right" of the new > node you've inserted by 2, but it's a tiny bit more complicated > than that. Though you can have very cheap cases indeed, in reality it turns out that on average you still need to traverse and update at least half of the nodes, so personally I really doubt you would save any significant amount of time by not just redoing all of them. And it's not that time-intensive either; typically it takes about 10-20mins, depending on CPU etc. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Wed Apr 9 04:31:31 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 9 Apr 2008 00:31:31 -0400 Subject: [BioSQL-l] BioSQL-1.0.0 + bioperl-db_1.5.2_100 + gbrowse question In-Reply-To: <47FA895F.4080407@embarqmail.com> References: <47F52F80.7030409@embarqmail.com> <47FA895F.4080407@embarqmail.com> Message-ID: <1094E323-C648-42AD-BE4B-623978FF7035@gmx.net> Hi Doug, thanks for your sleuthing work, it's much appreciated. I'm sure someone among the GBrowse crowd (I've copied the GBrowse mailing list) can make the fix according to what you found out, and they can help you out with the other problems you are having too if you elaborate what the issues are that you are seeing. As for the path tables, which ones are you referring to? I'm not sure the GBrowse BioSQL adapter uses any of them, so even if you do populate them it wouldn't change much. There is the load_ontology.pl script in Bioperl-db that can automatically compute the transitive closure for ontology terms, but as far as I am aware the feature or sequence loading scripts don't do that for seqfeature_path or bioentry_path, respectively. -hilmar On Apr 7, 2008, at 4:51 PM, doug brown wrote: > Hi Hilmar, > > Thank you for your reply. > > After much gnashing of teeth and much more code trawling, it > seems as it the basic problem I encountered was indeed one of > configuration. The current coding of > Bio::DB::Das::BioSQL::BioDatabaseAdaptor uses 'location' as the > keyword for representing the database host. The GFF adapter used, > and I guess still uses, 'host' for that purpose. The distributed > sample file 06.biosql.conf specified 'host' rather than 'location'. > Here is the correct configuration file section that now works for me: > > description = Magnaporthe grisea V5 genbank BioSQL > db_args = driver mysql > dbname M_grisea_genbank_biosql > location mycelium.fgl.ncsu.edu > user www > pass "" > namespace genbank > version 1 > > gbrowse starts up OK now. Could you please tell me whom would be > the responsible party so that I can send him/her a note? There are > also other problems with the file that I am trying to resolve. > Thank you. > > Now I need to figure out how to get gbrowse to display all of my > features (I finally have a database rich enough to represent them > but not so complex as to be unwieldy) ..... > > Oh, I noticed that the various path tables are not populated and > one piece of documentation said something about "... database > dependent...". Could you point me to any code samples that populate > those tables? Or, perhaps, pointers to specific mailing list > message chains even. > > In general, any clues or bread crumbs would be greatly appreciated. > > Regards, > Doug Brown > > Hilmar Lapp wrote: >> >> Hi Doug, >> >> I'm not exactly sure what the problem is but the error you are >> seeing is raised by code with GBrowse. I'd recommend that you post >> this to the GBrowse list. I have tried to trace the meaning of the >> Gbrowse conf parameters, and it seems to me that biodbname is >> obsolete and should be namespace instead. However, I also can't >> find where the error is being generated using the error message as >> guide, so I suspect you are using an outdated version of GBrowse. >> The folks on the Gbrowse list should be able to tell you more >> specifics, though. >> >> -hilmar >> >> On Apr 3, 2008, at 3:26 PM, doug brown wrote: >>> Hello Hilmar, >>> >>> First off, congratulations for achieving the version 1.0 >>> release of BioSQL. >>> >>> I am attempting to get BioSQL working with bioperl and gbrowse. >>> All are, I believe, the most recent versions of the software. >>> Unfortunately, I cam running into problems with getting gbrowse >>> to access the BioSQL database. >>> >>> It is my sincere hope that my problem is a configuration issue. >>> However, after trying multiple permutations of gbrowse >>> configuration params and much trawling through the bioperl code, >>> I am unable to resolve the problem. Could you take a moment and >>> see if there is an obvious solution to my problem? >>> >>> I have long awaited the 1.0 mature release of BioSQL and I am >>> eager to start using it in place my my ad hoc in-house databases! >>> >>> Here is the error from the apache log files: >>> ------------- EXCEPTION ------------- >>> MSG: error while executing query in >>> Bio::DB::Das::BioSQL::PartialSeqAdaptor::find_by_query: No >>> database selected >>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_query / >>> Library/Perl/5.8.6/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:1248 >>> STACK >>> Bio::DB::Das::BioSQL::BioDatabaseAdaptor::fetch_Seq_by_accession / >>> Library/Perl/5.8.6/darwin-thread-multi-2level/Bio/DB/Das/BioSQL/ >>> BioDatabaseAdaptor.pm:120 >>> STACK Bio::DB::Das::BioSQL::get_feature_by_name /Library/Perl/ >>> 5.8.6/darwin-thread-multi-2level/Bio/DB/Das/BioSQL.pm:220 >>> STACK Bio::Graphics::Browser::_feature_get /Library/Perl/5.8.6/ >>> darwin-thread-multi-2level/Bio/Graphics/Browser.pm:1884 >>> STACK Bio::Graphics::Browser::name2segments /Library/Perl/5.8.6/ >>> darwin-thread-multi-2level/Bio/Graphics/Browser.pm:1828 >>> STACK main::lookup_features_from_db /Library/WebServer/CGI- >>> Executables/gbrowse:1677 >>> STACK main::get_features /Library/WebServer/CGI-Executables/ >>> gbrowse:1577 >>> STACK toplevel /Library/WebServer/CGI-Executables/gbrowse:182 >>> >>> My configuration file: >>> [GENERAL] >>> # based on 06.biosql.conf,v 1.2.6.2.2.1 2006/06/07 20:50:29 which >>> was obtained >>> # from the basic gbrowse installation >>> description = Magnaporthe grisea V5 genbank BioSQL >>> db_adaptor = Bio::DB::Das::BioSQL >>> #nb: the dashes are required and were missing in the original >>> db_args = driver mysql >>> -dbname M_grisea_genbank_biosql >>> -namespace bioperl >>> -biodbname genbank >>> -version 1 >>> -host mycelium.fgl.ncsu.edu >>> -user www >>> -pass "" >>> -port 3306 >>> >>> plugins = FastaDumper RestrictionAnnotator >>> # deb 2-apr-08 SequenceDumper can't be found. So, dont use it. >>> #SequenceDumper >>> >>> ... remainder is unchanged ... >>> >>> database creation and load: >>> mysql -udebrown -p -h mycelium.fgl.ncsu.edu >>> drop database M_grisea_genbank_biosql; >>> create database M_grisea_genbank_biosql; >>> grant select on M_grisea_genbank_biosql.* to >>> www at marray.fgl.ncsu.edu; >>> use M_grisea_genbank_biosql; >>> >>> mysql -udebrown -p -h mycelium.fgl.ncsu.edu >>> M_grisea_genbank_biosql >> mysql.sql >>> >>> C:\Perl\site\bin\load_seqdatabase.bat\load_seqdatabase.bat --dsn >>> "dbi:mysql:database=M_grisea_genbank_biosql;host=mycelium.fgl.ncsu.e >>> du" -dbuser debrown --dbpass XXXXXXX --format genbank genbank >>> \CH476760.gb --namespace genbank >>> >>> machine (node) layout: >>> marray.fgl.ncsu.edu is the web server >>> mycelium.fgl.ncsu.edu is the database server >>> dougslaptop.fgl.ncsu.edu is a development system. >>> >>> >>> Thank you for your time, >>> >>> Regards, >>> Doug Brown >>> -- >>> Doug Brown - Bioinformatics >>> Fungal Genomics Laboratory >>> Center for Integrated Fungal Research >>> North Carolina State University >>> Campus Box 7251, Raleigh, NC 27695-7251 >>> https://www.fungalgenomics.ncsu.edu/~debrown/ >>> Tel: (919) 513-0394, Fax (919) 513-0024 >>> e-mail: doug_brown at ncsu.edu >>> >> >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> > > -- > Doug Brown - Bioinformatics > Fungal Genomics Laboratory > Center for Integrated Fungal Research > North Carolina State University > Campus Box 7251, Raleigh, NC 27695-7251 > https://www.fungalgenomics.ncsu.edu/~debrown/ > Tel: (919) 513-0394, Fax (919) 513-0024 > e-mail: doug_brown at ncsu.edu -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Wed Apr 9 10:02:27 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 9 Apr 2008 11:02:27 +0100 Subject: [BioSQL-l] left_value and right_value in taxon table In-Reply-To: <5EDD43ED-57DD-4BAC-9A93-3B7DAE67ACB4@gmx.net> References: <5EDD43ED-57DD-4BAC-9A93-3B7DAE67ACB4@gmx.net> Message-ID: <320fb6e00804090302i5ab447acx912ce205c5580b4@mail.gmail.com> On Wed, Apr 9, 2008 at 12:57 AM, Hilmar Lapp wrote: > > The [load_ncbi_taxonomy.pl] script does recompute *all* > nested set values, though... > Though you can have very cheap cases indeed, in reality it > turns out that on average you still need to traverse and update > at least half of the nodes, so personally I really doubt you > would save any significant amount of time by not just redoing > all of them. And it's not that time-intensive either; typically it > takes about 10-20mins, depending on CPU etc. This does mean that in general, trying to fully update the taxon table when adding a new sequence with a novel NCBI taxon id would take at least 10mins (in addition to the drawback of having the Bio* project reimplement much of the load_ncbi_taxonomy.pl script's logic). This probably helps explain why when the NCBI taxon ID wasn't already defined, the old Biopython code would actually create new taxon table entries for the entire lineage (based on the species lineage names in a GenBank file) without linking into any existing taxon table entries which may have matched. Because these new entries were independent of everything else, their left/right values could be calculated trivially (starting above the largest existing left/right value). This had the advantage of recording as much information as possible (without having to use load_ncbi_taxonomy.pl at all), but left the taxon table full of redundant entries. I think that in this case, when trying to load a sequence with a novel NCBI taxon id, the best solution may be just to add a single minimal taxon table entry with NULL left/right values (and let the load_ncbi_taxonomy.pl fill in the lineage later). Peter From aaron.j.mackey at gsk.com Wed Apr 9 12:45:43 2008 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Wed, 9 Apr 2008 08:45:43 -0400 Subject: [BioSQL-l] left_value and right_value in taxon table In-Reply-To: <5EDD43ED-57DD-4BAC-9A93-3B7DAE67ACB4@gmx.net> Message-ID: > On Apr 8, 2008, at 11:58 AM, aaron.j.mackey at gsk.com wrote: > > I believe that the first thing the load_ncbi_taxonomy.pl script > > does is to > > wipe out everything already in the table. > > That may have been in true in its beginnings but hasn't been for a > long time :-) It only updates changed nodes, adds new ones, and > deletes retired ones (unless you say --nodelete). The script does > recompute *all* nested set values, though. Ahh right, I remember all that now. It was the wiping out of the left/right values that I was thinking of. Thanks, -Aaron From mmokrejs at ribosome.natur.cuni.cz Sat Apr 12 00:50:58 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Sat, 12 Apr 2008 02:50:58 +0200 Subject: [BioSQL-l] left_value and right_value in taxon table Message-ID: <48000772.4060903@ribosome.natur.cuni.cz> >> On Apr 8, 2008, at 11:58 AM, aaron.j.mackey at gsk.com wrote: >> > I believe that the first thing the load_ncbi_taxonomy.pl script >> > does is to >> > wipe out everything already in the table. >> >> That may have been in true in its beginnings but hasn't been for a >> long time :-) It only updates changed nodes, adds new ones, and >> deletes retired ones (unless you say --nodelete). The script does >> recompute *all* nested set values, though. > > Ahh right, I remember all that now. It was the wiping out of the > left/right values that I was thinking of. > > Thanks, > > -Aaron Hi, Maybe you have meant the other taxonomy loading script? ;-) http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/scripts/load_itis_taxonomy.pl You can use this script to load the taxonomy data into a fresh instance of biosql. Otherwise an already existing ITIS tree will be deleted first. I just don't understand why the very first sentence of the documentation within the scripts says something about 'update': This script loads or updates a biosql schema with phylodb extension with the ITIS taxonomy as a phylogenetic trees, one tree for each kingdom. Regards, Martin From mmokrejs at ribosome.natur.cuni.cz Sat Apr 12 01:32:14 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Sat, 12 Apr 2008 03:32:14 +0200 Subject: [BioSQL-l] [Bioperl-l] Loading sequences with novel NCBI taxon_id In-Reply-To: References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> Message-ID: <4800111E.3030802@ribosome.natur.cuni.cz> Chris Fields wrote: > The counter to that perspective (using new sequences with old tax info) > would be to regularly update NCBI taxonomy, particularly in > circumstances prior to adding new sequences. Hilmar mentioned that once > tax is loaded it doesn't take as long to update, so you could set up a > cron job to update regularly. > > I remember someone mentioning weekly or monthly updates on the list > quite a while ago, but I'm unsure how often NCBI updates tax information > (i.e. with every release, monthly, weekly, etc). I can see instances > popping up where you used the an up-to-date taxonomy but a new sequence > contains a tax ID not present. I think bioperl-db handles these but I'm > not sure what other Bio* do. > I spent some time benchmarking this and inspecting the mysql log files. The current load_ncbi_taxonomy.pl script with minor modification to show timestamps does this on initial import into mysql and then update of the database using exactly same dataset (but anyway it has to walk through all the data): $ ./load_ncbi_taxonomy.pl --dbname=biosqldb --driver=mysql --host=127.0.01 \ --port=3306 --directory=/home/mmokrejs/bioinformatics/databases/ncbitax/dump \ --chunksize=0 --verbose=2 --mycnf=~/.my.cnf Sat Apr 12 01:58:43 MEST 2008 Loading NCBI taxon database in /home/mmokrejs/bioinformatics/databases/ncbitax/dump: ... retrieving all taxon nodes in the database Sat Apr 12 01:58:43 MEST 2008 ... reading in taxon nodes from nodes.dmp Sat Apr 12 01:58:58 MEST 2008 ... insert / update / delete taxon nodes 10000/421098 done (in 5 secs, 2000.0 rows/s) 20000/421098 done (in 4 secs, 2500.0 rows/s) ... 420000/421098 done (in 4 secs, 2500.0 rows/s) Sat Apr 12 02:02:21 MEST 2008 ... (committing nodes) Sat Apr 12 02:02:21 MEST 2008 ... rebuilding nested set left/right values 10000 done (in 24 secs, 416.7 rows/s) 20000 done (in 26 secs, 384.6 rows/s) 30000 done (in 24 secs, 416.7 rows/s) ... 420004 done (in 23 secs, 434.8 rows/s) Sat Apr 12 02:19:25 MEST 2008 ... reading in taxon names from names.dmp Sat Apr 12 02:19:25 MEST 2008 ... deleting old taxon names Sat Apr 12 02:19:25 MEST 2008 ... inserting new taxon names 10000 done (in 8 secs, 1250.0 rows/s) 20000 done (in 8 secs, 1250.0 rows/s) ... 580000 done (in 5 secs, 2000.0 rows/s) Sat Apr 12 02:24:48 MEST 2008 ... cleaning up Sat Apr 12 02:24:49 MEST 2008 Done. $ I decided to re-import the same data to mimic at least somehow the future updates, although no record should be UPDATEd, except zapping left and right values with NULL. :(( $ ./load_ncbi_taxonomy.pl --dbname=biosqldb --driver=mysql --host=127.0.01 --port=3306 --directory=/home/mmokrejs/bioinformatics/databases/ncbitax/dump \ --chunksize=0 --verbose=2 --mycnf=~/.my.cnf Sat Apr 12 02:35:20 MEST 2008 Loading NCBI taxon database in /home/mmokrejs/bioinformatics/databases/ncbitax/dump: ... retrieving all taxon nodes in the database Sat Apr 12 02:35:26 MEST 2008 ... reading in taxon nodes from nodes.dmp Sat Apr 12 02:35:46 MEST 2008 ... insert / update / delete taxon nodes 10000/421098 done (in 0 secs, 10000.0 rows/s) 20000/421098 done (in 0 secs, 10000.0 rows/s) ... 410000/421098 done (in 0 secs, 10000.0 rows/s) 420000/421098 done (in 0 secs, 10000.0 rows/s) Sat Apr 12 02:35:55 MEST 2008 ... (committing nodes) Sat Apr 12 02:35:55 MEST 2008 ... rebuilding nested set left/right values 10000 done (in 9 secs, 1111.1 rows/s) 20000 done (in 9 secs, 1111.1 rows/s) ... 410004 done (in 8 secs, 1250.0 rows/s) 420004 done (in 9 secs, 1111.1 rows/s) Sat Apr 12 02:41:54 MEST 2008 ... reading in taxon names from names.dmp Sat Apr 12 02:41:54 MEST 2008 ... deleting old taxon names Sat Apr 12 02:41:55 MEST 2008 ... inserting new taxon names 10000 done (in 5 secs, 2000.0 rows/s) 20000 done (in 5 secs, 2000.0 rows/s) ... 570000 done (in 6 secs, 1666.7 rows/s) 580000 done (in 5 secs, 2000.0 rows/s) Sat Apr 12 02:47:27 MEST 2008 ... cleaning up Sat Apr 12 02:47:27 MEST 2008 Done. $ ls -la /var/log/mysql/mysql.log -rw-rw---- 1 mysql mysql 483443314 Apr 12 03:15 /var/log/mysql/mysql.log $ Pentium4 M laptop, 1.8GHz, 1 GB RAM, mysql-5.0.56 with enabled SQL text logging, the slow version of logging all SQL commands compared to binary logging. The log was cleared before the tests. I could provide some bits from the log or upload it somewhere if anybody else would like to dig into the details. I believe the recalculation step could be made faster. See what happens: 31 Query SELECT taxon_id, left_value, right_value FROM taxon WHERE parent_taxon_id = '1' ORDER BY ncbi_taxon_id 31 Query SELECT taxon_id, left_value, right_value FROM taxon WHERE parent_taxon_id = '10239' ORDER BY ncbi_taxon_id 31 Query SELECT taxon_id, left_value, right_value FROM taxon WHERE parent_taxon_id = '12333' ORDER BY ncbi_taxon_id 31 Query SELECT taxon_id, left_value, right_value FROM taxon WHERE parent_taxon_id = '12335' ORDER BY ncbi_taxon_id 31 Query UPDATE taxon SET left_value = NULL, right_value = NULL WHERE left_value = '4' 31 Query UPDATE taxon SET left_value = NULL, right_value = NULL WHERE right_value = '5' 31 Query UPDATE taxon SET left_value = '4', right_value = '5' WHERE taxon_id = '12335' 31 Query SELECT taxon_id, left_value, right_value FROM taxon WHERE parent_taxon_id = '12340' ORDER BY ncbi_taxon_id 31 Query UPDATE taxon SET left_value = NULL, right_value = NULL WHERE left_value = '6' 31 Query UPDATE taxon SET left_value = NULL, right_value = NULL WHERE right_value = '7' 31 Query UPDATE taxon SET left_value = '6', right_value = '7' WHERE taxon_id = '12340' The columns left_value and right_value have NULL value upon the table is created, so no need to write again NULL into them. This would mean writing a wrapper function which would mimic update() but before doing that it would do 'SELECT * FROM', compare the values with those to be written and include in the final UPDATE statement only those columns for which values have been changed. We use such a smart wrapper for our code in python. ;-) When the columns for left and right are to be made NULL during update of an existing database, I think it would be much faster to drop the columns and re-create them again with NULL values. I think it could be investigated more the possibility to create empty taxon and taxon_name tables as MyISAM tables and only after all the import and updates they could be converted into InnoDB tables. One would have to probably think a bit more of the foreign keys but it might be they would not even be lost during the conversion back and forth. Actually, easy to check. Dump your current taxon and taxon_name tables (maybe even without sql data using --without-data), run 'ALTER TABLE taxon ... type=MyISAM' followed by 'ALTER TABLE taxon ... type=InnoDB' dump again the database structure and compare by diff with the original. But, time for sleep here. Martin From hlapp at gmx.net Sat Apr 12 02:48:29 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 11 Apr 2008 22:48:29 -0400 Subject: [BioSQL-l] left_value and right_value in taxon table In-Reply-To: <48000772.4060903@ribosome.natur.cuni.cz> References: <48000772.4060903@ribosome.natur.cuni.cz> Message-ID: <82CF4290-2F0F-4E8D-86C9-B4C59072C74C@gmx.net> On Apr 11, 2008, at 8:50 PM, Martin MOKREJ? wrote: >>> On Apr 8, 2008, at 11:58 AM, aaron.j.mackey at gsk.com wrote: >>> > I believe that the first thing the load_ncbi_taxonomy.pl script >>> > does is to >>> > wipe out everything already in the table. >>> That may have been in true in its beginnings but hasn't been for >>> a long time :-) It only updates changed nodes, adds new ones, and >>> deletes retired ones (unless you say --nodelete). The script does >>> recompute *all* nested set values, though. >> Ahh right, I remember all that now. It was the wiping out of the >> left/right values that I was thinking of. >> Thanks, >> -Aaron > > Hi, > > Maybe you have meant the other taxonomy loading script? ;-) > > http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/ > trunk/scripts/load_itis_taxonomy.pl This loads a taxonomy too, but into the PhyloDB tables (because the ITIS taxonomy consists of multiple hierarchies, not a single one like NCBI). > > > You can use this script to load the taxonomy data into a fresh > instance of > biosql. Otherwise an already existing ITIS tree will be deleted first. > > > I just don't understand why the very first sentence of the > documentation > within the scripts says something about 'update': > > > This script loads or updates a biosql schema with phylodb extension > with the ITIS taxonomy as a phylogenetic trees, one tree for each > kingdom. > The 'update' here is forward looking :) At this point there won't really be an update as any existing trees within the ITIS namespace are deleted first. -hilmar > > > Regards, > Martin > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sat Apr 12 03:23:21 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 11 Apr 2008 23:23:21 -0400 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon_id In-Reply-To: <4800111E.3030802@ribosome.natur.cuni.cz> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> <4800111E.3030802@ribosome.natur.cuni.cz> Message-ID: On Apr 11, 2008, at 9:32 PM, Martin MOKREJ? wrote: > I decided to re-import the same data to mimic at least somehow > the future updates, although no record should be UPDATEd, > except zapping left and right values with NULL. :(( Not sure what made you frown here? > [...] > > I believe the recalculation step could be made faster. See what > happens: > [...] > The columns left_value and right_value have NULL value upon > the table is created, so no need to write again NULL into > them. But that's only true the first time you load. For almost all real databases, all except the first run of the script won't be able to take advantage of that. > This would mean writing a wrapper function which would > mimic update() but before doing that it would do 'SELECT * FROM', > compare the values with those to be written and include in the > final UPDATE statement only those columns for which values have > been changed. We use such a smart wrapper for our code in python. > ;-) What you see is the "optimization" for MySQL. For all other RDBMSs it does both left and right in one update. BTW note that SELECT does not have zero cost, it requires both an index and a table read, only to find on average 50% of the time that you will need to update anyway. So what you gain 50% of the time you lose the other 50% of the time. > > When the columns for left and right are to be made NULL during > update of an existing database, I think it would be much faster > to drop the columns and re-create them again with NULL values. In terms of speed, that may be how MySQL works indeed. In PostgreSQL it would even be transactional (but very slow with concurrent queries), but with most databases you are now outside of a transaction (because it is DDL), which not only leaves the data in an inconsistent state, but also will immediately break any application you run against it because the table structure changed under its feet. > [...] I think it could be investigated more the possibility to create > empty taxon and taxon_name tables as MyISAM tables and only after > all the import and updates they could be converted into InnoDB > tables. I'm sure there are lots of hacks and tricks that would make this faster for one particular RBDMS, and you are welcome to explore those. But the script is written to deal with several RDBMSs, and it does so as transactionally safe as possible. The assumption is that you are running this against a live database that is being queried concurrently. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sat Apr 12 18:10:44 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 12 Apr 2008 14:10:44 -0400 Subject: [BioSQL-l] personal vs list email Message-ID: I'm not sure why but I have received several Bioperl or BioSQL- related email inquiries directed to me *personally* over the past few weeks. I have been responding as I get to them, but I feel that I am doing both the senders and this community a poor service, because sometimes someone else on the list could have responded much faster, and when I respond, others on the list who happen to be interested in the same question don't get to see the answer. So from now on as a policy I will redirect *every* email sent to me personally and that asks a question related to one of the projects to the respective mailing list. If you don't want this, please conspicuously say so at the top of your email, and in that case if you do ask a project-related question be prepared to wait and to possibly needing to follow up. As an aside, it's a pretty safe assumption to make that all other core developers, and quite possibly *all* developers are following a similar policy, whether expressly or not. Isn't this somewhere in the FAQ too? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Sat Apr 12 20:17:43 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 12 Apr 2008 15:17:43 -0500 Subject: [BioSQL-l] personal vs list email In-Reply-To: References: Message-ID: On Apr 12, 2008, at 1:10 PM, Hilmar Lapp wrote: > I'm not sure why but I have received several Bioperl or BioSQL- > related email inquiries directed to me *personally* over the past > few weeks. > > I have been responding as I get to them, but I feel that I am doing > both the senders and this community a poor service, because > sometimes someone else on the list could have responded much faster, > and when I respond, others on the list who happen to be interested > in the same question don't get to see the answer. > > So from now on as a policy I will redirect *every* email sent to me > personally and that asks a question related to one of the projects > to the respective mailing list. If you don't want this, please > conspicuously say so at the top of your email, and in that case if > you do ask a project-related question be prepared to wait and to > possibly needing to follow up. > > As an aside, it's a pretty safe assumption to make that all other > core developers, and quite possibly *all* developers are following a > similar policy, whether expressly or not. I agree; I'm sure several other core devs feel the same way. I always try to forward these to the list if I feel it is more relevant there. > Isn't this somewhere in the FAQ too? > > -hilmar No, but I've added it to the bioperl FAQ; might be worth checking over and editing. chris From aaron.j.mackey at gsk.com Mon Apr 14 13:00:52 2008 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Mon, 14 Apr 2008 09:00:52 -0400 Subject: [BioSQL-l] [Bioperl-l] personal vs list email In-Reply-To: Message-ID: I try to take it even one step further: I require the person to re-ask their question on the mailing list (and then try to answer it there). This has the added benefit of causing the person to pause a moment to reflect on their question, and (sometimes) to spend a bit more time preparing the question for more broader public consumption. -Aaron From biopython at maubp.freeserve.co.uk Wed Apr 23 09:04:33 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Apr 2008 10:04:33 +0100 Subject: [BioSQL-l] BioSQL script to update taxon table left/right values Message-ID: <320fb6e00804230204w5161e6afg143a66a3cbb8aa66@mail.gmail.com> Dear list, In addition to loading the NCBI taxonomy, the load_ncbi_taxonomy.pl script also recalculates the left/right values. Is there a separate BioSQL script which ONLY recalculates the left/right values? I was asked this by a Biopython user. Possible use-cases include people using a non-NCBI taxonomy. Peter From hlapp at gmx.net Wed Apr 23 14:14:19 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 23 Apr 2008 10:14:19 -0400 Subject: [BioSQL-l] BioSQL script to update taxon table left/right values In-Reply-To: <320fb6e00804230204w5161e6afg143a66a3cbb8aa66@mail.gmail.com> References: <320fb6e00804230204w5161e6afg143a66a3cbb8aa66@mail.gmail.com> Message-ID: <57EF37D9-295A-495A-ADFD-E13722DBD642@gmx.net> No there isn't but it's a good idea. Would you mind posting it as a BioSQL bug/feature request? -hilmar On Apr 23, 2008, at 5:04 AM, Peter wrote: > Dear list, > > In addition to loading the NCBI taxonomy, the load_ncbi_taxonomy.pl > script also recalculates the left/right values. > > Is there a separate BioSQL script which ONLY recalculates the left/ > right values? > > I was asked this by a Biopython user. Possible use-cases include > people using a non-NCBI taxonomy. > > Peter > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Wed Apr 23 15:41:00 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Apr 2008 16:41:00 +0100 Subject: [BioSQL-l] BioSQL script to update taxon table left/right values In-Reply-To: <57EF37D9-295A-495A-ADFD-E13722DBD642@gmx.net> References: <320fb6e00804230204w5161e6afg143a66a3cbb8aa66@mail.gmail.com> <57EF37D9-295A-495A-ADFD-E13722DBD642@gmx.net> Message-ID: <320fb6e00804230841p188b4897q6c7f08dcf1ead552@mail.gmail.com> On Wed, Apr 23, 2008 at 3:14 PM, Hilmar Lapp wrote: > No there isn't but it's a good idea. Would you mind posting it as a BioSQL > bug/feature request? Sure, I've filed an enhancement request: Bug 2493 - New script to recalculate left/right values in the taxon table http://bugzilla.open-bio.org/show_bug.cgi?id=2493 Peter From mmokrejs at ribosome.natur.cuni.cz Wed Apr 23 15:58:18 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Wed, 23 Apr 2008 17:58:18 +0200 Subject: [BioSQL-l] BioSQL script to update taxon table left/right values In-Reply-To: <320fb6e00804230841p188b4897q6c7f08dcf1ead552@mail.gmail.com> References: <320fb6e00804230204w5161e6afg143a66a3cbb8aa66@mail.gmail.com> <57EF37D9-295A-495A-ADFD-E13722DBD642@gmx.net> <320fb6e00804230841p188b4897q6c7f08dcf1ead552@mail.gmail.com> Message-ID: <480F5C9A.60007@ribosome.natur.cuni.cz> I would just propose to make the current script more modular and provide a command-line option argument which would just establish the database handler and update the fields and close $dbh. M. Peter wrote: > On Wed, Apr 23, 2008 at 3:14 PM, Hilmar Lapp wrote: >> No there isn't but it's a good idea. Would you mind posting it as a BioSQL >> bug/feature request? > > Sure, I've filed an enhancement request: > > Bug 2493 - New script to recalculate left/right values in the taxon table > http://bugzilla.open-bio.org/show_bug.cgi?id=2493 > > Peter > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l From darin.london at duke.edu Tue Apr 29 16:52:54 2008 From: darin.london at duke.edu (darin.london at duke.edu) Date: Tue, 29 Apr 2008 12:52:54 -0400 Subject: [BioSQL-l] BOSC 2008 Announcement and Call For Submissions Message-ID: <200804291653.m3TGqsF5020841@tenero.duhs.duke.edu> BOSC 2008 Call for Abstracts Reminder The 9th annual Bioinformatics Open Source Conference (BOSC 2008) will take place in Toronto, Ontario, Canada, as one of several Special Interest Group (SIG) meetings occurring in conjunction with the 16th annual Intelligent Systems for Molecular Biology Conference (ISMB 2008). This is a reminder to submit your proposals for talks to the BOSC submission system before May 11. Submission Process: All abstracts must be submitted through our Open Conference Systems site (http://events.open-bio.org/BOSC2008/openconf.php). The form will ask for a small Abstract Text to be pasted into it, and a full paper. The small Abstract text should be a summary, while the longer abstract (should provide more details, including the open-source license requirement details) Full-length abstracts are limited to one page with one inch (2.5 cm) margins on the top, sides, and bottom. The full-length abstract should include the title, authors, and affiliations. We prefer your abstract to be in PDF format, although plain t Important Dates: May 11: Abstract submission deadline. June 2: Notification of accepted talks. June 4: Early registration discount cut-off. July 18-19: BOSC 2008! We hope to see you at BOSC 2008! Kam Dahlquist and Darin London BOSC 2008 Co-organizers