[Bioperl-l] some notes on entrezgene parser performance

Stefan Kirov skirov at utk.edu
Mon Jun 13 15:13:19 EDT 2005


This is brief comparison of locuslink (march download) v.s. entrezgene. 
While this is not too comprehensive it shows to some extent how 
entrezgene parser performs.
The procedure: both locuslink and entrezgene files are parsed and loaded 
into a relational database (genereg.ornl.gov/gkdb) and then the number 
of records is compared between the locuslink and entrezgene tables. 
Locuslink parser has been used for years and I know it works well. So if 
there is a nasty bug in entrezgene, that missed some of the data, I 
expect to see a decrease in the number of records in entrezgene vs. 
locuslink (some fluctuation is normal). Here are two reports (for Gene 
Ontology and RefSeq, if interested please ask and I will send additional 
reports). I think the parser is functioning normally  so far.  The only 
significant deviation is GO data for Drosophila, but that is because the 
entrezgene file simply does not contain this data. This is quite a 
restricted approach, so I can't guarantee the parser is not missing 
something. Please let me know if you notice weird behavior.
Stefan


REFSEQ report:
For organism Apis mellifera, entrez gene had 18 records more (less) than 
locuslink
And table ll_refseq_nm had 14 records more (less) than locuslink
For organism Bos taurus, entrez gene had 37113 records more (less) than 
locuslink
And table ll_refseq_nm had 362 records more (less) than locuslink
For organism Caenorhabditis elegans, entrez gene had -799 records more 
(less) than locuslink
And table ll_refseq_nm had -383 records more (less) than locuslink
For organism Canis familiaris, entrez gene had 42 records more (less) 
than locuslink
And table ll_refseq_nm had 45 records more (less) than locuslink
For organism Danio rerio, entrez gene had 2208 records more (less) than 
locuslink
And table ll_refseq_nm had 305 records more (less) than locuslink
For organism Drosophila melanogaster, entrez gene had -8814 records more 
(less) than locuslink
And table ll_refseq_nm had 1249 records more (less) than locuslink
For organism Gallus gallus, entrez gene had 16 records more (less) than 
locuslink
And table ll_refseq_nm had 192 records more (less) than locuslink
For organism Homo sapiens, entrez gene had 107382 records more (less) 
than locuslink
And table ll_refseq_nm had 151 records more (less) than locuslink
For organism Human immunodeficiency virus 1, entrez gene had -9 records 
more (less) than locuslink
And table ll_refseq_nm had -24 records more (less) than locuslink
For organism Mus musculus, entrez gene had 57427 records more (less) 
than locuslink
And table ll_refseq_nm had 49 records more (less) than locuslink
For organism Pan troglodytes, entrez gene had 19 records more (less) 
than locuslink
And table ll_refseq_nm had 18 records more (less) than locuslink
For organism Rattus norvegicus, entrez gene had 25821 records more 
(less) than locuslink
And table ll_refseq_nm had 681 records more (less) than locuslink
For organism Strongylocentrotus purpuratus, entrez gene had -29 records 
more (less) than locuslink
And table ll_refseq_nm had -8 records more (less) than locuslink
For organism Sus scrofa, entrez gene had 12 records more (less) than 
locuslink
And table ll_refseq_nm had 3 records more (less) than locuslink
For organism Takifugu rubripes, entrez gene had -238 records more (less) 
than locuslink
And table ll_refseq_nm had -7 records more (less) than locuslink
For organism Xenopus tropicalis, entrez gene had -2612 records more 
(less) than locuslink
And table ll_refseq_nm had -2403 records more (less) than locuslink

GeneOntology report
For organism Danio rerio, entrez gene had 2208 records more (less) than 
locuslink
And table ll_go had 4291 records more (less) than locuslink
For organism Drosophila melanogaster, entrez gene had -8814 records more 
(less) than locuslink
And table ll_go had -44725 records more (less) than locuslink
For organism Homo sapiens, entrez gene had 107382 records more (less) 
than locuslink
And table ll_go had 6716 records more (less) than locuslink
For organism Mus musculus, entrez gene had 57427 records more (less) 
than locuslink
And table ll_go had 2972 records more (less) than locuslink
For organism Rattus norvegicus, entrez gene had 25821 records more 
(less) than locuslink
And table ll_go had 4927 records more (less) than locuslink



More information about the Bioperl-l mailing list