[Biopython-dev] [Bug 2629] Updated Bio.NaiveBayes to listfns import

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Wed Nov 5 16:51:07 UTC 2008


http://bugzilla.open-bio.org/show_bug.cgi?id=2629





------- Comment #11 from bsouthey at gmail.com  2008-11-05 11:51 EST -------
(In reply to comment #10)
> See
> http://coreygoldberg.blogspot.com/2008/07/python-counting-items-in-list.html
> for some timings of this operation. I think Bruce's approach is most suitable,
> except for the dict update method; I would use
>         content_freqs[cval] = content_freqs.get(cval,0)+p_contents
> instead. Depending on the contents of the list, sometimes it runs even faster
> than the implementation in listfns.
> > 
> > Given the possible rounding issues, does doing the rescaling (dividing by the
> > number of elements) at the start make a big time saving (over dividing each
> > total at the end)?  I would feel happier with the division at the end (as done
> > in the listfns code).
> > 
> I think the rescaling at the start is a good thing. If the list contains many
> different objects, rescaling at the end can take a long time. Probably that is
> not the typical use case here, but on the other hand I don't see a good reason
> not to save time here.
> 
> Maybe just my nitpicking, but I think the get_content_freq function will be
> more readable if we use different variable names inside this function.
> 

(In reply to comment #10)
> See
> http://coreygoldberg.blogspot.com/2008/07/python-counting-items-in-list.html
> for some timings of this operation. I think Bruce's approach is most suitable,
> except for the dict update method; I would use
>         content_freqs[cval] = content_freqs.get(cval,0)+p_contents
> instead. Depending on the contents of the list, sometimes it runs even faster
> than the implementation in listfns.

Basically the goal is find the frequency of each class and store it in a
dictionary with the keys being each class and the value being the frequency. So
you could count up all observations in each class (essentially a adding one to
the appropriate class sum) and then divide each count by the total number of
observations - as implemented in the dictget approach.Being more cryptic, we
can avoid the second division by adding one/number of observations instead one
to the appropriate class sum as implemented in get_content_freq.

Thanks for the link, I created a timing code for random lists.

get_content_freq is the one I put in the patch
get_content_freq2 is the modified version
ternary is based the Cory code modified to give frequencies rather than counts
dictget is using a dictionary to count then get the frequencies  
listfns.contents is the Biopython Python version without the C code import.
clistfns.contents is the direct import of Biopython module that uses C code 

My system is running 64-bit Fedora on Linux with Python 2.5.2. The number of
observation is not important (difference is very small), I used 1000000 random
integers and measured just doing it once and repeat the test 5 times with
1000000 executions and get the minimum time ie min(timeit.repeat(5, 1000000)).
Also, this function is not called that much in the NaiveBayes so these are
rather extreme cases. 

Range of ints between one and two:
get_content_freq  once: 1.90734863281e-05  best of 5: 8.11614704132
get_content_freq2 once: 8.10623168945e-06  best of 5: 4.39126110077
ternary file      once: 1.59740447998e-05  best of 5: 9.42879796028
dictget file      once: 1.4066696167e-05  best of 5: 10.468517065
listfns.contents  once: 1.28746032715e-05  best of 5: 7.50778198242
clistfns.contents once: 6.91413879395e-06  best of 5: 2.71360707283


Range of ints between one and ten:
get_content_freq  once: 1.90734863281e-05  best of 5: 7.97784090042
get_content_freq2 once: 7.15255737305e-06  best of 5: 4.21833491325
ternary file      once: 1.69277191162e-05  best of 5: 9.18815684319
dictget file      once: 1.50203704834e-05  best of 5: 10.2242910862
listfns.contents  once: 1.50203704834e-05  best of 5: 7.25569987297
clistfns.contents once: 8.10623168945e-06  best of 5: 2.6411280632

Range of ints between one and one hundred:

get_content_freq  once: 2.00271606445e-05  best of 5: 7.99760317802
get_content_freq2 once: 7.86781311035e-06  best of 5: 4.20446300507
ternary file      once: 1.71661376953e-05  best of 5: 9.26767396927
dictget file      once: 1.4066696167e-05  best of 5: 10.2449028492
listfns.contents  once: 1.4066696167e-05  best of 5: 7.34166693687
clistfns.contents once: 7.15255737305e-06  best of 5: 2.63198709488

So this not dependent on the number of classes. For the most part this numbers
are showing more system overheads than major differences between the actual
approaches. Therefore I would clearly go with Michiel's version.


> > 
> > Given the possible rounding issues, does doing the rescaling (dividing by the
> > number of elements) at the start make a big time saving (over dividing each
> > total at the end)?  I would feel happier with the division at the end (as done
> > in the listfns code).
> > 
> I think the rescaling at the start is a good thing. If the list contains many
> different objects, rescaling at the end can take a long time. Probably that is
> not the typical use case here, but on the other hand I don't see a good reason
> not to save time here.

>From the two case scenario above, the get_content_freq methods result in:
{1: 0.49978999999354606, 2: 0.50020999999354643}
and the others result in:
{1: 0.49979000000000001, 2: 0.50021000000000004}

On my 64-bit linux system the numerical error is small but within the
expectations. It may be worse on a 32-bit system or OS. I really wanted to draw
attention to this because tiny differences can be important (not to mention
people who don't understand enough about numerical precision).

> 
> Maybe just my nitpicking, but I think the get_content_freq function will be
> more readable if we use different variable names inside this function.
> 

Please rename as necessary.

Bruce


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list