TransWikia.com

Sample log geometric distribution from log probability

Cross Validated Asked on January 14, 2021

I want to sample from the geometric distribution for a very small success rate. The success rate is so small that I represent it by its log. I want the result to also be represented by its log. Is there a numerically stable way to do this?

Eg.

log(success rate) = -2000

log(sample) ~ 2000

One Answer

Let $X$ be the number of independent trials of a Bernoulli$(p)$ variable needed to observe the first success, so that $X$ can have the values $1,2,3,ldots.$ For any such value its cumulative probability function is given by

$$F_p(k) = Pr(Xle k) = 1 - (1-p)^k.$$

This is readily inverted for drawing values of $X.$ That is, given a uniform variable $U$ in $[0,1],$ the value

$$x_p(U) = lceil log_{1-p} U rceil = lceil frac{log(U)}{ log(1-p)}rceiltag{*}$$

has the same distribution as $X.$ (The brackets $lceil rceil$ denote the ceiling, or next greatest integer.)

When $p$ is tiny, $X$ is likely going to be a huge value. This is intuitively clear: when you have little chance of success, you will likely see a very long string of failures before you do eventually succeed. One rigorous demonstration goes like this: pick a huge number $Mgg 0$ that is still very tiny compared to $1/p,$ so that $pM$ is small. The chance that $X$ exceeds $M$ is

$$Pr(X gt M) = 1 - F_p(M) = left(1-p)^{1/p}right)^{pM} approx e^{-pM} approx 1 -pM approx 1.$$

In fact, when computing in double-precision floating point arithmetic, these approximations are equalities when $pM le 2^{-53}approx 10^{-16}.$ In other words, on a log (base 10) scale, you just won't see any values of $log(X)$ that are more than $16$ less than $log(1/p) = -log(p).$

When any number is huge, there's no discernible difference between it and its ceiling (the next highest integer). This permits us to drop the ceiling operation from $(*)$ and take the logarithms of both sides to obtain

$$log(x_p(U)) approx log(-log(U)) - log(-log(1-p)) approx log(-log(U)) + log(p).$$

That's the solution. Just generate uniform variables $U$ in the interval $(0,1)$ and apply this formula. (The error in the last approximation is proportional to $-p/2,$ which is truly tiny.)

This, by the way, is a reverse Gumbel distribution.

As an example, let $p = 10^{-2000}.$ Here is a histogram of a million realizations of $X$ using this solution.

Figure

Notice how in this simulation of size $10^6$ there were no values of $log X$ more than $6$ less than $-log(p)=2000.$ This is as we would expect.


The R code to run this large simulation took well under one second, attesting to its efficiency:

x <- log(-log(runif(1e6))) / log(10) + 2000

The three constants are 2000, the negative log of $p;$ 10, the base of that logarithm; and 1e6, the number of values to generate.

Answered by whuber on January 14, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP