Brian Silverman | 32ed54e | 2018-08-04 23:37:28 -0700 | [diff] [blame^] | 1 | [section:binomial_dist Binomial Distribution] |
| 2 | |
| 3 | ``#include <boost/math/distributions/binomial.hpp>`` |
| 4 | |
| 5 | namespace boost{ namespace math{ |
| 6 | |
| 7 | template <class RealType = double, |
| 8 | class ``__Policy`` = ``__policy_class`` > |
| 9 | class binomial_distribution; |
| 10 | |
| 11 | typedef binomial_distribution<> binomial; |
| 12 | |
| 13 | template <class RealType, class ``__Policy``> |
| 14 | class binomial_distribution |
| 15 | { |
| 16 | public: |
| 17 | typedef RealType value_type; |
| 18 | typedef Policy policy_type; |
| 19 | |
| 20 | static const ``['unspecified-type]`` clopper_pearson_exact_interval; |
| 21 | static const ``['unspecified-type]`` jeffreys_prior_interval; |
| 22 | |
| 23 | // construct: |
| 24 | binomial_distribution(RealType n, RealType p); |
| 25 | |
| 26 | // parameter access:: |
| 27 | RealType success_fraction() const; |
| 28 | RealType trials() const; |
| 29 | |
| 30 | // Bounds on success fraction: |
| 31 | static RealType find_lower_bound_on_p( |
| 32 | RealType trials, |
| 33 | RealType successes, |
| 34 | RealType probability, |
| 35 | ``['unspecified-type]`` method = clopper_pearson_exact_interval); |
| 36 | static RealType find_upper_bound_on_p( |
| 37 | RealType trials, |
| 38 | RealType successes, |
| 39 | RealType probability, |
| 40 | ``['unspecified-type]`` method = clopper_pearson_exact_interval); |
| 41 | |
| 42 | // estimate min/max number of trials: |
| 43 | static RealType find_minimum_number_of_trials( |
| 44 | RealType k, // number of events |
| 45 | RealType p, // success fraction |
| 46 | RealType alpha); // risk level |
| 47 | |
| 48 | static RealType find_maximum_number_of_trials( |
| 49 | RealType k, // number of events |
| 50 | RealType p, // success fraction |
| 51 | RealType alpha); // risk level |
| 52 | }; |
| 53 | |
| 54 | }} // namespaces |
| 55 | |
| 56 | The class type `binomial_distribution` represents a |
| 57 | [@http://mathworld.wolfram.com/BinomialDistribution.html binomial distribution]: |
| 58 | it is used when there are exactly two mutually |
| 59 | exclusive outcomes of a trial. These outcomes are labelled |
| 60 | "success" and "failure". The |
| 61 | __binomial_distrib is used to obtain |
| 62 | the probability of observing k successes in N trials, with the |
| 63 | probability of success on a single trial denoted by p. The |
| 64 | binomial distribution assumes that p is fixed for all trials. |
| 65 | |
| 66 | [note The random variable for the binomial distribution is the number of successes, |
| 67 | (the number of trials is a fixed property of the distribution) |
| 68 | whereas for the negative binomial, |
| 69 | the random variable is the number of trials, for a fixed number of successes.] |
| 70 | |
| 71 | The PDF for the binomial distribution is given by: |
| 72 | |
| 73 | [equation binomial_ref2] |
| 74 | |
| 75 | The following two graphs illustrate how the PDF changes depending |
| 76 | upon the distributions parameters, first we'll keep the success |
| 77 | fraction /p/ fixed at 0.5, and vary the sample size: |
| 78 | |
| 79 | [graph binomial_pdf_1] |
| 80 | |
| 81 | Alternatively, we can keep the sample size fixed at N=20 and |
| 82 | vary the success fraction /p/: |
| 83 | |
| 84 | [graph binomial_pdf_2] |
| 85 | |
| 86 | [discrete_quantile_warning Binomial] |
| 87 | |
| 88 | [h4 Member Functions] |
| 89 | |
| 90 | [h5 Construct] |
| 91 | |
| 92 | binomial_distribution(RealType n, RealType p); |
| 93 | |
| 94 | Constructor: /n/ is the total number of trials, /p/ is the |
| 95 | probability of success of a single trial. |
| 96 | |
| 97 | Requires `0 <= p <= 1`, and `n >= 0`, otherwise calls __domain_error. |
| 98 | |
| 99 | [h5 Accessors] |
| 100 | |
| 101 | RealType success_fraction() const; |
| 102 | |
| 103 | Returns the parameter /p/ from which this distribution was constructed. |
| 104 | |
| 105 | RealType trials() const; |
| 106 | |
| 107 | Returns the parameter /n/ from which this distribution was constructed. |
| 108 | |
| 109 | [h5 Lower Bound on the Success Fraction] |
| 110 | |
| 111 | static RealType find_lower_bound_on_p( |
| 112 | RealType trials, |
| 113 | RealType successes, |
| 114 | RealType alpha, |
| 115 | ``['unspecified-type]`` method = clopper_pearson_exact_interval); |
| 116 | |
| 117 | Returns a lower bound on the success fraction: |
| 118 | |
| 119 | [variablelist |
| 120 | [[trials][The total number of trials conducted.]] |
| 121 | [[successes][The number of successes that occurred.]] |
| 122 | [[alpha][The largest acceptable probability that the true value of |
| 123 | the success fraction is [*less than] the value returned.]] |
| 124 | [[method][An optional parameter that specifies the method to be used |
| 125 | to compute the interval (See below).]] |
| 126 | ] |
| 127 | |
| 128 | For example, if you observe /k/ successes from /n/ trials the |
| 129 | best estimate for the success fraction is simply ['k/n], but if you |
| 130 | want to be 95% sure that the true value is [*greater than] some value, |
| 131 | ['p[sub min]], then: |
| 132 | |
| 133 | p``[sub min]`` = binomial_distribution<RealType>::find_lower_bound_on_p( |
| 134 | n, k, 0.05); |
| 135 | |
| 136 | [link math_toolkit.stat_tut.weg.binom_eg.binom_conf See worked example.] |
| 137 | |
| 138 | There are currently two possible values available for the /method/ |
| 139 | optional parameter: /clopper_pearson_exact_interval/ |
| 140 | or /jeffreys_prior_interval/. These constants are both members of |
| 141 | class template `binomial_distribution`, so usage is for example: |
| 142 | |
| 143 | p = binomial_distribution<RealType>::find_lower_bound_on_p( |
| 144 | n, k, 0.05, binomial_distribution<RealType>::jeffreys_prior_interval); |
| 145 | |
| 146 | The default method if this parameter is not specified is the Clopper Pearson |
| 147 | "exact" interval. This produces an interval that guarantees at least |
| 148 | `100(1-alpha)%` coverage, but which is known to be overly conservative, |
| 149 | sometimes producing intervals with much greater than the requested coverage. |
| 150 | |
| 151 | The alternative calculation method produces a non-informative |
| 152 | Jeffreys Prior interval. It produces `100(1-alpha)%` coverage only |
| 153 | ['in the average case], though is typically very close to the requested |
| 154 | coverage level. It is one of the main methods of calculation recommended |
| 155 | in the review by Brown, Cai and DasGupta. |
| 156 | |
| 157 | Please note that the "textbook" calculation method using |
| 158 | a normal approximation (the Wald interval) is deliberately |
| 159 | not provided: it is known to produce consistently poor results, |
| 160 | even when the sample size is surprisingly large. |
| 161 | Refer to Brown, Cai and DasGupta for a full explanation. Many other methods |
| 162 | of calculation are available, and may be more appropriate for specific |
| 163 | situations. Unfortunately there appears to be no consensus amongst |
| 164 | statisticians as to which is "best": refer to the discussion at the end of |
| 165 | Brown, Cai and DasGupta for examples. |
| 166 | |
| 167 | The two methods provided here were chosen principally because they |
| 168 | can be used for both one and two sided intervals. |
| 169 | See also: |
| 170 | |
| 171 | Lawrence D. Brown, T. Tony Cai and Anirban DasGupta (2001), |
| 172 | Interval Estimation for a Binomial Proportion, |
| 173 | Statistical Science, Vol. 16, No. 2, 101-133. |
| 174 | |
| 175 | T. Tony Cai (2005), |
| 176 | One-sided confidence intervals in discrete distributions, |
| 177 | Journal of Statistical Planning and Inference 131, 63-88. |
| 178 | |
| 179 | Agresti, A. and Coull, B. A. (1998). Approximate is better than |
| 180 | "exact" for interval estimation of binomial proportions. Amer. |
| 181 | Statist. 52 119-126. |
| 182 | |
| 183 | Clopper, C. J. and Pearson, E. S. (1934). The use of confidence |
| 184 | or fiducial limits illustrated in the case of the binomial. |
| 185 | Biometrika 26 404-413. |
| 186 | |
| 187 | [h5 Upper Bound on the Success Fraction] |
| 188 | |
| 189 | static RealType find_upper_bound_on_p( |
| 190 | RealType trials, |
| 191 | RealType successes, |
| 192 | RealType alpha, |
| 193 | ``['unspecified-type]`` method = clopper_pearson_exact_interval); |
| 194 | |
| 195 | Returns an upper bound on the success fraction: |
| 196 | |
| 197 | [variablelist |
| 198 | [[trials][The total number of trials conducted.]] |
| 199 | [[successes][The number of successes that occurred.]] |
| 200 | [[alpha][The largest acceptable probability that the true value of |
| 201 | the success fraction is [*greater than] the value returned.]] |
| 202 | [[method][An optional parameter that specifies the method to be used |
| 203 | to compute the interval. Refer to the documentation for |
| 204 | `find_upper_bound_on_p` above for the meaning of the |
| 205 | method options.]] |
| 206 | ] |
| 207 | |
| 208 | For example, if you observe /k/ successes from /n/ trials the |
| 209 | best estimate for the success fraction is simply ['k/n], but if you |
| 210 | want to be 95% sure that the true value is [*less than] some value, |
| 211 | ['p[sub max]], then: |
| 212 | |
| 213 | p``[sub max]`` = binomial_distribution<RealType>::find_upper_bound_on_p( |
| 214 | n, k, 0.05); |
| 215 | |
| 216 | [link math_toolkit.stat_tut.weg.binom_eg.binom_conf See worked example.] |
| 217 | |
| 218 | [note |
| 219 | In order to obtain a two sided bound on the success fraction, you |
| 220 | call both `find_lower_bound_on_p` *and* `find_upper_bound_on_p` |
| 221 | each with the same arguments. |
| 222 | |
| 223 | If the desired risk level |
| 224 | that the true success fraction lies outside the bounds is [alpha], |
| 225 | then you pass [alpha]/2 to these functions. |
| 226 | |
| 227 | So for example a two sided 95% confidence interval would be obtained |
| 228 | by passing [alpha] = 0.025 to each of the functions. |
| 229 | |
| 230 | [link math_toolkit.stat_tut.weg.binom_eg.binom_conf See worked example.] |
| 231 | ] |
| 232 | |
| 233 | |
| 234 | [h5 Estimating the Number of Trials Required for a Certain Number of Successes] |
| 235 | |
| 236 | static RealType find_minimum_number_of_trials( |
| 237 | RealType k, // number of events |
| 238 | RealType p, // success fraction |
| 239 | RealType alpha); // probability threshold |
| 240 | |
| 241 | This function estimates the minimum number of trials required to ensure that |
| 242 | more than k events is observed with a level of risk /alpha/ that k or |
| 243 | fewer events occur. |
| 244 | |
| 245 | [variablelist |
| 246 | [[k][The number of success observed.]] |
| 247 | [[p][The probability of success for each trial.]] |
| 248 | [[alpha][The maximum acceptable probability that k events or fewer will be observed.]] |
| 249 | ] |
| 250 | |
| 251 | For example: |
| 252 | |
| 253 | binomial_distribution<RealType>::find_number_of_trials(10, 0.5, 0.05); |
| 254 | |
| 255 | Returns the smallest number of trials we must conduct to be 95% sure |
| 256 | of seeing 10 events that occur with frequency one half. |
| 257 | |
| 258 | [h5 Estimating the Maximum Number of Trials to Ensure no more than a Certain Number of Successes] |
| 259 | |
| 260 | static RealType find_maximum_number_of_trials( |
| 261 | RealType k, // number of events |
| 262 | RealType p, // success fraction |
| 263 | RealType alpha); // probability threshold |
| 264 | |
| 265 | This function estimates the maximum number of trials we can conduct |
| 266 | to ensure that k successes or fewer are observed, with a risk /alpha/ |
| 267 | that more than k occur. |
| 268 | |
| 269 | [variablelist |
| 270 | [[k][The number of success observed.]] |
| 271 | [[p][The probability of success for each trial.]] |
| 272 | [[alpha][The maximum acceptable probability that more than k events will be observed.]] |
| 273 | ] |
| 274 | |
| 275 | For example: |
| 276 | |
| 277 | binomial_distribution<RealType>::find_maximum_number_of_trials(0, 1e-6, 0.05); |
| 278 | |
| 279 | Returns the largest number of trials we can conduct and still be 95% certain |
| 280 | of not observing any events that occur with one in a million frequency. |
| 281 | This is typically used in failure analysis. |
| 282 | |
| 283 | [link math_toolkit.stat_tut.weg.binom_eg.binom_size_eg See Worked Example.] |
| 284 | |
| 285 | [h4 Non-member Accessors] |
| 286 | |
| 287 | All the [link math_toolkit.dist_ref.nmp usual non-member accessor functions] |
| 288 | that are generic to all distributions are supported: __usual_accessors. |
| 289 | |
| 290 | The domain for the random variable /k/ is `0 <= k <= N`, otherwise a |
| 291 | __domain_error is returned. |
| 292 | |
| 293 | It's worth taking a moment to define what these accessors actually mean in |
| 294 | the context of this distribution: |
| 295 | |
| 296 | [table Meaning of the non-member accessors |
| 297 | [[Function][Meaning]] |
| 298 | [[__pdf] |
| 299 | [The probability of obtaining [*exactly k successes] from n trials |
| 300 | with success fraction p. For example: |
| 301 | |
| 302 | `pdf(binomial(n, p), k)`]] |
| 303 | [[__cdf] |
| 304 | [The probability of obtaining [*k successes or fewer] from n trials |
| 305 | with success fraction p. For example: |
| 306 | |
| 307 | `cdf(binomial(n, p), k)`]] |
| 308 | [[__ccdf] |
| 309 | [The probability of obtaining [*more than k successes] from n trials |
| 310 | with success fraction p. For example: |
| 311 | |
| 312 | `cdf(complement(binomial(n, p), k))`]] |
| 313 | [[__quantile] |
| 314 | [The [*greatest] number of successes that may be observed from n trials |
| 315 | with success fraction p, at probability P. Note that the value returned |
| 316 | is a real-number, and not an integer. Depending on the use case you may |
| 317 | want to take either the floor or ceiling of the result. For example: |
| 318 | |
| 319 | `quantile(binomial(n, p), P)`]] |
| 320 | [[__quantile_c] |
| 321 | [The [*smallest] number of successes that may be observed from n trials |
| 322 | with success fraction p, at probability P. Note that the value returned |
| 323 | is a real-number, and not an integer. Depending on the use case you may |
| 324 | want to take either the floor or ceiling of the result. For example: |
| 325 | |
| 326 | `quantile(complement(binomial(n, p), P))`]] |
| 327 | ] |
| 328 | |
| 329 | [h4 Examples] |
| 330 | |
| 331 | Various [link math_toolkit.stat_tut.weg.binom_eg worked examples] |
| 332 | are available illustrating the use of the binomial distribution. |
| 333 | |
| 334 | [h4 Accuracy] |
| 335 | |
| 336 | This distribution is implemented using the |
| 337 | incomplete beta functions __ibeta and __ibetac, |
| 338 | please refer to these functions for information on accuracy. |
| 339 | |
| 340 | [h4 Implementation] |
| 341 | |
| 342 | In the following table /p/ is the probability that one trial will |
| 343 | be successful (the success fraction), /n/ is the number of trials, |
| 344 | /k/ is the number of successes, /p/ is the probability and /q = 1-p/. |
| 345 | |
| 346 | [table |
| 347 | [[Function][Implementation Notes]] |
| 348 | [[pdf][Implementation is in terms of __ibeta_derivative: if [sub n]C[sub k ] is the binomial |
| 349 | coefficient of a and b, then we have: |
| 350 | |
| 351 | [equation binomial_ref1] |
| 352 | |
| 353 | Which can be evaluated as `ibeta_derivative(k+1, n-k+1, p) / (n+1)` |
| 354 | |
| 355 | The function __ibeta_derivative is used here, since it has already |
| 356 | been optimised for the lowest possible error - indeed this is really |
| 357 | just a thin wrapper around part of the internals of the incomplete |
| 358 | beta function. |
| 359 | |
| 360 | There are also various special cases: refer to the code for details. |
| 361 | ]] |
| 362 | [[cdf][Using the relation: |
| 363 | |
| 364 | `` |
| 365 | p = I[sub 1-p](n - k, k + 1) |
| 366 | = 1 - I[sub p](k + 1, n - k) |
| 367 | = __ibetac(k + 1, n - k, p)`` |
| 368 | |
| 369 | There are also various special cases: refer to the code for details. |
| 370 | ]] |
| 371 | [[cdf complement][Using the relation: q = __ibeta(k + 1, n - k, p) |
| 372 | |
| 373 | There are also various special cases: refer to the code for details. ]] |
| 374 | [[quantile][Since the cdf is non-linear in variate /k/ none of the inverse |
| 375 | incomplete beta functions can be used here. Instead the quantile |
| 376 | is found numerically using a derivative free method |
| 377 | (__root_finding_TOMS748).]] |
| 378 | [[quantile from the complement][Found numerically as above.]] |
| 379 | [[mean][ `p * n` ]] |
| 380 | [[variance][ `p * n * (1-p)` ]] |
| 381 | [[mode][`floor(p * (n + 1))`]] |
| 382 | [[skewness][`(1 - 2 * p) / sqrt(n * p * (1 - p))`]] |
| 383 | [[kurtosis][`3 - (6 / n) + (1 / (n * p * (1 - p)))`]] |
| 384 | [[kurtosis excess][`(1 - 6 * p * q) / (n * p * q)`]] |
| 385 | [[parameter estimation][The member functions `find_upper_bound_on_p` |
| 386 | `find_lower_bound_on_p` and `find_number_of_trials` are |
| 387 | implemented in terms of the inverse incomplete beta functions |
| 388 | __ibetac_inv, __ibeta_inv, and __ibetac_invb respectively]] |
| 389 | ] |
| 390 | |
| 391 | [h4 References] |
| 392 | |
| 393 | * [@http://mathworld.wolfram.com/BinomialDistribution.html Weisstein, Eric W. "Binomial Distribution." From MathWorld--A Wolfram Web Resource]. |
| 394 | * [@http://en.wikipedia.org/wiki/Beta_distribution Wikipedia binomial distribution]. |
| 395 | * [@http://www.itl.nist.gov/div898/handbook/eda/section3/eda366i.htm NIST Explorary Data Analysis]. |
| 396 | |
| 397 | [endsect] [/section:binomial_dist Binomial] |
| 398 | |
| 399 | [/ binomial.qbk |
| 400 | Copyright 2006 John Maddock and Paul A. Bristow. |
| 401 | Distributed under the Boost Software License, Version 1.0. |
| 402 | (See accompanying file LICENSE_1_0.txt or copy at |
| 403 | http://www.boost.org/LICENSE_1_0.txt). |
| 404 | ] |