Numpy random snippets

Jérémie Decock (www.jdhp.org)

Most comments are taken from the Numpy documentation.

Import directive

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

Tool functions

In [2]:
def plot(data, bins=30):
    plt.hist(data, bins)
    plt.show()

Discrete distributions

Bernoulli distribution

In [3]:
def bernoulli(p=None, size=1):
    return np.random.binomial(n=1, p=p, size=size)
In [4]:
bernoulli(p=0.5, size=100)
Out[4]:
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 1, 1, 0, 1])

Binomial distribution

Samples are drawn from a binomial distribution with specified parameters, $n$ trials and $p$ probability of success where $n \in \mathbb{N}$ and $p$ is in the interval $[0,1]$.

The probability density for the binomial distribution is

$$P(N) = \binom{n}{N}p^N(1-p)^{n-N}$$

where $n$ is the number of trials, $p$ is the probability of success, and $N$ is the number of successes.

When estimating the standard error of a proportion in a population by using a random sample, the normal distribution works well unless the product $p*n <=5$, where $p$ = population proportion estimate, and n = number of samples, in which case the binomial distribution is used instead. For example, a sample of 15 people shows 4 who are left handed, and 11 who are right handed. Then $p = 4/15 = 27%. 0.27*15 = 4$, so the binomial distribution should be used in this case.

In [5]:
np.random.binomial(n=10, p=0.5, size=100)
Out[5]:
array([6, 9, 5, 6, 6, 4, 6, 5, 5, 5, 6, 6, 6, 3, 4, 3, 7, 7, 7, 5, 7, 7, 5,
       3, 7, 4, 9, 3, 5, 4, 3, 6, 5, 6, 3, 4, 4, 6, 6, 6, 8, 6, 6, 6, 4, 6,
       5, 7, 4, 6, 6, 3, 3, 6, 4, 4, 3, 6, 5, 6, 5, 3, 5, 5, 4, 5, 7, 5, 8,
       5, 4, 4, 2, 4, 5, 8, 8, 5, 3, 8, 4, 6, 6, 5, 7, 4, 6, 8, 6, 7, 7, 6,
       6, 5, 6, 8, 7, 5, 9, 4])
In [6]:
data = np.random.binomial(n=10, p=0.25, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
mean: 2.4974
std: 1.36864649928
In [7]:
data = np.random.binomial(n=10, p=0.5, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
mean: 5.0121
std: 1.58270451759
In [8]:
data = np.random.binomial(n=10, p=0.75, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
mean: 7.512
std: 1.3764650377
In [9]:
data = np.random.binomial(n=25, p=0.5, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
mean: 12.4904
std: 2.48401043476

Hypergeometric distribution

Samples are drawn from a hypergeometric distribution with specified parameters, ngood (ways to make a good selection), nbad (ways to make a bad selection), and nsample = number of items sampled, which is less than or equal to the sum ngood + nbad.

  • ngood : Number of ways to make a good selection. Must be nonnegative.
  • nbad : Number of ways to make a bad selection. Must be nonnegative.
  • nsample : Number of items sampled. Must be at least 1 and at most ngood + nbad.

The probability density for the Hypergeometric distribution is

$$P(x) = \frac{\binom{m}{n}\binom{N-m}{n-x}}{\binom{N}{n}},$$

where $0 \le x \le m$ and $n+m-N \le x \le n$

for $P(x)$ the probability of x successes, n = ngood, m = nbad, and N = number of samples.

Consider an urn with black and white marbles in it, ngood of them black and nbad are white. If you draw nsample balls without replacement, then the hypergeometric distribution describes the distribution of black balls in the drawn sample.

Note that this distribution is very similar to the binomial distribution, except that in this case, samples are drawn without replacement, whereas in the Binomial case samples are drawn with replacement (or the sample space is infinite). As the sample space becomes large, this distribution approaches the binomial.

Suppose you have an urn with 15 white and 15 black marbles. If you pull 15 marbles at random, how likely is it that 12 or more of them are one color ?

>>> s = np.random.hypergeometric(15, 15, 15, 100000)
>>> sum(s>=12)/100000. + sum(s<=3)/100000.
#   answer = 0.003 ... pretty unlikely!
In [10]:
np.random.hypergeometric(ngood=15, nbad=15, nsample=15, size=100)
Out[10]:
array([ 8,  6,  9,  8,  8,  5,  8,  8,  9,  8,  7,  7,  5,  8,  8,  7,  8,
        7,  7,  7,  7,  9, 10,  7,  6,  7,  8,  8,  6,  6,  8,  8,  7,  8,
        5,  7,  8,  5,  7,  6,  6,  7,  6,  9,  9,  8,  6,  7,  6,  9, 11,
        8,  8,  7,  9,  6,  8,  6,  6,  8,  7,  7,  6,  9,  8,  8,  6,  5,
        9,  5,  7,  7,  8,  7,  9,  7,  5,  5,  9,  9,  7,  5,  5,  8,  7,
        8,  8,  8,  7,  8,  8,  7,  8,  7,  8,  7,  7, 10,  8,  8])
In [11]:
data = np.random.hypergeometric(ngood=15, nbad=15, nsample=15, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
mean: 7.5096
std: 1.39459952675

Poisson distribution

The Poisson distribution is the limit of the binomial distribution for large N:

$$f(k; \lambda)=\frac{\lambda^k e^{-\lambda}}{k!}$$

For events with an expected separation $\lambda$ the Poisson distribution $f(k; \lambda)$ describes the probability of $k$ events occurring within the observed interval $\lambda$.

Because the output is limited to the range of the C long type, a ValueError is raised when lam is within 10 sigma of the maximum representable value.

Lambda=1

In [12]:
np.random.poisson(lam=1, size=100)
Out[12]:
array([0, 1, 1, 1, 1, 0, 2, 1, 0, 0, 1, 1, 2, 4, 2, 1, 0, 1, 2, 0, 3, 1, 0,
       1, 2, 3, 0, 1, 0, 1, 2, 1, 0, 1, 2, 2, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 2, 0, 2, 3, 1, 1, 1, 4, 2, 3, 4, 2, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1,
       1, 2, 1, 4, 1, 2, 4, 0, 1, 1, 1, 0, 1, 0, 1, 1, 3, 0, 2, 1, 2, 1, 0,
       1, 0, 0, 2, 0, 0, 2, 1])
In [13]:
data = np.random.poisson(lam=1, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
mean: 0.9985
std: 0.989291539436

Lambda=2

In [14]:
data = np.random.poisson(lam=2, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
mean: 1.9987
std: 1.39989224942

Lambda=3

In [15]:
data = np.random.poisson(lam=3, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
mean: 2.9921
std: 1.75335038997

Lambda=4

In [16]:
data = np.random.poisson(lam=4, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
mean: 3.9682
std: 1.9913283908

Lambda=5

In [17]:
data = np.random.poisson(lam=5, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
mean: 5.0249
std: 2.2426502157

Geometric distribution

Bernoulli trials are experiments with one of two outcomes: success or failure (an example of such an experiment is flipping a coin). The geometric distribution models the number of trials that must be run in order to achieve success. It is therefore supported on the positive integers, $k = 1, 2, \dots$

The probability mass function of the geometric distribution is

$$f(k) = (1 - p)^{k - 1} p$$

where $p$ is the probability of success of an individual trial.

In [18]:
np.random.geometric(p=0.5, size=100)
Out[18]:
array([1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 2, 2, 2, 3, 2, 4, 1, 2, 2, 1, 1, 1, 1,
       2, 1, 2, 1, 1, 3, 1, 1, 1, 4, 1, 3, 1, 2, 4, 1, 1, 1, 1, 2, 2, 1, 2,
       2, 3, 1, 4, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 2, 1,
       2, 2, 2, 1, 2, 3, 3, 2, 1, 1, 2, 1, 1, 3, 1, 2, 1, 2, 2, 3, 3, 3, 1,
       4, 3, 2, 4, 1, 3, 2, 1])
In [19]:
data = np.random.geometric(p=0.5, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
mean: 2.0085
std: 1.40158044721

Pascal distribution (negative binomial distribution)

Samples are drawn from a negative binomial distribution with specified parameters, $n$ trials and $p$ probability of success where $n$ is an integer > 0 and $p$ is in the interval $[0, 1]$.

The probability density for the negative binomial distribution is

$$P(N;n,p) = \binom{N+n-1}{n-1}p^{n}(1-p)^{N},$$

where $n-1$ is the number of successes, $p$ is the probability of success, and $N+n-1$ is the number of trials. The negative binomial distribution gives the probability of $n-1$ successes and $N$ failures in $N+n-1$ trials, and success on the $(N+n)$th trial.

If one throws a die repeatedly until the third time a "1" appears, then the probability distribution of the number of non-"1"s that appear before the third "1" is a negative binomial distribution.

In [20]:
np.random.negative_binomial(n=1, p=0.1, size=100)
Out[20]:
array([ 0,  4,  2, 11,  0, 11,  9, 24,  4, 24, 49,  1,  5, 11,  4,  1, 18,
        4,  2,  1, 23,  3,  3,  9, 11, 11,  0,  1,  7,  0,  7,  1,  6, 18,
       13,  0, 12,  3,  3,  9,  4,  2,  2, 35, 19, 12,  4, 16,  8, 20, 12,
        3, 21, 24,  3,  1, 11,  6,  1,  4, 18,  8, 52,  9,  8, 45,  5,  3,
       14,  9, 12,  0,  2,  5,  3, 17,  4,  7,  2,  1,  5, 23,  3,  1, 26,
        2,  4, 23,  3, 34, 21,  0,  7,  2, 10,  3,  0, 18,  7,  0])
In [21]:
data = np.random.negative_binomial(n=1, p=0.1, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
mean: 9.1721
std: 9.63784631492

Uniform distribution

In [22]:
np.random.choice(range(10), size=100, replace=True, p=None)
Out[22]:
array([0, 6, 4, 5, 1, 1, 7, 2, 5, 2, 0, 4, 7, 4, 2, 6, 4, 4, 3, 7, 4, 8, 7,
       3, 8, 6, 5, 3, 3, 1, 7, 5, 8, 4, 0, 7, 4, 3, 5, 3, 4, 0, 6, 7, 5, 4,
       8, 7, 6, 8, 6, 7, 2, 7, 0, 1, 8, 3, 6, 1, 4, 0, 9, 4, 1, 6, 1, 9, 5,
       4, 6, 4, 3, 2, 7, 2, 0, 5, 2, 9, 4, 6, 4, 9, 4, 1, 0, 9, 2, 8, 6, 3,
       3, 9, 0, 9, 7, 3, 9, 6])
In [23]:
data = np.random.choice(range(25), size=100000, replace=True, p=None)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
mean: 12.03434
std: 7.22212993267

Miscellaneous

In [24]:
np.random.choice(range(10), size=10, replace=False, p=None)
Out[24]:
array([9, 7, 6, 2, 8, 1, 4, 0, 5, 3])
In [25]:
np.random.choice([1, 2, 3], size=100, replace=True, p=[0.8, 0.1, 0.1])
Out[25]:
array([1, 1, 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 1, 1, 1,
       1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1,
       1, 1, 1, 3, 1, 1, 2, 3, 3, 1, 1, 1, 1, 3, 1, 1, 1, 2, 1, 1, 2, 1, 1,
       3, 1, 1, 1, 1, 1, 3, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 3, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 3])

Continuous distribution

Uniform distribution

In [26]:
np.random.uniform(high=0.0, low=1.0, size=50)
Out[26]:
array([ 0.67104556,  0.83023117,  0.05559095,  0.96168305,  0.3635337 ,
        0.49322418,  0.0811216 ,  0.14233599,  0.27693516,  0.10072654,
        0.61389513,  0.85229222,  0.05049795,  0.18958157,  0.20640898,
        0.3721049 ,  0.49940771,  0.76084304,  0.05013689,  0.24839969,
        0.1949281 ,  0.45049906,  0.21380667,  0.59182599,  0.71946769,
        0.80371736,  0.11106431,  0.683811  ,  0.91139047,  0.49937834,
        0.69347948,  0.93589418,  0.26646693,  0.21876906,  0.50414477,
        0.17072248,  0.87572343,  0.49640995,  0.8671082 ,  0.56918445,
        0.83821021,  0.73590822,  0.75769837,  0.54989057,  0.2142728 ,
        0.83252792,  0.56320058,  0.64748407,  0.35717758,  0.00350291])

Normal distribution

The probability density function of the normal distribution, first derived by De Moivre and 200 years later by both Gauss and Laplace independently, is often called the bell curve because of its characteristic shape (see the example below).

The normal distributions occurs often in nature. For example, it describes the commonly occurring distribution of samples influenced by a large number of tiny, random disturbances, each with its own unique distribution.

The probability density for the Gaussian distribution is

$$p(x) = \frac{1}{\sqrt{ 2 \pi \sigma^2 }} e^{ - \frac{ (x - \mu)^2 } {2 \sigma^2} },$$

where $\mu$ is the mean and $\sigma$ the standard deviation. The square of the standard deviation, $\sigma^2$, is called the variance.

The function has its peak at the mean, and its "spread" increases with the standard deviation (the function reaches 0.607 times its maximum at $x + \sigma$ and $x - \sigma$). This implies that numpy.random.normal is more likely to return samples lying close to the mean, rather than those far away.

In [27]:
np.random.normal(loc=0.0, scale=1.0, size=50)
Out[27]:
array([ 1.53597863,  0.60434812,  1.90580199, -0.6774977 , -0.22281365,
        0.74455759, -0.25939052, -1.56089066, -0.39445805,  1.48596957,
       -0.48746909,  0.61119874,  1.31091042,  0.25538365,  0.85883798,
       -0.51616033,  2.85633901,  0.15848599,  0.6171736 ,  1.75055873,
       -0.78858517, -0.1689512 ,  0.65946444,  0.88692399, -0.46548475,
       -1.66740128, -0.70305577,  0.25817821,  0.05805816, -2.15632734,
        0.00497013,  0.56189663, -1.58355391, -2.51163019, -0.32110889,
        0.51125056,  1.64208819, -0.99586084, -2.17261491, -0.250682  ,
       -1.41984785,  1.20660704, -1.25455572,  0.05938362, -2.55812072,
        0.95259184,  0.34111079, -0.44878927, -0.03453323,  0.04980198])
In [28]:
data = np.random.normal(loc=0.0, scale=1.0, size=100000)
plot(data, bins=np.arange(-5, 6, 0.2))
print("mean:", data.mean())
print("std:", data.std())
mean: -0.00114518730585
std: 1.00000915657
In [29]:
data = np.random.normal(loc=2.0, scale=1.0, size=100000)
plot(data, bins=np.arange(-5, 6, 0.2))
print("mean:", data.mean())
print("std:", data.std())
mean: 1.99998797739
std: 1.00054912509
In [30]:
data = np.random.normal(loc=0.0, scale=1.5, size=100000)
plot(data, bins=np.arange(-5, 6, 0.2))
print("mean:", data.mean())
print("std:", data.std())
mean: 0.00267796456518
std: 1.49795240415

Log normal distribution

Draw samples from a log-normal distribution with specified mean, standard deviation, and array shape. Note that the mean and standard deviation are not the values for the distribution itself, but of the underlying normal distribution it is derived from.

A variable $x$ has a log-normal distribution if $log(x)$ is normally distributed. The probability density function for the log-normal distribution is:

$$p(x) = \frac{1}{\sigma x \sqrt{2\pi}} e^{(-\frac{(ln(x)-\mu)^2}{2\sigma^2})}$$

where $\mu$ is the mean and $\sigma$ is the standard deviation of the normally distributed logarithm of the variable.

A log-normal distribution results if a random variable is the product of a large number of independent, identically-distributed variables in the same way that a normal distribution results if the variable is the sum of a large number of independent, identically-distributed variables.

In [31]:
np.random.lognormal(mean=0.0, sigma=1.0, size=50)
Out[31]:
array([ 2.30497834,  0.39737179,  0.8307491 ,  0.7546626 ,  0.3010694 ,
        1.74249317,  1.40943629,  1.04864103,  2.29234239,  1.00263511,
        1.15531409,  2.50011619,  0.38898804,  0.82443434,  1.60405612,
        0.79481931,  1.89238041,  0.32286974,  0.48701339,  0.96174448,
        1.63714094,  1.02100535,  1.12076644,  4.55667572,  0.38831428,
        1.04073273,  2.57138772,  1.13975708,  0.14557827,  0.95761604,
        0.24508508,  1.51894579,  0.68814677,  0.72862115,  0.5853182 ,
        0.31928929,  0.82060584,  4.08187396,  3.04202637,  0.27014763,
        2.03570052,  0.80861777,  3.72313357,  0.88755491,  1.00053833,
        0.65831618,  0.86067884,  0.3852054 ,  0.20331136,  0.85778834])
In [32]:
data = np.random.lognormal(mean=0.0, sigma=1.0, size=100000)
plot(data, bins=np.arange(0, 10, 0.2))
print("mean:", data.mean())
print("std:", data.std())
mean: 1.65839186203
std: 2.16220122543
In [33]:
data = np.random.lognormal(mean=2.0, sigma=1.0, size=100000)
plot(data, bins=np.arange(0, 10, 0.2))
print("mean:", data.mean())
print("std:", data.std())
mean: 12.1629628267
std: 15.9159262295
In [34]:
data = np.random.lognormal(mean=0.0, sigma=1.5, size=100000)
plot(data, bins=np.arange(0, 10, 0.2))
print("mean:", data.mean())
print("std:", data.std())
mean: 3.09594123528
std: 8.56184051614

Power distribution

Draws samples in $[0, 1]$ from a power distribution with positive exponent $a - 1$ (with $a > 0$).

Also known as the power function distribution.

The probability density function is

$$P(x; a) = ax^{a-1}, 0 \le x \le 1, a>0.$$

The power function distribution is just the inverse of the Pareto distribution. It may also be seen as a special case of the Beta distribution.

It is used, for example, in modeling the over-reporting of insurance claims.

In [35]:
np.random.power(a=1.0, size=50)
Out[35]:
array([ 0.51062578,  0.05690993,  0.76410561,  0.59003196,  0.72203246,
        0.93320166,  0.00492461,  0.69966576,  0.48744854,  0.39799066,
        0.29922829,  0.34109046,  0.59102967,  0.8672385 ,  0.13562029,
        0.88979166,  0.28097275,  0.45082095,  0.82271325,  0.94555905,
        0.00594498,  0.99812782,  0.44598687,  0.90445102,  0.3246682 ,
        0.46605463,  0.07593735,  0.40502395,  0.02084473,  0.46908677,
        0.36810889,  0.80300819,  0.43673139,  0.74710671,  0.85961951,
        0.92180194,  0.22184448,  0.14997902,  0.96094549,  0.99349202,
        0.62451833,  0.8938237 ,  0.81986197,  0.47493491,  0.90482113,
        0.24027929,  0.51136235,  0.7541394 ,  0.80650171,  0.09519044])
In [36]:
data = np.random.power(a=0.25, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())
mean: 0.198087813852
std: 0.265005110639
In [37]:
data = np.random.power(a=0.5, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())
mean: 0.334377587168
std: 0.298238512263
In [38]:
data = np.random.power(a=1.0, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())
mean: 0.501470574665
std: 0.289169936237
In [39]:
data = np.random.power(a=2.0, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())
mean: 0.668239517985
std: 0.235289140128
In [40]:
data = np.random.power(a=5.0, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())
mean: 0.833287682779
std: 0.140451057947

Beta distribution

In [41]:
 

Exponential distribution

Its probability density function is

$$f\left( x; \frac{1}{\beta} \right) = \frac{1}{\beta} \exp \left( \frac{-x}{\beta} \right)$$

for $x > 0$ and 0 elsewhere.

$\beta$ is the scale parameter, which is the inverse of the rate parameter $\lambda = 1/\beta$. The rate parameter is an alternative, widely used parameterization of the exponential distribution.

The exponential distribution is a continuous analogue of the geometric distribution. It describes many common situations, such as the size of raindrops measured over many rainstorms, or the time between page requests to Wikipedia.

The scale parameter, $\beta = 1/\lambda$.

In [41]:
np.random.exponential(scale=1.0, size=50)
Out[41]:
array([ 3.35792625,  1.38486544,  2.57643236,  0.28509232,  0.42099   ,
        0.60495311,  0.69808564,  3.09013062,  0.68461339,  2.06229833,
        4.07638531,  0.05099878,  1.16568446,  0.26599367,  0.24458049,
        1.05867685,  2.42134531,  0.31890183,  1.45383122,  0.64828182,
        0.65024198,  0.02609204,  2.09908227,  0.64440071,  1.11937955,
        2.91213523,  0.10479277,  0.26112871,  0.21128686,  1.27727815,
        1.48304424,  0.01325119,  0.46780751,  1.30809298,  0.50083319,
        0.31499235,  0.69115437,  2.42405067,  2.51943501,  0.22953386,
        1.87609158,  0.97354262,  0.14265848,  0.13744649,  2.37175573,
        0.10389841,  0.33706408,  0.83699998,  0.59091668,  0.07418911])
In [42]:
data = np.random.exponential(scale=1.0, size=100000)
plot(data, bins=np.arange(0, 30, 0.5))
print("mean:", data.mean())
print("std:", data.std())
mean: 1.00614000302
std: 1.00890213555
In [43]:
data = np.random.exponential(scale=2.0, size=100000)
plot(data, bins=np.arange(0, 30, 0.5))
print("mean:", data.mean())
print("std:", data.std())
mean: 1.99267034842
std: 1.99011891871
In [44]:
data = np.random.exponential(scale=5.0, size=100000)
plot(data, bins=np.arange(0, 30, 0.5))
print("mean:", data.mean())
print("std:", data.std())
mean: 4.99587096957
std: 4.99891828483
In [45]:
data = np.random.exponential(scale=0.5, size=100000)
plot(data, bins=np.arange(0, 30, 0.5))
print("mean:", data.mean())
print("std:", data.std())
mean: 0.499059196485
std: 0.497957621821

Chi-square distribution

When df independent random variables, each with standard normal distributions (mean=0, variance=1), are squared and summed, the resulting distribution is chi-square.

This distribution is often used in hypothesis testing.

The variable obtained by summing the squares of df independent, standard normally distributed random variables:

$$Q = \sum_{i=0}^{\mathtt{df}} X^2_i$$

is chi-square distributed, denoted

$$Q \sim \chi^2_k.$$

The probability density function of the chi-squared distribution is

$$p(x) = \frac{(1/2)^{k/2}}{\Gamma(k/2)} x^{k/2 - 1} e^{-x/2},$$

where $\Gamma$ is the gamma function,

$$\Gamma(x) = \int_0^{-\infty} t^{x - 1} e^{-t} dt.$$
In [46]:
np.random.chisquare(df=1.0, size=50)
Out[46]:
array([  2.36823008e-01,   3.61474908e-01,   4.88437216e+00,
         1.15593171e+00,   9.16577658e-01,   9.49865848e-03,
         8.71912761e-01,   3.89333933e-02,   3.51821616e-03,
         2.44098328e+00,   2.85731118e-01,   1.17381525e+00,
         1.66560646e-01,   1.86011474e+00,   1.40426182e+00,
         1.03208495e+00,   1.09306292e-01,   6.78963299e-02,
         4.03890785e-01,   2.06354015e+00,   4.52373861e-02,
         1.40548460e-02,   1.00225810e+00,   5.09964406e-03,
         1.68741792e-02,   6.04837865e-02,   7.35536601e-01,
         2.88910015e-01,   1.03972742e+00,   3.16639883e-01,
         4.75560395e-03,   3.36596264e-01,   1.20964296e+00,
         2.68590876e-01,   1.58262040e-01,   3.03613536e-01,
         1.48203607e-01,   3.34045131e+00,   1.84796561e-02,
         1.74884341e+00,   2.03025196e-02,   5.80664851e-01,
         5.58587235e-01,   4.26915343e-01,   2.66419694e-02,
         1.94552098e-01,   8.89992993e-01,   5.23787752e+00,
         1.82231682e-01,   3.84417490e-01])
In [47]:
data = np.random.chisquare(df=1.0, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
mean: 0.987186954192
std: 1.39368542787
In [48]:
data = np.random.chisquare(df=2.0, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
mean: 1.96231602321
std: 1.95408830177
In [49]:
data = np.random.chisquare(df=5.0, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())
mean: 5.02470808907
std: 3.1963618343