# Numpy random snippets¶

Jérémie Decock (www.jdhp.org)

Most comments are taken from the Numpy documentation.

## Import directive¶

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt


## Tool functions¶

In [2]:
def plot(data, bins=30):
plt.hist(data, bins)
plt.show()


## Discrete distributions¶

### Bernoulli distribution¶

In [3]:
def bernoulli(p=None, size=1):
return np.random.binomial(n=1, p=p, size=size)

In [4]:
bernoulli(p=0.5, size=100)

Out[4]:
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0,
1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1,
1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
0, 1, 1, 0, 1, 1, 0, 1])

### Binomial distribution¶

Samples are drawn from a binomial distribution with specified parameters, $n$ trials and $p$ probability of success where $n \in \mathbb{N}$ and $p$ is in the interval $[0,1]$.

The probability density for the binomial distribution is

$$P(N) = \binom{n}{N}p^N(1-p)^{n-N}$$

where $n$ is the number of trials, $p$ is the probability of success, and $N$ is the number of successes.

When estimating the standard error of a proportion in a population by using a random sample, the normal distribution works well unless the product $p*n <=5$, where $p$ = population proportion estimate, and n = number of samples, in which case the binomial distribution is used instead. For example, a sample of 15 people shows 4 who are left handed, and 11 who are right handed. Then $p = 4/15 = 27%. 0.27*15 = 4$, so the binomial distribution should be used in this case.

In [5]:
np.random.binomial(n=10, p=0.5, size=100)

Out[5]:
array([6, 9, 5, 6, 6, 4, 6, 5, 5, 5, 6, 6, 6, 3, 4, 3, 7, 7, 7, 5, 7, 7, 5,
3, 7, 4, 9, 3, 5, 4, 3, 6, 5, 6, 3, 4, 4, 6, 6, 6, 8, 6, 6, 6, 4, 6,
5, 7, 4, 6, 6, 3, 3, 6, 4, 4, 3, 6, 5, 6, 5, 3, 5, 5, 4, 5, 7, 5, 8,
5, 4, 4, 2, 4, 5, 8, 8, 5, 3, 8, 4, 6, 6, 5, 7, 4, 6, 8, 6, 7, 7, 6,
6, 5, 6, 8, 7, 5, 9, 4])
In [6]:
data = np.random.binomial(n=10, p=0.25, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

mean: 2.4974
std: 1.36864649928

In [7]:
data = np.random.binomial(n=10, p=0.5, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

mean: 5.0121
std: 1.58270451759

In [8]:
data = np.random.binomial(n=10, p=0.75, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

mean: 7.512
std: 1.3764650377

In [9]:
data = np.random.binomial(n=25, p=0.5, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

mean: 12.4904
std: 2.48401043476


### Hypergeometric distribution¶

Samples are drawn from a hypergeometric distribution with specified parameters, ngood (ways to make a good selection), nbad (ways to make a bad selection), and nsample = number of items sampled, which is less than or equal to the sum ngood + nbad.

• ngood : Number of ways to make a good selection. Must be nonnegative.
• nbad : Number of ways to make a bad selection. Must be nonnegative.
• nsample : Number of items sampled. Must be at least 1 and at most ngood + nbad.

The probability density for the Hypergeometric distribution is

$$P(x) = \frac{\binom{m}{n}\binom{N-m}{n-x}}{\binom{N}{n}},$$

where $0 \le x \le m$ and $n+m-N \le x \le n$

for $P(x)$ the probability of x successes, n = ngood, m = nbad, and N = number of samples.

Consider an urn with black and white marbles in it, ngood of them black and nbad are white. If you draw nsample balls without replacement, then the hypergeometric distribution describes the distribution of black balls in the drawn sample.

Note that this distribution is very similar to the binomial distribution, except that in this case, samples are drawn without replacement, whereas in the Binomial case samples are drawn with replacement (or the sample space is infinite). As the sample space becomes large, this distribution approaches the binomial.

Suppose you have an urn with 15 white and 15 black marbles. If you pull 15 marbles at random, how likely is it that 12 or more of them are one color ?

>>> s = np.random.hypergeometric(15, 15, 15, 100000)
>>> sum(s>=12)/100000. + sum(s<=3)/100000.
#   answer = 0.003 ... pretty unlikely!
In [10]:
np.random.hypergeometric(ngood=15, nbad=15, nsample=15, size=100)

Out[10]:
array([ 8,  6,  9,  8,  8,  5,  8,  8,  9,  8,  7,  7,  5,  8,  8,  7,  8,
7,  7,  7,  7,  9, 10,  7,  6,  7,  8,  8,  6,  6,  8,  8,  7,  8,
5,  7,  8,  5,  7,  6,  6,  7,  6,  9,  9,  8,  6,  7,  6,  9, 11,
8,  8,  7,  9,  6,  8,  6,  6,  8,  7,  7,  6,  9,  8,  8,  6,  5,
9,  5,  7,  7,  8,  7,  9,  7,  5,  5,  9,  9,  7,  5,  5,  8,  7,
8,  8,  8,  7,  8,  8,  7,  8,  7,  8,  7,  7, 10,  8,  8])
In [11]:
data = np.random.hypergeometric(ngood=15, nbad=15, nsample=15, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

mean: 7.5096
std: 1.39459952675


### Poisson distribution¶

The Poisson distribution is the limit of the binomial distribution for large N:

$$f(k; \lambda)=\frac{\lambda^k e^{-\lambda}}{k!}$$

For events with an expected separation $\lambda$ the Poisson distribution $f(k; \lambda)$ describes the probability of $k$ events occurring within the observed interval $\lambda$.

Because the output is limited to the range of the C long type, a ValueError is raised when lam is within 10 sigma of the maximum representable value.

#### Lambda=1¶

In [12]:
np.random.poisson(lam=1, size=100)

Out[12]:
array([0, 1, 1, 1, 1, 0, 2, 1, 0, 0, 1, 1, 2, 4, 2, 1, 0, 1, 2, 0, 3, 1, 0,
1, 2, 3, 0, 1, 0, 1, 2, 1, 0, 1, 2, 2, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
0, 2, 0, 2, 3, 1, 1, 1, 4, 2, 3, 4, 2, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1,
1, 2, 1, 4, 1, 2, 4, 0, 1, 1, 1, 0, 1, 0, 1, 1, 3, 0, 2, 1, 2, 1, 0,
1, 0, 0, 2, 0, 0, 2, 1])
In [13]:
data = np.random.poisson(lam=1, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

mean: 0.9985
std: 0.989291539436


#### Lambda=2¶

In [14]:
data = np.random.poisson(lam=2, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

mean: 1.9987
std: 1.39989224942


#### Lambda=3¶

In [15]:
data = np.random.poisson(lam=3, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

mean: 2.9921
std: 1.75335038997


#### Lambda=4¶

In [16]:
data = np.random.poisson(lam=4, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

mean: 3.9682
std: 1.9913283908


#### Lambda=5¶

In [17]:
data = np.random.poisson(lam=5, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

mean: 5.0249
std: 2.2426502157


### Geometric distribution¶

Bernoulli trials are experiments with one of two outcomes: success or failure (an example of such an experiment is flipping a coin). The geometric distribution models the number of trials that must be run in order to achieve success. It is therefore supported on the positive integers, $k = 1, 2, \dots$

The probability mass function of the geometric distribution is

$$f(k) = (1 - p)^{k - 1} p$$

where $p$ is the probability of success of an individual trial.

In [18]:
np.random.geometric(p=0.5, size=100)

Out[18]:
array([1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 2, 2, 2, 3, 2, 4, 1, 2, 2, 1, 1, 1, 1,
2, 1, 2, 1, 1, 3, 1, 1, 1, 4, 1, 3, 1, 2, 4, 1, 1, 1, 1, 2, 2, 1, 2,
2, 3, 1, 4, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 2, 1,
2, 2, 2, 1, 2, 3, 3, 2, 1, 1, 2, 1, 1, 3, 1, 2, 1, 2, 2, 3, 3, 3, 1,
4, 3, 2, 4, 1, 3, 2, 1])
In [19]:
data = np.random.geometric(p=0.5, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

mean: 2.0085
std: 1.40158044721


### Pascal distribution (negative binomial distribution)¶

Samples are drawn from a negative binomial distribution with specified parameters, $n$ trials and $p$ probability of success where $n$ is an integer > 0 and $p$ is in the interval $[0, 1]$.

The probability density for the negative binomial distribution is

$$P(N;n,p) = \binom{N+n-1}{n-1}p^{n}(1-p)^{N},$$

where $n-1$ is the number of successes, $p$ is the probability of success, and $N+n-1$ is the number of trials. The negative binomial distribution gives the probability of $n-1$ successes and $N$ failures in $N+n-1$ trials, and success on the $(N+n)$th trial.

If one throws a die repeatedly until the third time a "1" appears, then the probability distribution of the number of non-"1"s that appear before the third "1" is a negative binomial distribution.

In [20]:
np.random.negative_binomial(n=1, p=0.1, size=100)

Out[20]:
array([ 0,  4,  2, 11,  0, 11,  9, 24,  4, 24, 49,  1,  5, 11,  4,  1, 18,
4,  2,  1, 23,  3,  3,  9, 11, 11,  0,  1,  7,  0,  7,  1,  6, 18,
13,  0, 12,  3,  3,  9,  4,  2,  2, 35, 19, 12,  4, 16,  8, 20, 12,
3, 21, 24,  3,  1, 11,  6,  1,  4, 18,  8, 52,  9,  8, 45,  5,  3,
14,  9, 12,  0,  2,  5,  3, 17,  4,  7,  2,  1,  5, 23,  3,  1, 26,
2,  4, 23,  3, 34, 21,  0,  7,  2, 10,  3,  0, 18,  7,  0])
In [21]:
data = np.random.negative_binomial(n=1, p=0.1, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

mean: 9.1721
std: 9.63784631492


### Uniform distribution¶

In [22]:
np.random.choice(range(10), size=100, replace=True, p=None)

Out[22]:
array([0, 6, 4, 5, 1, 1, 7, 2, 5, 2, 0, 4, 7, 4, 2, 6, 4, 4, 3, 7, 4, 8, 7,
3, 8, 6, 5, 3, 3, 1, 7, 5, 8, 4, 0, 7, 4, 3, 5, 3, 4, 0, 6, 7, 5, 4,
8, 7, 6, 8, 6, 7, 2, 7, 0, 1, 8, 3, 6, 1, 4, 0, 9, 4, 1, 6, 1, 9, 5,
4, 6, 4, 3, 2, 7, 2, 0, 5, 2, 9, 4, 6, 4, 9, 4, 1, 0, 9, 2, 8, 6, 3,
3, 9, 0, 9, 7, 3, 9, 6])
In [23]:
data = np.random.choice(range(25), size=100000, replace=True, p=None)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

mean: 12.03434
std: 7.22212993267


### Miscellaneous¶

In [24]:
np.random.choice(range(10), size=10, replace=False, p=None)

Out[24]:
array([9, 7, 6, 2, 8, 1, 4, 0, 5, 3])
In [25]:
np.random.choice([1, 2, 3], size=100, replace=True, p=[0.8, 0.1, 0.1])

Out[25]:
array([1, 1, 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 1, 1, 1,
1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1,
1, 1, 1, 3, 1, 1, 2, 3, 3, 1, 1, 1, 1, 3, 1, 1, 1, 2, 1, 1, 2, 1, 1,
3, 1, 1, 1, 1, 1, 3, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 3, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 3])

## Continuous distribution¶

### Uniform distribution¶

In [26]:
np.random.uniform(high=0.0, low=1.0, size=50)

Out[26]:
array([ 0.67104556,  0.83023117,  0.05559095,  0.96168305,  0.3635337 ,
0.49322418,  0.0811216 ,  0.14233599,  0.27693516,  0.10072654,
0.61389513,  0.85229222,  0.05049795,  0.18958157,  0.20640898,
0.3721049 ,  0.49940771,  0.76084304,  0.05013689,  0.24839969,
0.1949281 ,  0.45049906,  0.21380667,  0.59182599,  0.71946769,
0.80371736,  0.11106431,  0.683811  ,  0.91139047,  0.49937834,
0.69347948,  0.93589418,  0.26646693,  0.21876906,  0.50414477,
0.17072248,  0.87572343,  0.49640995,  0.8671082 ,  0.56918445,
0.83821021,  0.73590822,  0.75769837,  0.54989057,  0.2142728 ,
0.83252792,  0.56320058,  0.64748407,  0.35717758,  0.00350291])

### Normal distribution¶

The probability density function of the normal distribution, first derived by De Moivre and 200 years later by both Gauss and Laplace independently, is often called the bell curve because of its characteristic shape (see the example below).

The normal distributions occurs often in nature. For example, it describes the commonly occurring distribution of samples influenced by a large number of tiny, random disturbances, each with its own unique distribution.

The probability density for the Gaussian distribution is

$$p(x) = \frac{1}{\sqrt{ 2 \pi \sigma^2 }} e^{ - \frac{ (x - \mu)^2 } {2 \sigma^2} },$$

where $\mu$ is the mean and $\sigma$ the standard deviation. The square of the standard deviation, $\sigma^2$, is called the variance.

The function has its peak at the mean, and its "spread" increases with the standard deviation (the function reaches 0.607 times its maximum at $x + \sigma$ and $x - \sigma$). This implies that numpy.random.normal is more likely to return samples lying close to the mean, rather than those far away.

In [27]:
np.random.normal(loc=0.0, scale=1.0, size=50)

Out[27]:
array([ 1.53597863,  0.60434812,  1.90580199, -0.6774977 , -0.22281365,
0.74455759, -0.25939052, -1.56089066, -0.39445805,  1.48596957,
-0.48746909,  0.61119874,  1.31091042,  0.25538365,  0.85883798,
-0.51616033,  2.85633901,  0.15848599,  0.6171736 ,  1.75055873,
-0.78858517, -0.1689512 ,  0.65946444,  0.88692399, -0.46548475,
-1.66740128, -0.70305577,  0.25817821,  0.05805816, -2.15632734,
0.00497013,  0.56189663, -1.58355391, -2.51163019, -0.32110889,
0.51125056,  1.64208819, -0.99586084, -2.17261491, -0.250682  ,
-1.41984785,  1.20660704, -1.25455572,  0.05938362, -2.55812072,
0.95259184,  0.34111079, -0.44878927, -0.03453323,  0.04980198])
In [28]:
data = np.random.normal(loc=0.0, scale=1.0, size=100000)
plot(data, bins=np.arange(-5, 6, 0.2))
print("mean:", data.mean())
print("std:", data.std())

mean: -0.00114518730585
std: 1.00000915657

In [29]:
data = np.random.normal(loc=2.0, scale=1.0, size=100000)
plot(data, bins=np.arange(-5, 6, 0.2))
print("mean:", data.mean())
print("std:", data.std())

mean: 1.99998797739
std: 1.00054912509

In [30]:
data = np.random.normal(loc=0.0, scale=1.5, size=100000)
plot(data, bins=np.arange(-5, 6, 0.2))
print("mean:", data.mean())
print("std:", data.std())

mean: 0.00267796456518
std: 1.49795240415


### Log normal distribution¶

Draw samples from a log-normal distribution with specified mean, standard deviation, and array shape. Note that the mean and standard deviation are not the values for the distribution itself, but of the underlying normal distribution it is derived from.

A variable $x$ has a log-normal distribution if $log(x)$ is normally distributed. The probability density function for the log-normal distribution is:

$$p(x) = \frac{1}{\sigma x \sqrt{2\pi}} e^{(-\frac{(ln(x)-\mu)^2}{2\sigma^2})}$$

where $\mu$ is the mean and $\sigma$ is the standard deviation of the normally distributed logarithm of the variable.

A log-normal distribution results if a random variable is the product of a large number of independent, identically-distributed variables in the same way that a normal distribution results if the variable is the sum of a large number of independent, identically-distributed variables.

In [31]:
np.random.lognormal(mean=0.0, sigma=1.0, size=50)

Out[31]:
array([ 2.30497834,  0.39737179,  0.8307491 ,  0.7546626 ,  0.3010694 ,
1.74249317,  1.40943629,  1.04864103,  2.29234239,  1.00263511,
1.15531409,  2.50011619,  0.38898804,  0.82443434,  1.60405612,
0.79481931,  1.89238041,  0.32286974,  0.48701339,  0.96174448,
1.63714094,  1.02100535,  1.12076644,  4.55667572,  0.38831428,
1.04073273,  2.57138772,  1.13975708,  0.14557827,  0.95761604,
0.24508508,  1.51894579,  0.68814677,  0.72862115,  0.5853182 ,
0.31928929,  0.82060584,  4.08187396,  3.04202637,  0.27014763,
2.03570052,  0.80861777,  3.72313357,  0.88755491,  1.00053833,
0.65831618,  0.86067884,  0.3852054 ,  0.20331136,  0.85778834])
In [32]:
data = np.random.lognormal(mean=0.0, sigma=1.0, size=100000)
plot(data, bins=np.arange(0, 10, 0.2))
print("mean:", data.mean())
print("std:", data.std())

mean: 1.65839186203
std: 2.16220122543

In [33]:
data = np.random.lognormal(mean=2.0, sigma=1.0, size=100000)
plot(data, bins=np.arange(0, 10, 0.2))
print("mean:", data.mean())
print("std:", data.std())

mean: 12.1629628267
std: 15.9159262295

In [34]:
data = np.random.lognormal(mean=0.0, sigma=1.5, size=100000)
plot(data, bins=np.arange(0, 10, 0.2))
print("mean:", data.mean())
print("std:", data.std())

mean: 3.09594123528
std: 8.56184051614


### Power distribution¶

Draws samples in $[0, 1]$ from a power distribution with positive exponent $a - 1$ (with $a > 0$).

Also known as the power function distribution.

The probability density function is

$$P(x; a) = ax^{a-1}, 0 \le x \le 1, a>0.$$

The power function distribution is just the inverse of the Pareto distribution. It may also be seen as a special case of the Beta distribution.

It is used, for example, in modeling the over-reporting of insurance claims.

In [35]:
np.random.power(a=1.0, size=50)

Out[35]:
array([ 0.51062578,  0.05690993,  0.76410561,  0.59003196,  0.72203246,
0.93320166,  0.00492461,  0.69966576,  0.48744854,  0.39799066,
0.29922829,  0.34109046,  0.59102967,  0.8672385 ,  0.13562029,
0.88979166,  0.28097275,  0.45082095,  0.82271325,  0.94555905,
0.00594498,  0.99812782,  0.44598687,  0.90445102,  0.3246682 ,
0.46605463,  0.07593735,  0.40502395,  0.02084473,  0.46908677,
0.36810889,  0.80300819,  0.43673139,  0.74710671,  0.85961951,
0.92180194,  0.22184448,  0.14997902,  0.96094549,  0.99349202,
0.62451833,  0.8938237 ,  0.81986197,  0.47493491,  0.90482113,
0.24027929,  0.51136235,  0.7541394 ,  0.80650171,  0.09519044])
In [36]:
data = np.random.power(a=0.25, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())

mean: 0.198087813852
std: 0.265005110639

In [37]:
data = np.random.power(a=0.5, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())

mean: 0.334377587168
std: 0.298238512263

In [38]:
data = np.random.power(a=1.0, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())

mean: 0.501470574665
std: 0.289169936237

In [39]:
data = np.random.power(a=2.0, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())

mean: 0.668239517985
std: 0.235289140128

In [40]:
data = np.random.power(a=5.0, size=100000)
plot(data, bins=np.arange(0, 2, 0.05))
print("mean:", data.mean())
print("std:", data.std())

mean: 0.833287682779
std: 0.140451057947


### Beta distribution¶

In [41]:



### Exponential distribution¶

Its probability density function is

$$f\left( x; \frac{1}{\beta} \right) = \frac{1}{\beta} \exp \left( \frac{-x}{\beta} \right)$$

for $x > 0$ and 0 elsewhere.

$\beta$ is the scale parameter, which is the inverse of the rate parameter $\lambda = 1/\beta$. The rate parameter is an alternative, widely used parameterization of the exponential distribution.

The exponential distribution is a continuous analogue of the geometric distribution. It describes many common situations, such as the size of raindrops measured over many rainstorms, or the time between page requests to Wikipedia.

The scale parameter, $\beta = 1/\lambda$.

In [41]:
np.random.exponential(scale=1.0, size=50)

Out[41]:
array([ 3.35792625,  1.38486544,  2.57643236,  0.28509232,  0.42099   ,
0.60495311,  0.69808564,  3.09013062,  0.68461339,  2.06229833,
4.07638531,  0.05099878,  1.16568446,  0.26599367,  0.24458049,
1.05867685,  2.42134531,  0.31890183,  1.45383122,  0.64828182,
0.65024198,  0.02609204,  2.09908227,  0.64440071,  1.11937955,
2.91213523,  0.10479277,  0.26112871,  0.21128686,  1.27727815,
1.48304424,  0.01325119,  0.46780751,  1.30809298,  0.50083319,
0.31499235,  0.69115437,  2.42405067,  2.51943501,  0.22953386,
1.87609158,  0.97354262,  0.14265848,  0.13744649,  2.37175573,
0.10389841,  0.33706408,  0.83699998,  0.59091668,  0.07418911])
In [42]:
data = np.random.exponential(scale=1.0, size=100000)
plot(data, bins=np.arange(0, 30, 0.5))
print("mean:", data.mean())
print("std:", data.std())

mean: 1.00614000302
std: 1.00890213555

In [43]:
data = np.random.exponential(scale=2.0, size=100000)
plot(data, bins=np.arange(0, 30, 0.5))
print("mean:", data.mean())
print("std:", data.std())

mean: 1.99267034842
std: 1.99011891871

In [44]:
data = np.random.exponential(scale=5.0, size=100000)
plot(data, bins=np.arange(0, 30, 0.5))
print("mean:", data.mean())
print("std:", data.std())

mean: 4.99587096957
std: 4.99891828483

In [45]:
data = np.random.exponential(scale=0.5, size=100000)
plot(data, bins=np.arange(0, 30, 0.5))
print("mean:", data.mean())
print("std:", data.std())

mean: 0.499059196485
std: 0.497957621821


### Chi-square distribution¶

When df independent random variables, each with standard normal distributions (mean=0, variance=1), are squared and summed, the resulting distribution is chi-square.

This distribution is often used in hypothesis testing.

The variable obtained by summing the squares of df independent, standard normally distributed random variables:

$$Q = \sum_{i=0}^{\mathtt{df}} X^2_i$$

is chi-square distributed, denoted

$$Q \sim \chi^2_k.$$

The probability density function of the chi-squared distribution is

$$p(x) = \frac{(1/2)^{k/2}}{\Gamma(k/2)} x^{k/2 - 1} e^{-x/2},$$

where $\Gamma$ is the gamma function,

$$\Gamma(x) = \int_0^{-\infty} t^{x - 1} e^{-t} dt.$$
In [46]:
np.random.chisquare(df=1.0, size=50)

Out[46]:
array([  2.36823008e-01,   3.61474908e-01,   4.88437216e+00,
1.15593171e+00,   9.16577658e-01,   9.49865848e-03,
8.71912761e-01,   3.89333933e-02,   3.51821616e-03,
2.44098328e+00,   2.85731118e-01,   1.17381525e+00,
1.66560646e-01,   1.86011474e+00,   1.40426182e+00,
1.03208495e+00,   1.09306292e-01,   6.78963299e-02,
4.03890785e-01,   2.06354015e+00,   4.52373861e-02,
1.40548460e-02,   1.00225810e+00,   5.09964406e-03,
1.68741792e-02,   6.04837865e-02,   7.35536601e-01,
2.88910015e-01,   1.03972742e+00,   3.16639883e-01,
4.75560395e-03,   3.36596264e-01,   1.20964296e+00,
2.68590876e-01,   1.58262040e-01,   3.03613536e-01,
1.48203607e-01,   3.34045131e+00,   1.84796561e-02,
1.74884341e+00,   2.03025196e-02,   5.80664851e-01,
5.58587235e-01,   4.26915343e-01,   2.66419694e-02,
1.94552098e-01,   8.89992993e-01,   5.23787752e+00,
1.82231682e-01,   3.84417490e-01])
In [47]:
data = np.random.chisquare(df=1.0, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

mean: 0.987186954192
std: 1.39368542787

In [48]:
data = np.random.chisquare(df=2.0, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

mean: 1.96231602321
std: 1.95408830177

In [49]:
data = np.random.chisquare(df=5.0, size=10000)
plot(data, bins=range(30))
print("mean:", data.mean())
print("std:", data.std())

mean: 5.02470808907
std: 3.1963618343