Wang Haihua
🚅 🚋😜 🚑 🚔
In this chapter, we will only be dealing with special distributions which are frequently encountered in practice.
Let us review an example of discrete uniform distribution.
unif_d = sp.stats.randint.rvs(0, 10, size = 1000)
fig, ax = plt.subplots(figsize = (12, 7))
h, bins, patches = ax.hist(unif_d, density = True, bins = 30)
ax.set_title('Discrete Uniform Frequency Distribution', size = 16)
plt.show()
x = np.arange(1, 11)
unif_pmf = sp.stats.randint.pmf(x, 2, 10)
fig, ax = plt.subplots(figsize = (12, 7))
ax.scatter(x, unif_pmf, s = 100, color = 'green', label = 'Low = 2, High = 9')
ax.plot(x,unif_pmf, lw = 3, color = 'k')
ax.legend(fontsize = 18, loc = 'center')
<matplotlib.legend.Legend at 0x5fee348>
The binomial experiment has 4 properties:
The PMF of binomial ditribution is
\begin{equation}
C_n^k p^kq^{n-k}
\end{equation}
We use a simple example to explain the PMF.
Every month, a personal banker might meet 50 people enquiring loans, emperically 30% of them have bad credit history. So calculate probability from 1:50 people that have bad credit history.
First we can asnwer what the probability is that the banker meet exactly $14$ people who have bad credit history?
n = 50
k = 14 # what is the prob that exact 14 ppl she met had bad credit history?
b = scipy.special.comb(50, 14)
p = .3
f_bino = b*p**k*(1-p)**(n-k)
print('The prob of meeting {0} persons who have bad credit history is {1:.2f}%.'.format(k, f_bino * 100))
The prob of meeting 14 persons who have bad credit history is 11.89%.
n = 50
p = .3
bad_credit = np.arange(1, 51)
y = sp.stats.binom.pmf(bad_credit, n, p)
fig, ax = plt.subplots(figsize = (10, 6))
ax.plot(bad_credit, y, lw = 3, color = 'r', alpha = .5)
ax.set_ylim([0, .13])
ax.set_title('The probability that from 1 to 50 persons who have bad credit history', size = 16, x = .5, y = 1.02)
ax.set_xlabel('Number of Person Who Has Bad Credit History', size = 12)
ax.set_ylabel('Probability', size = 12)
plt.show()
If a trade trades $n$ times a month, he has a $p%$ chance of winning the trade, find out the probability that he can win at least $k$ trades a month.
We can also ask what are the probabilites he wins more than $k$ trades, or between $(k_1, \ k_2)$ trades.
What if the probability of wining changing from 0.1 to 0.8, what is the probability that he wins less than 6 trades, assuming every month he trades 20 times.
When $n\rightarrow\infty$ and $p\rightarrow0$,binomial distribution converges to Poisson distribution, i.e. when $n$ is large and $p$ is small, we can use Poisson to approximate Binomial. \begin{equation} P(X=k)=\frac{\lambda^ke^{-\lambda}}{k!} \end{equation}
For an ordinary traders, every trade has $1/1000$ probabilty to encounter a 'Black Swan' shock, a trader sets $20$ trades per month, what is the probability that she will encounter $2$ 'Black Swan' within 5 years?
This problem can be solved by Binomial, the formulation as below
\begin{equation} \text{Number of Trades} = 20\times 12\times 5=1200\\ P(x=2) = \binom{1200}{2}\Big(\frac{1}{1000}\Big)^2\Big(\frac{999}{1000}\Big)^{1198} \end{equation}Of course, we can employ Poisson PMF to approximate it, the parameter for Poisson distribution is
\begin{equation} \lambda = np = 1200 \times \frac{1}{1000} = 1.2 \end{equation}that means every 5 years, there is in average 1.2 times of Black Swan shock.
\begin{equation} P(x=2)=\frac{\lambda^ke^{-\lambda}}{k!}=\frac{1.2^2e^{-1.2}}{2!} \end{equation}sp.special.comb(1200, 2)*(1/1000)**2*(999/1000)**1198
0.21698280952603388
Suprisingly high probability of having 1 BS shock, and one BS shock could possibly wipe out the whole account. Take care, traders, the survival is pivotal!
So what's the probability of having more than $k$ times BS shock?
k = 2
prob_sf = 1 - sp.stats.poisson.cdf(k, lambdaP)
prob_sf_inv = sp.stats.poisson.cdf(k, lambdaP)
print('The probability of having more than %1.0f BS shock in 5 years is %3.3f%%.' % (k, prob_sf*100)) # double % can escape formating
The probability of having more than 2 BS shock in 5 years is 12.051%.
Again we can compute all most important moments
The PMF of G-D is $f(k)=p(1-p)^k$, where $k$ is a non-negative integer. $p$ is the probability of success, and $k$ is the number of failures before success.
So G-D is trying to answer 'How many times you have to fail in order to embrace the first success?'
If each trade has 1/1000 chance to encounter a BS shock, what is the prob of encounter a BS shock after trade $k$ times.
k = 10
p = 1/1000
geodist = (1 - 1/1000)**10*1/1000
geodist
print('The probability of observing exact %1.0f times of safe trading before a BS shock is %3.3f.' %(k, geodist))
The probability of observing exact 10 times of safe trading before a BS shock is 0.001.
The main difference between hypergeometric and binomial is that the former sampling is not independent of each other, i.e. the sampling is without replacement.
The formula is \begin{equation} f(x) =\frac{{K\choose x} {M-K \choose N-x}}{{M\choose N}} \end{equation}
There is 100 candies in an urn, 20 are red, 80 are blue, if we take 5 of them out from it. What is the prob of having exact 4 red candies?
Solution:
\begin{equation} \frac{{20\choose4}{80\choose1}}{{100\choose5}} \end{equation}To solve it:
C = sp.special.comb(20, 4)*sp.special.comb(80, 1) /sp.special.comb(100, 5)
print('The prob of have 4 red candies by taking 5 out is %1.6f%%.'% (C*100))
The prob of have 4 red candies by taking 5 out is 0.514826%.
What is the probability that at most $4$ red candies are taken? i.e. The sum-up of probabilities of drawing $1$, $2$, $3$, and $4$ candies.
hgeo_cdf = sp.stats.hypergeom.cdf(4, 100, 20, 5,loc = 0) # the arg order the same as MATLAB, not recommended
print('The prob of have at most 4 red candies by taking 5 out is %1.6f%%.' %(hgeo_cdf*100))
The prob of have at most 4 red candies by taking 5 out is 99.979407%.
The PDF of Uniform Distribution is
\begin{equation} f(x)=\frac{1}{b-a} \end{equation}And its r.v. generator is one of most frequently used function in NumPy: np.random.rand()
unif = np.random.rand(10000)
fig, ax = plt.subplots(figsize = (12, 7))
h, bins, patches = ax.hist(unif, density = True, bins = 30)
ax.set_title('Continous Uniform Frequency Distribution', size = 16)
plt.show()
# pdf(x, loc=0, scale=1)
# cdf(x, loc=0, scale=1)
x = np.linspace(-.2, 1.2, 100)
unif_pdf = sp.stats.uniform.pdf(x)
unif_cdf = sp.stats.uniform.cdf(x)
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (17, 7))
ax[0].plot(x,unif_pdf, lw = 4, label = 'PDF of Continouse U-D')
ax[0].set_xlim([-.1, 1.1])
ax[0].set_ylim([0, 2])
ax[0].legend(fontsize = 16)
ax[1].plot(x,unif_cdf, lw = 4, label = 'CDF of Continouse U-D')
ax[1].set_xlim([-.2, 1.2])
ax[1].set_ylim([0, 2])
ax[1].legend(fontsize = 16)
plt.show()
The most convenient method of creating normal distribution plot is to use sp.stats.norm.pdf()
.
The PDF function single normal distribution is
$$ f(x)= \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}} $$mu = 2
sigma = 1
x = np.arange(-2, 6, 0.1)
norm_pdf = sp.stats.norm.pdf(x, mu, sigma)
norm_cdf = sp.stats.norm.cdf(x, mu, sigma)
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (17, 7))
ax[0].plot(x,norm_pdf, lw = 4, label = 'PDF of Normal Distribution', ls = '--')
ax[0].legend(fontsize = 16, loc = 'lower left', framealpha=0.2)
ax[1].plot(x,norm_cdf, lw = 4, label = 'CDF of Normal Distribution')
ax[1].legend(fontsize = 16,fancybox=True, framealpha=0.5)
plt.show()
The inverse normal CDF is commonly used in calculating $p$-value, the plot below is useful to understand the idea.
norm_95_r = sp.stats.norm.ppf(.975) # ppf mean point percentage function, actually inverse CDF
norm_95_l = sp.stats.norm.ppf(.025)
x = np.linspace(-5, 5, 200)
y = sp.stats.norm.pdf(x)
xl = np.linspace(-5, norm_95_l, 100)
yl = sp.stats.norm.pdf(xl)
xr = np.linspace(norm_95_r, 5, 100)
yr = sp.stats.norm.pdf(xr)
fig, ax = plt.subplots(figsize = (17, 7))
ax.plot(x,y, lw = 4, label = 'PDF of Normal Distribution', ls = '-', color = 'orange')
ax.set_ylim([0, .45])
ax.fill_between(x, y, 0, alpha=0.1, color = 'blue')
ax.fill_between(xl,yl, 0, alpha=0.6, color = 'blue')
ax.fill_between(xr,yr, 0, alpha=0.6, color = 'blue')
ax.text(-.2, 0.15, '95%', fontsize = 20)
ax.text(-2.3, 0.015, '2.5%', fontsize = 12, color = 'white')
ax.text(2.01, 0.015, '2.5%', fontsize = 12, color = 'white')
ax.annotate('±%.4f' %norm_95_r, xy = (norm_95_r, 0), xytext = (-.4, .05), weight = 'bold', color = 'r',
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3', color = 'b'), fontsize = 16)
ax.annotate('±%.4f' %norm_95_r, xy = (norm_95_l, 0), xytext = (-.4, .05), weight = 'bold', color = 'r',
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3', color = 'b'), fontsize = 16)
ax.set_title('Normal Distribution And 2.5% Shaded Area', size = 20)
plt.show()
The multivariate normal distribution density function is \begin{equation} f_\boldsymbol{X}(x_1,...,x_2)=\frac{1}{\sqrt{(2\pi)^k|\Sigma|}}\exp{\Big(-\frac{(x-\mu)^T\Sigma^{-1}(x-\mu)}{2}\Big)} \end{equation}
plt.show()
plt.show()
The PDF of Beta distribution is
\begin{equation} f(x, a, b)=\frac{\Gamma(a+b) x^{a-1}(1-x)^{b-1}}{\Gamma(a) \Gamma(b)} \end{equation}where $0\leq x \leq 1$ and $a>0$, $b>0$, $\Gamma$ is the Gamma function.
Beta distribution is a natural good choice for priors in Bayesian econometrics since its range is bounded in unit.
In mathematics, the gamma function (represented by $\Gamma$ , the capital letter gamma from the Greek alphabet) is one commonly used extension of the factorial function to complex numbers. The gamma function is defined for all complex numbers except the non-positive integers. For any positive integer n, $$\Gamma (n)=(n-1)!$$
Derived by Daniel Bernoulli, for complex numbers with a positive real part, the gamma function is defined via a convergent improper integral:
$$ \Gamma (z)=\int _{0}^{\infty }x^{z-1}e^{-x}\,dx,\ \qquad \Re (z)>0\ .$$ax.axis([0, 1, 0, 3])
(0.0, 1.0, 0.0, 3.0)
ax.axis([0, 1, 0, 1])
(0.0, 1.0, 0.0, 1.0)
$\chi^2$ distribution is closely connected with normal distributions, if $z$ has the standard normal distribution, then $z^2$ has the $\chi^2$ distribution with $d.f.=1$. And futher,if
\begin{equation} z_1, z_2, ..., z_k \sim i.i.d. N(0, 1) \end{equation}Then summation has a $\chi^2$ distribution of $d.f. = k$:
\begin{equation} \sum_{i=0}^k z_i^2 \sim \chi^2(k) \end{equation}plt.show()
plt.show()
If $U_1$ has a $\chi^2$ distribution with $\nu_1$ d.f. and $U_2$ has a $\chi^2$ distribution with $\nu_2$ d.f., then
\begin{equation} \frac{U_1/\nu_1}{U_2/\nu_2}\sim F(\nu_1, \nu_2) \end{equation}We are using $F$ distribution for ratios of variances.
plt.show()