16. Distributions and Probabilities#

16.1. Outline#

In this lecture we give a quick introduction to data and probability distributions using Python.

!pip install --upgrade yfinance  
Hide code cell output
Requirement already satisfied: yfinance in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (0.2.36)
Requirement already satisfied: pandas>=1.3.0 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from yfinance) (2.0.3)
Requirement already satisfied: numpy>=1.16.5 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from yfinance) (1.24.3)
Requirement already satisfied: requests>=2.31 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from yfinance) (2.31.0)
Requirement already satisfied: multitasking>=0.0.7 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from yfinance) (0.0.11)
Requirement already satisfied: lxml>=4.9.1 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from yfinance) (4.9.3)
Requirement already satisfied: appdirs>=1.4.4 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from yfinance) (1.4.4)
Requirement already satisfied: pytz>=2022.5 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from yfinance) (2023.3.post1)
Requirement already satisfied: frozendict>=2.3.4 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from yfinance) (2.4.0)
Requirement already satisfied: peewee>=3.16.2 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from yfinance) (3.17.1)
Requirement already satisfied: beautifulsoup4>=4.11.1 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from yfinance) (4.12.2)
Requirement already satisfied: html5lib>=1.1 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from yfinance) (1.1)
Requirement already satisfied: soupsieve>1.2 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from beautifulsoup4>=4.11.1->yfinance) (2.4)
Requirement already satisfied: six>=1.9 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from html5lib>=1.1->yfinance) (1.16.0)
Requirement already satisfied: webencodings in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from html5lib>=1.1->yfinance) (0.5.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from pandas>=1.3.0->yfinance) (2.8.2)
Requirement already satisfied: tzdata>=2022.1 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from pandas>=1.3.0->yfinance) (2023.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from requests>=2.31->yfinance) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from requests>=2.31->yfinance) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from requests>=2.31->yfinance) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in /usr/share/miniconda3/envs/quantecon/lib/python3.11/site-packages (from requests>=2.31->yfinance) (2023.7.22)
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import yfinance as yf
import scipy.stats
import seaborn as sns

16.2. Common distributions#

In this section we recall the definitions of some well-known distributions and explore how to manipulate them with SciPy.

16.2.1. Discrete distributions#

Let’s start with discrete distributions.

A discrete distribution is defined by a set of numbers \(S = \{x_1, \ldots, x_n\}\) and a probability mass function (PMF) on \(S\), which is a function \(p\) from \(S\) to \([0,1]\) with the property

\[ \sum_{i=1}^n p(x_i) = 1 \]

We say that a random variable \(X\) has distribution \(p\) if \(X\) takes value \(x_i\) with probability \(p(x_i)\).

That is,

\[ \mathbb P\{X = x_i\} = p(x_i) \quad \text{for } i= 1, \ldots, n \]

The mean or expected value of a random variable \(X\) with distribution \(p\) is

\[ \mathbb{E}[X] = \sum_{i=1}^n x_i p(x_i) \]

Expectation is also called the first moment of the distribution.

We also refer to this number as the mean of the distribution (represented by) \(p\).

The variance of \(X\) is defined as

\[ \mathbb{V}[X] = \sum_{i=1}^n (x_i - \mathbb{E}[X])^2 p(x_i) \]

Variance is also called the second central moment of the distribution.

The cumulative distribution function (CDF) of \(X\) is defined by

\[ F(x) = \mathbb{P}\{X \leq x\} = \sum_{i=1}^n \mathbb 1\{x_i \leq x\} p(x_i) \]

Here \(\mathbb 1\{ \textrm{statement} \} = 1\) if “statement” is true and zero otherwise.

Hence the second term takes all \(x_i \leq x\) and sums their probabilities.

16.2.1.1. Uniform distribution#

One simple example is the uniform distribution, where \(p(x_i) = 1/n\) for all \(n\).

We can import the uniform distribution on \(S = \{1, \ldots, n\}\) from SciPy like so:

n = 10
u = scipy.stats.randint(1, n+1)

Here’s the mean and variance

u.mean(), u.var()
(5.5, 8.25)

The formula for the mean is \((n+1)/2\), and the formula for the variance is \((n^2 - 1)/12\).

Now let’s evaluate the PMF

u.pmf(1)
0.1
u.pmf(2)
0.1

Here’s a plot of the probability mass function:

fig, ax = plt.subplots()
S = np.arange(1, n+1)
ax.plot(S, u.pmf(S), linestyle='', marker='o', alpha=0.8, ms=4)
ax.vlines(S, 0, u.pmf(S), lw=0.2)
ax.set_xticks(S)
plt.show()
_images/1fe53347620dc49fe839f549e2e6c4a336a06a75ec9980ddaa904b311720d6a4.png

Here’s a plot of the CDF:

fig, ax = plt.subplots()
S = np.arange(1, n+1)
ax.step(S, u.cdf(S))
ax.vlines(S, 0, u.cdf(S), lw=0.2)
ax.set_xticks(S)
plt.show()
_images/e34184dc464c2473fb260a5879b7b97b78f902ee825eed7272d46c743de07bc4.png

The CDF jumps up by \(p(x_i)\) and \(x_i\).

Exercise 16.1

Calculate the mean and variance for this parameterization (i.e., \(n=10\)) directly from the PMF, using the expressions given above.

Check that your answers agree with u.mean() and u.var().

16.2.1.2. Bernoulli distribution#

Another useful (and more interesting) distribution is the Bernoulli distribution

We can import the uniform distribution on \(S = \{1, \ldots, n\}\) from SciPy like so:

n = 10
u = scipy.stats.randint(1, n+1)

Here’s the mean and variance

u.mean(), u.var()
(5.5, 8.25)

The formula for the mean is \((n+1)/2\), and the formula for the variance is \((n^2 - 1)/12\).

Now let’s evaluate the PMF

u.pmf(1)
0.1
u.pmf(2)
0.1

Here’s a plot of the probability mass function:

fig, ax = plt.subplots()
S = np.arange(1, n+1)
ax.plot(S, u.pmf(S), linestyle='', marker='o', alpha=0.8, ms=4)
ax.vlines(S, 0, u.pmf(S), lw=0.2)
ax.set_xticks(S)
plt.show()
_images/1fe53347620dc49fe839f549e2e6c4a336a06a75ec9980ddaa904b311720d6a4.png

Here’s a plot of the CDF:

fig, ax = plt.subplots()
S = np.arange(1, n+1)
ax.step(S, u.cdf(S))
ax.vlines(S, 0, u.cdf(S), lw=0.2)
ax.set_xticks(S)
plt.show()
_images/e34184dc464c2473fb260a5879b7b97b78f902ee825eed7272d46c743de07bc4.png

The CDF jumps up by \(p(x_i)\) and \(x_i\).

Exercise 16.2

Calculate the mean and variance for this parameterization (i.e., \(n=10\)) directly from the PMF, using the expressions given above.

Check that your answers agree with u.mean() and u.var().

16.2.1.3. Binomial distribution#

Another useful (and more interesting) distribution is the binomial distribution on \(S=\{0, \ldots, n\}\), which has PMF

\[ p(i) = \binom{n}{i} \theta^i (1-\theta)^{n-i} \]

Here \(\theta \in [0,1]\) is a parameter.

The interpretation of \(p(i)\) is: the number of successes in \(n\) independent trials with success probability \(\theta\).

(If \(\theta=0.5\), p(i) can be “how many heads in \(n\) flips of a fair coin”)

The mean and variance are

n = 10
θ = 0.5
u = scipy.stats.binom(n, θ)
u.mean(), u.var()
(5.0, 2.5)

The formula for the mean is \(n \theta\) and the formula for the variance is \(n \theta (1-\theta)\).

Here’s the PDF

u.pmf(1)
0.009765625000000002
fig, ax = plt.subplots()
S = np.arange(1, n+1)
ax.plot(S, u.pmf(S), linestyle='', marker='o', alpha=0.8, ms=4)
ax.vlines(S, 0, u.pmf(S), lw=0.2)
ax.set_xticks(S)
plt.show()
_images/686eb55823a463fb91922f7015cb41fd8d6bf399f106a3acc742325cbb30a241.png

Here’s the CDF

fig, ax = plt.subplots()
S = np.arange(1, n+1)
ax.step(S, u.cdf(S))
ax.vlines(S, 0, u.cdf(S), lw=0.2)
ax.set_xticks(S)
plt.show()
_images/136c34fc2845d8da6367981c61ebe50ebae9b3a93857121fe990ef204cfb365e.png

Exercise 16.3

Using u.pmf, check that our definition of the CDF given above calculates the same function as u.cdf.

16.2.1.4. Poisson distribution#

Poisson distribution on \(S = \{0, 1, \ldots\}\) with parameter \(\lambda > 0\) has PMF

\[ p(i) = \frac{\lambda^i}{i!} e^{-\lambda} \]

The interpretation of \(p(i)\) is: the number of events in a fixed time interval, where the events occur at a constant rate \(\lambda\) and independently of each other.

The mean and variance are

λ = 2
u = scipy.stats.poisson(λ)
u.mean(), u.var()
(2.0, 2.0)

The the expectation of Poisson distribution is \(\lambda\) and the variance is also \(\lambda\).

Here’s the PMF

λ = 2
u = scipy.stats.poisson(λ)
u.pmf(1)
0.2706705664732254
fig, ax = plt.subplots()
S = np.arange(1, n+1)
ax.plot(S, u.pmf(S), linestyle='', marker='o', alpha=0.8, ms=4)
ax.vlines(S, 0, u.pmf(S), lw=0.2)
ax.set_xticks(S)
plt.show()
_images/8f30da31ffe9a56330185d86e7af93977d18b6828a34faa1f5d4763290ef776a.png

16.2.2. Continuous distributions#

Continuous distributions are represented by a density function, which is a function \(p\) over \(\mathbb R\) (the set of all numbers) such that \(p(x) \geq 0\) for all \(x\) and

\[ \int_{-\infty}^\infty p(x) dx = 1 \]

We say that random variable \(X\) has distribution \(p\) if

\[ \mathbb P\{a < X < b\} = \int_a^b p(x) dx \]

for all \(a \leq b\).

The definition of the mean and variance of a random variable \(X\) with distribution \(p\) are the same as the discrete case, after replacing the sum with an integral.

For example, the mean of \(X\) is

\[ \mathbb{E}[X] = \int_{-\infty}^\infty x p(x) dx \]

The cumulative distribution function (CDF) of \(X\) is defined by

\[ F(x) = \mathbb P\{X \leq x\} = \int_{-\infty}^x p(x) dx \]

16.2.2.1. Normal distribution#

Perhaps the most famous distribution is the normal distribution, which has density

\[ p(x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]

This distribution has two parameters, \(\mu\) and \(\sigma\).

It can be shown that, for this distribution, the mean is \(\mu\) and the variance is \(\sigma^2\).

We can obtain the moments, PDF and CDF of the normal density as follows:

μ, σ = 0.0, 1.0
u = scipy.stats.norm(μ, σ)
u.mean(), u.var()
(0.0, 1.0)

Here’s a plot of the density — the famous “bell-shaped curve”:

μ_vals = [-1, 0, 1]
σ_vals = [0.4, 1, 1.6]
fig, ax = plt.subplots()
x_grid = np.linspace(-4, 4, 200)

for μ, σ in zip(μ_vals, σ_vals):
    u = scipy.stats.norm(μ, σ)
    ax.plot(x_grid, u.pdf(x_grid),
    alpha=0.5, lw=2,
    label=f'$\mu={μ}, \sigma={σ}$')

plt.legend()
plt.show()
_images/841ab727d0d9e255b164fe1fcaebd5f337d96b8670c535dce9618f65298b3d0d.png

Here’s a plot of the CDF:

fig, ax = plt.subplots()
for μ, σ in zip(μ_vals, σ_vals):
    u = scipy.stats.norm(μ, σ)
    ax.plot(x_grid, u.cdf(x_grid),
    alpha=0.5, lw=2,
    label=f'$\mu={μ}, \sigma={σ}$')
    ax.set_ylim(0, 1)
plt.legend()
plt.show()
_images/81f0dc6e63526bc654f0f5cffc0e4d7f218950f6c5446014162aabe178b84c93.png

16.2.2.2. Lognormal distribution#

The lognormal distribution is a distribution on \(\left(0, \infty\right)\) with density

\[ p(x) = \frac{1}{\sigma x \sqrt{2\pi}} \exp \left(- \frac{\left(\log x - \mu\right)^2}{2 \sigma^2} \right) \]

This distribution has two parameters, \(\mu\) and \(\sigma\).

It can be shown that, for this distribution, the mean is \(\exp\left(\mu + \sigma^2/2\right)\) and the variance is \(\left[\exp\left(\sigma^2\right) - 1\right] \exp\left(2\mu + \sigma^2\right)\).

It has a nice interpretation: if \(X\) is lognormally distributed, then \(\log X\) is normally distributed.

It is often used to model variables that are “multiplicative” in nature, such as income or asset prices.

We can obtain the moments, PDF, and CDF of the normal density as follows:

μ, σ = 0.0, 1.0
u = scipy.stats.lognorm(s=σ, scale=np.exp(μ))
u.mean(), u.var()
(1.6487212707001282, 4.670774270471604)
μ_vals = [-1, 0, 1]
σ_vals = [0.25, 0.5, 1]
x_grid = np.linspace(0, 3, 200)

fig, ax = plt.subplots()
for μ, σ in zip(μ_vals, σ_vals):
    u = scipy.stats.lognorm(σ, scale=np.exp(μ))
    ax.plot(x_grid, u.pdf(x_grid),
    alpha=0.5, lw=2,
    label=f'$\mu={μ}, \sigma={σ}$')

plt.legend()
plt.show()
_images/630745989e0834bfc3205ea467474f9ef8b07daeb614932272f0132cae9894d2.png
fig, ax = plt.subplots()
μ = 1
for σ in σ_vals:
    u = scipy.stats.norm(μ, σ)
    ax.plot(x_grid, u.cdf(x_grid),
    alpha=0.5, lw=2,
    label=f'$\mu={μ}, \sigma={σ}$')
    ax.set_ylim(0, 1)
    ax.set_xlim(0, 3)
plt.legend()
plt.show()
_images/902093670fc8a77189685a4f0f4d571e44dc993b1090c6dad2858b26060e22b7.png

16.2.2.3. Exponential distribution#

The exponential distribution is a distribution on \(\left(0, \infty\right)\) with density

\[ p(x) = \lambda \exp \left( - \lambda x \right) \]

This distribution has one parameter, \(\lambda\).

It is related to the Poisson distribution as it describes the distribution of the length of the time interval between two consecutive events in a Poisson process.

It can be shown that, for this distribution, the mean is \(1/\lambda\) and the variance is \(1/\lambda^2\).

We can obtain the moments, PDF, and CDF of the normal density as follows:

λ = 1.0
u = scipy.stats.expon(scale=1/λ)
u.mean(), u.var()
(1.0, 1.0)
fig, ax = plt.subplots()
λ_vals = [0.5, 1, 2]
x_grid = np.linspace(0, 6, 200)

for λ in λ_vals:
    u = scipy.stats.expon(scale=1/λ)
    ax.plot(x_grid, u.pdf(x_grid),
    alpha=0.5, lw=2,
    label=f'$\lambda={λ}$')
plt.legend()
plt.show()
_images/bd793db095b2c69f7039ce7f830a50ed29196a2b6b93da4bf598d9f8fb15caa2.png
fig, ax = plt.subplots()
for λ in λ_vals:
    u = scipy.stats.expon(scale=1/λ)
    ax.plot(x_grid, u.cdf(x_grid),
    alpha=0.5, lw=2,
    label=f'$\lambda={λ}$')
    ax.set_ylim(0, 1)
plt.legend()
plt.show()
_images/bd3bf061622aa5f71ddba90574ff782e5dd74185a947f1279636e1fb8180a8b0.png

16.2.2.4. Beta distribution#

The beta distribution is a distribution on \((0, 1)\) with density

\[ p(x) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} x^{\alpha - 1} (1 - x)^{\beta - 1} \]

where \(\Gamma\) is the gamma function.

(The role of the gamma function is just to normalize the density, so that it integrates to one.)

This distribution has two parameters, \(\alpha > 0\) and \(\beta > 0\).

It can be shown that, for this distribution, the mean is \(\alpha / (\alpha + \beta)\) and the variance is \(\alpha \beta / (\alpha + \beta)^2 (\alpha + \beta + 1)\).

We can obtain the moments, PDF, and CDF of the normal density as follows:

α, β = 3.0, 1.0
u = scipy.stats.beta(α, β)
u.mean(), u.var()
(0.75, 0.0375)
α_vals = [0.5, 1, 5, 25, 3]
β_vals = [3, 1, 10, 20, 0.5]
x_grid = np.linspace(0, 1, 200)

fig, ax = plt.subplots()
for α, β in zip(α_vals, β_vals):
    u = scipy.stats.beta(α, β)
    ax.plot(x_grid, u.pdf(x_grid),
    alpha=0.5, lw=2,
    label=fr'$\alpha={α}, \beta={β}$')
plt.legend()
plt.show()
_images/768bb7e9f27c899f4a1e4c3e0f0328993499e65f196c8595edbbb7e9e21e955c.png
fig, ax = plt.subplots()
for α, β in zip(α_vals, β_vals):
    u = scipy.stats.beta(α, β)
    ax.plot(x_grid, u.cdf(x_grid),
    alpha=0.5, lw=2,
    label=fr'$\alpha={α}, \beta={β}$')
    ax.set_ylim(0, 1)
plt.legend()
plt.show()
_images/771adac6e40aa55ef30550e262f86bdb3ff813da58eee60ffff29a6b21060f9b.png

16.2.2.5. Gamma distribution#

The gamma distribution is a distribution on \(\left(0, \infty\right)\) with density

\[ p(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} \exp(-\beta x) \]

This distribution has two parameters, \(\alpha > 0\) and \(\beta > 0\).

It can be shown that, for this distribution, the mean is \(\alpha / \beta\) and the variance is \(\alpha / \beta^2\).

One interpretation is that if \(X\) is gamma distributed and \(\alpha\) is an integer, then \(X\) is the sum of \(\alpha\) independent exponentially distributed random variables with mean \(1/\beta\).

We can obtain the moments, PDF, and CDF of the normal density as follows:

α, β = 3.0, 2.0
u = scipy.stats.gamma(α, scale=1/β)
u.mean(), u.var()
(1.5, 0.75)
α_vals = [1, 3, 5, 10]
β_vals = [3, 5, 3, 3]
x_grid = np.linspace(0, 7, 200)

fig, ax = plt.subplots()
for α, β in zip(α_vals, β_vals):
    u = scipy.stats.gamma(α, scale=1/β)
    ax.plot(x_grid, u.pdf(x_grid),
    alpha=0.5, lw=2,
    label=fr'$\alpha={α}, \beta={β}$')
plt.legend()
plt.show()
_images/efe8bd5e3b1db0785f95538ff9333e2b9b20e7fbd5cbdb7e08defc36f5aea2fb.png
fig, ax = plt.subplots()
for α, β in zip(α_vals, β_vals):
    u = scipy.stats.gamma(α, scale=1/β)
    ax.plot(x_grid, u.cdf(x_grid),
    alpha=0.5, lw=2,
    label=fr'$\alpha={α}, \beta={β}$')
    ax.set_ylim(0, 1)
plt.legend()
plt.show()
_images/3959f6c951517692492143c122bca1f5ada8dfd2fe6c4f7f7c41a55b870b9c31.png

16.3. Observed distributions#

Sometimes we refer to observed data or measurements as “distributions”.

For example, let’s say we observe the income of 10 people over a year:

data = [['Hiroshi', 1200], 
        ['Ako', 1210], 
        ['Emi', 1400],
        ['Daiki', 990],
        ['Chiyo', 1530],
        ['Taka', 1210],
        ['Katsuhiko', 1240],
        ['Daisuke', 1124],
        ['Yoshi', 1330],
        ['Rie', 1340]]

df = pd.DataFrame(data, columns=['name', 'income'])
df
name income
0 Hiroshi 1200
1 Ako 1210
2 Emi 1400
3 Daiki 990
4 Chiyo 1530
5 Taka 1210
6 Katsuhiko 1240
7 Daisuke 1124
8 Yoshi 1330
9 Rie 1340

In this situation, we might refer to the set of their incomes as the “income distribution.”

The terminology is confusing because this set is not a probability distribution — it’s just a collection of numbers.

However, as we will see, there are connections between observed distributions (i.e., sets of numbers like the income distribution above) and probability distributions.

Below we explore some observed distributions.

16.3.1. Summary statistics#

Suppose we have an observed distribution with values \(\{x_1, \ldots, x_n\}\)

The sample mean of this distribution is defined as

\[ \bar x = \frac{1}{n} \sum_{i=1}^n x_i \]

The sample variance is defined as

\[ \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2 \]

For the income distribution given above, we can calculate these numbers via

x = np.asarray(df['income'])
x.mean(), x.var()
(1257.4, 20412.839999999997)

Exercise 16.4

Check that the formulas given above produce the same numbers.

16.3.2. Visualization#

Let’s look at different ways that we can visualize one or more observed distributions.

We will cover

  • histograms

  • kernel density estimates and

  • violin plots

16.3.2.1. Histograms#

We can histogram the income distribution we just constructed as follows

x = df['income']
fig, ax = plt.subplots()
ax.hist(x, bins=5, density=True, histtype='bar')
plt.show()
_images/f3610d61545002338fed44c6dae74445c46249174a7f2e94b185b795b02fab28.png

Let’s look at a distribution from real data.

In particular, we will look at the monthly return on Amazon shares between 2000/1/1 and 2023/1/1.

The monthly return is calculated as the percent change in the share price over each month.

So we will have one observation for each month.

df = yf.download('AMZN', '2000-1-1', '2023-1-1', interval='1mo' )
prices = df['Adj Close']
data = prices.pct_change()[1:] * 100
data.head()
Hide code cell output
[*********************100%%**********************]  1 of 1 completed

Date
2000-02-01     6.679568
2000-03-01    -2.722323
2000-04-01   -17.630592
2000-05-01   -12.457531
2000-06-01   -24.838297
Name: Adj Close, dtype: float64

The first observation is the monthly return (percent change) over January 2000, which was

data[0] 
6.6795679502808625

Let’s turn the return observations into an array and histogram it.

x_amazon = np.asarray(data)
fig, ax = plt.subplots()
ax.hist(x_amazon, bins=20)
plt.show()
_images/1b7dd447d1805b124f3042c34dd89af076908af532365e92958da5db891519a4.png

16.3.2.2. Kernel density estimates#

Kernel density estimate (KDE) is a non-parametric way to estimate and visualize the PDF of a distribution.

KDE will generate a smooth curve that approximates the PDF.

fig, ax = plt.subplots()
sns.kdeplot(x_amazon, ax=ax)
plt.show()
_images/e31e0737e4bc4cd53bb5559a6581097a9587ecec846fa7e0712bc5e57361443d.png

The smoothness of the KDE is dependent on how we choose the bandwidth.

fig, ax = plt.subplots()
sns.kdeplot(x_amazon, ax=ax, bw_adjust=0.1, alpha=0.5, label="bw=0.1")
sns.kdeplot(x_amazon, ax=ax, bw_adjust=0.5, alpha=0.5, label="bw=0.5")
sns.kdeplot(x_amazon, ax=ax, bw_adjust=1, alpha=0.5, label="bw=1")
plt.legend()
plt.show()
_images/64582ca96b9bc3336c4d9e163891c6f934e413d791bf125a7dd624f9e0d0fcf4.png

When we use a larger bandwidth, the KDE is smoother.

A suitable bandwidth is not too smooth (underfitting) or too wiggly (overfitting).

16.3.2.3. Violin plots#

Yet another way to display an observed distribution is via a violin plot.

fig, ax = plt.subplots()
ax.violinplot(x_amazon)
plt.show()
_images/35c2942613b74a4930e961d1f39cd06345213ff948cb92ab3afc6953b5c886b3.png

Violin plots are particularly useful when we want to compare different distributions.

For example, let’s compare the monthly returns on Amazon shares with the monthly return on Apple shares.

df = yf.download('AAPL', '2000-1-1', '2023-1-1', interval='1mo' )
prices = df['Adj Close']
data = prices.pct_change()[1:] * 100
x_apple = np.asarray(data)
Hide code cell output
[*********************100%%**********************]  1 of 1 completed

fig, ax = plt.subplots()
ax.violinplot([x_amazon, x_apple])
plt.show()
_images/cf024efaeea41b3566a9bbe173eaea2d7d7a5cbac00eff50cf8b2e3235a7cdeb.png

16.3.3. Connection to probability distributions#

Let’s discuss the connection between observed distributions and probability distributions.

Sometimes it’s helpful to imagine that an observed distribution is generated by a particular probability distribution.

For example, we might look at the returns from Amazon above and imagine that they were generated by a normal distribution.

Even though this is not true, it might be a helpful way to think about the data.

Here we match a normal distribution to the Amazon monthly returns by setting the sample mean to the mean of the normal distribution and the sample variance equal to the variance.

Then we plot the density and the histogram.

μ = x_amazon.mean()
σ_squared = x_amazon.var()
σ = np.sqrt(σ_squared)
u = scipy.stats.norm(μ, σ)
x_grid = np.linspace(-50, 65, 200)
fig, ax = plt.subplots()
ax.plot(x_grid, u.pdf(x_grid))
ax.hist(x_amazon, density=True, bins=40)
plt.show()
_images/77c11298eda5cbe4a9de9dbff4c2539aa0fdf68ac0c468d20179b693db8f72b8.png

The match between the histogram and the density is not very bad but also not very good.

One reason is that the normal distribution is not really a good fit for this observed data — we will discuss this point again when we talk about heavy tailed distributions.

Of course, if the data really is generated by the normal distribution, then the fit will be better.

Let’s see this in action

  • first we generate random draws from the normal distribution

  • then we histogram them and compare with the density.

μ, σ = 0, 1
u = scipy.stats.norm(μ, σ)
N = 2000  # Number of observations
x_draws = u.rvs(N)
x_grid = np.linspace(-4, 4, 200)
fig, ax = plt.subplots()
ax.plot(x_grid, u.pdf(x_grid))
ax.hist(x_draws, density=True, bins=40)
plt.show()
_images/59899c5359714301abbb4da35cc9f82261abb1647db900658b0d5c1fe641bbab.png

Note that if you keep increasing \(N\), which is the number of observations, the fit will get better and better.

This convergence is a version of the “law of large numbers”, which we will discuss later.