{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e65c8817",
   "metadata": {},
   "source": [
    "# Simple Linear Regression Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a513c791",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a7073a0",
   "metadata": {},
   "source": [
    "The simple regression model estimates the relationship between two variables $ x_i $ and $ y_i $\n",
    "\n",
    "$$\n",
    "y_i = \\alpha + \\beta x_i + \\epsilon_i, i = 1,2,...,N\n",
    "$$\n",
    "\n",
    "where $ \\epsilon_i $ represents the error between the line of best fit and the sample values for $ y_i $ given $ x_i $.\n",
    "\n",
    "Our goal is to choose values for $ \\alpha $ and $ \\beta $ to build a line of “best” fit for some data that is available for variables $ x_i $ and $ y_i $.\n",
    "\n",
    "Let us consider a simple dataset of 10 observations for variables $ x_i $ and $ y_i $:\n",
    "\n",
    "||$ y_i $|$ x_i $|\n",
    "|:-------------------------------:|:-------------------------------:|:-------------------------------:|\n",
    "|1|2000|32|\n",
    "|2|1000|21|\n",
    "|3|1500|24|\n",
    "|4|2500|35|\n",
    "|5|500|10|\n",
    "|6|900|11|\n",
    "|7|1100|22|\n",
    "|8|1500|21|\n",
    "|9|1800|27|\n",
    "|10|250|2|\n",
    "Let us think about $ y_i $ as sales for an ice-cream cart, while $ x_i $ is a variable that records the day’s temperature in Celsius."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1946618e",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "x = [32, 21, 24, 35, 10, 11, 22, 21, 27, 2]\n",
    "y = [2000,1000,1500,2500,500,900,1100,1500,1800, 250]\n",
    "df = pd.DataFrame([x,y]).T\n",
    "df.columns = ['X', 'Y']\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "172b29ce",
   "metadata": {},
   "source": [
    "We can use a scatter plot of the data to see the relationship between $ y_i $ (ice-cream sales in dollars (\\$’s)) and $ x_i $ (degrees Celsius)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "314b7664",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "ax = df.plot(\n",
    "    x='X', \n",
    "    y='Y', \n",
    "    kind='scatter', \n",
    "    ylabel='Ice-cream sales ($\\'s)', \n",
    "    xlabel='Degrees celcius'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5feea426",
   "metadata": {},
   "source": [
    "as you can see the data suggests that more ice-cream is typically sold on hotter days.\n",
    "\n",
    "To build a linear model of the data we need to choose values for $ \\alpha $ and $ \\beta $ that represents a line of “best” fit such that\n",
    "\n",
    "$$\n",
    "\\hat{y_i} = \\hat{\\alpha} + \\hat{\\beta} x_i\n",
    "$$\n",
    "\n",
    "Let’s start with $ \\alpha = 5 $ and $ \\beta = 10 $"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a14fec40",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "α = 5\n",
    "β = 10\n",
    "df['Y_hat'] = α + β * df['X']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9021868d",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)\n",
    "ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6a0a35d",
   "metadata": {},
   "source": [
    "We can see that this model does a poor job of estimating the relationship.\n",
    "\n",
    "We can continue to guess and iterate towards a line of “best” fit by adjusting the parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c33112cc",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "β = 100\n",
    "df['Y_hat'] = α + β * df['X']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5f8c91c",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)\n",
    "ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7e18415",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "β = 65\n",
    "df['Y_hat'] = α + β * df['X']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f7b2c20",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)\n",
    "ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63547ca0",
   "metadata": {},
   "source": [
    "However we need to think about formalizing this guessing process by thinking of this problem as an optimization problem.\n",
    "\n",
    "Let’s consider the error $ \\epsilon_i $ and define the difference between the observed values $ y_i $ and the estimated values $ \\hat{y}_i $ which we will call the residuals\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\hat{e}_i &= y_i - \\hat{y}_i \\\\\n",
    "          &= y_i - \\hat{\\alpha} - \\hat{\\beta} x_i\n",
    "\\end{aligned}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9123b8b7",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df['error'] = df['Y_hat'] - df['Y']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69d1a2db",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63e8d7ef",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)\n",
    "ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')\n",
    "plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c6433ed",
   "metadata": {},
   "source": [
    "The Ordinary Least Squares (OLS) method chooses $ \\alpha $ and $ \\beta $ in such a way that **minimizes** the sum of the squared residuals (SSR).\n",
    "\n",
    "$$\n",
    "\\min_{\\alpha,\\beta} \\sum_{i=1}^{N}{\\hat{e}_i^2} = \\min_{\\alpha,\\beta} \\sum_{i=1}^{N}{(y_i - \\alpha - \\beta x_i)^2}\n",
    "$$\n",
    "\n",
    "Let’s call this a cost function\n",
    "\n",
    "$$\n",
    "C = \\sum_{i=1}^{N}{(y_i - \\alpha - \\beta x_i)^2}\n",
    "$$\n",
    "\n",
    "that we would like to minimize with parameters $ \\alpha $ and $ \\beta $."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ed686bb",
   "metadata": {},
   "source": [
    "## How does error change with respect to $ \\alpha $ and $ \\beta $\n",
    "\n",
    "Let us first look at how the total error changes with respect to $ \\beta $ (holding the intercept $ \\alpha $ constant)\n",
    "\n",
    "We know from [the next section](#slr-optimal-values) the optimal values for $ \\alpha $ and $ \\beta $  are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1e02f2e",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "β_optimal = 64.38\n",
    "α_optimal = -14.72"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f76a3cf",
   "metadata": {},
   "source": [
    "We can then calculate the error for a range of $ \\beta $ values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b27ad27f",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "errors = {}\n",
    "for β in np.arange(20,100,0.5):\n",
    "    errors[β] = abs((α_optimal + β * df['X']) - df['Y']).sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "abb307ee",
   "metadata": {},
   "source": [
    "Plotting the error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "168db300",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "ax = pd.Series(errors).plot(xlabel='β', ylabel='error')\n",
    "plt.axvline(β_optimal, color='r');"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff1b1ada",
   "metadata": {},
   "source": [
    "Now let us vary $ \\alpha $ (holding $ \\beta $ constant)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "820eaac0",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "errors = {}\n",
    "for α in np.arange(-500,500,5):\n",
    "    errors[α] = abs((α + β_optimal * df['X']) - df['Y']).sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c6955f1",
   "metadata": {},
   "source": [
    "Plotting the error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4e436fa",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "ax = pd.Series(errors).plot(xlabel='α', ylabel='error')\n",
    "plt.axvline(α_optimal, color='r');"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b45833fd",
   "metadata": {},
   "source": [
    "\n",
    "<a id='slr-optimal-values'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49f8a513",
   "metadata": {},
   "source": [
    "## Calculating optimal values\n",
    "\n",
    "Now let us use calculus to solve the optimization problem and compute the optimal values for $ \\alpha $ and $ \\beta $ to find the ordinary least squares solution.\n",
    "\n",
    "First taking the partial derivative with respect to $ \\alpha $\n",
    "\n",
    "$$\n",
    "\\frac{\\partial C}{\\partial \\alpha}[\\sum_{i=1}^{N}{(y_i - \\alpha - \\beta x_i)^2}]\n",
    "$$\n",
    "\n",
    "and setting it equal to $ 0 $\n",
    "\n",
    "$$\n",
    "0 = \\sum_{i=1}^{N}{-2(y_i - \\alpha - \\beta x_i)}\n",
    "$$\n",
    "\n",
    "we can remove the constant $ -2 $ from the summation by dividing both sides by $ -2 $\n",
    "\n",
    "$$\n",
    "0 = \\sum_{i=1}^{N}{(y_i - \\alpha - \\beta x_i)}\n",
    "$$\n",
    "\n",
    "Now we can split this equation up into the components\n",
    "\n",
    "$$\n",
    "0 = \\sum_{i=1}^{N}{y_i} - \\sum_{i=1}^{N}{\\alpha} - \\beta \\sum_{i=1}^{N}{x_i}\n",
    "$$\n",
    "\n",
    "The middle term is a straight forward sum from $ i=1,...N $ by a constant $ \\alpha $\n",
    "\n",
    "$$\n",
    "0 = \\sum_{i=1}^{N}{y_i} - N*\\alpha - \\beta \\sum_{i=1}^{N}{x_i}\n",
    "$$\n",
    "\n",
    "and rearranging terms\n",
    "\n",
    "$$\n",
    "\\alpha = \\frac{\\sum_{i=1}^{N}{y_i} - \\beta \\sum_{i=1}^{N}{x_i}}{N}\n",
    "$$\n",
    "\n",
    "We observe that both fractions resolve to the means $ \\bar{y_i} $ and $ \\bar{x_i} $\n",
    "\n",
    "\n",
    "<a id='equation-eq-optimal-alpha'></a>\n",
    "$$\n",
    "\\alpha = \\bar{y_i} - \\beta\\bar{x_i} \\tag{45.1}\n",
    "$$\n",
    "\n",
    "Now let’s take the partial derivative of the cost function $ C $ with respect to $ \\beta $\n",
    "\n",
    "$$\n",
    "\\frac{\\partial C}{\\partial \\beta}[\\sum_{i=1}^{N}{(y_i - \\alpha - \\beta x_i)^2}]\n",
    "$$\n",
    "\n",
    "and setting it equal to $ 0 $\n",
    "\n",
    "$$\n",
    "0 = \\sum_{i=1}^{N}{-2 x_i (y_i - \\alpha - \\beta x_i)}\n",
    "$$\n",
    "\n",
    "we can again take the constant outside of the summation and divide both sides by $ -2 $\n",
    "\n",
    "$$\n",
    "0 = \\sum_{i=1}^{N}{x_i (y_i - \\alpha - \\beta x_i)}\n",
    "$$\n",
    "\n",
    "which becomes\n",
    "\n",
    "$$\n",
    "0 = \\sum_{i=1}^{N}{(x_i y_i - \\alpha x_i - \\beta x_i^2)}\n",
    "$$\n",
    "\n",
    "now substituting for $ \\alpha $\n",
    "\n",
    "$$\n",
    "0 = \\sum_{i=1}^{N}{(x_i y_i - (\\bar{y_i} - \\beta \\bar{x_i}) x_i - \\beta x_i^2)}\n",
    "$$\n",
    "\n",
    "and rearranging terms\n",
    "\n",
    "$$\n",
    "0 = \\sum_{i=1}^{N}{(x_i y_i - \\bar{y_i} x_i - \\beta \\bar{x_i} x_i - \\beta x_i^2)}\n",
    "$$\n",
    "\n",
    "This can be split into two summations\n",
    "\n",
    "$$\n",
    "0 = \\sum_{i=1}^{N}(x_i y_i - \\bar{y_i} x_i) + \\beta \\sum_{i=1}^{N}(\\bar{x_i} x_i - x_i^2)\n",
    "$$\n",
    "\n",
    "and solving for $ \\beta $ yields\n",
    "\n",
    "\n",
    "<a id='equation-eq-optimal-beta'></a>\n",
    "$$\n",
    "\\beta = \\frac{\\sum_{i=1}^{N}(x_i y_i - \\bar{y_i} x_i)}{\\sum_{i=1}^{N}(x_i^2 - \\bar{x_i} x_i)} \\tag{45.2}\n",
    "$$\n",
    "\n",
    "We can now use [(45.1)](#equation-eq-optimal-alpha) and [(45.2)](#equation-eq-optimal-beta) to calculate the optimal values for $ \\alpha $ and $ \\beta $\n",
    "\n",
    "Calculating $ \\beta $"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b789a07c",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df = df[['X','Y']].copy()  # Original Data\n",
    "\n",
    "# Calculate the sample means\n",
    "x_bar = df['X'].mean()\n",
    "y_bar = df['Y'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5dad8190",
   "metadata": {},
   "source": [
    "Now computing across the 10 observations and then summing the numerator and denominator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be207fda",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "# Compute the Sums\n",
    "df['num'] = df['X'] * df['Y'] - y_bar * df['X']\n",
    "df['den'] = pow(df['X'],2) - x_bar * df['X']\n",
    "β = df['num'].sum() / df['den'].sum()\n",
    "print(β)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af63a7de",
   "metadata": {},
   "source": [
    "Calculating $ \\alpha $"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2bd714de",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "α = y_bar - β * x_bar\n",
    "print(α)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b06749c",
   "metadata": {},
   "source": [
    "Now we can plot the OLS solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ddd3fb0",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df['Y_hat'] = α + β * df['X']\n",
    "df['error'] = df['Y_hat'] - df['Y']\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "ax = df.plot(x='X',y='Y', kind='scatter', ax=ax)\n",
    "ax = df.plot(x='X',y='Y_hat', kind='line', ax=ax, color='g')\n",
    "plt.vlines(df['X'], df['Y_hat'], df['Y'], color='r');"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "beaef9f5",
   "metadata": {},
   "source": [
    "## Exercise 45.1\n",
    "\n",
    "Now that you know the equations that solve the simple linear regression model using OLS you can now run your own regressions to build a model between $ y $ and $ x $.\n",
    "\n",
    "Let’s consider two economic variables GDP per capita and Life Expectancy.\n",
    "\n",
    "1. What do you think their relationship would be?  \n",
    "1. Gather some data [from our world in data](https://ourworldindata.org)  \n",
    "1. Use `pandas` to import the `csv` formatted data and plot a few different countries of interest  \n",
    "1. Use [(45.1)](#equation-eq-optimal-alpha) and [(45.2)](#equation-eq-optimal-beta) to compute optimal values for  $ \\alpha $ and $ \\beta $  \n",
    "1. Plot the line of best fit found using OLS  \n",
    "1. Interpret the coefficients and write a summary sentence of the relationship between GDP per capita and Life Expectancy  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90d35d18",
   "metadata": {},
   "source": [
    "## Solution to[ Exercise 45.1](https://intro.quantecon.org/#slr-ex1)\n",
    "\n",
    "**Q2:** Gather some data [from our world in data](https://ourworldindata.org)\n",
    "\n",
    "You can download <a href=https://github.com/QuantEcon/lecture-python-intro/raw/main/lectures/_static/lecture_specific/simple_linear_regression/life-expectancy-vs-gdp-per-capita.csv download>a copy of the data here</a> if you get stuck\n",
    "\n",
    "**Q3:** Use `pandas` to import the `csv` formatted data and plot a few different countries of interest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e778cc4",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "data_url = \"https://github.com/QuantEcon/lecture-python-intro/raw/main/lectures/_static/lecture_specific/simple_linear_regression/life-expectancy-vs-gdp-per-capita.csv\"\n",
    "df = pd.read_csv(data_url, nrows=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "000ace03",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f366f4c",
   "metadata": {},
   "source": [
    "You can see that the data downloaded from Our World in Data has provided a global set of countries with the GDP per capita and Life Expectancy Data.\n",
    "\n",
    "It is often a good idea to at first import a few lines of data from a csv to understand its structure so that you can then choose the columns that you want to read into your DataFrame.\n",
    "\n",
    "You can observe that there are a bunch of columns we won’t need to import such as `Continent`\n",
    "\n",
    "So let’s built a list of the columns we want to import"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e64075e6",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "cols = ['Code', 'Year', 'Life expectancy at birth (historical)', 'GDP per capita']\n",
    "df = pd.read_csv(data_url, usecols=cols)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5bf83640",
   "metadata": {},
   "source": [
    "Sometimes it can be useful to rename your columns to make it easier to work with in the DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15e70a2f",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df.columns = [\"cntry\", \"year\", \"life_expectancy\", \"gdppc\"]\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36af1849",
   "metadata": {},
   "source": [
    "We can see there are `NaN` values which represents missing data so let us go ahead and drop those"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6461e89",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df.dropna(inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4787f83b",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28ebdda1",
   "metadata": {},
   "source": [
    "We have now dropped the number of rows in our DataFrame from 62156 to 12445 removing a lot of empty data relationships.\n",
    "\n",
    "Now we have a dataset containing life expectancy and GDP per capita for a range of years.\n",
    "\n",
    "It is always a good idea to spend a bit of time understanding what data you actually have.\n",
    "\n",
    "For example, you may want to explore this data to see if there is consistent reporting for all countries across years\n",
    "\n",
    "Let’s first look at the Life Expectancy Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c8199032",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "le_years = df[['cntry', 'year', 'life_expectancy']].set_index(['cntry', 'year']).unstack()['life_expectancy']\n",
    "le_years"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cad69531",
   "metadata": {},
   "source": [
    "As you can see there are a lot of countries where data is not available for the Year 1543!\n",
    "\n",
    "Which country does report this data?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34df3119",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "le_years[~le_years[1543].isna()]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30faafe2",
   "metadata": {},
   "source": [
    "You can see that Great Britain (GBR) is the only one available\n",
    "\n",
    "You can also take a closer look at the time series to find that it is also non-continuous, even for GBR."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "949f94fa",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "le_years.loc['GBR'].plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2fcc6076",
   "metadata": {},
   "source": [
    "In fact we can use pandas to quickly check how many countries are captured in each year"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e07b5918",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "le_years.stack().unstack(level=0).count(axis=1).plot(xlabel=\"Year\", ylabel=\"Number of countries\");"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bde0df04",
   "metadata": {},
   "source": [
    "So it is clear that if you are doing cross-sectional comparisons then more recent data will include a wider set of countries\n",
    "\n",
    "Now let us consider the most recent year in the dataset 2018"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ecf1e1aa",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df = df[df.year == 2018].reset_index(drop=True).copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6d920bc0",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df.plot(x='gdppc', y='life_expectancy', kind='scatter',  xlabel=\"GDP per capita\", ylabel=\"Life expectancy (years)\",);"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1510bf4d",
   "metadata": {},
   "source": [
    "This data shows a couple of interesting relationships.\n",
    "\n",
    "1. there are a number of countries with similar GDP per capita levels but a wide range in Life Expectancy  \n",
    "1. there appears to be a positive relationship between GDP per capita and life expectancy. Countries with higher GDP per capita tend to have higher life expectancy outcomes  \n",
    "\n",
    "\n",
    "Even though OLS is solving linear equations – one option we have is to transform the variables, such as through a log transform, and then use OLS to estimate the transformed variables.\n",
    "\n",
    "By specifying `logx` you can plot the GDP per Capita data on a log scale"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "832e37c2",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df.plot(x='gdppc', y='life_expectancy', kind='scatter',  xlabel=\"GDP per capita\", ylabel=\"Life expectancy (years)\", logx=True);"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64aaaa47",
   "metadata": {},
   "source": [
    "As you can see from this transformation – a linear model fits the shape of the data more closely."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2e0ee642",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df['log_gdppc'] = df['gdppc'].apply(np.log10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa951567",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31360c8e",
   "metadata": {},
   "source": [
    "**Q4:** Use [(45.1)](#equation-eq-optimal-alpha) and [(45.2)](#equation-eq-optimal-beta) to compute optimal values for  $ \\alpha $ and $ \\beta $"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1dce64d",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "data = df[['log_gdppc', 'life_expectancy']].copy()  # Get Data from DataFrame\n",
    "\n",
    "# Calculate the sample means\n",
    "x_bar = data['log_gdppc'].mean()\n",
    "y_bar = data['life_expectancy'].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1fd3437d",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65be188b",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "# Compute the Sums\n",
    "data['num'] = data['log_gdppc'] * data['life_expectancy'] - y_bar * data['log_gdppc']\n",
    "data['den'] = pow(data['log_gdppc'],2) - x_bar * data['log_gdppc']\n",
    "β = data['num'].sum() / data['den'].sum()\n",
    "print(β)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7efbd8c",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "α = y_bar - β * x_bar\n",
    "print(α)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bab096f",
   "metadata": {},
   "source": [
    "**Q5:** Plot the line of best fit found using OLS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ff27930",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "data['life_expectancy_hat'] = α + β * df['log_gdppc']\n",
    "data['error'] = data['life_expectancy_hat'] - data['life_expectancy']\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "data.plot(x='log_gdppc',y='life_expectancy', kind='scatter', ax=ax)\n",
    "data.plot(x='log_gdppc',y='life_expectancy_hat', kind='line', ax=ax, color='g')\n",
    "plt.vlines(data['log_gdppc'], data['life_expectancy_hat'], data['life_expectancy'], color='r')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ebf4f93",
   "metadata": {},
   "source": [
    "## Exercise 45.2\n",
    "\n",
    "Minimizing the sum of squares is not the **only** way to generate the line of best fit.\n",
    "\n",
    "For example, we could also consider minimizing the sum of the **absolute values**, that would give less weight to outliers.\n",
    "\n",
    "Solve for $ \\alpha $ and $ \\beta $ using the least absolute values"
   ]
  }
 ],
 "metadata": {
  "date": 1745476283.2456524,
  "filename": "simple_linear_regression.md",
  "kernelspec": {
   "display_name": "Python",
   "language": "python3",
   "name": "python3"
  },
  "title": "Simple Linear Regression Model"
 },
 "nbformat": 4,
 "nbformat_minor": 5
}