Named list()
- attr(*, "class")= chr [1:2] "theme" "gg"
- attr(*, "complete")= logi FALSE
- attr(*, "validate")= logi TRUE
2025-04-01
We’ve learned how test hypotheses about:
What if we hypothize a relationship between variables? What if a relationship between two variables depends on a third?
\[ y=\beta_0 + \beta_1 x + \epsilon \\ \epsilon \sim \mathcal{N}(0,1) \]
(pronounced y equals beta nought plus beta one x plus epsilon)
We see sample data and estimate the preceding equation using data. The estimated regression equation is
\[ \hat{y}=b_0+b_1x \]
Notice that there’s no epsilon. We assume that the mean of the epsilon values is zero. (And we’ll learn how to check that later.)
\(x\): explanatory variable, predictor, input, covariate
\(y\): response, output, outcome
The term epsilon or \(\epsilon\) in the idealized regression equation refers to the divergence of each point from what the model predicts. It’s the pink line in our initial pictures. In the above regression models, each divergence of a single point is called a residual.
Think back to the first picture (reproduced below). The residuals are the key to assessing the model.
Named list()
- attr(*, "class")= chr [1:2] "theme" "gg"
- attr(*, "complete")= logi FALSE
- attr(*, "validate")= logi TRUE
The outcome of interest is the length of a possum’s head.
What if we only have an intercept?
Call:
lm(formula = head_l ~ 1, data = possum)
Residuals:
Min 1Q Median 3Q Max
-10.1029 -1.9279 0.1971 2.1221 10.4971
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 92.6029 0.3504 264.3 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.573 on 103 degrees of freedom
One Sample t-test
data: possum$head_l
t = 264.28, df = 103, p-value < 0.00000000000000022
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
91.90796 93.29781
sample estimates:
mean of x
92.60288
The intercept is just the mean! The t-statistic is the same.
Don’t forget about \(\epsilon\). It helps us look at the relationship between the residuals and predictions to understand what’s going on.
Don’t forget about \(\epsilon\)! Let’s also check the relationship between residuals and the data.
The textbook gives the regression equation for a model predicting the head length of a brushtail possum’s head given its overall length as follows:
\[ \hat{y}=41+0.59x \]
In other words, the head length is 41mm plus a fraction 0.59 of the total length.
Correlation is a number in the interval \([-1,1]\) describing the strength of the linear association between two variables. The most common measure of correlation (and the only one our textbook bothers to mention) is Pearson’s correlation coefficient, \(r\).
\[r = \frac{1}{n-1}\sum_{i=1}^{n}\frac{x_i-\bar{x}}{s_x}\frac{y_i-\bar{y}}{s_y}\]
\(r=-1\) means that two variables are perfectly negatively correlated.
\(r=0\) means that two variables are completely uncorrelated.
\(r=1\) means that two variables are perfectly positively correlated.
How did we arrive at the estimates for slope and intercept? We used a time-honored technique called least squares, which has been around for about two hundred years. It consists of minimizing the sum of squares of the residuals, which are often abbreviated as SSR, SSE, or RSS. To use this technique, we have to make some assumptions and can then use two equations to find the slope and intercept.
\[R^2 = \frac{SSR}{SS_{reg}} \\ R^2 = 1 - \frac{\sum{\epsilon^2}}{(\sum{y-\bar{y}})^2} \]
It’s called the coefficient of determination. It quantifies how much variance the model explains.
If there’s only one predictor variable,\(R = r\).
Keep in mind that, in the R language, \(R\) always refers to the square root of the multiple coefficient of determination.
\(r\) is in the interval \([-1,1]\) but \(R^2\) is in the interval \([0,1]\) so \(R^2\) the multiple coefficient of determination represents the proportion of variability in the data that is explained by the model. The adjusted \(R^2\) accounts for a problem with \(R^2\) that we will discuss later. In the case of simple linear regression there is almost no difference.
The best way to fix residual issues is to improve model fit to the data. We can also use robust standard errors.
With 2 variables, we have analytic equations for linear regression.
\[b_1=\frac{s_y}{s_x}r\]
\[b_0=\bar{y}-b_1\bar{x}\]
Slope: how much \(y\) grows or shrinks for a one-unit increase in \(x\)
Intercept: how large \(y\) would be if \(x\) were 0 (only works if \(x\) can be zero)
For linear regression, using categorical variables as \(y\) is fraught (we’ll address this later), but we often use them as \(x\). The textbook gives an example of sales of Mario Kart game cartridges, where the variable cond takes on the categories new and used. The following frame is a regression of total price (total_pr) on condition (cond).
Call:
lm(formula = mariokart$total_pr ~ mariokart$cond)
Residuals:
Min 1Q Median 3Q Max
-18.168 -7.771 -3.148 1.857 279.362
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 53.771 3.329 16.153 <0.0000000000000002 ***
mariokart$condused -6.623 4.343 -1.525 0.13
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25.57 on 141 degrees of freedom
Multiple R-squared: 0.01622, Adjusted R-squared: 0.009244
F-statistic: 2.325 on 1 and 141 DF, p-value: 0.1296
Points far from the horizontal center have high leverage, in the sense that they can pull the regression line up or down more forcefully than can outliers near the horizontal center.
A subset of high leverage points are those that actually exercise this high leverage and do pull the regression line out of position.
It can be dangerous to remove these points from analysis for reasons explored in a book called The Black Swan by Nicolas Taleb.
Let’s examine the models we generated previously.
Call:
lm(formula = head_l ~ total_l, data = possum)
Residuals:
Min 1Q Median 3Q Max
-7.1877 -1.5340 -0.3345 1.2788 7.3968
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.70979 5.17281 8.257 0.000000000000565704 ***
total_l 0.57290 0.05933 9.657 0.000000000000000468 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.595 on 102 degrees of freedom
Multiple R-squared: 0.4776, Adjusted R-squared: 0.4725
F-statistic: 93.26 on 1 and 102 DF, p-value: 0.0000000000000004681
There are five columns in the coefficients table
The coefficients table tells us all we need to know to conduct a hypothesis test concerning an estimate, where the hypothesis is typically whether the true value of the coefficient is zero. If it is zero, the input variable associated with the coefficient is not linearly related to the output or response variable.
By default, R gives a ninety-five percent confidence interval for all coefficients. Notice that condition: used includes zero, highlighting the unreliability of the estimate.
\[b_i \pm t^*_{df} \times SE_{b_i}\]
For the mariokart model, \(t^*_{df}\) can be found for the ninety-five percent confidence interval by
and \(b_i\) and \(SE_{b_i}\) are given in the coefficients table.
END
This slideshow was produced using quarto
Fonts are Roboto Condensed Bold, JetBrains Mono Nerd Font, and STIX2
