Stats: Linear Regression

Finally, the heart of the course

Why linear regression?

We’ve learned how test hypotheses about:

Values of a single variable (i.e., \(\hat{\mu}\), \(\hat{\sigma}\), skewness).
Differences between two variables (i.e., \(T\), \(z\), \(\chi^2\) tests).
Differences in sets of variables (i.e., \(\chi^2\) tests, ANOVA).

What if we hypothize a relationship between variables? What if a relationship between two variables depends on a third?

Linear regression formalized.

\[ y=\beta_0 + \beta_1 x + \epsilon \\ \epsilon \sim \mathcal{N}(0,1) \]

(pronounced y equals beta nought plus beta one x plus epsilon)

What we actually see

We see sample data and estimate the preceding equation using data. The estimated regression equation is

\[ \hat{y}=b_0+b_1x \]

Notice that there’s no epsilon. We assume that the mean of the epsilon values is zero. (And we’ll learn how to check that later.)

Names of the variables

\(x\): explanatory variable, predictor, input, covariate

\(y\): response, output, outcome

Remember epsilon? Now a new term, residual

The term epsilon or \(\epsilon\) in the idealized regression equation refers to the divergence of each point from what the model predicts. It’s the pink line in our initial pictures. In the above regression models, each divergence of a single point is called a residual.

Assessing a model with residuals

Think back to the first picture (reproduced below). The residuals are the key to assessing the model.

Example from textbook: brushtail possums

library(ggplot2)
library(ggfortify) # useful for plotting residuals
theme(theme_minimal())

 Named list()
 - attr(*, "class")= chr [1:2] "theme" "gg"
 - attr(*, "complete")= logi FALSE
 - attr(*, "validate")= logi TRUE

load("~/class/data/possum.rda")

The outcome of interest is the length of a possum’s head.

Building a model step-by-step

What if we only have an intercept?

m <- lm(head_l ~ 1, possum)
print(summary(m))


Call:
lm(formula = head_l ~ 1, data = possum)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.1029  -1.9279   0.1971   2.1221  10.4971 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  92.6029     0.3504   264.3 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.573 on 103 degrees of freedom

An intercept-only model is a t-test

print(t.test(possum$head_l))


    One Sample t-test

data:  possum$head_l
t = 264.28, df = 103, p-value < 0.00000000000000022
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 91.90796 93.29781
sample estimates:
mean of x 
 92.60288

The intercept is just the mean! The t-statistic is the same.

Checking residuals

Don’t forget about \(\epsilon\). It helps us look at the relationship between the residuals and predictions to understand what’s going on.

mean.resid <- mean(m$residuals)
p1 <- ggplot(m, aes(.fitted, .resid)) + # .fitted and .resid are on the model object
    geom_point() +
    geom_hline(yintercept = mean.resid)

print(p1)

The model predicts always predicts the mean.
The mean of the residuals is 0.

Comparing residuals to data

Don’t forget about \(\epsilon\)! Let’s also check the relationship between residuals and the data.

p2 <- ggplot(m, aes(possum$head_l, .resid)) +
    geom_point() +
    geom_smooth()
print(p2)

The residual is just the data minus the mean aka “demeaned” data.
This means the model explains 0 variance.
There’s no relationship between a constant (i.e., intercept) and data.

Analyzing a relationship between two variables

The textbook gives the regression equation for a model predicting the head length of a brushtail possum’s head given its overall length as follows:

\[ \hat{y}=41+0.59x \]

In other words, the head length is 41mm plus a fraction 0.59 of the total length.

Using R to find the equation

m2 <- lm(head_l ~ total_l, possum)
print(m2)


Call:
lm(formula = head_l ~ total_l, data = possum)

Coefficients:
(Intercept)      total_l  
    42.7098       0.5729

Conveient way to examine residuals

autoplot(m2)

Correlation

Correlation is a number in the interval \([-1,1]\) describing the strength of the linear association between two variables. The most common measure of correlation (and the only one our textbook bothers to mention) is Pearson’s correlation coefficient, \(r\).

\[r = \frac{1}{n-1}\sum_{i=1}^{n}\frac{x_i-\bar{x}}{s_x}\frac{y_i-\bar{y}}{s_y}\]

Values of \(r\)

\(r=-1\) means that two variables are perfectly negatively correlated.

\(r=0\) means that two variables are completely uncorrelated.

\(r=1\) means that two variables are perfectly positively correlated.

Least squares

How did we arrive at the estimates for slope and intercept? We used a time-honored technique called least squares, which has been around for about two hundred years. It consists of minimizing the sum of squares of the residuals, which are often abbreviated as SSR, SSE, or RSS. To use this technique, we have to make some assumptions and can then use two equations to find the slope and intercept.

We can quantify how well the model fits the data.

\[R^2 = \frac{SSR}{SS_{reg}} \\ R^2 = 1 - \frac{\sum{\epsilon^2}}{(\sum{y-\bar{y}})^2} \]

It’s called the coefficient of determination. It quantifies how much variance the model explains.

If there’s only one predictor variable,\(R = r\).

Keep in mind that, in the R language, \(R\) always refers to the square root of the multiple coefficient of determination.

The most common measure of fit

\(r\) is in the interval \([-1,1]\) but \(R^2\) is in the interval \([0,1]\) so \(R^2\) the multiple coefficient of determination represents the proportion of variability in the data that is explained by the model. The adjusted \(R^2\) accounts for a problem with \(R^2\) that we will discuss later. In the case of simple linear regression there is almost no difference.

Assumptions

linearity: the data should follow a linear trend—there are advanced regression methods for non-linear relationships
normality of residuals: The residuals are approximately normally distributed (evaluated with a QQ plot).
constant variability: The residuals don’t follow a pattern (the most common being a right-facing trumpet or funnel)
independent random observations: Usually don’t apply least squares to seasonal data, for example, because its structure can be modeled as a time series

The best way to fix residual issues is to improve model fit to the data. We can also use robust standard errors.

Least squares equations

With 2 variables, we have analytic equations for linear regression.

\[b_1=\frac{s_y}{s_x}r\]

\[b_0=\bar{y}-b_1\bar{x}\]

Calculating linear regression

(b1 <- sd(possum$head_l)/sd(possum$total_l)*cor(possum$head_l, possum$total_l))

[1] 0.5729013

(b0 <- mean(possum$head_l) - b1*mean(possum$total_l))

[1] 42.70979

Interpreting slope and intercept

Slope: how much \(y\) grows or shrinks for a one-unit increase in \(x\)

Intercept: how large \(y\) would be if \(x\) were 0 (only works if \(x\) can be zero)

Example of \(R^2\) for brushtail possum regression

print(summary(m)[['r.squared']])

[1] 0

print(summary(m2)[['r.squared']])

[1] 0.4776105

Categorical variables with two categories

For linear regression, using categorical variables as \(y\) is fraught (we’ll address this later), but we often use them as \(x\). The textbook gives an example of sales of Mario Kart game cartridges, where the variable cond takes on the categories new and used. The following frame is a regression of total price (total_pr) on condition (cond).

Categorical variable example

load("mariokart.rda")
(m<-lm(mariokart$total_pr ~ mariokart$cond))


Call:
lm(formula = mariokart$total_pr ~ mariokart$cond)

Coefficients:
       (Intercept)  mariokart$condused  
            53.771              -6.623

Summary of the preceding model

summary(m)


Call:
lm(formula = mariokart$total_pr ~ mariokart$cond)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.168  -7.771  -3.148   1.857 279.362 

Coefficients:
                   Estimate Std. Error t value            Pr(>|t|)    
(Intercept)          53.771      3.329  16.153 <0.0000000000000002 ***
mariokart$condused   -6.623      4.343  -1.525                0.13    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.57 on 141 degrees of freedom
Multiple R-squared:  0.01622,   Adjusted R-squared:  0.009244 
F-statistic: 2.325 on 1 and 141 DF,  p-value: 0.1296

Leverage

Points far from the horizontal center have high leverage, in the sense that they can pull the regression line up or down more forcefully than can outliers near the horizontal center.

Influential points

A subset of high leverage points are those that actually exercise this high leverage and do pull the regression line out of position.

It can be dangerous to remove these points from analysis for reasons explored in a book called The Black Swan by Nicolas Taleb.

Software

Let’s examine the models we generated previously.

summary(m2)


Call:
lm(formula = head_l ~ total_l, data = possum)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.1877 -1.5340 -0.3345  1.2788  7.3968 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 42.70979    5.17281   8.257 0.000000000000565704 ***
total_l      0.57290    0.05933   9.657 0.000000000000000468 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.595 on 102 degrees of freedom
Multiple R-squared:  0.4776,    Adjusted R-squared:  0.4725 
F-statistic: 93.26 on 1 and 102 DF,  p-value: 0.0000000000000004681

The coefficents table

There are five columns in the coefficients table

Estimate: the coefficient estimate (\(b_0, b_1\)) of the parameter (\(\beta_0,\beta_1\))
Std Error: the standard error of the coeffient (stdev scaled by sample size)
t value: the ratio of the estimate to its standard error
Pr(>|t|): the p-value of the t-statistic
Significance codes (unlabeled): category the p-value falls into

Hypothesis tests

The coefficients table tells us all we need to know to conduct a hypothesis test concerning an estimate, where the hypothesis is typically whether the true value of the coefficient is zero. If it is zero, the input variable associated with the coefficient is not linearly related to the output or response variable.

Confidence intervals

By default, R gives a ninety-five percent confidence interval for all coefficients. Notice that condition: used includes zero, highlighting the unreliability of the estimate.

confint(m2)

                 2.5 %     97.5 %
(Intercept) 32.4495415 52.9700448
total_l      0.4552298  0.6905728

Computing confidence intervals by hand

\[b_i \pm t^*_{df} \times SE_{b_i}\]

For the mariokart model, \(t^*_{df}\) can be found for the ninety-five percent confidence interval by

qt(0.975,141)

[1] 1.976931

and \(b_i\) and \(SE_{b_i}\) are given in the coefficients table.

END

Stats:
Linear Regression

Linear Regression