Stats:
Multiple Linear Regression;
Logistic Regression

Nathan TeBlunthuis

2025-04-15

Preliminaries

What kind of relationships can the correlation coefficient quantify?

Anscombe’s quartet. Four datasets with the same mean, variance, correlation coefficient (0.816), and regression line.

Model selection

Homework problem 3 asked you to choose the best predictors in a model of diamond prices. Different people got different answers. Most found caret, but some found cut and depth while others found color and clarity. Why would you disagree?

I hope today’s lecture will help you understand why. The short answer is that regression coefficients are sensitive to the order you add them to a model. A model that includes everything might have several good predictors but if they all correlated, they won’t all have large coefficients.

Multiple Linear Regression

Everything we’ve done so far considers relationships between up to 2 variables at once.

But there can be many things that might all affect an outcome. How can we statistically analyze such more complex situations?

The answer is simple. We add them.

\[\hat{y}=b_0+b_1x_1+b_2x_2+\cdots+b_nx_n\]

Because we’re using a linear equation, that is, the equation of a line to model the data, there’s no reason we can’t add terms to the equation. These terms are additive, meaning that we add each term and each term has a coefficient. So now, our estimate of \(y\), which we call \(\hat{y}\), looks like this for \(n\) terms.

How do we interpret \(b_1\) and \(b_2\)?

The only complicated thing about multiple regression is the interdependence of the coefficients.

  • One intuition: As marginal associations.
  • Another intuition: As controls/adjustments.

Building a model step-by-step

Code
load("~/class/data/possum.rda")
m <- lm(possum, formula = total_l ~ 1)
ggplot(possum, aes(y = total_l, x = 1)) +
    geom_point() +
    geom_point(aes(y=mean(total_l), x=1), color='orange') + 
    geom_abline(
        intercept = m$coefficients[["(Intercept)"]],
        slope = 0,
        color='blue'
    )

Intercept-only model

Adding a categorical variable

Code
load("~/class/data/possum.rda")
m2 <- lm(possum, formula = total_l ~ sex)
ggplot(possum, aes(y = total_l, x = sex)) +
  geom_point() +
  geom_point(aes(y = total_l, x = sex),
             color = "orange",
             data = possum |> group_by(sex) |> summarise(total_l = mean(total_l)),
             shape = 15,
             size = 3) +
  geom_abline(
    intercept = m2$coefficients[["(Intercept)"]] - m2$coefficients[["sexm"]],
    slope = m2$coefficients[["sexm"]],
    color = "blue"
  )

The regression line goes through the means of both variables.

Adding a second categorical variable

Code
possum |> group_by(sex, pop) |> select(total_l) |> summarize(mean=mean(total_l))
# A tibble: 4 × 3
# Groups:   sex [2]
  sex   pop    mean
  <fct> <fct> <dbl>
1 f     Vic    88.3
2 f     other  87.4
3 m     Vic    86.5
4 m     other  86.5

Code
load("~/class/data/possum.rda")
m3 <- lm(possum, formula = total_l ~ sex + pop)

p <- ggplot(possum, aes(y = total_l, x = sex)) +
    geom_point() +
    geom_point(aes(y=total_l, x=sex), color='orange', data = possum |> group_by(sex,pop) |> summarise(total_l= mean(total_l)),shape=15,size=3) +
    facet_wrap(pop ~ .) + 
    geom_abline(
        aes(intercept = m3$coefficients[["(Intercept)"]] - m3$coefficients[["sexm"]] + m3$coefficients[["popother"]] * (pop == 'other'),
        slope = m3$coefficients[["sexm"]]),
        linetype='dashed'
    )
print(p)

Model with 2 categorical variables.

Now we’re in 3 dimensions, which is difficult to visualize.

With 2 categorical variables the regression line goes through the mean of each cell in the table. With 3 it does its best, but is constrained because the slope for pop is the same for both sexes.

Letting each sex have a different slope for pop.

Interaction terms
Variables in regression models that are the product of two variables.

\[\hat{y}=b_0+b_1x_1+b_2x_2+b_3x_1x_2\]

m4 <- lm(possum, formula = total_l ~ sex + pop + sex:pop)

You almost always want to include the first-order terms if you include an interaction. Shorthand in R:

    m4 <- lm(possum, formula = total_l ~ sex*pop)

Code
load("~/class/data/possum.rda")
p <- ggplot(possum, aes(y = total_l, x = sex)) +
    geom_point() +
    geom_point(aes(y=total_l, x=sex), color='orange', data = possum |> group_by(sex,pop) |> summarise(total_l= mean(total_l)),shape=15,size=3) +
    geom_abline(
        aes(intercept = m4$coefficients[["(Intercept)"]] - m4$coefficients[["sexm"]] + m4$coefficients[["popother"]] * (pop == 'other') - m4$coefficients["sexm:popother"]*(pop=='other'),
        slope = m4$coefficients[["sexm"]] + m4$coefficients["sexm:popother"]*(pop == 'other')),
        linetype='dashed'
    ) +
    facet_wrap(pop ~ .) 
print(p)

The regression line with interaction categorical variables goes through the mean of each subgroup.

Linear regression models conditional means

Even with continuous data, the regression line does its best to through the mean value of the data at a given level of each predictor. Adding a new predictor makes this harder and linear regression has to balance the trade-offs.

You can see this is why adding \(x_2\) to the model changes \(b_1\).

Adding the interaction term changes the coefficients in more complex ways. We’ll cover how to interpret interaction terms later.

knitreg(list(m,m2,m3,m4))
Statistical models
  Model 1 Model 2 Model 3 Model 4
(Intercept) 87.09*** 87.91*** 88.09*** 88.33***
  (0.42) (0.65) (0.76) (0.88)
sexm   -1.40 -1.31 -1.81
    (0.85) (0.87) (1.27)
popother     -0.42 -0.96
      (0.86) (1.32)
sexm:popother       0.95
        (1.75)
R2 0.00 0.03 0.03 0.03
Adj. R2 0.00 0.02 0.01 0.00
Num. obs. 104 104 104 104
***p < 0.001; **p < 0.01; *p < 0.05

Collinearity

table(possum$site, possum$pop)
   
    Vic other
  1  33     0
  2  13     0
  3   0     7
  4   0     7
  5   0    13
  6   0    13
  7   0    18

What if we include pop and site in the model?

m5 <- lm(possum, formula = total_l ~ sex + pop + as.factor(site))
knitreg(list(m3,m5))
Statistical models
  Model 1 Model 2
(Intercept) 88.09*** 89.99***
  (0.76) (0.63)
sexm -1.31 -0.66
  (0.87) (0.67)
popother -0.42 -3.78***
  (0.86) (0.97)
as.factor(site)2   -7.82***
    (1.06)
as.factor(site)3   2.24
    (1.44)
as.factor(site)4   6.51***
    (1.44)
as.factor(site)5   1.07
    (1.18)
as.factor(site)6   -1.21
    (1.17)
R2 0.03 0.48
Adj. R2 0.01 0.44
Num. obs. 104 104
***p < 0.001; **p < 0.01; *p < 0.05

Collinearity

Including correlated variables in a model introduces collinearity.

If \(x1\) and \(x2\) are collinear variance in \(y\) explained by \(x1\) is often also explained by \(x2\). So \(b1\) and \(b2\) are related.

Perfectly collinear variables create numerical problems estimation. Generally a bad idea to include perfectly collinear variables.

Other than that extreme case, collinearity is usually fine, especially in large samples.

Coefficients quantify marginal associations

The coefficient in a two-variable regression quantifies the direct association (aka “effect”). For categorical data this is the difference in means.

In mulivariate regression, the coefficient \(b_j\) quantifies the marginal association / effect. The model says a one-unit change in \(x_j\) would change the outcome by \(b_j\) if we held the other \(x\) constant.

Hypotheses about marginal associations

H1_0: Vic’s offspring are not longer than the average given their sex or site.

H1_A: Vic’s offspring are longer than average given their sex or site.

Predictors and control variables

H2_0: Vic’s genes do not cause his offspring to be longer than average.

H2_A: Vic’s genes cause his offspring to be longer than average.

We can’t do an experiment, but maybe we can argue that H1_A implies H2_A.

Bias-Variance Trade-off

Adding more variables to a model tends to reduce bias, but this comes at the cost of variance. You have less data and more degrees of freedom, so parameter estimates are less precise.

“”

Adjusted R-squared

m <- with(mtcars, lm(mpg ~ hp+wt))
knitreg(m)
Statistical models
  Model 1
(Intercept) 37.23***
  (1.60)
hp -0.03**
  (0.01)
wt -3.88***
  (0.63)
R2 0.83
Adj. R2 0.81
Num. obs. 32
***p < 0.001; **p < 0.01; *p < 0.05

The output looks a bit different now. First, there are 32 residuals, so the individual residuals are not listed. Instead, you see summary statistics for the residuals.

Next, look at the coefficients table. There are three rows now, for the intercept, for hp, and for wt. Notice that all three have significance codes at the end of the row. Normally, you shouldn’t be concerned about the significance code for the intercept, but the other two are interesting. The code for hp is two stars, meaning that it is less than 0.01, while the code for wt is three stars, meaning that it is less than 0.001. The \(p\)-value of 1.12e-06, which is abbreviated scientific notation, means to take 1.12 and shift the decimal point six places to the left, giving 0.00000112 as the decimal equivalent.

The residual standard error is 2.593 on 29 degrees of freedom. As mentioned before, the residual standard error (often abbreviated \(R\!S\!E\)) is the square root of the expression formed by the ratio of the residual sum of squares (often abbreviated \(R\!S\!S\)) divided by the degrees of freedom. Expressing this as a formula is as follows.

\[R\!S\!E=\sqrt{\frac{R\!S\!S}{\text{df}}}\]

\(R\!S\!S\) can be found in R for the above model as follows.

(rss <- sum(residuals(m)^2))
[1] 195.0478

Similarly, \(R\!S\!E\) can be found in R for the above model as follows.

(rse <- sqrt( sum(residuals(m)^2) / m$df.residual ))
[1] 2.593412

The mathematical formulas for these values are as follows.

\[\begin{align*} R\!S\!S&=\sum_{i=1}^{n}(y_i-\hat{y}_{i})^2 \\ &=S\!S_{yy}-\hat{\beta}_1S\!S_{xy} \end{align*}\]

The term \(S\!S_{xy}\) in the preceding formula is only relevant for simple linear regression with one predictor, where \(S\!S_{xy}= \sum(x_i-\overline{x})(y_i-\overline{y})\). A formula for \(\hat{\beta}_1\) in that case would be the following.

\[\hat{\beta}_1=\frac{S\!S_{xy}}{S\!S_{xx}}\]

\(S\!S_{xx}\) in that case would be the sum of squares of deviations of the predictors from their mean: \(\sum(x_i-\overline{x})^2\).

\[R\!S\!E=\sqrt{\frac{\sum_{i=1}^{n}(y_i-\hat{y}_{i})^2}{n-(k+1)}} \]

where \(k\) is the number of parameters being estimated. For example, in the above model wt and hp are the two parameters being estimated.

The Multiple R-squared is 82 percent and the Adjusted R-squared is 81 percent. This is a good sign because the Adjusted R-squared is adjusted for the case where you have included too many variables on the right hand side of the linear model formula. If it’s similar to Multiple R-squared, that means you probably have not included too many variables.

\[ R^2=\frac{S\!S_{yy}-R\!S\!S}{S\!S_{yy}} =1-\frac{R\!S\!S}{S\!S_{yy}} \]

\[ R^2_a=1-\left(\frac{n-1}{n-(k+1)}\right)(1-R^2) \]

The \(F\)-statistic is important now, because of its interpretation. The \(F\)-statistic tells you that at least one of the variables is significant, taken in combination with the others. The \(t\)-statistics only give the individual contribution of the variables, so it’s possible to have a significant \(t\)-statistic without a significant \(F\)-statistic. The first thing to check in regression output is the \(F\)-statistic. If it’s too small, i.e., has a large \(p\)-value, try a different model.

\[F_c=\frac{(S\!S_{yy}-R\!S\!S)/k}{R\!S\!S/[n-(k+1)]} =\frac{R^2/k}{(1-R^2)/[n-(k+1)]}\]

The \(c\) subscript after \(F\) simply signifies that it is the computed value of \(F\), the one you see in the regression table. This is to contrast it with \(F_\alpha\), the value to which you compare it in making the hypothesis that one or more variables in the model are significant. You don’t need to compute \(F_\alpha\) since the \(p\)-value associated with \(F_c\) gives you enough information to reject or fail to reject the null hypothesis that none of the variables in the model are significant. If you want to calculate it anyway, you can say the following in R:

qf(0.05,2,29,lower.tail=FALSE)
[1] 3.327654

The preceding calculation assumes you choose 0.05 as the alpha level and that there are \(k=2\) parameters estimated in the model and that \(n-(k+1)=29\). These are the numerator and denominator degrees of freedom.

knitreg(lm(mpg ~ ., data=mtcars))
Statistical models
  Model 1
(Intercept) 12.30
  (18.72)
cyl -0.11
  (1.05)
disp 0.01
  (0.02)
hp -0.02
  (0.02)
drat 0.79
  (1.64)
wt -3.72
  (1.89)
qsec 0.82
  (0.73)
vs 0.32
  (2.10)
am 2.52
  (2.06)
gear 0.66
  (1.49)
carb -0.20
  (0.83)
R2 0.87
Adj. R2 0.81
Num. obs. 32
***p < 0.001; **p < 0.01; *p < 0.05

Notes on model selection

Collider bias

  • When adding a variable to a regression model causes bias in causal inference.
  • E.g., Everyone in my dating pool is either nice or handsome.

“DAG showing a collider. Credit Tim bates (Wikipedia)”

If they weren’t nice or handsome I wouldn’t date them!

Synergy between models and measurement

“book cover”
  • Designing a good model depends on contextual knowledge.
  • Models are tools for creating knowledge.
  • The best way to evaluate the model is if it is useful it in practice
    • Out-of-sample predictions; experiments.
  • Models are usually limited by data collection / measurement.

Occam’s Razor

There is a heuristic called Occam’s Razor, named after William of Occam (inventer, not popularizer).

If two explanations have the same explanatory power, you should prefer the simpler one. In this context, simpler means fewer variables.

More on interaction terms

Interpreting an interaction

We saw how interaction terms help the model fit the data when relationship (slope) between \(x_1\) and \(y\) depends on \(x_2\).

Code
data(mtcars)
m1 <- lm(mpg ~ wt + cyl, mtcars)
m2 <- lm(mpg ~ wt * cyl, mtcars)
knitreg(list(m1, m2))
Statistical models
  Model 1 Model 2
(Intercept) 39.69*** 54.31***
  (1.71) (6.13)
wt -3.19*** -8.66***
  (0.76) (2.32)
cyl -1.51** -3.80***
  (0.41) (1.01)
wt:cyl   0.81*
    (0.33)
R2 0.83 0.86
Adj. R2 0.82 0.85
Num. obs. 32 32
***p < 0.001; **p < 0.01; *p < 0.05
  • Adding the interaction term changes everything.
  • What is the relationship between weight and gas mileage?

pacman::p_load(ggeffects)
plot(ggeffect(m2, c("wt", "cyl")))

Marginal effects plot for the relationship between weight and gas mileage depending on cylinders

Interaction with continuous variable

Code
data(mtcars)
m3 <- lm(mpg ~ wt + drat, mtcars)
m4 <- lm(mpg ~ wt * drat, mtcars)
knitreg(list(m3,m4))
Statistical models
  Model 1 Model 2
(Intercept) 30.29*** 5.55
  (7.32) (12.63)
wt -4.78*** 3.88
  (0.80) (3.80)
drat 1.44 8.49*
  (1.46) (3.32)
wt:drat   -2.54*
    (1.09)
R2 0.76 0.80
Adj. R2 0.74 0.78
Num. obs. 32 32
***p < 0.001; **p < 0.01; *p < 0.05
plot(ggeffect(m4, c("wt", "drat")))

Marginal effects plot for the relationship between weight and mileage depending on rear axle ratio.

Hypotheses and models with interactions

  • \(H1_0\): There is no relationship between weight and gas mileage, given gear axle ratio.
  • \(H2_0\): There is no relationship between gear-axel-ratio and gas mileage, given weight.
  • \(H3_0\): The relationship between weight and gas mileage is no different for cars of different gear axle ratios.

Nonlinear relationships

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point() +
  geom_smooth(method = "gam") +
  geom_smooth(method = "lm", color = "orange") +
  ylim(0, 20000)

Plot of diamonds data with linear and gam smoother

Nonlinear relationships (polynomial models)

library(splines)
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point() +
  geom_smooth(method = "gam") +
  geom_smooth(method = "lm", color = "orange", formula = y ~ poly(x, 4)) +
  ylim(0, 20000)

Plot of diamonds data with polynomial and gam smoother

Interpreting polynomial models

m <- lm(diamonds, formula=price ~ poly(carat, 4))
plot(ggeffect(m, "carat[1:5,by=0.1]"))

Marginal effects plot for polyomial model of carat.

Nonlinear relationships (spline models)

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point() +
  geom_smooth(method = "gam") +
  geom_smooth(method = "lm", color = "orange", formula = y ~ bs(x, df=15))

Plot of diamonds data with polynomial and gam smoother

Interpreting spline models

m <- lm(diamonds, formula=price ~ bs(carat, df=154))
plot(ggeffect(m, "carat[1:5,by=0.1]"))

Marginal effects plot for polyomial model of carat.

Ordered factors

When you conduct a linear regression, you are generally looking for the linear effects of input variables on output variables. For example, consider the diamonds data frame that is automatically installed with ggplot. The price output variable may be influenced by a number of input variables, such as cut, carat, and color. It happens that R interprets two of these as ordered factors and does something interesting with them. Let’s conduct a regression of price on cut.

summary(lm(price~cut,data=diamonds))

Call:
lm(formula = price ~ cut, data = diamonds)

Residuals:
   Min     1Q Median     3Q    Max 
 -4258  -2741  -1494   1360  15348 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  4062.24      25.40 159.923 < 0.0000000000000002 ***
cut.L        -362.73      68.04  -5.331       0.000000098033 ***
cut.Q        -225.58      60.65  -3.719               0.0002 ***
cut.C        -699.50      52.78 -13.253 < 0.0000000000000002 ***
cut^4        -280.36      42.56  -6.588       0.000000000045 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3964 on 53935 degrees of freedom
Multiple R-squared:  0.01286,   Adjusted R-squared:  0.01279 
F-statistic: 175.7 on 4 and 53935 DF,  p-value: < 0.00000000000000022

Note that the output includes four independent variables instead of just cut. The first one, cut.L, represents cut. The L stands for linear and estimates the linear effect of cut on price. The other three are completely independent of the linear effect of cut on price. The next, cut.Q, estimates the quadratic effect of cut on price. The third, cut.C, estimates the cubic effect of cut on price. The fourth, cut^4 estimates the effect of a fourth degree polynomial contrast of cut on price. Together, all four of these terms are orthogonal polynomial contrasts. They are chosen by R to be independent of each other since a mixture could reveal spurious effects. Why did R stop at four such contrasts? Let’s examine cut further.

table(diamonds$cut)

     Fair      Good Very Good   Premium     Ideal 
     1610      4906     12082     13791     21551 

You can see that cut has five levels. R automatically chooses \(\text{level}-1\) polynomial contrasts when presented with an ordered factor. How can you make R stop doing this if you don’t care about the nonlinear effects? You can present the factor as unordered without altering the factor as it is stored. Then your regression output will list one term for each level of the factor, estimating the effect of that level on the output variable.

diamonds$cut <- factor(diamonds$cut,ordered=FALSE)
summary(lm(price~cut,data=diamonds))

Call:
lm(formula = price ~ cut, data = diamonds)

Residuals:
   Min     1Q Median     3Q    Max 
 -4258  -2741  -1494   1360  15348 

Coefficients:
             Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   4358.76      98.79  44.122 < 0.0000000000000002 ***
cutGood       -429.89     113.85  -3.776             0.000160 ***
cutVery Good  -377.00     105.16  -3.585             0.000338 ***
cutPremium     225.50     104.40   2.160             0.030772 *  
cutIdeal      -901.22     102.41  -8.800 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3964 on 53935 degrees of freedom
Multiple R-squared:  0.01286,   Adjusted R-squared:  0.01279 
F-statistic: 175.7 on 4 and 53935 DF,  p-value: < 0.00000000000000022

By the way, you can also say it in one line as:

summary(lm(price~factor(cut,ordered=F),data=diamonds))

Call:
lm(formula = price ~ factor(cut, ordered = F), data = diamonds)

Residuals:
   Min     1Q Median     3Q    Max 
 -4258  -2741  -1494   1360  15348 

Coefficients:
                                  Estimate Std. Error t value
(Intercept)                        4358.76      98.79  44.122
factor(cut, ordered = F)Good       -429.89     113.85  -3.776
factor(cut, ordered = F)Very Good  -377.00     105.16  -3.585
factor(cut, ordered = F)Premium     225.50     104.40   2.160
factor(cut, ordered = F)Ideal      -901.22     102.41  -8.800
                                              Pr(>|t|)    
(Intercept)                       < 0.0000000000000002 ***
factor(cut, ordered = F)Good                  0.000160 ***
factor(cut, ordered = F)Very Good             0.000338 ***
factor(cut, ordered = F)Premium               0.030772 *  
factor(cut, ordered = F)Ideal     < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3964 on 53935 degrees of freedom
Multiple R-squared:  0.01286,   Adjusted R-squared:  0.01279 
F-statistic: 175.7 on 4 and 53935 DF,  p-value: < 0.00000000000000022

Notice that generating the orthogonal polynomial contrasts does not alter the linear model in any way. It’s just extra information. Both models produce the same \(R^2\) and the same \(F\)-statistic and the same \(p\)-value of the \(F\)-statistic.

It is VERY important to note that there is an assumption in the generation of these orthogonal polynomial contrasts. They assume that the differences between the levels of the ordered factor are evenly spaced. If the levels are not evenly spaced, the information provided will be misleading. Take for instance the world’s top three female sprinters. I read an article claiming that the difference between the top two (Richardson and Jackson at the time of this writing) was much smaller than the difference between the second and third. There are many statistical tools that use ranks, such as 1, 2, and 3, as ordered factors. Here is a case where that would be misleading.

How can you use this information? In a basic course like this, the information is not particularly useful and we will not pursue it. If you go on in statistics or data science, you will soon encounter situations where nonlinear effects matter a great deal.

For example, this often arises in age-period-cohort analysis, where you want to separate the effects of people’s ages, usually divided into evenly spaced levels, and other numerical factors such as income, also usually divided into evenly spaced levels, and the effects of significant periods, such as the economic collapse of 2008 or the pandemic beginning in 2019, and finally the effects of being a member of a cohort or identifiable group. This kind of analysis is often conducted by epidemiologists, people with very large groups of customers, policymakers, and others concerned with large groups of people, or any “objects” with large numbers of attributes.

Logistic Regression

Logistic Regression Intro

  • Linear regression with a binary outcome is biased.
  • You can still use it (we’ll discuss).
  • Logistic regression is unbiased.
  • Logistic regression is a classification model.

The logistic function

Note that zero and one happen to be the boundaries of a probability measure. Hence, you can use the logistic function to map arbitrary numbers to a probability.

Logistic regression example

load("~/class/data/resume.rda")
names(resume)
 [1] "job_ad_id"              "job_city"               "job_industry"          
 [4] "job_type"               "job_fed_contractor"     "job_equal_opp_employer"
 [7] "job_ownership"          "job_req_any"            "job_req_communication" 
[10] "job_req_education"      "job_req_min_experience" "job_req_computer"      
[13] "job_req_organization"   "job_req_school"         "received_callback"     
[16] "firstname"              "race"                   "gender"                
[19] "years_college"          "college_degree"         "honors"                
[22] "worked_during_school"   "years_experience"       "computer_skills"       
[25] "special_skills"         "volunteer"              "military"              
[28] "employment_holes"       "has_email_address"      "resume_quality"        

Logistic regression example

with(resume,table(race,received_callback))
       received_callback
race       0    1
  black 2278  157
  white 2200  235
with(resume,table(gender,received_callback))
      received_callback
gender    0    1
     f 3437  309
     m 1041   83
with(resume,table(honors,received_callback))
      received_callback
honors    0    1
     0 4263  350
     1  215   42

Logistic regression example

m <- glm(received_callback ~ honors + race + gender,
  data = resume,
  family = "binomial"
  )
m |> knitreg()
<table class="texreg" style="margin: 10px auto;border-collapse: collapse;border-spacing: 0px;caption-side: bottom;color: #000000;border-top: 2px solid #000000;">
<caption>Statistical models</caption>
<thead>
<tr>
<th style="padding-left: 5px;padding-right: 5px;">&nbsp;</th>
<th style="padding-left: 5px;padding-right: 5px;">Model 1</th>
</tr>
</thead>
<tbody>
<tr style="border-top: 1px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">(Intercept)</td>
<td style="padding-left: 5px;padding-right: 5px;">-2.71<sup>&#42;&#42;&#42;</sup></td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.09)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">honors</td>
<td style="padding-left: 5px;padding-right: 5px;">0.86<sup>&#42;&#42;&#42;</sup></td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.18)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">racewhite</td>
<td style="padding-left: 5px;padding-right: 5px;">0.44<sup>&#42;&#42;&#42;</sup></td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.11)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">genderm</td>
<td style="padding-left: 5px;padding-right: 5px;">-0.11</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.13)</td>
</tr>
<tr style="border-top: 1px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">AIC</td>
<td style="padding-left: 5px;padding-right: 5px;">2697.15</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">BIC</td>
<td style="padding-left: 5px;padding-right: 5px;">2723.11</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">Log Likelihood</td>
<td style="padding-left: 5px;padding-right: 5px;">-1344.57</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">Deviance</td>
<td style="padding-left: 5px;padding-right: 5px;">2689.15</td>
</tr>
<tr style="border-bottom: 2px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">Num. obs.</td>
<td style="padding-left: 5px;padding-right: 5px;">4870</td>
</tr>
</tbody>
<tfoot>
<tr>
<td style="font-size: 0.8em;" colspan="2"><sup>&#42;&#42;&#42;</sup>p &lt; 0.001; <sup>&#42;&#42;</sup>p &lt; 0.01; <sup>&#42;</sup>p &lt; 0.05</td>
</tr>
</tfoot>
</table>

Interpreting logistic regression

m |>
  ggeffect(terms="honors") |>
  plot()

Marginal effects plots translates the coefficient into probabilities.

Interpreting logistic regression: Odds ratios

#| code-fold: true
#| fig.cap: "Example logistic regression model"
m <- glm(received_callback ~ honors + race + gender,
  data = resume,
  family = "binomial"
  )
m |> knitreg()
Odds
Ratio of probabilities. If \(P(X)=0.2\), the odds of \(X\) are \(\frac{0.2}{0.8}=\frac{1}{4}\). (1:4 Odds).
Odds ratio
How the odds change with a logistic regresion covariate. \(\beta_X=\frac{P(Y|X = X_0 + 1)/P( \neg Y | X_0 + 1)}{P(Y|X = X_0)/P(\neg Y | X = X_0)}\)

Model selection for GLMs

Maximum Likelihood

  • Linear regression models are fit via “least squares”.

  • This is equivalent to maximizing the likelihood of a normal model.

Likelihood
Probability of the model given the data.

For logistic regression:

\[\hat{L} = \prod_{k;y_k=1}^N{y^*_i}\prod_{k:y_k=0}{(1-y^*{i})}\]

Akaike Information Criterion (AIC)

One easy way to compare these models is by comparing the values of AIC, the Akaike Information Criterion.

\[AIC = 2k - 2ln(\hat{L})\]

This measures the loss of information, so a lower AIC value corresponds to a more predictive model.

Relative likelihood
the likelihood that the alternative model minimizes the information loss.

\[e^{\text{AIC}_{\text{min}}-\text{AIC}_{\text{alternative}}/2}\]

Bayesian Information Criterion (BIC)

  • BIC is an alternative to AIC that says additional parameters pose a greater risk of overfitting when you have more data.

\[BIC = k ln(n) - 2 ln(\hat{L})\]

Keep in mind that the AIC/BIC is only a tool to compare models, not an absolute measure. There is no such thing as an absolutely good AIC/BIC value. In the above cases, the relative likelihood for the two models with higher AIC values is vanishingly small. The model using honors loses far less information than do the other two.

Another approach is to calculate AICc, Delta_AICc, and AICcWt using R.

m1 <- glm(received_callback ~ honors,data=resume,family="binomial")
m2 <- glm(received_callback ~ race,data=resume,family="binomial")
m3 <- glm(received_callback ~ gender,data=resume,family="binomial")
knitreg(list(m1, m2, m3))
Statistical models
  Model 1 Model 2 Model 3
(Intercept) -2.50*** -2.67*** -2.41***
  (0.06) (0.08) (0.06)
honors 0.87***    
  (0.18)    
racewhite   0.44***  
    (0.11)  
genderm     -0.12
      (0.13)
AIC 2710.72 2713.94 2730.03
BIC 2723.70 2726.92 2743.01
Log Likelihood -1353.36 -1354.97 -1363.02
Deviance 2706.72 2709.94 2726.03
Num. obs. 4870 4870 4870
***p < 0.001; **p < 0.01; *p < 0.05

pacman::p_load(AICcmodavg)
models <- list(m1,m2,m3)
model_names <- c("honors", "race", "gender")
aictab(cand.set = models, modnames = model_names)

Model selection based on AICc:

   K    AICc Delta_AICc AICcWt Cum.Wt       LL

honors 2 2710.73 0.00 0.83 0.83 -1353.36 race 2 2713.94 3.21 0.17 1.00 -1354.97 gender 2 2730.03 19.31 0.00 1.00 -1363.02

A common rule, discredited by Anderson (2008) is that, if Delta_AICc is greater than 2, the model should be discarded. But it is not the case that any model can be judged as good or bad based on AIC. Instead, the AIC tells us only relative information about the models: which is better from an information loss standpoint. The AICcWt or Akaike weight, is the probability that the given model is the best of the models in the table from an information loss standpoint. Here there is a clear winner. The difference between 83 percent and seventeen percent is striking.

Penalized likelihood for model selection

  • AIC / BIC try to balance the trade-off between parsimony and predictive performance after models have been specified and fit.

  • An alternative is to let likelihood maximization select which variables to include.

  • This is called “regularization” or “penalized likelihood”.

  • There are several penalities you might choose, popular ones are “LASSO” and “Ridge Regression”.

LASSO aka L1-Norm Penalization

\[\min_{ \beta_0, \beta } \biggl\{ \sum_{i=1}^N \bigl(y_i - \beta_0 - x_i^\intercal \beta\bigr)^2 \biggr\}\] subject to \[\sum_{j=1}^p |\beta_j| \leq t. \]

\(t\) is a “free parameter”/ hyperparameter controlling the penalty. Specified by the modeler, not learned from the data.

Ridge Regression aka L2-Norm Penalization

\[\min_{ \beta_0, \beta } \biggl\{ \sum_{i=1}^N \bigl(y_i - \beta_0 - x_i^\intercal \beta\bigr)^2 \biggr\}\] subject to \[\sum_{j=1}^p \beta_j^2 \leq t. \]

\(\lambda\) is the hyperparameter controlling the amount of penalty.

LASSO vs Ridge Regression

By Nicoguaro - Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=58258966

By Beos123 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=75712577
  • L1/LASSO can set coefficients to 0. It tends to do so.

  • L2/Ridge regression does not set coefficients to 0.

  • Not exclusive, you can use both called “elastic net”.

Choosing hyperparameters using tidymodels

Datacamp shows a different way, using tidymodels in one of their tutorials.

The bank wants to divide customers into those likely to buy and those unlikely to buy some banking product.

They would like to divide the customers into these two groups using logistic regression, with a cutoff point of fifty-fifty.

If there’s better than a fifty-fifty chance, they will send a salesperson but if there’s less than a fifty-fifty chance, they won’t send a salesperson.

pacman::p_load(tidymodels)

#. Read the dataset and convert the target variable to a factor
bank_df <- read_csv2("~/class/data/bank-full.csv")
bank_df$y = as.factor(bank_df$y)

#. Plot job occupation against the target variable
ggplot(bank_df, aes(job, fill = y)) +
    geom_bar() +
    coord_flip()

Train / validation / test split

You train the model on one set of data, validate it on another, then test it on another, previously unseen set. The next thing in this example is training and testing.

Split data into train and test

set.seed(421)
split <- initial_split(bank_df, prop = 0.8, strata = y)
train <- split |> 
         training()
test <- split |>
        testing()

Train a logistic regression model

m <- logistic_reg(mixture = 0, penalty = 0.1) |>
  set_engine("glmnet") |>
  set_mode("classification") |>
  fit(y ~ ., data = train)

Model summary

m |>  extract_fit_engine() |>
  tidy() |>
  rename(penalty = lambda) |>   # <- for consistent naming
  filter(term != "(Intercept)")
# A tibble: 4,200 × 5
   term   step estimate penalty dev.ratio
   <chr> <dbl>    <dbl>   <dbl>     <dbl>
 1 age       1 7.24e-40   126.   1.70e-13
 2 age       2 6.22e- 6   115.   1.12e- 3
 3 age       3 6.83e- 6   105.   1.23e- 3
 4 age       4 7.49e- 6    95.4  1.35e- 3
 5 age       5 8.22e- 6    86.9  1.49e- 3
 6 age       6 9.02e- 6    79.2  1.63e- 3
 7 age       7 9.89e- 6    72.2  1.79e- 3
 8 age       8 1.09e- 5    65.7  1.96e- 3
 9 age       9 1.19e- 5    59.9  2.15e- 3
10 age      10 1.31e- 5    54.6  2.36e- 3
# ℹ 4,190 more rows

tidy(m)
# A tibble: 43 × 3
   term              estimate penalty
   <chr>                <dbl>   <dbl>
 1 (Intercept)      -2.30         0.1
 2 age               0.000638     0.1
 3 jobblue-collar   -0.124        0.1
 4 jobentrepreneur  -0.104        0.1
 5 jobhousemaid     -0.118        0.1
 6 jobmanagement     0.0318       0.1
 7 jobretired        0.291        0.1
 8 jobself-employed -0.0104       0.1
 9 jobservices      -0.0592       0.1
10 jobstudent        0.389        0.1
# ℹ 33 more rows

# . Class Predictions
pred_class <- predict(m,
  new_data = test,
  type = "class"
  )

Class Probabilities

pred_proba <- predict(m,
                      new_data = test,
                      type = "prob")
results <- test |>
           select(y) |>
           bind_cols(pred_class, pred_proba)

accuracy(results, truth = y, estimate = .pred_class)

A tibble: 1 × 3

.metric .estimator .estimate 1 accuracy binary 0.893

Hyperparameter tuning

How can we choose

There are aspects of this approach, called hyperparameters, that influence the quality of the model. It can be tedious to adjust these aspects, called penalty and mixture, so here’s a technique for doing it automatically. You’ll learn about this and similar techniques if you take a more advanced course like 310D, Intro to Data Science.

#. Define the logistic regression model with penalty and mixture hyperparameters
log_reg <- logistic_reg(mixture = tune(), penalty = tune(), engine = "glmnet")

# . Define the grid search for the hyperparameters
grid <- grid_regular(mixture(), penalty(), levels = c(mixture = 4, penalty = 3))

grid
# A tibble: 12 × 2
   mixture      penalty
     <dbl>        <dbl>
 1   0     0.0000000001
 2   0.333 0.0000000001
 3   0.667 0.0000000001
 4   1     0.0000000001
 5   0     0.00001     
 6   0.333 0.00001     
 7   0.667 0.00001     
 8   1     0.00001     
 9   0     1           
10   0.333 1           
11   0.667 1           
12   1     1           

#. Define the workflow for the model
log_reg_wf <- workflow() |>
  add_model(log_reg) |>
  add_formula(y ~ .)

#. Define the resampling method for the grid search
folds <- vfold_cv(train, v = 5)

#. Tune the hyperparameters using the grid search
log_reg_tuned <- tune_grid(
  log_reg_wf,
  resamples = folds,
  grid = grid,
  control = control_grid(save_pred = TRUE)
)

select_best(log_reg_tuned, metric = "roc_auc")
# A tibble: 1 × 3
       penalty mixture .config              
         <dbl>   <dbl> <chr>                
1 0.0000000001       0 Preprocessor1_Model01

#. Fit the model using the optimal hyperparameters
log_reg_final <- logistic_reg(penalty = 0.0000000001, mixture = 0) |>
                 set_engine("glmnet") |>
                 set_mode("classification") |>
                 fit(y~., data = train)

#. Evaluate the model performance on the testing set
pred_class <- predict(log_reg_final,
                      new_data = test,
                      type = "class")
results <- test |>
  select(y) |>
  bind_cols(pred_class, pred_proba)

#. Create confusion matrix
conf_mat(results, truth = y,
         estimate = .pred_class)
          Truth
Prediction   no  yes
       no  7838  738
       yes  147  320
precision(results, truth = y,
          estimate = .pred_class)
# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 precision binary         0.914
recall(results, truth = y,
          estimate = .pred_class)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 recall  binary         0.982

Evaluation metrics

Following are two tables from James et al. (2021) that you can use to evaluate a classification model.

Another view is provided at Wikipedia in the following picture

coeff <- tidy(log_reg_final) |>
  arrange(desc(abs(estimate))) |>
  filter(abs(estimate) > 0.5)
coeff
# A tibble: 10 × 3
   term            estimate      penalty
   <chr>              <dbl>        <dbl>
 1 (Intercept)       -2.59  0.0000000001
 2 poutcomesuccess    2.08  0.0000000001
 3 monthmar           1.62  0.0000000001
 4 monthoct           1.08  0.0000000001
 5 monthsep           1.03  0.0000000001
 6 contactunknown    -1.01  0.0000000001
 7 monthdec           0.861 0.0000000001
 8 monthjan          -0.820 0.0000000001
 9 housingyes        -0.550 0.0000000001
10 monthnov          -0.517 0.0000000001

ggplot(coeff, aes(x = term, y = estimate, fill = term)) + geom_col() + coord_flip()

References

Anderson, David. 2008. Model Based Inference in the Life Sciences: A Primer on Evidence. New York, NY: Springer.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning, 2nd Edition. Springer New York.

Colophon

This slideshow was produced using quarto

Fonts are Roboto Condensed Bold, JetBrains Mono Nerd Font, and STIX2