2025-04-15
Anscombe’s quartet. Four datasets with the same mean, variance, correlation coefficient (0.816), and regression line.
Homework problem 3 asked you to choose the best predictors in a model of diamond prices. Different people got different answers. Most found caret, but some found cut and depth while others found color and clarity. Why would you disagree?
I hope today’s lecture will help you understand why. The short answer is that regression coefficients are sensitive to the order you add them to a model. A model that includes everything might have several good predictors but if they all correlated, they won’t all have large coefficients.
Everything we’ve done so far considers relationships between up to 2 variables at once.
But there can be many things that might all affect an outcome. How can we statistically analyze such more complex situations?
The answer is simple. We add them.
\[\hat{y}=b_0+b_1x_1+b_2x_2+\cdots+b_nx_n\]
Because we’re using a linear equation, that is, the equation of a line to model the data, there’s no reason we can’t add terms to the equation. These terms are additive, meaning that we add each term and each term has a coefficient. So now, our estimate of \(y\), which we call \(\hat{y}\), looks like this for \(n\) terms.
The only complicated thing about multiple regression is the interdependence of the coefficients.
Intercept-only model
load("~/class/data/possum.rda")
m2 <- lm(possum, formula = total_l ~ sex)
ggplot(possum, aes(y = total_l, x = sex)) +
geom_point() +
geom_point(aes(y = total_l, x = sex),
color = "orange",
data = possum |> group_by(sex) |> summarise(total_l = mean(total_l)),
shape = 15,
size = 3) +
geom_abline(
intercept = m2$coefficients[["(Intercept)"]] - m2$coefficients[["sexm"]],
slope = m2$coefficients[["sexm"]],
color = "blue"
)The regression line goes through the means of both variables.
load("~/class/data/possum.rda")
m3 <- lm(possum, formula = total_l ~ sex + pop)
p <- ggplot(possum, aes(y = total_l, x = sex)) +
geom_point() +
geom_point(aes(y=total_l, x=sex), color='orange', data = possum |> group_by(sex,pop) |> summarise(total_l= mean(total_l)),shape=15,size=3) +
facet_wrap(pop ~ .) +
geom_abline(
aes(intercept = m3$coefficients[["(Intercept)"]] - m3$coefficients[["sexm"]] + m3$coefficients[["popother"]] * (pop == 'other'),
slope = m3$coefficients[["sexm"]]),
linetype='dashed'
)
print(p)Model with 2 categorical variables.
Now we’re in 3 dimensions, which is difficult to visualize.
With 2 categorical variables the regression line goes through the mean of each cell in the table. With 3 it does its best, but is constrained because the slope for pop is the same for both sexes.
\[\hat{y}=b_0+b_1x_1+b_2x_2+b_3x_1x_2\]
load("~/class/data/possum.rda")
p <- ggplot(possum, aes(y = total_l, x = sex)) +
geom_point() +
geom_point(aes(y=total_l, x=sex), color='orange', data = possum |> group_by(sex,pop) |> summarise(total_l= mean(total_l)),shape=15,size=3) +
geom_abline(
aes(intercept = m4$coefficients[["(Intercept)"]] - m4$coefficients[["sexm"]] + m4$coefficients[["popother"]] * (pop == 'other') - m4$coefficients["sexm:popother"]*(pop=='other'),
slope = m4$coefficients[["sexm"]] + m4$coefficients["sexm:popother"]*(pop == 'other')),
linetype='dashed'
) +
facet_wrap(pop ~ .)
print(p)The regression line with interaction categorical variables goes through the mean of each subgroup.
Even with continuous data, the regression line does its best to through the mean value of the data at a given level of each predictor. Adding a new predictor makes this harder and linear regression has to balance the trade-offs.
You can see this is why adding \(x_2\) to the model changes \(b_1\).
Adding the interaction term changes the coefficients in more complex ways. We’ll cover how to interpret interaction terms later.
| Model 1 | Model 2 | Model 3 | Model 4 | |
|---|---|---|---|---|
| (Intercept) | 87.09*** | 87.91*** | 88.09*** | 88.33*** |
| (0.42) | (0.65) | (0.76) | (0.88) | |
| sexm | -1.40 | -1.31 | -1.81 | |
| (0.85) | (0.87) | (1.27) | ||
| popother | -0.42 | -0.96 | ||
| (0.86) | (1.32) | |||
| sexm:popother | 0.95 | |||
| (1.75) | ||||
| R2 | 0.00 | 0.03 | 0.03 | 0.03 |
| Adj. R2 | 0.00 | 0.02 | 0.01 | 0.00 |
| Num. obs. | 104 | 104 | 104 | 104 |
| ***p < 0.001; **p < 0.01; *p < 0.05 | ||||
What if we include pop and site in the model?
| Model 1 | Model 2 | |
|---|---|---|
| (Intercept) | 88.09*** | 89.99*** |
| (0.76) | (0.63) | |
| sexm | -1.31 | -0.66 |
| (0.87) | (0.67) | |
| popother | -0.42 | -3.78*** |
| (0.86) | (0.97) | |
| as.factor(site)2 | -7.82*** | |
| (1.06) | ||
| as.factor(site)3 | 2.24 | |
| (1.44) | ||
| as.factor(site)4 | 6.51*** | |
| (1.44) | ||
| as.factor(site)5 | 1.07 | |
| (1.18) | ||
| as.factor(site)6 | -1.21 | |
| (1.17) | ||
| R2 | 0.03 | 0.48 |
| Adj. R2 | 0.01 | 0.44 |
| Num. obs. | 104 | 104 |
| ***p < 0.001; **p < 0.01; *p < 0.05 | ||
Including correlated variables in a model introduces collinearity.
If \(x1\) and \(x2\) are collinear variance in \(y\) explained by \(x1\) is often also explained by \(x2\). So \(b1\) and \(b2\) are related.
Perfectly collinear variables create numerical problems estimation. Generally a bad idea to include perfectly collinear variables.
Other than that extreme case, collinearity is usually fine, especially in large samples.
The coefficient in a two-variable regression quantifies the direct association (aka “effect”). For categorical data this is the difference in means.
In mulivariate regression, the coefficient \(b_j\) quantifies the marginal association / effect. The model says a one-unit change in \(x_j\) would change the outcome by \(b_j\) if we held the other \(x\) constant.
H1_0: Vic’s offspring are not longer than the average given their sex or site.
H1_A: Vic’s offspring are longer than average given their sex or site.
H2_0: Vic’s genes do not cause his offspring to be longer than average.
H2_A: Vic’s genes cause his offspring to be longer than average.
We can’t do an experiment, but maybe we can argue that H1_A implies H2_A.
Adding more variables to a model tends to reduce bias, but this comes at the cost of variance. You have less data and more degrees of freedom, so parameter estimates are less precise.
“”
| Model 1 | |
|---|---|
| (Intercept) | 37.23*** |
| (1.60) | |
| hp | -0.03** |
| (0.01) | |
| wt | -3.88*** |
| (0.63) | |
| R2 | 0.83 |
| Adj. R2 | 0.81 |
| Num. obs. | 32 |
| ***p < 0.001; **p < 0.01; *p < 0.05 | |
The output looks a bit different now. First, there are 32 residuals, so the individual residuals are not listed. Instead, you see summary statistics for the residuals.
Next, look at the coefficients table. There are three rows now, for the intercept, for hp, and for wt. Notice that all three have significance codes at the end of the row. Normally, you shouldn’t be concerned about the significance code for the intercept, but the other two are interesting. The code for hp is two stars, meaning that it is less than 0.01, while the code for wt is three stars, meaning that it is less than 0.001. The \(p\)-value of 1.12e-06, which is abbreviated scientific notation, means to take 1.12 and shift the decimal point six places to the left, giving 0.00000112 as the decimal equivalent.
The residual standard error is 2.593 on 29 degrees of freedom. As mentioned before, the residual standard error (often abbreviated \(R\!S\!E\)) is the square root of the expression formed by the ratio of the residual sum of squares (often abbreviated \(R\!S\!S\)) divided by the degrees of freedom. Expressing this as a formula is as follows.
\[R\!S\!E=\sqrt{\frac{R\!S\!S}{\text{df}}}\]
\(R\!S\!S\) can be found in R for the above model as follows.
Similarly, \(R\!S\!E\) can be found in R for the above model as follows.
The mathematical formulas for these values are as follows.
\[\begin{align*} R\!S\!S&=\sum_{i=1}^{n}(y_i-\hat{y}_{i})^2 \\ &=S\!S_{yy}-\hat{\beta}_1S\!S_{xy} \end{align*}\]
The term \(S\!S_{xy}\) in the preceding formula is only relevant for simple linear regression with one predictor, where \(S\!S_{xy}= \sum(x_i-\overline{x})(y_i-\overline{y})\). A formula for \(\hat{\beta}_1\) in that case would be the following.
\[\hat{\beta}_1=\frac{S\!S_{xy}}{S\!S_{xx}}\]
\(S\!S_{xx}\) in that case would be the sum of squares of deviations of the predictors from their mean: \(\sum(x_i-\overline{x})^2\).
\[R\!S\!E=\sqrt{\frac{\sum_{i=1}^{n}(y_i-\hat{y}_{i})^2}{n-(k+1)}} \]
where \(k\) is the number of parameters being estimated. For example, in the above model wt and hp are the two parameters being estimated.
The Multiple R-squared is 82 percent and the Adjusted R-squared is 81 percent. This is a good sign because the Adjusted R-squared is adjusted for the case where you have included too many variables on the right hand side of the linear model formula. If it’s similar to Multiple R-squared, that means you probably have not included too many variables.
\[ R^2=\frac{S\!S_{yy}-R\!S\!S}{S\!S_{yy}} =1-\frac{R\!S\!S}{S\!S_{yy}} \]
\[ R^2_a=1-\left(\frac{n-1}{n-(k+1)}\right)(1-R^2) \]
The \(F\)-statistic is important now, because of its interpretation. The \(F\)-statistic tells you that at least one of the variables is significant, taken in combination with the others. The \(t\)-statistics only give the individual contribution of the variables, so it’s possible to have a significant \(t\)-statistic without a significant \(F\)-statistic. The first thing to check in regression output is the \(F\)-statistic. If it’s too small, i.e., has a large \(p\)-value, try a different model.
\[F_c=\frac{(S\!S_{yy}-R\!S\!S)/k}{R\!S\!S/[n-(k+1)]} =\frac{R^2/k}{(1-R^2)/[n-(k+1)]}\]
The \(c\) subscript after \(F\) simply signifies that it is the computed value of \(F\), the one you see in the regression table. This is to contrast it with \(F_\alpha\), the value to which you compare it in making the hypothesis that one or more variables in the model are significant. You don’t need to compute \(F_\alpha\) since the \(p\)-value associated with \(F_c\) gives you enough information to reject or fail to reject the null hypothesis that none of the variables in the model are significant. If you want to calculate it anyway, you can say the following in R:
The preceding calculation assumes you choose 0.05 as the alpha level and that there are \(k=2\) parameters estimated in the model and that \(n-(k+1)=29\). These are the numerator and denominator degrees of freedom.
| Model 1 | |
|---|---|
| (Intercept) | 12.30 |
| (18.72) | |
| cyl | -0.11 |
| (1.05) | |
| disp | 0.01 |
| (0.02) | |
| hp | -0.02 |
| (0.02) | |
| drat | 0.79 |
| (1.64) | |
| wt | -3.72 |
| (1.89) | |
| qsec | 0.82 |
| (0.73) | |
| vs | 0.32 |
| (2.10) | |
| am | 2.52 |
| (2.06) | |
| gear | 0.66 |
| (1.49) | |
| carb | -0.20 |
| (0.83) | |
| R2 | 0.87 |
| Adj. R2 | 0.81 |
| Num. obs. | 32 |
| ***p < 0.001; **p < 0.01; *p < 0.05 | |

If they weren’t nice or handsome I wouldn’t date them!

There is a heuristic called Occam’s Razor, named after William of Occam (inventer, not popularizer).
If two explanations have the same explanatory power, you should prefer the simpler one. In this context, simpler means fewer variables.
We saw how interaction terms help the model fit the data when relationship (slope) between \(x_1\) and \(y\) depends on \(x_2\).
| Model 1 | Model 2 | |
|---|---|---|
| (Intercept) | 39.69*** | 54.31*** |
| (1.71) | (6.13) | |
| wt | -3.19*** | -8.66*** |
| (0.76) | (2.32) | |
| cyl | -1.51** | -3.80*** |
| (0.41) | (1.01) | |
| wt:cyl | 0.81* | |
| (0.33) | ||
| R2 | 0.83 | 0.86 |
| Adj. R2 | 0.82 | 0.85 |
| Num. obs. | 32 | 32 |
| ***p < 0.001; **p < 0.01; *p < 0.05 | ||
Marginal effects plot for the relationship between weight and gas mileage depending on cylinders
| Model 1 | Model 2 | |
|---|---|---|
| (Intercept) | 30.29*** | 5.55 |
| (7.32) | (12.63) | |
| wt | -4.78*** | 3.88 |
| (0.80) | (3.80) | |
| drat | 1.44 | 8.49* |
| (1.46) | (3.32) | |
| wt:drat | -2.54* | |
| (1.09) | ||
| R2 | 0.76 | 0.80 |
| Adj. R2 | 0.74 | 0.78 |
| Num. obs. | 32 | 32 |
| ***p < 0.001; **p < 0.01; *p < 0.05 | ||
Plot of diamonds data with linear and gam smoother
Plot of diamonds data with polynomial and gam smoother
Marginal effects plot for polyomial model of carat.
Plot of diamonds data with polynomial and gam smoother
Marginal effects plot for polyomial model of carat.
When you conduct a linear regression, you are generally looking for the linear effects of input variables on output variables. For example, consider the diamonds data frame that is automatically installed with ggplot. The price output variable may be influenced by a number of input variables, such as cut, carat, and color. It happens that R interprets two of these as ordered factors and does something interesting with them. Let’s conduct a regression of price on cut.
Call:
lm(formula = price ~ cut, data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-4258 -2741 -1494 1360 15348
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4062.24 25.40 159.923 < 0.0000000000000002 ***
cut.L -362.73 68.04 -5.331 0.000000098033 ***
cut.Q -225.58 60.65 -3.719 0.0002 ***
cut.C -699.50 52.78 -13.253 < 0.0000000000000002 ***
cut^4 -280.36 42.56 -6.588 0.000000000045 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3964 on 53935 degrees of freedom
Multiple R-squared: 0.01286, Adjusted R-squared: 0.01279
F-statistic: 175.7 on 4 and 53935 DF, p-value: < 0.00000000000000022
Note that the output includes four independent variables instead of just cut. The first one, cut.L, represents cut. The L stands for linear and estimates the linear effect of cut on price. The other three are completely independent of the linear effect of cut on price. The next, cut.Q, estimates the quadratic effect of cut on price. The third, cut.C, estimates the cubic effect of cut on price. The fourth, cut^4 estimates the effect of a fourth degree polynomial contrast of cut on price. Together, all four of these terms are orthogonal polynomial contrasts. They are chosen by R to be independent of each other since a mixture could reveal spurious effects. Why did R stop at four such contrasts? Let’s examine cut further.
You can see that cut has five levels. R automatically chooses \(\text{level}-1\) polynomial contrasts when presented with an ordered factor. How can you make R stop doing this if you don’t care about the nonlinear effects? You can present the factor as unordered without altering the factor as it is stored. Then your regression output will list one term for each level of the factor, estimating the effect of that level on the output variable.
Call:
lm(formula = price ~ cut, data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-4258 -2741 -1494 1360 15348
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4358.76 98.79 44.122 < 0.0000000000000002 ***
cutGood -429.89 113.85 -3.776 0.000160 ***
cutVery Good -377.00 105.16 -3.585 0.000338 ***
cutPremium 225.50 104.40 2.160 0.030772 *
cutIdeal -901.22 102.41 -8.800 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3964 on 53935 degrees of freedom
Multiple R-squared: 0.01286, Adjusted R-squared: 0.01279
F-statistic: 175.7 on 4 and 53935 DF, p-value: < 0.00000000000000022
By the way, you can also say it in one line as:
Call:
lm(formula = price ~ factor(cut, ordered = F), data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-4258 -2741 -1494 1360 15348
Coefficients:
Estimate Std. Error t value
(Intercept) 4358.76 98.79 44.122
factor(cut, ordered = F)Good -429.89 113.85 -3.776
factor(cut, ordered = F)Very Good -377.00 105.16 -3.585
factor(cut, ordered = F)Premium 225.50 104.40 2.160
factor(cut, ordered = F)Ideal -901.22 102.41 -8.800
Pr(>|t|)
(Intercept) < 0.0000000000000002 ***
factor(cut, ordered = F)Good 0.000160 ***
factor(cut, ordered = F)Very Good 0.000338 ***
factor(cut, ordered = F)Premium 0.030772 *
factor(cut, ordered = F)Ideal < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3964 on 53935 degrees of freedom
Multiple R-squared: 0.01286, Adjusted R-squared: 0.01279
F-statistic: 175.7 on 4 and 53935 DF, p-value: < 0.00000000000000022
Notice that generating the orthogonal polynomial contrasts does not alter the linear model in any way. It’s just extra information. Both models produce the same \(R^2\) and the same \(F\)-statistic and the same \(p\)-value of the \(F\)-statistic.
It is VERY important to note that there is an assumption in the generation of these orthogonal polynomial contrasts. They assume that the differences between the levels of the ordered factor are evenly spaced. If the levels are not evenly spaced, the information provided will be misleading. Take for instance the world’s top three female sprinters. I read an article claiming that the difference between the top two (Richardson and Jackson at the time of this writing) was much smaller than the difference between the second and third. There are many statistical tools that use ranks, such as 1, 2, and 3, as ordered factors. Here is a case where that would be misleading.
How can you use this information? In a basic course like this, the information is not particularly useful and we will not pursue it. If you go on in statistics or data science, you will soon encounter situations where nonlinear effects matter a great deal.
For example, this often arises in age-period-cohort analysis, where you want to separate the effects of people’s ages, usually divided into evenly spaced levels, and other numerical factors such as income, also usually divided into evenly spaced levels, and the effects of significant periods, such as the economic collapse of 2008 or the pandemic beginning in 2019, and finally the effects of being a member of a cohort or identifiable group. This kind of analysis is often conducted by epidemiologists, people with very large groups of customers, policymakers, and others concerned with large groups of people, or any “objects” with large numbers of attributes.
Note that zero and one happen to be the boundaries of a probability measure. Hence, you can use the logistic function to map arbitrary numbers to a probability.
[1] "job_ad_id" "job_city" "job_industry"
[4] "job_type" "job_fed_contractor" "job_equal_opp_employer"
[7] "job_ownership" "job_req_any" "job_req_communication"
[10] "job_req_education" "job_req_min_experience" "job_req_computer"
[13] "job_req_organization" "job_req_school" "received_callback"
[16] "firstname" "race" "gender"
[19] "years_college" "college_degree" "honors"
[22] "worked_during_school" "years_experience" "computer_skills"
[25] "special_skills" "volunteer" "military"
[28] "employment_holes" "has_email_address" "resume_quality"
m <- glm(received_callback ~ honors + race + gender,
data = resume,
family = "binomial"
)
m |> knitreg()<table class="texreg" style="margin: 10px auto;border-collapse: collapse;border-spacing: 0px;caption-side: bottom;color: #000000;border-top: 2px solid #000000;">
<caption>Statistical models</caption>
<thead>
<tr>
<th style="padding-left: 5px;padding-right: 5px;"> </th>
<th style="padding-left: 5px;padding-right: 5px;">Model 1</th>
</tr>
</thead>
<tbody>
<tr style="border-top: 1px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">(Intercept)</td>
<td style="padding-left: 5px;padding-right: 5px;">-2.71<sup>***</sup></td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;"> </td>
<td style="padding-left: 5px;padding-right: 5px;">(0.09)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">honors</td>
<td style="padding-left: 5px;padding-right: 5px;">0.86<sup>***</sup></td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;"> </td>
<td style="padding-left: 5px;padding-right: 5px;">(0.18)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">racewhite</td>
<td style="padding-left: 5px;padding-right: 5px;">0.44<sup>***</sup></td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;"> </td>
<td style="padding-left: 5px;padding-right: 5px;">(0.11)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">genderm</td>
<td style="padding-left: 5px;padding-right: 5px;">-0.11</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;"> </td>
<td style="padding-left: 5px;padding-right: 5px;">(0.13)</td>
</tr>
<tr style="border-top: 1px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">AIC</td>
<td style="padding-left: 5px;padding-right: 5px;">2697.15</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">BIC</td>
<td style="padding-left: 5px;padding-right: 5px;">2723.11</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">Log Likelihood</td>
<td style="padding-left: 5px;padding-right: 5px;">-1344.57</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">Deviance</td>
<td style="padding-left: 5px;padding-right: 5px;">2689.15</td>
</tr>
<tr style="border-bottom: 2px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">Num. obs.</td>
<td style="padding-left: 5px;padding-right: 5px;">4870</td>
</tr>
</tbody>
<tfoot>
<tr>
<td style="font-size: 0.8em;" colspan="2"><sup>***</sup>p < 0.001; <sup>**</sup>p < 0.01; <sup>*</sup>p < 0.05</td>
</tr>
</tfoot>
</table>
Marginal effects plots translates the coefficient into probabilities.
#| code-fold: true
#| fig.cap: "Example logistic regression model"
m <- glm(received_callback ~ honors + race + gender,
data = resume,
family = "binomial"
)
m |> knitreg()Linear regression models are fit via “least squares”.
This is equivalent to maximizing the likelihood of a normal model.
For logistic regression:
\[\hat{L} = \prod_{k;y_k=1}^N{y^*_i}\prod_{k:y_k=0}{(1-y^*{i})}\]
One easy way to compare these models is by comparing the values of AIC, the Akaike Information Criterion.
\[AIC = 2k - 2ln(\hat{L})\]
This measures the loss of information, so a lower AIC value corresponds to a more predictive model.
\[e^{\text{AIC}_{\text{min}}-\text{AIC}_{\text{alternative}}/2}\]
\[BIC = k ln(n) - 2 ln(\hat{L})\]
Keep in mind that the AIC/BIC is only a tool to compare models, not an absolute measure. There is no such thing as an absolutely good AIC/BIC value. In the above cases, the relative likelihood for the two models with higher AIC values is vanishingly small. The model using honors loses far less information than do the other two.
Another approach is to calculate AICc, Delta_AICc, and AICcWt using R.
m1 <- glm(received_callback ~ honors,data=resume,family="binomial")
m2 <- glm(received_callback ~ race,data=resume,family="binomial")
m3 <- glm(received_callback ~ gender,data=resume,family="binomial")
knitreg(list(m1, m2, m3))| Model 1 | Model 2 | Model 3 | |
|---|---|---|---|
| (Intercept) | -2.50*** | -2.67*** | -2.41*** |
| (0.06) | (0.08) | (0.06) | |
| honors | 0.87*** | ||
| (0.18) | |||
| racewhite | 0.44*** | ||
| (0.11) | |||
| genderm | -0.12 | ||
| (0.13) | |||
| AIC | 2710.72 | 2713.94 | 2730.03 |
| BIC | 2723.70 | 2726.92 | 2743.01 |
| Log Likelihood | -1353.36 | -1354.97 | -1363.02 |
| Deviance | 2706.72 | 2709.94 | 2726.03 |
| Num. obs. | 4870 | 4870 | 4870 |
| ***p < 0.001; **p < 0.01; *p < 0.05 | |||
pacman::p_load(AICcmodavg)
models <- list(m1,m2,m3)
model_names <- c("honors", "race", "gender")
aictab(cand.set = models, modnames = model_names)Model selection based on AICc:
K AICc Delta_AICc AICcWt Cum.Wt LL
honors 2 2710.73 0.00 0.83 0.83 -1353.36 race 2 2713.94 3.21 0.17 1.00 -1354.97 gender 2 2730.03 19.31 0.00 1.00 -1363.02
A common rule, discredited by Anderson (2008) is that, if Delta_AICc is greater than 2, the model should be discarded. But it is not the case that any model can be judged as good or bad based on AIC. Instead, the AIC tells us only relative information about the models: which is better from an information loss standpoint. The AICcWt or Akaike weight, is the probability that the given model is the best of the models in the table from an information loss standpoint. Here there is a clear winner. The difference between 83 percent and seventeen percent is striking.
AIC / BIC try to balance the trade-off between parsimony and predictive performance after models have been specified and fit.
An alternative is to let likelihood maximization select which variables to include.
This is called “regularization” or “penalized likelihood”.
There are several penalities you might choose, popular ones are “LASSO” and “Ridge Regression”.
\[\min_{ \beta_0, \beta } \biggl\{ \sum_{i=1}^N \bigl(y_i - \beta_0 - x_i^\intercal \beta\bigr)^2 \biggr\}\] subject to \[\sum_{j=1}^p |\beta_j| \leq t. \]
\(t\) is a “free parameter”/ hyperparameter controlling the penalty. Specified by the modeler, not learned from the data.
\[\min_{ \beta_0, \beta } \biggl\{ \sum_{i=1}^N \bigl(y_i - \beta_0 - x_i^\intercal \beta\bigr)^2 \biggr\}\] subject to \[\sum_{j=1}^p \beta_j^2 \leq t. \]
\(\lambda\) is the hyperparameter controlling the amount of penalty.
L1/LASSO can set coefficients to 0. It tends to do so.
L2/Ridge regression does not set coefficients to 0.
Not exclusive, you can use both called “elastic net”.
tidymodelsDatacamp shows a different way, using tidymodels in one of their tutorials.
The bank wants to divide customers into those likely to buy and those unlikely to buy some banking product.
They would like to divide the customers into these two groups using logistic regression, with a cutoff point of fifty-fifty.
If there’s better than a fifty-fifty chance, they will send a salesperson but if there’s less than a fifty-fifty chance, they won’t send a salesperson.
You train the model on one set of data, validate it on another, then test it on another, previously unseen set. The next thing in this example is training and testing.
m |> extract_fit_engine() |>
tidy() |>
rename(penalty = lambda) |> # <- for consistent naming
filter(term != "(Intercept)")# A tibble: 4,200 × 5
term step estimate penalty dev.ratio
<chr> <dbl> <dbl> <dbl> <dbl>
1 age 1 7.24e-40 126. 1.70e-13
2 age 2 6.22e- 6 115. 1.12e- 3
3 age 3 6.83e- 6 105. 1.23e- 3
4 age 4 7.49e- 6 95.4 1.35e- 3
5 age 5 8.22e- 6 86.9 1.49e- 3
6 age 6 9.02e- 6 79.2 1.63e- 3
7 age 7 9.89e- 6 72.2 1.79e- 3
8 age 8 1.09e- 5 65.7 1.96e- 3
9 age 9 1.19e- 5 59.9 2.15e- 3
10 age 10 1.31e- 5 54.6 2.36e- 3
# ℹ 4,190 more rows
# A tibble: 43 × 3
term estimate penalty
<chr> <dbl> <dbl>
1 (Intercept) -2.30 0.1
2 age 0.000638 0.1
3 jobblue-collar -0.124 0.1
4 jobentrepreneur -0.104 0.1
5 jobhousemaid -0.118 0.1
6 jobmanagement 0.0318 0.1
7 jobretired 0.291 0.1
8 jobself-employed -0.0104 0.1
9 jobservices -0.0592 0.1
10 jobstudent 0.389 0.1
# ℹ 33 more rows
.metric .estimator .estimate
How can we choose
There are aspects of this approach, called hyperparameters, that influence the quality of the model. It can be tedious to adjust these aspects, called penalty and mixture, so here’s a technique for doing it automatically. You’ll learn about this and similar techniques if you take a more advanced course like 310D, Intro to Data Science.
# . Define the grid search for the hyperparameters
grid <- grid_regular(mixture(), penalty(), levels = c(mixture = 4, penalty = 3))
grid# A tibble: 12 × 2
mixture penalty
<dbl> <dbl>
1 0 0.0000000001
2 0.333 0.0000000001
3 0.667 0.0000000001
4 1 0.0000000001
5 0 0.00001
6 0.333 0.00001
7 0.667 0.00001
8 1 0.00001
9 0 1
10 0.333 1
11 0.667 1
12 1 1
Truth
Prediction no yes
no 7838 738
yes 147 320
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 precision binary 0.914
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 recall binary 0.982
Following are two tables from James et al. (2021) that you can use to evaluate a classification model.
Another view is provided at Wikipedia in the following picture
# A tibble: 10 × 3
term estimate penalty
<chr> <dbl> <dbl>
1 (Intercept) -2.59 0.0000000001
2 poutcomesuccess 2.08 0.0000000001
3 monthmar 1.62 0.0000000001
4 monthoct 1.08 0.0000000001
5 monthsep 1.03 0.0000000001
6 contactunknown -1.01 0.0000000001
7 monthdec 0.861 0.0000000001
8 monthjan -0.820 0.0000000001
9 housingyes -0.550 0.0000000001
10 monthnov -0.517 0.0000000001
ggplot(coeff, aes(x = term, y = estimate, fill = term)) + geom_col() + coord_flip()
This slideshow was produced using quarto
Fonts are Roboto Condensed Bold, JetBrains Mono Nerd Font, and STIX2
