Robust Regression: Dealing with violated assumptions
2025-04-17
Linear Regression Assumptions
- linearity: the data should follow a linear trend—there are advanced regression methods for non-linear relationships
- normality of residuals: The residuals are approximately normally distributed (evaluated with a QQ plot).
- constant variability: The residuals don’t follow a pattern (the most common being a right-facing trumpet or funnel)
- independent random observations: Usually don’t apply least squares to seasonal data, for example, because its structure can be modeled as a time series
The best way to fix residual issues is to improve model fit to the data. We can also use robust standard errors.
Linear Regression Assumptions
- linearity: the data should follow a linear trend—there are advanced regression methods for non-linear relationships
- normality of residuals: The residuals are approximately normally distributed (evaluated with a QQ plot).
- constant variability: The residuals don’t follow a pattern (the most common being a right-facing trumpet or funnel)
- independent random observations: Usually don’t apply least squares to seasonal data, for example, because its structure can be modeled as a time series
The best way to fix residual issues is to improve model fit to the data. We can also use robust standard errors.
Dealing with Assumption Violations
- Option 1
-
Accept limitations. No analysis is perfect.
- Option 2
-
Improve model to address limitations.
- Tradeoff
-
Improving models increases complexity and hurts interpretability.
- Goal
-
An analysis that is robust to limitations.
Key question: Bias or Variance?
If an assumption violation leads to bias, it’s can be more important to address it.
If we’re worried about variance, large effects given the sample size goes a long way.
Nonlinear relationships
![]()
Plot of diamonds data with linear and gam smoother
Nonlinear relationships (polynomial models)
![]()
Plot of diamonds data with polynomial and gam smoother
The polynomial fits the data a bit better, but interpretation becomes complicated, and out-of-sample predictions are poor.
Special case: Linear probability model
<table class="texreg" style="margin: 10px auto;border-collapse: collapse;border-spacing: 0px;caption-side: bottom;color: #000000;border-top: 2px solid #000000;">
<caption>Statistical models</caption>
<thead>
<tr>
<th style="padding-left: 5px;padding-right: 5px;"> </th>
<th style="padding-left: 5px;padding-right: 5px;">Model 1</th>
<th style="padding-left: 5px;padding-right: 5px;">Model 2</th>
</tr>
</thead>
<tbody>
<tr style="border-top: 1px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">(Intercept)</td>
<td style="padding-left: 5px;padding-right: 5px;">-2.74<sup>***</sup></td>
<td style="padding-left: 5px;padding-right: 5px;">0.06<sup>***</sup></td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;"> </td>
<td style="padding-left: 5px;padding-right: 5px;">(0.10)</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.01)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">honors</td>
<td style="padding-left: 5px;padding-right: 5px;">0.76<sup>***</sup></td>
<td style="padding-left: 5px;padding-right: 5px;">0.08<sup>***</sup></td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;"> </td>
<td style="padding-left: 5px;padding-right: 5px;">(0.18)</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.02)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">years_experience</td>
<td style="padding-left: 5px;padding-right: 5px;">0.03<sup>***</sup></td>
<td style="padding-left: 5px;padding-right: 5px;">0.00<sup>***</sup></td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;"> </td>
<td style="padding-left: 5px;padding-right: 5px;">(0.01)</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.00)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">genderm</td>
<td style="padding-left: 5px;padding-right: 5px;">-0.09</td>
<td style="padding-left: 5px;padding-right: 5px;">-0.01</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;"> </td>
<td style="padding-left: 5px;padding-right: 5px;">(0.13)</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.01)</td>
</tr>
<tr style="border-top: 1px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">AIC</td>
<td style="padding-left: 5px;padding-right: 5px;">2702.30</td>
<td style="padding-left: 5px;padding-right: 5px;"> </td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">BIC</td>
<td style="padding-left: 5px;padding-right: 5px;">2728.26</td>
<td style="padding-left: 5px;padding-right: 5px;"> </td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">Log Likelihood</td>
<td style="padding-left: 5px;padding-right: 5px;">-1347.15</td>
<td style="padding-left: 5px;padding-right: 5px;"> </td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">Deviance</td>
<td style="padding-left: 5px;padding-right: 5px;">2694.30</td>
<td style="padding-left: 5px;padding-right: 5px;"> </td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">Num. obs.</td>
<td style="padding-left: 5px;padding-right: 5px;">4870</td>
<td style="padding-left: 5px;padding-right: 5px;">4870</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">R<sup>2</sup></td>
<td style="padding-left: 5px;padding-right: 5px;"> </td>
<td style="padding-left: 5px;padding-right: 5px;">0.01</td>
</tr>
<tr style="border-bottom: 2px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">Adj. R<sup>2</sup></td>
<td style="padding-left: 5px;padding-right: 5px;"> </td>
<td style="padding-left: 5px;padding-right: 5px;">0.01</td>
</tr>
</tbody>
<tfoot>
<tr>
<td style="font-size: 0.8em;" colspan="3"><sup>***</sup>p < 0.001; <sup>**</sup>p < 0.01; <sup>*</sup>p < 0.05</td>
</tr>
</tfoot>
</table>
Assumption violations all over!
But we still get essentially the same results.
So why Logistic regression?
- The linear probability model can make impossible predictions.
Nonconstant Variance (Heteroskedasticity)
A tibble: 2 × 2
cond var 1 new 55.3 2 used 1072. ![]()
Sandwich to the rescue
- Robust standard errors aka “Sandwich” estimators adjust standard errors to deal with nonconstant variance.
$Standard
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 53.7707 3.3289 16.1528 <0.0000000000000002 ***
mariokart$condused -6.6226 4.3434 -1.5248 0.1296
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
$Sandwich
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 53.77068 0.95973 56.0269 < 0.0000000000000002 ***
mariokart$condused -6.62258 3.67853 -1.8003 0.07395 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Another example
Diamonds are clustered by cut and clarity.
Parting thoughts
- One reasons linear regression is popular is that these robust standard errors work.
- Dealing with heteroskedasticity in other models requires different techniques.
- Note that constant variance is an assumption about the standard errors, not model bias.
- Violoations of the “linarity” assumption that the model fits the data cause bias.