Robust Regression: Dealing with violated assumptions

Nathan TeBlunthuis

2025-04-17

Linear Regression Assumptions

  • linearity: the data should follow a linear trend—there are advanced regression methods for non-linear relationships
  • normality of residuals: The residuals are approximately normally distributed (evaluated with a QQ plot).
  • constant variability: The residuals don’t follow a pattern (the most common being a right-facing trumpet or funnel)
  • independent random observations: Usually don’t apply least squares to seasonal data, for example, because its structure can be modeled as a time series

The best way to fix residual issues is to improve model fit to the data. We can also use robust standard errors.

Linear Regression Assumptions

  • linearity: the data should follow a linear trend—there are advanced regression methods for non-linear relationships
  • normality of residuals: The residuals are approximately normally distributed (evaluated with a QQ plot).
  • constant variability: The residuals don’t follow a pattern (the most common being a right-facing trumpet or funnel)
  • independent random observations: Usually don’t apply least squares to seasonal data, for example, because its structure can be modeled as a time series

The best way to fix residual issues is to improve model fit to the data. We can also use robust standard errors.

Dealing with Assumption Violations

Option 1
Accept limitations. No analysis is perfect.
Option 2
Improve model to address limitations.
Tradeoff
Improving models increases complexity and hurts interpretability.
Goal
An analysis that is robust to limitations.

Key question: Bias or Variance?

If an assumption violation leads to bias, it’s can be more important to address it.

If we’re worried about variance, large effects given the sample size goes a long way.

Linearity

Nonlinear relationships

Plot of diamonds data with linear and gam smoother

Nonlinear relationships (polynomial models)

Plot of diamonds data with polynomial and gam smoother

The polynomial fits the data a bit better, but interpretation becomes complicated, and out-of-sample predictions are poor.

Special case: Linear probability model

<table class="texreg" style="margin: 10px auto;border-collapse: collapse;border-spacing: 0px;caption-side: bottom;color: #000000;border-top: 2px solid #000000;">
<caption>Statistical models</caption>
<thead>
<tr>
<th style="padding-left: 5px;padding-right: 5px;">&nbsp;</th>
<th style="padding-left: 5px;padding-right: 5px;">Model 1</th>
<th style="padding-left: 5px;padding-right: 5px;">Model 2</th>
</tr>
</thead>
<tbody>
<tr style="border-top: 1px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">(Intercept)</td>
<td style="padding-left: 5px;padding-right: 5px;">-2.74<sup>&#42;&#42;&#42;</sup></td>
<td style="padding-left: 5px;padding-right: 5px;">0.06<sup>&#42;&#42;&#42;</sup></td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.10)</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.01)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">honors</td>
<td style="padding-left: 5px;padding-right: 5px;">0.76<sup>&#42;&#42;&#42;</sup></td>
<td style="padding-left: 5px;padding-right: 5px;">0.08<sup>&#42;&#42;&#42;</sup></td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.18)</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.02)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">years_experience</td>
<td style="padding-left: 5px;padding-right: 5px;">0.03<sup>&#42;&#42;&#42;</sup></td>
<td style="padding-left: 5px;padding-right: 5px;">0.00<sup>&#42;&#42;&#42;</sup></td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.01)</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.00)</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">genderm</td>
<td style="padding-left: 5px;padding-right: 5px;">-0.09</td>
<td style="padding-left: 5px;padding-right: 5px;">-0.01</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.13)</td>
<td style="padding-left: 5px;padding-right: 5px;">(0.01)</td>
</tr>
<tr style="border-top: 1px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">AIC</td>
<td style="padding-left: 5px;padding-right: 5px;">2702.30</td>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">BIC</td>
<td style="padding-left: 5px;padding-right: 5px;">2728.26</td>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">Log Likelihood</td>
<td style="padding-left: 5px;padding-right: 5px;">-1347.15</td>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">Deviance</td>
<td style="padding-left: 5px;padding-right: 5px;">2694.30</td>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">Num. obs.</td>
<td style="padding-left: 5px;padding-right: 5px;">4870</td>
<td style="padding-left: 5px;padding-right: 5px;">4870</td>
</tr>
<tr>
<td style="padding-left: 5px;padding-right: 5px;">R<sup>2</sup></td>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">0.01</td>
</tr>
<tr style="border-bottom: 2px solid #000000;">
<td style="padding-left: 5px;padding-right: 5px;">Adj. R<sup>2</sup></td>
<td style="padding-left: 5px;padding-right: 5px;">&nbsp;</td>
<td style="padding-left: 5px;padding-right: 5px;">0.01</td>
</tr>
</tbody>
<tfoot>
<tr>
<td style="font-size: 0.8em;" colspan="3"><sup>&#42;&#42;&#42;</sup>p &lt; 0.001; <sup>&#42;&#42;</sup>p &lt; 0.01; <sup>&#42;</sup>p &lt; 0.05</td>
</tr>
</tfoot>
</table>

Assumption violations all over!

But we still get essentially the same results.

So why Logistic regression?

  • The linear probability model can make impossible predictions.
       1 
1.127822 

Nonconstant Variance (Heteroskedasticity)

Example

A tibble: 2 × 2

cond var 1 new 55.3 2 used 1072.

Sandwich to the rescue

  • Robust standard errors aka “Sandwich” estimators adjust standard errors to deal with nonconstant variance.
$Standard

t test of coefficients:

                   Estimate Std. Error t value            Pr(>|t|)    
(Intercept)         53.7707     3.3289 16.1528 <0.0000000000000002 ***
mariokart$condused  -6.6226     4.3434 -1.5248              0.1296    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


$Sandwich

t test of coefficients:

                   Estimate Std. Error t value             Pr(>|t|)    
(Intercept)        53.77068    0.95973 56.0269 < 0.0000000000000002 ***
mariokart$condused -6.62258    3.67853 -1.8003              0.07395 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Another example

Diamonds are clustered by cut and clarity.

Parting thoughts

  • One reasons linear regression is popular is that these robust standard errors work.
  • Dealing with heteroskedasticity in other models requires different techniques.
  • Note that constant variance is an assumption about the standard errors, not model bias.
  • Violoations of the “linarity” assumption that the model fits the data cause bias.