Episode 42 — Apply linear regression well: assumptions, diagnostics, ridge, LASSO, elastic net

In this episode, we move from the idea of explaining models to one of the most foundational predictive tools you will meet, which is linear regression, and we focus on what it means to use it well rather than just run it and accept the output. Linear regression is popular because it is conceptually simple, fast, and often surprisingly effective, but beginners can get burned because the model can look confident even when its assumptions are badly violated. When that happens, the numbers you get may still look neat and precise, yet the conclusions can be unreliable or even backwards. The goal here is to build a strong mental model of what linear regression is trying to do, what it quietly assumes about the world, and how you check whether those assumptions are reasonable for your data. We also connect that to regularization methods like ridge, L A S S O, and elastic net, because these are practical tools for making regression more stable when real-world data is messy.

Before we continue, a quick note: this audio course is a companion to our course companion books. The first book is about the exam and provides detailed information on how to pass it best. The second book is a Kindle-only eBook that contains 1,000 flashcards that can be used on your mobile device or Kindle. Check them both out at Cyber Author dot me, in the Bare Metal Study Guides Series.

At its core, linear regression tries to describe a relationship between one or more input variables and a continuous output by fitting a straight-line pattern, even when there are many inputs. The straight line idea does not mean the world is literally straight; it means we are approximating the output as a weighted sum of inputs plus an intercept term that captures a baseline. Those weights are learned from data, and the training process typically tries to minimize the average size of prediction errors, often by minimizing squared error so that larger mistakes count more. The simplicity of this setup is a strength because it is easy to interpret and quick to fit, but the simplicity is also a risk because it can hide important complexity, like non-linear effects or interactions between variables. A beginner-friendly way to think about it is that the model is trying to find the best flat surface through a cloud of points in a multi-dimensional space, and the quality of that surface depends heavily on whether the data behaves in a way that supports that kind of approximation.

The first major assumption to understand is the idea of linearity, which means the model is assuming the expected output changes in a linear way as inputs change, holding other inputs constant. This does not mean every point lies on a perfect line, because noise is allowed, but it does mean the average pattern should be reasonably well captured by a linear trend. When linearity is not true, the model can systematically miss in certain regions, like always underpredicting at high values and overpredicting at low values. Beginners often try to fix this by adding more data or by trusting that the model will somehow adapt, but linear regression cannot magically become non-linear without changing the features you give it. This is why feature design and careful diagnostics matter, because you need to notice when the relationship is curved, piecewise, or dependent on interactions. If you miss that, your model might still produce a decent overall error, but the errors will have structure, and structured errors are a signal that the model form is mismatched to the data-generating process.

Another assumption is that the errors, meaning the differences between predictions and true values, are independent of each other. Independence is easier to violate than beginners expect, especially when data is collected over time, over geographic space, or in repeated measurements from the same source. If errors are correlated, the model may still predict okay, but many of the standard confidence measures become misleading because they rely on the idea that each error is a fresh, unrelated draw. A classic example is time series, where yesterday’s error tends to relate to today’s error because the environment evolves smoothly rather than randomly resetting each row. When independence fails, you can end up with a model that seems statistically convincing even though it is simply tracking a trend or repeating a bias that persists across the dataset. The practical lesson is that you should think about how the data was produced, not just what columns you see, because dependence often comes from sampling methods and data collection processes rather than from the model itself.

A closely related assumption is homoscedasticity, which is a fancy way of saying the variability of errors should be roughly constant across the range of predicted values. If the model is much more wrong for large values than small values, or if error spread grows as the target grows, then the model is violating this assumption. This matters because the model will be pulled toward fitting regions with large variance in a way that can distort the fit, and it can also make your uncertainty estimates unreliable. In real data, heteroscedasticity is common, such as predicting income, sales, or response times, where large values naturally come with larger spread. The beginner move is to not panic when you see this, but to recognize it as a signal to rethink transformations, features, or even the evaluation method. Sometimes the model can still be useful for prediction, but you should be cautious about using it to draw precise conclusions about how much each input matters.

A fourth assumption people often mention is normality of errors, and it is important to handle this one with nuance. Linear regression does not require errors to be perfectly normal in order to make predictions, but normality matters if you are relying on certain statistical tests or confidence intervals that assume a bell-shaped error distribution. Beginners sometimes treat normality as a strict pass or fail rule, but in practice you look for big departures that suggest the model is missing structure, like heavy tails that indicate frequent extreme errors or a skew that indicates systematic bias. If errors are not normal because of a few outliers, that suggests the need for outlier analysis and possibly robust approaches, rather than a blind transformation. If errors are not normal because the relationship is non-linear, that suggests feature changes or a different model family. The key is to treat normality checks as part of the diagnostic story, not as a ritual box-check.

Diagnostics are the tools you use to see whether these assumptions are reasonable, and the most beginner-friendly diagnostic is looking at residuals, which are the errors plotted against something meaningful. When residuals are randomly scattered around zero with no clear pattern, that is a good sign that the model is capturing the main structure. When residuals show a curve, a funnel shape, clusters, or repeating waves, that pattern is a clue about what the model is missing. Another diagnostic idea is leverage, which describes points that have unusual input values compared to the rest of the data, meaning they can strongly influence the fitted line even if their output is not extreme. High leverage points can make a model appear to fit well because the model bends toward them, but that can harm generalization. Outliers in the target direction can also distort the fit, especially because squared error penalizes large deviations heavily, which can cause the model to chase extreme points. A strong regression practice is not to automatically delete unusual points, but to understand whether they represent real cases you need to model, errors in data, or special conditions that should be handled separately.

Multicollinearity is another common challenge that can make beginners think linear regression is broken when it is actually telling them something about their features. Multicollinearity means two or more inputs are strongly correlated, so the model has trouble deciding how to split credit between them. The model might still predict well, because it only needs the combination, but the individual coefficients can become unstable and can swing dramatically when you add or remove a feature or a few data points. This is a major reason why interpretability can be tricky, because you might think a coefficient reflects a clear relationship, but it can be an artifact of redundant features. In practical terms, multicollinearity shows up as large changes in coefficients across similar fits, or coefficients that have surprising signs compared to intuition. Rather than treating that as a failure, you should see it as a prompt to simplify features, combine related variables, or use regularization to stabilize the solution.

This is where regularization enters as a practical tool, and it is helpful to view it as adding a preference for simpler models while still fitting the data. Regularization works by adding a penalty term to the training objective, meaning the model is not only punished for prediction error but also punished for having large coefficients. The result is a tradeoff: you accept a little more training error to get a model that is less sensitive to noise and less likely to overfit. This is especially valuable when you have many features, when features are correlated, or when the dataset is not huge compared to the number of inputs. Regularization does not magically create new information, but it helps the model avoid extreme coefficient values that arise from trying to fit random quirks in the training data. In an exam mindset, you should remember that regularization is a bias-variance tradeoff tool: it increases bias slightly in exchange for reducing variance and improving stability.

Ridge regression is the regularized version that uses a penalty on the squared size of coefficients, which tends to shrink coefficients smoothly toward zero without usually making them exactly zero. This means ridge is great when you believe many features contribute a little and you want stability in the presence of multicollinearity. Ridge can keep correlated features in the model together, sharing weight between them, rather than picking one and discarding the other. A beginner-friendly intuition is that ridge discourages the model from putting all its trust in one feature unless the data strongly demands it. The coefficients become smaller in magnitude, and predictions can become more reliable on new data, especially when the original least squares solution was unstable. Ridge is not a feature selector in the strict sense, because it does not usually eliminate features, but it can reduce the effective impact of weak predictors. When interpretability matters, ridge can actually help because it reduces wild coefficient swings, but you still need to remember that correlated features can blur the meaning of individual weights.

L A S S O is another regularization method, but it uses a penalty based on the absolute size of coefficients rather than the squared size, and that small change creates a big behavioral difference. L A S S O tends to push some coefficients to exactly zero, which means it can perform automatic feature selection. This can be extremely useful when you have many features and you believe only a smaller subset truly matters for prediction. The benefit is a simpler model that is easier to communicate, and often easier to deploy, because you only need the selected features. The tradeoff is that when features are highly correlated, L A S S O can behave a bit like it is forced to choose, picking one feature and dropping another even if both carry similar information. That choice can be unstable when data changes slightly, and it can lead to explanations that sound more definitive than they should. A smart beginner stance is to value L A S S O for simplification but to be cautious about treating selected features as the only true causes, especially when the dataset contains redundancy.

Elastic net combines ideas from ridge and L A S S O, and you can think of it as giving the model two kinds of pressure at once. The ridge-like part promotes stability and smooth shrinkage, while the L A S S O-like part promotes sparsity, meaning some coefficients can become zero. This combination often performs well when you have many features that are correlated in groups, which is common in real data where multiple measurements reflect the same underlying factor. Elastic net can select groups of correlated features together rather than arbitrarily picking one, which can be more stable and more realistic in some domains. The key idea is that elastic net gives you a continuum between ridge and L A S S O behavior, so you can tune the balance based on your goals. If you care most about prediction stability, you lean more ridge-like, and if you care most about simplifying the feature set, you lean more L A S S O-like. In exam terms, it is important to recognize when each regularization method is appropriate based on feature count, correlation structure, and the need for interpretability.

All regularization methods introduce tuning choices, and while we are not doing step-by-step procedures here, you should understand what tuning is conceptually. The penalty strength controls how strongly coefficients are pushed toward zero, and choosing it is essentially choosing how much complexity you are willing to tolerate. If the penalty is too weak, you do not get the stability benefits, and if it is too strong, you underfit and lose meaningful signal. A beginner mistake is to pick a penalty that makes the model look clean and simple without checking whether it still predicts well on data it did not train on. Another mistake is to treat the tuned penalty as a fixed truth, rather than as a choice that depends on the context, the costs of errors, and how noisy the data is. It is also important to remember that regularization interacts with scaling, because penalties depend on coefficient size, and coefficient size depends on the units of your inputs. If one feature is measured in huge numbers and another in tiny decimals, the penalty can distort the model unless features are put on comparable scales, which is why scaling is part of responsible regression practice.

Connecting this back to the broader goal of applying linear regression well, the most important habit is to treat the model as a hypothesis about the data, not as an automatic answer machine. You start with a simple model because it is interpretable and easy to diagnose, and then you test whether it matches the patterns you see through residuals, stability checks, and thoughtful evaluation. If the assumptions look reasonable, linear regression can be an excellent baseline and sometimes the best final choice, especially when you need transparency. If the assumptions look shaky, you do not have to abandon the approach immediately, but you do need to adjust expectations and possibly use regularization to improve robustness. Ridge, L A S S O, and elastic net are not magic fixes, but they are powerful ways to make regression behave better when features are many, correlated, or noisy. The deeper lesson is that good regression is a balance of understanding, checking, and communicating limits, and that mindset is exactly what this certification expects you to demonstrate when you choose and justify a model for a real problem.

Episode 42 — Apply linear regression well: assumptions, diagnostics, ridge, LASSO, elastic net
Broadcast by