Episode 16 — Handle missing data properly: MCAR, MAR, NMAR, and imputation implications

In this episode, we’re going to treat missing data as a first-class topic rather than an annoying detail you hope disappears, because missingness can quietly change the meaning of your dataset and the trustworthiness of every conclusion you draw from it. Beginners often think missing values are just empty cells you can delete or fill in, but the reason values are missing often matters more than the missing values themselves. If data is missing for random reasons, it may mainly reduce precision by shrinking your sample, but if data is missing for systematic reasons, it can bias your estimates and mislead your models. That is why the three missingness patterns, Missing Completely at Random (M C A R), Missing at Random (M A R), and Not Missing at Random (N M A R), matter so much. They are not just categories for a textbook; they are labels for how the missingness process relates to the data and to what you are trying to learn. Our goal is to make these patterns feel intuitive, then connect them to common handling strategies, especially imputation, so you understand what you gain and what you risk when you choose a method.

Before we continue, a quick note: this audio course is a companion to our course companion books. The first book is about the exam and provides detailed information on how to pass it best. The second book is a Kindle-only eBook that contains 1,000 flashcards that can be used on your mobile device or Kindle. Check them both out at Cyber Author dot me, in the Bare Metal Study Guides Series.

A good way to start is to recognize that missingness is information, even before you decide how to handle it. If a value is missing because a sensor failed under certain conditions, that missingness tells you something about those conditions. If a survey answer is missing because some participants chose not to answer a sensitive question, that missingness may correlate with the very thing you are trying to study. In data and A I work, missingness can arise from collection systems, user behavior, data transfer issues, and filtering rules, and those sources are rarely perfectly random. This is why simply deleting missing values can sometimes be safe and sometimes be disastrous. When you delete rows, you are changing the population your analysis represents, and if missingness is related to the outcome or to important predictors, you can distort the story. Beginners often focus on how many values are missing, but the more important question is why they are missing. If the exam asks what you should consider first, the strongest answer usually involves understanding the missingness mechanism rather than immediately choosing an imputation technique.

Missing Completely at Random (M C A R) is the simplest category, and it is the one beginners often assume by default even when it is not justified. M C A R means that the probability a value is missing does not depend on any data values, observed or unobserved. In plain language, missingness is like a random coin flip that knocks out some entries, independent of what those entries were and independent of anything else in the dataset. If missingness truly behaves this way, then the cases with missing values are not systematically different from cases without missing values, so deleting missing rows may mainly reduce your sample size without strongly biasing estimates. The key idea is that M C A R protects you from bias because the missingness process is unrelated to the substance of the data. The problem is that true M C A R is rare in real systems, because missingness often occurs due to conditions that correlate with behavior, time, device type, or value magnitude. On an exam, if a scenario suggests missingness happens randomly due to a transient glitch unrelated to the data, M C A R might be plausible. If the scenario hints at patterns, M C A R is often the wrong assumption.

Missing at Random (M A R) is the category that sounds like it means random, but it actually means something more subtle and more realistic. M A R means the probability a value is missing can depend on observed data, but not on the missing value itself once you account for the observed data. In plain language, missingness can be explained by variables you can see, like the device type, the region, the time of day, or other features, even though the value you want is missing. For example, suppose a certain field is more likely to be missing for mobile users because the mobile form does not collect it reliably. The missingness depends on the observed variable of platform, not directly on what the missing value would have been. Under M A R, you can often reduce bias by modeling missingness using the observed variables, and many imputation strategies assume something like M A R because it makes the problem tractable. Beginners often think M A R means you can ignore missingness, but you cannot, because missingness is still systematic and can change your dataset’s composition. The difference is that under M A R, you have observed clues that can help you correct for it. Exam questions that mention missingness tied to observable groups are often describing M A R.

Not Missing at Random (N M A R) is the most challenging category and the one most likely to cause hidden bias if you treat it like a simpler case. N M A R means the probability a value is missing depends on the value itself, even after accounting for observed variables. In plain language, the missingness is directly related to what you do not get to see, which creates a self-reinforcing blind spot. A classic example is when people with higher incomes are less likely to report income, so the missingness depends on income itself. In a system context, a sensor might fail more often when readings are extremely high, meaning the most extreme values are selectively missing. Under N M A R, standard imputation methods can be very misleading because they fill in missing values using patterns from observed data, but observed data is missing the very extremes that define the missingness. Beginners sometimes assume that if you impute, you are fixing the problem, but under N M A R, naive imputation can disguise bias and make results look more certain than they really are. On the exam, if missingness is plausibly connected to the magnitude or nature of the missing value, you should suspect N M A R. The correct response is often caution and recognition that the missingness mechanism must be addressed, not just filled over.

These categories matter because handling strategies that are reasonable under one mechanism can be harmful under another. If you assume M C A R and delete missing rows, but the true mechanism is M A R or N M A R, you can bias your dataset by selectively removing certain groups or value ranges. If you assume M A R and use an imputation model that relies on observed variables, but the true mechanism is N M A R, you might impute values that look plausible but systematically miss the extremes. The exam often tests this idea by presenting a missingness story and asking what risk is introduced by a particular handling method. The right approach is to match the method to a plausible mechanism and to acknowledge uncertainty when the mechanism cannot be validated fully. In practice, you rarely know the true mechanism perfectly, but you can often tell when M C A R is unlikely, which is already a major improvement over default assumptions. The deeper skill is to treat missing data handling as part of the modeling problem, not a preprocessing chore. Once you do that, you start asking better questions about how data enters your system.

Now let’s talk about imputation, because imputation is the family of techniques most learners encounter first. Imputation means filling in missing values with estimated values so you can use methods that require complete data or so you can reduce information loss from deletion. The simplest forms of imputation include replacing missing values with a mean, median, or most common category. These are easy to implement, but the exam focus is usually on implications rather than on mechanics. Simple imputation can shrink variability artificially because you are inserting repeated or central values, which can make your dataset look more consistent than it truly is. That, in turn, can lead models to be overconfident, because the data has been smoothed. Simple imputation can also distort relationships between variables, because it ignores the fact that missing values may be related to other features. Beginners often like simple imputation because it is quick, but quick solutions can create subtle bias. Exam questions often include mean imputation as a tempting answer and then test whether you understand that it can bias variance and correlations.

More sophisticated imputation uses other variables to predict missing values, which is generally more plausible under M A R because missingness can be explained by observed data. If a field is missing more often for certain devices, you might impute using device type and other available signals. This can preserve relationships better than simple imputation because the filled values depend on context rather than being the same number for everyone. The danger is that imputation is still a model, and like any model it can be wrong, especially if the missingness mechanism is complex or if the predictors used for imputation do not capture the true structure. Another danger is that imputation can create a false sense of certainty, because once values are filled, they look like real observations even though they are estimates. A mature approach is to remember that imputed values should not be treated as ground truth, and uncertainty from imputation should be acknowledged. In exam terms, you should recognize that model-based imputation can reduce bias under M A R compared to deletion, but it can still mislead under N M A R. The key is understanding what assumption the imputation is making about how missingness relates to the data.

Multiple imputation is a concept that often appears in discussions of missing data because it addresses the false certainty problem by creating multiple plausible filled datasets rather than one single filled version. The idea is that instead of pretending you know the missing values, you generate several reasonable estimates based on a model, analyze each filled dataset, and then combine results in a way that reflects variability due to missingness. You do not need to implement this for the exam, but you should understand why it exists. It exists because a single imputation treats estimated values as fixed and can underestimate uncertainty, leading to overly narrow confidence intervals and inflated significance. Multiple imputation is a structured way to propagate missingness uncertainty into your conclusions. For exam scenarios where the focus is on interpreting results, recognizing that single imputation can understate uncertainty is often enough. If the question is about why a model seems too confident after imputation, the correct reasoning often involves lost variability and hidden uncertainty. Thinking about imputation as adding guesses rather than adding truth keeps you cautious in the right way.

Another major decision point is whether to delete cases with missing values, and if so, how and why. Deleting missing values can be reasonable when missingness is truly small and plausibly M C A R, because the main effect is reduced sample size rather than systematic distortion. Deletion can also be used when a variable is missing so often that imputing it would add more guesswork than signal. The risk is that deletion can introduce bias under M A R and N M A R, because you might be deleting a non-random subset of cases. For example, if certain groups are more likely to have missing entries, deletion disproportionately removes those groups, and your dataset becomes less representative. This can cause your model to perform poorly in real-world deployment for the very groups that were deleted. Beginners often think deletion is the safest because it avoids making things up, but deleting can be a form of making things up too, because you are pretending the remaining data represents the population. Deletion is not neutral; it is a choice that reshapes your sample. On the exam, if you see missingness tied to observable groups, deletion should make you cautious unless you also account for those groups.

Weighting and indicator features are also strategies used in missing data situations, and understanding the high-level idea helps you reason about exam questions that mention them. Weighting can adjust estimates when missingness differs across groups, especially under M A R, by giving more influence to underrepresented cases, though it depends on having enough observed data for those groups. Adding an indicator for missingness means creating a separate feature that flags whether a value was missing, which can allow a model to learn patterns associated with missingness itself. This can be useful because missingness can carry real information, like a device failing to report a metric under certain conditions. The caution is that using missingness indicators can also encode problematic relationships if missingness reflects a sensitive or confounded process, and it can lead to models that rely on the fact of missingness rather than the underlying phenomenon. For exam purposes, the main takeaway is that missingness can be modeled, not just imputed away, and that missingness indicators can capture informative patterns under M A R-like settings. The decision depends on whether missingness is informative and whether you want the model to use that information. This reinforces the idea that missing data is part of the signal story, not just a hole in the spreadsheet.

A key misunderstanding to avoid is the belief that imputation always improves model performance, because it increases dataset size. Imputation can increase the amount of usable data, but if it introduces bias or destroys variance, it can harm generalization. A model trained on heavily imputed data can learn overly smooth relationships that do not exist in reality, and then it will perform poorly when faced with real variation. Imputation can also leak information if the imputation process uses data from the future or from the target in inappropriate ways, which can make performance look artificially strong during evaluation. Even without going into technical configuration, you should understand the principle that imputation must be done in a way that respects the separation between learning and evaluation. The exam may phrase this as a model doing too well on validation data after imputation, suggesting data leakage or improperly handled missingness. Another common effect is that imputation can change class boundaries in subtle ways, especially when missingness is uneven across classes. If positives tend to have more missingness, naive imputation might erase that signal and reduce separability. The safe mindset is that imputation is a modeling decision with consequences, not a free upgrade.

You also need to connect missing data handling to interpretability, because filling values changes what your dataset means. If you replace missing values with a central value, you are not just fixing a problem; you are asserting that the missing values are typical, which may be false. If you impute based on other variables, you are asserting that missing values follow the same relationships as observed ones, which is plausible under M A R but questionable under N M A R. If you treat missingness as a category in a categorical feature, you are asserting that missingness is a meaningful state, which can be true if missingness reflects a system behavior like not applicable or not collected. These are all semantic choices, and they influence what a model learns. Beginners often think data preparation is purely technical, but it is also about making meaning consistent. On an exam, if you see a feature where missing might mean something like unknown, not measured, or not applicable, you should consider that treating missing as its own category might preserve meaning better than forcing a guess. The central idea is that missingness has causes, and handling choices should match causes rather than ignoring them.

To close, handling missing data properly means you start by understanding the missingness mechanism, because mechanism determines what choices are safe and what choices are risky. You learned that M C A R describes missingness unrelated to any data values, which can make deletion less biased though still less precise. You learned that M A R describes missingness related to observed variables, which often makes model-based imputation or adjustment strategies more defensible because you can use observed clues. You learned that N M A R describes missingness related to the missing values themselves, which makes naive deletion and naive imputation dangerous because the missingness hides exactly the values you would need to model correctly. You also learned that imputation is not a neutral fix, because it can shrink variance, distort relationships, and understate uncertainty if treated as truth. Finally, you connected missing data to representativeness and bias, recognizing that missingness can change the population your analysis describes even when it looks like a small technical detail. When you can explain missingness patterns in plain language and choose handling strategies that respect those patterns, you are showing the kind of careful reasoning the CompTIA DataAI exam is designed to measure.

Episode 16 — Handle missing data properly: MCAR, MAR, NMAR, and imputation implications
Broadcast by