Episode 27 — Spot granularity traps, aggregation bias, and Simpson’s paradox early
In this episode, we’re going to learn how data can tell a believable story that is still wrong, simply because we summarized it at the wrong level or mixed together groups that should not be blended. Beginners often assume that more summarization makes data easier to understand, but summarization can also erase the very patterns you need to see. Granularity refers to the level of detail in your data, like whether you are looking at individual events, daily totals, or monthly averages. Aggregation is the act of combining many detailed records into a summary, like turning thousands of transactions into one number per customer. The trap is that once you aggregate, you can introduce bias and you can accidentally reverse relationships, which is where Simpson’s paradox comes in. The goal here is to help you notice early when the level of detail you are using is shaping your conclusions more than the underlying reality.
Before we continue, a quick note: this audio course is a companion to our course companion books. The first book is about the exam and provides detailed information on how to pass it best. The second book is a Kindle-only eBook that contains 1,000 flashcards that can be used on your mobile device or Kindle. Check them both out at Cyber Author dot me, in the Bare Metal Study Guides Series.
Granularity matters because data is generated at a certain natural level, and when you change that level, you change what questions you are able to answer. If you have clickstream data, the natural unit might be a click, but you might aggregate it to a session, a user, or a day. If you have medical measurements, the natural unit might be a visit, but you might aggregate it to a patient or a hospital. Each choice is valid for certain questions and invalid for others, and the danger is assuming one level works for everything. At high granularity, you see variability and rare events, but the data can be noisy. At low granularity, you see smooth patterns, but you might hide important differences and create misleading averages. Spotting granularity traps early is about asking what the prediction target and decision context require, then checking whether your current level of detail matches that requirement.
One classic granularity trap happens when you aggregate away the timing that matters. For example, suppose you want to predict whether a customer will churn, and you create a feature that is average weekly usage over the last three months. That average might hide a sharp decline in the last two weeks, which could be the strongest warning sign. The model then learns a bland version of behavior instead of the meaningful change. Another example is aggregating network events into a daily count, which might hide that all the events happened in a short burst that signals an incident. When you summarize too early, you can remove patterns like spikes, trends, or sequences that carry predictive information. This is not an argument against aggregation; it is a reminder that aggregation is a modeling choice that changes what signals remain.
Aggregation bias is the idea that relationships observed in aggregated data can differ from relationships observed in detailed data, and the reason is that aggregation changes the weighting of observations. When you average across customers, a customer with many transactions and a customer with one transaction might be treated equally if you summarize per customer. When you average across transactions, heavy users might dominate because they generate more rows. Neither weighting is automatically correct; it depends on the question. If your decision is at the customer level, you may want each customer to count once. If your decision is at the transaction level, you may want each transaction to count once. Problems arise when the weighting implied by your aggregation does not match the decision you want to make, because then the model learns a distorted picture of importance. Aggregation bias often feels like a mystery until you realize it is often just a hidden choice about who or what gets counted more.
Another form of aggregation bias comes from mixing together groups that behave differently and then reporting a single overall relationship as if it applies to everyone. For example, you might compute a single conversion rate across all traffic sources and conclude that a new campaign improved conversions, but the improvement might be true only for one source and negative for another. When you combine groups, the overall rate becomes a weighted average, and those weights depend on group sizes, not on what is fair or what is meaningful. If group sizes change over time, the overall rate can change even if each group’s rate stays the same. This is why aggregated metrics can swing dramatically during periods of user mix shifts, product changes, or seasonality. If you are not careful, you might chase a moving overall number while missing the stable group-level patterns underneath.
Simpson’s paradox is the most dramatic example of these issues because it describes a situation where a trend appears in one direction when you look at the combined data, but the trend reverses when you look within each group. This is not a trick of math; it is a consequence of how weighted averages work when groups have different sizes and different baseline levels. A simple way to imagine it is to think of two groups, where one group has generally higher outcomes than the other, and the distribution of cases between groups differs across the conditions you are comparing. When you combine the groups, the condition with more weight from the high-outcome group can look better overall, even if within each group it is worse. Beginners often assume that if a conclusion changes when you split the data, one view must be wrong, but Simpson’s paradox teaches that both views can be mathematically correct and still misleading for decision-making. The real question becomes which view matches the causal structure and the decision context.
To make this idea feel concrete without getting stuck in arithmetic, consider a scenario where you compare two teaching methods across beginner and advanced students. Advanced students tend to score higher no matter what method is used, simply because they start stronger. If one method was used more often with advanced students, it could look better overall, even if within both beginner and advanced groups it was actually slightly worse. The overall metric is being pulled by the mix of students rather than by the method effect itself. This same pattern appears in business data, healthcare outcomes, hiring metrics, and model evaluation, because group composition often changes. The key lesson is that aggregated comparisons can be dominated by who is in the sample, not just what treatment or condition you are comparing. Simpson’s paradox is a reminder to ask what groups exist and whether group membership is acting like a hidden driver.
A practical skill is learning what kinds of grouping variables are likely to cause Simpson-like reversals, and these are often variables that influence the outcome strongly and are unevenly distributed. In many datasets, geography, customer segment, device type, channel, and time period can act as strong groupers. If one segment has naturally higher conversion and one segment has naturally lower conversion, a shift in segment mix can change the overall conversion rate even if nothing improved. If one device type has inherently different behavior, mixing device types can hide or reverse relationships. Time is a particularly common grouping factor because systems change, marketing strategies shift, and user populations evolve. Spotting the possibility of Simpson’s paradox early means being suspicious of any strong overall relationship that might be explained by a change in group composition. It also means being willing to split the data by plausible groupers to check whether the relationship holds within groups.
Granularity traps also appear when you join data from different sources that are recorded at different levels, because mismatched levels can create misleading duplicates or unintended amplification. For example, if you have one row per customer in a demographics table and many rows per customer in a transactions table, joining them can replicate the same demographic values across many transaction rows. That replication is not inherently wrong, but it changes how you count and summarize, and it can create the illusion of more independent information than you truly have. If you then compute a correlation at the transaction level, you may be inadvertently weighting customers with more transactions more heavily. In evaluation, you might also leak customer identity patterns into the model if you are not careful about how you split the data. Thinking about granularity at join time is one of the most important beginner habits because many modeling pipelines create these issues silently.
Another way aggregation can mislead is through what is sometimes called the ecological fallacy, where you infer individual behavior from group-level averages. If a city has a high average income and a high average health outcome, you cannot conclude that higher-income individuals in that city are the ones with better outcomes without looking at individual-level data. Group-level relationships can be real at the group level but not true at the individual level, because groups differ on many dimensions simultaneously. In machine learning, this matters because you might train a model on aggregated data and then apply it to individual decisions, assuming the relationship holds. If the relationship is only true in aggregate, the model’s predictions can be systematically wrong for individuals. This is another form of granularity trap: using a level of data that does not match the level of decision.
A strong defense against these problems is to make the unit of analysis explicit, meaning you clearly state what each row represents and what decision the model will support. If the model makes decisions per customer, then you should be careful about using transaction-level rows without aggregation, or at least be explicit about how transaction behavior maps to customer outcomes. If the model makes decisions per event, then you should not collapse everything into customer-level averages and assume you still have the event-level signal. Once the unit is clear, you can choose aggregation strategies that preserve the information that matters, such as keeping recent-window features instead of lifetime averages, or computing both overall levels and recent changes. You can also plan group-aware evaluation, ensuring that groups that might cause Simpson’s paradox are represented in both training and testing. The core idea is that good modeling starts with alignment between data grain and decision grain.
By the end of this topic, you should be able to recognize that granularity and aggregation are not just data preparation steps, but choices that can change the conclusions you draw. Granularity traps occur when you summarize away the very patterns you need or when you operate at a level that does not match the decision you want to support. Aggregation bias occurs when summarization changes weighting or hides group differences, making overall relationships look stronger or weaker than they truly are. Simpson’s paradox is the vivid reminder that overall trends can reverse when you examine groups, especially when group membership strongly influences outcomes and group proportions differ across conditions. The way to avoid these landmines is not to avoid aggregation, but to be deliberate: define your unit of analysis, check relationships within key groups, and stay aware that averages can tell stories shaped by mix rather than by cause. When you build this awareness early, your models become more trustworthy because they are grounded in the right level of reality.