Episode 24 — Run EDA with intent: distributions, skew, kurtosis, and feature type checks
In this episode, we’re going to build a mindset for exploratory data analysis that feels purposeful instead of random, because beginners often open a dataset, start plotting things, and end up with a pile of charts that do not answer any clear question. Exploratory data analysis, often shortened to Exploratory Data Analysis (E D A), is the stage where you learn what your data actually is before you decide what you want to do with it. The key word is intent, meaning you are not just looking, you are checking specific truths about the data that will affect every decision that comes after. You want to know what kinds of values exist, what normal looks like, what weird looks like, and what might quietly break a model. Distributions, skew, kurtosis, and feature type checks sound like technical terms, but they are really ways of asking simple questions about shape, balance, extremes, and meaning. If you learn to do this well, you will prevent many mistakes that otherwise show up later as confusing training results or unreliable predictions.
Before we continue, a quick note: this audio course is a companion to our course companion books. The first book is about the exam and provides detailed information on how to pass it best. The second book is a Kindle-only eBook that contains 1,000 flashcards that can be used on your mobile device or Kindle. Check them both out at Cyber Author dot me, in the Bare Metal Study Guides Series.
A good first habit is to start by asking what each feature is supposed to represent, because numbers are not always the kind of numbers you think they are. Some columns are measurements, like temperature or response time, where differences and averages make sense. Some columns are identifiers, like customer IDs or transaction IDs, which are numbers only for convenience and have no meaningful numeric distance. Some columns represent categories, like product type or region, even if they are encoded as numbers such as 1, 2, and 3. This is why feature type checks matter: you are verifying whether each column is continuous, discrete, categorical, ordinal, boolean, time-based, or text-based, and that decision changes how you should summarize and visualize it. When you skip this step, you can accidentally compute averages of IDs or treat a category code as if it had a numeric relationship that is not real. Intentful E D A begins with meaning, not math.
Once you understand what a feature is, you can start thinking about distributions, which describe how values are spread across their range. A distribution tells you whether most values cluster around a typical center, whether they are evenly spread, or whether they clump into distinct groups. Even without advanced statistics, it helps to imagine piling up all the values and asking what the pile looks like: is it one hump, two humps, or a long tail? This matters because many modeling techniques behave differently depending on whether features are roughly bell-shaped, heavily skewed, or full of outliers. It also matters for choosing transformations, deciding how to handle missing values, and interpreting what the model learns. When you look at a distribution, you are not judging whether it is good or bad; you are learning what kind of world the data came from.
Skew is one of the first distribution concepts that becomes practical immediately, because it tells you whether the distribution leans to one side. A right-skewed distribution has a long tail to the right, meaning there are a small number of unusually large values compared to the typical value. A left-skewed distribution has a long tail to the left, meaning there are a small number of unusually small values relative to the typical value. Many real-world features are right-skewed, such as income, file sizes, response times, and counts of events, because there is a natural lower bound near zero but no strict upper bound. Skew matters because averages become less representative when a small number of large values pull the mean away from what most observations look like. It also matters because models can become overly influenced by extreme values if you do not understand that they are part of the natural shape.
Kurtosis is a concept that often gets treated as mysterious, but a beginner-friendly view is that it relates to how much of the distribution lives in the tails and how sharp the peak is. High kurtosis suggests more extreme tail behavior, meaning more values far from the center than you would expect under a gentle bell-shaped pattern. Low kurtosis suggests lighter tails and fewer extreme outliers, although the exact interpretation can vary depending on the convention being used. The reason kurtosis matters is not because you need to calculate it by hand, but because it alerts you to whether rare extreme values are part of the data’s nature. If your feature occasionally produces huge spikes, a model might chase those spikes or become unstable if they are not handled thoughtfully. Seeing tail heaviness early helps you decide whether to cap values, transform them, or treat the extremes as a separate phenomenon worth investigating. Intentful E D A uses kurtosis-like thinking as an early warning signal for extremes.
Another key part of E D A with intent is checking for feature types that are easy to misread, especially ordinal versus categorical features. An ordinal feature has an order that matters, like low, medium, high, or a rating from one to five where higher means more. A categorical feature is a set of labels where order is not meaningful, like city names or device types. If you treat ordinal data as purely categorical, you might throw away useful order information. If you treat categorical data as ordinal, you might invent relationships that do not exist, such as assuming category 4 is twice category 2 just because the code is bigger. These mistakes can be subtle because the dataset might store both as integers, but the meaning is entirely different. A careful feature type check looks for columns that are numeric but behave like categories, such as having only a handful of repeated values, or having values that correspond to codes rather than measurements.
Distributions also help you detect data quality issues that are not obvious from missing value counts alone. A feature might have no missing entries, but still be effectively useless if almost every value is the same. This is sometimes called near-zero variance, and it can happen when a sensor is stuck, a field is defaulted, or the feature is not populated meaningfully. You can spot this by looking at the distribution and noticing that almost all values pile into one bin. Another problem is when a feature has only a small number of values, not because it is categorical, but because it was rounded, truncated, or recorded in a limited format. That can create stair-step patterns that may confuse a model or hide true relationships. When you examine distributions with intent, you are looking for these signs of measurement and recording behavior, not just the mathematical shape.
When you analyze skew and tails, it also helps to connect those shapes to likely causes in the data collection process. A long right tail might represent rare high-usage customers, rare large files, or rare long delays during outages. A heavy tail might represent a mixture of normal operations and occasional incidents, meaning the data actually contains multiple regimes. If you treat that mixture as one simple distribution, you might build a model that fits neither regime well. Intentful E D A asks whether the shape suggests multiple populations, such as two peaks that could indicate two different user groups or two operating modes. If you suspect multiple groups, you can investigate by slicing the distribution by a known category, like region or device type, to see whether those groups explain the multi-peak behavior. This is not about overcomplicating; it is about realizing that a dataset is often a blend of different realities.
A frequent beginner mistake is to assume that summary statistics like mean and standard deviation fully describe a feature, but those numbers can hide important shape information. Two distributions can have the same mean and spread but look completely different, such as one being symmetric and another being strongly skewed with outliers. Skew and kurtosis provide extra information beyond the mean and variance, and even without computing them exactly, you can reason about them by looking at the distribution. Another trap is to use the mean as the “typical” value when the distribution is skewed, because then the mean can represent a value that few observations actually have. In those cases, median and percentiles often match intuition better because they describe what a typical observation looks like. Intentful E D A is about matching the summary to the shape rather than applying one summary everywhere.
Feature type checks also include verifying the role of time features, because time can appear as a timestamp, a duration, or a derived unit like day-of-week. A timestamp is not just a number; it is a coordinate on a timeline, and treating it as a raw numeric feature can create misleading relationships, especially if the dataset spans periods with changes in behavior. A duration feature, such as time since signup, is different because it directly describes an elapsed amount that can be meaningful in modeling. Derived time features like hour-of-day can capture seasonality, but they are cyclical, meaning after 23 comes 0, and that cycle matters for interpretation. These distinctions belong in E D A because they determine whether a feature should be treated as continuous, cyclical, or categorical. When beginners ignore these details, they often build models that learn time position rather than time behavior, which can make performance look good in testing but fail in the future.
Another intent-driven E D A habit is to check how distributions change when you compare the feature against the target or against a key grouping variable. For example, if the target is whether a transaction is fraudulent, you might compare the distribution of transaction amount for fraudulent versus non-fraudulent cases. You are not trying to prove a causal story; you are trying to see whether the feature behaves differently across outcomes, which hints at predictive usefulness and possible issues. Skew and kurtosis can differ across groups, and those differences can signal that a single transformation may not fit both groups equally well. This also helps you spot leakage-like patterns, where a feature seems almost perfectly separated by the target, which is suspicious if the feature would not realistically be known at prediction time. Even at the E D A stage, you can protect yourself by noticing patterns that look too perfect.
As you build confidence, you can treat E D A as a set of questions you ask repeatedly, rather than a set of plots you make once. What is the feature type, and does the encoding match the meaning? What is the distribution shape, and does it have skew or heavy tails? Are there outliers, and do they look like errors or like rare but real events? Does the feature have enough variation to be useful? Does the feature behave differently across important groups, and do those differences make sense in context? These questions help you avoid the trap of treating E D A as a checklist and instead treat it as an investigation. The purpose is not to make the data look pretty; it is to understand the data well enough to make honest modeling choices later.
By the end of this topic, you should feel that distributions, skew, kurtosis, and feature type checks are not separate technical chores, but connected parts of one careful habit: learning what your data is really saying. Distributions show you the shape of the world your data came from, including whether it is smooth, clumpy, skewed, or heavy-tailed. Skew tells you whether typical values and average values differ in important ways, which affects summaries and modeling sensitivity. Kurtosis-style thinking alerts you to extremes and tail behavior, which affects stability, robustness, and whether you need transformations. Feature type checks keep you from making meaning mistakes that math alone cannot fix, like averaging IDs or inventing order in categories. When you run E D A with intent, you are building a foundation for trustworthy modeling decisions, because you are learning the data’s structure before you ask a model to learn it for you.