Episode 29 — Encode categorical variables correctly: one-hot, ordinal, target, and hashing

In this episode, we’re going to tackle a topic that sounds simple but causes a surprising number of modeling failures: turning categories into numbers in a way that keeps their meaning intact. Categorical variables are features like country, device type, payment method, or error code, where the value is a label rather than a measurement. Models typically work with numbers, so you have to encode categories into a numeric form, and the encoding choice quietly tells the model what relationships exist. If you pick the wrong encoding, the model can learn patterns that are not real, like assuming one category is greater than another just because it has a higher code. If you pick the right encoding, the model can learn differences between categories without inventing order or distance that does not exist. One-hot, ordinal, target, and hashing encodings are common approaches, and each one fits a different situation. The goal is to help you understand what each encoding communicates to the model, what assumptions it makes, and how to avoid the most common traps.

Before we continue, a quick note: this audio course is a companion to our course companion books. The first book is about the exam and provides detailed information on how to pass it best. The second book is a Kindle-only eBook that contains 1,000 flashcards that can be used on your mobile device or Kindle. Check them both out at Cyber Author dot me, in the Bare Metal Study Guides Series.

The first step is to separate nominal categories from ordinal categories, because this distinction drives most encoding decisions. Nominal categories have no natural order, like browser type or region, and the only truth is that labels are different from each other. Ordinal categories have a meaningful order, like low, medium, high, or a satisfaction rating from one to five where higher has a consistent meaning. If you treat a nominal variable like it is ordinal by assigning numbers such as 1, 2, 3 and letting the model interpret those as a scale, you are injecting a fake ordering. The model might then assume that category 3 is closer to category 2 than category 1, which is not meaningful for most labels. If you treat an ordinal variable as purely nominal, you might throw away the useful fact that there is an ordered progression. So before you encode anything, you want to ask what kind of category you have and whether its order is real and stable.

One-hot encoding is often the safest default for nominal categories because it avoids inventing order and instead represents each category as its own separate indicator. The basic idea is that if a feature can take values like red, green, and blue, you create separate numeric features that indicate whether the value is red, whether it is green, and whether it is blue. This tells the model that these are distinct buckets, not points on a line. One-hot encoding works well when the number of categories is not too large and when you want the model to learn separate effects for each category. It also makes interpretation more straightforward in many cases, because you can reason about how each category’s indicator influences the prediction. The main cost is that it can create many features, especially if you have high-cardinality categories like thousands of ZIP codes or thousands of product IDs, which can lead to sparsity and overfitting if you are not careful.

A common beginner mistake with one-hot encoding is to forget that categories can appear in new data that were not present in training data. If the model only knows about the categories it saw before, a new category value can become an unknown case that needs a consistent handling strategy. Another mistake is to one-hot encode an identifier, like a unique customer ID, which creates a nearly perfect fingerprint for each row and encourages the model to memorize instead of generalize. Even when an ID seems helpful, it often creates performance that collapses when you encounter new IDs in deployment. One-hot encoding also creates a subtle redundancy if you include an indicator for every category plus an intercept term, because the indicators sum to one, which can cause issues for some linear models. Many implementations handle this automatically, but the conceptual lesson is that encoding is not only about transforming data, it is about controlling what kinds of patterns the model is allowed to learn.

Ordinal encoding is designed for categories with a real order, and it represents that order directly as numeric values. If you have categories like small, medium, large, you can map them to 1, 2, 3 in a way that reflects the progression. This can be powerful because it lets models use the order signal efficiently without needing separate indicators for each level. It can also produce cleaner behavior when the relationship is monotonic, meaning the outcome generally increases or decreases consistently across the ordered levels. However, ordinal encoding carries an assumption that the steps between levels are evenly spaced or at least comparable, which may not always be true. For example, the gap in meaning between poor and fair might not match the gap between good and excellent, even though both are one step apart. The key is that ordinal encoding communicates order and distance, so you should use it only when both are reasonably defensible.

Another beginner pitfall is to use ordinal encoding simply because categories are labeled with numbers already, such as severity codes 1 through 5, without checking whether the codes truly represent equal steps. Sometimes a code is just a label, and higher numbers do not mean “more,” they just mean “different.” Even when codes are ordered, the relationship between levels and the outcome might not be linear, meaning the jump from level 1 to 2 might matter more than the jump from 4 to 5. In those cases, one-hot encoding can still be a better choice because it allows each level to have its own effect. The takeaway is that ordinal encoding is not automatically better for ordered categories; it is better when the order matters in a consistent way and when the model can benefit from treating the category like a scale. When you use it intentionally, it can simplify the problem and reduce dimensionality, but when you use it casually, it can inject incorrect structure.

Target encoding is a more advanced idea that can be extremely useful for high-cardinality nominal categories, but it comes with serious risks if you do it incorrectly. The basic idea is to replace a category with a numeric value derived from the target, such as the average outcome for that category. For example, if you are predicting churn, you might encode each region by the historical churn rate in that region. This can capture useful information without creating thousands of one-hot columns, and it can help models learn patterns from rare categories by mapping them into a shared numeric space. The danger is that target encoding can easily leak information if you compute those averages using the full dataset, including the rows you are trying to predict. Then the encoding is partly based on the answer, and the model performance becomes artificially inflated. Target encoding can also overfit to noise when categories have few examples, because a category with one or two observations can have an extreme average that is not reliable.

To use target encoding safely, you need the mindset that encodings must be learned without peeking at the target for the same row you will later predict. That typically means computing encodings using only training data and doing it in a way that preserves the separation between training and evaluation. It also means using smoothing, which is a way of pulling category averages toward the overall average when you have limited data, so rare categories do not get extreme values that are mostly noise. Even without getting into formulas, you can understand smoothing as a fairness principle: categories with lots of evidence can have confident estimates, and categories with little evidence should be treated cautiously. Target encoding is powerful because it can capture category signal compactly, but it is dangerous because it can quietly turn your model into a memorizer. For beginners, the most important lesson is not the mechanics, but the risk: if your encoding uses the target improperly, you are building leakage into the dataset.

Hashing encoding is another approach designed to handle high-cardinality categories, and it solves a different problem than target encoding. Instead of using the target to create a numeric value, hashing uses a deterministic function to map category labels into a fixed number of bins. The result is a vector where categories land in buckets based on their hash values, and you control the number of buckets in advance. This is useful when you have huge vocabularies or many unique identifiers and you need a memory-efficient representation that can handle categories you have never seen before. The tradeoff is collisions, meaning different categories can map to the same bucket and become indistinguishable to the model. Collisions introduce noise, but if you choose enough buckets, the collisions become less frequent, and the method can still work well. Hashing is especially attractive when the category space is large and dynamic, because it avoids needing a stored lookup table of all categories.

A beginner-friendly way to think about hashing is to imagine you have a fixed set of labeled drawers, and each category label gets assigned to a drawer by a rule you do not control in detail. If two labels end up in the same drawer, the model sees them as blended, which can blur meaning. The benefit is that you never run out of drawers, and you can handle new labels without redesigning the encoding. The risk is that the blending can hide important differences, especially if a rare but important category collides with a common category and gets drowned out. Hashing also makes interpretation harder because you cannot easily say which original labels correspond to which encoded bucket without additional tracking. So hashing is a practical engineering compromise: it gives you scalability and flexibility at the cost of some clarity and some collision noise. Understanding that tradeoff helps you choose it for the right situations rather than by accident.

These encoding choices also interact with model families and with your goals around interpretability, stability, and fairness. One-hot encoding is often easier to interpret and works well for many models, but it can explode feature count. Ordinal encoding is compact and can work well when order is real and meaningful, but it can inject fake distances when misused. Target encoding can be very predictive with high-cardinality categories, but it demands strict discipline to avoid leakage and overfitting. Hashing is scalable and handles unseen categories naturally, but it introduces collisions and can reduce interpretability. A key beginner habit is to treat encoding as part of the model design, not just a preprocessing step, because it determines what patterns are representable. If your encoding forces categories into an artificial numeric relationship, you are shaping the hypothesis space of the model in a way that can be harmful.

By the end of this topic, you should feel confident that encoding is about protecting meaning, not just turning words into numbers. One-hot encoding is a strong default for nominal categories because it avoids inventing order and lets the model learn separate effects. Ordinal encoding is appropriate when categories have a genuine, stable order and when treating them like a scale matches the real-world interpretation. Target encoding is a compact way to capture category-to-outcome relationships for high-cardinality features, but it must be done with strict separation and smoothing to avoid leakage and noise-driven overfitting. Hashing encoding provides a fixed-size, scalable representation that handles unseen categories, but it introduces collisions and reduces transparency. When you choose encodings intentionally, you help the model learn true structure rather than artifacts, and you set yourself up for evaluation results that actually reflect real-world performance.

Episode 29 — Encode categorical variables correctly: one-hot, ordinal, target, and hashing
Broadcast by