How to Handle Categorical Data with Multiple Levels in EDA

In Exploratory Data Analysis (EDA), handling categorical data with multiple levels (or categories) is crucial for understanding how the data behaves and how it interacts with other features. Categorical variables can be challenging, especially when they contain many levels, but there are various strategies and techniques that can help in managing and analyzing them effectively.

Here’s a step-by-step guide on how to handle categorical data with multiple levels in EDA:

1. Understand the Categorical Variable

Before diving into analysis, get a clear understanding of the categorical variable. A categorical variable is one that takes on a limited, fixed number of possible values (e.g., “red,” “blue,” “green”). If the variable has multiple levels, you might have thousands of different categories, which could make analysis complex.

Start by identifying the variable and understanding the number of categories it has. In Pandas, you can use df['column_name'].value_counts() to see the distribution of each category.

python
df['column_name'].value_counts()

This will show how many times each category appears, which is essential for understanding the dataset’s distribution.

2. Frequency Distribution

Once you know the unique categories, check their frequency distribution. You can create bar plots to visualize the distribution of categories. Categories with very low frequencies (outliers) may skew the results, so identifying these early on helps in deciding whether to merge them into a smaller set of categories.

python
import matplotlib.pyplot as plt
import seaborn as sns

sns.countplot(x='column_name', data=df)
plt.xticks(rotation=90)  # Rotate labels if there are many categories
plt.show()

3. Combine Rare Categories

In many cases, especially with categorical data that has a large number of levels, certain categories may appear very infrequently. These rare categories often don’t carry much useful information and can introduce noise into your analysis or models. One strategy is to group these rare categories into an “Other” category.

For example, if you have a “City” variable with hundreds of cities, but some cities appear only once or twice, you can group these into an “Other” category to reduce the complexity.

python
threshold = 10  # Number of occurrences below which a category will be labeled as 'Other'
value_counts = df['column_name'].value_counts()
other_categories = value_counts[value_counts < threshold].index
df['column_name'] = df['column_name'].replace(other_categories, 'Other')

4. Create Dummy Variables (One-Hot Encoding)

For machine learning models, categorical data often needs to be transformed into numerical values. One of the most common techniques for this is one-hot encoding, which creates a new binary column for each category.

For example, if you have a variable “Color” with values ['red', 'blue', 'green'], one-hot encoding will create three new columns: “Color_red,” “Color_blue,” and “Color_green,” with binary values indicating the presence or absence of each color.

python
df = pd.get_dummies(df, columns=['column_name'], drop_first=True)

drop_first=True is typically used to avoid multicollinearity in models by removing the first category and encoding the rest of the categories as binary values.

5. Label Encoding for Ordinal Data

In cases where your categorical variable has an inherent order (e.g., “Low,” “Medium,” “High”), label encoding can be used. This involves assigning a numerical value to each category. Unlike one-hot encoding, label encoding keeps the ordinal relationship intact.

python
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df['encoded_column'] = encoder.fit_transform(df['column_name'])

This approach is particularly useful when your categorical data represents ordered categories.

6. Visualizing Categorical Data

Understanding how categorical data interacts with other variables in your dataset can provide insights into relationships and trends. For visualizing categorical variables with multiple levels, consider the following:

a) Box Plots / Violin Plots

You can use box plots or violin plots to understand how a categorical variable affects a numerical variable. For example, if you want to understand how “City” (a categorical variable) affects “Income” (a numerical variable), you can plot a box plot.

python
sns.boxplot(x='column_name', y='numerical_column', data=df)

b) Bar Plots

Bar plots can show the relationship between a categorical variable and a numerical variable. You can use them to calculate the mean, sum, or other aggregates of a numerical variable for each category.

python
sns.barplot(x='column_name', y='numerical_column', data=df)

c) Stacked Bar Charts

If you want to show how different categories in one variable are distributed across categories in another variable, a stacked bar chart can be useful.

python
pd.crosstab(df['column_name'], df['other_column']).plot(kind='bar', stacked=True)
plt.show()

7. Handling Missing Data in Categorical Variables

Handling missing data is another important step. For categorical data, missing values might occur if a category has not been recorded or is incomplete. There are several strategies to handle missing categorical data:

Fill with the mode (most frequent category): This is a simple strategy that fills missing values with the most common category.
```
python
mode_value = df['column_name'].mode()[0]
df['column_name'].fillna(mode_value, inplace=True)
```
Use placeholder categories: You can also introduce a new category like “Unknown” or “Missing” to indicate that the data was unavailable.
```
python
df['column_name'].fillna('Unknown', inplace=True)
```
Imputation: More advanced techniques like imputation using a model (e.g., KNN imputer) can be used, though this is more common for numerical data.

8. Chi-Square Test for Categorical Variables

If you are analyzing relationships between two categorical variables, the Chi-square test of independence can help assess whether two variables are related or not. This test is useful when you want to know whether the distributions of categorical variables are independent of each other.

python
from scipy.stats import chi2_contingency

contingency_table = pd.crosstab(df['category1'], df['category2'])
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f"p-value: {p}")

A low p-value (typically less than 0.05) suggests that there is a significant relationship between the two variables.

9. Correlation with Target Variable

If your categorical variable is related to a target variable (e.g., a binary outcome like “Survived” or “Not Survived”), it’s important to check the correlation between the categorical variable and the target. You can use techniques like Point-Biserial Correlation (for binary targets) or Cramér’s V (for nominal categorical variables).

python
from scipy.stats import pointbiserialr

# For binary target
correlation, p_value = pointbiserialr(df['binary_target'], df['categorical_column'])
print(f"Correlation: {correlation}")

Conclusion

Handling categorical data with multiple levels in EDA requires understanding the distribution, managing rare categories, and transforming the data for modeling purposes. Techniques like one-hot encoding, label encoding, visualizations, and dealing with missing data help in preparing categorical features for machine learning models while extracting valuable insights during analysis.

Share This Page: