How to Perform a Chi-Square Test for Categorical Data in EDA

In exploratory data analysis (EDA), performing a Chi-Square test for categorical data helps to assess whether there is a significant relationship between two categorical variables. This statistical test is particularly useful when you’re dealing with contingency tables, where you have frequencies or counts of data points across categories.

Here’s a step-by-step guide on how to perform a Chi-Square test during EDA:

1. Understand the Chi-Square Test

The Chi-Square test of independence is used to determine whether two categorical variables are independent or associated with each other. It compares the observed frequencies (the actual data) to the expected frequencies (the frequencies we would expect if the variables were independent).

2. State the Hypotheses

Null Hypothesis (H₀): There is no association between the two variables; they are independent.
Alternative Hypothesis (H₁): There is an association between the two variables; they are dependent.

3. Prepare Your Data

For a Chi-Square test, your data should be in the form of a contingency table, where each cell represents a count of occurrences for each combination of the two categorical variables.

Example:

If you are examining the relationship between gender and whether a person likes a product (yes or no), your contingency table might look like this:

Gender	Likes Product	Doesn’t Like Product
Male	50	30
Female	40	60

4. Calculate the Expected Frequencies

The expected frequency for each cell in a contingency table is calculated using the formula:

E = frac{(Row , Total times Column , Total)}{Grand , Total}

Where:

Row Total is the sum of the observations in that row.
Column Total is the sum of the observations in that column.
Grand Total is the total number of observations.

For example, in the above table:

The expected frequency for the Male, Likes Product cell would be:

E = frac{(80 times 90)}{180} = 40

Repeat this for each cell in the table.

5. Perform the Chi-Square Test Calculation

The Chi-Square statistic (χ²) is calculated using the formula:

chi^2 = sum frac{(O – E)^2}{E}

Where:

$O$ is the observed frequency.
$E$ is the expected frequency.

For each cell, subtract the expected frequency from the observed frequency, square the result, and divide it by the expected frequency. Then, sum all of these values across all cells in the table.

6. Determine the Degrees of Freedom

The degrees of freedom (df) for the Chi-Square test is calculated using the formula:

df = (r – 1) times (c – 1)

Where:

$r$ is the number of rows in the contingency table.
$c$ is the number of columns.

For our example with a 2×2 table, the degrees of freedom would be:

df = (2 – 1) times (2 – 1) = 1

7. Find the Critical Value and Compare

Using the Chi-Square distribution table, find the critical value for the given significance level (usually 0.05) and the degrees of freedom (df). For df = 1 and a significance level of 0.05, the critical value is approximately 3.841.

8. Make a Decision

If the calculated Chi-Square statistic (χ²) is greater than the critical value, reject the null hypothesis. This means that there is a significant association between the two categorical variables.
If the calculated χ² is less than the critical value, fail to reject the null hypothesis. This means there is no significant association between the variables.

9. Using Python (Pandas and Scipy) for the Chi-Square Test

In practice, you can use Python libraries such as Pandas and Scipy to easily perform a Chi-Square test.

Example Code:

python
import pandas as pd
from scipy.stats import chi2_contingency

# Sample data
data = {'Gender': ['Male', 'Male', 'Female', 'Female'],
        'Likes Product': [50, 40],
        "Doesn't Like Product": [30, 60]}

df = pd.DataFrame(data)

# Create a contingency table
contingency_table = pd.crosstab(df['Gender'], df['Likes Product'])

# Perform the Chi-Square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Output results
print(f"Chi-Square Statistic: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of Freedom: {dof}")
print(f"Expected Frequencies: n{expected}")

# Check the p-value
if p < 0.05:
    print("Reject the null hypothesis. The variables are dependent.")
else:
    print("Fail to reject the null hypothesis. The variables are independent.")

10. Interpret the Results

If the p-value is less than 0.05, it indicates that the null hypothesis can be rejected. This means that there is a statistically significant association between the two categorical variables.
If the p-value is greater than 0.05, you fail to reject the null hypothesis, indicating no significant association.

11. Visualizing the Results

While the Chi-Square test gives a p-value, it can sometimes be helpful to visualize the contingency table to get a sense of the data before conducting the test. You can use a heatmap or bar plot to represent the observed and expected frequencies.

Here’s a simple example using a heatmap in Seaborn:

python
import seaborn as sns
import matplotlib.pyplot as plt

# Heatmap of the contingency table
sns.heatmap(contingency_table, annot=True, cmap='coolwarm', fmt='g')
plt.show()

Conclusion

The Chi-Square test for categorical data is a powerful tool in EDA to assess the relationship between two categorical variables. It helps in understanding patterns and dependencies that can guide further analysis and decision-making. By following the steps outlined above, you can effectively perform a Chi-Square test and interpret the results to draw meaningful conclusions from your data.

Share This Page:

How to Perform a Chi-Square Test for Categorical Data in EDA

1. Understand the Chi-Square Test

2. State the Hypotheses

3. Prepare Your Data

Example:

4. Calculate the Expected Frequencies

5. Perform the Chi-Square Test Calculation

6. Determine the Degrees of Freedom

7. Find the Critical Value and Compare

8. Make a Decision

9. Using Python (Pandas and Scipy) for the Chi-Square Test

Example Code:

10. Interpret the Results

11. Visualizing the Results

Conclusion

Check Out Our Newest Posts we wrote about

Writing Thread-Safe Memory Management in C++

Writing Tests for Animation Systems

Writing Secure C++ Code with Proper Memory Management

Writing Secure C++ Code with Proper Memory Management (1)