How to Use the Chi-Square Test for Categorical Data in EDA

The Chi-Square test is a statistical method commonly used in exploratory data analysis (EDA) to assess the relationship between two categorical variables. It helps determine whether there is a significant association or dependency between these variables, making it an essential tool when dealing with categorical data.

Understanding the Chi-Square Test

The Chi-Square test operates under the null hypothesis, which assumes that there is no association between the two variables. If the observed data differs significantly from what we would expect under this null hypothesis, we reject the null hypothesis and conclude that the variables are related.

Key Concepts:

Observed frequencies: The actual data or counts from your dataset.
Expected frequencies: The frequencies you would expect to see if there were no relationship between the two variables.
Chi-Square Statistic (χ²): The formula used to calculate the difference between observed and expected frequencies, which helps assess the degree of association.

Steps to Perform the Chi-Square Test in EDA

1. Set Up the Contingency Table

A contingency table (or cross-tabulation) is a matrix that shows the frequency distribution of the variables. For example, if you’re analyzing the relationship between “Gender” and “Purchase Decision” (Yes/No), the table might look like this:

Gender	Yes	No	Total
Male	50	30	80
Female	40	60	100
Total	90	90	180

Each cell contains the count of observations that fall into the corresponding category of both variables.

2. Calculate Expected Frequencies

The expected frequency for each cell is calculated using the formula:

E_{ij} = frac{(Row , Total times Column , Total)}{Grand , Total}

For example, the expected frequency for “Male & Yes” would be:

E_{11} = frac{(80 times 90)}{180} = 40

This step is repeated for all cells in the table.

3. Compute the Chi-Square Statistic

Now that we have both the observed and expected frequencies, we can compute the Chi-Square statistic using the following formula:

chi^2 = sum frac{(O_{ij} – E_{ij})^2}{E_{ij}}

Where:

$O_{ij}$ = Observed frequency
$E_{ij}$ = Expected frequency

In this formula, the difference between the observed and expected values is squared and divided by the expected value for each cell, then summed across all cells.

4. Determine the Degrees of Freedom

The degrees of freedom (df) are determined by the formula:

df = (r – 1) times (c – 1)

Where:

$r$ = Number of rows in the contingency table
$c$ = Number of columns in the contingency table

For a 2×2 table like the one above, df = (2-1) * (2-1) = 1.

5. Find the Critical Value

To determine whether the Chi-Square statistic is significant, compare it to a critical value from the Chi-Square distribution table, based on your chosen significance level (α), typically 0.05, and the degrees of freedom. If the computed Chi-Square statistic exceeds the critical value, the result is significant.

6. Make a Decision

If the Chi-Square statistic is greater than the critical value: Reject the null hypothesis, indicating that there is a significant relationship between the variables.
If the Chi-Square statistic is less than the critical value: Fail to reject the null hypothesis, indicating that there is no significant relationship between the variables.

Example: Analyzing a Real-World Scenario

Imagine you are analyzing whether “Age Group” and “Preference for a Product” are related. You might have the following data:

Age Group	Likes Product	Dislikes Product	Total
Under 30	30	20	50
30-50	40	10	50
Over 50	20	30	50
Total	90	60	150

Step 1: Contingency Table

We already have the table ready, so now we move on to the next steps.

Step 2: Calculate Expected Frequencies

For “Under 30 & Likes Product”:

E_{11} = frac{(50 times 90)}{150} = 30

Repeat for all other cells in the table.

Step 3: Compute the Chi-Square Statistic

After calculating the expected frequencies, apply the formula to compute the Chi-Square statistic.

Step 4: Degrees of Freedom

Since we have a 3×2 table (3 age groups, 2 product preferences), the degrees of freedom would be:

df = (3 – 1) times (2 – 1) = 2

Step 5: Find the Critical Value

Using the Chi-Square distribution table, look up the critical value for df = 2 at a significance level of 0.05. The critical value is 5.99.

Step 6: Make a Decision

If the calculated Chi-Square statistic exceeds 5.99, we reject the null hypothesis and conclude that “Age Group” and “Preference for a Product” are significantly related. If it is less, we do not have enough evidence to claim a relationship.

Using Python for the Chi-Square Test

In practice, the Chi-Square test can be easily conducted using Python libraries like SciPy.

python
import pandas as pd
import scipy.stats as stats

# Example contingency table
data = {'Likes Product': [30, 40, 20],
        'Dislikes Product': [20, 10, 30]}

df = pd.DataFrame(data, index=['Under 30', '30-50', 'Over 50'])

# Perform Chi-Square Test
chi2, p, dof, expected = stats.chi2_contingency(df)

# Results
print("Chi-Square Statistic:", chi2)
print("P-Value:", p)
print("Degrees of Freedom:", dof)
print("Expected Frequencies:n", expected)

# Decision based on p-value
if p < 0.05:
    print("Reject the null hypothesis: Significant relationship.")
else:
    print("Fail to reject the null hypothesis: No significant relationship.")

Considerations When Using the Chi-Square Test

Sample Size: The Chi-Square test is most reliable when the sample size is large. If the expected frequencies in any cell are less than 5, the results may not be valid. In such cases, consider using Fisher’s Exact Test, which is suitable for smaller datasets.
Independence: The Chi-Square test assumes that the observations are independent of each other. If the data is paired (e.g., before and after data), the Chi-Square test may not be appropriate.
Data Type: The test only applies to categorical data. For continuous data, other statistical tests like ANOVA or t-tests should be used.

Conclusion

The Chi-Square test is a powerful tool for detecting relationships between categorical variables in EDA. By following a clear set of steps—setting up a contingency table, calculating expected values, and comparing observed and expected frequencies—you can determine if there is a significant association between the variables. In practice, tools like Python’s scipy library make it easier to perform the test and interpret the results efficiently. However, it’s important to ensure that the assumptions of the test are met to avoid misleading conclusions.

Share This Page: