How to Use Python’s Scipy and Statsmodels for EDA

Exploratory Data Analysis (EDA) is a critical step in the data science process that helps uncover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations. Python offers powerful libraries like Scipy and Statsmodels that simplify and enhance EDA with advanced statistical tools and tests. This article dives into how to effectively use Scipy and Statsmodels for thorough EDA.

Understanding Scipy and Statsmodels in EDA

Scipy is a robust Python library for scientific and technical computing, offering modules for optimization, integration, interpolation, eigenvalue problems, algebraic equations, and statistics. Within EDA, Scipy provides a rich set of statistical functions and tests to analyze data distributions, relationships, and significance.

Statsmodels complements Scipy by focusing on statistical modeling, hypothesis testing, and data exploration. It provides extensive capabilities for regression analysis, time series analysis, and statistical tests, with detailed output that is highly interpretable.

Step 1: Preparing Your Data

Before diving into statistical analysis with Scipy and Statsmodels, data must be clean and well-organized.

python
import pandas as pd

# Load your dataset
data = pd.read_csv('your_dataset.csv')

# Basic data cleaning
data = data.dropna()  # Drop missing values for simplicity
print(data.head())

Step 2: Descriptive Statistics Using Scipy and Pandas

Start with basic descriptive statistics to summarize your data.

python
print(data.describe())  # Summary statistics using pandas

from scipy import stats

# Calculate skewness and kurtosis using Scipy
skewness = stats.skew(data['column_name'])
kurtosis = stats.kurtosis(data['column_name'])
print(f"Skewness: {skewness}, Kurtosis: {kurtosis}")

Skewness indicates asymmetry in the data distribution.
Kurtosis measures the “tailedness,” helping detect outliers.

Step 3: Distribution Analysis

Analyzing the distribution of your variables helps check assumptions like normality.

python
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import normaltest

sns.histplot(data['column_name'], kde=True)
plt.show()

# Test for normality
stat, p = normaltest(data['column_name'])
print(f"Normality Test Statistic={stat:.3f}, p-value={p:.3f}")

if p > 0.05:
    print("Data appears to be normally distributed.")
else:
    print("Data is not normally distributed.")

The D’Agostino’s K-squared test (normaltest) in Scipy assesses whether data deviates from a normal distribution.

Step 4: Correlation Analysis

Identifying relationships between variables is essential.

python
# Using Pandas correlation matrix
correlation_matrix = data.corr()
print(correlation_matrix)

# Hypothesis testing of correlation using Scipy
from scipy.stats import pearsonr

corr, p_value = pearsonr(data['col1'], data['col2'])
print(f"Pearson correlation: {corr:.3f}, p-value: {p_value:.3f}")

The correlation coefficient indicates strength and direction.
The p-value tests if the correlation is statistically significant.

Step 5: Statistical Tests for Group Differences with Scipy

For comparing groups or categories, Scipy offers t-tests, ANOVA, and non-parametric tests.

Example: Independent t-test between two groups

python
group1 = data[data['group_col'] == 'Group1']['value_col']
group2 = data[data['group_col'] == 'Group2']['value_col']

t_stat, p_val = stats.ttest_ind(group1, group2)
print(f"T-test statistic: {t_stat:.3f}, p-value: {p_val:.3f}")

If the p-value is below 0.05, you can conclude the groups differ significantly.

Step 6: Regression Analysis with Statsmodels

Regression modeling provides insights into relationships between variables, controlling for others.

python
import statsmodels.api as sm

X = data[['independent_var1', 'independent_var2']]
y = data['dependent_var']

X = sm.add_constant(X)  # Adds intercept term
model = sm.OLS(y, X).fit()
print(model.summary())

The summary() output from Statsmodels gives detailed regression coefficients, R-squared, confidence intervals, and diagnostic statistics for model evaluation.

Step 7: Residual Analysis and Diagnostics

Validate your regression model by examining residuals.

python
import statsmodels.api as sm
import matplotlib.pyplot as plt

residuals = model.resid

# Plot residuals distribution
sns.histplot(residuals, kde=True)
plt.title('Residuals Distribution')
plt.show()

# Plot residuals vs fitted values
fitted = model.fittedvalues
plt.scatter(fitted, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted')
plt.show()

Checking residuals helps detect heteroscedasticity, non-linearity, or outliers affecting the model.

Step 8: Time Series EDA with Statsmodels

For time series data, Statsmodels offers rich EDA functions like autocorrelation and stationarity tests.

python
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller

series = data['time_series_col']

plot_acf(series)
plot_pacf(series)
plt.show()

# Augmented Dickey-Fuller test for stationarity
adf_result = adfuller(series)
print(f"ADF Statistic: {adf_result[0]:.3f}")
print(f"p-value: {adf_result[1]:.3f}")

if adf_result[1] < 0.05:
    print("Time series is stationary.")
else:
    print("Time series is non-stationary.")

Conclusion

Scipy and Statsmodels together provide a powerful toolkit for Exploratory Data Analysis, combining descriptive statistics, hypothesis testing, regression modeling, and diagnostics. Scipy excels in statistical testing and distributions, while Statsmodels shines with its detailed regression outputs and time series analysis.

Integrating these libraries into your EDA workflow helps build a solid foundation for accurate modeling and insights, improving your data-driven decision-making.

If you want, I can also provide sample datasets or advanced examples on specific tests or models. Would you like that?

Share this Page your favorite way: Click any app below to share.

See all the ways to share this page

How to Use Python’s Scipy and Statsmodels for EDA

Understanding Scipy and Statsmodels in EDA

Step 1: Preparing Your Data

Step 2: Descriptive Statistics Using Scipy and Pandas

Step 3: Distribution Analysis

Step 4: Correlation Analysis

Step 5: Statistical Tests for Group Differences with Scipy

Step 6: Regression Analysis with Statsmodels

Step 7: Residual Analysis and Diagnostics

Step 8: Time Series EDA with Statsmodels

Conclusion

Check Out Our Newest Posts we wrote about

Why your ML system design must support partial retraining

Why your ML pipeline must detect missing or stale features

Why your ML feedback loop must consider label quality

Why your ML deployment plan must include fallback logic