The Palos Publishing Company

Follow Us On The X Platform @PalosPublishing
Categories We Write About

How to Use Python’s Scipy and Statsmodels for EDA

Exploratory Data Analysis (EDA) is a critical step in the data science process that helps uncover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations. Python offers powerful libraries like Scipy and Statsmodels that simplify and enhance EDA with advanced statistical tools and tests. This article dives into how to effectively use Scipy and Statsmodels for thorough EDA.


Understanding Scipy and Statsmodels in EDA

Scipy is a robust Python library for scientific and technical computing, offering modules for optimization, integration, interpolation, eigenvalue problems, algebraic equations, and statistics. Within EDA, Scipy provides a rich set of statistical functions and tests to analyze data distributions, relationships, and significance.

Statsmodels complements Scipy by focusing on statistical modeling, hypothesis testing, and data exploration. It provides extensive capabilities for regression analysis, time series analysis, and statistical tests, with detailed output that is highly interpretable.


Step 1: Preparing Your Data

Before diving into statistical analysis with Scipy and Statsmodels, data must be clean and well-organized.

python
import pandas as pd # Load your dataset data = pd.read_csv('your_dataset.csv') # Basic data cleaning data = data.dropna() # Drop missing values for simplicity print(data.head())

Step 2: Descriptive Statistics Using Scipy and Pandas

Start with basic descriptive statistics to summarize your data.

python
print(data.describe()) # Summary statistics using pandas from scipy import stats # Calculate skewness and kurtosis using Scipy skewness = stats.skew(data['column_name']) kurtosis = stats.kurtosis(data['column_name']) print(f"Skewness: {skewness}, Kurtosis: {kurtosis}")
  • Skewness indicates asymmetry in the data distribution.

  • Kurtosis measures the “tailedness,” helping detect outliers.


Step 3: Distribution Analysis

Analyzing the distribution of your variables helps check assumptions like normality.

python
import matplotlib.pyplot as plt import seaborn as sns from scipy.stats import normaltest sns.histplot(data['column_name'], kde=True) plt.show() # Test for normality stat, p = normaltest(data['column_name']) print(f"Normality Test Statistic={stat:.3f}, p-value={p:.3f}") if p > 0.05: print("Data appears to be normally distributed.") else: print("Data is not normally distributed.")

The D’Agostino’s K-squared test (normaltest) in Scipy assesses whether data deviates from a normal distribution.


Step 4: Correlation Analysis

Identifying relationships between variables is essential.

python
# Using Pandas correlation matrix correlation_matrix = data.corr() print(correlation_matrix) # Hypothesis testing of correlation using Scipy from scipy.stats import pearsonr corr, p_value = pearsonr(data['col1'], data['col2']) print(f"Pearson correlation: {corr:.3f}, p-value: {p_value:.3f}")
  • The correlation coefficient indicates strength and direction.

  • The p-value tests if the correlation is statistically significant.


Step 5: Statistical Tests for Group Differences with Scipy

For comparing groups or categories, Scipy offers t-tests, ANOVA, and non-parametric tests.

Example: Independent t-test between two groups

python
group1 = data[data['group_col'] == 'Group1']['value_col'] group2 = data[data['group_col'] == 'Group2']['value_col'] t_stat, p_val = stats.ttest_ind(group1, group2) print(f"T-test statistic: {t_stat:.3f}, p-value: {p_val:.3f}")

If the p-value is below 0.05, you can conclude the groups differ significantly.


Step 6: Regression Analysis with Statsmodels

Regression modeling provides insights into relationships between variables, controlling for others.

python
import statsmodels.api as sm X = data[['independent_var1', 'independent_var2']] y = data['dependent_var'] X = sm.add_constant(X) # Adds intercept term model = sm.OLS(y, X).fit() print(model.summary())

The summary() output from Statsmodels gives detailed regression coefficients, R-squared, confidence intervals, and diagnostic statistics for model evaluation.


Step 7: Residual Analysis and Diagnostics

Validate your regression model by examining residuals.

python
import statsmodels.api as sm import matplotlib.pyplot as plt residuals = model.resid # Plot residuals distribution sns.histplot(residuals, kde=True) plt.title('Residuals Distribution') plt.show() # Plot residuals vs fitted values fitted = model.fittedvalues plt.scatter(fitted, residuals) plt.axhline(0, color='red', linestyle='--') plt.xlabel('Fitted values') plt.ylabel('Residuals') plt.title('Residuals vs Fitted') plt.show()

Checking residuals helps detect heteroscedasticity, non-linearity, or outliers affecting the model.


Step 8: Time Series EDA with Statsmodels

For time series data, Statsmodels offers rich EDA functions like autocorrelation and stationarity tests.

python
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf from statsmodels.tsa.stattools import adfuller series = data['time_series_col'] plot_acf(series) plot_pacf(series) plt.show() # Augmented Dickey-Fuller test for stationarity adf_result = adfuller(series) print(f"ADF Statistic: {adf_result[0]:.3f}") print(f"p-value: {adf_result[1]:.3f}") if adf_result[1] < 0.05: print("Time series is stationary.") else: print("Time series is non-stationary.")

Conclusion

Scipy and Statsmodels together provide a powerful toolkit for Exploratory Data Analysis, combining descriptive statistics, hypothesis testing, regression modeling, and diagnostics. Scipy excels in statistical testing and distributions, while Statsmodels shines with its detailed regression outputs and time series analysis.

Integrating these libraries into your EDA workflow helps build a solid foundation for accurate modeling and insights, improving your data-driven decision-making.


If you want, I can also provide sample datasets or advanced examples on specific tests or models. Would you like that?

Share this Page your favorite way: Click any app below to share.

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Categories We Write About