Exploratory Data Analysis (EDA) is a fundamental process in data science that helps uncover patterns, detect anomalies, test hypotheses, and check assumptions through visual and quantitative techniques. While linear relationships between variables are relatively straightforward to identify, detecting non-linear relationships requires more nuanced approaches. Identifying these relationships is crucial for building accurate predictive models, as many real-world phenomena do not follow linear patterns. Here’s how you can effectively detect non-linear relationships between variables using EDA.
Understanding Non-Linear Relationships
A non-linear relationship implies that the change in the dependent variable does not correspond to a constant change in the independent variable. These relationships can take various forms such as exponential, logarithmic, quadratic, cubic, or sinusoidal. Recognizing these patterns can help in choosing the appropriate model or transformation technique.
1. Visual Inspection with Scatter Plots
Scatter plots are among the most powerful tools in EDA for detecting non-linear relationships.
-
Usage: Plot each pair of numerical variables against each other.
-
Identification: Look for curves, clusters, or wavy patterns instead of straight-line trends.
-
Enhancement: Use color coding for categorical variables to detect hidden relationships.
Example: A scatter plot showing a parabolic curve suggests a quadratic relationship. Similarly, an S-curve could indicate a sigmoid-type function.
2. Pair Plots (Scatterplot Matrix)
A pair plot offers a grid of scatter plots for each pairwise combination of features in a dataset.
-
Benefit: Allows simultaneous visualization of multiple variable relationships.
-
Detection: Curved or complex scatter patterns in these plots are telltale signs of non-linearity.
Tools: Seaborn’s pairplot()
in Python is highly effective for generating pair plots.
3. Using Correlation Measures Beyond Pearson
Pearson correlation only captures linear relationships. For non-linear relationships, alternative correlation measures should be used.
-
Spearman’s Rank Correlation: Detects monotonic relationships, whether linear or not.
-
Kendall’s Tau: A rank-based measure also useful for non-linear but consistent directional relationships.
-
Maximal Information Coefficient (MIC): Part of the MINE statistics, MIC can detect both linear and non-linear associations.
These methods provide numeric evidence of association that can support visual findings.
4. Residual Plots
Residual plots show the difference between observed and predicted values in a model, often a linear regression.
-
Detection: If the residuals display a pattern (such as a curve), it suggests that a linear model is not sufficient.
-
Interpretation: A random scatter indicates a good fit; any systematic structure suggests non-linearity.
Residual plots are particularly useful after fitting a basic model to check its adequacy.
5. Box Plots for Categorical vs. Continuous Variables
Box plots help identify the relationship between a categorical and a continuous variable.
-
Usage: Group data by categories and observe distribution changes.
-
Detection: Non-linear relationships might manifest as non-parallel box plot medians or varying spreads.
This is especially useful for detecting non-linear trends in grouped or segmented data.
6. Heatmaps of Correlation Matrices (Using Rank-Based Correlations)
While traditional correlation matrices rely on Pearson coefficients, using Spearman or Kendall in a heatmap format can highlight potential non-linear associations.
-
Visualization: Color gradients represent strength and direction of relationships.
-
Benefit: Helps prioritize which variable pairs warrant deeper non-linear analysis.
7. Loess/LOWESS Curves in Scatter Plots
Locally Weighted Scatterplot Smoothing (LOWESS or LOESS) overlays a smooth curve over scatter plot data to show the local relationship between variables.
-
Usage: Apply LOWESS to reveal underlying trends that are not obvious from raw points.
-
Detection: Deviations from a straight line imply non-linear behavior.
Tools like Seaborn’s regplot()
or lmplot()
with lowess=True
help generate such curves.
8. Polynomial and Spline Fitting
Fit polynomial curves of varying degrees to the data.
-
Procedure: Try quadratic, cubic, or higher-order polynomials to see if the fit improves.
-
Spline Regression: Breaks the data into intervals and fits piecewise polynomials, useful for complex non-linear structures.
Plotting these fitted curves over raw data reveals how well the model explains variation.
9. Decision Trees for Pattern Recognition
Decision trees naturally capture non-linear relationships.
-
Usage: Fit a decision tree regressor or classifier and evaluate feature splits.
-
Interpretation: The hierarchical decision structure shows variable interactions and thresholds not apparent in linear models.
Though not strictly part of traditional EDA, shallow decision trees can be used as exploratory tools.
10. Dimensionality Reduction Techniques
Techniques like t-SNE (t-distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection) help in identifying complex relationships in high-dimensional data.
-
Application: Useful when relationships are hard to visualize due to many features.
-
Detection: Non-linear structures in lower-dimensional plots suggest hidden non-linear associations.
These are particularly valuable when dealing with large, multi-feature datasets.
11. Transformations to Reveal Hidden Patterns
Sometimes transforming variables helps uncover non-linear relationships.
-
Common Transformations:
-
Logarithmic
-
Square root
-
Reciprocal
-
Box-Cox or Yeo-Johnson transformations
-
Post-transformation scatter plots or correlation analysis can indicate if the transformation linearizes the relationship.
12. Interaction Effects and Feature Engineering
Creating interaction terms or polynomial features can help reveal non-linear effects.
-
Example: Instead of just using
X
, includeX^2
orX*Z
(interaction with another variable). -
Analysis: Check whether these engineered features show stronger relationships with the target.
EDA on these engineered features often brings out patterns that were previously invisible.
Conclusion
Detecting non-linear relationships during EDA is essential for selecting appropriate modeling techniques and avoiding biased inferences. While scatter plots and correlation measures are the cornerstone, combining them with residual analysis, non-linear fits, and dimensionality reduction ensures a robust exploratory approach. By methodically applying these techniques, data scientists can gain deeper insights and construct models that more accurately capture the complexity of real-world data.