How to Use Python Libraries Like Seaborn and Matplotlib for EDA

Exploratory Data Analysis (EDA) is an essential step in any data analysis pipeline. It involves summarizing the main characteristics of a dataset, often with visual methods. Python offers several libraries for this purpose, but two of the most popular ones are Seaborn and Matplotlib. These libraries provide versatile tools to explore datasets, visualize distributions, and uncover hidden patterns or outliers in the data. In this article, we will discuss how to use these libraries for EDA and highlight some effective techniques.

1. Introduction to Seaborn and Matplotlib

Matplotlib is the foundational plotting library in Python, offering comprehensive control over the creation of static, animated, and interactive plots. However, its syntax can sometimes be a bit verbose, especially when working with complex visualizations.

On the other hand, Seaborn is built on top of Matplotlib and provides a high-level interface for creating attractive and informative statistical graphics. It simplifies the process of creating complex visualizations and comes with default styles that make plots look professional right away.

For EDA, these libraries are perfect because they allow you to quickly inspect your data through visualization and gain insights in a clear, concise way.

2. Getting Started with Seaborn and Matplotlib

Before we dive into how to use these libraries for EDA, let’s first import them and load a sample dataset to work with.

python
import seaborn as sns
import matplotlib.pyplot as plt

# Load a built-in dataset from Seaborn
data = sns.load_dataset('tips')

In this example, we’ve loaded the tips dataset, which contains information about restaurant bills, tips, day of the week, and time (lunch/dinner). This dataset is great for illustrating basic EDA techniques.

3. Visualizing Univariate Distributions

The first step in EDA is to understand the distribution of individual variables. You can use both Matplotlib and Seaborn for this.

Using Seaborn for Univariate Distribution:

python
# Visualize the distribution of the 'total_bill' column
sns.histplot(data['total_bill'], kde=True, color='blue')
plt.title('Distribution of Total Bill')
plt.show()

Here, sns.histplot() creates a histogram and adds a Kernel Density Estimate (KDE) curve. The KDE provides a smoother estimate of the distribution.

Using Matplotlib for Univariate Distribution:

python
# Visualize the distribution of the 'total_bill' column using Matplotlib
plt.hist(data['total_bill'], bins=20, color='blue', alpha=0.7)
plt.title('Distribution of Total Bill')
plt.xlabel('Total Bill')
plt.ylabel('Frequency')
plt.show()

Matplotlib requires a bit more setup, but it offers great flexibility in customization. Both methods give us insights into the distribution of the total_bill column.

4. Visualizing Bivariate Relationships

Once you understand the individual distributions, it’s time to explore relationships between two variables. Both libraries offer powerful tools for this purpose.

Using Seaborn for Bivariate Analysis:

python
# Scatterplot between total_bill and tip
sns.scatterplot(data=data, x='total_bill', y='tip', hue='sex', palette='coolwarm')
plt.title('Total Bill vs. Tip')
plt.show()

Seaborn’s scatterplot() function is ideal for visualizing the relationship between two continuous variables. The hue parameter allows you to categorize data points by a third variable (in this case, sex), adding more depth to the plot.

Using Matplotlib for Bivariate Analysis:

python
# Scatterplot using Matplotlib
plt.scatter(data['total_bill'], data['tip'], c=data['sex'].map({'Male': 'blue', 'Female': 'red'}), alpha=0.7)
plt.title('Total Bill vs. Tip')
plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.show()

While Matplotlib also supports scatter plots, the customizability is greater, allowing you to manually assign colors to data points based on categories.

5. Pairwise Relationships and Correlation

One of the key tasks in EDA is understanding how multiple variables interact with each other. For this, Seaborn provides pairplot(), which visualizes pairwise relationships between all numeric variables in the dataset.

Pairplot with Seaborn:

python
# Pairwise relationships between variables
sns.pairplot(data, hue='sex', diag_kind='kde', palette='coolwarm')
plt.show()

This generates scatter plots for each pair of variables, along with KDE plots for the diagonal (which show the distribution of individual variables). The hue parameter adds another layer of information, showing how sex influences the relationships.

Correlation Heatmap with Seaborn:

To understand the correlation between variables, you can visualize the correlation matrix using a heatmap.

python
# Correlation heatmap
corr = data.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

This heatmap shows the pairwise correlations between all numeric variables. The annot=True parameter annotates the cells with the correlation values, while fmt='.2f' formats them to two decimal places.

6. Categorical Data Visualizations

In addition to continuous data, EDA often involves categorical data. Seaborn provides several plotting functions for categorical data, including countplot(), boxplot(), and violinplot().

Countplot with Seaborn:

python
# Count plot for the 'day' column
sns.countplot(data=data, x='day', palette='pastel')
plt.title('Count of Tips by Day')
plt.show()

The countplot() visualizes the count of data points in each category (in this case, the different days of the week). It’s helpful for understanding the distribution of categorical data.

Boxplot with Seaborn:

python
# Boxplot for total_bill by time of day
sns.boxplot(data=data, x='time', y='total_bill', palette='coolwarm')
plt.title('Total Bill by Time')
plt.show()

A boxplot is used to visualize the distribution of numerical data across categories. It provides a clear view of the median, quartiles, and potential outliers.

Violinplot with Seaborn:

python
# Violin plot for total_bill by time of day
sns.violinplot(data=data, x='time', y='total_bill', palette='coolwarm')
plt.title('Total Bill Distribution by Time')
plt.show()

A violinplot combines aspects of both boxplots and KDEs, showing the distribution of the data along with its density.

7. Customizing Plots for Better Understanding

Seaborn and Matplotlib offer many customization options that can help make your EDA process more insightful.

Adding Titles, Labels, and Legends:

Both libraries provide methods for adding titles, axis labels, and legends to your plots.

python
# Customizing the scatter plot with Matplotlib
plt.scatter(data['total_bill'], data['tip'], c=data['sex'].map({'Male': 'blue', 'Female': 'red'}), alpha=0.7)
plt.title('Total Bill vs. Tip', fontsize=14)
plt.xlabel('Total Bill', fontsize=12)
plt.ylabel('Tip', fontsize=12)
plt.legend(['Male', 'Female'], loc='upper left')
plt.show()

Modifying Axis Scales:

Sometimes, you may need to adjust the scales of your plots, such as applying logarithmic scaling to understand the distribution better.

python
# Applying a logarithmic scale to the total_bill axis
plt.scatter(data['total_bill'], data['tip'], alpha=0.7)
plt.xscale('log')
plt.title('Total Bill (Log Scale) vs. Tip')
plt.xlabel('Total Bill (Log Scale)')
plt.ylabel('Tip')
plt.show()

8. Conclusion

Using Seaborn and Matplotlib for EDA is an effective way to understand and visualize the key characteristics of your dataset. While Seaborn simplifies the process and provides aesthetically pleasing visuals, Matplotlib offers granular control and flexibility. By combining both libraries, you can quickly uncover insights, detect outliers, explore relationships, and ultimately prepare your data for further analysis or modeling.

In practice, EDA is an iterative process—continuously visualizing, transforming, and re-analyzing the data helps you refine your understanding and unlock the full potential of your dataset.

Share This Page: