Categories We Write About

How to Use EDA to Study the Relationship Between Air Quality and Public Health

Exploratory Data Analysis (EDA) is a critical step in understanding complex relationships in data, such as the link between air quality and public health. By applying EDA techniques, researchers can uncover patterns, detect anomalies, and form hypotheses about how air pollutants impact health outcomes. This article explains how to use EDA effectively to study the relationship between air quality and public health.

Understanding the Context: Air Quality and Public Health

Air quality affects respiratory health, cardiovascular conditions, and overall well-being. Common pollutants like PM2.5, PM10, nitrogen dioxide (NO₂), sulfur dioxide (SO₂), ozone (O₃), and carbon monoxide (CO) have been linked to increased hospital admissions, chronic diseases, and mortality rates. Studying this relationship requires integrating environmental data with public health records and applying thorough data analysis.

Step 1: Collecting Relevant Data

To perform EDA, the first step is gathering quality data from reliable sources:

  • Air Quality Data: Obtain pollutant concentrations from government environmental agencies, air monitoring stations, or satellite data. This data typically includes daily or hourly readings of PM2.5, PM10, NO₂, SO₂, O₃, and CO.

  • Public Health Data: Collect health outcome data such as hospital admissions, emergency visits, disease incidence rates (e.g., asthma, COPD), and mortality records. Data may be sourced from hospitals, public health departments, or health surveys.

  • Demographic and Geographic Data: Incorporate population density, age distribution, socioeconomic status, and geographical location to control for confounding variables.

Step 2: Data Cleaning and Preparation

Data from different sources often come with inconsistencies and missing values. Clean and prepare your data by:

  • Handling missing values through imputation or removal.

  • Standardizing pollutant measurements and health outcome formats.

  • Aligning data by date and location to enable meaningful comparisons.

  • Removing outliers or verifying if extreme values represent true phenomena.

Step 3: Initial Data Exploration

Begin EDA with descriptive statistics to get a sense of the dataset’s properties:

  • Calculate mean, median, range, and standard deviation for air pollutant concentrations.

  • Summarize health outcome frequencies and distributions.

  • Use boxplots and histograms to visualize the spread and skewness of key variables.

Step 4: Visualizing Relationships

Visual tools help reveal patterns between air quality and health:

  • Scatter Plots: Plot pollutant levels against health metrics (e.g., PM2.5 vs. asthma hospitalizations) to identify potential linear or nonlinear relationships.

  • Time Series Plots: Examine trends and seasonality in pollutant concentrations alongside health outcomes over time.

  • Heatmaps: Visualize correlations between multiple pollutants and different health indicators simultaneously.

  • Geospatial Maps: Map pollutant concentrations and health outcome incidence geographically to detect hotspots.

Step 5: Correlation Analysis

Quantify the strength and direction of relationships using correlation coefficients:

  • Pearson correlation measures linear association for continuous variables.

  • Spearman or Kendall rank correlations are useful for non-linear or ordinal data.

  • Interpret correlations cautiously, considering lag effects (e.g., air pollution exposure may affect health with a delay).

Step 6: Investigating Lag Effects and Temporal Patterns

Health impacts may not be immediate. Analyze time-lagged relationships by:

  • Shifting pollutant data forward by days or weeks relative to health data.

  • Using cross-correlation functions to find optimal lag periods.

  • Plotting lagged scatter plots or computing lagged correlations to understand delayed effects.

Step 7: Grouped and Stratified Analysis

Break down data into subgroups to uncover more detailed insights:

  • Stratify by age groups, gender, or socioeconomic status to see if effects vary.

  • Compare urban vs. rural areas for differences in pollutant impact.

  • Analyze seasonal variation by comparing data from different months or seasons.

Step 8: Identifying Outliers and Anomalies

Outliers can indicate data errors or unusual events like pollution spikes:

  • Use boxplots or z-scores to detect extreme pollutant readings.

  • Investigate health outcome anomalies for potential outbreaks or reporting issues.

  • Decide whether to exclude outliers or analyze them separately.

Step 9: Forming Hypotheses and Next Steps

EDA does not confirm causation but helps generate hypotheses:

  • Identify which pollutants most strongly correlate with specific health outcomes.

  • Determine if certain population groups are more vulnerable.

  • Suggest time periods or locations for targeted intervention or further study.

Step 10: Reporting and Visual Communication

Present your findings clearly using:

  • Dashboards that update with new data.

  • Interactive visualizations to explore relationships dynamically.

  • Clear summary tables and concise narratives emphasizing key insights.


Using EDA to study air quality and public health provides a foundation for deeper statistical modeling and policy formulation. It allows researchers and policymakers to prioritize pollutants and vulnerable groups, ultimately guiding efforts to improve air quality and protect public health.

Share This Page:

Enter your email below to join The Palos Publishing Company Email List

We respect your email privacy

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Categories We Write About