How to Apply EDA for Studying the Relationship Between Public Health and Environmental Factors

Exploratory Data Analysis (EDA) is a critical process in understanding how public health outcomes are influenced by environmental factors. By applying EDA techniques, researchers and analysts can uncover patterns, anomalies, and relationships that inform public health strategies and policy-making. This article outlines how to apply EDA effectively in studying the relationship between public health and environmental factors, covering data sourcing, preprocessing, visualization, and interpretation.

Understanding the Scope

The goal of EDA in this context is to identify trends, correlations, and potential causal links between environmental variables (such as air quality, water pollution, temperature, and green space availability) and public health indicators (such as respiratory illness rates, mortality rates, or hospital admissions).

Step 1: Data Collection

1.1 Sourcing Public Health Data

Reliable sources of public health data include:

World Health Organization (WHO)
Centers for Disease Control and Prevention (CDC)
Local government health departments
Hospital records and health surveys

Health metrics might include:

Disease prevalence (e.g., asthma, cardiovascular diseases)
Mortality and morbidity rates
Life expectancy
Hospital admission records
Immunization rates

1.2 Sourcing Environmental Data

Key sources of environmental data are:

Environmental Protection Agency (EPA)
NASA Earth Observatory
World Bank Open Data
NOAA for climate data

Environmental variables to collect:

Air Quality Index (AQI)
Water quality indicators (e.g., lead concentration)
Temperature and humidity
Urban green space data
Noise levels
PM2.5 and PM10 levels

Step 2: Data Integration and Cleaning

2.1 Temporal and Spatial Alignment

Combine datasets by aligning temporal (time-based) and spatial (location-based) dimensions. This may involve:

Aggregating data to the same time unit (e.g., monthly averages)
Mapping health data to geographic units (e.g., zip codes, counties)

2.2 Handling Missing and Inconsistent Data

EDA requires clean datasets. Address the following:

Missing values: Use imputation techniques or remove null rows if necessary.
Outliers: Identify through box plots or z-scores and decide whether to keep or exclude them.
Normalization: Standardize variables to allow meaningful comparison (e.g., normalize pollution levels).

Step 3: Univariate and Bivariate Analysis

3.1 Univariate Analysis

Analyze each variable independently to understand distributions and summary statistics.

Histograms and density plots for continuous variables (e.g., temperature)
Bar plots for categorical variables (e.g., type of illness)
Summary statistics: Mean, median, mode, variance

3.2 Bivariate Analysis

Explore the relationship between pairs of variables:

Scatter plots to observe correlations (e.g., between AQI and asthma incidence)
Box plots to compare distributions (e.g., disease rate across pollution categories)
Correlation matrix to detect linear relationships across variables

Step 4: Geospatial Visualization

Mapping health and environmental data provides context-sensitive insights.

Choropleth maps: Visualize variable intensity across regions (e.g., disease rates by county)
Heatmaps: Show density and clustering (e.g., pollution hotspots)
Overlay maps: Combine multiple layers (e.g., pollution levels with hospital locations)

Use GIS tools or Python libraries like Folium, Geopandas, or QGIS for creating insightful maps.

Step 5: Time Series Analysis

Studying changes over time is critical, especially when evaluating the impact of environmental policies or seasonal effects.

Line plots: Track variable trends over time (e.g., monthly AQI vs. respiratory hospital admissions)
Rolling averages: Smooth fluctuations to identify trends
Lag analysis: Examine delayed effects (e.g., pollution today affecting health next month)

Step 6: Multivariate Analysis

To understand complex relationships:

Multivariate regression: Assess impact of multiple environmental factors on health outcomes
Principal Component Analysis (PCA): Reduce dimensionality for easier visualization
Cluster analysis: Group similar regions or time periods based on multiple factors

This step helps to build predictive models and identify high-risk regions or vulnerable populations.

Step 7: Hypothesis Testing

Use statistical tests to validate observed relationships:

T-tests: Compare means across groups (e.g., urban vs. rural disease rates)
Chi-square tests: Assess relationships between categorical variables (e.g., illness type and location)
ANOVA: Compare means across multiple groups (e.g., comparing disease rates across pollution levels)

These tests confirm whether observed patterns are statistically significant.

Step 8: Identifying Causal Relationships

EDA is primarily exploratory, but it sets the stage for causal inference.

Temporal precedence: Use time-based data to argue causality (e.g., pollution peaks followed by illness spikes)
Natural experiments: Analyze the impact of events like policy changes or environmental disasters
Instrumental variables: Use external variables to isolate causality (e.g., wind patterns affecting pollution exposure)

While EDA cannot prove causality, it highlights candidate relationships for more rigorous analysis using statistical modeling or machine learning.

Step 9: Communicating Insights

Effective communication is key to translating EDA into actionable public health strategies.

Dashboards: Use interactive tools like Tableau, Power BI, or Dash to present findings
Infographics and reports: Summarize findings with visuals and clear narratives
Storytelling: Frame insights around real-world implications (e.g., how reducing pollution could lower hospital admissions)

Clear communication helps policymakers and stakeholders make informed decisions.

Example Case Study: Air Pollution and Respiratory Health

Dataset:

AQI levels from EPA for a 5-year period
Monthly asthma-related hospital admissions from city health departments
Demographic data for affected areas

EDA Process:

Aligned pollution and health data monthly per city
Found a positive correlation (0.67) between PM2.5 and asthma admissions
Box plots showed higher asthma rates in cities with AQI above 100
Time series analysis revealed seasonal spikes during winter
Regression confirmed PM2.5 as a significant predictor of asthma admissions

Insight:

Mitigating PM2.5 could potentially reduce asthma admissions, especially during winter months. This insight supports targeted air quality interventions.

Conclusion

Applying EDA to study the relationship between public health and environmental factors offers a powerful way to generate insights from data. By systematically collecting, cleaning, visualizing, and analyzing data, analysts can uncover critical relationships that guide effective health interventions. Though EDA is exploratory, it lays the groundwork for deeper analysis and informed policy-making, ultimately contributing to improved public health outcomes.

Share This Page: