Prompt Workflows to Summarize Data Processing Jobs
1. High-Level Summary of a Data Processing Job
Prompt:
“Summarize the data processing job executed on [insert system/tool: e.g., Apache Spark, AWS Glue, Dataflow]. Include the job purpose, input and output datasets, key transformations, and final results.”
Output Focus:
-
Job name and ID
-
Objective (e.g., aggregating user logs, ETL for a data warehouse)
-
Data sources and destinations
-
Main processing steps (joins, filters, aggregations)
-
Execution time and outcome
2. Summarize Job Performance Metrics
Prompt:
“Provide a summary of the performance metrics for the data processing job [Job ID] that ran on [Date]. Include total runtime, data processed, number of stages/tasks, and any failures or retries.”
Output Focus:
-
Total data processed (e.g., GBs, rows)
-
Execution time
-
Number of failed and retried tasks
-
Bottlenecks or skewed stages (if any)
-
Resource utilization (CPU, memory)
3. Summarize Job Logs for Errors and Warnings
Prompt:
“Extract and summarize all errors and warnings from the log file of the data processing job [Job ID]. Group similar messages and highlight the most frequent and critical issues.”
Output Focus:
-
List of unique error/warning messages
-
Frequency count
-
Timestamps of occurrences
-
Possible root causes
-
Recommended fixes
4. Summarize Data Quality Checks Post-Processing
Prompt:
“Summarize the results of the data quality checks conducted after the processing job [Job Name]. Include record counts, null value checks, duplicates, and constraint violations.”
Output Focus:
-
Total records in output
-
Number of nulls or missing values
-
Duplicate records detected
-
Validation rule results (e.g., schema conformance)
-
Data anomalies and thresholds breached
5. Summarize Data Lineage for the Job
Prompt:
“Summarize the data lineage for the processing job [Job Name], from source to final output. Include all intermediate steps, transformations, and any dependencies.”
Output Focus:
-
Source datasets and systems
-
Processing path and intermediate stages
-
Transformations applied (filters, enrichments, etc.)
-
Final output format and destination
-
Upstream/downstream job dependencies
6. Summarize Job Configuration and Parameters
Prompt:
“Summarize the configuration settings and runtime parameters used in the data processing job [Job ID], including memory, parallelism, retry limits, and any custom configurations.”
Output Focus:
-
Executor/memory configuration
-
Number of partitions or threads
-
Job retry/failure policies
-
Custom config values (e.g., broadcast joins, spill thresholds)
-
Environment details (e.g., cluster, region)
7. Summarize Historical Trends Across Jobs
Prompt:
“Summarize performance trends for the past 7 days for the data processing job [Job Name]. Highlight changes in duration, data volume, failure rate, and resource consumption.”
Output Focus:
-
Daily runtime and data volume
-
Success vs. failure rates
-
Performance regressions/improvements
-
Resource usage trends
-
Anomalies or outliers
8. Summarize Job Scalability and Efficiency
Prompt:
“Summarize the scalability performance of the job [Job Name] when running with different input sizes. Comment on efficiency in terms of throughput and cost/resource consumption.”
Output Focus:
-
Input size vs. execution time
-
Throughput (records/sec or GB/min)
-
CPU/memory cost trends
-
Ideal scaling threshold
-
Observed inefficiencies
9. Summarize Multiple Jobs in a Workflow
Prompt:
“Summarize all jobs in the data pipeline executed on [Date], including dependencies, start/end times, success status, and handoffs between stages.”
Output Focus:
-
List of jobs and their sequence
-
Execution timelines and durations
-
Inter-job dependencies and data handoffs
-
Success/failure status
-
Bottlenecks in the pipeline
10. Summarize Security and Access Logs for the Job
Prompt:
“Summarize the access and security logs associated with the data processing job [Job ID], highlighting who triggered the job, data access patterns, and any unauthorized attempts.”
Output Focus:
-
User/service triggering the job
-
Resources accessed (files, tables, APIs)
-
Role-based access logs
-
Unauthorized access attempts
-
Compliance/logging flags triggered
These prompt workflows are applicable across multiple data platforms and can be adapted for tools like Apache Spark, Airflow, Databricks, AWS Glue, and GCP Dataflow. They help streamline job documentation, monitoring, auditing, and reporting tasks.
Leave a Reply