What is an Agentic Spark Copilot?

An Agentic Spark Copilot is an AI agent that autonomously connects to production runtime data, including Spark physical plans, task metrics, shuffle behavior, and cluster configuration, to diagnose and fix performance issues. Unlike generic AI coding assistants that only see source code, an agentic copilot sees live production context and proposes platform-aware fixes.

How did DataFlint's Agentic Spark Copilot reduce Similarweb's Spark job runtime by 92%?

DataFlint's Agentic Spark Copilot autonomously analyzed the EMR runtime, identified that 773MB source files were being split into 128MB partitions (EMR default), causing 3.36 billion shuffle reads. It recommended setting spark.sql.files.maxPartitionBytes to 1GB, which collapsed task count and eliminated shuffle overhead, bringing runtime from 3 hours to 20 minutes (a 92% reduction).

Why does the same Spark job run differently on Databricks vs EMR?

Databricks applies platform-specific optimizations like automatic stage retries that progressively reduce task counts. Vanilla EMR does not. Different defaults for spark.sql.files.maxPartitionBytes (128MB on EMR vs larger on Databricks) and automatic repartitioning also cause the same code to produce wildly different partition layouts, shuffle volumes, and runtimes.

How is an Agentic Spark Copilot different from ChatGPT or GitHub Copilot for Spark optimization?

Generic AI tools only see code. They cannot see file sizes, data distribution, Spark DAG plans, cluster utilization, or platform-specific configuration defaults. An Agentic Spark Copilot like DataFlint connects to actual production runtime data via protocols like MCP, giving it the context needed to diagnose issues that are invisible from code alone, such as partition skew, shuffle bottlenecks, and platform behavioral differences.

How Similarweb Cut Spark Job Runtime by 92% Using an Agentic Spark Copilot

When Similarweb moved a critical Spark job from a Databricks notebook in development to EMR staging, the runtime exploded from 50 minutes to 3 hours. Using DataFlint's Agentic Spark Copilot, an AI agent with direct access to production runtime context, the team identified the root cause in minutes, not days, and discovered a hidden platform difference that no amount of log diving would have revealed. The fix? A single configuration change brought the runtime down to 20 minutes.

The Problem: The Same Job, Wildly Different Performance

Similarweb's data engineering team develops Spark jobs in Databricks notebooks and deploys them to EMR for staging and production. When one critical job moved from dev to staging, what should have been a straightforward deployment turned into a performance challenge.

Using the DataFlint Dashboard for company-wide observability and cost monitoring, the team could track all Spark jobs across their fleet:

DataFlint Jobs dashboard showing run count, cost, duration, and savings potential

The Dashboard gives teams a unified view of all Spark jobs across all platforms, Databricks, EMR, and more, showing run counts, costs, duration trends, and savings potential at a glance. This cross-platform visibility is critical for teams like Similarweb running Spark on multiple environments. One-click investigation lets engineers drill into any job to understand exactly what's happening without chasing logs across systems.

On Databricks, the job ran smoothly in 50 minutes:

Databricks job run: 50.7m duration, 22.31 TiB input, 3.73 TiB output

The job processed 22.31 TiB of input and produced 3.73 TiB of output in just under 51 minutes.

On EMR, the same job was running for 3 hours:

EMR job with Partition Skew alert: 21x skew ratio

DataFlint immediately flagged a Partition Skew alert showing a 21x skew ratio. That's a 4x performance degradation for identical logic, and the team had no idea why.

Finding the Root Cause with DataFlint's Job Debugger & Optimizer

Using DataFlint's open-source Job Debugger & Optimizer, the team spotted the issue through real-time query visualization with heat maps.

DataFlint query visualization: Repartition By Hash with Large Partition Size warning on EMR

The EMR run partitioned data into only 1,000 partitions, creating 7.88 GiB average partition sizes that caused larger spill than was originally on Databricks.

Comparing to the Databricks run revealed the difference:

Databricks run: 2,085 partitions, 1.33 GiB average partition size

Databricks used 2,085 partitions with just 1.33 GiB average sizes, more than double the partitions, keeping spill manageable. Additionally, smaller partitions compress more efficiently, which means less data is shuffled across the network. What would have taken hours of manual investigation was visible in seconds.

First Fix Attempt: The Agentic Spark Copilot in Action

With the root cause identified, the team turned to DataFlint's Agentic Spark Copilot. Unlike generic AI coding assistants that only see source code, DataFlint's agentic copilot autonomously connects to production runtime data, including Spark physical plans, task-level metrics, shuffle statistics, and cluster configuration, via MCP (Model Context Protocol). It pulled all relevant context from the EMR execution and offered the following fix:

DataFlint Copilot suggestion: repartition to 20,000 to reduce partition size

The recommendation: increase partitions to 20,000 so each would average ~100MB, reducing memory pressure. The team implemented the fix confidently.

But the job still ran for 3 hours.

Digging Deeper: The Hidden EMR Bottleneck

On a second inspection of DataFlint's query visualization, the real issue became clear:

Read Parquet stage: 773.48 MiB average file size, 184 partitions

The average source file size was ~773MB. Because EMR defaults to 128MB partitions, Spark splits each large file into multiple tasks, creating a cascade: 167K tasks × 20,000 partitions = 3.36 billion shuffle read files.

The task duration histogram revealed severe skew:

Task duration histogram showing severe skew: max tasks ~36s vs median ~1.7s

Why didn't this happen on Databricks?

Spark Web UI: Databricks stage retries reducing task count from 87,010 to 3,621

Looking at the Spark web UI, we saw that Databricks automatically optimizes through stage retries, progressively reducing task counts from 87,010 down to 3,621. This platform-specific optimization doesn't exist in vanilla EMR.

The Real Fix: Platform-Aware Tuning via the Agentic Spark Copilot

Armed with this insight, surfaced autonomously by the agentic copilot through its production-context awareness, the solution became clear: increase the threshold for how Spark reads files into partitions.

The fix: Set spark.sql.files.maxPartitionBytes to 1GB.

This configuration change means Spark opens one task per file instead of splitting files into dozens of tiny partitions (the default on EMR is 128MB). The result: dramatically reduced shuffle read overhead and improved sort performance.

Final result on EMR: 20 minutes, actually faster than the original Databricks run.

Final EMR run: 18.3m duration, 8.38 TiB input, 2.25 TiB output, zero spills

The job now completes with zero spills and zero task failures.

Key Takeaways: Why You Need an Agentic Spark Copilot

This case illustrates a critical lesson: Spark performance isn't about code that compiles, it's about production context. The factors that determine whether your job runs in 40 minutes or 3 hours live outside your codebase: data volume and file sizes, cluster configuration, platform-specific optimizations, Spark configuration defaults, executor memory settings, shuffle behavior, and partition strategies. Code that works perfectly in development can fail spectacularly in production when any of these variables change.

Generic AI tools like ChatGPT and GitHub Copilot are blind to these factors because they only see code. An Agentic Spark Copilot bridges the gap by autonomously connecting to production runtime data and reasoning about platform-specific behaviors that are invisible from source code alone.

Without an agentic copilot, this migration could have consumed days of engineering time, endless Spark Web UI refreshing, manual log correlation, and trial-and-error configuration changes with no visibility into platform differences.

With DataFlint's Agentic Spark Copilot and production-aware observability:

Real-time query visualization pinpointed partition skew immediately
The Agentic Spark Copilot provided context-aware fix suggestions grounded in actual runtime data
Root cause analysis took minutes instead of hours with no log diving required
The team gained deep understanding of Databricks vs. EMR behavioral differences that will inform future migrations
The copilot's agentic workflow (observe, diagnose, suggest, iterate) mirrors how a senior data engineer would debug, but in seconds

Ready to try the Agentic Spark Copilot? Get started with DataFlint and optimize your Spark jobs in minutes, not days.

How Similarweb Cut Spark Job Runtime by 92% with DataFlint's Agentic Spark Copilot

The Problem: The Same Job, Wildly Different Performance

Finding the Root Cause with DataFlint's Job Debugger & Optimizer

First Fix Attempt: The Agentic Spark Copilot in Action

Digging Deeper: The Hidden EMR Bottleneck

The Real Fix: Platform-Aware Tuning via the Agentic Spark Copilot

Key Takeaways: Why You Need an Agentic Spark Copilot

Product

Resources

Company