How Similarweb Cut Spark Job Runtime by 92% with DataFlint's Agentic Spark Copilot
When Similarweb moved a critical Spark job from a Databricks notebook in development to EMR staging, the runtime exploded from 50 minutes to 3 hours. Using DataFlint's Agentic Spark Copilot, an AI agent with direct access to production runtime context, the team identified the root cause in minutes, not days, and discovered a hidden platform difference that no amount of log diving would have revealed. The fix? A single configuration change brought the runtime down to 20 minutes.
The Problem: The Same Job, Wildly Different Performance
Similarweb's data engineering team develops Spark jobs in Databricks notebooks and deploys them to EMR for staging and production. When one critical job moved from dev to staging, what should have been a straightforward deployment turned into a performance challenge.
Using the DataFlint Dashboard for company-wide observability and cost monitoring, the team could track all Spark jobs across their fleet:

The Dashboard gives teams a unified view of all Spark jobs across all platforms, Databricks, EMR, and more, showing run counts, costs, duration trends, and savings potential at a glance. This cross-platform visibility is critical for teams like Similarweb running Spark on multiple environments. One-click investigation lets engineers drill into any job to understand exactly what's happening without chasing logs across systems.
On Databricks, the job ran smoothly in 50 minutes:

The job processed 22.31 TiB of input and produced 3.73 TiB of output in just under 51 minutes.
On EMR, the same job was running for 3 hours:

DataFlint immediately flagged a Partition Skew alert showing a 21x skew ratio. That's a 4x performance degradation for identical logic, and the team had no idea why.
Finding the Root Cause with DataFlint's Job Debugger & Optimizer
Using DataFlint's open-source Job Debugger & Optimizer, the team spotted the issue through real-time query visualization with heat maps.

The EMR run partitioned data into only 1,000 partitions, creating 7.88 GiB average partition sizes that caused larger spill than was originally on Databricks.
Comparing to the Databricks run revealed the difference:

Databricks used 2,085 partitions with just 1.33 GiB average sizes, more than double the partitions, keeping spill manageable. Additionally, smaller partitions compress more efficiently, which means less data is shuffled across the network. What would have taken hours of manual investigation was visible in seconds.
First Fix Attempt: The Agentic Spark Copilot in Action
With the root cause identified, the team turned to DataFlint's Agentic Spark Copilot. Unlike generic AI coding assistants that only see source code, DataFlint's agentic copilot autonomously connects to production runtime data, including Spark physical plans, task-level metrics, shuffle statistics, and cluster configuration, via MCP (Model Context Protocol). It pulled all relevant context from the EMR execution and offered the following fix:

The recommendation: increase partitions to 20,000 so each would average ~100MB, reducing memory pressure. The team implemented the fix confidently.
But the job still ran for 3 hours.
Digging Deeper: The Hidden EMR Bottleneck
On a second inspection of DataFlint's query visualization, the real issue became clear:

The average source file size was ~773MB. Because EMR defaults to 128MB partitions, Spark splits each large file into multiple tasks, creating a cascade: 167K tasks × 20,000 partitions = 3.36 billion shuffle read files.
The task duration histogram revealed severe skew:

Why didn't this happen on Databricks?

Looking at the Spark web UI, we saw that Databricks automatically optimizes through stage retries, progressively reducing task counts from 87,010 down to 3,621. This platform-specific optimization doesn't exist in vanilla EMR.
The Real Fix: Platform-Aware Tuning via the Agentic Spark Copilot
Armed with this insight, surfaced autonomously by the agentic copilot through its production-context awareness, the solution became clear: increase the threshold for how Spark reads files into partitions.
The fix: Set spark.sql.files.maxPartitionBytes to 1GB.
This configuration change means Spark opens one task per file instead of splitting files into dozens of tiny partitions (the default on EMR is 128MB). The result: dramatically reduced shuffle read overhead and improved sort performance.
Final result on EMR: 20 minutes, actually faster than the original Databricks run.

The job now completes with zero spills and zero task failures.
Key Takeaways: Why You Need an Agentic Spark Copilot
This case illustrates a critical lesson: Spark performance isn't about code that compiles, it's about production context. The factors that determine whether your job runs in 40 minutes or 3 hours live outside your codebase: data volume and file sizes, cluster configuration, platform-specific optimizations, Spark configuration defaults, executor memory settings, shuffle behavior, and partition strategies. Code that works perfectly in development can fail spectacularly in production when any of these variables change.
Generic AI tools like ChatGPT and GitHub Copilot are blind to these factors because they only see code. An Agentic Spark Copilot bridges the gap by autonomously connecting to production runtime data and reasoning about platform-specific behaviors that are invisible from source code alone.
Without an agentic copilot, this migration could have consumed days of engineering time, endless Spark Web UI refreshing, manual log correlation, and trial-and-error configuration changes with no visibility into platform differences.
With DataFlint's Agentic Spark Copilot and production-aware observability:
- Real-time query visualization pinpointed partition skew immediately
- The Agentic Spark Copilot provided context-aware fix suggestions grounded in actual runtime data
- Root cause analysis took minutes instead of hours with no log diving required
- The team gained deep understanding of Databricks vs. EMR behavioral differences that will inform future migrations
- The copilot's agentic workflow (observe, diagnose, suggest, iterate) mirrors how a senior data engineer would debug, but in seconds
Ready to try the Agentic Spark Copilot? Get started with DataFlint and optimize your Spark jobs in minutes, not days.
