Logo
Case Studies

SimilarWeb Case Study: How AI-Powered Spark Tuning Achieved 90x Faster Performance and 160x Cost Reduction

Daniel Aronovich
Daniel Aronovich
Co-Founder & CTO
Meni Shmueli
Meni Shmueli
Co-Founder & CEO

SimilarWeb, a leading digital market intelligence company that processes massive amounts of web data to provide insights into website traffic, user behavior, and competitive intelligence, was facing a critical Spark job failure. One of their most important data pipelines was failing after 22 hours on 200 machines. Traditional debugging and generic AI tools couldn't help. The code was correct, but Spark was only using 83 of their 800 available cores due to subtle optimizer behaviors that were invisible without production context.

Using DataFlint's IDE Extension - AI Copilot, which connects to actual runtime data, we identified the root cause in minutes. The fix? Just 4 lines of code and configuration changes. The result: the job now runs in 15 minutes on 20 machines, 90X faster, 160X cheaper, and reliably completes every time.

Here's the full story of how DataFlint helped SimilarWeb achieve this massive Spark performance optimization and Databricks cost reduction.

Performance improvements: 22 hours to 15 minutes, 200 machines to 20 machines

The Challenge SimilarWeb Faced

SimilarWeb processes massive amounts of HTML, images, and web assets every single day to power their digital market intelligence platform. As the web grows more dynamic and complex, their data pipeline monitoring requirements grow with it. The volume of content they ingest has skyrocketed, page structures continue to evolve, and their extraction logic is more involved than ever.

This naturally creates significant big data performance tuning challenges. Even with robust distributed systems in place, one critical Spark job was pushing their infrastructure to its limits.

The symptoms were frustrating:

  • 22.2 hours of runtime before failing
  • 200 md-fleet.xlarge machines consumed
  • No completion - the job never finished successfully

SimilarWeb's team turned to AI models to help diagnose the issue, but conventional AI assistants couldn't provide meaningful hints. The code was straightforward and correct. The problem wasn't the code at all, it was rooted in the production context: the nature of their data, the file sizes, and how data was distributed at scale.

The Pain - A Simple-Looking Spark Pipeline That Failed After 22 Hours

SimilarWeb's compute pipeline is built on Apache Spark running on Databricks, which they use to parallelize loading favicon images, parsing, extracting information, and writing results.

The code itself is relatively simple:

Spark code snippet showing the pipeline with repartition, UDF, filter, and write operations

They calculate a repartition value based on input size, repartition to distribute work evenly, apply a PySpark UDF to fetch favicon images for each domain, and mark results as success or failed. Standard Spark code optimization practices, nothing here suggests the job should fail or run for 22 hours.

Despite following Apache Spark optimization best practices, the job:

  • Ran for 22.2 hours
  • Consumed 200 md-fleet.xlarge machines
  • Still failed to finish

The challenge was rooted in the production context: the nature of the data, the file sizes, and the way the data was distributed at scale. This was the missing piece that no Spark monitoring tool we had could reveal.

Why Existing AI Tools Couldn't Help

Standard AI coding agents only see code. They cannot see:

  • File layouts
  • Data and task distribution
  • Spark DAG optimization opportunities
  • Cluster size and utilization
  • Spark shuffle optimization issues

The issue we were dealing with required awareness of the live environment the code ran in, not just the code itself. Traditional big data monitoring tools and even advanced AI assistants were blind to the real problem.

This is exactly where DataFlint fits in.

How DataFlint Solved It - AI for Data Engineering With Production Context

DataFlint's IDE Extension - AI Copilot allows development environments (Cursor, VS Code, or IntelliJ) to connect via Model Context Protocol (MCP) to the actual production context of the workload. This is what makes DataFlint different from generic AI tools: it has visibility into the real production environment, not just the code.

DataFlint's production-aware AI can see:

  • Real Spark physical plans for query optimization
  • Real-time pipeline monitoring data from the DataFlint Dashboard
  • Actual input data characteristics and distribution
  • Spark cluster monitoring metrics and resource utilization
  • Stage-level performance bottlenecks identified by the Job Debugger & Optimizer

And because of that, DataFlint can propose changes that are correct for the real environment, not just theoretically correct code. This is what makes it different from generic AI tools: it understands Spark executor tuning, memory optimization, and the actual runtime behavior of your jobs based on compressed production logs.

After applying DataFlint fixes - improved performance metrics

The fixes? Just 2 lines in the code, right in the IDE via DataFlint's AI Copilot, and 2 lines in the cluster configuration:

Code changes highlighted by DataFlint Copilot: asNondeterministic() and cache()

👉 See how DataFlint can help optimize your Spark jobs with production-aware AI assistance.

Root Cause Explanation - What DataFlint Discovered

Using DataFlint's Job Debugger & Optimizer to analyze the production run, the first insight DataFlint highlighted was shocking:

SimilarWeb's cluster had 800 available cores, but Spark was only launching 83 tasks.

This resulted in:

  • Only 10% utilization
  • Very long task durations
  • Increased Spark memory optimization issues
  • Frequent retries that took ages due to long-running tasks

But how is this possible? They clearly repartitioned in the code. They expected thousands of parallel tasks.

The real reason: Spark skipped their manual repartition.

Two issues caused this:

1. A filter on UDF results triggered predicate pushdown

Spark cannot estimate the cost of a UDF. When it sees a filter on top of a UDF, Spark optimizes by pushing the filter down as early as possible.

Because of this optimization, Spark applied the filter before the repartition. This effectively bypassed our repartition logic entirely.

DataFlint query visualization showing problematic execution plan with Filter before Repartition

2. The input data had exactly 83 files

With the filter bypassing the repartition, Spark defaulted to using the number of input files as the number of tasks.

83 files = 83 tasks.

No matter how many cores we had or how large the cluster was, Spark would not create more tasks. This was the core bottleneck.

DataFlint stage analysis showing only 83 tasks with 3.7 hour median duration

The DataFlint AI Copilot Fix: Use asNondeterministic()

DataFlint's IDE Extension suggested marking the UDF as non-deterministic, which instructs Spark not to apply the filter pushdown optimization and not to run the UDF twice.

This preserves the manual repartition step and restores the intended level of parallelism.

This single Spark SQL optimization change was responsible for a 20X improvement:

  • 2X for not running the same heavy UDF twice (the usual use case for asNondeterministic)
  • 10X by parallelizing the job across all available cores
Corrected query plan with Repartition early, showing 480 partitions and healthy execution

Additional Spark Performance Tuning Optimizations Found by DataFlint

1. Multiple Actions and Missing Caching

SimilarWeb's code writes failed downloads separately for logging purposes. However, the data wasn't cached, meaning the logic (including the UDF) ran twice.

With real production insight from DataFlint's Dashboard observability platform, DataFlint identified that the output data was small and easily cacheable. By caching before the write step, they avoid recomputing the extraction logic.

This Spark memory optimization delivered another 2X improvement in both cost and duration.

2. Machine Type Optimization for Databricks Cost Optimization

SimilarWeb's initial configuration used many small, general-purpose (m-type) machines. But this workload is:

  • CPU heavy
  • Not very memory heavy
DataFlint resource analysis showing 96.92% idle cores and massive underutilization

Large numbers of small machines also increase Spark shuffle optimization challenges and the risk of spot failures.

Switching to fewer, larger, compute-optimized machines (cd-fleet.12xlarge) provided:

  • Another 2X cost reduction - more cores for less money
  • More stable execution - fewer spot failures
After optimization: 14.8 minutes, 0% idle cores, full utilization - efficient Spark execution

Summary - AI for Big Data Performance Optimization Has Arrived

This job was effectively impossible to optimize using traditional methods. The bottleneck wasn't in the code, but in the production environment surrounding it. Without visibility into data distribution, file structure, and Spark's physical execution plan, classical debugging approaches and standard AI assistants were blind.

By giving AI access to the production context through DataFlint's modern data observability platform, combining the Job Debugger for detailed analysis, the Dashboard for fleet-wide visibility, and the IDE Extension for AI-powered fixes, DataFlint enabled Spark cost optimization and performance improvements that were previously inaccessible.

The outcome:

  • 90X faster duration
  • 160X cheaper Databricks costs
  • A job that now finishes reliably

AI-assisted data pipeline optimization for massive-scale workloads is rapidly evolving, and tools like DataFlint are making complex engines like Apache Spark dramatically easier to optimize and maintain. Whether you're struggling with Spark join optimization, garbage collection tuning, or simply trying to reduce Databricks costs, production-aware AI is changing what's possible.

👉 Ready to achieve similar results? Get started with DataFlint and see the difference for yourself.