TODO-Spark Migrations - OpenHands Docs

Apache Spark is constantly evolving, and keeping your data pipelines up to date is essential for performance, security, and access to new features. OpenHands can help you migrate Spark applications across versions, frameworks, and cloud platforms.

Overview

OpenHands assists with Spark migrations in several ways:

Version upgrades: Migrate from Spark 2.x to 3.x, or between 3.x versions
API modernization: Update deprecated APIs to current best practices
Framework migrations: Convert from MapReduce, Hive, or other frameworks to Spark
Cloud migrations: Move Spark workloads between cloud providers or to cloud-native services

Migration Scenarios

Spark Version Upgrades

Upgrading Spark versions often requires code changes due to API deprecations and behavioral differences. Spark 2.x to 3.x Migration:

Migrate my Spark 2.4 application to Spark 3.5:

Identify all deprecated API usages in src/main/scala/
Update DataFrame operations that changed behavior
Migrate from SparkSession.builder() patterns if needed
Update date/time handling for the new defaults
Check and update all UDF registrations
Update the build.sbt dependencies

List all changes made with before/after comparisons.

Common migration areas:

Spark 2.x	Spark 3.x	Action Required
`SQLContext`	`SparkSession`	Replace with SparkSession
`HiveContext`	`SparkSession` with Hive	Update initialization
`Dataset.unionAll()`	`Dataset.union()`	Rename method calls
`DataFrame.explode()`	`functions.explode()`	Use SQL functions
Legacy date parsing	Proleptic Gregorian calendar	Review date handling

Migration from Other Big Data Frameworks

MapReduce to Spark:

Convert our MapReduce jobs to Spark:

Analyze the MapReduce job in src/mapreduce/WordCount.java
Identify the mapper and reducer logic
Convert to equivalent Spark transformations
Preserve the same input/output formats
Create a test that compares outputs from both versions

Hive to Spark SQL:

Migrate our Hive ETL pipeline to Spark:

Convert the Hive scripts in etl/hive/ to Spark SQL
Replace Hive UDFs with Spark SQL functions where possible
For custom UDFs, create Spark UDF equivalents
Maintain compatibility with our existing Hive metastore
Benchmark the performance difference

Pig to Spark:

Convert our Pig Latin scripts to PySpark:

Analyze the Pig scripts in pipelines/pig/
Map Pig operations to equivalent Spark transformations
Convert Pig UDFs to Python functions
Preserve the data flow and dependencies
Document any behavioral differences

Cloud Platform Migrations

On-premises to Cloud:

AWS EMR
Databricks
Google Dataproc

Migrate our Spark application to AWS EMR:

Update file paths from HDFS to S3
Configure AWS credentials handling
Update cluster configuration for EMR
Modify logging to use CloudWatch
Create EMR step definitions for our jobs
Update the CI/CD pipeline for EMR deployment

Migrate our Spark jobs to Databricks:

Convert batch jobs to Databricks notebooks
Update file I/O to use DBFS or Unity Catalog
Configure cluster policies and instance pools
Set up Databricks Jobs for scheduling
Migrate secrets to Databricks secret scopes
Update monitoring to use Databricks metrics

Migrate our Spark workloads to Dataproc:

Update storage paths to use GCS
Configure service account authentication
Create Dataproc workflow templates
Set up Cloud Logging and Monitoring
Update dependencies for GCP libraries
Configure autoscaling policies

Between Cloud Providers:

Migrate our Spark application from EMR to Databricks:

Inventory all EMR-specific configurations
Map S3 paths to DBFS equivalents
Convert EMR bootstrap scripts to init scripts
Update IAM roles to Databricks service principals
Migrate Step Functions orchestration to Databricks Workflows
Create a parallel testing strategy

Code Transformation

API Updates

OpenHands can automatically update deprecated APIs:

Update all deprecated Spark APIs in our codebase:

Scan for deprecated method usages
Replace with recommended alternatives
Update import statements as needed
Add comments noting the changes for review
Run the test suite to verify no regressions

Common API updates:

// Before (Spark 2.x)
val df = spark.read.format("json").load(path)
df.registerTempTable("temp")

// After (Spark 3.x)
val df = spark.read.format("json").load(path)
df.createOrReplaceTempView("temp")

Performance Optimization

Improve performance during migration:

Optimize our Spark jobs during the migration:

Replace `collect()` with `take()` or `foreach()` where appropriate
Convert repeated DataFrame operations to use caching
Optimize shuffle operations with appropriate partitioning
Replace narrow transformations grouped after wide ones
Update broadcast join hints for large dimension tables
Profile before and after with Spark UI metrics

Key optimization patterns:

Anti-pattern	Optimization	Impact
Multiple `count()` calls	Cache and count once	Reduces recomputation
Small file output	Coalesce before write	Fewer files, faster reads
Skewed joins	Salting or broadcast	Eliminates stragglers
UDFs for simple ops	Built-in functions	Catalyst optimization

Best Practices Application

Apply modern Spark best practices:

Refactor our Spark application to follow best practices:

Replace RDD operations with DataFrame/Dataset where possible
Use Spark SQL functions instead of UDFs
Implement proper error handling with try-catch
Add schema validation for input data
Implement idempotent writes for recovery
Add structured logging for debugging

Testing and Validation

Job Testing

Create comprehensive tests for migrated jobs:

Create a test suite for our migrated Spark jobs:

Unit tests for transformation logic using local SparkSession
Integration tests with sample data files
Schema validation tests for input and output
Property-based tests for key business logic
Test fixtures that work with both Spark versions

Example test structure:

class MigrationTest extends AnyFunSuite with SparkSessionTestWrapper {
  test("transformed output matches expected schema") {
    val input = spark.read.json("src/test/resources/input.json")
    val result = MyTransformations.process(input)
    
    assert(result.schema === expectedSchema)
  }
  
  test("business logic produces same results as legacy") {
    val input = loadTestData()
    val newResult = NewPipeline.run(input)
    val legacyResult = loadLegacyOutput()
    
    assertDataFrameEquals(newResult, legacyResult)
  }
}

Performance Benchmarking

Compare performance between versions:

Create performance benchmarks for our migration:

Set up identical test datasets of 1GB, 10GB, and 100GB
Measure job completion time for both versions
Compare resource utilization (CPU, memory, shuffle)
Track stage-level metrics from Spark UI
Generate a comparison report with recommendations

Benchmark metrics to track:

Job duration (wall clock time)
Shuffle read/write bytes
Peak executor memory
Task distribution (min/max/median)
Garbage collection time

Data Validation

Ensure data correctness after migration:

Validate that our migrated pipeline produces correct output:

Run both pipelines on the same input dataset
Compare row counts between outputs
Perform checksum comparison on key columns
Validate aggregations match exactly
Check for NULL handling differences
Generate a data quality report

Validation approaches:

Row-Level
Aggregate
Schema

# Compare outputs row by row
old_df = spark.read.parquet("output/v2/")
new_df = spark.read.parquet("output/v3/")

diff = old_df.exceptAll(new_df)
assert diff.count() == 0, f"Found {diff.count()} differences"

# Compare key metrics
old_stats = old_df.agg(
    count("*"), sum("amount"), avg("quantity")
).collect()[0]

new_stats = new_df.agg(
    count("*"), sum("amount"), avg("quantity")
).collect()[0]

assert old_stats == new_stats

# Compare schemas
assert old_df.schema == new_df.schema, \
    f"Schema mismatch: {old_df.schema} vs {new_df.schema}"

Examples

Complete Spark 2 to 3 Migration

Migrate our Spark 2.4 ETL pipeline to Spark 3.5:

Project structure:
- src/main/scala/etl/
  - ExtractJob.scala
  - TransformJob.scala
  - LoadJob.scala
- src/main/resources/
  - application.conf

Requirements:
1. Update all deprecated APIs
2. Migrate from legacy date parsing
3. Update to new Catalog API for Hive tables
4. Preserve all business logic exactly
5. Update build.sbt with new dependencies
6. Create a test suite comparing old and new outputs
7. Document all breaking changes found

Hive to Spark SQL Migration

Convert our Hive data warehouse queries to Spark SQL:

Hive scripts to migrate:
- daily_aggregation.hql
- customer_segments.hql
- revenue_report.hql

Requirements:
1. Convert HiveQL to Spark SQL
2. Replace Hive UDFs with Spark equivalents
3. Optimize for Spark execution
4. Maintain Hive metastore compatibility
5. Create performance comparison benchmarks

EMR to Databricks Migration

Migrate our EMR Spark pipeline to Databricks:

Current setup:
- EMR 6.x with Spark 3.1
- S3 for data storage
- Step Functions for orchestration
- CloudWatch for monitoring

Target:
- Databricks on AWS
- Unity Catalog for data governance
- Databricks Workflows for orchestration
- Built-in Databricks monitoring

Deliverables:
1. Converted notebook or job definitions
2. Updated storage configurations
3. Workflow definitions
4. IAM/service principal mappings
5. Migration runbook

Repository Setup - Configure your Spark repository for OpenHands
Key Features - OpenHands capabilities overview
Prompting Best Practices - Write effective prompts

Use Cases

​Overview

​Migration Scenarios

​Spark Version Upgrades

​Migration from Other Big Data Frameworks

​Cloud Platform Migrations

​Code Transformation

​API Updates

​Performance Optimization

​Best Practices Application

​Testing and Validation

​Job Testing

​Performance Benchmarking

​Data Validation

​Examples

​Complete Spark 2 to 3 Migration

​Hive to Spark SQL Migration

​EMR to Databricks Migration

​Related Resources