Skip to main content
Apache Spark is constantly evolving, and keeping your data pipelines up to date is essential for performance, security, and access to new features. OpenHands can help you migrate Spark applications across versions, frameworks, and cloud platforms.

Overview

OpenHands assists with Spark migrations in several ways:
  • Version upgrades: Migrate from Spark 2.x to 3.x, or between 3.x versions
  • API modernization: Update deprecated APIs to current best practices
  • Framework migrations: Convert from MapReduce, Hive, or other frameworks to Spark
  • Cloud migrations: Move Spark workloads between cloud providers or to cloud-native services

Migration Scenarios

Spark Version Upgrades

Upgrading Spark versions often requires code changes due to API deprecations and behavioral differences. Spark 2.x to 3.x Migration:
Migrate my Spark 2.4 application to Spark 3.5:

1. Identify all deprecated API usages in src/main/scala/
2. Update DataFrame operations that changed behavior
3. Migrate from SparkSession.builder() patterns if needed
4. Update date/time handling for the new defaults
5. Check and update all UDF registrations
6. Update the build.sbt dependencies

List all changes made with before/after comparisons.
Common migration areas:
Spark 2.xSpark 3.xAction Required
SQLContextSparkSessionReplace with SparkSession
HiveContextSparkSession with HiveUpdate initialization
Dataset.unionAll()Dataset.union()Rename method calls
DataFrame.explode()functions.explode()Use SQL functions
Legacy date parsingProleptic Gregorian calendarReview date handling

Migration from Other Big Data Frameworks

MapReduce to Spark:
Convert our MapReduce jobs to Spark:

1. Analyze the MapReduce job in src/mapreduce/WordCount.java
2. Identify the mapper and reducer logic
3. Convert to equivalent Spark transformations
4. Preserve the same input/output formats
5. Create a test that compares outputs from both versions
Hive to Spark SQL:
Migrate our Hive ETL pipeline to Spark:

1. Convert the Hive scripts in etl/hive/ to Spark SQL
2. Replace Hive UDFs with Spark SQL functions where possible
3. For custom UDFs, create Spark UDF equivalents
4. Maintain compatibility with our existing Hive metastore
5. Benchmark the performance difference
Pig to Spark:
Convert our Pig Latin scripts to PySpark:

1. Analyze the Pig scripts in pipelines/pig/
2. Map Pig operations to equivalent Spark transformations
3. Convert Pig UDFs to Python functions
4. Preserve the data flow and dependencies
5. Document any behavioral differences

Cloud Platform Migrations

On-premises to Cloud:
Migrate our Spark application to AWS EMR:

1. Update file paths from HDFS to S3
2. Configure AWS credentials handling
3. Update cluster configuration for EMR
4. Modify logging to use CloudWatch
5. Create EMR step definitions for our jobs
6. Update the CI/CD pipeline for EMR deployment
Between Cloud Providers:
Migrate our Spark application from EMR to Databricks:

1. Inventory all EMR-specific configurations
2. Map S3 paths to DBFS equivalents
3. Convert EMR bootstrap scripts to init scripts
4. Update IAM roles to Databricks service principals
5. Migrate Step Functions orchestration to Databricks Workflows
6. Create a parallel testing strategy

Code Transformation

API Updates

OpenHands can automatically update deprecated APIs:
Update all deprecated Spark APIs in our codebase:

1. Scan for deprecated method usages
2. Replace with recommended alternatives
3. Update import statements as needed
4. Add comments noting the changes for review
5. Run the test suite to verify no regressions
Common API updates:
// Before (Spark 2.x)
val df = spark.read.format("json").load(path)
df.registerTempTable("temp")

// After (Spark 3.x)
val df = spark.read.format("json").load(path)
df.createOrReplaceTempView("temp")

Performance Optimization

Improve performance during migration:
Optimize our Spark jobs during the migration:

1. Replace `collect()` with `take()` or `foreach()` where appropriate
2. Convert repeated DataFrame operations to use caching
3. Optimize shuffle operations with appropriate partitioning
4. Replace narrow transformations grouped after wide ones
5. Update broadcast join hints for large dimension tables
6. Profile before and after with Spark UI metrics
Key optimization patterns:
Anti-patternOptimizationImpact
Multiple count() callsCache and count onceReduces recomputation
Small file outputCoalesce before writeFewer files, faster reads
Skewed joinsSalting or broadcastEliminates stragglers
UDFs for simple opsBuilt-in functionsCatalyst optimization

Best Practices Application

Apply modern Spark best practices:
Refactor our Spark application to follow best practices:

1. Replace RDD operations with DataFrame/Dataset where possible
2. Use Spark SQL functions instead of UDFs
3. Implement proper error handling with try-catch
4. Add schema validation for input data
5. Implement idempotent writes for recovery
6. Add structured logging for debugging

Testing and Validation

Job Testing

Create comprehensive tests for migrated jobs:
Create a test suite for our migrated Spark jobs:

1. Unit tests for transformation logic using local SparkSession
2. Integration tests with sample data files
3. Schema validation tests for input and output
4. Property-based tests for key business logic
5. Test fixtures that work with both Spark versions
Example test structure:
class MigrationTest extends AnyFunSuite with SparkSessionTestWrapper {
  test("transformed output matches expected schema") {
    val input = spark.read.json("src/test/resources/input.json")
    val result = MyTransformations.process(input)
    
    assert(result.schema === expectedSchema)
  }
  
  test("business logic produces same results as legacy") {
    val input = loadTestData()
    val newResult = NewPipeline.run(input)
    val legacyResult = loadLegacyOutput()
    
    assertDataFrameEquals(newResult, legacyResult)
  }
}

Performance Benchmarking

Compare performance between versions:
Create performance benchmarks for our migration:

1. Set up identical test datasets of 1GB, 10GB, and 100GB
2. Measure job completion time for both versions
3. Compare resource utilization (CPU, memory, shuffle)
4. Track stage-level metrics from Spark UI
5. Generate a comparison report with recommendations
Benchmark metrics to track:
  • Job duration (wall clock time)
  • Shuffle read/write bytes
  • Peak executor memory
  • Task distribution (min/max/median)
  • Garbage collection time

Data Validation

Ensure data correctness after migration:
Validate that our migrated pipeline produces correct output:

1. Run both pipelines on the same input dataset
2. Compare row counts between outputs
3. Perform checksum comparison on key columns
4. Validate aggregations match exactly
5. Check for NULL handling differences
6. Generate a data quality report
Validation approaches:
# Compare outputs row by row
old_df = spark.read.parquet("output/v2/")
new_df = spark.read.parquet("output/v3/")

diff = old_df.exceptAll(new_df)
assert diff.count() == 0, f"Found {diff.count()} differences"

Examples

Complete Spark 2 to 3 Migration

Migrate our Spark 2.4 ETL pipeline to Spark 3.5:

Project structure:
- src/main/scala/etl/
  - ExtractJob.scala
  - TransformJob.scala
  - LoadJob.scala
- src/main/resources/
  - application.conf

Requirements:
1. Update all deprecated APIs
2. Migrate from legacy date parsing
3. Update to new Catalog API for Hive tables
4. Preserve all business logic exactly
5. Update build.sbt with new dependencies
6. Create a test suite comparing old and new outputs
7. Document all breaking changes found

Hive to Spark SQL Migration

Convert our Hive data warehouse queries to Spark SQL:

Hive scripts to migrate:
- daily_aggregation.hql
- customer_segments.hql
- revenue_report.hql

Requirements:
1. Convert HiveQL to Spark SQL
2. Replace Hive UDFs with Spark equivalents
3. Optimize for Spark execution
4. Maintain Hive metastore compatibility
5. Create performance comparison benchmarks

EMR to Databricks Migration

Migrate our EMR Spark pipeline to Databricks:

Current setup:
- EMR 6.x with Spark 3.1
- S3 for data storage
- Step Functions for orchestration
- CloudWatch for monitoring

Target:
- Databricks on AWS
- Unity Catalog for data governance
- Databricks Workflows for orchestration
- Built-in Databricks monitoring

Deliverables:
1. Converted notebook or job definitions
2. Updated storage configurations
3. Workflow definitions
4. IAM/service principal mappings
5. Migration runbook