Spark Developer / Engineer
Job Title: Spark Developer / Engineer (2 positions)
Location: US Remote, work during PST time zone
Duration: 6-12 Months
Workflows are powered by offline batch jobs written in Scalding, a MapReduce-based framework. To enhance scalability and performance, migrating these jobs from Scalding to Apache Spark.
Key Responsibilities:
Understanding the Existing Scalding Codebase
o Analyze the current Scalding-based data pipelines.
o Document existing business logic and transformations.
Migrating the Logic to Spark
o Convert existing Scalding jobs into Spark (PySpark/Scala) while ensuring optimized performance.
o Refactor data transformations and aggregations in Spark.
o Optimize Spark jobs for efficiency and scalability.
Ensuring Data Parity & Validation
o Develop data parity tests to compare outputs between Scalding and Spark implementations.
o Identify and resolve any discrepancies between the two versions.
o Work with stakeholders to validate correctness.
Writing Unit Tests & Improving Code Quality
o Implement robust unit and integration tests for Spark jobs.
o Ensure code meets engineering best practices (modular, reusable, and well-documented).
Required Qualifications:
- Experience in big data processing with Apache Spark (PySpark or Scala).
- Strong experience with data migration from legacy systems to Spark.
- Proficiency in Scalding and MapReduce frameworks.
- Experience with Hadoop, Hive, and distributed data processing.
- Hands-on experience in writing unit tests for Spark pipelines.
- Strong SQL and data validation experience.
- Proficiency in Python, Scala
- Knowledge of CI/CD pipelines for data jobs.
- Familiarity with Apache Airflow orchestration tool.
Job Title: Spark Developer / Engineer (2 positions)
Location: US Remote, work during PST time zone
Duration: 6-12 Months
Workflows are powered by offline batch jobs written in Scalding, a MapReduce-based framework. To enhance scalability and performance, migrating these jobs from Scalding to Apache Spark.
Key Responsibilities:
Understanding the Existing Scalding Codebase
o Analyze the current Scalding-based data pipelines.
o Document existing business logic and transformations.
Migrating the Logic to Spark
o Convert existing Scalding jobs into Spark (PySpark/Scala) while ensuring optimized performance.
o Refactor data transformations and aggregations in Spark.
o Optimize Spark jobs for efficiency and scalability.
Ensuring Data Parity & Validation
o Develop data parity tests to compare outputs between Scalding and Spark implementations.
o Identify and resolve any discrepancies between the two versions.
o Work with stakeholders to validate correctness.
Writing Unit Tests & Improving Code Quality
o Implement robust unit and integration tests for Spark jobs.
o Ensure code meets engineering best practices (modular, reusable, and well-documented).
Required Qualifications:
- Experience in big data processing with Apache Spark (PySpark or Scala).
- Strong experience with data migration from legacy systems to Spark.
- Proficiency in Scalding and MapReduce frameworks.
- Experience with Hadoop, Hive, and distributed data processing.
- Hands-on experience in writing unit tests for Spark pipelines.
- Strong SQL and data validation experience.
- Proficiency in Python, Scala
- Knowledge of CI/CD pipelines for data jobs.
- Familiarity with Apache Airflow orchestration tool.