Responsibilities
• Monitor and maintain production data pipelines to ensure 99.9% uptime and optimal performance
• Implement comprehensive logging, alerting, and monitoring systems using Application monitoring tools
• Perform regular health checks performance, job execution times, and resource utilization to identify and resolve bottlenecks proactively
• Manage incident response procedures for pipeline failures, including root cause analysis, resolution, and post-incident reviews
• Establish and maintain disaster recovery procedures and backup strategies for critical data assets within the Databricks environment
• Conduct regular performance tuning of Spark jobs and Databricks cluster configurations to optimize cost and execution efficiency
• Maintain comprehensive documentation for operational procedures, runbooks, and troubleshooting guides
• Coordinate scheduled maintenance windows and system upgrades with minimal business i...