SRE Product Support Manager
Overview
As a Site Reliability Engineer (SRE) at SysMind, you will play a key role in maintaining the performance, scalability, and availability of enterprise-grade applications and systems. You’ll work at the intersection of software engineering and operations, building automation, improving observability, and ensuring the resilience of production environments.
This position is ideal for engineers who are passionate about system optimization, root-cause analysis, and continuous improvement. You’ll collaborate with cross-functional teams to diagnose complex issues, enhance deployment pipelines, and design fail-safe recovery mechanisms that keep critical systems running without interruption.
Roles & Responsibilities
- Ensure the availability, scalability, and performance of production systems across multiple environments.
- Design, build, and maintain monitoring, alerting, and incident response systems using tools like Prometheus, Grafana, and ELK.
- Implement infrastructure automation and configuration management using Ansible, Terraform, or similar frameworks.
- Collaborate with development teams to establish SLOs, SLIs, and SLAs, embedding reliability into software design.
- Conduct root-cause analysis for production incidents, documenting solutions and preventive actions.
- Drive continuous improvement initiatives to enhance resilience, reduce downtime, and automate repetitive tasks.
- Optimize CI/CD pipelines for reliable and secure application deployments.
If you thrive on building stable, scalable, and high-performing systems, and want to be at the heart of enterprise transformation, fill out the form below to apply for this position.
