Site Reliability Engineer (SRE)

Mumbai | Remote (India) • Full-time • Senior • Cloud/DevOps

SLIs/SLOs, incident response, reliability engineering.

About the Role

Ensure the reliability, availability, and performance of our production systems. You'll design SLOs, implement monitoring strategies, and lead incident response efforts.

What You'll Do

Define and implement SLIs, SLOs, and error budgets

Design and maintain comprehensive monitoring and alerting systems

Lead incident response and post-mortem processes

Implement automation to reduce toil and improve reliability

Conduct capacity planning and performance optimization

Build tools for deployment, monitoring, and troubleshooting

Collaborate with engineering teams on reliability requirements

What You'll Bring

5+ years of SRE or production operations experience

Strong programming skills (Python, Go, or similar)

Deep understanding of distributed systems and microservices

Experience with observability tools (Prometheus, Jaeger, ELK)

Knowledge of incident management and on-call practices

Understanding of chaos engineering and reliability testing

Experience with cloud platforms and auto-scaling

Nice to Have

Experience with large-scale distributed systems

Knowledge of performance engineering and optimization

Familiarity with machine learning for operations (AIOps)

Understanding of disaster recovery and business continuity

Why Join StackBinary™?

Flexible working hours

Remote-friendly culture

Learning & development budget

High-ownership projects

Pragmatic engineering culture

Work with cutting-edge tech

Ready to Apply?

Join our team of builders who love shipping quality software.

Questions about this role?

careers@stackbinary.io

Site Reliability Engineer (SRE)

About the Role

Ready to Apply?

Get Practical AI, Cloud & DevOps Tips Monthly