Responsibilities

Design, build, and maintain scalable, reliable, and secure infrastructure across AWS (including Elastic Beanstalk) and Azure.
Develop and manage CI/CD pipelines using Azure DevOps, GitHub Actions, or similar tools to ensure smooth and automated deployments.
Operate, monitor, and troubleshoot Kubernetes clusters (EKS, AKS, or self-managed) to ensure system stability and uptime.
Implement comprehensive observability solutions using Prometheus, Grafana, Loki, and Alertmanager.
Automate infrastructure provisioning and configuration using Terraform, Helm, CloudFormation, and/or Ansible.
Define, measure, and improve system reliability through SLOs, SLIs, and SLAs.
Enhance system resilience and incident response through proactive monitoring and capacity planning.
Manage secrets, access control, and security policies to maintain a robust and compliant infrastructure.
Participate in on-call rotations, respond to incidents, and drive root cause analysis and post-incident reviews.
Collaborate closely with development teams to embed reliability and scalability best practices throughout the software lifecycle.

Requirements

5+ years of experience in a Site Reliability, DevOps, or Cloud Engineering role.
Strong hands-on experience with AWS (EC2, VPC, IAM, CloudWatch, Elastic Beanstalk, RDS, S3) and familiarity with Azure services.
Proven experience deploying and managing containerized applications using Kubernetes (EKS/AKS) and Docker.
Skilled in CI/CD pipeline development and multi-cloud workflows (Azure DevOps, GitHub Actions, etc.).
Solid understanding of observability tools such as Prometheus, Grafana, Loki, and Alertmanager.
Proficiency in infrastructure-as-code tools like Terraform, CloudFormation, or similar.
Scripting skills in Bash, Python, or PowerShell.
Strong grasp of networking, Linux systems, and cloud security best practices.
Excellent problem-solving skills with a focus on performance, scalability, and reliability.

Senior Site Reliability Engineer