Responsibilities
- Design, build, and maintain scalable, reliable, and secure infrastructure across AWS (including Elastic Beanstalk) and Azure.
- Develop and manage CI/CD pipelines using Azure DevOps, GitHub Actions, or similar tools to ensure smooth and automated deployments.
- Operate, monitor, and troubleshoot Kubernetes clusters (EKS, AKS, or self-managed) to ensure system stability and uptime.
- Implement comprehensive observability solutions using Prometheus, Grafana, Loki, and Alertmanager.
- Automate infrastructure provisioning and configuration using Terraform, Helm, CloudFormation, and/or Ansible.
- Define, measure, and improve system reliability through SLOs, SLIs, and SLAs.
- Enhance system resilience and incident response through proactive monitoring and capacity planning.
- Manage secrets, access control, and security policies to maintain a robust and compliant infrastructure.
- Participate in on-call rotations, respond to incidents, and drive root cause analysis and post-incident reviews.
- Collaborate closely with development teams to embed reliability and scalability best practices throughout the software lifecycle.
Requirements
- 5+ years of experience in a Site Reliability, DevOps, or Cloud Engineering role.
- Strong hands-on experience with AWS (EC2, VPC, IAM, CloudWatch, Elastic Beanstalk, RDS, S3) and familiarity with Azure services.
- Proven experience deploying and managing containerized applications using Kubernetes (EKS/AKS) and Docker.
- Skilled in CI/CD pipeline development and multi-cloud workflows (Azure DevOps, GitHub Actions, etc.).
- Solid understanding of observability tools such as Prometheus, Grafana, Loki, and Alertmanager.
- Proficiency in infrastructure-as-code tools like Terraform, CloudFormation, or similar.
- Scripting skills in Bash, Python, or PowerShell.
- Strong grasp of networking, Linux systems, and cloud security best practices.
- Excellent problem-solving skills with a focus on performance, scalability, and reliability.