Back to search:Senior Site / Jakarta (menteng)

Responsibilities

  • Design, build, and maintain scalable, reliable, and secure infrastructure across AWS (including Elastic Beanstalk) and Azure.
  • Develop and manage CI/CD pipelines using Azure DevOps, GitHub Actions, or similar tools to ensure smooth and automated deployments.
  • Operate, monitor, and troubleshoot Kubernetes clusters (EKS, AKS, or self-managed) to ensure system stability and uptime.
  • Implement comprehensive observability solutions using Prometheus, Grafana, Loki, and Alertmanager.
  • Automate infrastructure provisioning and configuration using Terraform, Helm, CloudFormation, and/or Ansible.
  • Define, measure, and improve system reliability through SLOs, SLIs, and SLAs.
  • Enhance system resilience and incident response through proactive monitoring and capacity planning.
  • Manage secrets, access control, and security policies to maintain a robust and compliant infrastructure.
  • Participate in on-call rotations, respond to incidents, and drive root cause analysis and post-incident reviews.
  • Collaborate closely with development teams to embed reliability and scalability best practices throughout the software lifecycle.

Requirements

  • 5+ years of experience in a Site Reliability, DevOps, or Cloud Engineering role.
  • Strong hands-on experience with AWS (EC2, VPC, IAM, CloudWatch, Elastic Beanstalk, RDS, S3) and familiarity with Azure services.
  • Proven experience deploying and managing containerized applications using Kubernetes (EKS/AKS) and Docker.
  • Skilled in CI/CD pipeline development and multi-cloud workflows (Azure DevOps, GitHub Actions, etc.).
  • Solid understanding of observability tools such as Prometheus, Grafana, Loki, and Alertmanager.
  • Proficiency in infrastructure-as-code tools like Terraform, CloudFormation, or similar.
  • Scripting skills in Bash, Python, or PowerShell.
  • Strong grasp of networking, Linux systems, and cloud security best practices.
  • Excellent problem-solving skills with a focus on performance, scalability, and reliability.