Back to search:Site Reliability / Jakarta

Responsibilities:

● Perform all responsibilities of L1 Operational Support, including alert monitoring, triage, and initial response

● Own end-to-end incident management for production issues, from detection through resolution

● Execute configuration and infrastructure changes to resolve recurring operational issues using GitOps practices

● Apply fixes in production environments safely and efficiently via version-controlled repositories and CI/CD pipelines

● Lead incident coordination during outages, including communication, timeline tracking, and stakeholder updates

● Conduct root cause analysis and participate in post-incident reviews

● Maintain, improve, and expand incident runbooks, SOPs, and operational documentation

Qualifications:

● Strong understanding of Linux, web applications, and containerized environments

● Hands-on experience with Kubernetes (Deployments, HPA, PVCs, Pods) and GitOps tools like ArgoCD

● Solid knowledge of AWS services (EKS, RDS, EC2, S3, IAM, VPC, CloudWatch)

● Proficiency in using observability stacks (Prometheus, Grafana, OpenTelemetry) for triage and validation ● Experience managing infrastructure through GitHub with proper change control

● Proven ability to troubleshoot under pressure and make safe, effective production changes

● Excellent documentation and communication skills for incident reporting and cross-team collaboration

● Ability to follow and execute incident runbooks under pressure.

Benefit :

Permanent Base

Bonus Performance

Eid allowance

Development in house program.