What will you do :

● Perform all responsibilities of L1 Operational Support, including alert monitoring,

triage, and initial response

● Own end-to-end incident management for production issues, from detection through

resolution

● Execute configuration and infrastructure changes to resolve recurring operational

issues using GitOps practices

● Apply fixes in production environments safely and efficiently via version-controlled

repositories and CI/CD pipelines

● Lead incident coordination during outages, including communication, timeline

tracking, and stakeholder updates

● Conduct root cause analysis and participate in post-incident reviews

● Maintain, improve, and expand incident runbooks, SOPs, and operational

documentation

Meet our Qualification :

● Strong understanding of Linux, web applications, and containerized environments

● Hands-on experience with Kubernetes (Deployments, HPA, PVCs, Pods) and GitOps

tools like ArgoCD

● Solid knowledge of AWS services (EKS, RDS, EC2, S3, IAM, VPC, CloudWatch)

● Proficiency in using observability stacks (Prometheus, Grafana, OpenTelemetry) for

triage and validation

● Experience managing infrastructure through GitHub with proper change control

● Proven ability to troubleshoot under pressure and make safe, effective production

changes

● Excellent documentation and communication skills for incident reporting and

cross-team collaboration

● Ability to follow and execute incident runbooks under pressure.

Preferred Qualifications (Nice-to-Have):

● Experience with Infrastructure-as-Code tools (e.g., Terraform)

● Prior on-call or incident command experience

● Basic scripting skills (Bash, Python) for automation

● Understanding of logs, metrics, and traces correlation in observability platforms

Benefit :

Site Reliability Engineer