What will you do :
● Perform all responsibilities of L1 Operational Support, including alert monitoring,
triage, and initial response
● Own end-to-end incident management for production issues, from detection through
resolution
● Execute configuration and infrastructure changes to resolve recurring operational
issues using GitOps practices
● Apply fixes in production environments safely and efficiently via version-controlled
repositories and CI/CD pipelines
● Lead incident coordination during outages, including communication, timeline
tracking, and stakeholder updates
● Conduct root cause analysis and participate in post-incident reviews
● Maintain, improve, and expand incident runbooks, SOPs, and operational
documentation
Meet our Qualification :
● Strong understanding of Linux, web applications, and containerized environments
● Hands-on experience with Kubernetes (Deployments, HPA, PVCs, Pods) and GitOps
tools like ArgoCD
● Solid knowledge of AWS services (EKS, RDS, EC2, S3, IAM, VPC, CloudWatch)
● Proficiency in using observability stacks (Prometheus, Grafana, OpenTelemetry) for
triage and validation
● Experience managing infrastructure through GitHub with proper change control
● Proven ability to troubleshoot under pressure and make safe, effective production
changes
● Excellent documentation and communication skills for incident reporting and
cross-team collaboration
● Ability to follow and execute incident runbooks under pressure.
Preferred Qualifications (Nice-to-Have):
● Experience with Infrastructure-as-Code tools (e.g., Terraform)
● Prior on-call or incident command experience
● Basic scripting skills (Bash, Python) for automation
● Understanding of logs, metrics, and traces correlation in observability platforms
Benefit :
- Bonus Performance
- Career Development
- Eid allowance
- Development