Responsibilities:
● Perform all responsibilities of L1 Operational Support, including alert monitoring, triage, and initial response
● Own end-to-end incident management for production issues, from detection through resolution
● Execute configuration and infrastructure changes to resolve recurring operational issues using GitOps practices
● Apply fixes in production environments safely and efficiently via version-controlled repositories and CI/CD pipelines
● Lead incident coordination during outages, including communication, timeline tracking, and stakeholder updates
● Conduct root cause analysis and participate in post-incident reviews
● Maintain, improve, and expand incident runbooks, SOPs, and operational documentation
Qualifications:
● Strong understanding of Linux, web applications, and containerized environments
● Hands-on experience with Kubernetes (Deployments, HPA, PVCs, Pods) and GitOps tools like ArgoCD
● Solid knowledge of AWS services (EKS, RDS, EC2, S3, IAM, VPC, CloudWatch)
● Proficiency in using observability stacks (Prometheus, Grafana, OpenTelemetry) for triage and validation ● Experience managing infrastructure through GitHub with proper change control
● Proven ability to troubleshoot under pressure and make safe, effective production changes
● Excellent documentation and communication skills for incident reporting and cross-team collaboration
● Ability to follow and execute incident runbooks under pressure.
Benefit :
Permanent Base
Bonus Performance
Eid allowance
Development in house program.