Back to search:Site Reliability / Yogyakarta

POSITION SUMMARY:

AccelByte is building a 24x7 operations team for AAA multiplayer video games. In this position, we need a driven Site Reliability Engineer who can actively participate in the day-to-day combat by maintaining high reliability of our service and drive prioritization in fixing what may be broken today, as well as able to envision, design and implement processes and technologies to improve the ability to identify, isolate, correlate, and mitigate service impacting problems in the system. The Site Reliability Engineer must also know some coding to automate routine tasks in service metrics gathering, correlating, organizing, and presenting, in addition to detail and in-depth root cause analysis

ESSENTIAL FUNCTIONS/RESPONSIBILITIES:

The Site Reliability Engineer (SRE) is accountable for the following functions and responsibilities:

  • Design, implement, and maintain infrastructure for applications
  • Build and run service deployment using K8s and other CNCF projects
  • Provide a secure, high-scalable, and cost-effective cloud platform
  • Construct and build effective systems to monitor the health of our system/applications, and to handle outages
  • Solve problems occurring in all our environments and create solutions to prevent them from happening again
  • Produce automation and innovative tools to assist the product development teams and to deliver operational excellence
  • Create and maintain infrastructure related documentations and SRE runbooks
  • Collaborate with other stakeholders to provide cost-effective, operational excellence, and performance efficient infrastructure solutions to improve our products.
  • Identify technology, process gaps, and opportunities for improvement
  • Liaise, communicate, and work directly with our client.

QUALIFICATIONS/EXPERIENCE REQUIRED

  • 2+ years Linux administration
  • Degree in Computer Science or equivalent experience
  • Prior experience helping design, manage and run large scale applications in the cloud
  • Experience with monitoring systems and strategies (System Admin)
  • Solid performance and troubleshooting skills
  • Solid foundation on distributed system
  • Robust knowledge and experience in cloud computing of at least one cloud provider (preferred AWS/GCP)
  • Experience with containerization principles and frameworks such as Docker, Container, Kubernetes, etc
  • Proven track record of building infrastructure as code (Terraform is must), configuration management, and package manager (eg: Helm Chart)
  • Proven experience with automation, CICD, and GitOps tools such as Jenkins, GitLab, GitHub, Flux, and/or ArgoCD
  • Experience with monitoring and alerting tools such as Prometheus, Grafana, ELK/EFK, Splunk, Datadog, OpsGenie, PagerDuty, etc
  • Experience within a greenfield environment, building infrastructure from scratch
  • Software development and scripting experience with Bash, Python, and/or Golang
  • Ability to work with clients on tight deadlines and fluid requirements
  • Good communication skill (escalation, explaining the incident)
  • Fluent in English both spoken and written
  • Willing to work on shift (24/7)

QUALIFICATIONS/EXPERIENCE PREFERRED

  • Contribute to open source projects and participate in technical communities
  • Experience working for or with AAA game studios
  • JVM tuning and troubleshooting
  • Experience with web services
  • Experience in Networking, Security, or Storage
  • Experience managing SQL and NoSQL databases
  • Familiar with Perforce version control