About the Role
We are looking for an experienced Site Reliability Engineer
to design, implement, and manage our cloud-based infrastructure on
Google Cloud Platform (GCP)
from the ground up. The ideal candidate will ensure our systems are highly available, reliable, scalable, and efficient while collaborating closely with software engineers to deliver robust services.
Responsibilities
- Design, implement, and manage scalable cloud infrastructure on
Google Cloud Platform (GCP)
. - Monitor and manage system health using the
Grafana Stack
, and handle error tracking and resolution via
Sentry
. - Perform system troubleshooting and problem-solving across platform and application domains.
- Develop and maintain
automation tools
and
CI/CD pipelines
to enhance deployment efficiency. - Collaborate with software engineering teams to ensure service reliability and high availability.
- Recommend improvements in architecture and processes to enhance system performance and maintainability.
- Evaluate new technologies and make pragmatic decisions to deliver maximum business value.
- Conduct performance tuning, load balancing, and automation for improved system stability and scalability.
Requirements
- Bachelor's degree in
Computer Science
,
Information Technology
, or a related field (or equivalent practical experience). - Minimum 5 years
of experience as a
Site Reliability Engineer (SRE)
or in a similar production cloud environment role (preferably
GCP
). - Hands-on experience with
Grafana Stack
(Grafana, Loki, Prometheus, Tempo) and
Sentry
. - Strong understanding of
Software Engineering principles
and
CI/CD practices
. - Proven ability in
system troubleshooting, incident response
, and
performance optimization
. - Excellent analytical, problem-solving, and decision-making skills, especially under pressure.
- Self-motivated, with the ability to manage multiple priorities in a fast-paced environment.
- Strong communication and interpersonal skills, with a collaborative mindset.