Responsibilities
- Champion and implement a culture of SRE to maintain a high-quality platform infrastructure
- Champion and implement application and infrastructure monitoring and alerting to prevent client impacting issues by ensuring system availability, performance and scalability to maintain SLOs and SLAs
- Optimize application performance at scale
- Define and support continuous integration and deployment pipelines (CI/CD), handle metal hardware, cloud instances, containers, sandboxes, networking, VPNs, storage, databases, caches, websites, monitoring, logging, backups, ETL, security of web services, etc.
- Dive deep into technology and stay on the forefront of the latest tools, technologies, and strategies; help evaluate, prototype, and integrate them into work processes
- Perform with broad independence and deliver on project milestones and tasks on schedule while communicating progress regularly
- Build strong relationships with SRE team members and software engineering teams to hold each other accountable for quality expectations
- Evangelize best practices, eliminate bottlenecks, and improve process
- Maintenance of monitoring, logging, and backup systems
- Quick respond to breakage and security incidents
- Author documentation and guides for infrastructure and tooling
- Collaborate with developer teams to ensure timely delivery for Sandbox and Production
- Write automated tests to ensure error-free code and performance.
- Implement best practices in security and data protection
Requirements
- Hold minimum Bachelors or Masters Degree in Computer Science or equivalent work experience.
- 5+ years demonstrating hands-on technical leadership and business impact in combining software skills with systems to solve complex automation and reliability challenges
- 5+ years working with various cloud providers, containerization technologies, automated deployment frameworks, orchestration frameworks, monitoring, logging, alerting, system internals, networking, databases, distributed systems, and service-oriented architecture
- 5+ years instrumenting proactive alerting and monitoring systems technologies (e.g., Splunk, Grafana, New Relic, Datadog, VictoriaMetrics)
- Strong experience with automation tools like Rundeck, Ansible.
- Minimum 3+ years of experience writing software in any modern software language such as C#.NET, Java, Javascript, , React.
- Minimum 3+ years of experience with open-source CI/CD tools like Jenkins, Gitlab, Github Actions
- Proven track record to implement load, stress, performance and reliability testing standards at scale to improve service, platform and infrastructure resiliency
- Strong experience with ELK stack and familiar with microservices architectures