Qualifications:
- 3–5 years of experience as a Performance Engineer or Site Reliability Engineer (SRE) focused on capacity and performance.
- Bachelor's degree in Computer Science, Information Systems, or related field.
- Proficient in monitoring and observability tools (Prometheus, Grafana, ELK Stack, Datadog, CloudWatch, or Google Monitoring).
- Strong analytical skills; proficient in SQL and Excel/Google Sheets.
- Skilled in Python (Pandas, Matplotlib) or scripting languages (Python/Bash) for automation and analysis.
- Solid understanding of system performance metrics, databases, and networking.
- Hands-on experience with Azure Cloud, including auto-scaling, pricing models, and FinOps practices.
- Knowledge of statistical modeling or machine learning for performance forecasting is a plus.
- Familiarity with performance testing tools such as JMeter or Gatling is an advantage.
- Ensure IT infrastructure and resources have sufficient capacity to meet current and future business needs efficiently.
- Develop and implement capacity planning for hardware, software, and network.
- Monitor and analyze system performance and utilization data.
- Build forecasting models to predict future capacity needs.
- Identify IT resource requirements and recommend optimization strategies.
- Collaborate with Architecture and Development teams to model performance impacts of new features.
- Maintain dashboards and reports on capacity and performance.
- Provide cost optimization recommendations (e.g., rightsizing, reserved instances)