Senior Site Reliability Engineer (SRE)
Join a Leading Enterprise Technology Management Platform
Are you passionate about building highly scalable, secure, and reliable systems that enable businesses to automate and streamline key processes? We are a dynamic and innovative company, offering a SaaS solution that automates and orchestrates enterprise IT operations such as onboarding, offboarding, audit readiness, and more. Our platform reduces reliance on manual tasks and allows enterprises to optimize their existing infrastructure.
We’re seeking an experienced Senior Site Reliability Engineer (SRE) to join our growing team. If you thrive in a DevSecOps environment and enjoy tackling the challenges of large-scale, distributed microservices, this is the opportunity for you!
What You'll Do:
- Ensure System Performance: Gather and analyze platform and application metrics to continuously improve performance and resolve faults.
- Collaborate with Engineering: Partner with top-tier engineering teams to enhance service reliability through rigorous testing and automation.
- Automation and Scalability: Build sustainable systems and services through automation, enabling efficient completion of key projects.
- System Uptime: Balance speed and reliability by managing well-defined service level objectives (SLOs) for minimal downtime and maximum availability.
- Security and Compliance: Implement practices to ensure systems align with security, compliance, and availability commitments.
- System Upgrades: Plan and execute system upgrades to keep our services current and resilient.
- Mentorship: Mentor and train engineers across the organization, fostering a culture of continuous improvement.
- On-Call Rotation: Participate in the on-call rotation to support system reliability.
What You’ll Bring:
- Kubernetes Expertise: Extensive experience managing production Kubernetes clusters, including deployment, scaling, and troubleshooting. Experience with Amazon EKS is a plus.
- Configuration Management: Proficiency with tools like Ansible, Helm, and Kustomize for automating infrastructure provisioning and configuration.
- Monitoring Tools: Familiarity with Prometheus, Grafana, or similar tools to monitor system health and optimize platform performance.
- AWS Cloud Services: Deep knowledge of AWS, including EC2, S3, IAM, and VPC, for building and managing scalable infrastructure.
- Infrastructure as Code (IaC): Experience using Terraform to automate cloud resource provisioning, ensuring efficient and repeatable infrastructure deployment.
- Queuing Systems: Experience with RabbitMQ, Kafka, or managed services like AmazonMQ for reliable communication between distributed systems.
- Database Management: Proficiency with MySQL and Amazon RDS for managing high-performance and highly available databases in the cloud.
- Networking & Security: Strong understanding of network design and security best practices to protect systems and ensure compliance.
- High-Uptime Environments: Proven experience in maintaining high availability with strategies for fault tolerance and disaster recovery.
- Cross-functional Collaboration: Ability to work effectively with multiple teams to achieve project goals in a fast-paced environment.
- Programming Skills: Experience with Python, Go, or JavaScript and modern CI/CD pipelines to automate testing, deployment, and monitoring.
- Problem Solving: A proactive mindset for identifying and resolving performance issues and improving system processes.