Full-Time Senior Site Reliability Engineer
Articul8 is hiring a remote Full-Time Senior Site Reliability Engineer. The career level for this job opening is Experienced and is accepting Worldwide based applicants remotely. Read complete job description before applying.
Articul8
Job Title
Posted
Career Level
Career Level
Locations Accepted
Share
Job Details
About Us
Articul8 AI delivers cutting-edge SaaS products that transform how businesses operate. Our platform empowers organizations to leverage the power of artificial intelligence in a reliable, scalable, and secure environment.
Position Overview
We are seeking an experienced Site Reliability Engineer (SRE) to join our team and help ensure the reliability, performance, and scalability of our GenAI SaaS platform. As an SRE, you will bridge the gap between development and operations, implementing automation and best practices to maintain our service reliability objectives while supporting rapid innovation.
Key Responsibilities
- Architect and maintain scalable, highly available infrastructure for our GenAI platform.
- Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance.
- Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency.
- Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver outstanding service quality.
- Participate in on-call rotations and provide rapid response to production incidents, minimizing downtime and user impact.
- Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads.
- Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives.
- Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads.
- Implement and enforce security best practices across all systems and environments.
- Create and maintain comprehensive documentation to foster a culture of shared knowledge.
Qualifications Required
- Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
- 5+ years of experience in DevOps, SRE, or similar roles
- Strong experience with cloud platforms (AWS, GCP, or Azure)
- Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.)
- Hands-on experience with infrastructure as code tools (Terraform, CloudFormation, etc.)
- Solid background in containerization technologies (Docker, Kubernetes)
- Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.)
- Strong understanding of CI/CD pipelines and automation
- Exceptional troubleshooting and problem-solving skills
Qualifications Preferred
- Experience supporting AI/ML systems in production
- Knowledge of GPU infrastructure management and optimization