Full-Time Site Reliability Engineer / Team Lead
Omilia is hiring a remote Full-Time Site Reliability Engineer / Team Lead. The career level for this job opening is Manager and is accepting Argentina based applicants remotely. Read complete job description before applying.
Omilia
Job Title
Posted
Career Level
Career Level
Locations Accepted
Share
Job Details
We are looking for an experienced Site Reliability Engineer/Team Lead to manage and coordinate a team of reliability engineers.
Your primary responsibility will be ensuring the high availability, security, and reliability of our cloud platform.
You will oversee incident resolution, change management, and team coordination, as well as training and developing team members.
Requirements
Change, Incident and Problem Management:
- Oversee the resolution of complex incidents.
- Coordinate with SREs to ensure timely incident resolution.
- Optimize technical processes to improve change quality.
- Track and report on incident resolution metrics.
- Ensure that SLOs & SLIs are defined and maintained.
Customer Escalation Handling:
- Handle escalated incidents and ensure timely resolution.
- Ensure customer satisfaction with the resolution process.
Team Coordination:
- Assign tasks and tickets to team members.
- Ensure proper documentation of incidents and resolutions.
- Provide guidance and support to SREs.
- Advise Platform Team Members plus other Stakeholders with Omilia Best Practices.
Quality Assurance:
- Review and ensure the quality of incident responses and solutions.
- Conduct regular audits of incident reports and resolutions.
Training and Development:
- Develop training programs for new and existing team members.
- Conduct knowledge-sharing sessions.
Platform Security, High Availability and Reliability:
- Drive the design and development of the SRE infrastructure (Dev, Staging, PreProd Environments included as well) and maintenance tools for the full lifecycle of system development.
- Disaster Recovery, Backup Strategy for data integrity and business continuity.
- Provisioning, configuration, and scaling for efficiency and consistency.
- Continuous Improvements of Omilia Cloud Platform.
Experience Required:
- 5-7 years of experience in SRE or related roles.
- Experience in large-scale system architecture and automation.
- Experience in a leadership role is an advantage.
Bachelor's degree in Computer Science, Engineering, or related field.
Must-have: AWS, Azure, Kubernetes, Docker, Terraform, Ansible
Nice-to-have: Python, Git, Shell scripting, Linux, SQL