Full-Time Tech Lead, Site Reliability Engineer
DITTO is hiring a remote Full-Time Tech Lead, Site Reliability Engineer. The career level for this job opening is Experienced and is accepting APAC based applicants remotely. Read complete job description before applying.
DITTO
Job Title
Posted
Career Level
Career Level
Locations Accepted
Share
Job Details
About Ditto: Ditto is redefining how data moves at the edge. Our mission is to make it seamless for developers to build resilient, real-time applications, regardless of network conditions.
About the role: Ditto is at an inflection point. As we scale to meet the growing demands of our enterprise customers, we need experienced SRE Leads to drive and mature our Site Reliability Engineering practice.This is a unique opportunity to play a leading role in shaping enterprise-grade reliability, observability and incident management to ensure Ditto's systems meet the high standards our customers expect.
As a Lead SRE, you will:
- Line manage your regional squad of SREs, providing leadership and setting the standard for enterprise ready reliability
- Develop a high-performing team through mentoring, coaching, and creating growth opportunities for engineers
- Engage with incident management and escalations, ensuring your squad sees continual improvement in incident response and actively owns follow ups
- Architect enterprise-grade observability solutions across complex distributed systems
- Actively lead and manage SREs initiatives, co-ordinating across teams where needed
- Guide the implementation of SLIs, SLO and SLAs that align with business objectives
- Establish best practices for documentation, runbooks, and knowledge sharing across engineering
- Play an active roll in on-call, and manage your squad’s rotation
What you'll need:
- 7+ years of experience in Site Reliability Engineering or similar DevOps roles with a focus on system reliability and incident management
- 3+ years of experience leading and mentoring technical teams
- Strong experience with modern monitoring stacks including Prometheus, Grafana, and Datadog
- Proficiency in at least one systems programming language, such as Go, Rust, C or Java
- Experience with Infrastructure as Code tools, like Terraform and Helm
- Hands-on experience architecting applications for Kubernetes, and managing Kubernetes infrastructure
- Experience with AWS and at least one other major cloud service provider (GCP, Azure)
- Excellent communication skills
- Experience maintaining on-call rotations and incident response procedures
- A high degree of agency, taking ownership of problems and identifying initiatives and improvements
- Proven project management skills and the ability to balance competing priorities and interrupts
- Understanding of security best practices in cloud environments
Nice to have:
- Experience directly line managing SREs
- Experience building or operating multi-tenant, multi-cloud SaaS/DBaaS Platforms
- Familiarity with edge computing or mesh networking
- Experience instrumenting advanced observability practices (tracing, profiling) in distributed systems
- Experience working with globally distributed teams across EMEA and APAC regions