Full-Time Sr. Cloud Site Reliability Engineer
Serve Robotics is hiring a remote Full-Time Sr. Cloud Site Reliability Engineer. The career level for this job opening is Senior Manager and is accepting USA based applicants remotely. Read complete job description before applying.
Serve Robotics
Job Title
Posted
Career Level
Career Level
Locations Accepted
Share
Job Details
At Serve Robotics, we’re reimagining how things move in cities. Our personable sidewalk robot is our vision for the future. It’s designed to take deliveries away from congested streets, make deliveries available to more people, and benefit local businesses. The Serve fleet has been delighting merchants, customers, and pedestrians along the way in Los Angeles while doing commercial deliveries.
We’re looking for talented individuals who will grow robotic deliveries from surprising novelty to efficient ubiquity. We are tech industry veterans in software, hardware, and design who are pooling our skills to build the future we want to live in. We are solving real-world problems leveraging robotics, machine learning and computer vision, among other disciplines, with a mindful eye towards the end-to-end user experience. Our team is agile, diverse, and driven. We believe that the best way to solve complicated dynamic problems is collaboratively and respectfully.
This is a senior-level, individual contributor position. You will balance hands-on responsibilities—building and maintaining critical SRE tooling and processes - with technical leadership - guiding architecture decisions, mentoring others in SRE practices, and steering strategic initiatives to enhance system resiliency and availability.
You’ll collaborate across engineering, product, and operations teams to ensure our systems meet strict uptime and performance goals, all while aligning with overarching business objectives.
Responsibilities
- Instrumentation & Monitoring: Develop and refine monitoring and observability tools (metrics, logs, traces) to validate system availability and performance. Implement best practices for instrumentation using tools like Prometheus, Grafana, Datadog, or equivalent.
- Reliability Engineering: Collaborate with development teams to design and implement solutions for higher availability in the cloud. Lead the definition and management of Service Level Indicators (SLIs) and Service Level Objectives (SLOs), ensuring alignment with business goals. Perform capacity planning, load testing, and performance tuning to ensure systems can handle projected traffic and workloads.
- Incident Response & Prevention: Own the incident response process, including on-call rotation, alerts, and root cause analysis. Proactively identify reliability risks and propose mitigations to reduce system downtime. Conduct and facilitate postmortems to capture learnings, drive improvements, and prevent recurrence of issues.
- Align System Health with Business Metrics: Map system availability metrics to direct business value, ensuring stakeholders understand how reliability impacts overall company objectives. Create reporting dashboards that connect reliability data with KPIs and business goals.
- Technical Leadership & Mentorship: Serve as an in-house SRE expert, advising teams on reliability-oriented designs, coding practices, and testing methodologies. Mentor junior and mid-level engineers, fostering a culture of continuous learning, automation, and operational excellence.
- Collaboration & Education: Work closely with engineering, product, and operations teams to advocate for SRE best practices. Conduct training sessions and share knowledge to build a culture of reliability throughout the organization.