Full-Time Senior Reliability Engineer
Hive.co is hiring a remote Full-Time Senior Reliability Engineer. The career level for this job opening is Senior Manager and is accepting Canada based applicants remotely. Read complete job description before applying.
Hive.co
Job Title
Posted
Career Level
Career Level
Locations Accepted
Share
Job Details
Hive is a fast-growing SaaS company. Our Engineering Team builds and maintains systems empowering customers. We ship MVPs, deploy multiple times daily, and iterate based on feedback. We handle high-volume data, integrations (Ticketmaster, Eventbrite), billions of customer data points, and 200 million emails/SMS monthly. Our tech stack includes Python, React, Redis, MongoDB, SQL, Elasticsearch, Clickhouse, and AWS services.
We seek a Senior Reliability Engineer to join our Reliability Team. This role bridges infrastructure, operations, and application engineering to ensure scalable, performant, secure, and cost-effective services.
What you'll do:
- Champion system observability improvements.
- Drive SLO adoption and improvement.
- Enhance application performance.
- Tackle complex technical challenges.
- Partner with development teams for scalable solutions.
- Lead security and compliance initiatives.
- Craft and refine developer tools.
- Develop and implement cost optimization strategies for cloud infrastructure.
- Collaborate with DevOps to maintain deployment pipelines.
- Contribute to incident management.
What we're looking for:
- 7+ years software engineering experience, 5+ years in reliability/infrastructure/platform engineering.
- 3+ years AWS experience, proven ability to build monitoring, alerting, observability.
- Track record implementing/improving SLOs and uptime KPIs.
- Expert knowledge of Linux, Docker, distributed systems.
- Solid programming skills (Python, Go).
- Strong security best practices, data-driven approach to stability.
- Excellent communication skills.
Bonus points:
- Scaling complex AWS environments.
- Creating developer platforms, CI/CD pipelines.
- Cloud cost optimization.
- Establishing/improving incident management processes.