Contractor Senior Site Reliability Engineer
Masabi is hiring a remote Contractor Senior Site Reliability Engineer. The career level for this job opening is Experienced and is accepting Argentina based applicants remotely. Read complete job description before applying.
Masabi
Job Title
Posted
Career Level
Career Level
Locations Accepted
Share
Job Details
We're looking for a Senior Site Reliability Engineer to join Masabi and be at the forefront of ensuring our platform's reliability, performance, and security. In this role, you'll be pivotal to scaling and modernising our platform while ensuring uptime, performance, and security.
Responsibilities:
- Automation and Scalability: Drive automation to reduce operational overhead and human error. Build CI/CD pipelines, develop Infrastructure as Code (IaC) using tools like Terraform and CloudFormation, and design scalable systems to handle high traffic while optimising resource utilisation.
- Continuous Improvement: Refine processes, tools, and workflows to enhance system reliability, scalability, and efficiency. Plan capacity to anticipate future needs and support high-performance systems.
- Security and Compliance: Ensure infrastructure meets organisational security standards and supports compliance frameworks like SOC 2 and PCI.
- Monitoring and Reliability: Maintain real-time monitoring systems aligned with SLIs and SLOs, ensuring uptime and performance meet or exceed SLAs. Set up proactive alerting mechanisms to address issues before they escalate.
- Cost Optimisation: Monitor and optimise cloud infrastructure costs through autoscaling, rightsizing, and architectural reviews to balance cost-effectiveness with reliability.
- Disaster Recovery and Redundancy: Implement failover strategies, disaster recovery plans, and redundancy to ensure system resilience under all conditions.
- Incident Management: Respond to production incidents, minimise downtime, and restore availability. Perform root cause analysis, implement preventive measures, and contribute to post-incident reviews to share lessons learned.
- Collaboration and Mentorship: Partner with developers to design reliable, maintainable systems. Coach teams on best practices for reliability, scalability, and observability, fostering a culture of ownership.
- Documentation and Knowledge Sharing: Maintain detailed documentation for infrastructure, incident response, and workflows. Develop playbooks and runbooks to ensure seamless knowledge transfer.
Key Tools and Technologies:
- Monitoring: Grafana, Prometheus, CloudWatch, Pingdom, Kibana
- CI/CD: GitLab CI, Rundeck
- IaC: Terraform, CloudFormation
- Cloud Platforms: AWS
About You:
- Significant experience in SRE or related roles, with a proven track record in building and maintaining reliable systems
- Expertise in AWS Cloud technologies
- Hands-on experience with Terraform and Grafana, along with strong knowledge of security principles and networking components
- Experience in building pipelines and robust CI/CD infrastructure
- A collaborative team player who approaches projects with an open mind and prioritises security
Desirable:
- Familiarity with PCI DSS v4 Compliance requirements is a plus
- AWS Cloud certification
- Experience with orchestrating containers