Full-Time Site Reliability Engineer (SRE) Lead - Azure & SaaS
Xplor is hiring a remote Full-Time Site Reliability Engineer (SRE) Lead - Azure & SaaS. The career level for this job opening is Experienced and is accepting Toronto, Canada based applicants remotely. Read complete job description before applying.
Xplor
Job Title
Posted
Career Level
Career Level
Locations Accepted
Salary
Share
Job Details
We are looking for a seasoned Site Reliability Engineer to help evolve and support our Azure-based SaaS platform, ideally with exposure to integrated payments systems. You will focus on building scalable infrastructure, optimizing secure CI/CD pipelines, and enabling full observability and automation in a fast-paced, cloud-native environment.
Essential Duties and Responsibilities
- Design and maintain secure, scalable CI/CD pipelines, incorporating tools such as SonarCloud for code quality and security scanning
- Build resilient, automated cloud infrastructure on Azure (with limited exposure to AWS as needed)
- Optimize platform performance, reliability, and cost-efficiency across distributed systems and cloud workloads
- Contribute to architecture and automation strategies for PCI-compliant, integrated payments services
- Lead incident response efforts and implement automation to reduce recurrence of production issues
- Implement and maintain observability across the platform using Coralogix, OpenTelemetry, Azure Monitor, and related tools
- Write and maintain Infrastructure as Code using Terraform, Ansible, or equivalent tools
- Eliminate complexity and manual operations through thoughtful automation and platform tooling
- Collaborate across engineering teams to embed reliability, scalability, and security into the development lifecycle
- Participate in on-call rotations for production support
Relevant Technologies
- Languages: Python, Bash, PowerShell, Java, C#
- Cloud Platforms: Microsoft Azure (primary), AWS (secondary)
- CI/CD & DevSecOps Tools: Azure DevOps, GitHub Actions, Bitbucket, Bamboo, SonarCloud, Snyk
- Infrastructure as Code: Terraform, Ansible, Spacelift
- Observability & Monitoring: Coralogix, OpenTelemetry, Azure App Insights, CloudWatch, APM tools
- Architecture: Kubernetes, Docker, microservices, serverless (Azure Functions)
Requirements:
- 5+ years of experience in Site Reliability Engineering, DevOps, or Cloud Engineering roles
- Hands-on experience supporting Azure-native platforms at scale (AKS, App Services, Azure Functions, etc.)
- Proven track record in designing and optimizing secure CI/CD pipelines
- Experience supporting SaaS platforms in a cloud-native environment
- Strong scripting and automation skills (PowerShell, Bash, or Python)
- Expertise in system monitoring, alerting, and observability frameworks
- Experience with incident response, root cause analysis, and operational readiness best practices
- Working knowledge of version control systems and git workflows
- Excellent collaboration and communication skills in cross-functional Agile teams