Full-Time Lead Site Reliability Engineer

Hitachi Solutions is hiring a remote Full-Time Lead Site Reliability Engineer. The career level for this job opening is Expert and is accepting USA based applicants remotely. Read complete job description before applying.

This job was posted 5 months ago and is likely no longer active. We encourage you to explore more recent opportunities on our site. However, you may still try your luck using 'Apply Now' link below. We recommend focusing on newer listings available here.

Hitachi Solutions

Job Title

Lead Site Reliability Engineer

Posted

5 months ago on 4th July 2025

Career Level

Full-Time

Career Level

Expert

Locations Accepted

USA

Salary

YEAR $142500 - $198750

Job Details

This is a full-time role for an expert in systems design with considerable skill and expertise in large software development in an AZURE dev environment.

Designs and implements Continuous Integration/Continuous Deployment (CI/CD) tooling using GitHub Actions / Azure DevOps, and related technologies. This includes defining and implementing:

build and test pipelines for containerized architectures
infrastructure as code (IaC) for the stateful deployment of environments
Role-Based Access Control (RBAC)
linting and other code quality controls
gitops and kubernetes pipelines
managing SaaS deployment APIs

Individuals in this role will assist in the design, engineering, development, planning and administration of Azure Kubernetes AKS clusters for a set of critical business applications.

This role will work closely with application, engineering, security and operations teams to engineer and build Kubernetes and Azure PaaS & IaaS solutions within an agile and modern enterprise grade operating model.

Qualified applicants will have a demonstrated capability to learn new concepts quickly, and/or have robust domain expertise.

Key Responsibilities:

Responsible for availability, latency, performance, efficiency, monitoring/observability, emergency response, capacity planning, setting, and maintaining SLOs, SLIs and Error Budgets, creating dashboards.
Analyze, troubleshoot, and resolve operational challenges contributing to defined SLO's.
Manage site stability, performance, reliability, and maintain uptime for production environments.
Develop a fully automated multi-environment observability stack based on the existing system and extend it to predict capacity needs based on the usage patterns.
Strive for automation to reduce toil and increase development velocity.
Perform application-specific production support, incident management, change management, problem management, RCAs, and service restoration as needed.
Identify changes for the product architecture from the reliability, performance and availability perspective with a data driven approach.
Analyze and address complex technical challenges and issues that arise during the software development & run lifecycle.
Debug, troubleshoot, and resolve technical problems efficiently.
Create and maintain technical documentation, including design specifications, user guides, run books and best practice guidelines.
Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation.
Collaborate with software development teams in the release management process and to shape the future roadmap and establish strong operational readiness across teams.
Participate in Agile ceremonies, such as sprint planning, stand-up meetings, and retrospectives.
Collaborate with product managers, designers, and other engineers to ensure alignment and efficient project execution.
Share your expertise and mentor engineers, helping them grow and develop their skills.
Foster a culture of continuous learning and improvement within the team.
Stay updated with the latest technologies, tools, and cloud computing.
Proactively learn and adapt to new technologies to drive innovation.
Collaborate with customers to understand their needs, gather feedback, and provide technical support and guidance as needed.
Contribute to incident root cause analysis, service restoration, and serve as an incident commander during outage events.

Qualifications:

Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider.
Solid experience with Monitoring/APM/Observability tools (Data dog, Application Insights, Prometheus, Grafana etc.,)
Strong backgroud with Azure Resources like Key Vault, Data Factory, Azure Databricks and Storage Accounts.
Experience implementing observability plans around logs, metrics, and traces.
Experience in an agile development team developing software.
Implement and participate exercising best practices for CI/CD.
Experience with cloud infrastructure environments, preferably Azure, and Infrastructure as code (Terraform, Bicep, ARM).
Design, develop, and maintain infrastructure using popular IaC tools and technologies like Terraform, Helm, others.
Strong experience with containerization technology and/or Kubernetes.
Experience with Release automation, system administration, configuration management.
Experience with programming languages (Python, Go, etc.).
Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts.
Strong interpersonal and teaming skills - ability to set and enforce process and influence engineers who are not direct reports.
Strong analytical and programming skills (Python, Go etc.).
(Bonus) Experience with MLFlow and other MLOps pipeline technology

Skills

Azure CI/CD Kubernetes Site Reliability Engineering Terraform

FAQs

What is the last date for applying to the job?

The deadline to apply for Full-Time Lead Site Reliability Engineer at Hitachi Solutions is 3rd of August 2025 . We consider jobs older than one month to have expired.

Which countries are accepted for this remote job?

This job accepts [ USA ] applicants. .

Apply Now

Related Jobs You May Like

Azure DevOps Engineer

Jersey City, NJ

2 days ago

.NET

Azure

DevOps

Derex Technologies Inc

Full-Time

Experienced

Lead Palantir Developer

Seattle, WA

2 days ago

CI/CD Pipelines

Data Engineering

Palantir Foundry

Logic20/20 Inc.

Full-Time

Experienced

YEAR $156750 - $173329

Cloud AppOps Engineer

Atlanta, GA

3 days ago

Application Support

AWS

Cloud Services (EC2, S3, IAM, ELB, VPC, VPN)

Sutherland

Full-Time

Experienced

Staff DataOps Engineer

Remote, India

3 days ago

AWS

CI/CD

DataOps

Nagarro

Full-Time

Experienced

Query Tuning Specialist - Database Performance - Postgre

Austin, Texas

3 days ago

Database Management

Performance Tuning

Problem-solving

ServiceNow

Full-Time

Experienced

DevOps Engineer, Playout

New York, New York

3 days ago

CICD

Cloud Services (AWS, GCP, Azure)

DevOps

NBCUniversal

Full-Time

Experienced

YEAR $90000 - $110000

Query Tuning Specialist - Database Performance - Postgres

Austin, Texas

3 days ago

Database Management

Performance Tuning

SaaS/PaaS/Cloud Development

ServiceNow

Full-Time

Experienced

Lead Palantir Developer

Seattle, WA

4 days ago

CI/CD Pipelines

Cloud ETL

Palantir Foundry

Logic20/20 Inc.

Full-Time

Experienced

YEAR $156750 - $173329

Cloud AppOps Engineer

Atlanta, GA

4 days ago

Application Support

AWS

Cloud Security

Sutherland

Full-Time

Experienced

Site Reliability Engineer

Stamford, Connecticut

4 days ago

Cloud Platforms (AWS, GCP, Azure)

Configuration Management

Monitoring And Alerting Tools

NBCUniversal

Full-Time

Experienced

YEAR $110000 - $145000

Senior Cloud Platform Engineer (Networking)

Berlin, Germany

5 days ago

AWS

Networking

Scalable GmbH

Full-Time

Experienced

DevOps Engineer

Texas

5 days ago

AWS

GitLab

Kubernetes

InfStones

Full-Time

Experienced

All Remote Jobs

Full-Time Lead Site Reliability Engineer

Hitachi Solutions

Job Title

Posted

Career Level

Career Level

Locations Accepted

Salary

Share

Job Details

Skills

FAQs

What is the last date for applying to the job?

Which countries are accepted for this remote job?

Related Jobs You May Like