Full-Time Lead Site Reliability Engineer

Hitachi Solutions is hiring a remote Full-Time Lead Site Reliability Engineer. The career level for this job opening is Expert and is accepting Greenville based applicants remotely. Read complete job description before applying.

Hitachi Solutions

Job Title

Lead Site Reliability Engineer

Posted

Career Level

Full-Time

Career Level

Expert

Locations Accepted

Greenville

Salary

YEAR $142500 - $198750

Job Details

This is a full-time role in our product organization for an expert in systems design with considerable skill and expertise in large software development.

Key Responsibilities:

  • Designs and implements CI/CD tooling using GitHub Actions / Azure DevOps, and related technologies.
  • Defines and implements build and test pipelines for containerized architectures, infrastructure as code (IaC) for the stateful deployment of environments, Role-Based Access Control (RBAC), linting and other code quality controls, gitops and Kubernetes pipelines, and manages SaaS deployment APIs.
  • Assists in the design, engineering, development, planning and administration of Azure Kubernetes AKS clusters for a set of critical business applications.
  • Works closely with application, engineering, security and operations teams to engineer and build Kubernetes and Azure PaaS & IaaS solutions within an agile and modern enterprise grade operating model.
  • Responsible for availability, latency, performance, efficiency, monitoring/observability, emergency response, capacity planning, setting, and maintaining SLOs, SLIs and Error Budgets, creating dashboards.
  • Analyzes, troubleshoots, and resolves operational challenges contributing to defined SLO's.
  • Manages site stability, performance, reliability, and maintains uptime for production environments.
  • Develops a fully automated multi-environment observability stack based on the existing system and extends it to predict capacity needs based on the usage patterns.
  • Strives for automation to reduce toil and increase development velocity.
  • Performs application-specific production support, incident management, change management, problem management, RCAs, and service restoration as needed.
  • Identifies changes for the product architecture from the reliability, performance and availability perspective with a data-driven approach.
  • Analyzes and addresses complex technical challenges and issues that arise during the software development & run lifecycle.
  • Debug, troubleshoots, and resolves technical problems efficiently.
  • Creates and maintains technical documentation, including design specifications, user guides, run books and best practice guidelines.
  • Actively looks for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation.
  • Collaborates with software development teams in the release management process and to shape the future roadmap and establishes strong operational readiness across teams.
  • Participates in Agile ceremonies, such as sprint planning, stand-up meetings, and retrospectives.
  • Collaborates with product managers, designers, and other engineers to ensure alignment and efficient project execution.
  • Shares expertise and mentors engineers, helping them grow and develop their skills.
  • Fosters a culture of continuous learning and improvement within the team.
  • Stays updated with the latest technologies, tools, and cloud computing.
  • Proactively learns and adapts to new technologies to drive innovation.
  • Collaborates with customers to understand their needs, gathers feedback, and provides technical support and guidance as needed.
  • Triages incoming Web Support escalation requests routing to applicable internal teams.
  • Contributes to incident root cause analysis, service restoration, and serves as an incident commander during outage events.
  • Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider.
  • Solid experience with Monitoring/APM/Observability tools (Data dog, Application Insights, Prometheus, Grafana etc.,)
  • Strong backgroud with Azure Resources like Key Vault, Data Factory, Azure Databricks and Storage Accounts.
  • Experience implementing observability plans around logs, metrics, and traces.
  • Experience in an agile development team developing software.
  • Implement and participate exercising best practices for CI/CD.
  • Experience with cloud infrastructure environments, preferably Azure, and Infrastructure as code (Terraform, Bicep, ARM).
  • Design, develop, and maintain infrastructure using popular IaC tools and technologies like Terraform, Helm, others.
  • Strong experience with containerization technology and/or Kubernetes.
  • Experience with Release automation, system administration, configuration management.
  • Experience with programming languages (Python, Go, etc.).
  • Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts.
  • Strong interpersonal and teaming skills - ability to set and enforce process and influence engineers who are not direct reports.
  • Strong analytical and programming skills (Python, Go etc.).

(Bonus) Experience with MLFlow and other MLOps pipeline technology

FAQs

What is the last date for applying to the job?

The deadline to apply for Full-Time Lead Site Reliability Engineer at Hitachi Solutions is 18th of April 2025 . We consider jobs older than one month to have expired.

Which countries are accepted for this remote job?

This job accepts [ Greenville ] applicants. .

Related Jobs You May Like

Senior Site Reliability Engineer

Newton, MA
1 day ago
Ansible
AWS
CloudFormation
Cyberark
Full-Time
Senior Manager
YEAR $119000 - $165000

DreamWorks Technology - Sr. Platform Engineer I

Glendale, CALIFORNIA
2 days ago
CI/CD
Cloud Computing
DevOps
NBCUniversal
Full-Time
Experienced
YEAR $145000 - $165000

Robotics DevOps Engineer

USA
2 days ago
AWS
DevOps
Linux
Formic
Full-Time
Experienced

IT Release Manager

USA
2 days ago
Cloud Environments
DevOps
Project Management
Brightspeed
Full-Time
Experienced
YEAR $120000 - $150000

Staff Site Reliability Engineer

USA
2 days ago
Cloud Networking
DevOps
Linux Systems Administration
Primer.ai
Full-Time
Expert
YEAR $180000 - $230000

Senior Site Reliability Engineer - Data

USA, UK, Portugal, South Africa, Argentina, Czech Republic, Croatia, Spain, Albania
2 days ago
AWS
Cloud Security
Databases
C Side
Full-Time
Experienced

TSO System Engineer (Linux)

Warsaw, Poland
2 days ago
DevOps
Java
Linux
Software Mind
Full-Time
Experienced

DevOps Engineer, KMS Healthcare

Ho Chi Minh, Viet Nam
2 days ago
AWS
CI/CD
DevOps
KMS Technology
Full-Time
Experienced

Lead Site Reliability Engineer (AZURE)

Greenville
3 days ago
Azure
CI/CD
Containerization
Hitachi Solutions
Full-Time
Expert
YEAR $142500 - $198750

Platform Security Engineer

Orlando, FL
3 days ago
AWS
CI/CD
Linux
NBCUniversal
Full-Time
Experienced

Senior DevOps Engineer II

Worldwide
3 days ago
AWS
DevOps
GitOps
ActiveProspect
Full-Time
Experienced

DevOps Support Engineer

East Coast - US
3 days ago
CI/CD
DevOps
Docker
Spacelift
Full-Time
Experienced