Full-Time Senior Site Reliability Engineer
DK Crown Holdings Inc. is hiring a remote Full-Time Senior Site Reliability Engineer. The career level for this job opening is Experienced and is accepting USA based applicants remotely. Read complete job description before applying.
DK Crown Holdings Inc.
Job Title
Posted
Career Level
Career Level
Locations Accepted
Salary
Share
Job Details
As a Senior Site Reliability Engineer, you'll build and scale the critical infrastructure behind every product. In this role, you'll take on complex challenges across global data centers, multiple cloud platforms, and on-premise systems—designing automation-first solutions that elevate performance and eliminate operational friction.
- Drive stability and scalability across our global compute platform spanning numerous data centers, multiple public clouds, and on-premise environments.
- Implement automation for self-healing, fault-tolerant infrastructure using declarative configurations and event-driven workflows.
- Develop internal tools to eliminate repetitive tasks.
- Establish critical performance and reliability metrics for infrastructure platform components.
- Ensure the highest level of uptime and Quality of Service (QoS) for internal customers through operational excellence.
- Support technical growth by sharing knowledge, participating in design discussions, and contributing to a collaborative team culture.
- Participate in an on-call rotation, incident reviews, root cause identification, and Root Cause Analysis (RCA) reporting.
Qualifications
- Bachelor's degree in Computer Science or relevant education, experience, and training.
- At least 4 years of experience managing distributed cloud environments such as GCP, AWS, vSphere, and Nutanix, along with platform automation at scale.
- Deep expertise in container orchestration with Kubernetes with the ability to design, scale, and troubleshoot complex workloads.
- Strong experience developing software for automation and infrastructure tooling such as Go and Python.
- Kubernetes administration experience, including installation, configuration, and troubleshooting.
- Working knowledge of networking and Linux-based systems, including container runtimes such as Docker and containerized, packet-level debugging, and kernel troubleshooting.
- Experience with Infrastructure as Code (IaC) and configuration management tools, including Terraform, Chef, and Pulumi to ensure scalable and repeatable infrastructure provisioning.
- Creative problem-solving skills and excellent communication.