Full-Time DevOps Engineer
Arize AI is hiring a remote Full-Time DevOps Engineer. The career level for this job opening is Experienced and is accepting USA based applicants remotely. Read complete job description before applying.
Arize AI
Job Title
Posted
Career Level
Career Level
Locations Accepted
Share
Job Details
The Opportunity
AI is rapidly changing the world. From processing job applications and credit decisions, to making content recommendations and helping researchers analyze genetic markers at scale -- many aspects of our daily lives are touched by machine learned systems in some way.
Arize is the leading machine learning observability platform to help ML teams discover issues, diagnose problems, and improve the results of machine learning models. In short: we are here to build world class software that helps make AI work better.
The Team
Our On-Prem engineering team is responsible for the deployment of Arize in customer environments. In addition to working with customers in defining infrastructure requirements, the team designs and develops software and tooling that enables the management of these systems at large scale. The On-Prem team has grown to be expert in Kubernetes and cloud deployment on GCP, Azure, and AWS as well as dealing with networking and security aspects of on-premise deployments. The team is dynamic and relies on few talented individuals with a high degree of autonomy and initiative.
What You’ll Do
- Work hands-on with the infrastructure that supports our distributed & highly scalable services in both SaaS and on-prem offerings
- Gather requirements from customers and adapt manifests and software to support new environments
- Use and augment monitoring tools to observe platform health, ensure performance and reliability
- Interact with the product team to test new features and package new on-prem releases
- Automate and optimize the release pipeline to make it as frictionless as possible
- Exhibit continuous curiosity for emerging technology that could solve our challenges
What We’re Looking For
- 1-2+ years experience in site reliability engineering, DevOps, and system administration
- CS (preferred) or other technical degree, or equivalent practical experience
- Experience working with DevOps tools such as Kubernetes, Terraform, Ansible, Puppet and Chef
- Proficiency with scripting languages such as Python and bash
- Experience managing cloud infrastructure in AWS, GCP, and/or Azure
- Expertise in Linux administration, configuration, and networking protocols
Bonus Points, But Not Required
- Experience with on-prem deployment architectures
- Experience running a 24x7 SaaS platform with defined SLI, SLO, SLA
- Familiarity with operating machine learning & AI applications
Technologies You’ll Work With:
- Kubernetes
- Postgres
- Messaging systems
- Go, Java, Python
- Bazel
- AWS, GCP