Full-Time Site Reliability Engineer (SRE/ DevOps)
Arista Networks is hiring a remote Full-Time Site Reliability Engineer (SRE/ DevOps). The career level for this job opening is Experienced and is accepting Bengaluru, India based applicants remotely. Read complete job description before applying.
Arista Networks
Job Title
Posted
Career Level
Career Level
Locations Accepted
Share
Job Details
Who You'll Work WithArista Networks is seeking a skilled professional for the Engineering Productivity (EngProd) team to maintain and support growing infrastructure and users. The ideal candidate is adaptable, versatile, and eager to learn new technologies.
As a software engineer, you will collaborate to design, build, and administer secure, scalable, and fault-tolerant tools/infrastructure in a hybrid cloud environment.
In the EngProd group, you'll work with other engineers to design, build, scale, and operate systems used by Arista's product development teams.
Systems include industry-standard tools like Ansible, Artifactory, Gerrit, Jenkins, Kubernetes, Grafana, Spinnaker, MySQL, ElasticSearch, Google Cloud, Varnish, Perforce, Gerrit, and 3rd party storage appliances, along with internally developed automation for CI/CD, testing, analysis, and visualization.
What You'll Do
- Build, deploy (safely and incrementally), and operate critical production systems, focusing on scalability, reliability, observability, performance, and security.
- Monitor, support, and enhance the developer experience across services.
- Develop automation to streamline and efficiently operate production systems.
- Proactively monitor, respond to, and enhance alerts, setting up automated alert handling.
- Create and maintain incident response runbooks.
- Build and deploy new systems with scalability, reliability, and observability as core requirements.
- Troubleshoot platform/infrastructural issues and aid Arista software engineers in their diagnostics.
- Engage with 3rd party vendor support.
- Deploy systems in a staged manner.
- Document post-mortems and develop solutions to prevent incident recurrence.
- Plan and communicate maintenance windows for production systems.
- Collaborate with Arista's product development teams to identify and resolve infrastructural bottlenecks and limitations affecting their workflows.
- Adopt best practices for building secure, scalable, and fault-tolerant systems.
- Implement solutions to scale systems and enhance fault-tolerance and performance for improved system availability.
- Study OSS systems for better issue triage and resolution.