Full-Time Site Reliability Engineer
Procurement Sciences is hiring a remote Full-Time Site Reliability Engineer. The career level for this job opening is Experienced and is accepting USA based applicants remotely. Read complete job description before applying.
Procurement Sciences
Job Title
Posted
Career Level
Career Level
Locations Accepted
Share
Job Details
Procurement Sciences AI (PSci.AI) is at the forefront of generative artificial intelligence, transforming the government contracting sector. As a venture-backed B2B SaaS company, we are dedicated to revolutionizing federal, state, and local business development with disruptive AI capabilities.
Job Title: Site Reliability Engineer (SRE)
Location: Washington, DC metro area; Salt Lake City, UT; or Remote
Job Description
- Identify and resolve system and application issues through in-depth root cause analysis.
- Design, develop, and implement comprehensive automated testing.
- Build and maintain robust observability and monitoring solutions.
- Define and monitor service level indicators (SLIs), service level objectives (SLOs), and service level agreements (SLAs).
- Collaborate with developers and operations staff to enhance system reliability.
- Develop and continuously improve monitoring and alerting systems.
- Lead and implement best practices for incident management, disaster recovery, and business continuity.
- Manage high-impact incident response, facilitate post-mortem analyses, and drive remediation.
- Plan for capacity upgrades and scaling.
- Automate operational tasks and infrastructure management using Infrastructure as Code (IaC) tools.
- Ensure all systems and processes comply with security, privacy, and regulatory requirements.
- Continually assess and drive improvements in system architecture.
Technical Requirements
- Proficient in Kubernetes, Helm, and troubleshooting in secure and regulated environments.
- Deep experience with observability and monitoring tools.
- Hands-on expertise with major public cloud providers: Azure, Azure Gov, AWS, AWS GovCloud, and Google Cloud Platform (GCP).
- Strong grasp of microservices architecture, cloud-native technologies, Postgres, and AI/ML systems.
- Expertise in automated testing frameworks and practices.
- Proficiency in tracking and analyzing reliability metrics (SLIs, SLAs, SLOs).
- Strong programming skills in TypeScript and Python.
- Solid scripting abilities in Bash, PowerShell, or similar languages.
- Demonstrated experience with Infrastructure as Code (IaC) tools.
- Awareness of core networking principles and advanced troubleshooting skills.
- Effective communicator.
Preferred Qualifications
- Experience in the GovCon sector and/or holding a security clearance.
- Familiarity with GitOps principles and tools; experience with FluxCD is a plus.
- Proven experience in designing, building, and maintaining CI/CD pipelines.
- Experience managing reliability in multi-cloud or hybrid cloud environments.
- Knowledge of security and compliance standards.
- Previous success operating in dynamic, high-growth SaaS companies.
- Demonstrated expertise in operationalizing new development workloads.