Full-Time Data Scientist – AI & Agentic Applications & Benchmarking
Cloudbees is hiring a remote Full-Time Data Scientist – AI & Agentic Applications & Benchmarking. The career level for this job opening is Experienced and is accepting USA based applicants remotely. Read complete job description before applying.
Cloudbees
Job Title
Posted
Career Level
Career Level
Locations Accepted
Salary
Share
Job Details
CloudBees enables enterprises to deliver scalable, compliant, and secure software. It allows developers to bring and execute their code anywhere, providing greater flexibility and freedom through fast, self-serve, and secure workflows.CloudBees supports organizations at every step of their DevSecOps journey, whether using Jenkins on-premise or transitioning software delivery to the cloud.
About the RoleCloudBees is seeking a startup-savvy Data Scientist to help define, measure, and evangelize the impact of Agentic Applications across our platform. You’ll work closely with engineers and product teams to prototype and measure AI and Agentic experiences, using evals, telemetry, and AI benchmarks to help the company drive the conversation in the market and with customers.
Key Responsibilities- Partner with our platform team to develop and prototype telemetry, eval frameworks, and benchmarks for emerging agentic systems.
- Partner with product and engineering teams to measure AI outcomes and usage across customers and teams.
- Help define KPIs and success metrics for AI and LLM-driven features and workflows.
- Use Python notebooks to explore data, visualize insights, and test hypotheses rapidly and share insights.
- Tell the story behind the numbers: Write internal documentation, performance summaries, and thought leadership around outcomes.
- Enable engineering teams to instrument, log, and evaluate agent performance effectively.
- Stay up to date with evolving metrics and evaluation techniques in the LLM and agentic AI ecosystem.
Required:
- 3+ years of experience in data science or ML analytics roles, ideally in startup or high-growth environments.
- Proficiency in Python, including building and sharing analysis via Jupyter notebooks.
- Experience working with evals, telemetry, A/B testing, and evaluating user-facing ML systems.
- Experience with AI/ML tools such as MLFlow, Hugging Face, or other Model / LLM tools.
- Ability to partner with technical teams to define meaningful metrics and benchmarks.
- Clear communication skills—capable of writing about outcomes, sharing learnings, and influencing stakeholders.
- Comfort working in fast-paced, ambiguous environments where speed and clarity matter.
Preferred (not required):
- Experience with agentic or LLM-based applications (e.g., evaluating AI copilots, autonomous workflows).
- Familiarity with tools like LangSmith, OpenInference, or custom eval stacks.
- Background in developer tools, DevOps, or platform engineering environments.