Research Scientist, Benchmarks & Evaluations
Posted on May 23, 2026 (about 2 hours ago)
Company Overview
We are building Protege to solve the biggest unmet need in AI — getting access to the right training data. The Protege platform facilitates the secure, efficient, and privacy-centric exchange of AI training data.
Solving AI’s data problem is a generational opportunity. Protege is backed by world-class investors and is already powering partnerships with ambitious teams in AI.
We are a lean, fast-moving, high-trust team of builders who are obsessed with velocity and impact. Our culture is built for people who thrive on ambiguity, own outcomes, and want to shape the future of data and AI.
DataLab
DataLab is Protege’s research arm — a team of research scientists tackling fundamental challenges about data for AI. We bridge the gap between research theory and data deployment, publishing on questions such as how to quality-control large-scale corpora and how to build evaluation datasets that reflect the real world.
We value scientific rigor and impact, and our team is built for people who thrive on ambiguity and ownership.
The Role
Benchmarks decide what AI gets built. Many current evaluations are contaminated, gameable, synthetic, or measure capabilities that don’t transfer to real tasks. Protege is hiring a Research Scientist to lead the design of benchmarks and evaluations that frontier labs, enterprises, and policymakers can trust.
You will design tasks that meaningfully separate models, validate those tasks against human baselines, pressure-test for contamination and elicitation gaps, and publish work that shapes Protege’s evaluation datasets.
What you'll do
- Design tasks and benchmarks that distinguish capability levels across frontier models, including agentic, reasoning-heavy, and domain-specific (healthcare, finance, scientific) settings.
- Validate evaluations rigorously: run human baselines, analyze inter-rater reliability, study how elicitation and scaffolding shift results, and quantify signal versus noise.
- Develop the “science of evals” at Protege, including item response theory, contamination analysis, predictive validity studies, and statistical frameworks for comparing models with appropriate uncertainty.
- Run evaluations on current frontier models, sometimes in collaboration with partners at AI labs, enterprises, and government.
- Publish research that establishes Protege as a standard-setter for evaluation data.
- Translate findings into product, working closely with data and engineering teams to turn research into evaluation datasets customers can deploy.
- Partner with outsourced annotation vendors: own statistical machinery to determine which annotators to trust on which tasks and translate that into trustworthiness scores for customers.
What we're looking for
- Advanced degree (PhD preferred, or MS/BS plus equivalent industry experience) in a quantitative field such as econometrics, quantitative finance, computer science, engineering, statistics, mathematics, or another applied research discipline.
- Hands-on experience evaluating LLMs, agents, or other ML systems, including prompting, scaffolding, and tooling for large-scale evals.
- Experience with annotator quality and inter-rater reliability: designing labeling protocols, computing agreement statistics, and reasoning about annotator bias and calibration.
- Excellent scientific writing and communication to synthesize technical findings into actionable narratives for labs, customers, and policymakers.
- A bias toward velocity: know which pipelines need to be production-grade versus scrappy, and deliver reliable results quickly.
Bonus
- Experience with RL evaluation techniques — reward modeling, off-policy evaluation, evals for RLHF/RLAIF or agentic RL pipelines.
- Ability to navigate new customer architectures, data systems, and requirements quickly.
- Experience with latent-variable models of annotator skill (Dawid–Skene, MACE, IRT-style approaches) or running large expert-annotator panels in regulated domains.
- Track record of published benchmarks or evaluation papers adopted by the field.
Protege Values
Pass the Loved Ones’ Test
Always Find a Way
Go Fast and Grow Fast
Practice Kindness and Candor
Deliver Together
Own the Outcome. Hone the Craft.
Always Find a Way
Go Fast and Grow Fast
Practice Kindness and Candor
Deliver Together
Own the Outcome. Hone the Craft.
How to Apply
Please apply using the application page for this job. Include your CV and a brief cover letter outlining relevant research and evaluation experience.
Application Link
Apply at: https://jobs.ashbyhq.com/protege/11f7d4de-81d5-44c1-9aa4-b21e2e2015fc/application