Skip to content
Protege logo

Research Scientist, Benchmarks & Evaluations

ProtegeAI Training company
RemoteSenior
Data & AI

About the role

Lead the design and execution of benchmarks and evaluations for frontier AI models.

  • Lead the design, validation, and execution of benchmarks and evaluations for frontier AI models, producing rigorous, publishable evaluation datasets and translating findings into product.
  • Key Responsibilities Design benchmarks distinguishing capability levels across frontier and domain-specific models.
  • Validate evaluations via human baselines, inter-rater reliability, and contamination analysis.
  • Develop statistical frameworks (IRT, contamination, predictive validity) for model comparison.
  • Run evaluations on frontier models and collaborate with partners and vendors.
  • Publish research and translate results into deployable evaluation datasets.
  • Requirements Advanced quantitative degree (PhD preferred) or equivalent experience.
  • Hands-on experience evaluating LLMs, agents, or ML systems and running evals at scale.
  • Experience with annotator quality, inter-rater reliability, and labeling protocols.
  • Strong scientific writing and communication skills.
View original posting →

Tech stack

LLMs

Match insights

Tech:LLMs
Level:Senior

More roles at Protege

View open roles at Protege