About the role
Lead the design and execution of benchmarks and evaluations for frontier AI models.
- •Lead the design, validation, and execution of benchmarks and evaluations for frontier AI models, producing rigorous, publishable evaluation datasets and translating findings into product.
- •Key Responsibilities Design benchmarks distinguishing capability levels across frontier and domain-specific models.
- •Validate evaluations via human baselines, inter-rater reliability, and contamination analysis.
- •Develop statistical frameworks (IRT, contamination, predictive validity) for model comparison.
- •Run evaluations on frontier models and collaborate with partners and vendors.
- •Publish research and translate results into deployable evaluation datasets.
- •Requirements Advanced quantitative degree (PhD preferred) or equivalent experience.
- •Hands-on experience evaluating LLMs, agents, or ML systems and running evals at scale.
- •Experience with annotator quality, inter-rater reliability, and labeling protocols.
- •Strong scientific writing and communication skills.
Tech stack
LLMs
Match insights
Tech:LLMs
Level:Senior