Research Scientist, Benchmarks & Evaluations

ProtegeAI Training company

RemoteSenior

Data & AI

Bookmark Apply on site→

About the role

Lead the design and execution of benchmarks and evaluations for frontier AI models.

•Lead the design, validation, and execution of benchmarks and evaluations for frontier AI models, producing rigorous, publishable evaluation datasets and translating findings into product.
•Key Responsibilities Design benchmarks distinguishing capability levels across frontier and domain-specific models.
•Validate evaluations via human baselines, inter-rater reliability, and contamination analysis.
•Develop statistical frameworks (IRT, contamination, predictive validity) for model comparison.
•Run evaluations on frontier models and collaborate with partners and vendors.
•Publish research and translate results into deployable evaluation datasets.
•Requirements Advanced quantitative degree (PhD preferred) or equivalent experience.
•Hands-on experience evaluating LLMs, agents, or ML systems and running evals at scale.
•Experience with annotator quality, inter-rater reliability, and labeling protocols.
•Strong scientific writing and communication skills.

View original posting →

View original posting for full requirements →

Tech stack

LLMs

Match insights

Tech:LLMs

Level:Senior

More roles at Protege

View open roles at Protege