About the role
Lead the evaluation team to create high-signal datasets and tools for coding agents.
- •Lead the group responsible for creating high-signal evaluation datasets for coding agents and building the tools engineers use to write and run them.
- •Key Responsibilities Set the eval roadmap end-to-end—what we measure, why it matters, and how signals turn into shipping + training decisions.
- •Lead and grow a high-impact team of engineers and researchers building eval datasets and developer-friendly tools to write and run evals.
- •Guide the next generation of CursorBench so it continues to reflect real developer workflows at Cursor, and expand it with new evals that measure other properties developers value.
- •Define crisp online quality signals and turn regressions into robust guardrails.
- •Integrate evals into decision-making cadence for launches, deploys, and model training loops.
- •Requirements You’ve led engineering teams shipping production systems and have strong people leadership and coaching skills.
- •You can align research, product, data, and infrastructure on what “good” means—and turn that into durable metrics, processes, and release/training rituals.
- •You have good taste and strong opinions on model and agent behaviors, and you stay up-to-date on emerging research and industry trends.
- •You have strong data acumen, and can collaborate effectively with data scientists and researchers.
Match insights
Level:Manager