Research Engineer – Evals

FirecrawlAI & company

San Francisco, United StatesMid

Data & AI

Bookmark Apply on site→

About the role

Build evaluation systems to assess the effectiveness of Firecrawl.

•You'll build the evaluation systems that tell us whether Firecrawl actually works.
•This role involves designing metrics, building pipelines, generating datasets, and owning the feedback loop from output quality back to model and product decisions.
•Key Responsibilities Build the eval stack from scratch, defining metrics, building pipelines, curating datasets, and integrating evals into CI/CD.
•Design benchmarks that reflect reality across millions of websites, including edge cases.
•Own LLM-as-judge pipelines and human review tooling.
•Close the loop with models and RL by turning quality measurements into reward signals.
•Run fast experiments and communicate findings clearly.
•Requirements 3+ years in ML engineering, applied AI, or data quality with production systems.
•Experience building eval infrastructure, pipelines, and datasets.
•Understanding of what "good" means for unstructured web data.

View original posting →

View original posting for full requirements →

Tech stack

PythonMLflowKubeflowSageMakerVertex AIONNXJAXPandasNumPyAirflowdbtSnowflakeDatabricksFivetranDagsterPrefectApache KafkaApache FlinkPostgreSQLMySQLMongoDBRedisElasticsearchDynamoDBCassandraSQLiteOracleSQL ServerNeo4jCockroachDB

Match insights

Tech:Python, MLflow, Kubeflow, SageMaker, Vertex AI

Level:Mid

More roles at Firecrawl

View open roles at Firecrawl