Skip to content
Firecrawl logo

Research Engineer – Evals

FirecrawlAI & company
San Francisco, United StatesMid
Data & AI

About the role

Build evaluation systems to assess the effectiveness of Firecrawl.

  • You'll build the evaluation systems that tell us whether Firecrawl actually works.
  • This role involves designing metrics, building pipelines, generating datasets, and owning the feedback loop from output quality back to model and product decisions.
  • Key Responsibilities Build the eval stack from scratch, defining metrics, building pipelines, curating datasets, and integrating evals into CI/CD.
  • Design benchmarks that reflect reality across millions of websites, including edge cases.
  • Own LLM-as-judge pipelines and human review tooling.
  • Close the loop with models and RL by turning quality measurements into reward signals.
  • Run fast experiments and communicate findings clearly.
  • Requirements 3+ years in ML engineering, applied AI, or data quality with production systems.
  • Experience building eval infrastructure, pipelines, and datasets.
  • Understanding of what "good" means for unstructured web data.
View original posting →

Tech stack

PythonMLflowKubeflowSageMakerVertex AIONNXJAXPandasNumPyAirflowdbtSnowflakeDatabricksFivetranDagsterPrefectApache KafkaApache FlinkPostgreSQLMySQLMongoDBRedisElasticsearchDynamoDBCassandraSQLiteOracleSQL ServerNeo4jCockroachDB

Match insights

Tech:Python, MLflow, Kubeflow, SageMaker, Vertex AI
Level:Mid

More roles at Firecrawl

View open roles at Firecrawl