About the role
Build evaluation systems to assess the effectiveness of Firecrawl.
- •You'll build the evaluation systems that tell us whether Firecrawl actually works.
- •This role involves designing metrics, building pipelines, generating datasets, and owning the feedback loop from output quality back to model and product decisions.
- •Key Responsibilities Build the eval stack from scratch, defining metrics, building pipelines, curating datasets, and integrating evals into CI/CD.
- •Design benchmarks that reflect reality across millions of websites, including edge cases.
- •Own LLM-as-judge pipelines and human review tooling.
- •Close the loop with models and RL by turning quality measurements into reward signals.
- •Run fast experiments and communicate findings clearly.
- •Requirements 3+ years in ML engineering, applied AI, or data quality with production systems.
- •Experience building eval infrastructure, pipelines, and datasets.
- •Understanding of what "good" means for unstructured web data.
Tech stack
PythonMLflowKubeflowSageMakerVertex AIONNXJAXPandasNumPyAirflowdbtSnowflakeDatabricksFivetranDagsterPrefectApache KafkaApache FlinkPostgreSQLMySQLMongoDBRedisElasticsearchDynamoDBCassandraSQLiteOracleSQL ServerNeo4jCockroachDB
Match insights
Tech:Python, MLflow, Kubeflow, SageMaker, Vertex AI
Level:Mid