Model-based Evaluations in Langfuse
Model-based evaluations (LLM-as-a-judge) are a powerful tool to automate the evaluation of LLM applications integrated with Langfuse. With model-based evalutions, LLMs are used to score a specific session/trace/LLM-call in Langfuse on criteria such as correctness, toxicity, or hallucinations.
There are two ways to run model-based evaluations in Langfuse:
Via Langfuse UI (beta)
- HobbyPublic Beta
- ProPublic Beta
- TeamPublic Beta
- Self HostedNot Available
Store an API key
To use evals, you have to bring your own LLM API keys. To do so, navigate to the settings page and insert your API key. We store them encrypted on our servers.
Create an eval template
First, we need to specify the evaluation template:
- Select the model and its parameters.
- Customise one of the Langfuse managed prompt templates or write your own.
- We use function calling to extract the evaluation output. Specify the descriptions for the function parameters
score
andreasoning
. This is how you can direct the LLM to score on a specific range and provide specific reasoning for the score.
Create an eval config
Second, we need to specify on which traces
Langfuse should run the template we created above.
- Select the evaluation template to run.
- Specify the name of the
scores
which will be created as a result of the evaluation. - Filter which newly ingested traces should be evaluated. (Coming soon: select existing traces)
- Specify how Langfuse should fill the variables in the template. Langfuse can extract data from
trace
,generations
,spans
, orevent
objects which belong to a trace. You can choose to takeInput
,Output
ormetadata
form each of these objects. Forgenerations
,spans
, orevents
, you also have to specify the name of the object. We will always take the latest object that matches the name. - Reduce the sampling to not run evals on each trace. This helps to save LLM API cost.
- Add a delay to the evaluation execution. This is how you can ensure all data arrived at Langfuse servers before evaluation is exeucted.
See the progress
Once the configuration is saved, Langfuse will start running the evals on the traces that match the filter. You can see the progress on the config page or the log table.
See scores
Upon receiving new traces, navigate to the trace detail view to see the associated scores.
Via External Evaluation Pipeline
- HobbyFull
- ProFull
- TeamFull
- Self HostedFull
You can run your own model-based evals on data in Langfuse by fetching traces from Langfuse (e.g. via the Python SDK) and then adding evaluation results as scores
back to the traces in Langfuse. This gives you full flexibility to run various eval libraries on your production data and discover which work well for your use case.
The example notebook is a good template to get started with building your own evaluation pipeline.