- Application performance: The total time taken to return a response to a user request
- Accuracy: How often a given LLM returns an accurate answer
Basic application performance
Guardrails consist of a guard and a series of validators that the guard uses to validate LLM responses. Generally, a guard runs in sub-10ms performance. Validators should only add around 100ms of additional latency when configured correctly. The largest latency and performance issues will come from your selection of LLM. It’s important to capture metrics around LLM usage and assess how different LLMs handle different workloads in terms of both performance and result accuracy. Guardrails AI’s LiteLLM support makes it easy to switch out LLMs with minor changes to your guard calls.Performance tips
Here are a few tips to get the best performance out of your Guardrails-enabled applications. Use async guards for the best performance. Use theAsyncGuard class to make concurrent calls to multiple LLMs and process the response chunks as they arrive. For more information, see Async stream-validate LLM responses.
Use a remote server for heavy workloads. More compute-intensive workloads, such as remote inference endpoints, work best when run with dedicated memory and CPU. For example, guards that use a single Machine Learning (ML) model for validation can run in milliseconds on GPU-equipped machines, while they may take tens of seconds on normal CPUs. However, guardrailing orchestration itself performs better on general compute.
To account for this, offload performance-critical validation work by:
- Using Guardrails Server to run certain guard executions on a dedicated server
- Leverage remote validation inference to configure validators to call a REST API for inference results instead of running them locally
OnFailAction.REASK and OnFailAction.FIX_REASK action will ask the LLM to correct its output, with OnFailAction.FIX_REASK running re-validation on the revised output. In general, re-validation works best when using a small, purpose-built LLM fine-tuned to your use case.