Streaming
Streaming
In production systems with user interaction, streaming output from LLMs greatly improves the user experience. Streaming allows you to build real-time systems that minimize the time to first token (TTFT) rather than waiting for the entire document to be completed before progessing.
Guardrails natively supports validation for streaming output, supporting both synchronous and asynchronous approaches.
Streaming with a guard class can be done by setting the ‘stream’ parameter to ‘True’
With streaming, not only do chunks from the LLM arrive as they are generated, but validation results can stream in real time as well.
To do this, validators specify a chunk strategy. By default, validators wait until they have accumulated a sentence’s worth of content from the LLM before running validation. Once they’ve run validation, they emit that result in real time.
In practice, this means that you do not have to wait until the LLM has finished outputting tokens to access validation results, which helps you create smoother and faster user experiences. It also means that validation can run only on individual sentences, instead of the entire accumulated response, which helps save on costs for validators that require expensive inference.
To access these validation results, use the error_spans_in_output helper method on Guard. This will provide an up to date list of all ranges of text in the output so far that have failed validation.