Skip to main content
In production systems with user interaction, streaming output from LLMs greatly improves the user experience. Streaming allows you to build real-time systems that minimize the time to first token (TTFT) rather than waiting for the entire document to be completed before processing. Guardrails natively supports validation for streaming output, supporting both synchronous and asynchronous approaches.

Basic streaming

Streaming with a guard class can be done by setting the stream parameter to True:
from rich import print
import guardrails as gd
from guardrails.hub import CompetitorCheck

prompt = "Tell me about the Apple iPhone"
guard = gd.Guard().use(CompetitorCheck, ["Apple"])

fragment_generator = guard(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me about LLM streaming APIs."},
    ],
    max_tokens=1024,
    temperature=0,
    stream=True,
)

for op in fragment_generator:
    print(op)
With streaming, not only do chunks from the LLM arrive as they are generated, but validation results can stream in real time as well. To do this, validators specify a chunk strategy. By default, validators wait until they have accumulated a sentence’s worth of content from the LLM before running validation. Once they’ve run validation, they emit that result in real time. In practice, this means that you do not have to wait until the LLM has finished outputting tokens to access validation results, which helps you create smoother and faster user experiences. It also means that validation can run only on individual sentences, instead of the entire accumulated response, which helps save on costs for validators that require expensive inference. To access these validation results, use the error_spans_in_output helper method on Guard. This will provide an up to date list of all ranges of text in the output so far that have failed validation.
error_spans = guard.error_spans_in_output()

Async streaming

In cases where concurrent network calls are happening (many LLM calls!) it may be beneficial to use an asynchronous LLM client. Guardrails also natively supports asynchronous streaming calls.
guard = gd.Guard()

fragment_generator = await guard(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me about the streaming API of guardrails."},
    ],
    max_tokens=1024,
    temperature=0,
    stream=True,
)

async for op in fragment_generator:
    print(op)

Using validators with streaming

To stream validated output, you need only pass the stream=True flag through the guard function. This will return a generator that will yield GuardResult objects as they are processed.
from guardrails import Guard
import os

# Set your OpenAI API key
# os.environ['OPENAI_API_KEY'] = ""

guard = Guard()
stream_chunk_generator = guard(
    messages=[{"role": "user", "content": "How many moons does Jupiter have?"}],
    model="gpt-3.5-turbo",
    stream=True
)

# Print the validated output as it is processed
for chunk in stream_chunk_generator:
    print(f"{chunk.validated_output}")
Using validators with streaming works the same way. Note that not all on_fail types are supported with streaming.
guardrails hub install hub://guardrails/profanity_free
from guardrails import Guard
from guardrails.hub import ProfanityFree
import os

# Set your OpenAI API key
# os.environ['OPENAI_API_KEY'] = ""

guard = Guard().use(ProfanityFree())
stream_chunk_generator = guard(
    messages=[{"role": "user", "content": "How many moons does Jupiter have?"}],
    model="gpt-3.5-turbo",
    stream=True
)

# Print the validated output as it is processed
for chunk in stream_chunk_generator:
    print(f"{chunk.validated_output}")