Skip to main content

Async Stream-validate LLM responses

Asynchronous behavior is generally useful in LLM applciations. It allows multiple, long-running LLM requests to execute at once. Adding streaming to this situation allows us to make non-blocking, iterative validations over each stream as chunks arrive. This document explores how to implement this behavior using the Guardrails framework.

Note: learn more about streaming here.

# Few imports and global variables
from rich import print
import guardrails as gd
import litellm
from IPython.display import clear_output
import time

Setup

Install the necessary validators from Guardrails hub in your CLI.

!guardrails hub install hub://guardrails/competitor_check

Create the Guard object

Async Streaming

from guardrails.hub import CompetitorCheck

prompt = "Tell me about the Apple Iphone"

guard = gd.AsyncGuard().use(CompetitorCheck, ["Apple"])
Example 1: No async streaming

By default, the stream parameter is set to False. We will use LiteLLM to make our LLM calls.

# Wrap the litellm OpenAI API call with the `guard` object
raw_llm_output, validated_output, *rest = await guard(
litellm.acompletion,
model="gpt-3.5-turbo",
prompt=prompt,
max_tokens=1024,
temperature=0.3,
)
# Let's see the logs
print(guard.history.last.tree)
Example 2: Async Streaming

Set the stream parameter to True

# Wrap the litellm OpenAI API call with the `guard` object
fragment_generator = await guard(
litellm.acompletion,
model="gpt-3.5-turbo",
prompt=prompt,
max_tokens=1024,
temperature=0,
stream=True,
)


async for op in fragment_generator:
clear_output(wait=True)
print(op)
time.sleep(0.5)
# Let's see the logs
print(guard.history.last.tree)

As you can see here, the outputs in both examples match. The only difference is that, in the async streaming example, the outputs are returned as soon as they are received and validated by Guardrails.

In the non-streaming example, the outputs are returned only after the entire request has been processed by the API.

In other words, when async streaming is enabled, the API returns the outputs as soon as they are ready, rather than waiting for the entire request to be processed.