Async Stream-validate LLM responses
Asynchronous behavior is generally useful in LLM applciations. It allows multiple, long-running LLM requests to execute at once. Adding streaming to this situation allows us to make non-blocking, iterative validations over each stream as chunks arrive. This document explores how to implement this behavior using the Guardrails framework.
Note: learn more about streaming here.
# Few imports and global variables
from rich import print
import guardrails as gd
import litellm
from IPython.display import clear_output
import time
Setup
Install the necessary validators from Guardrails hub in your CLI.
!guardrails hub install hub://guardrails/competitor_check
Create the Guard object
Async Streaming
from guardrails.hub import CompetitorCheck
prompt = "Tell me about the Apple Iphone"
guard = gd.AsyncGuard().use(CompetitorCheck, ["Apple"])
Example 1: No async streaming
By default, the stream
parameter is set to False
.
We will use LiteLLM to make our LLM calls.
# Wrap the litellm OpenAI API call with the `guard` object
raw_llm_output, validated_output, *rest = await guard(
litellm.acompletion,
model="gpt-3.5-turbo",
prompt=prompt,
max_tokens=1024,
temperature=0.3,
)
# Let's see the logs
print(guard.history.last.tree)
Example 2: Async Streaming
Set the stream
parameter to True
# Wrap the litellm OpenAI API call with the `guard` object
fragment_generator = await guard(
litellm.acompletion,
model="gpt-3.5-turbo",
prompt=prompt,
max_tokens=1024,
temperature=0,
stream=True,
)
async for op in fragment_generator:
clear_output(wait=True)
print(op)
time.sleep(0.5)
# Let's see the logs
print(guard.history.last.tree)
As you can see here, the outputs in both examples match. The only difference is that, in the async streaming example, the outputs are returned as soon as they are received and validated by Guardrails.
In the non-streaming example, the outputs are returned only after the entire request has been processed by the API.
In other words, when async streaming is enabled, the API returns the outputs as soon as they are ready, rather than waiting for the entire request to be processed.