Leverage LiteLLM in Guardrails to Validate Any LLM's Output

Safeer MohiuddinSafeer Mohiuddin

June 20, 2024

Using Large Language Models (LLMs) effectively in your applications comes with a few key obstacles. One of the key challenges is writing new code every time you need to swap out LLMs.

That's why we're happy to announce that LiteLLM and Guardrails have teamed up to help developers create AI-driven applications that work across a large suite of LLMs without requiring extensive code changes. Now, your apps can leverage the most appropriate LLM for a given task while also verifying and increasing the quality of its output.

LiteLLM: Bringing consistency to LLM calls

Ever since OpenAI broke open the AI floodgates, we've seen an explosion in LLMs. This variety is great, as different LLMs trained on different data sets will have different strengths. (As an example, see our recent benchmarking of which LLMs best handle structured data.)

However, with this variety comes increased complexity. There's no standard for LLM APIs, which means each LLM defines its own input and output format. That means,  for every LLM you call, you need to write a proxy layer that handles that specific LLM's request/response formats.

This complicates writing applications that call multiple LLMs - for example, in apps that use prompt chaining. Prompt chaining breaks a task into subtasks, with each subtask handed off to one or more LLMs, and the outputs used to build up a final answer. As an example, you may use one LLM to extract text from a PDF document (because that LLM is better at it, faster, cheaper, etc.), and then use another LLM to answer questions about the extracted data.

You may also want to switch out different LLMs over time for cost or performance reasons. Or, you may improve the output you supply to a user by querying different LLMs and using a quorum algorithm to determine which ones give the most accurate response to particular questions.

This is where LiteLLM comes in. LiteLLM is an open-source library that provides a proxy layer over 100 different LLMs. You can call different LLMs' completion, embedding, and image generation endpoints by changing only a few parameters (usually just the API key and model name). No matter the LLM, LiteLLM will return the results in a consistent format. For example, a text response will always be available at ['choices'][0]['message']['content'].

You can also leverage LiteLLM to track your spend, view your activity, and load balance calls across multiple AI projects using either its OpenAI Proxy Server or its LiteLLM Python SDK.

Guardrails AI: Validating output across LLMs

Another challenge with LLMs is validating the output you receive. Achieving high-quality output from an LLM is usually a combination of several factors:

  • Using the right LLM for the task
  • Structuring the request in the right way for that particular LLM
  • Supplying enough additional prompt context to supplement the LLM's 2+-year-old knowledge base with current facts

But how do you know whether the response an LLM sends back is correct or not? Usually, this involves a human reviewing and validating its output. That's not a scalable approach for most real-world AI app scenarios. And it's not a viable approach when building out a network of AI agents that query multiple LLMs automatically.

Guardrails defines an API and specification format that enables you to enforce automatic validation of LLM output. Using either our Pydantic or RAIL file format or one of our available programmatic APIs, you can define any number of guards that inspect and approve or reject the output from an LLM. You can define any number of programmatic responses to failed requests, including re-asking with an enhanced prompt.

Guardrails supports a large and growing number of pre-built LLM validators via Guardrails Hub. You can also create your own custom validators and upload them to the Hub for others to use.

The benefits of using LiteLLM and Guardrails

Using Guardrails with any LLM makes for a powerful combination, as it means you can leverage the LLM that's best fit for a given task without compromising on output quality. By taking advantage of the LiteLMM/Guardrails integration, you can call different LLMs - OpenAI, Anthropic, HuggingFace, Mistral, Cohere, AWS Bedrock, etc. - for different tasks while validating the results with a single, consistent syntax.

This approach also increases the reusability of your validation logic. Instead of validating output from three different models in three different ways, you can configure a validator once in Guardrails and use it across your company's various AI projects. This can help accelerate AI output quality across your organization.

Guardrails enhances the responses from LiteLLM by enabling you to control the response format across LLMs. If you ask four different  LLMs a question, all four may return their answer in slightly different free-text formats. Using Guardrails, you can specify and enforce a specific output format (e.g., JSON, XML), leveraging Guardrails to refine and reask the prompt if the first response fails to satisfy your specs.

How to use LiteLLM and Guardrails 

Let's see how this works in action. Traditionally, in Guardrails, to call an LLM, you have to do two things:

  • Configure a guard comprised of validators that check the output from an LMM
  • Call the LLM with the Guardrails wrapper, which will inspect the output and implement re-ask or correction logic

First, install Guardrails and use the CLI to install the validator we'll use.

pip install guardrails-ai;
guardrails hub install hub://guardrails/valid_length;

Now, we can configure guards programmatically in code. Let's build a simple guard that ensures an LLM generates a cat name with the correct number of characters:

from guardrails import Guard
from guardrails.hub import ValidLength
import openai

guard = Guard.from_string(
  validators=[ValidLength(min=1, max=10, on_fail="exception")]
)

response = guard(
  llm_api=openai.chat.completions.create,
  prompt="Suggest a name for my cat that is between 1 and 10 characters long.",
  model="gpt-4",
  max_tokens=1024,
  temperature=0.5,
)

print(response)

(Note: This example assumes you've installed Guardrails and configured your OpenAI API key as an environment variable. If you haven't, check out one of our prior blog posts - it'll walk you through the process.)

This will result in a response like the following from OpenAPI after it's been parsed and verified by Guardrails:

warn(
    ValidationOutcome(
        raw_llm_output='"Whiskers"',
        validated_output='"Whiskers"',
        reask=None,
        validation_passed=True,
        error=None
    )
)

Guardrails can do this because it integrates directly with OpenAI as an AI platform. But you can take advantage instead of every LLM supported by LiteLLM by making a few simple modifications to how you instantiate the guard.

Before changing code, install LiteLLM if you don't already have it on your box:

pip install litellm

Then, change your code as follows:

from guardrails import Guard
from guardrails.hub import ValidLength
import openai
import litellm

guard = Guard.from_string(
  validators=[ValidLength(min=1, max=10, on_fail="reask")]
)

validated_response = guard(
    litellm.completion,
    model="gpt-4",
    max_tokens=500,
    msg_history=[{"role": "user", "content": "Suggest a name for my cat that is between 1 and 10 characters long."}]
)

print(validated_response)

To switch to using LiteLLM, you only need to import the LiteLLM module, specify litellm.completion as the target operation, and format to LiteLLM's input format.

If you run this, you'll get the same output you did before. (OpenAI really seems to fancy “Whiskers” as a cat name.) If you want to get a second opinion, you can change a few parameters to call another model. For example, if you have ollama installed with the llama2 manifest pulled, you can query it with:

validated_response = guard(
    litellm.completion,
    model="ollama/llama2",
    max_tokens=500,
    api_base="http://localhost:11434",
    msg_history=[{"role": "user", "content": "Suggest a name for my cat that is between 1 and 10 characters long."}],
)

You can ask the same question of Anthropic with the following (assuming you've set the ANTHROPIC_API_KEY environment variable and have available credits):

validated_response = guard(
    litellm.completion,
    model="claude-2",
    msg_history=[{"role": "user", "content": "Suggest a name for my cat that is between 1 and 10 characters long."}],
)

Note that, while the format returned by LiteLLM to Guardrails will be the same, each LLM may decide to return its answers differently. For example, when calling Anthropic above, it sends its answers back in this format that actually violated our constraints:

ValidationOutcome(
    raw_llm_output=' Here are some short cat name ideas:\n\n- Kitty\n- Felix \n- Boots\n- Luna\n- Tiger\n- Oreo\n- Smokey\n- Shadow\n- Mittens\n- Fluffy',
    validated_output=' Here are ',
    reask=FieldReAsk(
        incorrect_value=' Here are some short cat name ideas:\n\n- Kitty\n- Felix \n- Boots\n- Luna\n- Tiger\n- Oreo\n- Smokey\n- Shadow\n- Mittens\n- Fluffy',
        fail_results=[
            FailResult(
                outcome='fail',
                metadata=None,
                error_message='Value has length greater than 10. Please return a shorter output, that is shorter than 10 characters.',
                fix_value=' Here are '
            )
        ],
        path=None
    ),
    validation_passed=True,
    error=None
)

Whereas OpenAI returned just a single cat name, Anthropic gave us a bunch of verbiage along with a full list of names. You can control this somewhat with better prompt engineering. You can also use Guardrails to ensure a consistent response across LLMs by specifying that the answer must come back as structured text (e.g., JSON) in a specific format. For an example, see our blog on generating synthetic structured data.

Conclusion

Previously, using numerous LLMs in concert required a lot of coding, testing, and re-testing to get the prompts and output formats just right. Using LiteLLM and Guardrails, you can simplify the process, reducing the code you must generate to get a consistent, validated response from a wide range of AI models.

Tags:

integrations

Similar ones you might find interesting

Handling fix results for streaming

How we handle fix results for streaming in Guardrails.

Read more

How we rewrote LLM Streaming to deal with validation failures

The new pipeline for LLM Streaming now includes ways to merge fixes across chunks after validating.

Read more

Latency and usability upgrades for ML-based validators

The numbers behind our validators

Read more