Toxic Language LLM - Validator Details

Loading ...

Toxic Language LLM

Detects toxic language in LLM-generated text using an LLM as the detection backbone. Evaluates text across seven toxicity categories: toxicity, severe toxicity, obscene, threat, insult, identity attack, and sexual explicit content.

string

LLM

Etiquette

Chatbots

Customer Support

Overview

updated 2 months

Developed by:

Guardrails AI

Date of development:

Mar 2026

Validator type:

Moderation

Blog:

License:

Apache 2

Input/Output:

Output

Playground

Under Construction

The playground for this validator is not enabled. Please try again later.

Description

Intended Use

This validator detects toxic language in LLM-generated text using an LLM as the detection backbone (via LiteLLM). It is a clean, LLM-based alternative to the model-based ToxicLanguage validator, which relies on the Detoxify toxic-bert model.

Instead of downloading and running a local classification model, this validator sends text to an LLM that evaluates it across seven toxicity categories:

toxicity - general toxic content
severe_toxicity - extremely toxic content
obscene - obscene language
threat - threatening language
insult - insulting language
identity_attack - identity-based attacks
sexual_explicit - sexually explicit content

The validator supports two validation modes:

sentence (default): Evaluates each sentence individually. Toxic sentences are removed while clean sentences are preserved in the fix_value.
full: Evaluates the entire text as a whole. If any toxicity is detected, the entire text fails validation.

Requirements

Dependencies:
- guardrails-ai>=0.4.0
- litellm
Foundation model access keys:
- ANTHROPIC_API_KEY (required for the default Claude Haiku model)
- Or the appropriate API key for your chosen model (e.g., OPENAI_API_KEY for OpenAI models)

Installation

guardrails hub install hub://guardrails/toxic_language_llm

Usage Examples

Validating string output via Python

In this example, we apply the validator to a string output generated by an LLM.

# Import Guard and Validator
from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard

# Use with default settings (sentence mode, threshold 0.5, Claude Haiku)
guard = Guard().use(ToxicLanguageLLM)

guard.validate("The weather is beautiful today.")  # Validator passes
guard.validate("You are a terrible person.")  # Validator fails

Customizing threshold and validation method

from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard

# Strict full-text validation with a lower threshold
guard = Guard().use(
    ToxicLanguageLLM,
    threshold=0.3,
    validation_method="full",
    on_fail="exception",
)

guard.validate("The project is going well.")  # Validator passes

Using a different LLM model

from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard

# Use OpenAI model instead of the default Claude Haiku
guard = Guard().use(
    ToxicLanguageLLM,
    model="openai/gpt-4o-mini",
    on_fail="fix",
)

result = guard.validate("Clean sentence. Toxic sentence here.")
# result.validated_output contains only the clean sentences

Validating JSON output via Python

In this example, we apply the validator to a string field of a JSON output generated by an LLM.

# Import Guard and Validator
from pydantic import BaseModel, Field
from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard

# Initialize Validator
val = ToxicLanguageLLM(threshold=0.5, validation_method="sentence")

# Create Pydantic BaseModel
class ChatResponse(BaseModel):
    user_name: str
    message: str = Field(validators=[val])

# Create a Guard to check for valid Pydantic output
guard = Guard.from_pydantic(output_class=ChatResponse)

# Run LLM output generating JSON through guard
guard.parse("""
{
    "user_name": "Alice",
    "message": "Hello, how are you today?"
}
""")

API Reference

__init__(self, threshold=0.5, validation_method="sentence", model=None, on_fail="noop")

Initializes a new instance of the ToxicLanguageLLM class.

Parameters

threshold (float): Confidence score threshold for toxicity classification. Scores at or above this value are flagged as toxic. Defaults to 0.5.
validation_method (str): Either "sentence" to evaluate individual sentences or "full" to evaluate the entire text. Defaults to "sentence".
model (str, optional): LiteLLM model identifier to use for toxicity detection. Defaults to the latest Claude Haiku model (anthropic/claude-haiku-4-5-20251001).
on_fail (str, Callable): The policy to enact when a validator fails. If str, must be one of reask, fix, filter, refrain, noop, exception or fix_reask. Otherwise, must be a function that is called when the validator fails.

validate(self, value, metadata) -> ValidationResult

Validates the given value for toxic language using the configured LLM, relying on the metadata provided to customize the validation process. This method is automatically invoked by guard.parse(...) or guard.validate(...).

Note:

This method should not be called directly by the user. Instead, invoke guard.parse(...) or guard.validate(...) where this method will be called internally for each associated Validator.
When invoking guard.parse(...), ensure to pass the appropriate metadata dictionary that includes keys and values required by this validator. If guard is associated with multiple validators, combine all necessary metadata into a single dictionary.

Parameters

value (Any): The input text to validate.
metadata (dict): A dictionary containing metadata. This validator does not require any specific metadata keys.