This site is not available on Mobile. Please return on a desktop browser.
Visit our main site at guardrailsai.com
| Developed by | Guardrails AI |
|---|---|
| Date of development | Mar 2026 |
| Validator type | Moderation |
| Blog | |
| License | Apache 2 |
| Input/Output | Output |
This validator detects toxic language in LLM-generated text using an LLM as the detection backbone (via LiteLLM). It is a clean, LLM-based alternative to the model-based ToxicLanguage validator, which relies on the Detoxify toxic-bert model.
Instead of downloading and running a local classification model, this validator sends text to an LLM that evaluates it across seven toxicity categories:
The validator supports two validation modes:
fix_value.Dependencies:
Foundation model access keys:
ANTHROPIC_API_KEY (required for the default Claude Haiku model)OPENAI_API_KEY for OpenAI models)guardrails hub install hub://guardrails/toxic_language_llm
In this example, we apply the validator to a string output generated by an LLM.
# Import Guard and Validator
from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard
# Use with default settings (sentence mode, threshold 0.5, Claude Haiku)
guard = Guard().use(ToxicLanguageLLM)
guard.validate("The weather is beautiful today.") # Validator passes
guard.validate("You are a terrible person.") # Validator fails
from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard
# Strict full-text validation with a lower threshold
guard = Guard().use(
ToxicLanguageLLM,
threshold=0.3,
validation_method="full",
on_fail="exception",
)
guard.validate("The project is going well.") # Validator passes
from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard
# Use OpenAI model instead of the default Claude Haiku
guard = Guard().use(
ToxicLanguageLLM,
model="openai/gpt-4o-mini",
on_fail="fix",
)
result = guard.validate("Clean sentence. Toxic sentence here.")
# result.validated_output contains only the clean sentences
In this example, we apply the validator to a string field of a JSON output generated by an LLM.
# Import Guard and Validator
from pydantic import BaseModel, Field
from guardrails.hub import ToxicLanguageLLM
from guardrails import Guard
# Initialize Validator
val = ToxicLanguageLLM(threshold=0.5, validation_method="sentence")
# Create Pydantic BaseModel
class ChatResponse(BaseModel):
user_name: str
message: str = Field(validators=[val])
# Create a Guard to check for valid Pydantic output
guard = Guard.from_pydantic(output_class=ChatResponse)
# Run LLM output generating JSON through guard
guard.parse("""
{
"user_name": "Alice",
"message": "Hello, how are you today?"
}
""")
__init__(self, threshold=0.5, validation_method="sentence", model=None, on_fail="noop")
Initializes a new instance of the ToxicLanguageLLM class.
Parameters
threshold (float): Confidence score threshold for toxicity classification. Scores at or above this value are flagged as toxic. Defaults to 0.5.validation_method (str): Either "sentence" to evaluate individual sentences or "full" to evaluate the entire text. Defaults to "sentence".model (str, optional): LiteLLM model identifier to use for toxicity detection. Defaults to the latest Claude Haiku model (anthropic/claude-haiku-4-5-20251001).on_fail (str, Callable): The policy to enact when a validator fails. If str, must be one of reask, fix, filter, refrain, noop, exception or fix_reask. Otherwise, must be a function that is called when the validator fails.validate(self, value, metadata) -> ValidationResult
Validates the given value for toxic language using the configured LLM, relying on the metadata provided to customize the validation process. This method is automatically invoked by guard.parse(...) or guard.validate(...).
Note:
guard.parse(...) or guard.validate(...) where this method will be called internally for each associated Validator.guard.parse(...), ensure to pass the appropriate metadata dictionary that includes keys and values required by this validator. If guard is associated with multiple validators, combine all necessary metadata into a single dictionary.Parameters
value (Any): The input text to validate.metadata (dict): A dictionary containing metadata. This validator does not require any specific metadata keys.