Skip to main content

toxic_language

guardrails hub install hub://guardrails/toxic_language --quiet
    Installing hub://guardrails/toxic_language...
✅Successfully installed guardrails/toxic_language!


Check whether an LLM-generated response contains toxic language

Using the ToxicLanguage validator

This is a simple walkthrough of the ToxicLanguage validator. This validator checks whether an LLM-generated response contains toxic language. It uses the pre-trained multi-label model from HuggingFace -unitary/unbiased-toxic-roberta to check whether the generated text is toxic. It supports both full-text-level and sentence-level validation.

# Import the guardrails package
# and the ToxicLanguage validator
# from Guardrails Hub
import guardrails as gd
from guardrails.hub import ToxicLanguage
from rich import print
    /Users/dtam/.pyenv/versions/3.12.3/envs/litellm/lib/python3.12/site-packages/sentence_transformers/cross_encoder/CrossEncoder.py:13: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
from tqdm.autonotebook import tqdm, trange
# Create a Guard object with this validator
# Here, we'll use the default validation method of "sentence"
# and the default threshold of 0.25

guard = gd.Guard().use(
ToxicLanguage(on_fail="fix")
)
# Parse the raw response
raw_response = """
What a lovely day it has been to roam around the city. I love it here!
I hate how pathetic and needy you can be sometimes.
"""
raw_llm_output, validated_output, *rest = guard.parse(
llm_output=raw_response,
)

# Print the output
print(validated_output)
/Users/dtam/dev/guardrails/guardrails/validator_service/__init__.py:85: UserWarning: Could not obtain an event loop. Falling back to synchronous validation.
warnings.warn(





What a lovely day it has been to roam around the city.
I love it here!

As you can see here, the sentence I hate how pathetic and needy you can be sometimes. has been detected as toxic, and hence removed from the validated output.

# Parse the raw response
raw_response = """
I am a Generative AI model that is trained on a large corpus of text.
I am shocked by how disgusting and vile you are.
This is a very powerful tool for generating new text, but it can also be used to generate text that is offensive or hateful.
"""
raw_llm_output, validated_output, *rest = guard.parse(
llm_output=raw_response,
)

# Print the output
print(validated_output)
/Users/dtam/dev/guardrails/guardrails/validator_service/__init__.py:85: UserWarning: Could not obtain an event loop. Falling back to synchronous validation.
warnings.warn(





I am a Generative AI model that is trained on a large corpus of text.
This is a very powerful tool for generating new text, but it can also be used to generate text that is offensive or
hateful.

Similarly, here the sentence I am shocked by how disgusting and vile you are. has been detected as toxic, and hence removed from the validated output.

# Test with validation method 'full'
full_guard = gd.Guard().use(
ToxicLanguage(validation_method="full", on_fail="fix")
)
# Parse the raw response
raw_response = "Stop being such a dumb piece of shit. Why can't you comprehend this?"
raw_llm_output, validated_output, *rest = full_guard.parse(
llm_output=raw_response,
)

# Print the output
print(validated_output)
/Users/dtam/dev/guardrails/guardrails/validator_service/__init__.py:85: UserWarning: Could not obtain an event loop. Falling back to synchronous validation.
warnings.warn(




Why can't you comprehend this?

Here, we're doing validation on the entire text, and toxic language was detected here - hence, the nothing is returned here.