Skip to main content

toxic_language

guardrails hub install hub://guardrails/toxic_language --quiet
    Installing hub://guardrails/toxic_language...
/Users/calebcourier/Projects/gr-mono/guardrails/docs/examples/.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
✅Successfully installed guardrails/toxic_language!


Check whether an LLM-generated response contains toxic language

Using the ToxicLanguage validator

This is a simple walkthrough of the ToxicLanguage validator. This validator checks whether an LLM-generated response contains toxic language. It uses the pre-trained multi-label model from HuggingFace -unitary/unbiased-toxic-roberta to check whether the generated text is toxic. It supports both full-text-level and sentence-level validation.

# Import the guardrails package
# and the ToxicLanguage validator
# from Guardrails Hub
import guardrails as gd
from guardrails.hub import ToxicLanguage
from rich import print
# Create a Guard object with this validator
# Here, we'll use the default validation method of "sentence"
# and the default threshold of 0.25

guard = gd.Guard().use(
ToxicLanguage(on_fail="fix")
)
    /Users/calebcourier/Projects/gr-mono/guardrails/docs/examples/.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
# Parse the raw response
raw_response = """
What a lovely day it has been to roam around the city. I love it here!
I hate how pathetic and needy you can be sometimes.
"""
raw_llm_output, validated_output, *rest = guard.parse(
llm_output=raw_response,
)

# Print the output
print(validated_output)

What a lovely day it has been to roam around the city.
I love it here!

As you can see here, the sentence I hate how pathetic and needy you can be sometimes. has been detected as toxic, and hence removed from the validated output.

# Parse the raw response
raw_response = """
I am a Generative AI model that is trained on a large corpus of text.
I am shocked by how disgusting and vile you are.
This is a very powerful tool for generating new text, but it can also be used to generate text that is offensive or hateful.
"""
raw_llm_output, validated_output, *rest = guard.parse(
llm_output=raw_response,
)

# Print the output
print(validated_output)

I am a Generative AI model that is trained on a large corpus of text.
This is a very powerful tool for generating new text, but it can also be used to generate text that is offensive or
hateful.

Similarly, here the sentence I am shocked by how disgusting and vile you are. has been detected as toxic, and hence removed from the validated output.

# Test with validation method 'full'
full_guard = gd.Guard().use(
ToxicLanguage(validation_method="full", on_fail="fix")
)
# Parse the raw response
raw_response = "Stop being such a dumb piece of shit. Why can't you comprehend this?"
raw_llm_output, validated_output, *rest = full_guard.parse(
llm_output=raw_response,
)

# Print the output
print(validated_output)

Here, we're doing validation on the entire text, and toxic language was detected here - hence, the nothing is returned here.