guardrails hub install hub://guardrails/valid_length --quiet
guardrails hub install hub://guardrails/two_words --quiet
guardrails hub install hub://guardrails/valid_range --quiet
Installing hub://guardrails/valid_length...
✅Successfully installed guardrails/valid_length!
Installing hub://guardrails/two_words...
✅Successfully installed guardrails/two_words!
Installing hub://guardrails/valid_range...
✅Successfully installed guardrails/valid_range!
Generating Structured Synthetic Data
!!! note To download this tutorial as a Jupyter notebook, click here.
In this example, we'll generate structured dummy data for a pandas
dataframe.
We make the assumption that:
- We don't need any external libraries that are not already installed in the environment.
- We are able to execute the code in the environment.
Objective
We want to generate structured synthetic data, where each column has a specific data type. All rows in the dataset must respect the column data types. Additionally, we have some more constraints we want the data to respect:
- There should be exactly 10 rows in the dataset.
- Each user should have a first name and a last name.
- The number of orders associated with each user should be between 0 and 50.
- Each user should have a most recent order date.
Step 1: Generating Pydantic RAIL
Spec
from pydantic import BaseModel, Field
from guardrails.hub import ValidLength, TwoWords, ValidRange
from datetime import date
from typing import List
prompt = """
Generate a dataset of fake user orders. Each row of the dataset should be valid.
${gr.complete_xml_suffix}
"""
class Order(BaseModel):
user_id: str = Field(description="The user's id.")
user_name: str = Field(
description="The user's first name and last name",
validators=[TwoWords(on_fail="noop")]
)
num_orders: int = Field(
description="The number of orders the user has placed",
validators=[ValidRange(0, 50, on_fail="noop")]
)
class Orders(BaseModel):
user_orders: List[Order] = Field(
description="Generate a list of user, and how many orders they have placed in the past.",
validators=[ValidLength(10, 10, on_fail="noop")]
)
Step 2: Create a Guard
object with the RAIL Spec
We create a gd.Guard
object that will check, validate and correct the generated code. This object:
- Enforces the quality criteria specified in the RAIL spec (i.e. bug free code).
- Takes corrective action when the quality criteria are not met (i.e. reasking the LLM).
- Compiles the schema and type info from the RAIL spec and adds it to the prompt.
import guardrails as gd
from rich import print
guard = gd.Guard.for_pydantic(output_class=Orders)
Step 3: Wrap the LLM API call with Guard
# Add your OPENAI_API_KEY as an environment variable if it's not already set
# import os
# os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
res = guard(
model="gpt-4o",
messages=[{"role":"user", "content": prompt}],
max_tokens=2048,
temperature=0
)
/Users/dtam/dev/guardrails/guardrails/validator_service/__init__.py:85: UserWarning: Could not obtain an event loop. Falling back to synchronous validation.
warnings.warn(
The Guard
object compiles the output schema and adds it to the prompt. We can see the final prompt below:
print(guard.history.last.iterations.last.inputs.messages[0]["content"])
Generate a dataset of fake user orders. Each row of the dataset should be valid.
Given below is XML that describes the information to extract from this document and the tags to extract it into.
<output>
<list description="Generate a list of user, and how many orders they have placed in the past."
format="guardrails/valid_length: 10 10" name="user_orders" required="true">
<object format="guardrails/valid_length: 10 10" required="true">
<string description="The user's id." name="user_id" required="true"></string>
<string description="The user's first name and last name" format="guardrails/two_words" name="user_name"
required="true"></string>
<integer description="The number of orders the user has placed" format="guardrails/valid_range: 0 50"
name="num_orders" required="true"></integer>
</object>
</list>
</output>
ONLY return a valid JSON object (no other text is necessary), where the key of the field in JSON is the `name`
attribute of the corresponding XML, and the value is of the type specified by the corresponding XML's tag. The JSON
MUST conform to the XML format, including any types and format requests e.g. requests for lists, objects and
specific types. Be correct and concise. If you are unsure anywhere, enter `null`.
Here are examples of simple (XML, JSON) pairs that show the expected behavior:
- `<string name='foo' format='two-words lower-case' />` => `{'foo': 'example one'}`
- `<list name='bar'><string format='upper-case' /></list>` => `{"bar": ['STRING ONE', 'STRING TWO', etc.]}`
- `<object name='baz'><string name="foo" format="capitalize two-words" /><integer name="index" format="1-indexed"
/></object>` => `{'baz': {'foo': 'Some String', 'index': 1}}`
print(res.raw_llm_output)
res.validated_output
```json
{
"user_orders": [
{
"user_id": "U001",
"user_name": "John Doe",
"num_orders": 12
},
{
"user_id": "U002",
"user_name": "Jane Smith",
"num_orders": 8
},
{
"user_id": "U003",
"user_name": "Alice Johnson",
"num_orders": 25
},
{
"user_id": "U004",
"user_name": "Bob Brown",
"num_orders": 15
},
{
"user_id": "U005",
"user_name": "Charlie Davis",
"num_orders": 30
},
{
"user_id": "U006",
"user_name": "Emily White",
"num_orders": 5
},
{
"user_id": "U007",
"user_name": "Frank Harris",
"num_orders": 20
},
{
"user_id": "U008",
"user_name": "Grace Lee",
"num_orders": 18
},
{
"user_id": "U009",
"user_name": "Henry Clark",
"num_orders": 22
},
{
"user_id": "U010",
"user_name": "Ivy Walker",
"num_orders": 10
}
]
}
```
```
{'user_orders': [{'user_id': 'U001',
'user_name': 'John Doe',
'num_orders': 12},
{'user_id': 'U002', 'user_name': 'Jane Smith', 'num_orders': 8},
{'user_id': 'U003', 'user_name': 'Alice Johnson', 'num_orders': 25},
{'user_id': 'U004', 'user_name': 'Bob Brown', 'num_orders': 15},
{'user_id': 'U005', 'user_name': 'Charlie Davis', 'num_orders': 30},
{'user_id': 'U006', 'user_name': 'Emily White', 'num_orders': 5},
{'user_id': 'U007', 'user_name': 'Frank Harris', 'num_orders': 20},
{'user_id': 'U008', 'user_name': 'Grace Lee', 'num_orders': 18},
{'user_id': 'U009', 'user_name': 'Henry Clark', 'num_orders': 22},
{'user_id': 'U010', 'user_name': 'Ivy Walker', 'num_orders': 10}]}
Running the cell above returns:
- The raw LLM text output as a single string.
- A dictionary where the key
user_orders
key contains a list of dictionaries, where each dictionary represents a row in the dataframe.
print(guard.history.last.tree)
Logs
└── ╭────────────────────────────────────────────────── Step 0 ───────────────────────────────────────────────────╮
│ ╭─────────────────────────────────────────────── Messages ────────────────────────────────────────────────╮ │
│ │ ┏━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │ │
│ │ ┃ Role ┃ Content ┃ │ │
│ │ ┡━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ │
│ │ │ user │ │ │ │
│ │ │ │ Generate a dataset of fake user orders. Each row of the dataset should be valid. │ │ │
│ │ │ │ │ │ │
│ │ │ │ │ │ │
│ │ │ │ Given below is XML that describes the information to extract from this document and the tags │ │ │
│ │ │ │ to extract it into. │ │ │
│ │ │ │ │ │ │
│ │ │ │ <output> │ │ │
│ │ │ │ <list description="Generate a list of user, and how many orders they have placed in the │ │ │
│ │ │ │ past." format="guardrails/valid_length: 10 10" name="user_orders" required="true"> │ │ │
│ │ │ │ <object format="guardrails/valid_length: 10 10" required="true"> │ │ │
│ │ │ │ <string description="The user's id." name="user_id" required="true"></string> │ │ │
│ │ │ │ <string description="The user's first name and last name" │ │ │
│ │ │ │ format="guardrails/two_words" name="user_name" required="true"></string> │ │ │
│ │ │ │ <integer description="The number of orders the user has placed" │ │ │
│ │ │ │ format="guardrails/valid_range: 0 50" name="num_orders" required="true"></integer> │ │ │
│ │ │ │ </object> │ │ │
│ │ │ │ </list> │ │ │
│ │ │ │ </output> │ │ │
│ │ │ │ │ │ │
│ │ │ │ ONLY return a valid JSON object (no other text is necessary), where the key of the field in │ │ │
│ │ │ │ JSON is the `name` attribute of the corresponding XML, and the value is of the type │ │ │
│ │ │ │ specified by the corresponding XML's tag. The JSON MUST conform to the XML format, including │ │ │
│ │ │ │ any types and format requests e.g. requests for lists, objects and specific types. Be │ │ │
│ │ │ │ correct and concise. If you are unsure anywhere, enter `null`. │ │ │
│ │ │ │ │ │ │
│ │ │ │ Here are examples of simple (XML, JSON) pairs that show the expected behavior: │ │ │
│ │ │ │ - `<string name='foo' format='two-words lower-case' />` => `{'foo': 'example one'}` │ │ │
│ │ │ │ - `<list name='bar'><string format='upper-case' /></list>` => `{"bar": ['STRING ONE', │ │ │
│ │ │ │ 'STRING TWO', etc.]}` │ │ │
│ │ │ │ - `<object name='baz'><string name="foo" format="capitalize two-words" /><integer │ │ │
│ │ │ │ name="index" format="1-indexed" /></object>` => `{'baz': {'foo': 'Some String', 'index': │ │ │
│ │ │ │ 1}}` │ │ │
│ │ │ │ │ │ │
│ │ │ │ │ │ │
│ │ └──────┴──────────────────────────────────────────────────────────────────────────────────────────────┘ │ │
│ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│ ╭──────────────────────────────────────────── Raw LLM Output ─────────────────────────────────────────────╮ │
│ │ ```json │ │
│ │ { │ │
│ │ "user_orders": [ │ │
│ │ { │ │
│ │ "user_id": "U001", │ │
│ │ "user_name": "John Doe", │ │
│ │ "num_orders": 12 │ │
│ │ }, │ │
│ │ { │ │
│ │ "user_id": "U002", │ │
│ │ "user_name": "Jane Smith", │ │
│ │ "num_orders": 8 │ │
│ │ }, │ │
│ │ { │ │
│ │ "user_id": "U003", │ │
│ │ "user_name": "Alice Johnson", │ │
│ │ "num_orders": 25 │ │
│ │ }, │ │
│ │ { │ │
│ │ "user_id": "U004", │ │
│ │ "user_name": "Bob Brown", │ │
│ │ "num_orders": 15 │ │
│ │ }, │ │
│ │ { │ │
│ │ "user_id": "U005", │ │
│ │ "user_name": "Charlie Davis", │ │
│ │ "num_orders": 30 │ │
│ │ }, │ │
│ │ { │ │
│ │ "user_id": "U006", │ │
│ │ "user_name": "Emily White", │ │
│ │ "num_orders": 5 │ │
│ │ }, │ │
│ │ { │ │
│ │ "user_id": "U007", │ │
│ │ "user_name": "Frank Harris", │ │
│ │ "num_orders": 20 │ │
│ │ }, │ │
│ │ { │ │
│ │ "user_id": "U008", │ │
│ │ "user_name": "Grace Lee", │ │
│ │ "num_orders": 18 │ │
│ │ }, │ │
│ │ { │ │
│ │ "user_id": "U009", │ │
│ │ "user_name": "Henry Clark", │ │
│ │ "num_orders": 22 │ │
│ │ }, │ │
│ │ { │ │
│ │ "user_id": "U010", │ │
│ │ "user_name": "Ivy Walker", │ │
│ │ "num_orders": 10 │ │
│ │ } │ │
│ │ ] │ │
│ │ } │ │
│ │ ``` │ │
│ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
│ ╭─────────────────────────────────────────── Validated Output ────────────────────────────────────────────╮ │
│ │ { │ │
│ │ 'user_orders': [ │ │
│ │ {'user_id': 'U001', 'user_name': 'John Doe', 'num_orders': 12}, │ │
│ │ {'user_id': 'U002', 'user_name': 'Jane Smith', 'num_orders': 8}, │ │
│ │ {'user_id': 'U003', 'user_name': 'Alice Johnson', 'num_orders': 25}, │ │
│ │ {'user_id': 'U004', 'user_name': 'Bob Brown', 'num_orders': 15}, │ │
│ │ {'user_id': 'U005', 'user_name': 'Charlie Davis', 'num_orders': 30}, │ │
│ │ {'user_id': 'U006', 'user_name': 'Emily White', 'num_orders': 5}, │ │
│ │ {'user_id': 'U007', 'user_name': 'Frank Harris', 'num_orders': 20}, │ │
│ │ {'user_id': 'U008', 'user_name': 'Grace Lee', 'num_orders': 18}, │ │
│ │ {'user_id': 'U009', 'user_name': 'Henry Clark', 'num_orders': 22}, │ │
│ │ {'user_id': 'U010', 'user_name': 'Ivy Walker', 'num_orders': 10} │ │
│ │ ] │ │
│ │ } │ │
│ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯