Skip to main content
guardrails hub install hub://guardrails/valid_length --quiet
guardrails hub install hub://guardrails/two_words --quiet
guardrails hub install hub://guardrails/valid_range --quiet
    Installing hub://guardrails/valid_length...
✅Successfully installed guardrails/valid_length!


Installing hub://guardrails/two_words...
✅Successfully installed guardrails/two_words!


Installing hub://guardrails/valid_range...
✅Successfully installed guardrails/valid_range!


Generating Structured Synthetic Data

!!! note To download this tutorial as a Jupyter notebook, click here.

In this example, we'll generate structured dummy data for a pandas dataframe.

We make the assumption that:

  1. We don't need any external libraries that are not already installed in the environment.
  2. We are able to execute the code in the environment.

Objective

We want to generate structured synthetic data, where each column has a specific data type. All rows in the dataset must respect the column data types. Additionally, we have some more constraints we want the data to respect:

  1. There should be exactly 10 rows in the dataset.
  2. Each user should have a first name and a last name.
  3. The number of orders associated with each user should be between 0 and 50.
  4. Each user should have a most recent order date.

Step 1: Generating Pydantic RAIL Spec

from pydantic import BaseModel, Field
from guardrails.hub import ValidLength, TwoWords, ValidRange
from datetime import date
from typing import List

prompt = """
Generate a dataset of fake user orders. Each row of the dataset should be valid.

${gr.complete_xml_suffix}
"""

class Order(BaseModel):
user_id: str = Field(description="The user's id.")
user_name: str = Field(
description="The user's first name and last name",
validators=[TwoWords()]
)
num_orders: int = Field(
description="The number of orders the user has placed",
validators=[ValidRange(0, 50)]
)

class Orders(BaseModel):
user_orders: List[Order] = Field(
description="Generate a list of user, and how many orders they have placed in the past.",
validators=[ValidLength(10, 10, on_fail="noop")]
)
    /Users/dtam/.pyenv/versions/3.12.1/lib/python3.12/site-packages/sentence_transformers/cross_encoder/CrossEncoder.py:11: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
from tqdm.autonotebook import tqdm, trange

Step 2: Create a Guard object with the RAIL Spec

We create a gd.Guard object that will check, validate and correct the generated code. This object:

  1. Enforces the quality criteria specified in the RAIL spec (i.e. bug free code).
  2. Takes corrective action when the quality criteria are not met (i.e. reasking the LLM).
  3. Compiles the schema and type info from the RAIL spec and adds it to the prompt.
import guardrails as gd

from rich import print
guard = gd.Guard.from_pydantic(output_class=Orders)

Step 3: Wrap the LLM API call with Guard

# Add your OPENAI_API_KEY as an environment variable if it's not already set
# import os
# os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

res = guard(
model="gpt-4o",
messages=[{"role":"user", "content": prompt}],
max_tokens=2048,
temperature=0
)

The Guard object compiles the output schema and adds it to the prompt. We can see the final prompt below:

print(guard.history.last.iterations.last.inputs.msg_history[0]["content"])

Generate a dataset of fake user orders. Each row of the dataset should be valid.


Given below is XML that describes the information to extract from this document and the tags to extract it into.

<output>
<list description="Generate a list of user, and how many orders they have placed in the past."
format="guardrails/valid_length: 10 10" name="user_orders" required="true">
<object format="guardrails/valid_length: 10 10" required="true">
<string description="The user's id." name="user_id" required="true"></string>
<string description="The user's first name and last name" format="guardrails/two_words" name="user_name"
required="true"></string>
<integer description="The number of orders the user has placed" format="guardrails/valid_range: 0 50"
name="num_orders" required="true"></integer>
</object>
</list>
</output>

ONLY return a valid JSON object (no other text is necessary), where the key of the field in JSON is the `name`
attribute of the corresponding XML, and the value is of the type specified by the corresponding XML's tag. The JSON
MUST conform to the XML format, including any types and format requests e.g. requests for lists, objects and
specific types. Be correct and concise. If you are unsure anywhere, enter `null`.

Here are examples of simple (XML, JSON) pairs that show the expected behavior:
- `<string name='foo' format='two-words lower-case' />` => `{'foo': 'example one'}`
- `<list name='bar'><string format='upper-case' /></list>` => `{"bar": ['STRING ONE', 'STRING TWO', etc.]}`
- `<object name='baz'><string name="foo" format="capitalize two-words" /><integer name="index" format="1-indexed"
/></object>` => `{'baz': {'foo': 'Some String', 'index': 1}}`


print(res.raw_llm_output)
res.validated_output
```json
{
"user_orders": [
{
"user_id": "u001",
"user_name": "John Doe",
"num_orders": 15
},
{
"user_id": "u002",
"user_name": "Jane Smith",
"num_orders": 22
},
{
"user_id": "u003",
"user_name": "Alice Johnson",
"num_orders": 8
},
{
"user_id": "u004",
"user_name": "Bob Brown",
"num_orders": 30
},
{
"user_id": "u005",
"user_name": "Charlie Davis",
"num_orders": 12
},
{
"user_id": "u006",
"user_name": "Diana Evans",
"num_orders": 25
},
{
"user_id": "u007",
"user_name": "Eve Foster",
"num_orders": 18
},
{
"user_id": "u008",
"user_name": "Frank Green",
"num_orders": 10
},
{
"user_id": "u009",
"user_name": "Grace Hill",
"num_orders": 5
},
{
"user_id": "u010",
"user_name": "Henry King",
"num_orders": 20
}
]
}
```

```





{'user_orders': [{'user_id': 'u001',
'user_name': 'John Doe',
'num_orders': 15},
{'user_id': 'u002', 'user_name': 'Jane Smith', 'num_orders': 22},
{'user_id': 'u003', 'user_name': 'Alice Johnson', 'num_orders': 8},
{'user_id': 'u004', 'user_name': 'Bob Brown', 'num_orders': 30},
{'user_id': 'u005', 'user_name': 'Charlie Davis', 'num_orders': 12},
{'user_id': 'u006', 'user_name': 'Diana Evans', 'num_orders': 25},
{'user_id': 'u007', 'user_name': 'Eve Foster', 'num_orders': 18},
{'user_id': 'u008', 'user_name': 'Frank Green', 'num_orders': 10},
{'user_id': 'u009', 'user_name': 'Grace Hill', 'num_orders': 5},
{'user_id': 'u010', 'user_name': 'Henry King', 'num_orders': 20}]}

Running the cell above returns:

  1. The raw LLM text output as a single string.
  2. A dictionary where the key user_orders key contains a list of dictionaries, where each dictionary represents a row in the dataframe.
print(guard.history.last.tree)
Logs
└── ╭────────────────────────────────────────────────── Step 0 ───────────────────────────────────────────────────╮
╭──────────────────────────────────────────────── Prompt ─────────────────────────────────────────────────╮
│ No prompt │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────────────────── Message History ────────────────────────────────────────────╮
│ ┏━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │
│ ┃ Role Content ┃ │
│ ┡━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │
│ │ user │ │ │
│ │ │ Generate a dataset of fake user orders. Each row of the dataset should be valid. │ │
│ │ │ │ │
│ │ │ │ │
│ │ │ Given below is XML that describes the information to extract from this document and the tags │ │
│ │ │ to extract it into. │ │
│ │ │ │ │
│ │ │ <output> │ │
│ │ │ <list description="Generate a list of user, and how many orders they have placed in the │ │
│ │ │ past." format="guardrails/valid_length: 10 10" name="user_orders" required="true"> │ │
│ │ │ <object format="guardrails/valid_length: 10 10" required="true"> │ │
│ │ │ <string description="The user's id." name="user_id" required="true"></string> │ │
│ │ │ <string description="The user's first name and last name" │ │
│ │ │ format="guardrails/two_words" name="user_name" required="true"></string> │ │
│ │ │ <integer description="The number of orders the user has placed" │ │
│ │ │ format="guardrails/valid_range: 0 50" name="num_orders" required="true"></integer> │ │
│ │ │ </object> │ │
│ │ │ </list> │ │
│ │ │ </output> │ │
│ │ │ │ │
│ │ │ ONLY return a valid JSON object (no other text is necessary), where the key of the field in │ │
│ │ │ JSON is the `name` attribute of the corresponding XML, and the value is of the type │ │
│ │ │ specified by the corresponding XML's tag. The JSON MUST conform to the XML format, including │ │
│ │ │ any types and format requests e.g. requests for lists, objects and specific types. Be │ │
│ │ │ correct and concise. If you are unsure anywhere, enter `null`. │ │
│ │ │ │ │
│ │ │ Here are examples of simple (XML, JSON) pairs that show the expected behavior: │ │
│ │ │ - `<string name='foo' format='two-words lower-case' />` => `{'foo': 'example one'}` │ │
│ │ │ - `<list name='bar'><string format='upper-case' /></list>` => `{"bar": ['STRING ONE', │ │
│ │ │ 'STRING TWO', etc.]}` │ │
│ │ │ - `<object name='baz'><string name="foo" format="capitalize two-words" /><integer │ │
│ │ │ name="index" format="1-indexed" /></object>` => `{'baz': {'foo': 'Some String', 'index': │ │
│ │ │ 1}}` │ │
│ │ │ │ │
│ │ │ │ │
│ └──────┴──────────────────────────────────────────────────────────────────────────────────────────────┘ │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────────────────── Raw LLM Output ─────────────────────────────────────────────╮
│ ```json │
│ { │
│ "user_orders": [ │
│ { │
│ "user_id": "u001", │
│ "user_name": "John Doe", │
│ "num_orders": 15 │
│ }, │
│ { │
│ "user_id": "u002", │
│ "user_name": "Jane Smith", │
│ "num_orders": 22 │
│ }, │
│ { │
│ "user_id": "u003", │
│ "user_name": "Alice Johnson", │
│ "num_orders": 8 │
│ }, │
│ { │
│ "user_id": "u004", │
│ "user_name": "Bob Brown", │
│ "num_orders": 30 │
│ }, │
│ { │
│ "user_id": "u005", │
│ "user_name": "Charlie Davis", │
│ "num_orders": 12 │
│ }, │
│ { │
│ "user_id": "u006", │
│ "user_name": "Diana Evans", │
│ "num_orders": 25 │
│ }, │
│ { │
│ "user_id": "u007", │
│ "user_name": "Eve Foster", │
│ "num_orders": 18 │
│ }, │
│ { │
│ "user_id": "u008", │
│ "user_name": "Frank Green", │
│ "num_orders": 10 │
│ }, │
│ { │
│ "user_id": "u009", │
│ "user_name": "Grace Hill", │
│ "num_orders": 5 │
│ }, │
│ { │
│ "user_id": "u010", │
│ "user_name": "Henry King", │
│ "num_orders": 20 │
│ } │
│ ] │
│ } │
│ ``` │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────────────────────── Validated Output ────────────────────────────────────────────╮
│ { │
│ 'user_orders': [ │
│ {'user_id': 'u001', 'user_name': 'John Doe', 'num_orders': 15}, │
│ {'user_id': 'u002', 'user_name': 'Jane Smith', 'num_orders': 22}, │
│ {'user_id': 'u003', 'user_name': 'Alice Johnson', 'num_orders': 8}, │
│ {'user_id': 'u004', 'user_name': 'Bob Brown', 'num_orders': 30}, │
│ {'user_id': 'u005', 'user_name': 'Charlie Davis', 'num_orders': 12}, │
│ {'user_id': 'u006', 'user_name': 'Diana Evans', 'num_orders': 25}, │
│ {'user_id': 'u007', 'user_name': 'Eve Foster', 'num_orders': 18}, │
│ {'user_id': 'u008', 'user_name': 'Frank Green', 'num_orders': 10}, │
│ {'user_id': 'u009', 'user_name': 'Grace Hill', 'num_orders': 5}, │
│ {'user_id': 'u010', 'user_name': 'Henry King', 'num_orders': 20} │
│ ] │
│ } │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯