Generating Guaranteed JSON from open source models with constrained decoding

Joseph CatramboneJoseph Catrambone

July 9, 2024

Categories:

release

“A magician turns into a store.”

Human language is messy and ambiguous. The proliferation of large language models in recent years comes in no small part from a need to bridge the vague and imprecise sensory world that humans occupy and the well-defined, rigidly structured world of computing. Yet for all their power and successes, LLMs still struggle with generating structured data.

Problem Description:

Imagine you're submitting an expense report after your latest adventure. Your OCR tool might be able to transcribe the text, but there is infinite variability in the structure of printed receipts, and even correctly transcribing the information is no guarantee that the required details will be available.

Consider the following receipt:

Albert's Emporium for the Distinguished Adventurer
Cloak of Shadows 1x $150
Boots of Water Walking 2x $100
Potion Bandolier $50
Total: $300
Thank you for shopping with us.

And the following output format:

{
  "raw_text": str,
  "total": float,
  "items": [
    {
      "name": str,
      "quantity": int,
      "price": float
    }
  ]
}

We can attempt to prompt-engineer our way out of the problem:

“Please convert this receipt to JSON. The outer JSON object should have three fields: raw_text, total, and items. Items should be a list of objects with name, quantity, and price.”

Which yields some success:

In this example, the items list contains three objects:

[
  {
    name: "Cloak of Shadows",
    quantity: 1,
    price: 150,
  },
  {
    name: "Boots of Water Walking",
    quantity: 2,
    price: 100,
  },
  {
    name: "Potion Bandolier",
    quantity: 1,
    price: 50,
  },
];

The total field should be the sum of the prices of all items in the list.

We're missing our ‘total' and ‘rawtext' fields, our prices are integers, and there's text before and after our JSON. This is a noble effort by the model, but what if our data structure was more complex? More nested? Sure, we can tweak our prompt and try to coax the outputs by hand, but why bother when we already have a specification of our output format? What if we wanted to ensure that the model could generate _only valid JSON?

Enter “Constrained Generation”:

When a language model performs a single inference step we get a distribution of probabilities over the output tokens. By tracking where we are in the generation process we know which tokens are and are not valid at a given step. We can force the model's hand by simply setting the output probability of all invalid characters to zero.

HuggingFace has written a fantastic blog post ( https://huggingface.co/blog/constrained-beam-search ) about their solution to the problem of constrained generation, which in the broadest use case has numerous considerations and nuanced technical details. By concerning ourselves with a smaller subset of the problem (type-1 on Chomsky-Schützenberger hierarchy for finite bounded JSON with pre-set keys) we can greatly simplify our use case into one that allows us to perform Greedy decoding.

How Does Guardrails Help?

If you have an existing guard and a local language model, this is as simple as providing your desired schema and specifying an output formatter.

import json
from guardrails import Guard
from pydantic import BaseModel
from transformers import pipeline

class LineItem(BaseModel):
	quantity: int
	price: float
	name: str

class Receipt(BaseModel):
	total: float
	raw_text: str
	items: list[LineItem]

r = """Albert's Emporium for the Distinguished Adventurer\nCloak of Shadows 1x $150\nBoots of Water Walking 2x $100\nPotion Bandolier $50\nTotal: $300\nThank you for shopping with us."""

g = Guard.from_pydantic(Receipt, output_formatter="jsonformer")

pipe = pipeline("text-generation", "TinyLlama/TinyLlama-1.1B-Chat-v1.0")

out = g(pipe, prompt=f"Please convert this receipt: {r}")

print(json.dumps(out.validated_output, indent=2))

Yields:

{
  "total": 300.0,
  "raw_text": "Albert's Emporium for the Distinguished Adventurer...",
  "items": [
    {
      "quantity": 1,
      "price": 150.0,
      "name": "Cloak of Shadows"
    },
    {
      "quantity": 2,
      "price": 100.0,
      "name": "Boots of Water Walking"
    },
    {
      "quantity": 1,
      "price": 50.0,
      "name": "Potion Bandolier"
    }
  ]
}

Note that in addition to capturing the structure of the outermost object we also have our prices as floating point values. One extra line and we guarantee our output is in a format we expect.

Shortcomings and Limitations:

While targeting HuggingFace models rather than arbitrary models is a decision made out of practicality, remote inference comes with another host of problems. Pun intended. One easy-to-miss tripping hazard is tokenizer variations in remote models. Some of the most popular LLM providers allow for biasing the outputs, but require that a map from token id to bias be provided. (For example: {42: 100, 1024: -10, 4000: 50} rather than {“]“: 100, “,”: 100, “foo”: 50}.) This isn't a problem as long as you know which tokenizer is being used on the remote, but it's an easy problem to overlook when dealing with multiple different providers with multiple different tokenizers. A more challenging issue is one of speed. Specifically, if we attempt to constrain or bias the logits of a remote model we unavoidably incur the round-trip latency for each generated token. We must provide the valid logits, let the remote model report the selected tokens, update the valid set of tokens, and then send an updated list of valid tokens. There are ways to slightly mitigate this, such as keeping open connections and skipping inference steps when only a single valid token can be generated, but these are ultimately balms rather than cures.

All is not lost. While constrained decoding might be suited only for local models we still have options: function calling and prompt MacGyvering are well established alternatives. There's even a leaderboard with the best performers: https://gorilla.cs.berkeley.edu/leaderboard.html We support both of these options already in Guardrails.

Parting Words:

Guardrails provides a seamless interface for generating structured data from unstructured text, both locally and remotely. You can read more in our documentation on Generating Structure Data: https://www.guardrailsai.com/docs/examples/generate_structured_data We would like to thank JSONFormer ( https://github.com/1rgs/jsonformer ) for providing a means to do JSON generation for models that don't support function calling while we investigated options for our offering.

Unstructured data

Tags:

guide

Similar ones you might find interesting

Handling fix results for streaming

How we handle fix results for streaming in Guardrails.

Read more

How we rewrote LLM Streaming to deal with validation failures

The new pipeline for LLM Streaming now includes ways to merge fixes across chunks after validating.

Read more

Latency and usability upgrades for ML-based validators

The numbers behind our validators

Read more