Skip to content
/ py-toon Public

Token-Oriented Object Notation (TOON) for Python – a compact, human-readable data format that reduces LLM token usage by 30-60% compared to JSON. Perfect for passing structured data to Large Language Models efficiently.

License

Notifications You must be signed in to change notification settings

ron-42/py-toon

Repository files navigation

py-toon

Python versions License: MIT SPEC v1.3

Token-Oriented Object Notation (TOON) is a compact, human-readable serialization format designed for passing structured data to Large Language Models with significantly reduced token usage. It's intended for LLM input, not output.

TOON's sweet spot is uniform arrays of objects – multiple fields per row, same structure across items. It borrows YAML's indentation-based structure for nested objects and CSV's tabular format for uniform data rows, then optimizes both for token efficiency in LLM contexts. For deeply nested or non-uniform data, JSON may be more efficient.

💡 Tip: Think of TOON as a translation layer: use JSON programmatically, convert to TOON for LLM input.


📋 Table of Contents


Why TOON?

AI is becoming cheaper and more accessible, but larger context windows allow for larger data inputs as well. LLM tokens still cost money – and standard JSON is verbose and token-expensive:

{
  "users": [
    { "id": 1, "name": "Alice", "role": "admin" },
    { "id": 2, "name": "Bob", "role": "user" }
  ]
}

TOON conveys the same information with fewer tokens:

users[2]{id,name,role}:
  1,Alice,admin
  2,Bob,user

Why create a new format? Because existing alternatives don't fit:

  • JSON: Too verbose for tabular data
  • CSV: No nested structures
  • YAML: Better than JSON, but still repeats keys
  • Protocol Buffers/MessagePack: Binary formats requiring schema definitions

TOON bridges these gaps with a text format optimized for token efficiency and LLM-friendly guardrails.


Key Features

  • 💸 Token-efficient: typically 30–60% fewer tokens than JSON
  • 🤿 LLM-friendly guardrails: explicit lengths and fields enable validation
  • 🍱 Minimal syntax: removes redundant punctuation (braces, brackets, most quotes)
  • 📐 Indentation-based structure: like YAML, uses whitespace instead of braces
  • 🧺 Tabular arrays: declare keys once, stream data as rows

Benchmarks

Token Efficiency

⭐ GitHub Repositories       ██████████████░░░░░░░░░░░    8,745 tokens
                             vs JSON (-42.3%)           15,145
                             vs JSON compact (-23.7%)   11,455
                             vs YAML (-33.4%)           13,129
                             vs XML (-48.8%)            17,095

📈 Daily Analytics           ██████████░░░░░░░░░░░░░░░    4,507 tokens
                             vs JSON (-58.9%)           10,977
                             vs JSON compact (-35.7%)    7,013
                             vs YAML (-48.8%)            8,810
                             vs XML (-65.7%)            13,128

🛒 E-Commerce Order          ████████████████░░░░░░░░░      166 tokens
                             vs JSON (-35.4%)              257
                             vs JSON compact (-2.9%)       171
                             vs YAML (-15.7%)              197
                             vs XML (-38.7%)               271

─────────────────────────────────────────────────────────────────────
Total                        ██████████████░░░░░░░░░░░   13,418 tokens
                             vs JSON (-49.1%)           26,379
                             vs JSON compact (-28.0%)   18,639
                             vs YAML (-39.4%)           22,136
                             vs XML (-56.0%)            30,494

Note: Token savings are measured against formatted JSON (2-space indentation). Measured with gpt-tokenizer using o200k_base encoding (GPT-5 tokenizer). Actual savings vary by model and tokenizer.


Installation & Quick Start

# Clone the repository
git clone https://github.com/ronak/py-toon.git
cd py-toon

# Install locally
pip install -e .

# Or install directly from GitHub
pip install git+https://github.com/ronak/py-toon.git

Example Usage

from toon_format import encode, decode

# Encode Python data to TOON
data = {
    "users": [
        {"id": 1, "name": "Alice", "role": "admin"},
        {"id": 2, "name": "Bob", "role": "user"}
    ]
}

toon_string = encode(data)
print(toon_string)
# users[2]{id,name,role}:
#   1,Alice,admin
#   2,Bob,user

# Decode TOON back to Python
original_data = decode(toon_string)
print(original_data)
# {'users': [{'id': 1, 'name': 'Alice', 'role': 'admin'}, ...]}

Format Overview

Objects

Simple objects with primitive values:

encode({
    "id": 123,
    "name": "Ada",
    "active": True
})
id: 123
name: Ada
active: true

Nested objects:

encode({
    "user": {
        "id": 123,
        "name": "Ada"
    }
})
user:
  id: 123
  name: Ada

Arrays

Primitive Arrays (Inline)

encode({"tags": ["admin", "ops", "dev"]})
tags[3]: admin,ops,dev

Arrays of Objects (Tabular)

When all objects share the same primitive fields, TOON uses an efficient tabular format:

encode({
    "items": [
        {"sku": "A1", "qty": 2, "price": 9.99},
        {"sku": "B2", "qty": 1, "price": 14.5}
    ]
})
items[2]{sku,qty,price}:
  A1,2,9.99
  B2,1,14.5

Mixed and Non-Uniform Arrays

Arrays that don't meet the tabular requirements use list format:

items[3]:
  - 1
  - a: 1
  - text

When objects appear in list format, the first field is placed on the hyphen line:

items[2]:
  - id: 1
    name: First
  - id: 2
    name: Second
    extra: true

Arrays of Arrays

encode({"pairs": [[1, 2], [3, 4]]})
pairs[2]:
  - [2]: 1,2
  - [2]: 3,4

Empty Arrays and Objects

encode({"items": []})  # items[0]:
encode([])              # [0]:
encode({})              # (empty output)
encode({"config": {}})  # config:

Quoting Rules

TOON quotes strings only when necessary to maximize token efficiency:

  • Inner spaces are allowed; leading or trailing spaces force quotes
  • Unicode and emoji are safe unquoted
  • Quotes and control characters are escaped with backslash

String values are quoted when:

  • Empty string: ""
  • Leading or trailing spaces: " padded ", " "
  • Contains active delimiter, colon, quote, backslash, or control chars: "a,b", "a:b", "say \"hi\""
  • Looks like boolean/number/null: "true", "42", "null"
  • Starts with "- " (list-like): "- item"
  • Looks like structural token: "[5]", "{key}"

Object keys are unquoted if they match the identifier pattern (start with letter or underscore, followed by letters, digits, underscores, or dots). All other keys must be quoted.


API

encode(value, indent=2, delimiter=',', length_marker=False)

Converts any JSON-serializable value to TOON format.

Parameters:

  • value – Any JSON-serializable value (object, array, primitive, or nested structure)
  • indent – Number of spaces per indentation level (default: 2)
  • delimiter – Delimiter for array values: ',', '\t', or '|' (default: ',')
  • length_marker – Add # prefix to array lengths, e.g., items[#3] (default: False)

Returns: A TOON-formatted string

Example:

from toon_format import encode

items = [
    {"sku": "A1", "qty": 2, "price": 9.99},
    {"sku": "B2", "qty": 1, "price": 14.5}
]

# Default (comma delimiter)
print(encode({"items": items}))

# Tab delimiter (often more token-efficient)
print(encode({"items": items}, delimiter='\t'))

# With length marker
print(encode({"items": items}, length_marker=True))

decode(input_str, indent=2, strict=True)

Converts a TOON-formatted string back to Python values.

Parameters:

  • input_str – A TOON-formatted string to parse
  • indent – Expected number of spaces per indentation level (default: 2)
  • strict – Enable strict validation (default: True)

Returns: A Python value (dict, list, or primitive)

Example:

from toon_format import decode

toon = """
items[2]{sku,qty,price}:
  A1,2,9.99
  B2,1,14.5
"""

data = decode(toon)
# {'items': [{'sku': 'A1', 'qty': 2, 'price': 9.99}, ...]}

Strict Mode:

By default, the decoder validates input strictly:

  • Invalid escape sequences throw errors
  • Syntax errors throw on missing colons, malformed headers
  • Array length mismatches throw when declared length doesn't match actual count
  • Delimiter mismatches throw when row delimiters don't match header

Use strict=False for lenient parsing.


Using TOON in LLM Prompts

TOON works best when you show the format instead of describing it. The structure is self-documenting – models parse it naturally once they see the pattern.

Sending TOON to LLMs (Input)

Wrap your encoded data in a fenced code block (label it ```toon for clarity). The indentation and headers are usually enough – models treat it like familiar YAML or CSV. The explicit length markers ([N]) and field headers ({field1,field2}) help the model track structure.

Generating TOON from LLMs (Output)

For output, be more explicit. When you want the model to generate TOON:

  • Show the expected header (users[N]{id,name,role}:). The model fills rows instead of repeating keys, reducing generation errors.
  • State the rules: 2-space indent, no trailing spaces, [N] matches row count.

Example prompt:

Data is in TOON format (2-space indent, arrays show length and fields).

```toon
users[3]{id,name,role,lastLogin}:
  1,Alice,admin,2025-01-15T10:30:00Z
  2,Bob,user,2025-01-14T15:22:00Z
  3,Charlie,user,2025-01-13T09:45:00Z
```

Task: Return only users with role "user" as TOON. Use the same header. Set [N] to match the row count. Output only the code block.

💡 Tip: For large uniform tables, use encode(data, delimiter='\t') and tell the model "fields are tab-separated." Tabs often tokenize better than commas.


Notes and Limitations

  • Format familiarity and structure matter as much as token count. TOON's tabular format requires arrays of objects with identical keys and primitive values only. When this doesn't hold (mixed types, non-uniform objects, nested structures), TOON switches to list format where JSON can be more efficient at scale.

    • TOON excels at: Uniform arrays of objects (same fields, primitive values), especially large datasets with consistent structure
    • JSON is better for: Non-uniform data, deeply nested structures, and objects with varying field sets
  • Token counts vary by tokenizer and model. Benchmarks use a GPT-style tokenizer; actual savings will differ with other models.

  • TOON is designed for LLM input where human readability and token efficiency matter. It's not a drop-in replacement for JSON in APIs or storage.


Full Specification

For precise formatting rules and implementation details, see the full specification (currently v1.3).

The conformance tests provide language-agnostic test fixtures that validate implementations across any language.


Other Implementations


License

MIT License © 2025 Ronak


Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

When implementing features, please follow the TOON specification to ensure compatibility across implementations.


Acknowledgments

TOON format specification and reference implementation by Johann Schopplich and contributors.

This Python implementation follows the official TOON specification v1.3.

About

Token-Oriented Object Notation (TOON) for Python – a compact, human-readable data format that reduces LLM token usage by 30-60% compared to JSON. Perfect for passing structured data to Large Language Models efficiently.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages