What is Chain-of-Draft reasoning?

Chain-of-Draft (CoD) achieves 75% token reduction vs standard Chain-of-Thought with 78% latency decrease using concise 5-word-or-less reasoning steps.

How does ContextBank memory work?

ContextBank is a thread-safe global persistent memory layer that persists across sessions and distributed systems.

What is the difference between Pipe and Pipeline?

A Pipe is a single processing unit. A Pipeline chains multiple Pipes sequentially — output of one becomes input of the next.

how-to ~16 min

Context Window Size, Token Budgets, and Truncation

Manage context window sizes, token budgets, and truncation strategies for predictable AI inference resource allocation.

Context Window Size, Token Budgets, and Truncation

How the Pieces Fit
Context Window Size - Managing Total Token Capacity
Token Budgets - Predictable Resource Allocation
Fill Modes and Multi-Page Budgeting
Truncation - Intelligent Content Reduction
Advanced Truncation Control
Semantic Compression - Legend-Backed Prompt Reduction
Practical Implementation Patterns
Error Prevention and Handling
Token Counting and Optimization
Production Considerations

TPipe provides sophisticated token management to solve the fundamental problem of AI model context limits. These features prevent runtime failures, optimize token usage, and ensure predictable behavior when dealing with large inputs.

How the Pieces Fit

TPipe handles token pressure in layers:

Context window size sets the total space the model can use.
Token budgets divide that space between user input, output, reasoning, and context.
Fill modes decide how lorebook and page-keyed context are selected before truncation happens.

If you only need the pipe to stay under the model limit, autoTruncateContext() is enough. If you need predictable allocation or page-by-page control, use TokenBudgetSettings. If you want lorebook entries to be selected first, turn on fill mode as well.

Context Window Size - Managing Total Token Capacity

The Problem

AI models have hard limits on total tokens they can process (input + output combined). Exceed this limit and your API call fails. Different models have different limits (8K, 32K, 128K, etc.).

The Solution

pipe.setContextWindowSize(32000)  // Set total token budget
    .setMaxTokens(8000)          // Reserve tokens for output

What this does: Establishes a token budget that prevents API failures by ensuring total token usage stays within model limits.

How it works: TPipe calculates available input space as contextWindowSize - maxTokens - systemPromptTokens, then automatically manages content to fit this constraint.

Use setContextWindowSize() when you already know the model limit you want to work within. Use setMaxTokens() when you want to reserve output room so the model does not spend the whole window on input.

Token Budgets - Predictable Resource Allocation

The Problem

Without explicit budgeting, you can’t predict:

Whether your input will fit
How much space is left for AI output
If reasoning models have enough “thinking” space
When truncation will occur

The Solution - TokenBudgetSettings

val budget = TokenBudgetSettings(
    userPromptSize = 12000,
    maxTokens = 20000,
    reasoningBudget = 8000,
    subtractReasoningFromInput = false,
    contextWindowSize = 32000,
    allowUserPromptTruncation = true,
    preserveJsonInUserPrompt = true,
    compressUserPrompt = false,
    truncateContextWindowAsString = false,
    preserveTextMatches = true,
    multiPageBudgetStrategy = MultiPageBudgetStrategy.DYNAMIC_SIZE_FILL,
    pageWeights = mapOf("critical" to 2.0, "normal" to 1.0),
    reserveEmptyPageBudget = true
)

What this does:

userPromptSize reserves a fixed amount of room for user input. Set it to null when you want TPipe to allocate only what the current prompt actually needs.
maxTokens reserves output space for the model’s answer.
reasoningBudget reserves thinking space for reasoning models. By default it is taken from maxTokens; set subtractReasoningFromInput = true if you want that room carved out of the input side instead.
contextWindowSize sets the total budget for the whole request.
allowUserPromptTruncation decides whether oversized user input is trimmed or rejected.
preserveJsonInUserPrompt tries to keep JSON structure intact when user text has to be shortened.
compressUserPrompt turns on semantic compression before truncation for natural-language prompts.
truncateContextWindowAsString switches truncation to a single string pass instead of entry-by-entry truncation.
preserveTextMatches keeps matching context elements and conversation history near the front of the line when space is tight.
multiPageBudgetStrategy decides how MiniBank page budgets are shared.
pageWeights only matter when the strategy uses weights.
reserveEmptyPageBudget decides whether empty pages still get a share of the budget.

Good rule of thumb: if you are trying to tune one request, start with contextWindowSize, maxTokens, and userPromptSize. If you are trying to tune a whole pipeline, add reasoningBudget, preserveTextMatches, and a multi-page strategy.

Token allocation calculation:

Total: 32,000 tokens
- System prompt: ~2,000 tokens (calculated)
- Output budget: 20,000 tokens
- Reasoning budget: 8,000 tokens (subtracted from output)
- Effective output: 12,000 tokens (20,000 - 8,000)
= Available for user input: 10,000 tokens

Fill Modes and Multi-Page Budgeting

Fill mode is about selection, not just trimming.

When fillMode = true or enableLoreBookFillMode() is active, TPipe first gives lorebook selection the full budget and fills it with the most relevant entries. Only after that does it split the remaining room between context elements and conversation history.

When fillAndSplitMode = true, TPipe reserves half of the top-level budget for lorebook selection and half for everything else. If lorebook uses less than its half, the leftover room is handed back to the rest of the context instead of going unused.

Use fill mode when lorebook entries are the most important part of the prompt. Use fill-and-split mode when lorebook should be prioritized, but you still want the rest of the context to keep a guaranteed share.

pipe.autoTruncateContext(fillAndSplitMode = true)

For page-keyed context, TokenBudgetSettings.multiPageBudgetStrategy controls how the budget is spread across pages:

DYNAMIC_SIZE_FILL gives smaller pages a chance to survive first and redistributes unused room as it goes.
DYNAMIC_FILL starts with a priority-style pass and then reclaims unused budget.
EQUAL_SPLIT divides the budget evenly across pages.
WEIGHTED_SPLIT uses pageWeights to favor more important pages.
PRIORITY_FILL walks pages in order and fills them one by one.

DYNAMIC_SIZE_FILL is the default for TokenBudgetSettings, while enableDynamicFill() switches the pipe to the older dynamic-fill behavior explicitly.

pipe.autoTruncateContext(fillMode = true)

pipe.setTokenBudget(TokenBudgetSettings(
    contextWindowSize = 32000,
    maxTokens = 4000,
    multiPageBudgetStrategy = MultiPageBudgetStrategy.DYNAMIC_SIZE_FILL
))

pipe.setTokenBudget(TokenBudgetSettings(
    contextWindowSize = 32000,
    maxTokens = 4000,
    multiPageBudgetStrategy = MultiPageBudgetStrategy.WEIGHTED_SPLIT,
    pageWeights = mapOf("critical" to 2.0, "normal" to 1.0)
))

Dynamic vs Explicit User Prompt Allocation

Understanding User Prompt Size Behavior

The userPromptSize parameter in TokenBudgetSettings controls how TPipe allocates space for user input, and it behaves differently depending on whether you provide an explicit value or leave it as null.

Explicit User Prompt Allocation

val budget = TokenBudgetSettings(
    userPromptSize = 12000,           // Explicit allocation
    contextWindowSize = 32000,
    maxTokens = 8000
)

How it works:

TPipe reserves exactly 12,000 tokens for user input
If user input exceeds this limit, behavior depends on allowUserPromptTruncation
Context space is calculated as: contextWindowSize - systemPrompt - userPromptSize - maxTokens
Predictable and deterministic allocation

Use when:

You need predictable token allocation
You want to enforce strict input size limits
You’re building applications with known input patterns
You want to prevent unexpectedly large inputs from consuming context space

Dynamic User Prompt Allocation

val budget = TokenBudgetSettings(
    userPromptSize = null,            // Dynamic allocation - TPipe calculates automatically
    contextWindowSize = 32000,
    maxTokens = 8000
)

How it works:

Automatic Calculation: TPipe counts the actual tokens in your user prompt and allocates exactly that amount
Space Optimization: Remaining space after system prompt, user prompt, and output allocation goes to context
Overflow Handling: If the calculated user prompt size would exceed available space, TPipe automatically reduces it
Cleanup: After processing, userPromptSize is reset to null to prevent issues in subsequent calls

Detailed Dynamic Allocation Process:

1. Calculate actual user prompt tokens: 8,500 tokens
2. Check available space: 32,000 - 2,000 (system) - 8,000 (output) = 22,000 tokens
3. Allocate user prompt space: 8,500 tokens
4. Remaining context space: 22,000 - 8,500 = 13,500 tokens

If user prompt was larger (e.g., 25,000 tokens):
1. Calculate user prompt tokens: 25,000 tokens  
2. Available space: 22,000 tokens (insufficient!)
3. Reduce user prompt allocation: 22,000 tokens (fits available space)
4. Remaining context space: 0 tokens
5. User prompt gets truncated to fit the reduced allocation

Use when:

Input sizes vary significantly between requests
You want to maximize context space utilization
You prefer automatic space optimization over strict limits
You’re building flexible applications that handle diverse input types

Semantic Compression - Legend-Backed Prompt Reduction

TPipe can reduce prompt token usage before truncation by semantically compressing natural-language text. The compressor removes function words, common filler phrases, and repeated proper nouns while leaving quoted spans untouched and returning a legend that maps the short codes back to their original values.

val compression = pipe.compressPrompt("""
    Alice Johnson and Alice Johnson are going to review the launch proposal in order to help the team.
    "Quoted text stays untouched."
""".trimIndent())

val promptForLLM = if(compression.legend.isNotEmpty())
{
    "${compression.legend}\n\n${compression.compressedText}"
}
else
{
    compression.compressedText
}

When to use it

Natural-language prompts that repeat names, roles, or filler phrases
Long system prompts that can be safely rewritten as plain English
Prompt budgets where you want to preserve meaning before falling back to truncation

When not to use it

JSON, XML, code blocks, schemas, or other machine-readable payloads
Prompts where quoted text must be preserved exactly as written

Behavior: TokenBudgetSettings.compressUserPrompt triggers the same compressor in the user-prompt budget path. If the compressed result still exceeds the budget, TPipe continues through the existing truncation or failure logic.

Overflow Handling in Dynamic Allocation

When dynamic allocation encounters insufficient space, TPipe implements sophisticated overflow handling:

val budget = TokenBudgetSettings(
    userPromptSize = null,                    // Dynamic allocation
    allowUserPromptTruncation = true,         // Enable overflow handling
    contextWindowSize = 32000,
    maxTokens = 8000
)

Overflow scenarios:

User prompt + system prompt + output > context window: User prompt size is automatically reduced
Binary content pushes total over limit: User prompt space is further reduced to accommodate binary data
Insufficient space even after reduction: Exception thrown if allowUserPromptTruncation is false

Example overflow handling:

Context window: 32,000 tokens
System prompt: 2,000 tokens  
Output budget: 8,000 tokens
User prompt (actual): 25,000 tokens
Binary content: 3,000 tokens

Step 1: Available space = 32,000 - 2,000 - 8,000 = 22,000 tokens
Step 2: User prompt exceeds available space (25,000 > 22,000)
Step 3: Reduce user prompt to fit: 22,000 tokens
Step 4: Account for binary content: 22,000 - 3,000 = 19,000 tokens
Step 5: Final user prompt allocation: 19,000 tokens
Step 6: User prompt truncated to fit 19,000 token budget

Choosing Between Dynamic and Explicit Allocation

Choose Explicit Allocation when:

Building production systems requiring predictable behavior
Implementing strict content policies or size limits
Working with known, consistent input patterns
Need to guarantee minimum context space availability

Choose Dynamic Allocation when:

Handling variable input sizes (user-generated content, document processing)
Want to maximize token utilization efficiency
Building flexible, adaptive applications
Prefer automatic optimization over manual tuning

Hybrid approach:

// Start with dynamic allocation for flexibility
val flexibleBudget = TokenBudgetSettings(
    userPromptSize = null,                // Dynamic
    allowUserPromptTruncation = true,     // Handle overflow gracefully
    contextWindowSize = 32000
)

// Switch to explicit allocation when patterns emerge
val optimizedBudget = TokenBudgetSettings(
    userPromptSize = 15000,               // Based on observed patterns
    allowUserPromptTruncation = false,    // Strict enforcement
    contextWindowSize = 32000
)

Truncation - Intelligent Content Reduction

The Problem

When content exceeds available token space, you need deterministic behavior instead of random failures. Different use cases require different truncation strategies.

Truncation Methods

pipe.setContextWindowSettings(ContextWindowSettings.TruncateTop)    // Remove oldest content
pipe.setContextWindowSettings(ContextWindowSettings.TruncateBottom) // Remove newest content

TruncateTop:

Use case: Conversational AI, ongoing interactions
Behavior: Preserves recent context, removes historical content
Why: Recent context is more relevant for continuing conversations

TruncateBottom:

Use case: Document analysis, reasoning tasks
Behavior: Preserves initial context, removes recent additions
Why: Initial context often contains critical instructions or document structure

For tokenizer tuning and lower-level truncation settings, see Token Counting, Truncation, and Tokenizer Tuning. That guide covers the parameters that control how tokens are estimated, split, and counted.

Automatic Truncation

pipe.truncateModuleContext()

What this does: Enables automatic context truncation when token limits are approached, preventing API failures.

How it works:

Calculates total token usage (system prompt + user input + context)
If exceeds available space, truncates context according to selected method
Preserves system prompt and user input (unless explicitly allowed to truncate)

Advanced Truncation Control

User Input Truncation

val budget = TokenBudgetSettings(
    allowUserPromptTruncation = true,  // Enable user input truncation
    preserveJsonInUserPrompt = true    // Preserve JSON structures
)

allowUserPromptTruncation:

false: Throw exception if user input exceeds userPromptSize
true: Automatically truncate user input to fit budget

preserveJsonInUserPrompt: Attempts to preserve JSON structure integrity during truncation.

truncateContextWindowAsString: Truncates the combined context as one string instead of trimming individual context pieces. Use this when you want a more aggressive, less structured cut.

Max Token Overflow - Intentional Output Constraint

pipe.enableMaxTokenOverflow()

What this does: Treats max token limits as intentional constraints rather than error conditions.

Use cases:

Constraining long outputs: Force models to be concise by setting low max tokens
Controlling reasoning: Limit reasoning tokens to prevent excessive “thinking”
Response length management: Ensure responses fit within UI constraints
Cost control: Cap token usage for budget management

Normal behavior: Throws error if model hits max token limit (treats as failure) With overflow enabled: Accepts truncated output as valid result (treats as intentional constraint)

Example - Forcing concise responses:

pipe.setMaxTokens(100)           // Very short response limit
    .enableMaxTokenOverflow()    // Accept truncated responses
// Result: Model forced to be extremely concise, partial responses accepted

Practical Implementation Patterns

High-Context Creative Writing

val writerPipe = BedrockPipe()
    .setContextWindowSize(100000)     // Large context for story continuity
    .setMaxTokens(20000)              // Long-form output capability
    .setContextWindowSettings(ContextWindowSettings.TruncateTop)
    .truncateModuleContext()

val budget = TokenBudgetSettings(
    userPromptSize = 15000,
    maxTokens = 20000,
    reasoningBudget = 5000,
    allowUserPromptTruncation = true,
    truncationMethod = ContextWindowSettings.TruncateTop
)

Purpose: Maintains story continuity while allowing long outputs. Recent story context takes priority over distant history.

Analytical Processing

val analysisPipe = BedrockPipe()
    .setContextWindowSize(50000)
    .setMaxTokens(15000)
    .setContextWindowSettings(ContextWindowSettings.TruncateBottom)
    .truncateModuleContext()

val budget = TokenBudgetSettings(
    userPromptSize = 8000,
    maxTokens = 15000,
    reasoningBudget = 10000,          // Large reasoning allocation
    allowUserPromptTruncation = false, // Preserve analytical input integrity
    truncationMethod = ContextWindowSettings.TruncateBottom
)

Purpose: Preserves initial instructions and document structure. Large reasoning budget for complex analysis. Strict input preservation.

Dynamic Allocation for Variable Input Processing

// Flexible document processing with dynamic user prompt allocation
val documentProcessor = BedrockPipe()
    .setContextWindowSize(128000)
    .setMaxTokens(8000)
    .truncateModuleContext()

val dynamicBudget = TokenBudgetSettings(
    userPromptSize = null,                    // Dynamic allocation based on actual content
    maxTokens = 8000,
    reasoningBudget = 4000,
    allowUserPromptTruncation = true,         // Handle overflow gracefully
    contextWindowSize = 128000,
    multiPageBudgetStrategy = MultiPageBudgetStrategy.DYNAMIC_FILL
)

// Usage with varying input sizes
val shortQuery = "Summarize this document"           // ~5 tokens -> 5 tokens allocated
val longQuery = "Analyze this document in detail..." // ~500 tokens -> 500 tokens allocated
val massiveQuery = "..." // 50,000 tokens -> Automatically reduced to fit available space

Purpose: Optimal space utilization for applications with highly variable input sizes. Automatically adapts to content while maximizing context space.

Explicit Allocation for Predictable Systems

// Production system with strict input controls
val productionPipe = BedrockPipe()
    .setContextWindowSize(32000)
    .setMaxTokens(4000)
    .truncateModuleContext()

val explicitBudget = TokenBudgetSettings(
    userPromptSize = 8000,                    // Fixed allocation - predictable behavior
    maxTokens = 4000,
    reasoningBudget = 2000,
    allowUserPromptTruncation = false,        // Strict enforcement - fail if exceeded
    contextWindowSize = 32000,
    multiPageBudgetStrategy = MultiPageBudgetStrategy.EQUAL_SPLIT
)

// Guaranteed behavior: exactly 8000 tokens for user input, 18000 for context
// Throws exception if user input exceeds 8000 tokens

Purpose: Predictable, deterministic behavior for production systems. Guarantees minimum context space and enforces input size policies.

Hybrid Approach - Adaptive Allocation

// Start with dynamic allocation, switch to explicit based on patterns
class AdaptiveTokenManager {
    private var observedSizes = mutableListOf<Int>()
    
    fun createBudget(isProduction: Boolean): TokenBudgetSettings {
        return if (isProduction && observedSizes.isNotEmpty()) {
            // Use observed patterns for explicit allocation
            val averageSize = observedSizes.average().toInt()
            val maxObserved = observedSizes.maxOrNull() ?: 0
            val safeAllocation = (maxObserved * 1.2).toInt() // 20% buffer
            
            TokenBudgetSettings(
                userPromptSize = safeAllocation,          // Based on observed patterns
                allowUserPromptTruncation = false,        // Strict in production
                contextWindowSize = 32000
            )
        } else {
            // Dynamic allocation for development/learning
            TokenBudgetSettings(
                userPromptSize = null,                    // Learn from actual usage
                allowUserPromptTruncation = true,         // Flexible during learning
                contextWindowSize = 32000
            )
        }
    }
    
    fun recordUsage(actualTokens: Int) {
        observedSizes.add(actualTokens)
        if (observedSizes.size > 100) {
            observedSizes.removeAt(0) // Keep recent history
        }
    }
}

Purpose: Combines the flexibility of dynamic allocation during development with the predictability of explicit allocation in production.

Error Prevention and Handling

Token Budget Validation

try {
    pipe.setTokenBudget(budget)
} catch (e: Exception) {
    // Budget validation failed - constraints impossible to satisfy
}

Common validation failures:

System prompt exceeds total context window
maxTokens exceeds available space after system prompt
reasoningBudget exceeds maxTokens

Runtime Token Management

try {
    val response = runBlocking { pipe.generateText(largeInput) }
} catch (e: Exception) {
    if (e.message?.contains("Context window size is too small") == true) {
        // Input + context exceeds available space
        // Solution: Enable truncation or increase context window
    }
}

Token Counting and Optimization

Automatic Token Calculation

TPipe automatically counts tokens for:

System prompts
User input
Context data
Binary content (images, documents)
Model reasoning output

Optimization Strategies

// Conservative sizing (leave 10-15% buffer)
val modelLimit = 32000
val safeLimit = (modelLimit * 0.85).toInt()
pipe.setContextWindowSize(safeLimit)

// Reasoning model optimization
reasoningBudget = maxTokens / 2  // 50% allocation for reasoning

// Context-heavy applications
userPromptSize = contextWindowSize * 0.6  // 60% for rich context

Production Considerations

Model-Specific Configuration

// DeepSeek (high reasoning capability)
reasoningBudget = maxTokens * 0.6  // 60% for reasoning

// Claude (balanced performance)
allowUserPromptTruncation = true   // Flexible input handling

// GPT-OSS (analytical tasks)
allowUserPromptTruncation = false  // Preserve input integrity
truncationMethod = ContextWindowSettings.TruncateBottom

Monitoring and Debugging

// Check actual token usage
val tokenCount = pipe.countAllTokens(content)
val settings = pipe.getTruncationSettings()

Size-Based Priority Allocation - DYNAMIC_SIZE_FILL Strategy

// Same strategy described above, shown here with a page-keyed pipe
val sizePriorityPipe = BedrockPipe()
    .setPageKey("gameplayData, userPreferences, storyContent")
    .setTokenBudget(TokenBudgetSettings(
        contextWindowSize = 32000,
        maxTokens = 4000,
        multiPageBudgetStrategy = MultiPageBudgetStrategy.DYNAMIC_SIZE_FILL
    ))
    .enableLoreBookFillMode()  // Optional: enable fill mode for prioritized lorebook selection

// Alternative method chaining
pipe.enableDynamicSizeFill()

This is the same size-aware dynamic fill strategy described above. It is useful when smaller pages carry state you do not want to lose and larger pages can be trimmed first.

Intentional Output Constraints

 // Accept partial responses as design choice
 pipe.setMaxTokens(2000)
    .enableMaxTokenOverflow()

These token management features solve the core challenge of working with token-limited AI models: ensuring predictable behavior, preventing runtime failures, and optimizing resource utilization for different use cases. The key is matching your truncation strategy and token allocation to your specific application requirements.

Next Steps

Now that you understand token management, learn about advanced token handling:

→ Token Counting, Truncation, and Tokenizer Tuning - Advanced token handling

Context Window Size, Token Budgets, and Truncation

Table of Contents

How the Pieces Fit

Context Window Size - Managing Total Token Capacity

The Problem

The Solution

Token Budgets - Predictable Resource Allocation

The Problem

The Solution - TokenBudgetSettings

Fill Modes and Multi-Page Budgeting

Dynamic vs Explicit User Prompt Allocation

Understanding User Prompt Size Behavior

Explicit User Prompt Allocation

Dynamic User Prompt Allocation

Semantic Compression - Legend-Backed Prompt Reduction

Overflow Handling in Dynamic Allocation

Choosing Between Dynamic and Explicit Allocation

Truncation - Intelligent Content Reduction

The Problem

Truncation Methods

Automatic Truncation

Advanced Truncation Control

User Input Truncation

Max Token Overflow - Intentional Output Constraint

Practical Implementation Patterns

High-Context Creative Writing

Analytical Processing

Dynamic Allocation for Variable Input Processing

Explicit Allocation for Predictable Systems

Hybrid Approach - Adaptive Allocation

Error Prevention and Handling

Token Budget Validation

Runtime Token Management

Token Counting and Optimization

Automatic Token Calculation

Optimization Strategies

Production Considerations

Model-Specific Configuration

Monitoring and Debugging

Size-Based Priority Allocation - DYNAMIC_SIZE_FILL Strategy

Intentional Output Constraints

Next Steps