- [Overview](#overview)
Dictionary Object API
Table of Contents
Overview
The Dictionary object provides token counting and text truncation capabilities for TPipe, using a dictionary-based approach to approximate LLM tokenization behavior for context window management.
object Dictionary
Public Properties
words
val words: List<String> by lazy
Dictionary word list loaded from /Words.txt resource file. Used for token counting and text matching operations. Lazily initialized and cached for performance. Falls back to empty list if resource loading fails.
Public Functions
Token Counting
countTokens(text: String, settings: TruncationSettings): Int
Counts tokens using TruncationSettings configuration object.
Behavior: Wrapper method that extracts all tokenization parameters from TruncationSettings and calls the full countTokens method. Provides convenient interface for consistent token counting across TPipe.
countTokens(text: String, countSubWordsInFirstWord: Boolean = true, favorWholeWords: Boolean = true, countOnlyFirstWordFound: Boolean = false, splitForNonWordChar: Boolean = true, alwaysSplitIfWholeWordExists: Boolean = false, countSubWordsIfSplit: Boolean = false, nonWordSplitCount: Int = 4): Int
Advanced token counting with comprehensive configuration options.
Behavior: Complex tokenization algorithm with multiple strategies:
Word Processing:
- Splits text by spaces and processes each word individually
- Ensures minimum 1 token per non-empty word
First Word Handling:
countSubWordsInFirstWord = true: Uses overlapping matches to count all subwords in first wordcountSubWordsInFirstWord = false: Treats first word like other words
Dictionary Matching:
favorWholeWords = true: Prioritizes complete word matches over subword fragmentsfavorWholeWords = false: Uses longest available matches regardless of word boundaries
Match Limiting:
countOnlyFirstWordFound = true: Stops after finding first dictionary match per wordcountOnlyFirstWordFound = false: Continues processing remainder of word
Splitting Logic:
splitForNonWordChar = true: Splits at non-letter characters and processes remainderalwaysSplitIfWholeWordExists = true: Forces splitting even when whole word matches foundcountSubWordsIfSplit = true: Continues dictionary matching after splitcountSubWordsIfSplit = false: Uses character-based counting after split
Fallback Counting:
nonWordSplitCount: Characters per token when no dictionary matches found- Applied when no matches exist or after splitting with
countSubWordsIfSplit = false
Text Truncation
truncate(text: String, windowSize: Int, multiplyWindowSizeBy: Int = 1000, truncateSettings: ContextWindowSettings, ...tokenization parameters...): String
Truncates text to fit within token budget using specified strategy.
Behavior:
Token Budget Calculation:
- Final limit =
windowSize * multiplyWindowSizeBy - Default multiplier of 1000 allows readable input values
Truncation Strategies:
TruncateTop: Removes words from beginning, preserves endTruncateBottom: Removes words from end, preserves beginningTruncateMiddle: Removes middle section, preserves both beginning and end
Word-Level Processing:
- Splits text by spaces and processes complete words
- Never truncates within individual words
- Uses same token counting logic for consistency
Early Return:
- Returns original text if already within token limit
- Avoids unnecessary processing when truncation not needed
truncateWithSettings(content: String, tokenBudget: Int, truncationMethod: ContextWindowSettings, settings: TruncationSettings): String
Convenience method for truncation using TruncationSettings object.
Behavior: Extracts all parameters from TruncationSettings and calls full truncate method. Provides consistent interface matching token counting approach.
| overlapTokens | Int | Tokens overlapping between chunks. Default 0 (no overlap). When > 0, the last N tokens of chunk N are repeated as the first N tokens of chunk N+1 to preserve context continuity. |
| preserveWordBoundary | Boolean | If true, a word that alone fits within maxTokens will start a new chunk rather than being forced into the current chunk over budget. Default true. |
Returns: List of text chunks, each containing at most maxTokens tokens. Empty chunks are omitted.
chunkByTokensWithSettings(text: String, maxTokens: Int, settings: TruncationSettings, overlapTokens: Int = 0, preserveWordBoundary: Boolean = true): List<String>
Convenience wrapper for chunkByTokens() that accepts a TruncationSettings object instead of individual parameters.
Behavior: Extracts all tokenization parameters from TruncationSettings and delegates to chunkByTokens(). Pairs well with Pipe.getTruncationSettings().
Parameters:
text: Input text to split into token-bounded chunksmaxTokens: Maximum token count per chunksettings:TruncationSettingscontaining all tokenization parametersoverlapTokens: Tokens overlapping between chunks (default 0)preserveWordBoundary: Preserve word boundaries (defaulttrue)
Returns: List of text chunks with token budget applied.
List Truncation
truncate(messages: List<String>, windowSize: Int, multiplyWindowSizeBy: Int = 1000, truncateSettings: ContextWindowSettings, ...tokenization parameters...): List<String>
Truncates list of strings by removing entire elements to fit token budget.
Behavior:
Element-Level Processing:
- Removes complete list elements rather than truncating individual strings
- Useful for chat histories where complete messages should be preserved or removed
Token Calculation:
- Sums token counts across all list elements
- Uses same tokenization parameters as text truncation
Truncation Strategies:
TruncateTop: Removes elements from beginning of listTruncateBottom: Removes elements from end of listTruncateMiddle: Removes elements from middle, preserving start and end elements
Preservation Logic:
- Maintains chronological order of remaining elements
- Ensures no partial message truncation
- Returns original list if already within token budget
Key Behaviors
Dictionary-Based Tokenization
Uses loaded word dictionary to approximate LLM tokenization behavior. Provides configurable strategies to match different LLM tokenizer characteristics without requiring actual tokenizer access.
Consistent Token Counting
All truncation methods use identical token counting logic to ensure truncated content fits within specified budgets. Token counts are deterministic and repeatable.
Word Boundary Preservation
Text truncation operates at word boundaries to maintain readability and semantic coherence. Never splits individual words during truncation process.
Flexible Configuration
Extensive parameterization allows fine-tuning for different LLM types and use cases. Settings can be optimized for specific tokenizer behaviors or content types.
Performance Optimization
Lazy loading of dictionary and early return conditions minimize computational overhead. Caching and efficient algorithms support high-frequency usage.
Fallback Mechanisms
Graceful handling of edge cases including empty inputs, missing dictionary resources, and no-match scenarios. Always provides reasonable token estimates even with incomplete data.
Next Steps
- Lorebook API - Continue into knowledge-base management.