Dictionary and Multi-Language System

Overview

The dictionary system manages word lookup, frequency ranking, and multi-language support through a tiered architecture. It combines static dictionaries, user-defined words, and neural prediction models into a "Language Pack" system with automatic language detection for bilingual typing.

Key Files

File	Class/Function	Purpose
`src/main/kotlin/tribixbite/cleverkeys/OptimizedVocabularyImpl.kt`	`OptimizedVocabulary`	Dictionary lookup and filtering
`src/main/kotlin/tribixbite/cleverkeys/data/LanguageDetector.kt`	`LanguageDetector`	Word-based language detection
`src/main/kotlin/tribixbite/cleverkeys/WordPredictor.kt`	`WordPredictor`	Unified prediction pipeline
`assets/dictionaries/{lang}_enhanced.bin`	Binary dictionaries	Trie-based word storage

Architecture

Language Pack Structure

Each language pack is a self-contained unit:

Language Pack ({lang})
├── dictionaries/{lang}_enhanced.bin    # Trie-based vocabulary
├── dictionaries/{lang}_unigrams.bin    # Top 1000 words for detection
├── models/swipe_encoder_{lang}.onnx    # Neural encoder
├── models/swipe_decoder_{lang}.onnx    # Neural decoder
├── layouts/{lang}_*.xml                # Keyboard layouts
└── metadata.json                       # Version, license info

Dictionary Layers

┌─────────────────────────────────────────────────────────────┐
│                     SuggestionRanker                         │
│  Merges candidates from all layers with unified scoring      │
└─────────────────────────────────────────────────────────────┘
                            │
       ┌────────────────────┼────────────────────┐
       │                    │                    │
       ▼                    ▼                    ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Layer 1    │    │   Layer 2    │    │   Layer 3    │
│ Main Dict    │    │ User Dict    │    │ Custom Dict  │
│ (Read-Only)  │    │ (System)     │    │ (App-Local)  │
│              │    │              │    │              │
│ Language     │    │ Android      │    │ SharedPrefs  │
│ Pack Asset   │    │ UserDictionary│   │ user_dict    │
└──────────────┘    └──────────────┘    └──────────────┘
       │                    │                    │
       └────────────────────┼────────────────────┘
                            │
                            ▼
                   ┌──────────────┐
                   │   Layer 4    │
                   │ Disabled     │
                   │ Words Filter │
                   └──────────────┘

Resolution Logic:

1. Gather candidates from Layers 1, 2, 3
2. Filter out words in Layer 4 (disabled)
3. Score by frequency, source priority (Custom > User > Main), neural confidence

Implementation Details

Accent Handling

The neural model uses a 26-letter vocabulary (a-z only). Accented words are handled through normalization:

// Accent mapping: normalized → canonical forms
data class AccentMapping(
    val normalized: String,         // "cafe"
    val canonicalForms: List<String>, // ["café"]
    val frequencies: List<Int>
)

// Lookup flow:
// 1. User swipes "café" → trajectory matches "cafe" pattern
// 2. Prefix lookup: "caf" → finds normalized candidates
// 3. Each candidate maps to its accented canonical form

Binary Dictionary Format (v2)

┌────────────────────────────────────────┐
│ HEADER (32 bytes)                      │
│  - Magic: "CKDICT" (6 bytes)           │
│  - Version: 2 (2 bytes)                │
│  - Language: "es" (4 bytes)            │
│  - Word Count (4 bytes)                │
│  - Trie Offset (4 bytes)               │
│  - Metadata Offset (4 bytes)           │
│  - Accent Map Offset (4 bytes)         │
├────────────────────────────────────────┤
│ TRIE DATA BLOCK                        │
│  - Compact trie of NORMALIZED words    │
│  - Terminal nodes store word_id        │
├────────────────────────────────────────┤
│ WORD METADATA BLOCK                    │
│  - Array indexed by word_id:           │
│    - Canonical string (UTF-8, varint)  │
│    - Frequency rank (UInt8, 0-255)     │
├────────────────────────────────────────┤
│ ACCENT MAP BLOCK (optional)            │
│  - normalized_word → [canonical_ids]   │
└────────────────────────────────────────┘

Frequency Ranking:

• Rank 0 = most frequent word
• Rank 255 = least frequent
• Log-scaled quantization preserves relative ordering

Language Detection

Word-based unigram frequency model:

data class LanguageScore(
    val language: String,
    var score: Float = 0f,
    var consecutiveHits: Int = 0
)

// Detection algorithm:
// 1. Each language pack ships top 1000 unigrams
// 2. Maintain sliding window of last 5 committed words
// 3. Score each word against active language unigram lists
// 4. Track running score per language (exponentially decaying)

fun shouldSwitch(primary: LanguageScore, candidate: LanguageScore): Boolean {
    // Conservative threshold to prevent jitter
    return candidate.score > primary.score * 2.0f &&
           candidate.consecutiveHits >= 2
}

Dual-Dictionary Mode

For bilingual typing (e.g., English + Spanish) without manual switching:

fun calculateUnifiedScore(
    word: String,
    nnConfidence: Float,      // From ONNX model
    dictionaryRank: Int,      // 0-255, lower = more common
    languageContext: Float,   // 0.0-1.0, from detector
    isPrimaryLang: Boolean
): Float {
    val rankScore = 1.0f - (dictionaryRank / 255f)
    val langMultiplier = if (isPrimaryLang) 1.0f else languageContext
    val secondaryPenalty = if (isPrimaryLang) 1.0f else 0.9f

    return nnConfidence * rankScore * langMultiplier * secondaryPenalty
}

Deduplication: When word exists in both dictionaries, present only entry with higher final score.

Manual Language Switching

// Triggered by Globe key or long-press Spacebar
fun switchLanguage(newLang: String) {
    // 1. Hot-swap MainDictionarySource
    dictionaryManager.loadMainDictionary(newLang)

    // 2. Load corresponding ONNX models
    multiLanguageManager.loadModels(newLang)

    // 3. Update keyboard layout if linked
    keyboardManager.switchLayout(newLang)
}

Data Structures

data class DictionaryWord(
    val canonical: String,      // Display form with accents
    val normalized: String,     // Lookup key without accents
    val frequencyRank: Int,     // 0-255, lower = more common
    val source: WordSource,     // MAIN, USER, CUSTOM, SECONDARY
    var enabled: Boolean = true
)

enum class WordSource {
    MAIN,       // Primary language pack
    SECONDARY,  // Secondary language pack
    USER,       // Android UserDictionary
    CUSTOM      // App SharedPreferences
}

data class LanguageState(
    val primary: String,
    val secondary: String?,
    val detectedContext: String,
    val confidence: Float
)

Key Classes

Class	Purpose
`DictionaryManager`	Singleton holding active WordPredictor instances, handles language lifecycle
`MultiLanguageManager`	Manages ONNX sessions, handles auto-detection logic
`SuggestionRanker`	Merges candidates from multiple dictionaries with unified scoring
`UnigramLanguageDetector`	Word-based language detection using frequency lists
`AccentNormalizer`	Unicode normalization (NFD) + accent stripping

Performance Considerations

Aspect	Strategy
Trie lookups	O(L) where L = key length, <5ms
Memory mapping	`MappedByteBuffer` for large dictionaries
Async loading	Language switching on `Dispatchers.IO`
Lazy init	Secondary dictionary loaded only when enabled
Memory cleanup	Inactive ONNX sessions unloaded after 60s
Unigram cache	~100KB per language kept in memory