Home / Specs / Dictionary & Languages
Languages v1.1.84

Dictionary & Languages

Multi-language support and language packs

Dictionary and Multi-Language System

Overview

The dictionary system manages word lookup, frequency ranking, and multi-language support through a tiered architecture. It combines static dictionaries, user-defined words, and neural prediction models into a "Language Pack" system with automatic language detection for bilingual typing.

Key Files

FileClass/FunctionPurpose
src/main/kotlin/tribixbite/cleverkeys/OptimizedVocabularyImpl.ktOptimizedVocabularyDictionary lookup and filtering
src/main/kotlin/tribixbite/cleverkeys/data/LanguageDetector.ktLanguageDetectorWord-based language detection
src/main/kotlin/tribixbite/cleverkeys/WordPredictor.ktWordPredictorUnified prediction pipeline
assets/dictionaries/{lang}_enhanced.binBinary dictionariesTrie-based word storage

Architecture

Language Pack Structure

Each language pack is a self-contained unit:

Language Pack ({lang})
├── dictionaries/{lang}_enhanced.bin    # Trie-based vocabulary
├── dictionaries/{lang}_unigrams.bin    # Top 1000 words for detection
├── models/swipe_encoder_{lang}.onnx    # Neural encoder
├── models/swipe_decoder_{lang}.onnx    # Neural decoder
├── layouts/{lang}_*.xml                # Keyboard layouts
└── metadata.json                       # Version, license info

Dictionary Layers

┌─────────────────────────────────────────────────────────────┐
│                     SuggestionRanker                         │
│  Merges candidates from all layers with unified scoring      │
└─────────────────────────────────────────────────────────────┘
                            │
       ┌────────────────────┼────────────────────┐
       │                    │                    │
       ▼                    ▼                    ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Layer 1    │    │   Layer 2    │    │   Layer 3    │
│ Main Dict    │    │ User Dict    │    │ Custom Dict  │
│ (Read-Only)  │    │ (System)     │    │ (App-Local)  │
│              │    │              │    │              │
│ Language     │    │ Android      │    │ SharedPrefs  │
│ Pack Asset   │    │ UserDictionary│   │ user_dict    │
└──────────────┘    └──────────────┘    └──────────────┘
       │                    │                    │
       └────────────────────┼────────────────────┘
                            │
                            ▼
                   ┌──────────────┐
                   │   Layer 4    │
                   │ Disabled     │
                   │ Words Filter │
                   └──────────────┘

Resolution Logic:

Implementation Details

Accent Handling

The neural model uses a 26-letter vocabulary (a-z only). Accented words are handled through normalization:

// Accent mapping: normalized → canonical forms
data class AccentMapping(
    val normalized: String,         // "cafe"
    val canonicalForms: List<String>, // ["café"]
    val frequencies: List<Int>
)

// Lookup flow:
// 1. User swipes "café" → trajectory matches "cafe" pattern
// 2. Prefix lookup: "caf" → finds normalized candidates
// 3. Each candidate maps to its accented canonical form

Binary Dictionary Format (v2)

┌────────────────────────────────────────┐
│ HEADER (32 bytes)                      │
│  - Magic: "CKDICT" (6 bytes)           │
│  - Version: 2 (2 bytes)                │
│  - Language: "es" (4 bytes)            │
│  - Word Count (4 bytes)                │
│  - Trie Offset (4 bytes)               │
│  - Metadata Offset (4 bytes)           │
│  - Accent Map Offset (4 bytes)         │
├────────────────────────────────────────┤
│ TRIE DATA BLOCK                        │
│  - Compact trie of NORMALIZED words    │
│  - Terminal nodes store word_id        │
├────────────────────────────────────────┤
│ WORD METADATA BLOCK                    │
│  - Array indexed by word_id:           │
│    - Canonical string (UTF-8, varint)  │
│    - Frequency rank (UInt8, 0-255)     │
├────────────────────────────────────────┤
│ ACCENT MAP BLOCK (optional)            │
│  - normalized_word → [canonical_ids]   │
└────────────────────────────────────────┘

Frequency Ranking:

Language Detection

Word-based unigram frequency model:

data class LanguageScore(
    val language: String,
    var score: Float = 0f,
    var consecutiveHits: Int = 0
)

// Detection algorithm:
// 1. Each language pack ships top 1000 unigrams
// 2. Maintain sliding window of last 5 committed words
// 3. Score each word against active language unigram lists
// 4. Track running score per language (exponentially decaying)

fun shouldSwitch(primary: LanguageScore, candidate: LanguageScore): Boolean {
    // Conservative threshold to prevent jitter
    return candidate.score > primary.score * 2.0f &&
           candidate.consecutiveHits >= 2
}

Dual-Dictionary Mode

For bilingual typing (e.g., English + Spanish) without manual switching:

fun calculateUnifiedScore(
    word: String,
    nnConfidence: Float,      // From ONNX model
    dictionaryRank: Int,      // 0-255, lower = more common
    languageContext: Float,   // 0.0-1.0, from detector
    isPrimaryLang: Boolean
): Float {
    val rankScore = 1.0f - (dictionaryRank / 255f)
    val langMultiplier = if (isPrimaryLang) 1.0f else languageContext
    val secondaryPenalty = if (isPrimaryLang) 1.0f else 0.9f

    return nnConfidence * rankScore * langMultiplier * secondaryPenalty
}

Deduplication: When word exists in both dictionaries, present only entry with higher final score.

Manual Language Switching

// Triggered by Globe key or long-press Spacebar
fun switchLanguage(newLang: String) {
    // 1. Hot-swap MainDictionarySource
    dictionaryManager.loadMainDictionary(newLang)

    // 2. Load corresponding ONNX models
    multiLanguageManager.loadModels(newLang)

    // 3. Update keyboard layout if linked
    keyboardManager.switchLayout(newLang)
}

Data Structures

data class DictionaryWord(
    val canonical: String,      // Display form with accents
    val normalized: String,     // Lookup key without accents
    val frequencyRank: Int,     // 0-255, lower = more common
    val source: WordSource,     // MAIN, USER, CUSTOM, SECONDARY
    var enabled: Boolean = true
)

enum class WordSource {
    MAIN,       // Primary language pack
    SECONDARY,  // Secondary language pack
    USER,       // Android UserDictionary
    CUSTOM      // App SharedPreferences
}

data class LanguageState(
    val primary: String,
    val secondary: String?,
    val detectedContext: String,
    val confidence: Float
)

Key Classes

ClassPurpose
DictionaryManagerSingleton holding active WordPredictor instances, handles language lifecycle
MultiLanguageManagerManages ONNX sessions, handles auto-detection logic
SuggestionRankerMerges candidates from multiple dictionaries with unified scoring
UnigramLanguageDetectorWord-based language detection using frequency lists
AccentNormalizerUnicode normalization (NFD) + accent stripping

Performance Considerations

AspectStrategy
Trie lookupsO(L) where L = key length, <5ms
Memory mappingMappedByteBuffer for large dictionaries
Async loadingLanguage switching on Dispatchers.IO
Lazy initSecondary dictionary loaded only when enabled
Memory cleanupInactive ONNX sessions unloaded after 60s
Unigram cache~100KB per language kept in memory