# LM Studio Optimal Settings for OpenCode This guide provides battle-tested configurations for running Qwen3-Coder and GPT-OSS-20B with OpenCode via LM Studio. ## Quick Start 1. Copy the configuration from `lmstudio-config-example.json` to your `opencode.json` 2. Download your models in LM Studio 3. Start the LM Studio server on port 1234 4. Launch OpenCode and use `/models` to select your local model --- ## Model-Specific Settings ### **Qwen3-Coder-30B** (Recommended Primary Model) Best for: Precise tool calling, code generation, debugging ```json { "limit": { "context": 24000, "output": 4000 }, "options": { "temperature": 0.1, "topP": 0.8, "minP": 0.01, "repetitionPenalty": 1.05 } } ``` **Why these settings:** - **Temperature 0.1**: Maximizes deterministic tool calling reliability. Use 0.2-0.3 for more creative exploration. - **Top-P 0.8**: Constrains token diversity appropriately for coding tasks - **Min-P 0.01**: Lower than llama.cpp default (0.1) for better tool use - **Repetition Penalty 1.05**: Prevents infinite loops during multi-step tool calls - **Context 24000**: Handles large codebases without frequent compaction - **Output 4000**: Sufficient for most code generation tasks ### **GPT-OSS-20B** (Alternative/Backup Model) Best for: General coding, conversation, when you need higher creativity ```json { "limit": { "context": 16000, "output": 4000 }, "options": { "temperature": 0.4, "topP": 0.9, "minP": 0.05, "repetitionPenalty": 1.05 } } ``` **Why these settings:** - **Temperature 0.4**: Higher than Qwen3 due to different architecture - still reliable for tools - **Top-P 0.9**: More diversity for MoE (Mixture of Experts) architecture - **Min-P 0.05**: Slightly higher for better creative balance - **Repetition Penalty 1.05**: Same as Qwen3 for loop prevention - **Context 16000**: Sufficient for most tasks, adjust based on VRAM --- ## LM Studio Application Settings ### **GPU Acceleration (Critical)** In LM Studio Settings → Hardware: 1. **GPU Offload Layers**: Set to MAXIMUM your GPU can handle - RTX 4060 8GB: 36 layers - RTX 4070 12GB: 40 layers - RTX 4090 24GB: All layers - Mac M1/M2/M3: All layers (MLX preferred) 2. **Keep Model in VRAM**: ✅ Enable 3. **Offload KV Cache to GPU**: ✅ Enable (4x speedup on compatible hardware) ### **Context Settings** - **Context Length**: Match or exceed your config (24000 for Qwen3, 16000 for GPT-OSS) - **Batch Size**: 512 (default) or higher if VRAM allows - **Threads**: Set to CPU cores - 2 (e.g., 14 threads for 16-core CPU) ### **Speculative Decoding** (Advanced) For 30B+ models, enable speculative decoding: - **Draft Model**: Use a small 1-3B model from the same family - **Speedup**: 1.5x-3x without quality loss --- ## OpenCode Integration ### Full Configuration Example ```json { "$schema": "https://opencode.ai/config.json", "provider": { "lmstudio": { "npm": "@ai-sdk/openai-compatible", "name": "LM Studio (Local)", "options": { "baseURL": "http://127.0.0.1:1234/v1" }, "models": { "qwen3-coder-30b": { "name": "Qwen3-Coder-30B (Local)", "tools": true, "limit": { "context": 24000, "output": 4000 }, "options": { "temperature": 0.1, "topP": 0.8, "minP": 0.01, "repetitionPenalty": 1.05 } }, "gpt-oss-20b": { "name": "GPT-OSS-20B (Local)", "tools": true, "limit": { "context": 16000, "output": 4000 }, "options": { "temperature": 0.4, "topP": 0.9, "minP": 0.05, "repetitionPenalty": 1.05 } } } } }, "model": "lmstudio/qwen3-coder-30b", "agents": { "build": { "mode": "primary", "description": "Main build agent" } } } ``` ### Switching Models Use the `/models` command in OpenCode to switch between your configured models without restarting. --- ## Troubleshooting ### Tool Calls Not Working 1. **Increase context window** in LM Studio to 16k-32k minimum 2. **Verify temperature** is set correctly (0.1 for Qwen3, 0.4 for GPT-OSS) 3. **Check repetition penalty** is set to 1.05 4. **Restart LM Studio server** after changing settings ### Slow Performance 1. **Maximize GPU layers** - check LM Studio logs for "offloaded X/Y layers" 2. **Enable KV cache offload** in GPU settings 3. **Reduce context length** if hitting VRAM limits 4. **Try speculative decoding** with a draft model ### Out of Memory 1. **Reduce context length**: 16000 → 12000 → 8000 2. **Reduce GPU layers**: Start at 50% and increase 3. **Switch to smaller quantization**: Q6 → Q5 → Q4 4. **Close other applications** using VRAM ### Model Repeating Itself 1. **Increase repetition penalty**: 1.05 → 1.10 → 1.15 2. **Lower temperature slightly**: 0.1 → 0.05 3. **Check min-P setting**: Should be 0.01-0.05 --- ## Hardware Recommendations ### Minimum Specs (Qwen3-Coder-30B) - **GPU**: 12GB VRAM (RTX 4070, RTX 3080 12GB) - **RAM**: 16GB system RAM - **Quantization**: Q4_K_M or Q5_K_M ### Recommended Specs (Qwen3-Coder-30B) - **GPU**: 16-24GB VRAM (RTX 4080, RTX 4090) - **RAM**: 32GB system RAM - **Quantization**: Q6_K or Q8 ### Minimum Specs (GPT-OSS-20B) - **GPU**: 8GB VRAM (RTX 4060) - **RAM**: 16GB system RAM - **Quantization**: Q4_K_M ### Mac Users - **MLX versions strongly recommended** over GGUF - Significantly faster on Apple Silicon - Use LM Studio's MLX support or native MLX inference - M1/M2/M3 with 16GB+ unified memory works well --- ## Performance Expectations ### Qwen3-Coder-30B (Q5_K_M on RTX 4080) - **Tokens/second**: 15-25 t/s - **Context loading**: 2-3 seconds - **Tool call reliability**: 95%+ ### GPT-OSS-20B (Q5_K_M on RTX 4060) - **Tokens/second**: 20-30 t/s - **Context loading**: 1-2 seconds - **Tool call reliability**: 90%+ --- ## Settings Comparison Table | Setting | Qwen3-Coder | GPT-OSS-20B | Reasoning | |---------|-------------|-------------|-----------| | **Temperature** | 0.1 | 0.4 | Qwen3 needs lower for tool calling | | **Top-P** | 0.8 | 0.9 | MoE models benefit from more diversity | | **Min-P** | 0.01 | 0.05 | Lower for deterministic tool use | | **Repetition Penalty** | 1.05 | 1.05 | Prevents loops in both | | **Context** | 24000 | 16000 | Qwen3 handles larger contexts better | | **Output** | 4000 | 4000 | Standard for code generation | --- ## When to Adjust Settings ### For More Creativity - Increase temperature: 0.1 → 0.3 (Qwen3) or 0.4 → 0.6 (GPT-OSS) - Increase top-P: 0.8 → 0.9 or 0.9 → 0.95 ### For More Precision - Decrease temperature: 0.1 → 0.05 (careful: may reduce quality) - Decrease top-P: 0.8 → 0.7 ### For Handling Repetition - Increase repetition penalty: 1.05 → 1.10 - Add frequency penalty: 0 → 0.3 --- ## Notes - These settings are optimized for **tool calling reliability** with OpenCode - Raw performance benchmarks show Ollama may be faster, but **tool calling is unreliable** - LM Studio's proper parameter handling makes it the recommended choice for OpenCode - Settings can be adjusted per-use-case, but these defaults work for 90% of coding tasks