LM-STUDIO-SETUP.md 7.2 KB

LM Studio Optimal Settings for OpenCode

This guide provides battle-tested configurations for running Qwen3-Coder and GPT-OSS-20B with OpenCode via LM Studio.

Quick Start

  1. Copy the configuration from lmstudio-config-example.json to your opencode.json
  2. Download your models in LM Studio
  3. Start the LM Studio server on port 1234
  4. Launch OpenCode and use /models to select your local model

Model-Specific Settings

Qwen3-Coder-30B (Recommended Primary Model)

Best for: Precise tool calling, code generation, debugging

{
  "limit": {
    "context": 24000,
    "output": 4000
  },
  "options": {
    "temperature": 0.1,
    "topP": 0.8,
    "minP": 0.01,
    "repetitionPenalty": 1.05
  }
}

Why these settings:

  • Temperature 0.1: Maximizes deterministic tool calling reliability. Use 0.2-0.3 for more creative exploration.
  • Top-P 0.8: Constrains token diversity appropriately for coding tasks
  • Min-P 0.01: Lower than llama.cpp default (0.1) for better tool use
  • Repetition Penalty 1.05: Prevents infinite loops during multi-step tool calls
  • Context 24000: Handles large codebases without frequent compaction
  • Output 4000: Sufficient for most code generation tasks

GPT-OSS-20B (Alternative/Backup Model)

Best for: General coding, conversation, when you need higher creativity

{
  "limit": {
    "context": 16000,
    "output": 4000
  },
  "options": {
    "temperature": 0.4,
    "topP": 0.9,
    "minP": 0.05,
    "repetitionPenalty": 1.05
  }
}

Why these settings:

  • Temperature 0.4: Higher than Qwen3 due to different architecture - still reliable for tools
  • Top-P 0.9: More diversity for MoE (Mixture of Experts) architecture
  • Min-P 0.05: Slightly higher for better creative balance
  • Repetition Penalty 1.05: Same as Qwen3 for loop prevention
  • Context 16000: Sufficient for most tasks, adjust based on VRAM

LM Studio Application Settings

GPU Acceleration (Critical)

In LM Studio Settings → Hardware:

  1. GPU Offload Layers: Set to MAXIMUM your GPU can handle

    • RTX 4060 8GB: 36 layers
    • RTX 4070 12GB: 40 layers
    • RTX 4090 24GB: All layers
    • Mac M1/M2/M3: All layers (MLX preferred)
  2. Keep Model in VRAM: ✅ Enable

  3. Offload KV Cache to GPU: ✅ Enable (4x speedup on compatible hardware)

Context Settings

  • Context Length: Match or exceed your config (24000 for Qwen3, 16000 for GPT-OSS)
  • Batch Size: 512 (default) or higher if VRAM allows
  • Threads: Set to CPU cores - 2 (e.g., 14 threads for 16-core CPU)

Speculative Decoding (Advanced)

For 30B+ models, enable speculative decoding:

  • Draft Model: Use a small 1-3B model from the same family
  • Speedup: 1.5x-3x without quality loss

OpenCode Integration

Full Configuration Example

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "lmstudio": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "LM Studio (Local)",
      "options": {
        "baseURL": "http://127.0.0.1:1234/v1"
      },
      "models": {
        "qwen3-coder-30b": {
          "name": "Qwen3-Coder-30B (Local)",
          "tools": true,
          "limit": {
            "context": 24000,
            "output": 4000
          },
          "options": {
            "temperature": 0.1,
            "topP": 0.8,
            "minP": 0.01,
            "repetitionPenalty": 1.05
          }
        },
        "gpt-oss-20b": {
          "name": "GPT-OSS-20B (Local)",
          "tools": true,
          "limit": {
            "context": 16000,
            "output": 4000
          },
          "options": {
            "temperature": 0.4,
            "topP": 0.9,
            "minP": 0.05,
            "repetitionPenalty": 1.05
          }
        }
      }
    }
  },
  "model": "lmstudio/qwen3-coder-30b",
  "agents": {
    "build": {
      "mode": "primary",
      "description": "Main build agent"
    }
  }
}

Switching Models

Use the /models command in OpenCode to switch between your configured models without restarting.


Troubleshooting

Tool Calls Not Working

  1. Increase context window in LM Studio to 16k-32k minimum
  2. Verify temperature is set correctly (0.1 for Qwen3, 0.4 for GPT-OSS)
  3. Check repetition penalty is set to 1.05
  4. Restart LM Studio server after changing settings

Slow Performance

  1. Maximize GPU layers - check LM Studio logs for "offloaded X/Y layers"
  2. Enable KV cache offload in GPU settings
  3. Reduce context length if hitting VRAM limits
  4. Try speculative decoding with a draft model

Out of Memory

  1. Reduce context length: 16000 → 12000 → 8000
  2. Reduce GPU layers: Start at 50% and increase
  3. Switch to smaller quantization: Q6 → Q5 → Q4
  4. Close other applications using VRAM

Model Repeating Itself

  1. Increase repetition penalty: 1.05 → 1.10 → 1.15
  2. Lower temperature slightly: 0.1 → 0.05
  3. Check min-P setting: Should be 0.01-0.05

Hardware Recommendations

Minimum Specs (Qwen3-Coder-30B)

  • GPU: 12GB VRAM (RTX 4070, RTX 3080 12GB)
  • RAM: 16GB system RAM
  • Quantization: Q4_K_M or Q5_K_M

Recommended Specs (Qwen3-Coder-30B)

  • GPU: 16-24GB VRAM (RTX 4080, RTX 4090)
  • RAM: 32GB system RAM
  • Quantization: Q6_K or Q8

Minimum Specs (GPT-OSS-20B)

  • GPU: 8GB VRAM (RTX 4060)
  • RAM: 16GB system RAM
  • Quantization: Q4_K_M

Mac Users

  • MLX versions strongly recommended over GGUF
  • Significantly faster on Apple Silicon
  • Use LM Studio's MLX support or native MLX inference
  • M1/M2/M3 with 16GB+ unified memory works well

Performance Expectations

Qwen3-Coder-30B (Q5_K_M on RTX 4080)

  • Tokens/second: 15-25 t/s
  • Context loading: 2-3 seconds
  • Tool call reliability: 95%+

GPT-OSS-20B (Q5_K_M on RTX 4060)

  • Tokens/second: 20-30 t/s
  • Context loading: 1-2 seconds
  • Tool call reliability: 90%+

Settings Comparison Table

Setting Qwen3-Coder GPT-OSS-20B Reasoning
Temperature 0.1 0.4 Qwen3 needs lower for tool calling
Top-P 0.8 0.9 MoE models benefit from more diversity
Min-P 0.01 0.05 Lower for deterministic tool use
Repetition Penalty 1.05 1.05 Prevents loops in both
Context 24000 16000 Qwen3 handles larger contexts better
Output 4000 4000 Standard for code generation

When to Adjust Settings

For More Creativity

  • Increase temperature: 0.1 → 0.3 (Qwen3) or 0.4 → 0.6 (GPT-OSS)
  • Increase top-P: 0.8 → 0.9 or 0.9 → 0.95

For More Precision

  • Decrease temperature: 0.1 → 0.05 (careful: may reduce quality)
  • Decrease top-P: 0.8 → 0.7

For Handling Repetition

  • Increase repetition penalty: 1.05 → 1.10
  • Add frequency penalty: 0 → 0.3

Notes

  • These settings are optimized for tool calling reliability with OpenCode
  • Raw performance benchmarks show Ollama may be faster, but tool calling is unreliable
  • LM Studio's proper parameter handling makes it the recommended choice for OpenCode
  • Settings can be adjusted per-use-case, but these defaults work for 90% of coding tasks