SocialSource: REDDIT LOCALLLAMAJune 9, 2026Importance: 4/5

Omi Med STT v1: Fine-Tuned Parakeet 0.6B for Medical ASR Released with Open Weights and Local Runtime

English summary

Omi Health founder released Omi Med STT v1, an open-weight (CC-BY-4.0) fine-tune of NVIDIA Parakeet TDT 0.6B v2 specialized for medical speech, with a local runtime that auto-selects backends (MLX on Apple Silicon, NeMo on CUDA, GGUF on CPU). On a held-out benchmark of 1,513 medical clips (7.18 hours), it achieves a medical word error rate (M-WER) of 2.37% and overall WER 8.30% while running at 145× realtime on an A10, significantly outperforming the base model and most open local ASR options. The model trails only VibeVoice-ASR 9B on M-WER but beats it on WER and speed, and rivals cloud-based medical transcription services such as ElevenLabs Scribe v2 (M-WER 1.39%) and AssemblyAI (1.81%) with the structural latency advantage of on-device processing. Training used 127 hours of audio (71% real, 29% synthetic), and the benchmark confirmed zero overlap with training data; key weaknesses are drug name accuracy (4.75% drug WER) targeted for improvement in v2.

Chinese summary

Omi Health 创始人发布 Omi Med STT v1，基于 NVIDIA Parakeet TDT 0.6B v2 微调、专为医疗语音设计的开放权重模型（CC-BY-4.0），并提供本地运行环境，可自动适配不同后端（Apple Silicon 用 MLX，CUDA 用 NeMo，CPU 用 GGUF）。在 1513 个医疗片段（7.18 小时）的独立评测中，医疗词错误率 M-WER 为 2.37%，整体 WER 8.30%，在 A10 上达到 145 倍实时速度，大幅超过基础模型和大多数本地开源方案。该模型 M-WER 仅落后于 VibeVoice-ASR 9B，但 WER 和速度更优，且能与 ElevenLabs Scribe v2（M-WER 1.39%）和 AssemblyAI（1.81%）等云端医疗转录服务抗衡，并具备本地处理的延迟优势。训练使用了 127 小时音频（71% 真实 + 29% 合成），评测集与训练无重合；主要弱项药品名称错误率 4.75% 计划在 v2 中改进。

Key points

Open-source medical ASR model (Omi Med STT v1) released under CC-BY-4.0, fine-tuned from NVIDIA Parakeet TDT 0.6B v2.
以 CC-BY-4.0 许可证发布开源医疗 ASR 模型 Omi Med STT v1，基于 NVIDIA Parakeet TDT 0.6B v2 微调。
Local runtime runs on Mac (MLX), CUDA, and CPU (GGUF) with auto-backend selection and pip install.
本地运行环境支持 Mac（MLX）、CUDA 和 CPU（GGUF），自动选择后端，可通过 pip 安装。
Benchmark on 7.18h medical audio: M-WER 2.37%, WER 8.30%, drug WER 4.75%, speed 145× realtime on A10.
在 7.18 小时医疗音频评测上：医疗词错误率 2.37%，整体 8.30%，药品错误率 4.75%，A10 上速度 145 倍实时。
Outperforms base model and most local open models; competitive with cloud services like AssemblyAI and ElevenLabs on M-WER while adding privacy of on-device processing.
远超基础模型及多数本地开源方案；M-WER 与 AssemblyAI、ElevenLabs 等云端服务相当，且具有本地处理隐私优势。

Open original