Omi Med STT v1: Fine-Tuned Parakeet 0.6B for Medical ASR Released with Open Weights and Local Runtime
English summary
Omi Health founder released Omi Med STT v1, an open-weight (CC-BY-4.0) fine-tune of NVIDIA Parakeet TDT 0.6B v2 specialized for medical speech, with a local runtime that auto-selects backends (MLX on Apple Silicon, NeMo on CUDA, GGUF on CPU). On a held-out benchmark of 1,513 medical clips (7.18 hours), it achieves a medical word error rate (M-WER) of 2.37% and overall WER 8.30% while running at 145× realtime on an A10, significantly outperforming the base model and most open local ASR options. The model trails only VibeVoice-ASR 9B on M-WER but beats it on WER and speed, and rivals cloud-based medical transcription services such as ElevenLabs Scribe v2 (M-WER 1.39%) and AssemblyAI (1.81%) with the structural latency advantage of on-device processing. Training used 127 hours of audio (71% real, 29% synthetic), and the benchmark confirmed zero overlap with training data; key weaknesses are drug name accuracy (4.75% drug WER) targeted for improvement in v2.
Chinese summary
Omi Health 创始人发布 Omi Med STT v1,基于 NVIDIA Parakeet TDT 0.6B v2 微调、专为医疗语音设计的开放权重模型(CC-BY-4.0),并提供本地运行环境,可自动适配不同后端(Apple Silicon 用 MLX,CUDA 用 NeMo,CPU 用 GGUF)。在 1513 个医疗片段(7.18 小时)的独立评测中,医疗词错误率 M-WER 为 2.37%,整体 WER 8.30%,在 A10 上达到 145 倍实时速度,大幅超过基础模型和大多数本地开源方案。该模型 M-WER 仅落后于 VibeVoice-ASR 9B,但 WER 和速度更优,且能与 ElevenLabs Scribe v2(M-WER 1.39%)和 AssemblyAI(1.81%)等云端医疗转录服务抗衡,并具备本地处理的延迟优势。训练使用了 127 小时音频(71% 真实 + 29% 合成),评测集与训练无重合;主要弱项药品名称错误率 4.75% 计划在 v2 中改进。
Key points
Open-source medical ASR model (Omi Med STT v1) released under CC-BY-4.0, fine-tuned from NVIDIA Parakeet TDT 0.6B v2.
以 CC-BY-4.0 许可证发布开源医疗 ASR 模型 Omi Med STT v1,基于 NVIDIA Parakeet TDT 0.6B v2 微调。
Local runtime runs on Mac (MLX), CUDA, and CPU (GGUF) with auto-backend selection and pip install.
本地运行环境支持 Mac(MLX)、CUDA 和 CPU(GGUF),自动选择后端,可通过 pip 安装。
Benchmark on 7.18h medical audio: M-WER 2.37%, WER 8.30%, drug WER 4.75%, speed 145× realtime on A10.
在 7.18 小时医疗音频评测上:医疗词错误率 2.37%,整体 8.30%,药品错误率 4.75%,A10 上速度 145 倍实时。
Outperforms base model and most local open models; competitive with cloud services like AssemblyAI and ElevenLabs on M-WER while adding privacy of on-device processing.
远超基础模型及多数本地开源方案;M-WER 与 AssemblyAI、ElevenLabs 等云端服务相当,且具有本地处理隐私优势。