PapersSource: ARXIVJune 12, 2026Importance: 4/5

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

English summary

The paper introduces SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, comprising 31 datasets across 7 task types. Evaluation of 31 embedding models shows large instruction-tuned multilingual models perform best, while existing Slovak-specific NLU models transfer poorly to embedding tasks. The authors develop e5-sk-small (45M parameters) and e5-sk-large (365M) by vocabulary trimming and fine-tuning Multilingual E5 models. Despite size reductions of up to 62%, these open-source models achieve competitive performance with proprietary APIs and are suitable for local deployment in semantic search and RAG. The benchmark, models, datasets, and code are released openly, offering a replicable path for other under-resourced languages.

Chinese summary

本文提出SkMTEB，首个斯洛伐克语综合性MTEB风格文本嵌入基准，包含31个数据集和7种任务类型。对31个嵌入模型的评测表明，大型指令微调多语言模型表现最佳，而现有的斯洛伐克语NLU模型在嵌入任务上迁移效果差。作者通过词汇裁剪和微调Multilingual E5模型，开发了e5-sk-small（45M参数）和e5-sk-large（365M参数）。尽管模型尺寸减少高达62%，但开源模型性能可媲美商业API，并适用于语义搜索和RAG的本地部署。研究公开了基准、模型、数据及代码，为其他资源匮乏语言提供了可复现的路径。

Key points

First comprehensive Slovak embedding benchmark with 31 datasets covering 7 task types, nearly 4× the depth of previous multilingual coverage.
首个斯洛伐克语综合嵌入基准，包含31个数据集、7种任务类型，覆盖深度约为已有多语言基准的4倍。
Large instruction-tuned multilingual models outperform Slovak-specific NLU models on embedding tasks; transfer of existing Slovak models is poor.
大型指令微调多语言模型在嵌入任务上优于斯洛伐克专用NLU模型；现有斯洛伐克模型迁移效果不佳。
Developed e5-sk-small (45M) and e5-sk-large (365M) via vocabulary trimming and fine-tuning, reducing parameters by up to 62%.
通过词汇裁剪和微调开发了e5-sk-small（45M）和e5-sk-large（365M），参数最多减少62%。
Open-source models match proprietary API performance, enabling local deployment for semantic search and RAG applications.
开源模型性能与商业API相当，支持语义搜索和RAG的本地部署。
All resources (benchmark, models, code) are publicly released to facilitate similar efforts for under-resourced languages.
全部资源（基准、模型、代码）公开，以促进资源匮乏语言的类似研究。

Open original