SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation
English summary
The paper introduces SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, comprising 31 datasets across 7 task types. Evaluation of 31 embedding models shows large instruction-tuned multilingual models perform best, while existing Slovak-specific NLU models transfer poorly to embedding tasks. The authors develop e5-sk-small (45M parameters) and e5-sk-large (365M) by vocabulary trimming and fine-tuning Multilingual E5 models. Despite size reductions of up to 62%, these open-source models achieve competitive performance with proprietary APIs and are suitable for local deployment in semantic search and RAG. The benchmark, models, datasets, and code are released openly, offering a replicable path for other under-resourced languages.
Chinese summary
本文提出SkMTEB,首个斯洛伐克语综合性MTEB风格文本嵌入基准,包含31个数据集和7种任务类型。对31个嵌入模型的评测表明,大型指令微调多语言模型表现最佳,而现有的斯洛伐克语NLU模型在嵌入任务上迁移效果差。作者通过词汇裁剪和微调Multilingual E5模型,开发了e5-sk-small(45M参数)和e5-sk-large(365M参数)。尽管模型尺寸减少高达62%,但开源模型性能可媲美商业API,并适用于语义搜索和RAG的本地部署。研究公开了基准、模型、数据及代码,为其他资源匮乏语言提供了可复现的路径。
Key points
First comprehensive Slovak embedding benchmark with 31 datasets covering 7 task types, nearly 4× the depth of previous multilingual coverage.
首个斯洛伐克语综合嵌入基准,包含31个数据集、7种任务类型,覆盖深度约为已有多语言基准的4倍。
Large instruction-tuned multilingual models outperform Slovak-specific NLU models on embedding tasks; transfer of existing Slovak models is poor.
大型指令微调多语言模型在嵌入任务上优于斯洛伐克专用NLU模型;现有斯洛伐克模型迁移效果不佳。
Developed e5-sk-small (45M) and e5-sk-large (365M) via vocabulary trimming and fine-tuning, reducing parameters by up to 62%.
通过词汇裁剪和微调开发了e5-sk-small(45M)和e5-sk-large(365M),参数最多减少62%。
Open-source models match proprietary API performance, enabling local deployment for semantic search and RAG applications.
开源模型性能与商业API相当,支持语义搜索和RAG的本地部署。
All resources (benchmark, models, code) are publicly released to facilitate similar efforts for under-resourced languages.
全部资源(基准、模型、代码)公开,以促进资源匮乏语言的类似研究。