Tutorial: Implementing Microsoft SkillOpt for SearchQA Prompt Optimization with Baseline Comparison and Skill Evolution Analysis
English summary
This tutorial provides a complete coding implementation of Microsoft SkillOpt’s instrumented prompt optimization pipeline. It sets up the environment with OpenAI-compatible model access, using GPT-4o as the optimizer and GPT-4o-mini as the target model. A baseline evaluation is performed on the SearchQA validation set before running the optimization loop, which includes rollout, reflection, aggregation, selection, slow update, and meta-skill mechanisms. The training process is visualized with accuracy curves, edit-budget scheduling, and cumulative token usage. Finally, the evolved best skill is evaluated against the unseen split, demonstrating a measurable hard-match accuracy lift over the seed baseline.
Chinese summary
本教程完整实现了微软SkillOpt的插桩提示优化流程。环境配置为兼容OpenAI的模型访问,优化器使用GPT-4o,目标模型使用GPT-4o-mini。在SearchQA验证集上先进行了基线评测,然后运行优化循环,包括rollout、reflection、aggregation、selection、slow update和meta-skill等步骤。训练过程通过准确率曲线、编辑预算调度和累计令牌用量进行可视化。最后,将优化后的最佳技能在未见过数据上评测,硬匹配准确率相比种子基线有可量化的提升。
Key points
The tutorial delivers a full, runnable Colab notebook for SkillOpt’s prompt optimization on the SearchQA task, from environment setup to final evaluation.
教程提供了在SearchQA任务上运行SkillOpt提示优化的完整Colab笔记本,覆盖环境配置到最终评测。
It demonstrates using stronger model GPT-4o as optimizer and weaker GPT-4o-mini as target agent, with controlled sample limits to manage cost.
展示了用强模型GPT-4o作为优化器、弱模型GPT-4o-mini作为目标智能体,并通过样本数量限制控制成本。
Baseline performance is measured before training, and the optimization process includes detailed visualization of accuracy, edit budget, and token usage across steps.
训练前先测量基线性能,优化过程中包含跨步骤的准确率、编辑预算和令牌用量的详细可视化。
Skill evolution is inspected via snapshots, textual diffs, generated patches, reflection analyses, and meta-skill artifacts, revealing how the prompt improves.
通过技能快照、文本差异、生成的补丁、反思分析和元技能产物查看技能演化过程,揭示提示的改进方式。
The final optimized prompt (best_skill.md) is compared against the baseline on the validation set, showing a hard-match accuracy lift and providing a deployable artifact.
最终优化提示(best_skill.md)与基线在验证集上对比,展示了硬匹配准确率提升,并给出可部署的产物。