Cleo is an open-source text-to-SQL model built by finetuning Qwen3.5-2B-Base, designed to encapsulate full analyst behavior within a 2B parameter model. The system uses the same structured harness for training, evaluation, and inference, implementing a gather-repair-answer contract that includes live execution evidence during candidate query search. Key design choices include co-optimization of the model contract, SQL safety layer, dialect handling, timeouts, and clarification behavior. The model, harness, and datasets are fully open-source on GitHub and Hugging Face. This project demonstrates how tightly coupling training and inference in a single harness can enable small models to handle complex SQL generation and interactive debugging.
Reddit user /u/summerday10 released FeynRL, an open-source framework designed to make reinforcement learning post-training for large language models, vision-language models, and agents fully transparent and modifiable. The framework exposes the entire training loop—data loading, rollout generation, reward computation, loss construction, optimization, and evaluation—so researchers can develop new algorithms without fighting hidden systems. It currently includes examples for supervised fine-tuning, DPO, and RL-style training and supports single-GPU, multi-GPU, and cluster setups. The project was motivated by the belief that open weights alone are insufficient; open training codebases that keep algorithms explicit and systems separate are necessary for advancing open ML/AI research.
Pyrecall is a new open-source tool built to address the lack of practical tooling for continual learning research. It snapshots skill scores before and after fine-tuning, flags performance regressions, and supports rolling back LoRA adapters by name. The tool runs fully locally, is released under the MIT license at v0.1.0, and can be installed via pip. The developer is seeking community feedback on the benchmark design.
The post discusses whether quantization-aware training (QAT) is designed to work specifically with one quantization method, such as Google's for Gemma-4, or if alternative quantizations like those from Unsloth are valid. Unsloth's quantizations of Gemma-4-QAT reportedly produce results closer to the QAT fine-tuned models. The author questions whether this closeness is beneficial or undermines the purpose of QAT, which is to emulate a particular inference-time quantization. The discussion highlights a potential trade-off between accuracy preservation and adherence to the original quantization scheme.
Niels from Hugging Face announces the addition of on-policy distillation (OPD) to PapersWithCode as a key term. OPD is a post-training technique used in models like Qwen 3.6, GLM-5.1, and DeepSeek-V4. The method involves injecting hint tokens to discourage specific errors during rollouts without regenerating new rollouts. A whiteboard explanation by Sasha Rush is linked, and the post invites feedback on other methods to add.