通过时间冗余掩蔽和潜在修复的自适应分词
英文摘要
The paper proposes a parameter-free adaptive token allocation method for video tokenization that exploits temporal redundancy in the latent space of a frozen continuous video tokenizer. It drops spatial positions whose per-position temporal-L1 differences fall below a fixed threshold, achieving content-driven compression rates. A lightweight Latent Inpainting Transformer (LIT) with factorised spatial-temporal attention reconstructs the dropped tokens. The pipeline requires only a single encoder pass and one LIT forward pass, eliminating auxiliary routing networks. On TokenBench and DAVIS benchmarks, the method delivers competitive reconstruction fidelity with a 31x inference speedup over ElasticTok-CV and 2x over InfoTok.
中文摘要
本文提出一种无参数的自适应视频分词方法,利用冻结的连续视频分词器潜在空间中的时间冗余性。通过固定阈值丢弃相邻帧间L1差异小的空间位置,实现由内容驱动的压缩率。提出轻量级潜在修复变换器(LIT),采用分治的时空注意力重建被丢弃的位置,推理仅需一次编码和一次LIT前向传播。在TokenBench和DAVIS基准上,该方法以有竞争力的重建保真度,取得比ElasticTok-CV快31倍、比InfoTok快2倍的推理速度。
关键要点
Exploits frozen continuous video tokenizer latent space; no retraining or auxiliary networks needed.
利用冻结的连续视频分词器潜在空间,无需重新训练或辅助网络。
Adaptive token dropping via fixed threshold on temporal-L1 differences between frames, yielding content-driven compression.
通过固定阈值对帧间时间L1差异进行自适应令牌丢弃,实现内容驱动的压缩。
Latent Inpainting Transformer (LIT) reconstructs dropped positions with factorised spatial-temporal attention.
潜在修复变换器(LIT)使用分治时空注意力重建被丢弃的位置。
31× inference speedup over ElasticTok-CV and 2× over InfoTok on standard benchmarks, maintaining competitive reconstruction quality.
在标准基准上,推理速度比ElasticTok-CV快31倍,比InfoTok快2倍,并保持有竞争力的重建质量。