通过时间冗余掩蔽和潜在修复的自适应分词

Loading / 加载中

英文摘要

The paper proposes a parameter-free adaptive token allocation method for video tokenization that exploits temporal redundancy in the latent space of a frozen continuous video tokenizer. It drops spatial positions whose per-position temporal-L1 differences fall below a fixed threshold, achieving content-driven compression rates. A lightweight Latent Inpainting Transformer (LIT) with factorised spatial-temporal attention reconstructs the dropped tokens. The pipeline requires only a single encoder pass and one LIT forward pass, eliminating auxiliary routing networks. On TokenBench and DAVIS benchmarks, the method delivers competitive reconstruction fidelity with a 31x inference speedup over ElasticTok-CV and 2x over InfoTok.

中文摘要

本文提出一种无参数的自适应视频分词方法，利用冻结的连续视频分词器潜在空间中的时间冗余性。通过固定阈值丢弃相邻帧间L1差异小的空间位置，实现由内容驱动的压缩率。提出轻量级潜在修复变换器（LIT），采用分治的时空注意力重建被丢弃的位置，推理仅需一次编码和一次LIT前向传播。在TokenBench和DAVIS基准上，该方法以有竞争力的重建保真度，取得比ElasticTok-CV快31倍、比InfoTok快2倍的推理速度。

关键要点

Exploits frozen continuous video tokenizer latent space; no retraining or auxiliary networks needed.

利用冻结的连续视频分词器潜在空间，无需重新训练或辅助网络。

Adaptive token dropping via fixed threshold on temporal-L1 differences between frames, yielding content-driven compression.

通过固定阈值对帧间时间L1差异进行自适应令牌丢弃，实现内容驱动的压缩。

Latent Inpainting Transformer (LIT) reconstructs dropped positions with factorised spatial-temporal attention.

潜在修复变换器（LIT）使用分治时空注意力重建被丢弃的位置。

31× inference speedup over ElasticTok-CV and 2× over InfoTok on standard benchmarks, maintaining competitive reconstruction quality.

在标准基准上，推理速度比ElasticTok-CV快31倍，比InfoTok快2倍，并保持有竞争力的重建质量。