EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

核心论点：

以 token prediction 为最终目标，freture prediction 可以看作是一个额外约束，这限制了 draft model 的表达能力，使其难以从增加的数据中受益
- 去掉 feature prediction loss，只保留了 token prediction loss
去掉 feature prediction loss 后，draft model output 不再和 verify model output(feature) 同分布，使得 step-2 使用 draft model 作为输入不再可靠
- 训练时把每一步模型推理时产生的 feature，添加到下一步的输入中，类似于自回归
去掉 feature prediction loss 后，可以使用中间层的 feature 作为 draft model 的输入
- 此时 draft model output 应该和多层 feature 融合后的结果同分布

Scaling Law 下降低难度反而限制模型的表达能力。
这脸是打的真疼啊。