FLM - Flow Language Model

2026-05-30T04:00:00+00:00

This note covers the key concepts of Flow Language Model (FLM) and Flow Map Language Model (FMLM).
本文整理了 Flow Language Model (FLM) 和 Flow Map Language Model (FMLM) 的核心概念。

FLM - Flow Language Model

一、问题定义

给定词表（token set）$V$，$\lvert V\rvert$ 表示其中 token 的种类数。记长度为 $L$ 的语句样本（即一个 sentence）为 $y = (y^l)_{l=1}^L \in V^L$.

目标是在使用尽量少的计算和推理步数的前提下，预测在$V^L$上的分布 $p(y)$，从而生成一个有意义的样本 $y_{\text{sample}}$.
\[\overbrace{\underset{v \in V}{\underline{\text{Mary}}} \text{ is a girl.}}^{y \in V^L}\]
考虑的指标：
- 生成质量
- 推理步数
- 计算开销
二、现有方案

Autoregressive language models

遵循链式法则，而非 Markov chain，因为每个 token 的预测都依赖于之前所有 token 的上下文.

整体分布可写为：
\[p(y) = p(y^1)p(y^2 \mid y^1) p(y^3 \mid y^1,y^2) \cdots p(y^L \mid y^{
计算顺序是逐 token 进行从左到右的, 一共需要 $L$ 步
每一步都需要从 $V$ 中选择一个 token
计算开销可记为 $O(\lvert V\rvert \cdot L)$
推理步数为 $O(L)$

Discrete diffusion language models

思路是通过并行计算多个 token 来加速，通常只有当推理步数小于 $L$ 时，才会有明显加速。

在DDLM中，我们定义句子级别的转移密度：

\[p_{t \mid s}(y_t \mid y_s), \quad y_s, y_t \in V^L\] \[\text{加噪方式} \left\{ \begin{array}{l c c c c c c c} & t = 0 & \rightarrow & t = 0.3 & \rightarrow & t = 0.7 & \rightarrow & t = 1 \\[1em] \text{masking: } & \text{Mary is a girl.} & \rightarrow & \text{Mary [m] a girl.} & \rightarrow & \text{Mary [m] a [m]} & \rightarrow & \text{[m] [m] [m] [m]} \\[0.5em] & \rlap{\color{gray}{\text{([m]的信息为0，但知道哪些token被破坏)}}} & & & & & & \\[1.5em] \text{uniform randomization: } & \text{Mary is a girl.} & \rightarrow & \text{Mary is a \textbf{boy}.} & \rightarrow & \text{Mary \textbf{angry} a \textbf{boy}} & \rightarrow & \text{\textbf{dog} \textbf{hello} a \textbf{boy}} \\[0.5em] & \rlap{\color{gray}{\text{(不可以推断哪些token被破坏以及原词是什么)}}} & & & & & & \\ \end{array} \right.\]

去噪：

如果直接在 $V^L$ 空间上计算 $P_{t\mid s}(y_t \mid y_s)$，可写作：

\[P_{t\mid s}(y_t \mid y_s) = \prod_{l=1}^{L} P_{t\mid s}(y_t^{l} \mid y_t^{在这个前提下，每一步$y$的更新，本质上是在完成一个$y_{s} \in V^L \mapsto y_{t} \in V^L$的映射.

计算顺序是所有token同时并发进行的
而每一步都需要为$y_{t}$中每一个位置选择一个$V$中的token
推理步数为$O(N)$, 为了起到加速作用，$N$应该是一个远小于 $L$ 的常数.
计算开销可记为 $O(\lvert V\rvert^L)$

因此通常是不可行的，DDLM通常采用 factorized approximation 方法：

\[\widehat{P}_{t\mid s}(y_t \mid y_s) := P_{t\mid s}^1(y_t^1 \mid y_s) P_{t\mid s}^2(y_t^2 \mid y_s) \cdots P_{t\mid s}^L(y_t^L \mid y_s)\]

$P_{t \mid s}^{i}$的角标$i$代表概率计算与token所处位置有关。

这意味着：

计算量可降到 $O(L \times \lvert V\rvert)$, 但会丢失大量上下文相关性，导致生成质量下降
此外每个 token 的预测近似彼此独立, 这点只有在 $t \rightarrow s$ 时才会成立
更新步幅小会导致生成时推理步数 $N$ 往往还是很大

二、语言的离散化表示

这一步是为了完成 $y \in V^L \leftrightarrow$ $x \in \mathcal{X}$ 的映射, 其中 $\mathcal{X}$ 是一个连续空间，通常取高维实数空间$\mathbb{R}^{L \times \lvert V\rvert}$.

往往我们采用一个连续的编码器（continuous embedding）$f$ 和一个解码器 $g$ 来完成这个映射：

编码器：$f : y \mapsto x$
解码器：$g : x \mapsto y$, 满足 $g(f(y)) = y$.

这样，原来的分布 $p(y)$ 可以由$p(x) = p(y = g(x))$诱导出分布 $p(x)$，并通过 $y = g(x)$ 还原。

推理和解码过程可重写为：

\[\hat{x} \sim p(x), \quad \hat{y} = g(\hat{x})\]

即先在连续空间中采样，再映射回离散序列。

对于编码方式 $f$，常见选择包括：

Learned Embedding: 每个Token的位置$i$随机初始化一个低维向量$\theta_i$, $\theta$作为模型的一部分参数进行实时更新.

这个方法的表达能力强，但需要 careful regularization, 不然可能会导致所有$\theta_i$趋向于相同的值, 或者 $\lvert\theta\rvert \to \infty$.

Pretrained Embedding: 实现简单, 可采用已训练好的语言模型如$T_5$(ELF)、BERT等的词嵌入层, 继承先验的知识, 训练的时候冻结.
One-hot Encoding: 直接把每个 token 映射为一个独热向量（one-hot vector）：

\[f: y \mapsto (\text{onehot}(y^1), \ldots, \text{onehot}(y^L))^\top, \quad V^L \to \mathbb{R}^{L \times \lvert V\rvert}\] \[g: x \mapsto (\arg\max(x^1), \ldots, \arg\max(x^L)), \quad \mathbb{R}^{L \times \lvert V\rvert} \to V^L\]

这种表示不需要正则化, 适合 cross-entropy训练, 信息无损但是维度很高, 并且语义相似性不容易直接表达。

三、Flow Language Model

核心问题变为：在嵌入空间中学习连续数据分布 $p(x)$。

1. 插值

目标是建立噪声分布$p_0$到数据分布$p_1$的桥梁，定义随时间演化的概率路径。

线性插值：

\[I_t := (1-t)x_0 + tx_1, \quad I_t \sim p_t\]

其中：

$x_0 \sim p_0 = \mathcal{N}(0, I)$ 表示噪声
$x_1 \sim p_1$ 表示目标样本
$t \in [0,1]$ 表示时间变量
$I_t$ 表示训练时刻 $t$ 的中间状态, 以示区分我们额外定义的 $x_t$ 变量表示推理时刻 $t$ 的中间状态

这种方式的速度场容易学习，路径是一条直线，训练通常高效且较稳定。

带噪声的随机插值：

\[I_t = (1-t)x_0 + tx_1 + \sqrt{2t(1-t)}z, \quad z \sim \mathcal{N}(0, I_d)\]

三角插值：

\[I_t = \cos\left(\frac{\pi}{2}t\right)x_0 + \sin\left(\frac{\pi}{2}t\right)x_1\]

这个能保证中间状态分布的方差不变，即“变速生成”。

另外也可以考虑编码-解码式路径，例如：

\[I_t = (1-t)x_0 + tx_1 + \sin^2(\pi t)\]

直观上：

$t: 0 \to 0.5$ 更像编码 / 破坏过程，编码到纯噪声的latent space
$t \to 1$ 更像从 latent space 解码回目标空间

2. 概率流与向量场

这一部分的目标是从产生的概率路径中学习一个确定性的 ODE / SDE。有点类似world model从轨迹中观察得出物理动力学知识的思想。

若采用线性插值，我们引入速度场 $b_t(x_t)$用于描述在时刻 $t$ 下的$x_t$更新方向, 定义为：

\[b_t(x_t) = \dot{x_t}\qquad x_0 \sim p_0, t \in [0,1]\]

整个模型的flow的动力学可以用一个ODE或SDE来描述：

ODE： $\frac{d}{dt} x_t = b_t(x_t)$
SDE： $\frac{d}{dt} x_t = b_t(x_t) + \sqrt{2\epsilon} dW_t$ 其中$W_t$ 是标准布朗运动，$\epsilon$ 是噪声强度，增加随机抖动以增强模型的鲁棒性和泛化能力.

从初始分布 $p_0$ 出发，只需要逐步求解 ODE / SDE 就可以得到最终分布 $p_1$ 的样本，而不需要每一步重新随机采样。

3. Fitting & MSE Loss

我们对符号稍作总结：

初始噪声：$x_0$，目标样本：$x_1$，时间变量：$t \in [0,1]$
训练中间状态：$I_t := (1-t)x_0 + tx_1$，训练时真实速度场为$\dot{I_t} = x_1-x_0$
推理时中间状态$x_t := x_0 + \int_0^t b_s(x_s) \, ds$，$b_t(x_t)$为模型预测的速度场，决定每一步更新方向。

因此只要能够预测或计算 $b_t(x)$，整个 ODE / SDE 的采样过程就被确定下来。

针对某一时刻 $t$，可以把所有轨迹经过该点时的速度做条件平均，从而得到：

\[b_t(x_t) = \mathbb{E}_{x_t \sim p_t}[\dot{x_t}] = \mathbb{E}_{I_t \sim p_t}[\dot{I_t} \vert I_t = x_t] = \mathbb{E}_{I_t \sim p_t}[I_1 - x_0 \vert I_t = x_t]\]

$I_{1}$代表在$I_t = x_t$的条件下，由含$I_t$的训练轨迹预测得到的$t = 1$时的状态，那么对$\forall x \in p_t(x)$有：

\[b_t(x) = \mathbb{E}_{I_t \sim p_t}[I_{1} - x_0 \vert I_t = x]\]

这也意味着，问题可以被转化为一个回归问题。

定义 MSE 损失：

\[\mathcal{L}_{\text{MSE}}(\hat{b}) := \int_0^1 \mathbb{E}\left[\left\vert \hat{b}_t(I_t) - \dot{I_t} \right\vert^2\right] dt\]

实际实现时，积分通常会离散化为网格：

\[0 = t_0 < t_1 < \cdots < t_N = 1\]

此时，最优的速度场满足：

\[b = \arg\min_{\hat{b}} \mathcal{L}_{\text{MSE}}(\hat{b})\]

直接预测速度场的问题

这是一个利用flow进行语言生成的独有问题，如果直接预测：

\[b = x_1 - x_0\]

那么由于 $x_0 \in \mathbb{R}^{L \times \lvert V\rvert}$ 处于高维离散表示空间中，而且高斯噪声$x_0$一般是满秩的，导致对于每个输入输出网络的$b_i \in \mathbb{R}^{\lvert V\rvert}$, 神经网络$\phi: \mathbb{R}^{L \times \lvert V\rvert} \mapsto \mathbb{R}^{L \times d} \mapsto \mathbb{R}^{L \times \lvert V\rvert}$的特征维度（hidden state）$d$ 不足以表达 $b_i$ ($d \ll \lvert V\rvert$). 因此用$b$作为训练对象时一定会出现欠拟合问题。（实验验证，ELF C.1, Figure 10）

一个常见观点是：虽然

\[x \in \mathbb{R}^{L \times \lvert V\rvert}\]

但自然语言样本并不会均匀铺满这个空间，而更可能集中探索在某个低维流形上，因此可以通过更紧凑的表示去学习它。

真实的速度场 $\dot{I_t}$ 是一个常数，等于 $x_1 - x_0$，与 $t$ 无关：

\[\dot{I_t} = x_1 - x_0,\]

预测速度场改写为对终点的预测：

\[\hat{b}_t(I_t) = \frac{\hat{x}_1 - x_t}{1-t}\]

其中 $\hat{x}1$ 是预测出的最终结果，$\hat{x}_1(x_t) = \mathbb{E}{I_t \sim p_t}[I_1 \mid I_{t} = x_t]$。

于是原来的 MSE 目标可改写为：

\[\mathcal{L}_{\text{MSE}}(\hat{b}) = \int_0^1 \mathbb{E}\left[\left\vert \hat{b}_t(I_t) - \dot{I_t} \right\vert^2\right] dt = \int_0^1 \mathbb{E}\left[\left\vert \frac{\hat{x}_1 - x_t}{1-t} - (x_1 - x_0) \right\vert^2\right] dt \\ = \int_0^1 \mathbb{E}\left[\left\vert \frac{\hat{x}_1 - x_{1} + ((1-t)x_{0} + tx_{1} - x_t)}{1-t} \right\vert^2\right] dt \\ = \int_0^1 \frac{1}{(1-t)^2}\mathbb{E}\left[(\hat{x}_1 - x_{1})^2\right] dt\]

此时训练也可以理解为直接逼近最终目标样本。但是需要注意这里有一个$\frac{1}{(1-t)^2}$使得整个积分成为了一个瑕积分，导致当 $t \to 1$ 时，损失函数的值会趋向于无穷大，这可能会引起训练不稳定。

应对措施

FMLM：直接忽略 $\frac{1}{(1-t)^2}$ 这一项，使用近似目标$\int_0^1 \mathbb{E}\left[(\hat{x}1 - x{1})^2\right] dt$；但这未必严格合理，实践中往往更适合配合 CE 使用，而且尽量要让 $t$ 较小。
ELF：
1. 不再对 $t \in [0,1]$ 做均匀采样，而是改为从对数正态分布中采样：$t’ \sim \mathcal{N}(P_{mean}, P_{std}^2), t = \sigma(t’)$。
2. 直接在 $t$ 接近 $1$ 的位置单独设置一个解码分支，这个位置更适合采用交叉熵CE
3. 由于最优预测 $\hat{x}_1$ 一定满足$ \hat{x}_1 - x_1 = O(1 - t) $, 相当于在这里引入了一个隐式的Lipschitz约束，当然我们可以考虑引入一个显式的Lipschitz正则项来强化这个约束，这个可以后面尝试一下。

4. Decoding & CE Loss

在连续空间中做拟合时，MSE Loss 是最自然的训练目标。但当数据最终要转换成离散表示时，我们可以引入分类指标交叉熵（Cross-Entropy, CE）作为更适配的训练目标。

FLM 的 CE 目标

FLM 采用 one-hot 编码，即目标 $x_1 \in \mathbb{R}^{L \times \lvert V\rvert}$ 中，每一个分量$x_{1}^l$只有一个元素为 1，其余为 0。

FLM 定义了交叉熵目标：

$\mathcal{L}_{CE}(\hat{x_1}) := \int_0^1 \mathbb{E}_{I_t \sim p_t}\left[-\sum_{l=1}^L \log \hat{x_1}(I_t)^l \cdot x_1^l\right] dt$ 其中 $\log \hat{x_1}(I_t)^l \cdot x_1^l = \log \left( \text{模型预测正确 Token } c \text{ 的概率} \right)$

关键性质：

$\mathcal{L}_{CE}$ 可分解为不可约的条件熵 + 从真实后验到模型分布的 KL 散度之和，因此最小化 CE 本质上在做一个$\hat{x}_1$ 对于$x_1$的分布匹配。
若 CE误差 $\Delta_D(\hat{D}) := \mathcal{L}{CE}(\hat{D}) - \mathcal{L}{CE}(D) \leq \varepsilon$，则对于任意 early stopping time $\xi \in (0,1)$：

\[W_2^2(\hat{p}_{1-\xi}, p_{1-\xi}) \leq C\varepsilon\]

其中 $C > 0$ 依赖于 $\xi$ 和模型的 Lipschitz 常数。$1-\xi$ 是为了避免转换中的 $(1-t)^{-1}$ 奇异性。简单来讲就是这里的训练损失越低,生成质量一定能更好.

与离散扩散的联系：最优去噪器隐式地学习了 factorized posterior $p^l_{1\mid t}(x_1^l \mid x_t)$，这与离散扩散模型通过 tokenwise CE 学习 $p^l_{1\mid t}$ 的方式一致。但关键区别在于：离散模型用 $p^l_{1\mid t}$ 做 ancestral sampling（需要完整联合概率密度，受因子化误差影响）；连续模型用 $p^l_{1\mid t}$ 推断精确速度场，从而可以被精确蒸馏为 few-step 生成器。

总而言之FLM 的 CE 是沿整个 flow 轨迹的 per-step 目标，即对所有 $t \in [0,1]$ 都施加 token-level 的交叉熵监督。

ELF 的 CE 目标

ELF 的 CE 仅在最终时间步 $t = 1$ 处施加，而非沿整个轨迹的 per-step 目标。这是 ELF 与 FLM 及其他连续 DLM 的核心设计差异。

在 $t = 1$ 时，ELF 将去噪过程自然地视为 continuous-to-discrete 的解码步骤。同一个网络 $\text{net}_\theta$ 在 $t = 1$ 时切换为 “decode” 模式（通过 binary mode token 控制），无需单独的解码器——可以理解为与去噪器共享权重的解码器。
当 $t \to 1$ 时，$z_t \to x$（接近干净的 embedding），此时网络输入过于简单（几乎就是答案本身）。因此 ELF 在 $t = 1$ 处引入了一个 token-level 的 corruption process，将干净 embedding 替换为被污染的版本 $\tilde{z}$，以创造有意义的训练信号。
CE 损失：

\[\mathcal{L}_{CE} = \mathbb{E}_{\tilde{z}}\left[\text{CrossEnt}(W\mathbf{x}_\theta(\tilde{z}), s)\right]\]

其中：

$\mathbf{x}_\theta(\tilde{z})$ 是网络对 $\tilde{z}$ 的预测（clean embedding）
$W$ 是一个可学习的 “unembedding” 矩阵，将 embedding 映射回 logits
$s$ 是 ground-truth token（原始离散 token）

推理时：仅在最终步 $t = 1$ 计算 $W\mathbf{x}_\theta(z_t)$，然后对 logits 取 argmax 得到离散 token。

ELF 不在中间去噪步骤施加 CE 的理由：ELF 的去噪轨迹完全在 unrestricted continuous embedding 空间中进行，per-step 的 token-level 监督会使轨迹过度绑定到词表级别的预测，限制了 flow dynamics 的灵活性。仅在 $t = 1$ 处做 discretization，使得 flow 在 $t < 1$ 时享有最大的自由度。

核心分歧：FLM 认为既然最优去噪器天然落在单纯形上，CE 就是最适配的训练目标，应当沿整个轨迹使用；ELF 则认为 per-step CE 会限制 flow dynamics 的灵活性，应当仅在最终步做离散化，中间步骤完全在连续空间中自由演化。

四、Flow map language models

1. 引入 Flow Map

为了避免仍要计算积分导致无法在 few-step 下收敛，我们引入流映射（flow map）。

定义为 $\dot{x}_t = b_t(x_t), \quad x_0 \sim p_0, \quad t \in [0,1]$ 的解算子：

\[X_{s,t}(x_s) = x_t, \quad \forall s,t \in [0,1].\]

即任意两时刻之间的变换函数。

式子也可重写成：

\[X_{s,t}(x_s) = x_s + (t-s)v_{s,t}(x_s),\]

其中 $v_{s,t}$ 称为 $s$ 到 $t$ 的平均速度或者”mean flow”。

给定待采样 $\hat{x}_0$ 后，在有 flow map 的前提下，我们可以离散地取网格 $0 = t_0 < \cdots < t_N = 1$，并依次计算：

\[\hat{x}_{t_{i+1}} = X_{t_i, t_{i+1}}(\hat{x}_{t_i})\]

得到样本。极端情况下取 $N = 1$，即可单步生成 $\hat{x}1 = X{0,1}(x_0)$。一般来说 $N$ 越大性能越好。

2. 为学习合理的 Flow Map 需要满足的条件

1. 恒等条件：

\[X_{s,s}(x) = x, \quad \forall x \in \mathbb{R}^{L \times \lvert V\rvert}, \forall s \in [0,1].\]

可知实际上这个条件是平凡的，我们将之整合到后两条约束中作为ODE的常数约束。

2. 瞬时速度条件(Lagrangian Equation)：

\[\partial_t X_{s,t}(x) = b_t(X_{s, t}(x)) \quad X_{s,s}(x) = x,\quad \forall x \in \mathbb{R}^{L \times \lvert V\rvert}, \forall s,t \in [0,1].\]

3. 终点不变条件(Eulerian Equation)： $\partial_s X_{s,t}(x) + b_s(x) \cdot \nabla X_{s,t}(x) = 0, \quad X_{t,t}(x) = x,\quad \forall x \in \mathbb{R}^{L \times \lvert V\rvert}, \forall s,t \in [0,1].$

4. 半群条件：

\[X_{u,t}(X_{s,u}(x)) = X_{s,t}(x), \quad \forall x \in \mathbb{R}^{L \times \lvert V\rvert}, \forall s,u,t \in [0,1].\]

需要补充的是, 后三条本质上是等价的,描述的是完全同一个流的性质. 第二条在极限情况$s \to t$下退化为局部的切线条件(Tangent Condition)： $\lim_{s \to t} \partial_t X_{s,t}(x) = b_t(x)$ 第三条是由终点$t$不随起点$s$变化的条件得出: $\frac{d}{ds} X_{s,t}(x) = 0$

根据切线性质和2, 3, 4的依次组合, 我们可以得到三种等价的约束形式：

Lagrangian map distillation (LMD) loss: $\mathcal{L}_{\text{LMD}}(\hat{v}) = \underbrace{\int_0^1 \int_0^t \mathbb{E} \left\vert\partial_t \hat{X}_{s,t}(I_s) - \text{sg}\left(\hat{b}_t\left(\hat{X}_{s,t}(I_s)\right)\right)\right\vert^2 ds dt}_{\text{对应：拉格朗日条件}} + \underbrace{\int_0^1 \mathbb{E} \vert\hat{v}_{t,t}(I_t) - \hat{b}_t(I_t)\vert^2 dt}_{\text{对应：切线条件}}$

Eulerian map distillation (EMD) loss: $\mathcal{L}_{\text{EMD}}(\hat{v}) = \underbrace{\int_0^1 \int_0^t \mathbb{E} \left\vert\partial_s \hat{X}_{s,t}(I_s) + \text{sg}\left(\hat{b}_s(I_s) \cdot \nabla \hat{X}_{s,t}(I_s)\right)\right\vert^2 ds dt}_{\text{对应：欧拉条件}} + \underbrace{\int_0^1 \mathbb{E} \vert\hat{v}_{t,t}(I_t) - \hat{b}_t(I_t)\vert^2 dt}_{\text{对应：切线条件}}$

progressive map distillation (PMD) loss: $\mathcal{L}_{\text{PMD}}(\hat{v}) = \underbrace{\int_0^1 \int_0^t \int_s^t \mathbb{E} \left\vert\hat{X}_{s,t}(I_s) - \text{sg}\left(\hat{X}_{t,u}\left(\hat{X}_{s,t}(I_s)\right)\right)\right\vert^2 du ds dt}_{\text{对应：半群条件}} + \underbrace{\int_0^1 \mathbb{E} \vert\hat{v}_{t,t}(I_t) - \hat{b}_t(I_t)\vert^2 dt}_{\text{对应：切线条件}}$

如果是自蒸馏的话,只需要将每个式子的最后一项中的 $\hat{b}_t(I_t)$ 替换为 $\dot{I_t}$ 即可.

FMLM选用的是最后一种约束形式：

\[\begin{aligned} \mathcal{L}_{\mathrm{MSE}}(\hat v)=& \int_0^1 \int_0^t \int_t^s \mathbb{E}_{I_s} \left[ \left\| \hat X_{s,t}(I_s) - \mathrm{sg}\!\left( \hat X_{u,t} \left( \hat X_{s,u}(I_s) \right) \right) \right\|^2 \right] \,du\,ds\,dt \\ &+ \int_0^1 \mathbb{E}_{I_t} \left[ \left\| \hat v_{t,t}(I_t) - \hat b_t(I_t) \right\|^2 \right] \,dt . \end{aligned}\]

第一项为结合律约束，第二项为瞬时速度约束.

其中 sg 意为 stop gradient，不反向传播，第一项还可以利用 $X_{s,t}(x_s) = x_s + (t-s)v_{s,t}$ 进一步展开，第二项可以用 $I_t$ 代替预训练得到的 $\hat{b}_t$

$X_{s, t}$本身是由 Gaussian Noise 导出的，如果直接约束会面临和直接计算$b_t$一样的问题。为了在概率空间统一 Flow Matching 和 Flow Map，同时利用 One-Hot 编码带来的单纯形约束，FMLM 引入新的优化对象：

\[\delta_{s,t}(x_s) := x_s + (1-s)v_{s,t}(x_s).\]

可以把它理解成一个矩阵$T \in \mathbb{R}^{N \times N}$：

横纵坐标分别对应时间 $(s,t)$
Flow Matching 相当于填写对角线上的值$T_{tt}$（$\hat{b}t = \hat{X}{t,t}$）
后者相当于填写矩阵上三角或者下三角区域

这种表示带来的额外好处是不需要再使用 MSE 做数值回归，而可以直接采用 CE（Cross Entropy）做分类， MSE做分类的话会有数值不稳定的问题。

3. $\delta_{s,t}$ 的性质

FMLM选择的 $\delta_{s,t}$ 有以下性质：

性质 1. 与流匹配快速对应：

\[X_{s,t}(x) = \frac{1-t}{1-s}x + \frac{t-s}{1-s}\delta_{s,t}(x)\]

性质 2. 在单纯形空间上，维数不高：

\[\delta_{s,t}(x)^\ell \in \Delta^{\lvert V\rvert-1}, \quad \ell = 1,2,\cdots,L.\]

其中 $\Delta^{\lvert V\rvert-1}$ 满足各个值非负且和为1。

性质 3. 对角线上对应标值的去噪器 / 填充函数：

\[\delta_{s,s}(x) = \mathbb{E}_{I_t \sim p_t}[x_1 \mid I_t = x] = \hat{x}_1, \quad \forall x \in \mathbb{R}^{L \times \lvert V\rvert}, \forall s \in [0,1].\]

性质 4. 满足结合律（半群条件）：

\[\delta_{s,t}(x) = \gamma \, \delta_{s,u}(x) + (1-\gamma) \, \delta_{u,t}(X_{s,u}(x)),\]

其中

\[\gamma = \frac{(1-t)(u-s)}{(1-u)(t-s)}.\]

Flow Map训练时使用的损失函数为

\[\mathcal{L}_{\mathrm{KL}}(\delta) := \mathbb{E}_{t,s,u} \mathbb{E}_{x_0,x_1} \left[ \sum_{\ell=1}^{L} \mathrm{KL}\!\left( \bar{\delta}_{s,t}^{\,\ell} \;\middle\|\; \delta_{s,t}^{\,\ell}(I_s) \right) \right] + \mathbb{E}_t \mathbb{E}_{x_0,x_1} \left[ \sum_{\ell=1}^{L} \mathrm{KL}\!\left( \hat{x}_1(I_t) \;\middle\|\; \delta_{t,t}^{\,\ell}(I_t) \right) \right].\]

其中

\[\bar{\delta}_{s,t} = \mathrm{sg}\left( \gamma \delta_{s,u}(I_s) + (1-\gamma) \delta_{u,t}(X_{s,u}(I_s)) \right).\]

$\bar{\delta}{s,t}$ 是半群条件的目标值, $\mathbb{E}{t,s,u}$是从在${0 \leq s \leq u \leq t \leq 1}$上全支撑(所有情况都能采样到的)的分布采样得出.

$\hat{x}1$ 使用 Flow Matching 预测，即作为教师模型，让学生模型 $\delta{s,t}$ 去拟合教师的 $\hat{x}1$ 即可。当然如果将 $\hat{x}_1$ 换成数据集中的原始 $x{1}$ 也可以从零开始直接训练（自蒸馏）。

用 $\hat{x}_i$：用原有 FLM 约束对角线
用 $x_i$：从头训练

4. Discrete Flow Maps?

FMLM 指出，在原有的离散状态空间中，虽然整个概率分布的演化在理论上是确定存在的 $\dot{p_t} = Q_t p_t$，但因$Q_t \in \mathbb{R}^{\lvert V\rvert^L \times \lvert V\rvert^L}$, $p_t \ in \Delta^{\lvert V\rvert^L}$维度爆炸而无法计算；而我们真正需要的样本级别（Sample-level）的确定性流映射，在数学上被证明是根本不存在的（无法拟合所有目标分布）。因此，直接在纯离散空间做单步确定性生成是一条死路。

五、算法实现

1. 时间重参数化 $\tau(t)$

在 FLM和FMLM 的训练和采样中，不同于一般的连续模态如图像，直接在 $t \in [0,1]$ 上均匀采样是低效的。原因在于：在高维 One-hot 空间中，Token 的去噪过程并不均匀——大部分确定性集中在 $t \to 1$ 的极窄窗口内剧烈变化。

为此，FLM引入了时间重参数化 $\tau(t)$，定义如下：

\[\tau(t) = \frac{P_e(0) - P_e(t)}{P_e(0)} = 1 - \frac{\lvert V\rvert}{\lvert V\rvert - 1} P_e(t)\]

其中各符号含义：

$\lvert V\rvert$：词表大小（Vocabulary Size）。
$P_e(t)$：解码错误率（Decoding Error Rate），定义为：

\[P_e(t) := \frac{1}{L} \sum_{l=1}^L P(g_l(x_t) \neq g_l(x_1))\]

$P_e(t)$ 反映的是如果在时间 $t$ 强行终止流、对当前带噪状态 $x_t$ 解码，平均会有多少比例的 Token 被解错，$g(x)$ 在这里就是 argmax 操作。

在端点处：

$P_e(0) = 1 - \frac{1}{\lvert V\rvert}$（纯高斯噪声时，解错率最大）
$P_e(1) = 0$（干净数据时，解错率为 0）

根据Figure 9，$P_e(t)$ 在 $t \to 1$ 时急剧下降，说明大部分 Token 的确定性都集中在这个阶段。

我们可以简单推导一下：由于$g(x) = argmax(x)$, 时刻$t$的插值为$x_t = (1 - t)x_0 + tx_1$, 假设本来应该输出的是第$j$位上的token, 然后混淆成了第$i$位上的token, 那么就有$x_t^i > x_t^j$, 也就是$(1-t)x_0^i + tx_1^i > (1-t)x_0^j + tx_1^j$. 即$x_0^i - x_0^j > \frac{t}{1 - t}$, $\frac{t}{1 - t}$随$t$单调递增, 因此当$t$较大时, 需要更大的噪声差异才能导致解码错误, 因此解码错误率会急剧下降.

在伪代码中，”Sample $t$ via $\tau(t) \sim \mathcal{U}[0,1]$” 的含义是：先均匀采样 $\tau$，再通过反函数映射回实际时间 $t$。

2. 非对角线损失计算时的采样

在 FMLM 的蒸馏（Distillation）训练中，非对角线损失的核心目的是让学生模型（两点去噪器 $\hat{\delta}_{s,t}$）学习“时空穿梭”的半群性质（Semigroup property）：即要求模型做到从起点 $s$ 直接跳跃到终点 $t$ 的输出结果，必须等于先从 $s$ 跳到中转点 $u$，再从 $u$ 跳到 $t$ 的连续映射结果。

为了保证模型在任何时间跨度下都能泛化这一性质，论文在重参数化后的时间轴 $\tau$ 上，采用了一套连续且对称的时间三元组 $(s, u, t)$ 采样策略。具体步骤如下：

$h \sim \mathcal{U}[0, 1]$ 从均匀分布中随机抽取一个时间步长 $h$，它代表本次训练要求模型跨越的时间距离。
$\tau(s) \sim \mathcal{U}[0, 1 - h]$, $\tau(t) = \tau(s) + h$
$\tau(u) = \frac{\tau(s) + \tau(t)}{2}$ 中转点 $u$ 强制取在起点和终点的中间。

相比shortcut model预先指定了固定的时间跨度、训练起点和终点。FMLM 的这种采样方式能够覆盖从极短到极长的所有时间跨度，确保模型在训练过程中见过各种跨度的样本，从而更好地学习半群性质。

3. 少步骤生成的边界采样

固定以概率p (FMLM中$p$取$\frac{1}{32}$) 直接采样边界对$(s,t) = (0,1)$来保证有足够的训练信号.

4. 生成时引导

Autoguidance, 用于无条件生成

$\hat{b}_t^{\text{(guided)}} = \hat{b}_t^{\text{(weak)}} + \eta (\hat{b}_t - \hat{b}_t^{\text{(weak)}})$
- $\hat{b}_t$ 是主力好模型预测的速度方向。
- $\hat{b}_t^{\text{(weak)}}$ 是一个“弱模型”（比如训练步数少、网络小、或者加了高 Dropout 变笨的模型）预测的方向。
- $(\hat{b}_t - \hat{b}_t^{\text{(weak)}})$ 就是“好”减去“坏”，代表着更高质量的方向。
- $\eta > 1$ 是控制强度的标量。

在连续空间里，这个结果依然是一个合法的方向；但在离散的 Logit 空间里”强推”（$\eta$ 很大时），会导致某些 token 的概率变得极其极端，破坏句子整体的连贯性。

Reward-guided generation, 用于控制文本属性的条件生成

这里用到了Flow Map Trajectory Guidance (FMTG)的思想： $\mathbf{x}_{t_{n+1}} = X_{t_n, t_{n+1}}(\mathbf{x}_{t_n}) + \lambda \nabla_{\mathbf{x}_{t_n}} r(X_{t_n, 1}(\mathbf{x}_{t_n}))$

下一步怎么走 = 正常按计划走一小步 + $\lambda \times$ 奖励梯度的修正。
$r(X_{t_n, 1}(\mathbf{x}_{t_n}))$表示模型可以看到自己目前预测的终点，让奖励模型评价一下好不好，然后计算梯度，反向传播回来微调当前这一小步的方向。

六、相关算法伪代码

Algorithm 5 FLM training

Require: Dataset $\mathcal{D}$, reparameterization $\tau(t)$, lr $\eta$
Initialize: Denoiser $\hat{D}$
repeat
    $\mathbf{x}_1 \leftarrow f(\mathbf{y}), \mathbf{y} \sim \mathcal{D}; \quad \mathbf{x}_0 \sim \mathsf{N}(0, I)$
    Sample $t$ via $\tau(t) \sim \mathsf{U}[0, 1]$
    $I_t \leftarrow (1-t)\mathbf{x}_0 + t\mathbf{x}_1$
    $\hat{\mathbf{x}}_1 \leftarrow \hat{D}_t(I_t)$
    Update $\hat{D}$: $\mathcal{L}_{\text{CE}} = - \sum_l (\mathbf{x}_1^l)^\top \log \hat{\mathbf{x}}_1^l$
until converged

Algorithm 6 FLM sampling

Require: Trained $\hat{D}$, $\tau(t)$, steps $N$
$\mathbf{x}_0 \sim \mathsf{N}(0, I)$
$t_n \leftarrow t(n/N)$ for $n=0, ..., N$
for $n=0$ to $N-1$ do
    $\hat{\mathbf{x}}_1 \leftarrow \hat{D}_{t_n}(\mathbf{x}_{t_n})$
    $\hat{b}_n \leftarrow (\hat{\mathbf{x}}_1 - \mathbf{x}_{t_n})/(1 - t_n)$
    $\mathbf{x}_{t_{n+1}} \leftarrow \mathbf{x}_{t_n} + (t_{n+1} - t_n)\hat{b}_n$
end for
Return $g(\mathbf{x}_{t_N})$

Algorithm 7 FMLM training (distillation)

Require: Dataset $\mathcal{D}$, trained $\hat{D}$, $\tau(t)$, lr $\eta$
Initialize: Two-time denoiser $\hat{\delta}$
repeat
    $\mathbf{x}_1 \leftarrow f(\mathbf{y}), \mathbf{y} \sim \mathcal{D}; \quad \mathbf{x}_0 \sim \mathsf{N}(0, I)$
    Diagonal (anchors to $\hat{D}$):
    $\tau(s) \sim \mathsf{U}[0, 1]; \quad I_s \leftarrow (1-s)\mathbf{x}_0 + s\mathbf{x}_1$
    $\mathcal{L}_{\text{diag}} \leftarrow - \sum_l \hat{D}(I_s)^l \cdot \log \hat{\delta}_{s,s}^l(I_s)$
    Off-diagonal (semigroup):
    $h \sim \mathsf{U}[0, 1]; \quad \tau(s) \sim \mathsf{U}[0, 1-h]$
    $\tau(t) \leftarrow \tau(s) + h; \quad \tau(u) \leftarrow \frac{\tau(s)+\tau(t)}{2}$
    $\gamma \leftarrow \frac{(1-t)(u-s)}{(1-u)(t-s)}$
    $\hat{X}_{s,u} \leftarrow \frac{1-u}{1-s}I_s + \frac{u-s}{1-s}\hat{\delta}_{s,u}(I_s)$
    $\bar{\delta} \leftarrow \text{sg}(\gamma\hat{\delta}_{s,u}(I_s) + (1-\gamma)\hat{\delta}_{u,t}(\hat{X}_{s,u}))$
    $\mathcal{L}_{\text{off}} \leftarrow - \sum_l \bar{\delta}^l \cdot \log \hat{\delta}_{s,t}^l(I_s)$
    Update $\hat{\delta}$: $\mathcal{L}_{\text{diag}} + \mathcal{L}_{\text{off}}$
until converged

Algorithm 8 FMLM sampling

Require: Trained $\hat{\delta}$, $\tau(t)$, steps $N$
$\mathbf{x}_0 \sim \mathsf{N}(0, I)$
$t_n \leftarrow t(n/N)$ for $n=0, ..., N$
for $n=0$ to $N-1$ do
$\mathbf{x}_{t_{n+1}} \leftarrow \frac{1-t_{n+1}}{1-t_n}\mathbf{x}_{t_n} + \frac{t_{n+1}-t_n}{1-t_n}\hat{\delta}_{t_n, t_{n+1}}(\mathbf{x}_{t_n})$
end for
Return $g(\mathbf{x}_{t_N})$

I. Problem Definition

Given a vocabulary (token set) $V$, let $\lvert V\rvert$ denote the number of token types. A sentence sample of length $L$ is denoted as $y = (y^l)_{l=1}^L \in V^L$.

The goal is to predict the distribution $p(y)$ over $V^L$ using as few computational and inference steps as possible, to generate a meaningful sample $y_{\text{sample}}$.

\[\overbrace{\underset{v \in V}{\underline{\text{Mary}}} \text{ is a girl.}}^{y \in V^L}\]

Metrics considered:

Generation quality
Inference steps
Computational overhead

II. Existing Solutions

Autoregressive language models

These follow the chain rule rather than a Markov chain, as the prediction of each token depends on the context of all previous tokens.

The joint distribution can be written as:

\[p(y) = p(y^1)p(y^2 \mid y^1) p(y^3 \mid y^1,y^2) \cdots p(y^L \mid y^{

The computation is performed token-by-token from left to right, requiring $L$ steps in total.

Each step requires selecting a token from $V$.

The computational overhead is $O(\lvert V\rvert \cdot L)$.

Inference steps are $O(L)$.

Discrete diffusion language models

The idea is to accelerate generation by computing multiple tokens in parallel. Generally, significant speedup is only achieved when the number of inference steps is less than $L$.

In DDLMs, we define the sentence-level transition density:

\[p_{t \mid s}(y_t \mid y_s), \quad y_s, y_t \in V^L\] \[\text{Corruption process} \left\{ \begin{array}{l c c c c c c c} & t = 0 & \rightarrow & t = 0.3 & \rightarrow & t = 0.7 & \rightarrow & t = 1 \\[1em] \text{masking: } & \text{Mary is a girl.} & \rightarrow & \text{Mary [m] a girl.} & \rightarrow & \text{Mary [m] a [m]} & \rightarrow & \text{[m] [m] [m] [m]} \\[0.5em] & \rlap{\color{gray}{\text{([m] has zero info, but corrupted positions are known)}}} & & & & & & \\[1.5em] \text{uniform randomization: } & \text{Mary is a girl.} & \rightarrow & \text{Mary is a \textbf{boy}.} & \rightarrow & \text{Mary \textbf{angry} a \textbf{boy}} & \rightarrow & \text{\textbf{dog} \textbf{hello} a \textbf{boy}} \\[0.5em] & \rlap{\color{gray}{\text{(Cannot infer corrupted positions nor original tokens)}}} & & & & & & \\ \end{array} \right.\]

Denoising:

If we directly compute $P_{t\mid s}(y_t \mid y_s)$ in the $V^L$ space, it can be written as:

\[P_{t\mid s}(y_t \mid y_s) = \prod_{l=1}^{L} P_{t\mid s}(y_t^{l} \mid y_t^{Under this premise, updating $y$ at each step is essentially performing a mapping $y_{s} \in V^L \mapsto y_{t} \in V^L$.

The computation is performed concurrently for all tokens.
Each step requires selecting a token from $V$ for every position in $y_t$.
Inference steps are $O(N)$. For acceleration, $N$ should be a constant much smaller than $L$.
The computational overhead is $O(\lvert V\rvert^L)$.

Therefore, this is generally intractable. DDLMs typically adopt a factorized approximation:

\[\widehat{P}_{t\mid s}(y_t \mid y_s) := P_{t\mid s}^1(y_t^1 \mid y_s) P_{t\mid s}^2(y_t^2 \mid y_s) \cdots P_{t\mid s}^L(y_t^L \mid y_s)\]

The superscript $i$ in $P_{t \mid s}^{i}$ indicates that the probability computation is position-dependent.

This means:

The computational overhead drops to $O(L \times \lvert V\rvert)$, but it loses a significant amount of contextual correlation, leading to a degradation in generation quality.
Furthermore, the prediction of each token is assumed to be approximately independent, which only holds true when $t \rightarrow s$.
The small update step sizes mean that the number of inference steps $N$ during generation remains quite large.

III. Discrete Representation of Language

This step aims to establish the mapping $y \in V^L \leftrightarrow x \in \mathcal{X}$, where $\mathcal{X}$ is a continuous space, typically a high-dimensional real coordinate space $\mathbb{R}^{L \times \lvert V\rvert}$.

Usually, we use a continuous encoder (embedding) $f$ and a decoder $g$ to perform this mapping:

Encoder: $f : y \mapsto x$
Decoder: $g : x \mapsto y$, such that $g(f(y)) = y$.

Thus, the original distribution $p(y)$ can induce a distribution $p(x)$ via $p(x) = p(y = g(x))$ and can be recovered using $y = g(x)$.

The inference and decoding process can be rewritten as:

\[\hat{x} \sim p(x), \quad \hat{y} = g(\hat{x})\]

That is, sample in the continuous space first, then map back to the discrete sequence.

Common choices for the encoding method $f$ include:

Learned Embedding: A low-dimensional vector $\theta_i$ is randomly initialized for each token index $i$, and $\theta$ is updated in real-time as part of the model parameters.

This method has strong expressive power but requires careful regularization; otherwise, all $\theta_i$ might collapse to the same value, or $\lvert\theta\rvert \to \infty$.

Pretrained Embedding: Easy to implement. It utilizes the embedding layers of pre-trained language models like T5 (as in ELF) or BERT, inheriting prior knowledge, and is frozen during training.
One-hot Encoding: Maps each token directly to a one-hot vector:

This representation requires no regularization and is well-suited for cross-entropy training. It is lossless in information but highly dimensional, and semantic similarity is not easily expressed directly.

IV. Flow Language Model

The core problem now becomes: learning the continuous data distribution $p(x)$ in the embedding space.

1. Interpolation

The goal is to bridge the noise distribution $p_0$ and the data distribution $p_1$, defining a probability path that evolves over time.

Linear Interpolation:

\[I_t := (1-t)x_0 + tx_1, \quad I_t \sim p_t\]

Where:

$x_0 \sim p_0 = \mathcal{N}(0, I)$ represents the noise.
$x_1 \sim p_1$ represents the target sample.
$t \in [0,1]$ is the time variable.
$I_t$ denotes the intermediate state at time $t$ during training, distinguished from the variable $x_t$ which denotes the intermediate state at time $t$ during inference.

The velocity field of this approach is easy to learn, the path is a straight line, and training is generally efficient and stable.

Stochastic Interpolation with Noise:

\[I_t = (1-t)x_0 + tx_1 + \sqrt{2t(1-t)}z, \quad z \sim \mathcal{N}(0, I_d)\]

Trigonometric Interpolation:

\[I_t = \cos\left(\frac{\pi}{2}t\right)x_0 + \sin\left(\frac{\pi}{2}t\right)x_1\]

This maintains a constant variance for the intermediate state distributions, yielding “variable-speed generation.”

Additionally, encoder-decoder-style paths can be considered, for example:

\[I_t = (1-t)x_0 + tx_1 + \sin^2(\pi t)\]

Intuitively:

$t: 0 \to 0.5$ acts more like an encoding/corruption process, mapping into a pure-noise latent space.
$t \to 1$ acts more like decoding from the latent space back to the target space.

2. Probability Flow and Vector Fields

The goal here is to learn a deterministic ODE/SDE from the generated probability paths. This is somewhat similar to the idea of world models inferring physical dynamics by observing trajectories.

If linear interpolation is adopted, we introduce a velocity field $b_t(x_t)$ to describe the update direction of $x_t$ at time $t$, defined as:

\[b_t(x_t) = \dot{x_t}\qquad x_0 \sim p_0, t \in [0,1]\]

The flow dynamics of the entire model can be described by an ODE or an SDE:

ODE: $\frac{d}{dt} x_t = b_t(x_t)$
SDE: $\frac{d}{dt} x_t = b_t(x_t) + \sqrt{2\epsilon} dW_t$ Where $W_t$ is standard Brownian motion and $\epsilon$ is the noise intensity, adding stochastic jitter to enhance the model’s robustness and generalization capability.

Starting from the initial distribution $p_0$, we just need to progressively solve the ODE/SDE to obtain samples from the final distribution $p_1$, without needing to randomly resample at every step.

3. Fitting & MSE Loss

Let’s briefly summarize the notation:

Initial noise: $x_0$, target sample: $x_1$, time variable: $t \in [0,1]$.
Intermediate state during training: $I_t := (1-t)x_0 + tx_1$, the true ground-truth velocity field during training is $\dot{I_t} = x_1-x_0$.
Intermediate state during inference: $x_t := x_0 + \int_0^t b_s(x_s) \, ds$, where $b_t(x_t)$ is the predicted velocity field, determining the update direction at each step.

Thus, as long as $b_t(x)$ can be predicted or computed, the entire ODE/SDE sampling process is determined.

For a specific time $t$, we can take the conditional expectation of the velocities of all trajectories passing through that point, yielding:

\[b_t(x_t) = \mathbb{E}_{x_t \sim p_t}[\dot{x_t}] = \mathbb{E}_{I_t \sim p_t}[\dot{I_t} \vert I_t = x_t] = \mathbb{E}_{I_t \sim p_t}[I_1 - x_0 \vert I_t = x_t]\]

$I_{1}$ represents the state at $t = 1$ predicted from training trajectories containing $I_t$, conditioned on $I_t = x_t$. Then, for $\forall x \in p_t(x)$, we have:

\[b_t(x) = \mathbb{E}_{I_t \sim p_t}[I_{1} - x_0 \vert I_t = x]\]

This also means the problem can be formulated as a regression problem.

Defining the MSE Loss:

\[\mathcal{L}_{\text{MSE}}(\hat{b}) := \int_0^1 \mathbb{E}\left[\left\vert \hat{b}_t(I_t) - \dot{I_t} \right\vert^2\right] dt\]

In practical implementation, the integral is typically discretized into a grid:

\[0 = t_0 < t_1 < \cdots < t_N = 1\]

Then, the optimal velocity field satisfies:

\[b = \arg\min_{\hat{b}} \mathcal{L}_{\text{MSE}}(\hat{b})\]

Issues with Directly Predicting the Velocity Field

This is a unique problem when using flow for language generation. If we directly predict:

\[b = x_1 - x_0\]

Since $x_0 \in \mathbb{R}^{L \times \lvert V\rvert}$ lies in a high-dimensional discrete representation space, and the Gaussian noise $x_0$ is generally full-rank, the neural network $\phi: \mathbb{R}^{L \times \lvert V\rvert} \mapsto \mathbb{R}^{L \times d} \mapsto \mathbb{R}^{L \times \lvert V\rvert}$ has a hidden state dimension $d$ that is insufficient to express each $b_i \in \mathbb{R}^{\lvert V\rvert}$ output from the network ($d \ll \lvert V\rvert$). Thus, using $b$ directly as the training target inevitably leads to underfitting. (Experimentally verified in ELF C.1, Figure 10).

A common view is that although $x \in \mathbb{R}^{L \times \lvert V\rvert}$, natural language samples do not uniformly cover this space. They are more likely concentrated on a low-dimensional manifold, so they can be learned through a more compact representation.

The true velocity field $\dot{I_t}$ is a constant equal to $x_1 - x_0$, independent of $t$:

\[\dot{I_t} = x_1 - x_0,\]

The predicted velocity field is rewritten as a prediction of the endpoint:

\[\hat{b}_t(I_t) = \frac{\hat{x}_1 - x_t}{1-t}\]

Where $\hat{x}1$ is the predicted final outcome, $\hat{x}_1(x_t) = \mathbb{E}{I_t \sim p_t}[I_1 \mid I_{t} = x_t]$.

Thus, the original MSE objective can be rewritten as:

At this point, training can also be understood as directly approximating the final target sample. However, note that the term $\frac{1}{(1-t)^2}$ turns the integral into an improper integral, causing the loss function value to approach infinity as $t \to 1$, which can lead to training instability.

Countermeasures

FMLM: Directly ignores the $\frac{1}{(1-t)^2}$ term, using an approximate objective $\int_0^1 \mathbb{E}\left[(\hat{x}1 - x{1})^2\right] dt$. However, this may not be strictly rigorous. In practice, it is often more suitable to use in conjunction with Cross-Entropy (CE), while keeping $t$ relatively small.
ELF:
1. Instead of uniform sampling for $t \in [0,1]$, it samples from a log-normal distribution: $t’ \sim \mathcal{N}(P_{mean}, P_{std}^2), t = \sigma(t’)$.
2. A separate decoding branch is set up directly at a position where $t$ is close to $1$, which is much more suitable for applying CE loss.
3. Since the optimal prediction $\hat{x}_1$ must satisfy $\hat{x}_1 - x_1 = O(1 - t)$, this is equivalent to introducing an implicit Lipschitz constraint. Naturally, we can consider introducing an explicit Lipschitz regularization term to enforce this constraint, which could be explored later.

4. Decoding & CE Loss

When fitting in a continuous space, MSE Loss is the most natural training objective. However, when the data ultimately needs to be converted back into a discrete representation, we can introduce Cross-Entropy (CE) as a more suitable training objective.

FLM’s CE Objective

FLM uses one-hot encoding, meaning that in the target $x_1 \in \mathbb{R}^{L \times \lvert V\rvert}$, every component $x_{1}^l$ has only one element equal to 1, and the rest are 0.

FLM defines the cross-entropy objective as:

\[\mathcal{L}_{CE}(\hat{x_1}) := \int_0^1 \mathbb{E}_{I_t \sim p_t}\left[-\sum_{l=1}^L \log \hat{x_1}(I_t)^l \cdot x_1^l\right] dt\]

Where

\[\log \hat{x_1}(I_t)^l \cdot x_1^l = \log \left( \text{probability of the model predicting the correct Token } c \right)\]

Key Properties:

$\mathcal{L}_{CE}$ can be decomposed into an irreducible conditional entropy plus the sum of KL divergences from the true posterior to the model distribution. Therefore, minimizing CE is essentially performing distribution matching of $\hat{x}_1$ to $x_1$.
If the CE error $\Delta_D(\hat{D}) := \mathcal{L}{CE}(\hat{D}) - \mathcal{L}{CE}(D) \leq \varepsilon$, then for any early stopping time $\xi \in (0,1)$:

\[W_2^2(\hat{p}_{1-\xi}, p_{1-\xi}) \leq C\varepsilon\]

Where $C > 0$ depends on $\xi$ and the model’s Lipschitz constant. The $1-\xi$ term is used to avoid the $(1-t)^{-1}$ singularity during conversion. Simply put, the lower the training loss here, the better the generation quality will definitely be.

Connection to Discrete Diffusion: The optimal denoiser implicitly learns the factorized posterior $p^l_{1\mid t}(x_1^l \mid x_t)$, which aligns with how discrete diffusion models learn $p^l_{1\mid t}$ via tokenwise CE. The key difference is: discrete models use $p^l_{1\mid t}$ for ancestral sampling (which requires the full joint probability density and is affected by factorization errors); continuous models use $p^l_{1\mid t}$ to infer exact velocity fields, which can then be precisely distilled into few-step generators.

In summary, FLM’s CE is a per-step objective along the entire flow trajectory, meaning token-level cross-entropy supervision is applied for all $t \in [0,1]$.

ELF’s CE Objective

ELF’s CE is only applied at the final time step $t = 1$, rather than as a per-step objective along the entire trajectory. This is the core design difference between ELF and FLM (or other continuous DLMs).

At $t = 1$, ELF naturally treats the denoising process as a continuous-to-discrete decoding step. The same network $\text{net}_\theta$ switches to a “decode” mode at $t = 1$ (controlled via a binary mode token), eliminating the need for a separate decoder—it can be understood as a decoder sharing weights with the denoiser.
As $t \to 1$, $z_t \to x$ (approaching the clean embedding), at which point the network input becomes too trivial (essentially the answer itself). Therefore, ELF introduces a token-level corruption process at $t = 1$, replacing the clean embedding with a corrupted version $\tilde{z}$ to create a meaningful training signal.
CE Loss:

\[\mathcal{L}_{CE} = \mathbb{E}_{\tilde{z}}\left[\text{CrossEnt}(W\mathbf{x}_\theta(\tilde{z}), s)\right]\]

Where:

$\mathbf{x}_\theta(\tilde{z})$ is the network’s prediction for $\tilde{z}$ (clean embedding).
$W$ is a learnable “unembedding” matrix that maps the embedding back to logits.
$s$ is the ground-truth token (the original discrete token).

During inference: $W\mathbf{x}_\theta(z_t)$ is computed only at the final step $t = 1$, followed by an argmax over the logits to obtain the discrete token.

Reasoning for why ELF avoids applying CE to intermediate denoising steps: ELF’s denoising trajectory operates entirely within an unrestricted continuous embedding space. Per-step token-level supervision would overly bind the trajectory to vocabulary-level predictions, restricting the flexibility of the flow dynamics. Performing discretization exclusively at $t = 1$ grants the flow maximum degrees of freedom when $t < 1$.

Core divergence: FLM posits that since the optimal denoiser naturally lies on the simplex, CE is the most suitable training objective and should be used along the entire trajectory. Conversely, ELF argues that per-step CE restricts the flexibility of flow dynamics, and thus discretization should only occur at the final step, leaving intermediate steps to evolve freely in the continuous space.

V. Flow Map Language Models

1. Introducing the Flow Map

To avoid having to compute integrals, which prevents convergence under few-step scenarios, we introduce the flow map.

Defined as the solution operator for $\dot{x}_t = b_t(x_t), \quad x_0 \sim p_0, \quad t \in [0,1]$:

\[X_{s,t}(x_s) = x_t, \quad \forall s,t \in [0,1].\]

That is, the transformation function between any two arbitrary moments in time.

The equation can also be rewritten as:

\[X_{s,t}(x_s) = x_s + (t-s)v_{s,t}(x_s),\]

Where $v_{s,t}$ is called the average velocity from $s$ to $t$, or “mean flow”.

Given a sample $\hat{x}_0$, with the flow map available, we can discretize a grid $0 = t_0 < \cdots < t_N = 1$ and sequentially compute:

\[\hat{x}_{t_{i+1}} = X_{t_i, t_{i+1}}(\hat{x}_{t_i})\]

to obtain the sample. In extreme cases, by setting $N = 1$, we can achieve single-step generation $\hat{x}1 = X{0,1}(x_0)$. Generally, a larger $N$ yields better performance.

2. Conditions Required to Learn a Reasonable Flow Map

1. Identity Condition:

\[X_{s,s}(x) = x, \quad \forall x \in \mathbb{R}^{L \times \lvert V\rvert}, \forall s \in [0,1].\]

It can be seen that this condition is actually trivial. We integrate it into the subsequent two constraints as the constant constraint for the ODE.

2. Instantaneous Velocity Condition (Lagrangian Equation):

\[\partial_t X_{s,t}(x) = b_t(X_{s, t}(x)) \quad X_{s,s}(x) = x,\quad \forall x \in \mathbb{R}^{L \times \lvert V\rvert}, \forall s,t \in [0,1].\]

3. Endpoint Invariance Condition (Eulerian Equation):

\[\partial_s X_{s,t}(x) + b_s(x) \cdot \nabla X_{s,t}(x) = 0, \quad X_{t,t}(x) = x,\quad \forall x \in \mathbb{R}^{L \times \lvert V\rvert}, \forall s,t \in [0,1].\]

4. Semi-group Condition:

\[X_{u,t}(X_{s,u}(x)) = X_{s,t}(x), \quad \forall x \in \mathbb{R}^{L \times \lvert V\rvert}, \forall s,u,t \in [0,1].\]

It is worth adding that the latter three are fundamentally equivalent, describing properties of the exact same flow. The second one degenerates into a local tangent condition when taking the limit $s \to t$:

\[\lim_{s \to t} \partial_t X_{s,t}(x) = b_t(x)\]

The third one is derived from the condition that the endpoint $t$ does not change with the starting point $s$:

\[\frac{d}{ds} X_{s,t}(x) = 0\]

By combining the tangent property with 2, 3, and 4 respectively, we can obtain three equivalent constraint forms:

Lagrangian map distillation (LMD) loss:

\[\mathcal{L}_{\text{LMD}}(\hat{v}) = \underbrace{\int_0^1 \int_0^t \mathbb{E} \left\vert\partial_t \hat{X}_{s,t}(I_s) - \text{sg}\left(\hat{b}_t\left(\hat{X}_{s,t}(I_s)\right)\right)\right\vert^2 ds dt}_{\text{Corresponds to: Lagrangian condition}} + \underbrace{\int_0^1 \mathbb{E} \vert\hat{v}_{t,t}(I_t) - \hat{b}_t(I_t)\vert^2 dt}_{\text{Corresponds to: Tangent condition}}\]

Eulerian map distillation (EMD) loss:

\[\mathcal{L}_{\text{EMD}}(\hat{v}) = \underbrace{\int_0^1 \int_0^t \mathbb{E} \left\vert\partial_s \hat{X}_{s,t}(I_s) + \text{sg}\left(\hat{b}_s(I_s) \cdot \nabla \hat{X}_{s,t}(I_s)\right)\right\vert^2 ds dt}_{\text{Corresponds to: Eulerian condition}} + \underbrace{\int_0^1 \mathbb{E} \vert\hat{v}_{t,t}(I_t) - \hat{b}_t(I_t)\vert^2 dt}_{\text{Corresponds to: Tangent condition}}\]

Progressive map distillation (PMD) loss:

\[\mathcal{L}_{\text{PMD}}(\hat{v}) = \underbrace{\int_0^1 \int_0^t \int_s^t \mathbb{E} \left\vert\hat{X}_{s,t}(I_s) - \text{sg}\left(\hat{X}_{t,u}\left(\hat{X}_{s,t}(I_s)\right)\right)\right\vert^2 du ds dt}_{\text{Corresponds to: Semi-group condition}} + \underbrace{\int_0^1 \mathbb{E} \vert\hat{v}_{t,t}(I_t) - \hat{b}_t(I_t)\vert^2 dt}_{\text{Corresponds to: Tangent condition}}\]

For self-distillation, we just need to replace $\hat{b}_t(I_t)$ in the last term of each equation with $\dot{I_t}$.

FMLM adopts the last constraint form:

The first term is the associativity constraint (semi-group), and the second is the instantaneous velocity constraint.

Here, sg stands for stop-gradient (no backpropagation). The first term can be further expanded using $X_{s,t}(x_s) = x_s + (t-s)v_{s,t}$, and the second term can replace the pre-trained $\hat{b}_t$ with $I_t$.

$X_{s, t}$ itself is derived from Gaussian Noise. Directly constraining it would face the same issues as directly computing $b_t$. To unify Flow Matching and Flow Map in the probability space while utilizing the simplex constraint brought by One-Hot encoding, FMLM introduces a new optimization target:

\[\delta_{s,t}(x_s) := x_s + (1-s)v_{s,t}(x_s).\]

This can be conceptualized as a matrix $T \in \mathbb{R}^{N \times N}$:

The horizontal and vertical axes correspond to time $(s,t)$.
Flow Matching is equivalent to filling in the diagonal values $T_{tt}$ (i.e., $\hat{b}t = \hat{X}{t,t}$).
The latter is equivalent to filling in the upper or lower triangular regions of the matrix.

An additional benefit of this representation is that we no longer need to use MSE for numerical regression; instead, we can directly use CE (Cross Entropy) for classification. Using MSE for classification can lead to numerical instability.

3. Properties of $\delta_{s,t}$

The $\delta_{s,t}$ chosen by FMLM has the following properties:

Property 1. Quick correspondence with flow matching:

\[X_{s,t}(x) = \frac{1-t}{1-s}x + \frac{t-s}{1-s}\delta_{s,t}(x)\]

Property 2. Resides on a simplex, low dimensionality:

\[\delta_{s,t}(x)^\ell \in \Delta^{\lvert V\rvert-1}, \quad \ell = 1,2,\cdots,L.\]

Where $\Delta^{\lvert V\rvert-1}$ satisfies that all values are non-negative and sum to 1.

Property 3. Diagonal matches the target denoiser / imputation function:

\[\delta_{s,s}(x) = \mathbb{E}_{I_t \sim p_t}[x_1 \mid I_t = x] = \hat{x}_1, \quad \forall x \in \mathbb{R}^{L \times \lvert V\rvert}, \forall s \in [0,1].\]

Property 4. Satisfies associativity (Semi-group condition):

\[\delta_{s,t}(x) = \gamma \, \delta_{s,u}(x) + (1-\gamma) \, \delta_{u,t}(X_{s,u}(x)),\]

Where

\[\gamma = \frac{(1-t)(u-s)}{(1-u)(t-s)}.\]

The loss function used during Flow Map training is:

Where

\[\bar{\delta}_{s,t} = \mathrm{sg}\left( \gamma \delta_{s,u}(I_s) + (1-\gamma) \delta_{u,t}(X_{s,u}(I_s)) \right).\]

$\bar{\delta}{s,t}$ is the target value for the semi-group condition. $\mathbb{E}{t,s,u}$ is sampled from a fully supported distribution over ${0 \leq s \leq u \leq t \leq 1}$ (meaning all combinations can be sampled).

$\hat{x}1$ is predicted using Flow Matching, acting as a teacher model, so the student model $\delta{s,t}$ just fits the teacher’s $\hat{x}_1$. Of course, if $\hat{x}_1$ is replaced by the original ground-truth $x_1$ from the dataset, it can also be trained directly from scratch (self-distillation).

Using $\hat{x}_i$: Constrain the diagonal using the original FLM.
Using $x_i$: Train from scratch.

4. Discrete Flow Maps?

FMLM points out that in the original discrete state space, although the evolution of the entire probability distribution theoretically exists deterministically as $\dot{p_t} = Q_t p_t$, it cannot be computed due to the dimensional explosion of $Q_t \in \mathbb{R}^{\lvert V\rvert^L \times \lvert V\rvert^L}$ and $p_t \in \Delta^{\lvert V\rvert^L}$. Furthermore, the sample-level deterministic flow map we actually need has been mathematically proven to simply not exist (it cannot fit all target distributions). Therefore, directly attempting single-step deterministic generation in a purely discrete space is a dead end.

VI. Algorithm Implementation

1. Time Reparameterization $\tau(t)$

In the training and sampling of FLM and FMLM, unlike typical continuous modalities such as images, uniform sampling over $t \in [0,1]$ is inefficient. The reason is that in the high-dimensional One-hot space, the denoising process of tokens is not uniform—most of the determinism is concentrated and changes drastically within a very narrow window as $t \to 1$.

To address this, FLM introduces time reparameterization $\tau(t)$, defined as follows:

\[\tau(t) = \frac{P_e(0) - P_e(t)}{P_e(0)} = 1 - \frac{\lvert V\rvert}{\lvert V\rvert - 1} P_e(t)\]

Where the symbols denote:

$\lvert V\rvert$: Vocabulary Size.
$P_e(t)$: Decoding Error Rate, defined as:

\[P_e(t) := \frac{1}{L} \sum_{l=1}^L P(g_l(x_t) \neq g_l(x_1))\]

$P_e(t)$ reflects the average proportion of tokens that would be decoded incorrectly if the flow is forcefully terminated at time $t$ and the current noisy state $x_t$ is decoded. Here, $g(x)$ acts as an argmax operation.

At the endpoints:

$P_e(0) = 1 - \frac{1}{\lvert V\rvert}$ (maximum error rate under pure Gaussian noise)
$P_e(1) = 0$ (0 error rate for clean data)

According to Figure 9, $P_e(t)$ drops sharply as $t \to 1$, indicating that the determinism of most tokens is concentrated in this stage.

We can deduce this simply: Since $g(x) = \text{argmax}(x)$, the interpolation at time $t$ is $x_t = (1 - t)x_0 + tx_1$. Assuming the correct output token is at index $j$, but it is confused with the token at index $i$, then we must have $x_t^i > x_t^j$. This means $(1-t)x_0^i + tx_1^i > (1-t)x_0^j + tx_1^j$, which simplifies to $x_0^i - x_0^j > \frac{t}{1 - t}$. Because $\frac{t}{1 - t}$ monotonically increases with $t$, a much larger noise difference is required to cause a decoding error when $t$ is large. Therefore, the decoding error rate drops drastically.

In the pseudocode, “Sample $t$ via $\tau(t) \sim \mathcal{U}[0,1]$” means: first uniformly sample $\tau$, and then map it back to the actual time $t$ via the inverse function.

2. Sampling for Off-Diagonal Loss Computation

In the distillation training of FMLM, the core purpose of the off-diagonal loss is to make the student model (the two-point denoiser $\hat{\delta}_{s,t}$) learn the semi-group property (“time-space travel”): this requires the model’s output for directly jumping from start point $s$ to endpoint $t$ to perfectly match the continuous mapping result of jumping from $s$ to relay point $u$, and then from $u$ to $t$.

To ensure the model generalizes this property across any time span, the paper employs a continuous and symmetric sampling strategy for the time triplet $(s, u, t)$ on the reparameterized time axis $\tau$. The specific steps are:

$h \sim \mathcal{U}[0, 1]$ Randomly draw a time step size $h$ from a uniform distribution, representing the time distance the model is required to cross in the current training step.
$\tau(s) \sim \mathcal{U}[0, 1 - h]$, $\tau(t) = \tau(s) + h$
$\tau(u) = \frac{\tau(s) + \tau(t)}{2}$ The relay point $u$ is forcefully placed exactly in the middle of the start and end points.

Compared to shortcut models which pre-specify fixed time spans, training start points, and end points, FMLM’s sampling method can cover all time spans from extremely short to extremely long, ensuring that the model encounters samples of various spans during training to better learn the semi-group property.

3. Boundary Sampling for Few-Step Generation

With a fixed probability $p$ ($p = \frac{1}{32}$ in FMLM), the boundary pair $(s,t) = (0,1)$ is sampled directly to ensure sufficient training signals.

4. Generation Guidance

Autoguidance, used for unconditional generation

$\hat{b}_t^{\text{(guided)}} = \hat{b}_t^{\text{(weak)}} + \eta (\hat{b}_t - \hat{b}_t^{\text{(weak)}})$
- $\hat{b}_t$ is the velocity direction predicted by the primary, high-quality model.
- $\hat{b}_t^{\text{(weak)}}$ is the direction predicted by a “weak model” (e.g., a model with fewer training steps, smaller network size, or “dumbed down” by high dropout).
- $(\hat{b}_t - \hat{b}_t^{\text{(weak)}})$ represents “good” minus “bad”, pointing towards a higher-quality direction.
- $\eta > 1$ is a scalar controlling the guidance strength.

In a continuous space, this result is still a valid direction; however, “hard pushing” ($\eta$ is very large) in a discrete Logit space causes the probabilities of certain tokens to become extremely polarized, destroying the overall coherence of the sentence.

Reward-guided generation, used for conditional generation controlling text attributes

This leverages the idea of Flow Map Trajectory Guidance (FMTG): $\mathbf{x}_{t_{n+1}} = X_{t_n, t_{n+1}}(\mathbf{x}_{t_n}) + \lambda \nabla_{\mathbf{x}_{t_n}} r(X_{t_n, 1}(\mathbf{x}_{t_n}))$

How to step forward = taking a small normal scheduled step + $\lambda \times$ reward gradient correction.
$r(X_{t_n, 1}(\mathbf{x}_{t_n}))$ indicates that the model can foresee its currently predicted endpoint, allowing a reward model to evaluate it, compute the gradient, and backpropagate it to fine-tune the direction of the current small step.

Algorithm 5 FLM training

Algorithm 6 FLM sampling

Algorithm 7 FMLM training (distillation)

Algorithm 8 FMLM sampling

blank

FLM - Flow Language Model

FLM - Flow Language Model

一、问题定义

二、现有方案

Autoregressive language models

Discrete diffusion language models

二、语言的离散化表示

三、Flow Language Model

1. 插值

2. 概率流与向量场

3. Fitting & MSE Loss

直接预测速度场的问题

应对措施

4. Decoding & CE Loss

FLM 的 CE 目标

ELF 的 CE 目标

四、Flow map language models

1. 引入 Flow Map

2. 为学习合理的 Flow Map 需要满足的条件

3. $\delta_{s,t}$ 的性质

4. Discrete Flow Maps?

五、算法实现

1. 时间重参数化 $\tau(t)$

2. 非对角线损失计算时的采样

3. 少步骤生成的边界采样

4. 生成时引导

六、相关算法伪代码

I. Problem Definition

II. Existing Solutions

Autoregressive language models

Discrete diffusion language models

III. Discrete Representation of Language

IV. Flow Language Model

1. Interpolation

2. Probability Flow and Vector Fields

3. Fitting & MSE Loss

Issues with Directly Predicting the Velocity Field

Countermeasures

4. Decoding & CE Loss

FLM’s CE Objective

ELF’s CE Objective

V. Flow Map Language Models

1. Introducing the Flow Map

2. Conditions Required to Learn a Reasonable Flow Map

3. Properties of $\delta_{s,t}$

4. Discrete Flow Maps?

VI. Algorithm Implementation

1. Time Reparameterization $\tau(t)$

2. Sampling for Off-Diagonal Loss Computation

3. Boundary Sampling for Few-Step Generation

4. Generation Guidance

VII. Pseudocode for Related Algorithms