> | Resource | Info |
> | :---        |:--- |
> | Paper       | https://arxiv.org/abs/2506.01347 |
> | Code & Data | https://github.com/TianHongZXY/RLVR-Decomposed |
> | Public      | arXiv |
> | Date        | 2025.07.17 |


# Summary Overview

作者讨论了在强化学习中正样本和负样本对于训练的影响，并且讲训练 decompose 到 Positive and Negative Sample Reinforcement (PSR and NSR) 发现如果是在负样本上进行训练能够提升模型Pass@k性能，尤其是k增大的情况下，能够和PPO，GRPO持平或者是超过。

![image.png](/static/img/b54a3c32a227edf155404bd1a5e4c48f.image.png)

**Contributions:**
1. We decompose RLVR into two components, PSR and NSR, and investigate their distinct impacts on model behavior and generalization measured by a range of $\mathrm{Pass}@k$ metrics.
2. We empirically demonstrate the surprising effectiveness of NSR-only training and use gradient analysis to show that NSR refines the model's prior by suppressing incorrect reasoning steps and preserving plausible alternatives.
3. We propose Weighted-REINFORCE, a simple modification to the RL objective that upweights NSR, yielding consistent gains across complex reasoning benchmarks including MATH, AIM 2025, and AMC 23.

# Main Content

The RLVR objective optimizes the expected reward-weighted likelihood:

$$
\begin{align}
\mathcal{L}_{\text{RLVR}}(\theta) &= -\mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}} \left[ \sum_{\boldsymbol{y}} r(\boldsymbol{x}, \boldsymbol{y}) \cdot \pi_\theta(\boldsymbol{y}|\boldsymbol{x}) \right], \quad r(\boldsymbol{x}, \boldsymbol{y}) \in \{-1, +1\} \\
&= -\mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}} \underbrace{\left[ \sum_{\boldsymbol{y}: r(\boldsymbol{x}, \boldsymbol{y})=1} \pi_\theta(\boldsymbol{y}|\boldsymbol{x}) \right]}_{\mathcal{L}_{\text{PSR}}(\theta)} - \mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}} \underbrace{\left[ \sum_{\boldsymbol{y}: r(\boldsymbol{x}, \boldsymbol{y})=-1} -\pi_\theta(\boldsymbol{y}|\boldsymbol{x}) \right]}_{\mathcal{L}_{\text{NSR}}(\theta)},
\end{align}
$$

where we define two sub-objectives representing each learning paradigm:

$$
\mathcal{L}_{\text{PSR}}(\theta) = -\mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}} \left[ \sum_{\boldsymbol{y}: r(\boldsymbol{x}, \boldsymbol{y})=1} \pi_\theta(\boldsymbol{y}|\boldsymbol{x}) \right]
$$

$$
\mathcal{L}_{\text{NSR}}(\theta) = -\mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}} \left[ \sum_{\boldsymbol{y}: r(\boldsymbol{x}, \boldsymbol{y})=-1} -\pi_\theta(\boldsymbol{y}|\boldsymbol{x}) \right].
$$

**Positive and Negative Sample Reinforcement for LLM Reasoning**

**Compared algorightms.** We compare the performance of PSR and NSR with commonly used RL algorithms, including PPO and GRPO. PSR and NSR are implemented by selectively updating the policy model using only correct or incorrect responses, respectively. As a result, PSR and NSR are trained on fewer samples per batch than standard RL algorithms that use both correct and incorrect responses.

**Training setups.** For the training set, we use MATH, which contains 7,500 problems. We train the models using the verl framework. The prompt batch size is 1,024, with 8 rollouts generated per prompt. The sampling temperature during training is set to 1.0, and the maximum context length is set to 4,096 and 32,768 tokens for $\texttt{Qwen2.5-Math-7B}$ and $\texttt{Qwen3-4B}$, respectively. We update the model with a mini-batch size of 256 and a learning rate of 1e-6.

$\mathrm{Pass}@k$ is defineed as the fraction of problems for which at least one correct response is produced in $k$ independent trials. However, directly computing $\mathrm{Pass}@k$ using only $k$ samples per example often suffers from high variance. We follow the unbiased estimator, which generates $n$ samples per problem ($n\geq k$), counts the number of correct responses $c$, and computes an unbiased estimate of $\mathrm{Pass}@k$ as:

$$
\text{Pass}@k = \mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}} \left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right]
$$

Notably, varying $k$ provides insights into different aspects of model behaviors. $\mathrm{Pass}@1$ approximates greedy decoding accuracy, reflecting how confidently the model can produce a correct response in a single attempt, essentially reflecting *exploitation*. In contrast, $\mathrm{Pass}@k$ with large $k$ evaluates the model's ability to generate diverse correct responses across multiple attempts, capturing its *exploration* ability and reasoning boundary.

![image.png](/static/img/853a4c0661809f528c8507c119a9aa30.image.png)

![image.png](/static/img/1ae101958a50946051ecc103a1338b91.image.png)

**NSR outperforms or stays comparable to the base model at a large $k$ value.** At larger decoding budgets (e.g., $\mathrm{Pass}@256$), recent work shows that RL-trained models often lose their advantage, and in some cases underperform the base model. This trend is generally observed in our experiments with PPO, GRPO, and PSR, especially in Figure 2.

**PSR improves accuracy at the cost of diversity.** This behaviors indicates that PSR overly concentrates probability mass on early correct responses, leading to overconfidence and a collapsed output distribution and ultimately limiting the model's ability to generate deiverse correct responses when allowing for more test-time compute.


![image.png](/static/img/61e48ad720730e52f3ee294fcead12a5.image.png)

**Token-Level Gradient Dynamics of PSR and NSR**

![image.png](/static/img/d1a91d86ac1276862ea706a4dd718e7f.image.png)

**Balancing Positive and Negative Reinforcement**

![image.png](/static/img/33673f8f5c91eebf0882988d9ccac54e.image.png)

# 🤖

:::info{title=" "}
1. 论文的创新之处与独特性：
   - **创新点1：负样本强化(NSR)的提出及其有效性分析**  
     论文通过将强化学习中的奖励机制分解为正样本强化(PSR)和负样本强化(NSR)，揭示了负样本强化在提升语言模型推理能力中的重要作用。尤其是单独使用负样本强化可以显著提高模型在推理任务中的表现，同时保持生成的多样性，甚至在某些情况下超过了主流的强化学习算法（如PPO和GRPO）。这一发现挑战了传统观点，即正样本奖励是提升模型性能的主要驱动力。
     
   - **创新点2：梯度分析揭示负样本强化的机制**  
     通过梯度分析，论文详细阐述了NSR如何通过抑制错误生成并重新分配概率质量来优化模型的输出。这种方法有效地利用了模型的先验知识，同时避免了过拟合和生成多样性下降的问题。

   - **创新点3：提出Weighted-REINFORCE算法**  
     论文提出了一种简单但有效的加权强化学习算法（Weighted-REINFORCE），通过降低正样本奖励的权重来平衡PSR和NSR的优势，最终在多个推理基准上表现出色。这一算法为设计高效的强化学习目标函数提供了新的思路。

   - **关键学习点**  
     - NSR能够在不直接强化正确样本的情况下间接提高模型的推理能力。
     - 正样本奖励可能导致输出分布的过度集中，损害模型的生成多样性。
     - 通过调整奖励权重，可以实现准确性和多样性之间的平衡。

2. 论文中存在的问题及改进建议：
   - **问题1：对更广泛模型的适用性验证不足**  
     论文的实验主要集中在Qwen2.5-Math-7B和Qwen3-4B模型上，这些模型在数学推理任务中表现优异。然而，不同模型的先验知识和推理能力可能差异显著，NSR和Weighted-REINFORCE的适用性尚未在其他模型（如GPT系列或其他开源模型）上得到验证。
     - **改进建议**：扩展实验范围，选择不同架构和规模的模型进行验证，以评估NSR和Weighted-REINFORCE的通用性。

   - **问题2：长期训练稳定性问题**  
     论文提到，长时间使用NSR训练可能导致模型性能下降，表明NSR在稳定性方面存在一定局限性。
     - **改进建议**：研究动态调整NSR和PSR权重的方法，例如在训练初期使用更多的NSR，后期逐步增加PSR的权重，以确保训练的稳定性。

   - **问题3：对复杂奖励信号的适应性研究不足**  
     论文主要研究了二元奖励信号(+1/-1)的情况，而实际任务中可能存在更复杂的奖励信号（例如连续值或多维反馈）。
     - **改进建议**：探索NSR和Weighted-REINFORCE在复杂奖励信号下的表现，并设计新的目标函数以适应这些场景。

3. 基于论文的内容和研究结果，提出的创新点或研究路径：
   - **创新点1：动态权重调整的强化学习方法**  
     设计一种动态权重调整机制，根据模型的训练阶段或任务类型动态调整PSR和NSR的权重，以优化训练效果。

   - **创新点2：跨领域任务上的NSR应用研究**  
     将NSR应用于其他推理任务（如代码生成、科学问题解答或开放领域问答），验证其在不同任务中的适用性和效果。

   - **创新点3：复杂奖励信号下的强化学习目标函数设计**  
     针对复杂奖励信号（例如连续值或多维反馈），设计新的强化学习目标函数，将NSR的概率分布重分配特性与奖励信号的细粒度信息结合。

4. 为新的研究路径制定的研究方案：
   - **研究路径1：动态权重调整的强化学习方法**
     - **研究方法**：设计一种动态权重调整机制，在训练过程中根据模型的准确性和生成多样性动态调整PSR和NSR的权重。可以通过监控模型的预测熵和正确样本比例来决定权重变化。
     - **研究步骤**：
       1. 实现动态权重调整算法，并与固定权重的Weighted-REINFORCE进行对比。
       2. 在数学推理任务上进行实验，评估动态权重调整的效果。
       3. 分析权重变化对模型性能的影响，验证其是否能够改善训练稳定性。
     - **期望成果**：动态权重调整能够在训练过程中平衡准确性和多样性，提升模型性能，同时保持训练稳定性。

   - **研究路径2：跨领域任务上的NSR应用研究**
     - **研究方法**：将NSR应用于不同领域的推理任务（如代码生成、科学问题解答），并与主流强化学习算法（如PPO、GRPO）进行对比。
     - **研究步骤**：
       1. 选择多个跨领域任务数据集（如CodeXGLUE、SciQ等）。
       2. 在不同任务上单独训练NSR，并与PSR、PPO等方法进行性能比较。
       3. 分析任务特性对NSR效果的影响，探索其适用范围。
     - **期望成果**：验证NSR在不同任务中的通用性，并发现其在特定任务上的潜在优势。

   - **研究路径3：复杂奖励信号下的强化学习目标函数设计**
     - **研究方法**：设计新的目标函数，将NSR的概率分布重分配特性与复杂奖励信号结合，例如通过加权平均或正则化方式处理多维奖励信号。
     - **研究步骤**：
       1. 定义复杂奖励信号的形式（如连续值或多维反馈）。
       2. 设计新的目标函数，并实现相应的训练算法。
       3. 在包含复杂奖励信号的任务数据集上进行实验，评估新目标函数的效果。
       4. 与传统强化学习算法进行对比，分析其在复杂奖励场景中的优势。
     - **期望成果**：新的目标函数能够有效处理复杂奖励信号，同时保持NSR的优势，提升模型在实际任务中的表现。

:::

# Others

[arXiv-2025] The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

> | Resource | Info |
> | :---        |:--- |
> | Paper       | https://arxiv.org/abs/2505.15778 |
> | Code & Data | https://github.com/eric-ai-lab/Soft-Thinking |
> | Public      | arXiv |
> | Date        | 2025.07.11 |


# Main Content

作者提出了 Soft Thinking，一种无需训练的方法来在连续的概念空间中进行推理的方法。

![image.png](/static/img/1a8c1fc95adf850d59e78fe2ba730f3f.image.png)

A fundamental limitation of standard CoT reasoning is its inherently unidirectional and sequential nature: at each step, the model samples a single token, committing to one specific branch of the reasoning path.

In this work, we propose a new perspective: instead of constraining LLMs to reason within the discrete, sequential space of language tokens, we aim to enable LLMs to reason with soft, abstract concepts, which encompass more general and fine-grained semantics and retain information about multiple possible paths. This retains the original distribution of the next step. At each step, we construct a new embedding from a concept token by probability-weighting all tokens embeddings, which form the continuous concept token. This approach allows the model to represent and process abstract concepts, endowing each output token with more nuanced and fine-grained semantics, and enabling the processing of multiple paths conceptually.

As a result, we introduce a *Cold Stop* mechanism to further boost efficiency and address the challenge of generation collapse (e.g., repetition) caused by out-of-distribution(OOD) inputs, where certain *concept tokens* may be unseen during training. To be specific, *Cold Stop* monitors the entropy of the model's output distribution at each step and terminates the reasoning process early when the model demonstrates high confidence (i.e., low entropy) over several consecutive steps. This mechaism prevents unnecessary computation and mitigates the risk of model collapse when dealing with OOD inputs, ensuring more robust and efficient reasoning.

**Two major advances:**
1. by operating in the continuous concept space formed as a convex combination of all token embeddings, the model can capture and manipulate abstract concepts and detailed semantic information;
2. because each concept token keeps a probability distribution from all possible next tokens, the model can implicitly and efficiently explore multiple reasoning paths in parallel, rather than being limited to a single trajectory.

> In language models with fewer than 7B parameters, the input embedding layer and the output language model head are typically weight-tied, enabling continuous-space reasoning by aligning the input and output spaces after extensive training. In contrast, for models with more than 7 billion parameters, these components are typically decoupled, meaning that the hidden states and input embeddings reside in different spaces. Directly using hidden states as input embeddings leads to siginficant representational mismatch, which is difficult to bridge even with extensive retraining.

![image.png](/static/img/08628d5a4e6941efb554e27b1984c820.image.png)

## *Soft Thinking:* Reasoning in a Continuous Concept Space

**Definition 1** *(Concept Token)*. At any intermediate thinking step, let $p\in\Delta^{|V|-1}$ be the LLM-produced probability distribution over the vocabulary. We call this probability vector a *concept token*, denoted by

$$
ct:=p
$$

Unlike a traditional step that collapses the distirbution to a single token id, the concept token preserves the full distribution of every possible next step.

**Definition 2** *(Continuous Concept Space)*. Let $E\in\mathbb{R}^{|V|\times d}$ be the embedding matrix and $e(k)=E[k]$ the embedding of the $k$-th vocabulary item. The continuous concept space is the convex combination of all embedding vectors.

$$
\mathcal{C} \;=\;
\Bigl\{\,\sum\nolimits_{k=1}^{|V|}\alpha_k\,e(k)\;:\;
\alpha\in\Delta^{|V|-1}\Bigr\}\subset\mathbb{R}^d
$$

i.e. the set of all probability-weighted mixtures of token embeddings. Note that this is different from the usual semantic space, which is modeled as a $d$-dimensional real vector space.

**Reasoning Process.** *Soft Thinking* only replaces the intermediate thinking step of the standard CoT. At each step of soft thinking, the model generates a *concept token*. Then, in the next step, the *concept token* $ct$ is injected back into the LLM by the embedding of *concept token*:

$$
\tilde{e}_{\mathrm{next}} = \sum_{k=1}^{|V|} ct[k]\,e(k) = \sum_{k=1}^{|V|} p[k]\,e(k)\in\mathcal{C}
$$

When the most probable token for a certain concept token is the end-of-thinking, the intermediate reasoning process stops, and the model switches to generating the output. All output stage tokens $y_j$ are sampled in the usual discrete manner; only the intermediate thinking phase flows through the continuous concept space defined above.

**Why methodname Helps.** Using *concept tokens* allows the model to avoid making hard decisions too early. Instead of selecting a single token at each step, the model keeps the full probability distribution over vocabulary. This gives it the flexibility to explore different reasoning paths, especially when it's unsure. By working in this *continuous concept space*, the model can represent more abstract concepts that don't map cleanly to a single word. These abstract concepts can later evolve into more concrete thoughts as reasoning continues. This flexibility helps the model think more clearly, avoid early mistakes, and better handle complex multi-step problems.

***Cold Stop*** While *concept tokens* enable more abstract reasoning, feeding in continuous *concept tokens* during inference places the model in an out-of-distribution (OOD) regime. This can lead to model collapse if the reasoning process continues for too long without correction. To mitigate this, we propose a *Cold Stop* mechanism that dynamically stops intermediate reasoning when the model becomes overconfident. At each step, we compute the entropy of the *concept token*:

$$
H(p) = -\sum_{k=1}^{|V|} p[k] \, \log p[k]
$$

Since $Soft Thinking$ preserves the entire probability distribution at each step, the entropy serves as a natural signal for uncertainty, which is often used in LLMs to evaluate the quality of generation. Low entropy, typically represents "cold" in physics, indicates that the model is confident in its prediction~cite{entropy}, and thus can conclude soon. Given an entropy threshold $\tau$ and a required number of consecutive confident steps $k$, we apply the following rule:

- If $H(p) < \tau$, increment a low-entropy step counter; otherwise, reset the counter.
- When the counter reaches $k$, we insert an end-of-thinking token $\langle/\mathrm{think}\rangle$ to conclude reasoning and begin final answer generation.

This strategy avoids unnecessary computation and prevents the model collapse under OOD conditions, while preserving the benefits of soft thinking through an entropy-based confidence measure.

![image.png](/static/img/eb793c3a298444901b05e39cdd5b6d08.image.png)

![image.png](/static/img/5096f5475e4abb7fda506457e4f61eb6.image.png)

![image.png](/static/img/7b7f7c319e366d0d9aebf51dbd34e6bd.image.png)

# 🤖

:::info{title=" "}
### 1. 论文的创新之处与独特性：

- **创新点一：提出“Soft Thinking”方法**  
  论文提出了一种全新的推理框架——Soft Thinking，突破了传统链式推理（Chain-of-Thought, CoT）方法仅依赖离散语言符号的局限性。Soft Thinking通过概率加权的概念嵌入（concept tokens）在连续概念空间中进行推理，允许模型同时探索多个推理路径，从而提高推理的准确性和效率。这种方法无需额外训练，直接基于现有的语言模型架构实现。

- **创新点二：Cold Stop机制**  
  为解决连续概念空间推理可能导致的生成崩溃（generation collapse）问题，论文设计了Cold Stop机制，通过监测模型输出的熵值，动态终止推理过程。这种机制有效避免了模型在离散分布之外的过度计算，同时提升了推理效率。

- **创新点三：显著的性能提升与效率优化**  
  实验结果表明，Soft Thinking在数学和编程任务上均显著提高了Pass@1的准确率（最高提升2.48%）并减少了生成长度（最高减少22.4%）。此外，生成的推理步骤更简洁且易于解释，展示了连续概念空间推理的潜力。

- **创新点四：无需额外训练的实现方式**  
  Soft Thinking完全基于现有模型架构，通过概率加权嵌入实现连续推理，无需额外训练或模型修改。这种轻量级的实现方式为实际应用提供了便利。

---

### 2. 论文中存在的问题及改进建议：

- **问题一：离散到连续空间的分布偏移问题**  
  Soft Thinking方法将模型置于离散分布之外（Out-of-Distribution, OOD）的推理环境，可能导致模型不稳定或生成崩溃。虽然Cold Stop机制部分缓解了这一问题，但并未从根本上解决模型对连续概念空间的适应性。

  **改进建议：**  
  - 在模型训练阶段引入概念嵌入的预训练任务，使模型能够学习连续概念空间的表示。
  - 设计一种混合推理方法，将离散推理与连续推理结合，动态调整推理模式以适应不同的任务需求。

- **问题二：缺乏对复杂任务的实验验证**  
  论文主要在数学和编程任务上进行了实验验证，但未涉及更复杂的多模态任务或语言理解任务。

  **改进建议：**  
  - 扩展实验范围，验证Soft Thinking在多模态任务（如视觉问答）或语言生成任务中的表现。
  - 设计更具挑战性的推理任务，评估Soft Thinking在处理高复杂度问题时的性能。

- **问题三：Cold Stop机制的参数敏感性**  
  Cold Stop机制依赖熵阈值和连续低熵步数的设定，这些参数的选择可能对不同任务表现敏感。

  **改进建议：**  
  - 引入动态参数调整机制，根据任务的复杂度和模型的推理状态实时优化熵阈值。
  - 通过强化学习优化Cold Stop参数，使其能够自适应不同的推理场景。

---

### 3. 基于论文的内容和研究结果，提出的创新点或研究路径：

#### 创新点一：**连续概念空间的多模态推理扩展**
探索Soft Thinking在多模态任务中的应用，例如图像与文本结合的推理任务，通过联合概念空间实现跨模态推理。

#### 创新点二：**动态推理模式切换机制**
设计一种智能推理框架，根据任务需求动态切换离散推理与连续推理模式，优化模型的性能和效率。

#### 创新点三：**连续概念空间的任务特定微调**
开发一种任务特定的微调方法，使模型能够更好地适应连续概念空间推理，提升在复杂任务中的表现。

---

### 4. 为新的研究路径制定的研究方案：

#### 研究路径一：连续概念空间的多模态推理扩展

**研究方法：**
- **目标：** 将Soft Thinking应用于多模态任务（如视觉问答或图像生成）以验证其跨模态推理能力。
- **步骤：**
  1. 构建多模态数据集，包括图像和文本的联合表示。
  2. 在模型推理阶段引入图像嵌入，结合文本嵌入形成联合连续概念空间。
  3. 设计实验，比较Soft Thinking与传统多模态推理方法的性能。
- **期望成果：**
  - 验证Soft Thinking在多模态任务中的有效性。
  - 提供一种统一的跨模态推理框架，提升多模态任务的准确性和效率。

---

#### 研究路径二：动态推理模式切换机制

**研究方法：**
- **目标：** 设计一种智能推理框架，根据任务需求动态切换离散推理与连续推理模式。
- **步骤：**
  1. 引入任务复杂度指标，用于评估任务是否适合连续推理。
  2. 在推理过程中实时监测模型的熵值和生成状态，动态调整推理模式。
  3. 通过强化学习优化模式切换策略。
- **期望成果：**
  - 提供一种灵活的推理框架，适应不同任务需求。
  - 在复杂任务中显著提升推理准确性和效率。

---

#### 研究路径三：连续概念空间的任务特定微调

**研究方法：**
- **目标：** 开发一种任务特定的微调方法，使模型能够更好地适应连续概念空间推理。
- **步骤：**
  1. 在模型训练阶段引入概念嵌入的微调任务。
  2. 使用任务数据集对模型进行微调，优化模型对连续概念空间的适应性。
  3. 验证微调后的模型在复杂任务中的表现。
- **期望成果：**
  - 提升模型在复杂任务中的表现。
  - 提供一种高效的微调方法，增强模型的适应性和稳定性。

---

以上分析与方案旨在深入挖掘Soft Thinking方法的潜力，并为未来研究提供明确方向与可行路径。
:::

# Others

[arXiv-2025] Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

> | Resource | Info |
> | :---        |:--- |
> | Paper       | http://arxiv.org/abs/2505.18454 |
> | Code & Data | https://github.com/Yueeeeeeee/HRPO |
> | Public      | arXiv |
> | Date        | 2025.07.10 |


# Main Content

作者提出了针对于 latent reasoning 的 RL 算法 (HRPO)。

![image.png](/static/img/7c127266e3a353179c8c99b0cffbc4ef.image.png)

![image.png](/static/img/4bfdb1be350e100e2677d4b74e6938ee.image.png)

We first describe our notation and settings for hybird latent reasoning. For input query $x=[x_1,x_2,\cdots,x_t]$ and its corresponding token embeddings $E=[e_1,e_2,\cdots,e_t]$, we describe the raw hidden states from the LLM output at step $t$ with $\hat{h}_t$, namely:

$$
\hat{H}=[\hat{h}_1,\hat{h}_2,\cdots,\hat{h}_t]=\texttt{Transformer}(E)
$$

in which $\texttt{Transformer}$ denotes the transformer model (i.e., decoder layers), $\hat{H}$ represents the final-layer hidden states produced by the $\texttt{Transformer}$. With the LM head ($\texttt{Head}$), the next output token $\hat{x}_{t+1}$ can be sampled from the output distribution over the vocabulary via:

$$
\hat{x}_{t+1}\sim\texttt{softmax}(\texttt{Head}(\hat{h}_t))
$$

However, hidden states often lie outside the model's token embedding manifold, which degrades generation quality when fed directly. To avoid this, we project $\hat{h}_t$ back into the embedding space to ensure the inputs conform to the model's learned distribution. Specifically, we use the output probabilities $p_{t+1}$ to compute a weighted interpolation over the vocabulary:

$$
h_{t+1}=W_e^T\frac{p_{t+1}}{||p_{t+1}||},\text{with}\;p_{t+1}=\texttt{softmax}(\frac{\texttt{Head}(\hat{h}_t)}{\tau})
$$

in which $\tau$ is the temperature and $W_e$ denotes the embedding matrix of the LLM. In other words, we compute the next input embeddingg as a weighted sum of all token embeddings, with weights given by $p_{t+1}$. In addition, $p_{t+1}$ is normalized to preserve the scale and variance of the output vector. This sampling-free mapping ensures differentiability and aligns the projected embedding with the model's native input space, thus leading to improved training dynamics.

While interpolated embeddings preserve semantic continuity, directly feeding $h_{t+1}$ as the next token input removes stochasticity and injects noise from irrelevant tokens, causing degraded generation within RL rollouts. As such, we design a hybird approach for latent reasoning by gradually imposing hidden state representations into the sampled token embeddings with a gating mechanism. Drawing on gated recurrence models, we formulate the gating mechanism as:
$$
\begin{align}
r_t &= \sigma(W_a \hat{e}_{t-1} + b_a), \\
i_t &= \sigma(W_x \hat{e}_{t-1} + b_x), \\
a_t &= \exp(-c \cdot \texttt{softplus}(\Lambda) \odot r_t), \\
e_{t+1} &= \begin{cases}
a_t \odot \hat{e}_{t+1} + \sqrt{1 - a_t^2} \odot (i_t \odot h_{t+1}) & t \in \text{think}, \\
\hat{e}_{t+1} & t \notin \text{think},
\end{cases}
\end{align}
$$

$e_{t+1}$ is the resulting hybird input for the next step, $\hat{e}_{t+1}$ denotes the embedding of the sampled discrete token $\hat{x}_{t+1}$, whereas $h_{t+1}$ is the projected hidden states. The gates $r_t$ and $i_t$ leverages sigmoid function $\sigma$ to control the blending, $a_t$ scales $\hat{e}_{t+1}$, $c$ is a fixed scaling constant, and $\Lambda$ is a learnable vector. Note that hybird reasoning only applies during the reasoning phase (i.e., $\t\in\texttt{think}$), while the final answer is still generated via standard autoregressive decoding. By initializing $a_t\to1$, the inputs first draw predominantly from the sampled token embeddings, thereby effectively preserving the LLM's generative capabilities. As the training progresses, the value range of $a_t$ converages to an optimum range and thus incorporates informative features from both hidden representations and sampled tokens.

$$
\begin{aligned}
\nabla_\theta \mathcal{J}_{\text{HRPO}}(\theta) &= \mathbb{E}_{x \sim \mathcal{D}, \{(y_i, H_i)\}_{i=1}^g \sim \pi_\theta(\cdot|x)} \\
&\left[ \frac{1}{g} \sum_{i=1}^{g} \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \nabla_\theta \log \pi_\theta(y_{i,t}|x, y_{i,<t}, H_{i,<t}) \hat{A}_{i,t} \right] - \beta \nabla_\theta \mathbb{D}_{KL}[\pi_\theta \| \pi_{\text{ref}}]
\end{aligned}
$$

As such, our HRPO implementation remains light weight, strictly on-policy and could be seamlessly combined with further RL optimizations.

**Exp 1:** open-domain & multi-hop knowledge-intensive question answering (Knowledge)

![image.png](/static/img/d7f0fc1292d94148e0453a2f27ea4ef0.image.png)

**Exp 2:** science, technology, engineering or mathematics (STEM) benchmarks.

![image.png](/static/img/77ad4ff113fd9e46e7bba3abff559637.image.png)

**Different Strategies for Latent Reasoning.** We compare different strategies to compute latent representations. Specifically, we use three methods to integrate hidden states into RL and train the 1.5B Qwen model on the MATH dataset. These variants are: (1) hidden states, which use the final layer hidden states as the next input; (2) interpolation, which employs interpolated embeddings; and (3) HRPO, our hybird latent reasoning. We visualize the exponential moving average (EMA) of rewards along with the GRPO baseline. Due to the mismatch between hidden states and embeddings, using hidden states degrades generation and yields nonosensical rollouts with zero reward. Although interpolation performs similar to HRPO for the first few hundred steps, the rewards eventually collapse and only slowly recover, likely because interpolation introduces excessive noise. We also provide a direct comparison between HRPO and latent reasoning mehtods. Overall, our approach achieves superior training dynamics with faste convergence while maintaining stability comparable to GRPO, highlighting the efficacy of our hybird design choice in HRPO.

# 🤖

:::info{title=" "}
1. 论文的创新之处与独特性：
   - **创新点1：混合隐式推理与强化学习的结合**  
     论文提出了一个名为“混合推理策略优化（HRPO）”的框架，通过强化学习（RL）逐步将隐式推理能力融入到大型语言模型（LLMs）中。相比传统的基于离散推理路径的训练方法，HRPO通过引入连续隐状态与离散嵌入的混合推理方式，减少对链式推理（CoT）标注数据的依赖，同时保留了LLMs的生成能力。
   - **创新点2：可学习的门控机制**  
     论文设计了一种门控机制，用于动态调整离散嵌入与隐式推理特征的权重。通过逐步引入隐状态信息，这种机制既保留了模型的生成能力，又实现了连续特征的有效融合。
   - **创新点3：无监督奖励优化**  
     HRPO采用了基于结果的简单奖励函数，无需复杂的链式推理标注数据，直接通过强化学习优化模型推理性能。这种方法降低了训练成本，同时提高了模型在知识密集型任务和推理密集型任务中的表现。
   - **创新点4：跨语言推理能力**  
     论文展示了HRPO在跨语言推理中的潜力，能够在推理过程中自然地结合多语言信息，表现出更强的泛化能力。

2. 论文中存在的问题及改进建议：
   - **问题1：对隐状态与嵌入空间的映射处理不足**  
     论文中提到直接将隐状态投射到嵌入空间可能会引入噪声，导致生成质量下降。虽然设计了插值机制，但仍可能存在隐状态与嵌入空间的匹配问题。  
     **改进建议**：引入更复杂的投射机制，例如基于对比学习的方法，确保隐状态与嵌入空间的语义一致性。
   - **问题2：奖励函数设计过于简单**  
     论文中使用了基于答案正确性的单一奖励函数，这可能无法充分捕捉推理过程的质量。  
     **改进建议**：设计多维奖励函数，结合推理过程的连贯性、复杂性和生成质量，进一步优化模型性能。
   - **问题3：对多语言推理的深入分析不足**  
     虽然论文提到HRPO具备跨语言推理能力，但未深入分析其在多语言任务中的具体表现。  
     **改进建议**：增加跨语言任务的实验，分析HRPO如何处理不同语言之间的语义转换与上下文整合问题。
   - **问题4：训练效率与资源需求未详细评估**  
     论文未详细说明HRPO在不同规模的模型上训练的效率与资源需求。  
     **改进建议**：提供详细的训练时间、资源消耗与模型规模的对比分析，为实际应用提供参考。

3. 基于论文的内容和研究结果，提出的创新点或研究路径：
   - **创新点1：基于强化学习的动态任务适配框架**  
     设计一个能够根据任务需求动态调整隐状态与嵌入比例的框架，使模型能够更高效地适配不同类型的任务（如知识问答、逻辑推理、数学计算）。
   - **创新点2：跨语言隐式推理优化**  
     探索如何通过强化学习进一步增强HRPO的跨语言推理能力，研究隐状态在不同语言间的迁移与共享机制。
   - **创新点3：隐状态与外部知识库的结合**  
     将HRPO的隐式推理能力与外部知识库（如语义图谱、结构化数据库）相结合，提升模型的知识获取与复杂推理能力。

4. 为新的研究路径制定的研究方案：
   - **研究路径1：基于强化学习的动态任务适配框架**
     - **研究方法**：设计一个动态门控机制，通过强化学习实时调整隐状态与嵌入比例，使模型能够根据任务类型（如知识问答、逻辑推理）自动优化推理过程。
     - **研究步骤**：
       1. 收集多样化的任务数据集（如知识问答、逻辑推理、数学计算）。
       2. 设计动态门控机制，定义任务类型的特征向量。
       3. 使用强化学习优化门控参数，训练模型在不同任务上的适配能力。
       4. 测试模型在多任务环境中的表现，分析适配效率与推理质量。
     - **期望成果**：提出一个能够动态适配任务需求的推理框架，提升模型在多任务环境中的表现。

   - **研究路径2：跨语言隐式推理优化**
     - **研究方法**：通过强化学习优化隐状态在不同语言间的迁移与共享机制，探索隐式推理的跨语言泛化能力。
     - **研究步骤**：
       1. 收集多语言数据集（如英语、中文、法语等）的推理任务。
       2. 设计跨语言隐状态共享机制，定义语言间的迁移矩阵。
       3. 使用HRPO框架训练模型，优化隐状态在多语言任务中的表现。
       4. 测试模型在跨语言任务上的推理能力，分析语义一致性与泛化性。
     - **期望成果**：提出一个跨语言隐式推理优化框架，增强模型在多语言环境中的推理能力。

   - **研究路径3：隐状态与外部知识库的结合**
     - **研究方法**：将HRPO的隐式推理能力与外部知识库（如语义图谱、结构化数据库）相结合，提升模型的知识获取与复杂推理能力。
     - **研究步骤**：
       1. 构建一个包含语义图谱与结构化数据库的知识库。
       2. 设计隐状态与知识库交互的机制，定义知识检索与推理规则。
       3. 使用HRPO框架训练模型，优化隐状态与知识库的结合能力。
       4. 测试模型在知识密集型任务上的表现，分析知识获取效率与推理质量。
     - **期望成果**：提出一个结合隐状态与外部知识库的推理框架，提升模型在知识密集型任务中的表现。

:::

# Others

[arXiv-2025] Hybrid Latent Reasoning via Reinforcement Learning

> | Resource | Info |
> | :---        |:--- |
> | Paper       | https://arxiv.org/abs/2502.12134 |
> | Code & Data | https://github.com/xuyige/SoftCoT |
> | Public      | ACL |
> | Date        | 2025.07.04 |


# Summary Overview

使用思维链能够使得 LLM 能够通过生成中间推理步骤来解决复杂的推理任务。但是，大多数现有方法都集中在 hard token 解码上，这会在离散词汇空间内限制推理。尽管最近一些工作探讨了连续的空间推理，但是这通常需要进行全量微调并且可能遇到灾难性遗忘。为了解决这个问题，作者提出了使用一个权重固定的轻量辅助模型来生成特定的 soft thought tokens，然后通过训练得到的投影模块将其映射到 LLM 的表示空间中。

![image.png](/static/img/b87130c78c1cd242df2a933d4cbc6e06.image.png)

# Main Content

To facilitate reasoning in a continuous space, we use soft thought tokens (i.e., the last-layer hidden states from the small assistant model before mapping to the vocabulary space) instead of discrete tokens.

Given an input question $\mathcal{Q}$, the framework produces a sequence of reasoning steps $\mathcal{R}$ and the final answer $\mathcal{A}$.

SoftCoT consists of three key components: the soft thought token generation module, the projection module, and the CoT reasoning module.

![image.png](/static/img/6e64664c415145a6662ab53e66e1a89b.image.png)

**Coconut is not applicable to larger language models:** We modify and run the official implementation of Coconut, adapting it to LLaMA-3.1-8B-Instruct. Our findings indicate that Coconut exhibits performance degradation following supervised fine-tuning with the language modeling objective, which can be attributed to the catastrophic forgetting phenomenon. This observation aligns with findings from prior studies.


# 🤖

:::info{title=" "}
1. 论文的创新之处与独特性：
   - **创新点**：
     - 提出了软链式思维（SoftCoT）框架，通过引入辅助模型生成连续空间中的软思维标记（soft thought tokens），克服了传统链式思维（CoT）依赖离散硬标记的局限性。
     - 采用投影模块将辅助模型生成的软标记映射到主语言模型的表示空间，避免了对主语言模型进行全模型微调，从而减轻了灾难性遗忘问题。
     - 通过参数高效的微调方法，仅优化投影模块参数，保持了主语言模型的预训练知识，同时提升了推理性能。
   - **独特性**：
     - 在无需修改主语言模型的情况下，通过辅助模型生成任务特定的软标记，动态适应不同的推理任务，增强了泛化能力。
     - 结合连续空间推理和链式思维的优势，减少了推理过程中冗余计算，提高了推理效率。
     - 实验覆盖五个推理基准和两种主流语言模型架构，验证了方法的鲁棒性和适用性。

2. 论文中存在的问题及改进建议：
   - **问题**：
     - **对更大规模模型的适用性缺乏验证**：论文仅在约7-8B参数规模的语言模型上进行了实验，未验证SoftCoT在更大规模模型（如GPT-4或PaLM）的表现。
     - **软标记生成的质量依赖辅助模型**：辅助模型生成的软标记质量可能直接影响推理性能，论文未详细探讨如何优化辅助模型或提升软标记质量。
     - **数据集覆盖范围有限**：尽管论文涵盖了数学推理、常识推理和符号推理，但未涉及多模态推理任务（如视觉-语言结合）或更复杂的推理场景。
   - **改进建议**：
     - 在更大规模的语言模型上进行实验，验证SoftCoT的可扩展性，并分析其在不同规模模型中的性能变化。
     - 探索优化辅助模型的方法，例如通过多任务微调或知识蒸馏提升辅助模型生成软标记的质量。
     - 扩展实验至多模态推理任务，验证SoftCoT在视觉-语言结合任务中的适用性，并开发适配多模态的投影模块。

3. 基于论文的内容和研究结果，提出的创新点或研究路径：
   - **创新点1：跨模态软链式思维（Multi-modal SoftCoT）**
     - 将SoftCoT扩展到多模态任务中，例如视觉-语言推理，通过引入视觉辅助模型生成视觉相关的软标记，并与语言模型的软标记结合。
   - **创新点2：动态软标记生成（Dynamic Soft Token Generation）**
     - 开发一种动态生成软标记的方法，根据输入任务的复杂性和上下文动态调整软标记的数量和内容，以进一步提升推理效率。
   - **创新点3：知识增强的软链式思维（Knowledge-enhanced SoftCoT）**
     - 集成外部知识库或知识图谱，通过辅助模型生成知识增强的软标记，提升模型在常识推理和领域特定任务中的表现。

4. 为新的研究路径制定的研究方案：
   - **研究路径1：跨模态软链式思维（Multi-modal SoftCoT）**
     - **研究方法**：
       - 设计一个视觉辅助模型（如CLIP或BLIP），生成视觉相关的软标记。
       - 开发一个多模态投影模块，将视觉软标记与语言软标记融合，并映射到主语言模型的表示空间。
       - 在多模态推理数据集（如VQA或OKVQA）上进行实验，评估方法的性能。
     - **研究步骤**：
       1. 构建视觉辅助模型并生成视觉软标记。
       2. 设计多模态投影模块，融合视觉和语言软标记。
       3. 在多模态推理任务上进行训练和测试，比较与现有方法的性能。
     - **期望成果**：
       - 提出一种适用于多模态任务的软链式思维框架，显著提升多模态推理性能。
       - 验证视觉和语言软标记的协同作用，提高模型的推理效率和准确性。

   - **研究路径2：动态软标记生成（Dynamic Soft Token Generation）**
     - **研究方法**：
       - 引入一个动态生成模块，根据输入任务的复杂性和上下文动态调整软标记的数量和内容。
       - 优化生成模块的参数，使其能够根据任务需求生成最优的软标记。
     - **研究步骤**：
       1. 设计动态生成模块，结合任务特定信息生成软标记。
       2. 在多种推理任务上进行实验，分析动态生成模块的效果。
       3. 与固定数量软标记的方法进行比较，验证动态生成的优势。
     - **期望成果**：
       - 提出一种更灵活的软链式思维生成方法，减少冗余计算，提高推理效率。
       - 通过动态调整软标记数量，进一步提升模型在复杂任务中的表现。

   - **研究路径3：知识增强的软链式思维（Knowledge-enhanced SoftCoT）**
     - **研究方法**：
       - 集成外部知识库或知识图谱，通过辅助模型生成知识增强的软标记。
       - 优化投影模块，使其能够有效融合知识标记与语言模型的表示。
     - **研究步骤**：
       1. 构建知识增强的辅助模型，生成知识相关软标记。
       2. 设计知识融合投影模块，将知识标记与语言标记结合。
       3. 在常识推理和领域特定任务上进行实验，评估知识增强的效果。
     - **期望成果**：
       - 提出一种结合外部知识的软链式思维框架，显著提升模型在常识推理和领域特定任务中的性能。
       - 验证知识增强的软标记对复杂推理任务的帮助，提高模型的泛化能力。

:::

# Others

[ACL-2025] SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs

> | Resource | Info |
> | :---        |:--- |
> | Paper       | https://arxiv.org/abs/2310.12931 |
> | Code & Data | https://github.com/eureka-research/Eureka |
> | Public      | ICLR |
> | Date        | 2025.06.30 |


# Summary Overview

Large Language Models (LLMs) have excelled as high-level semantic planners for sequential decision-making tasks. However, harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning, remains an open problem. We bridge this fundamental gap and present EUREKA, a human-level reward design algorithm powered by LLMs.

![image.png](/static/img/3139fa547109fccf92d2c316284fdba5.image.png)

Evolution-driven Universal REward Kit for Agent (EUREKA), a novel reward design algorithm powered by coding LLMs with the following contributions:
1. Achieves human-level performance on reward design
2. Solves dexterous manipulation tasks that were previously not feasible by manual reward engineering
3. Enable a new *gradient-free* in-context learning approach to reinforcement learning from human feedback (RLHF)

# Main Content

**Problem Setting and Definitions**

The goal of reward design is to return a shaped reward function for a ground-truth reward function that may be difficult to optimize directly (e.g., sparse rewards); this ground-truth reward function may only be accessed via queries by the designer.



Definition 2.1. (Reward Design Problem (Singh et al., 2010))
A reward design problem (RDP) is a tuple
$P = \langle M, \mathcal{R}, \pi_M, F \rangle$
where $M = (S, A, T)$ is the world model with state space $S$, action space $A$, and transition function $T$. $\mathcal{R}$ is the space of reward functions. $\mathcal{A}_M(\cdot) : \mathcal{R} \to \Pi$ is a learning algorithm that outputs a policy $\pi : S \to \Delta(A)$ that optimizes reward $$R \in \mathcal{R}$$ in the resulting Markov Decision Process (MDP), $$(M, R)$$. $$F : \Pi \to \mathbb{R}$$ is the fitness function that produces a scalar evaluation of any policy, which may only be accessed via policy queries (i.e., evaluate the policy using the ground truth reward function).

In an RDP, the goal is to output a reward function $$R \in \mathcal{R}$$ such that the policy
$$\pi := \mathcal{A}_M(R)$$ that optimizes $$R$$ achieves the highest fitness score $$F(\pi)$$.

**Reward Generation Problem.** In our problem setting, every component within a RDP is specified via code. Then, given a string $l$ that specifies the task, the objective of the reward generation problem is to output a reward function code $R$ such that $F(\mathcal{A}_M(R))$ is maximized.

![image.png](/static/img/68b35886254036ee256eed9523a3e6c6.image.png)

EUREKA outperforms human rewards.

![image.png](/static/img/7ed1e273caa3ffec90b4f8762ca1bee0.image.png)

EUREKA consistently improves over time.

![image.png](/static/img/032f1234abc3b7a6ce83c92c4eaa5ddf.image.png)

# 🤖

:::info{title=" "}
1. 论文的创新之处与独特性：
   - **提出了EUREKA框架**：EUREKA是一种基于编码大语言模型（LLMs）和进化搜索的新型奖励设计算法。其独特之处在于无需任务特定的提示或奖励模板，通过环境代码作为上下文，能够零样本生成可执行的奖励函数。这种方法显著减少了人工试错设计奖励的复杂性。
   - **实现了人类级别的奖励生成能力**：EUREKA在29个RL环境中表现优异，覆盖了10种不同的机器人形态。在83%的任务中，EUREKA生成的奖励超过了专家设计的奖励，平均改进幅度达52%。
   - **解决复杂任务的能力**：EUREKA首次实现了基于Shadow Hand的快速笔旋转任务，展示了其在高维度、复杂任务中的适应性。
   - **引入奖励反思机制**：通过奖励反思（reward reflection），EUREKA能够基于训练反馈动态改进奖励函数。这种机制为奖励设计提供了细粒度的优化信号。
   - **支持人类反馈的强化学习**：EUREKA能够结合人类提供的文本反馈生成更符合人类意图的奖励函数，为强化学习从人类反馈（RLHF）提供了一种无梯度的学习方法。

2. 论文中存在的问题及改进建议：
   - **依赖任务适合的评价函数**：EUREKA需要一个明确的任务评价函数（fitness function），然而在一些开放式任务中，定义明确的评价函数可能较为困难。改进建议是结合视觉-语言模型（VLMs），通过分析任务视频自动生成评价函数或人类反馈。
   - **对模拟环境的依赖**：研究主要在模拟环境中进行，缺乏对真实机器人任务的广泛验证。建议扩展到更多真实世界的任务，并探索更高效的Sim2Real方法。
   - **缺乏对多任务学习的探索**：EUREKA目前针对单一任务生成奖励函数，尚未探索多任务场景下的奖励生成机制。建议研究如何在多任务学习中共享奖励设计知识。
   - **对LLM性能的依赖**：尽管实验表明EUREKA在使用性能较低的GPT-3.5时仍能取得较好结果，但其性能仍然受限于LLM的能力。未来可以探索结合更高效的编码模型或领域特化模型。

3. 基于论文的内容和研究结果，提出的创新点或研究路径：
   - **创新点1**：开发基于视觉-语言模型的任务描述与奖励生成框架，将任务视频直接作为输入，自动生成奖励函数。
   - **创新点2**：研究多任务奖励生成机制，探索如何通过共享知识在多个任务中提升奖励生成效率和性能。
   - **创新点3**：将EUREKA扩展到真实机器人环境，结合Sim2Real技术验证其在实际任务中的适用性，并探索对物理参数变化的鲁棒性。

4. 为新的研究路径制定的研究方案：
   - **研究方案1：基于视觉-语言模型的任务描述与奖励生成**
     - **目标**：开发一个框架，利用视觉-语言模型从任务视频生成任务描述和奖励函数。
     - **研究方法**：
       1. 利用预训练的视觉-语言模型（如BLIP-2或MiniGPT-4）对任务视频生成自然语言任务描述。
       2. 将生成的任务描述与环境代码结合，输入EUREKA框架生成奖励函数。
       3. 在标准RL环境中评估生成奖励的性能，并与人工设计的奖励进行对比。
     - **期望成果**：
       1. 生成的奖励函数能够有效提升RL算法的学习效率。
       2. 验证视觉-语言模型在任务理解和奖励生成中的潜力。

   - **研究方案2：多任务奖励生成机制**
     - **目标**：研究如何通过共享知识在多个任务中提升奖励生成的效率和性能。
     - **研究方法**：
       1. 构建多个相关任务的环境（例如不同形态的机器人任务）。
       2. 在EUREKA中引入跨任务的奖励生成模块，利用任务间的相似性共享奖励设计知识。
       3. 评估共享奖励生成机制在不同任务中的性能提升。
     - **期望成果**：
       1. 提出一种高效的多任务奖励生成算法。
       2. 验证奖励共享机制在复杂任务中的适用性。

   - **研究方案3：扩展至真实机器人环境**
     - **目标**：将EUREKA扩展到真实机器人任务，验证其在实际应用中的适用性。
     - **研究方法**：
       1. 选择一个典型的真实机器人任务（例如机械臂抓取或移动机器人导航）。
       2. 结合Sim2Real技术，将EUREKA生成的奖励函数应用于真实机器人。
       3. 通过实验验证奖励函数在真实环境中的鲁棒性和性能。
     - **期望成果**：
       1. 提供EUREKA在真实机器人任务中的性能评估。
       2. 提出改进Sim2Real的奖励设计方法。

:::

# Others