Selection, which chooses the child node with the highest priority using the UCB1 formula:
UCB1=niwi+cnilnN
where wi is the total reward of node i, ni is its visit count, N is the parent node's visit count, and c balances exploration and exploitation.
Expansion adds new nodes
Simulation performs random rollouts to evaluate them
Backpropagation updates node statistics.
MCTS has been widely used in tasks such as optimizing strategies in board games like GO and in robotic path planning, where it helps robots navigate dynamic enviornments effectively.
Analysis of the Features of Reasoning LLMs
Ouput Behaviour Perspective
Slow-thinking models engage in a latent generative process, particularly noticeable during the prediction of subsequent tokens.
Verification and Check Structure:
Analysis of OpenAI's o1 and o3 models indicates that their reasoning frameworks incorporate both macro-level actions for long-term strategic planning and micro-level actions, including "Wait", "Hold on", "Alternatively", and "Let's pause". These micro actions facilitate meticulous verification and iterative checking processes, ensuring precision in task execution.
Studies suggest that constructing Slow-thinking CoT datasets with a focus on hard samples leads to better generalization in fields like medicine and mathematics.
In comparison to simple CoT, Slow-thinking Supervised Fine-Tuning (SFT) data exhibits remarkable sample efficiency, often delivering comparable results with just 1/100th of the sample size.
Training LLMs for slow-thinking, as characterized by the LongCoT approach, results in relatively uniform gradient norms across different layeers. In contrast, fast-thinking, exemplified by the simplified CoT method, generates larger gradient magnitudes in the earlier layers, along with significant variability in gradient norms across layers.
Core Method
Structure Search
As shown in Table 1, we classify the actions in prior work into four categories:
Reasoning Steps as Nodes: Actions represent intermediate reasoning steps or decisions, such as selecting rules, applying transformations, or generating subquestions.
Token-level Decisions: Actions involve generating tokens or sequences (e.g., the next word, phrase, or code snippet).
Task-specific Structures: Actions are domain-specific, such as moving blocks in blocksworld, constructing geometry in geometry problem-solving, or modifying workflows in task planning.
Self-correction and Exploration: Actions forcus on revisiting, refining, or backtracking to improve previous reasoning steps.
Additionallye, as illustrated in Table 1, we classify the reward design into five categories:
Outcome-based Rewards: Rewards focus on the correctness or validity of the final outcome or solution, including the validation of reasoning paths or task success.
Stepwise Evaluations: Rewards are assigned at intermediate steps based on the quality of each step or its contribution toward the final outcome.
Self-evaluation Mechanisms: Rewards rely on the model’s own confidence or self-assessment (e.g., likelihood, next-word probability, or confidence scores).
Domain-specific Criteria: Rewards are tailored to specific tasks, such as symmetry and complexity in geometry or alignment with human preferences in text generation.
Iterative Preference Learning: Rewards are derived from comparing multiple solutions or reasoning paths, guiding learning dynamically.
Reward Modeling
Outcome supervision emphasizes the correctness of the final answer at a higher level of granularity, and the resulting model is referred to as the Outcome Reward Model (ORM).
Process supervision provides step-by-step labels for the solution trajectory, evaluating the quality of each reasoning step. The resulting model is known as the Process Reward Model (PRM).
PRM offers significant advantages in complex reasoning tasks for several key reasons. First, it provides fine-grained, step-wise supervision, allowing for the identification of specific errors within a solution path. This feature is especially valuable for RL and automated error correction. Second, PRM closely mirrors human reasoning behavior, which relies on accurate intermediate steps to reach correct conclusions. Unlike ORM, PRM avoids situations where incorrect reasoning can still lead to a correct final answer, thus ensuring more robust and interpretable reasoning.
PRM face challenges:
Lack of Explanations: Current PRMs often generate scores for reasoning steps without sufficient explanations, limiting interpretability and hindering their usefulness in refining reasoning during test-time.
Bias in Training Data: Data collection methods, such as MCTS, tend to introduce distributional biases, assigning disproportionately higher scores to majority of questions. As a result, PRMs struggle to effectively identify erroneous reasoning steps.
Early-Step Bias: PRMs show lower accuracy in predicting rewards for earlier reasoning steps compared to those closer to the final answer. This issue stems from the increased randomness and uncertainty associated with the initial steps in the reasoning process.
Self Improvement
Macro Action
Reinforcement Fine-Tuning
Reinforcement Fine-Tuning (RFT) is an innovative technique recently introduced by OpenAI, designed to enable developers and engineers to fine-tune existing models for specific domains or complex tasks.
Key advantages include:
Simplified Training Pipline: RL supervision streamlines data construction and training processes, eliminating the need for complex stepwise search mechanisms.
Enhanced Scalability: Online RL training facilitates efficient scaling on large datasets, particularly for complex reasoning tasks.
Emergent Properties: DeepSeek-R1 demonstrates unique emergent capabilities, such as Long-CoT reasoning, which are difficult to achieve through SFT alone.
RFT faces the following challenges:
Unclear Mechanism behind Reasoning: The underlying mechanisms driving the reasoning improvements in DeepSeek-R1 remain poorly understood. For example, while DeepSeek-R1 exhibits emergent properties (e.g., “Emergent Length Increasing”, “Aha moments”), studies such as suggest that capabilities like Long-CoT might already exist in the base model, rather than solely emerging from RL training. Furthermore, performance gains observed in smaller models (e.g., Qwen-Math2B/7B) occur without noticeable “Aha moments”, complicating causal interpretations.
Reward Model Saturation: Many existing RL algorithms face reward model saturation, typically manifested as exploration collapse after around 100 training steps. Although DeepSeek-R1 alleviates this issue through specialized reward formatting, methods like ReFT and Satori propose alternating sampling and SFT distillation to combat reward hacking and exploration collapse.
Unstable Long-CoT Generation: Long reasoning chains generated by RFT are prone to instability, including context overflow, failure to return final answers, and sensitivity to reward shaping. For instance, methods like inadvertently introduce cosine reward functions, which degrade performance with increased iterations. O1-Prune uses post-hoc length pruning techniques (via RL/SFT) to stabilize outputs.
Future directions for RFT may include several exciting and innovative advancements, such as: