Chain-of-Thought, Tree-of-Thought, and Reasoning Model Advances

Reasoning capabilities have become a key differentiator in the LLM landscape.

Evolution of Reasoning Approaches

Chain-of-Thought (CoT)

Introduced in 2022, CoT revolutionized reasoning:

Standard Prompt:
Q: John has 5 apples...
A: 12 apples

Chain-of-Thought:
Q: John has 5 apples...
Let's think step by step:
1. John starts with 5 apples
2. He gives 3 to Mary → 5 - 3 = 2
3. Mary gives him 7 → 2 + 7 = 9
4. He loses 4 → 9 - 4 = 5
5. He buys 7 more → 5 + 7 = 12
A: 12 apples

Chain-of-Thought Variants

Variant	Description	Use Case
Zero-shot CoT	”Let’s think step by step”	Quick wins
Few-shot CoT	Exemplars in prompt	Complex reasoning
Self-consistency	Multiple paths, vote	Accuracy critical
Auto-CoT	Automated exemplar selection	Scale

Tree-of-Thought (ToT)

Expands CoT into search:

          Problem
         /   |   \
      Step1 Step1' Step1''
      / \     |     / \
   Good  Bad  Good  Bad
     |       |      |
   ...     ...    ...

Best for: Complex problems with branching decisions

Current Reasoning Models

Model Comparison on Reasoning Tasks

Model	MATH	GPQA	ARC-Challenge	Strategy
GPT-4	42.5%	35.7%	86.4%	Standard
GPT-4 + CoT	52.1%	42.3%	89.2%	Best-of-N
Claude 3.5	48.2%	38.9%	88.1%	Constitutional
o1 (OpenAI)	74.6%	53.6%	96.1%	Test-time compute
DeepSeek-R1	71.3%	51.8%	91.2%	RL training

The Test-Time Compute Revolution

OpenAI’s o1 demonstrated that thinking matters as much as training:

Training Compute: "Think before answering"
↓
Model thinks for N tokens internally
↓
More thinking = better reasoning
↓
Tradeoff: Latency vs accuracy

Techniques for Better Reasoning

Prompting Strategies

Explicit Steps: “First… Then… Finally…”
Role Assignment: “Think like a mathematician…”
Confidence Calibration: “Be sure before answering”
Error Detection: “Check your work…”

Verification Techniques

Method	How It Works	Accuracy Gain
Self-consistency	Multiple generations, vote	+5-10%
Process reward	Judge intermediate steps	+8-15%
Outcome reward	Judge final answer	+3-7%
Ensemble	Combine multiple models	+5-12%

Verification Model Architecture

┌──────────────┐
│ Question     │
└──────┬───────┘
       ↓
┌──────────────┐
│ Reasoner     │──→ Reasoning trace
│ (Primary)    │
└──────┬───────┘
       ↓
┌──────────────┐
│ Verifier     │──→ Correctness score
│ (Separate)   │
└──────┬───────┘
       ↓
  Accept or Retry

Practical Applications

Code Generation

Reasoning models excel at:

Understanding complex requirements
Debugging multi-file interactions
Proposing architectural improvements
Writing comprehensive tests

Mathematical Reasoning

OpenAI o1 achieved:

74.6% on MATH (vs 42.5% GPT-4)
83.3% on AIME (math competition)
93.2% on Putnam Math Competition

Scientific Reasoning

Emerging applications:

Literature review synthesis
Hypothesis generation
Experimental design
Data analysis

Limitations and Challenges

Limitation	Current State	Research Direction
Hallucination in steps	Partial solution	Better verification
Compute cost	High for o1	Efficiency research
Long chains	Degradation	Hierarchical reasoning
Domain expertise	Variable	Domain fine-tuning

Keith Kitchen

Explorer

Chain-of-Thought, Tree-of-Thought, and Reasoning Model Advances

Chain-of-Thought, Tree-of-Thought, and Reasoning Model Advances

Evolution of Reasoning Approaches

Chain-of-Thought (CoT)

Chain-of-Thought Variants

Tree-of-Thought (ToT)

Current Reasoning Models

Model Comparison on Reasoning Tasks

The Test-Time Compute Revolution

Techniques for Better Reasoning

Prompting Strategies

Verification Techniques

Verification Model Architecture

Practical Applications

Code Generation

Mathematical Reasoning

Scientific Reasoning

Limitations and Challenges

Media & Sources

Embedded Images

Source Links

Stacked notes

Graph View

Map

Table of Contents