graphcore-research
diff --git a/‎_posts/papers-of-the-month/2025-04/2025-05-07-motion-prompting-mamba-reasoning-modeling-rewards.md‎
Lines changed: 1 addition & 1 deletion b/‎_posts/papers-of-the-month/2025-04/2025-05-07-motion-prompting-mamba-reasoning-modeling-rewards.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎_posts/papers-of-the-month/2025-05/2025-06-02-parallel-scaling-evolving-code-understanding-llm-reasoning.md‎
Lines changed: 31 additions & 0 deletions b/‎_posts/papers-of-the-month/2025-05/2025-06-02-parallel-scaling-evolving-code-understanding-llm-reasoning.md‎
Lines changed: 31 additions & 0 deletions
diff --git a/‎_posts/papers-of-the-month/2025-05/papers/2025-06-01-alpha-eval.md‎
Lines changed: 66 additions & 0 deletions b/‎_posts/papers-of-the-month/2025-05/papers/2025-06-01-alpha-eval.md‎
Lines changed: 66 additions & 0 deletions
diff --git a/‎_posts/papers-of-the-month/2025-05/papers/2025-06-01-parallel-scaling-laws.md‎
Lines changed: 86 additions & 0 deletions b/‎_posts/papers-of-the-month/2025-05/papers/2025-06-01-parallel-scaling-laws.md‎
Lines changed: 86 additions & 0 deletions
diff --git a/‎_posts/papers-of-the-month/2025-05/papers/2025-06-01-soft-thinking.md‎
Lines changed: 79 additions & 0 deletions b/‎_posts/papers-of-the-month/2025-05/papers/2025-06-01-soft-thinking.md‎
Lines changed: 79 additions & 0 deletions
@@ -34,4 +34,4 @@ when compared to transformers with chains-of-thought.
 
 ---
 
-{% include paper-summaries.md %}
+{% include paper-summaries.md %}
@@ -0,0 +1,31 @@
+---
+title: "May Papers: Parallel scaling, Evolving code, Understanding LLM reasoning"
+header:
+    teaser: /assets/images/posts/2025-05/potm/twitter_card.png
+    image: /assets/images/posts/2025-05/potm/twitter_card.png
+    og_image: /assets/images/posts/2025-05/potm/twitter_card.png
+
+date: 2025-06-02T01:00:00-00:00
+potm_year: 2025
+potm_month: 5
+
+layout: paper-summaries-layout
+category: "papers-of-the-month"
+toc: true
+toc_sticky: true
+toc_label: "Papers"
+toc_icon: "book"
+author.twitter: "GCResearchTeam"
+---
+
+Hurtling past the NeurIPS submission deadline into the summer months, we switch from huddling around server rooms to keep warm to babysitting experiments whilst basking in the sun. We've had a bumper month of papers to sift through and once again we offer summaries of a few of our favourites.
+
+First, [Parallel Scaling Laws for Language Models](#parallel-scaling-laws-for-language-models) proposes a novel method of scaling compute with language models inspired by classifier-free guidance that finetunes a model to run multiple forward passes with different learned vector prefixes. We also looked into [AlphaEvolve](#alphaevolve-a-coding-agent-for-scientific-and-algorithmic-discovery), an evolutionary algorithm from Google DeepMind that generates and refine prompts for Gemini that can advance the state-of-the-art in algorithm design. 
+
+Since it has been a particularly exciting month for contributions on LLM reasoning, we picked two papers to dive into deeper. In [Soft Thinking](#soft-thinking-unlocking-the-reasoning-potential-of-llms-in-continuous-concept-space) the authors attempt to improve on prior work sampling continuous token embeddings rather than discrete tokens during reasoning phases of text generation. Finally, in [Spurious Rewards](#spurious-rewards-rethinking-training-signals-in-rlvr) they find that even rewarding random answers can improve reasoning ability, potentially forcing us to reconsider how we understand post-training techniques to improve the use of test-time compute. 
+
+*We hope you enjoy this month's papers as much as we did! If you have thoughts or questions, please reach out to us at [@GCResearchTeam](https://x.com/GCResearchTeam).*
+
+---
+
+{% include paper-summaries.md %}
@@ -0,0 +1,66 @@
+---
+title: "AlphaEvolve: A coding agent for scientific and algorithmic discovery"
+paper_authors: "Emilien Dupont, et al."
+orgs: "Google DeepMind"
+paper_link: "https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/AlphaEvolve.pdf"
+tags:
+    - AGI
+    - Evolutionary Algorithms
+    - RAG
+potm_year: 2025
+potm_month: 5
+paper_order: 2  # Editor will decide
+image_dir: "/assets/images/posts/2025-05/potm/AlphaEvolve/"
+review_author:
+    name: "Robert Hu"
+    link: "https://scholar.google.com/citations?user=SaxR4ugAAAAJ&hl=en"
+hidden: true
+---
+
+AlphaEvolve, evolves (no pun intended) the seminal method [FunSearch](https://www.nature.com/articles/s41586-023-06924-6) introduced in late 2023. Powered by a frontier model rather than a smaller LLM, it leverages evolutionary algorithms to successively prompt Gemini to finding novel solutions with respect to a multi-objective target. The results are quite compelling, improving upon various mathematical results relating to matrix multiplication and even reducing Google Cloud's costs with 0.7%. This is a staggering saving considering the scale of Google. 
+
+### The key idea
+
+AlphaEvolve’s workflow begins with an initial program and an automated evaluation function. Marked code blocks are evolved through LLM-proposed diffs, evaluated for quality, and selectively retained in a program database. Prompts are assembled from high-performing ancestors and injected with stochastic formatting, context, and feedback.
+
+The use of powerful Gemini 2.0 models (Flash for speed, Pro for breakthroughs) ensures a mix of exploration and high-quality suggestions. The evolution loop is fully asynchronous and can evaluate expensive programs in parallel. AlphaEvolve can optimize for multiple metrics and adapt across abstraction levels: from evolving functions to full search algorithms.
+
+By exploiting the capacity of LLMs and engineering a feedback loop, it can effectively help Gemini to act as an "arbitrary" format optimiser for very vague tasks, as long as the problem can be expressed in some form of syntax paired with an evaluation criteria. 
+
+<img src="{{ page.image_dir | append: 'Fig1.png' | relative_url }}" alt="Overall feedback loop for AlphaEvolve">
+<figcaption>Figure 1. AlphaEvolve feedback loop design.</figcaption>
+
+
+### Their method
+
+AlphaEvolve works by evolving full programs through an LLM-driven, feedback-grounded loop. Candidate solutions are generated as diffs from existing programs, scored via an automated evaluation function, and stored in a database to seed further generations. It supports evolving entire files across multiple languages and can simultaneously optimize for multiple metrics.
+
+To understand what makes AlphaEvolve effective, the authors performed ablation studies on two tasks: matrix multiplication and the kissing number problem. The key findings:
+
+1. **No evolution**: Removing the evolutionary loop (i.e., re-prompting the initial program) significantly degrades performance.
+
+2. **No context**: Stripping rich context from prompts leads to worse solutions, confirming that prompt engineering matters.
+
+3. **Small LLMs only**: Using only lightweight models reduces result quality. Strong base models like Gemini Pro make a difference.
+
+4. **No full-file evolution**: Restricting changes to a single function (vs. whole files) limits AlphaEvolve’s power and flexibility.
+
+5. **No meta-prompt evolution** : Removing the co-evolution of prompts results in slower progress, showing prompt quality co-evolution is a key driver.
+
+Together, these ablations show that AlphaEvolve’s strength comes from multiple interacting components; especially full-code evolution, high-quality LLMs, and contextual prompting.
+
+<img src="{{ page.image_dir | append: 'Fig2.png' | relative_url }}" alt="AlphaEvolve methodological gains">
+<figcaption>Figure 2. AlphaEvolve ablation study results.</figcaption>
+
+
+
+### Results
+
+AlphaEvolve discovered a new algorithm to multiply 4×4 complex matrices using 48 scalar multiplications—beating Strassen’s 49 from 1969. It improved tensor decomposition methods for matrix multiplication, set new state-of-the-art results on the Erdős minimum overlap and kissing number problems, and evolved efficient scheduling heuristics that saved 0.7% of Google’s data center compute.
+
+They further demonstrated that AlphaEvolve also able to carry out assembly level code to optimise kernels for Gemini's attention layer, yielding a 23% performance improvement, with a 1% decrease in wall time. 
+
+
+### Takeaways
+
+Some observations here is that frontier models can evidently be used as "general-purpose" optimisers assuming a well engineered feedback loop. This can likely be generalised to a product using an arbitrary frontier model, and may possibly add another venue for the agentic-LLM community to explore.
@@ -0,0 +1,86 @@
+---
+title: "Parallel Scaling Laws for Language Models"
+paper_authors: "Mouxiang Chen et al."
+orgs: "Zhejiang University, Qwen Team, Alibaba Group"
+paper_link: "https://arxiv.org/abs/2505.10475"
+tags:
+  - LLMs
+  - efficient-inference
+  - scaling-laws
+  - fine-tuning
+potm_year: 2025
+potm_month: 5
+paper_order: 1 # Editor will decide
+image_dir: "/assets/images/posts/2025-05/potm/parallel-scaling-laws/"
+review_author:
+  name: "Tom Pollak"
+  link: "https://www.linkedin.com/in/tompollak/"
+hidden: false
+---
+
+### The key idea
+
+Researches at Qwen introduce a new dimension of scaling: parallel forward passes. Their method, PARSCALE, runs $P$ parallel copies of a model, each with a different learned prefix. They find that running $P$ parallel passes is equivalent to scaling the model parameters by $O(\log P)$.
+
+<br>
+<img src="{{ page.image_dir | append: 'three-scaling-approaches.png' | relative_url }}">
+<figcaption>Three parallel scaling approaches: Parameter, Inference time, and Parallel computation.</figcaption>
+
+### Background
+
+The approach comes from a practical inference bottleneck: for large models, single batch inference can be memory-bound, especially on resource constrained edge devices. Rather than increasing model size or generating more reasoning steps, PARSCALE aims to scale a new axis, parallel computation, to keep model size approximately constant while improving performance.
+
+Inspired by techniques like Classifier-Free Guidance (CFG), PARSCALE hypothesizes:
+
+> Scaling parallel computation (while maintaining the nearly constant parameters) enhances the model’s capability, with similar effects as scaling parameters.
+
+### Methodology
+
+PARSCALE executes $P$ forward passes in parallel, each conditioned with a unique learned prefix (implemented via prefix tuning). Outputs of the different streams are combined using a learned aggregation MLP.
+
+Unlike inference-time tricks (e.g., beam search or self-consistency), PARSCALE learns the aggregation during _training_, leading to more effective use of parallel compute. Conceptually this is similar to ensembling, but with almost complete parameter sharing between the members.
+
+#### Training Strategy
+
+To reduce training costs, they propose a two-stage approach:
+
+- Stage 1: Standard pre-training (1T tokens)
+- Stage 2: Add PARSCALE (20B tokens, 2% overhead)
+
+Dramatically reduces cost of parallel scaling training (which requires $P$ forward passes) only applied to the final 20B tokens, not the full 1T.
+
+### Results
+
+<img src="{{ page.image_dir | append: 'parscale-loss-contours.png' | relative_url }}">
+<figcaption>PARSCALE results: Parameter, Inference time, and Parallel computation.</figcaption>
+
+#### Coding Tasks (Stack-V2-Python)
+
+| Model Params | P  | HumanEval+ (%) |
+|--------------|----|----------------|
+| 1.6B         | 1  | 33.9           |
+| 1.6B         | 8  | 39.1           |
+| 4.4B         | 1  | 39.2           |
+
+#### General Tasks (Pile)
+
+| Model Params | P  | Avg Score (%) |
+|--------------|----|---------------|
+| 1.6B         | 1  | 53.1          |
+| 1.6B         | 8  | 55.7          |
+| 2.8B         | 1  | 55.2          |
+
+
+For a 1.6B model, scaling to $P=8$ parallel streams achieves performance comparable with a 4.4B model on coding tasks. These efficiency gains are most pronounced at small batch sizes ($\leq 8$) where inference is memory-bound. This makes PARSCALE most suitable for edge deployment scenarios.
+
+- 22x less memory increase compared to parameter scaling.
+- 6x lower latency.
+- 8x increase (linear with $P$) KV cache size.
+
+#### Dynamic Parallel Scaling
+
+PARSCALE remains effective with frozen main parameters for different values of P. This enables dynamic parallel scaling: switching P to dynamically adapt model capabilities during inference.
+
+### Takeaways
+
+PARSCALE provides a new axis in which to boost model capability, particuarly in resource constrained single-batch inference. However KV cache grows linearly with the number of parallel streams ($P$) so effectiveness may diminish beyond $P=8$ (the largest tested configuration). It is an open question as to whether $O(\log P)$ scaling holds for $P ≫ 8$.
@@ -0,0 +1,79 @@
+---
+title: "Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space"
+paper_authors: "Zhen Zhang et al."
+orgs: "University of California, Purdue University, LMSYS Org, Microsoft"
+paper_link: "https://arxiv.org/abs/2505.15778"
+tags:
+    - LLMs
+    - test-time-compute
+    - reasoning
+    - efficient-inference  # Use https://graphcore-research.github.io/tags/ as reference
+potm_year: 2025
+potm_month: 5
+paper_order: 3  # Editor will decide
+image_dir: "/assets/images/posts/2025-05/potm/soft-thinking/"
+review_author:
+    name: "Luke Hudlass-Galley"
+    link: "https://www.linkedin.com/in/lukehudlassgalley/"
+hidden: true
+---
+
+### The key idea
+
+Conventional reasoning models generate long reasoning traces, which are typically constrained to the expressivity of the model's vocabulary. The discretisation of reasoning makes it hard to explore multiple paths or ideas, and limits the model's ability to  "think" about abstract concepts, as it is always forced to express its thoughts in natural language.
+
+Recent works have looked into *latent* thoughts, such as [Coconut](https://arxiv.org/abs/2412.06769) and [Latent Reasoning](https://arxiv.org/abs/2502.05171), in which the thoughts are not necessarily discretised tokens, but rather continuous tokens in some latent space. However, training these methods is non-trivial and scaling model size can be very challenging.
+
+In *Soft Thinking*, the authors propose a training-free approach to latent reasoning, in which the "concept tokens" are a probability-weighted mixture of the token embeddings.
+
+<img src="{{ page.image_dir | append: 'soft-thinking-schematic.png' | relative_url }}" alt="Visualisation of the Soft Thinking method.">
+<figcaption>Figure 2: Soft Thinking replaces discrete tokens with soft, abstract concept tokens, enabling reasoning in continuous concept space.</figcaption>
+
+
+### Their method
+
+Typically, reasoning models employ standard LLM inference techniques for generating their reasoning traces: each forward pass $i$ generates a probability distribution over the vocabulary, from which a token $t_i$ is sampled from. This token is then embedded using the embedding matrix $\mathbf{E}$ and injected into the model's input. 
+
+Mathematically, this can be expressed as
+<div>
+$$
+e_{i+1} = \mathbf{E}[t_i]
+$$
+</div>
+such that
+<div>
+$$
+t_i \sim p_i = \mathrm{LLM}(e_1, \cdots,  e_{i-1})
+$$
+</div>
+where $p_i$ is the probability distribution for the $i$th forward pass, and $\mathrm{LLM}$ is the model.
+
+The sampling operation of LLM inference discretises the model's output, limiting its expressivity. In contrast, Soft Thinking proposes taking a probability-weighted mixture of the input token embeddings, making a so-called *concept token*. This means the next input token can be expressed as
+<div>
+$$
+e_{i+1} = \sum_{k=1}^{|V|}p_i[k] \cdot E[k]
+$$
+</div>
+
+This approach means that the input embedding layer and output head do not need to be weight-tied, which can cause issues for other continuous reasoning approaches such as [Coconut](https://arxiv.org/abs/2412.06769).
+
+As the model no longer injects conventional tokens into the model as part of its reasoning trace, over time the model will be in an out-of-distribution regime. To mitigate this, the authors suggest a cold stop mechanism, which measures the entropy of the concept token, and if it falls below a threshold $\tau$ for some number of consecutive iterations, then a `</think>` token is injected into the sequence to terminate the reasoning trace and commence answer generation. This prevents the model from becoming overconfident, and provides a simple stopping condition for the model to exit latent-thought generation.
+
+### Results
+The authors examine Soft Thinking over a number of mathematical and coding tasks, on three different models: QwQ-32B, DeepSeek-R1-DistillQwen-32B, and DeepSeek-R1-Distill-Llama-70B. They find that across all models and tasks, they see an improvement in task performance, and very often a reduction in sequence length, indicating that Soft Thinking enables richer concepts for a given token.
+
+<img src="{{ page.image_dir | append: 'results-table-1.png' | relative_url }}" alt="Results table 1.">
+<figcaption>Table 1: Comparison of Soft Thinking and various baseline methods on accuracy and generation length across mathematical datasets. Best results are highlighted in bold.</figcaption>
+
+<img src="{{ page.image_dir | append: 'results-table-2.png' | relative_url }}" alt="Results table 2.">
+<figcaption>Table 2: Comparison of Soft Thinking and various baseline methods on accuracy and generation length across coding datasets. Best results are highlighted in bold.</figcaption>
+
+One concern surrounding latent reasoning is difficulty in interpeting the reasoning trace. While [another recent paper](https://arxiv.org/abs/2505.13775) questions the validity of traces to the actual reasoning itself, the Soft Thinking authors are still able to generate legible reasoning traces, simply by examining the highest-probability (discrete) token after each forward pass.
+
+<img src="{{ page.image_dir | append: 'probability-distribution.png' | relative_url }}" alt="Probability distribution over a complete reasoning trace.">
+<figcaption>Figure 4: An example illustrating the probability distribution of our proposed Soft Thinking method. At each step, top-$k$ token candidates and their probabilities are shown. Red boxes indicate the selected tokens that form the final generated sequence for readability and interpretability.</figcaption>
+
+
+### Takeaways
+
+Soft Thinking offers a viable way to imbue pre-trained reasoning models with latent reasoning capabilities which permit abstract concepts in a continuous space, without requiring any additional fine-tuning. As their results demonstrate, this offers the opportunity for greater task performance with shorter sequence lengths. While this work doesn't investigate how we can train models to best use the concept space, it does indicate that research in this direction is likely to bear promising results.
Original file line number	Diff line number	Diff line change
`@@ -34,4 +34,4 @@ when compared to transformers with chains-of-thought.`
`34`	`34`
`35`	`35`	`---`
`36`	`36`
`37`		`-{% include paper-summaries.md %}`
	`37`	`+{% include paper-summaries.md %}`