<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Neurocoder Tales: Compedium]]></title><description><![CDATA[Each post is a review of several papers, offering a comprehensive view of an important topic.]]></description><link>https://hungleai.substack.com/s/compedium</link><image><url>https://substackcdn.com/image/fetch/$s_!BcOO!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd01a0a6d-2805-4bf0-8d97-0154361f2f3e_1024x1024.png</url><title>Neurocoder Tales: Compedium</title><link>https://hungleai.substack.com/s/compedium</link></image><generator>Substack</generator><lastBuildDate>Tue, 05 May 2026 20:49:55 GMT</lastBuildDate><atom:link href="https://hungleai.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Hung Le]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[hungleai@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[hungleai@substack.com]]></itunes:email><itunes:name><![CDATA[Hung Le]]></itunes:name></itunes:owner><itunes:author><![CDATA[Hung Le]]></itunes:author><googleplay:owner><![CDATA[hungleai@substack.com]]></googleplay:owner><googleplay:email><![CDATA[hungleai@substack.com]]></googleplay:email><googleplay:author><![CDATA[Hung Le]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Agree or Disagree: A Review of Multi-Agent Debate ]]></title><description><![CDATA[Recent advances in Multi-agent Debate (MAD)]]></description><link>https://hungleai.substack.com/p/agree-or-disagree-a-review-of-multi</link><guid isPermaLink="false">https://hungleai.substack.com/p/agree-or-disagree-a-review-of-multi</guid><dc:creator><![CDATA[Hung Le]]></dc:creator><pubDate>Tue, 21 Apr 2026 08:06:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!7s8w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bb0e77f-7c9f-4016-b8d4-a2bee3526379_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Agents are everywhere. Every week, a new framework promises to solve complex problems by orchestrating multiple LLMs into pipelines, graphs, and hierarchies of  sophistication. The agentic ecosystem has exploded, and with it, a dizzying array of architectures for making LLMs work together. But beneath the engineering complexity, a surprisingly fundamental question has gone underexplored: &#129504; <em>what actually happens when multiple LLMs communicate with each other, and does it make them smarter?</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7s8w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bb0e77f-7c9f-4016-b8d4-a2bee3526379_1024x1024.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7s8w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bb0e77f-7c9f-4016-b8d4-a2bee3526379_1024x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!7s8w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bb0e77f-7c9f-4016-b8d4-a2bee3526379_1024x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!7s8w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bb0e77f-7c9f-4016-b8d4-a2bee3526379_1024x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!7s8w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bb0e77f-7c9f-4016-b8d4-a2bee3526379_1024x1024.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7s8w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bb0e77f-7c9f-4016-b8d4-a2bee3526379_1024x1024.webp" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2bb0e77f-7c9f-4016-b8d4-a2bee3526379_1024x1024.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Generated image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Generated image" title="Generated image" srcset="https://substackcdn.com/image/fetch/$s_!7s8w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bb0e77f-7c9f-4016-b8d4-a2bee3526379_1024x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!7s8w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bb0e77f-7c9f-4016-b8d4-a2bee3526379_1024x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!7s8w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bb0e77f-7c9f-4016-b8d4-a2bee3526379_1024x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!7s8w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bb0e77f-7c9f-4016-b8d4-a2bee3526379_1024x1024.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Multi-agent Debate. Source: Sora.</figcaption></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Many read my posts, but only 3% subscribe. If you find my writing helpful, please subscribe&#8212;it&#8217;s free! Your support motivates me to keep creating high-quality and exclusive content. Thank you!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h4>Table of Content</h4><ul><li><p><a href="https://hungleai.substack.com/i/194030888/introduction">Introduction</a></p></li><li><p><a href="https://hungleai.substack.com/i/194030888/chapter-1-broadcasting-the-messages-with-society-of-minds">Chapter 1: Broadcasting the Messages with Society of Minds</a></p></li><li><p><a href="https://hungleai.substack.com/i/194030888/chapter-2-giving-agents-identities-the-judge-and-the-tit-for-tat">Chapter 2: Giving Agents Identities &#8212; The Judge and the Tit-for-Tat</a></p></li><li><p><a href="https://hungleai.substack.com/i/194030888/chapter-3-agents-that-know-what-they-dont-know">Chapter 3: Agents That Know What They Don&#8217;t Know</a></p></li><li><p><a href="https://hungleai.substack.com/i/194030888/chapter-4-the-memory-problem-and-mad-with-memory-masking">Chapter 4: The Memory Problem and MAD with Memory Masking</a></p></li><li><p><a href="https://hungleai.substack.com/i/194030888/chapter-5-the-signal-you-were-ignoring-diversity-aware-retention">Chapter 5: The Signal You Were Ignoring: Diversity-Aware Retention</a></p></li></ul><div><hr></div><h2>Introduction</h2><p>One research thread has been quietly and carefully examining the communication protocol aspect of multi-agent LLMs since 2023. Rather than building elaborate tool-use pipelines or role-playing simulations, it zooms in on the simplest possible form of multi-agent interaction: debate. Multiple LLMs (same task, no tool, just agent talking), arguing with each other until they converge on an answer. The research direction aims to answer these questions: </p><p>&#129504; <em>Does structured disagreement between LLMs improve reasoning quality? And if so, how should agents communicate to make that happen?</em></p><p>These questions turn out to be much harder than they look. And like most good ideas in deep learning, the story isn&#8217;t one clean paper. Instead, it&#8217;s a progression of insights, each one exposing a new failure mode of the previous approach. This blog traces that progression from the ground up.</p><div><hr></div><h2>Chapter 1: Broadcasting the Messages with Society of Minds</h2><p>The idea begins with a philosophical intuition borrowed from Marvin Minsky&#8217;s 1988 book <em>The Society of Mind</em>: intelligence emerges not from a single powerful process, but from the interaction of many simpler ones. Du et al. [1] translated this directly into a &#128073;<strong>Society of Minds</strong> prompting framework for LLM agents.</p><p>The setup is remarkably simple. Given a query <em>q</em>:</p><ul><li><p><em>N</em> instances of an LLM each independently answer a question</p></li><li><p>Each agent reads all other agents&#8217; responses and updates its own answer</p></li><li><p>This repeats for several rounds until a consensus is reached</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HFp6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07c5fa4b-e323-4cbb-b971-d3c6dc934069_1060x293.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HFp6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07c5fa4b-e323-4cbb-b971-d3c6dc934069_1060x293.png 424w, https://substackcdn.com/image/fetch/$s_!HFp6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07c5fa4b-e323-4cbb-b971-d3c6dc934069_1060x293.png 848w, https://substackcdn.com/image/fetch/$s_!HFp6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07c5fa4b-e323-4cbb-b971-d3c6dc934069_1060x293.png 1272w, https://substackcdn.com/image/fetch/$s_!HFp6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07c5fa4b-e323-4cbb-b971-d3c6dc934069_1060x293.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HFp6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07c5fa4b-e323-4cbb-b971-d3c6dc934069_1060x293.png" width="1060" height="293" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/07c5fa4b-e323-4cbb-b971-d3c6dc934069_1060x293.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:293,&quot;width&quot;:1060,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:112495,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/194030888?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07c5fa4b-e323-4cbb-b971-d3c6dc934069_1060x293.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HFp6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07c5fa4b-e323-4cbb-b971-d3c6dc934069_1060x293.png 424w, https://substackcdn.com/image/fetch/$s_!HFp6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07c5fa4b-e323-4cbb-b971-d3c6dc934069_1060x293.png 848w, https://substackcdn.com/image/fetch/$s_!HFp6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07c5fa4b-e323-4cbb-b971-d3c6dc934069_1060x293.png 1272w, https://substackcdn.com/image/fetch/$s_!HFp6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07c5fa4b-e323-4cbb-b971-d3c6dc934069_1060x293.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Examples of debating rounds and how the agents converge on the final answer. Source: [1].</figcaption></figure></div><p>The debate prompt engineering is simple, something like: <em>&#8220;Based on the opinions of other agents, can you give an updated response?&#8221;</em> In practice, the debate runs for a fixed number of rounds <em>R</em>. Regardless of whether full consensus is reached, the final answer is obtained by either taking the first agent's response at the last round, majority voting across all agents, or using an LLM as a judge to select among the final responses.</p><p>Formally, at round <em>r</em>, agent <em>i</em> generates:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;g_{r,i} \\sim \\pi_\\theta(\\cdot \\mid q, \\mathcal{G}_{r-1})&quot;,&quot;id&quot;:&quot;VGEHEMFVDK&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>G&#7523;&#8331;&#8321; = {g&#7523;&#8331;&#8321;,&#8321;, ..., g&#7523;&#8331;&#8321;,&#8345;}</em> is the full set of previous responses from all agents. The communication topology looks like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gnoj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93f86b-9dd5-4fe8-a480-a4fcbe77c98a_565x531.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gnoj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93f86b-9dd5-4fe8-a480-a4fcbe77c98a_565x531.png 424w, https://substackcdn.com/image/fetch/$s_!gnoj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93f86b-9dd5-4fe8-a480-a4fcbe77c98a_565x531.png 848w, https://substackcdn.com/image/fetch/$s_!gnoj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93f86b-9dd5-4fe8-a480-a4fcbe77c98a_565x531.png 1272w, https://substackcdn.com/image/fetch/$s_!gnoj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93f86b-9dd5-4fe8-a480-a4fcbe77c98a_565x531.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gnoj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93f86b-9dd5-4fe8-a480-a4fcbe77c98a_565x531.png" width="287" height="269.729203539823" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc93f86b-9dd5-4fe8-a480-a4fcbe77c98a_565x531.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:531,&quot;width&quot;:565,&quot;resizeWidth&quot;:287,&quot;bytes&quot;:78080,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/194030888?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93f86b-9dd5-4fe8-a480-a4fcbe77c98a_565x531.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gnoj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93f86b-9dd5-4fe8-a480-a4fcbe77c98a_565x531.png 424w, https://substackcdn.com/image/fetch/$s_!gnoj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93f86b-9dd5-4fe8-a480-a4fcbe77c98a_565x531.png 848w, https://substackcdn.com/image/fetch/$s_!gnoj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93f86b-9dd5-4fe8-a480-a4fcbe77c98a_565x531.png 1272w, https://substackcdn.com/image/fetch/$s_!gnoj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc93f86b-9dd5-4fe8-a480-a4fcbe77c98a_565x531.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Communication network in the Society of Minds MAD (5 agents). Source: [X]</figcaption></figure></div><p>Despite its simplicity, the empirical results were striking for their time. On arithmetic, factual QA, chess move validity, and other benchmarks, multi-agent debate consistently outperformed single-agent baselines, chain-of-thought, and reflection-based methods. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lGzq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a6d1eb9-2229-46db-9916-94e35628f82c_426x291.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lGzq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a6d1eb9-2229-46db-9916-94e35628f82c_426x291.png 424w, https://substackcdn.com/image/fetch/$s_!lGzq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a6d1eb9-2229-46db-9916-94e35628f82c_426x291.png 848w, https://substackcdn.com/image/fetch/$s_!lGzq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a6d1eb9-2229-46db-9916-94e35628f82c_426x291.png 1272w, https://substackcdn.com/image/fetch/$s_!lGzq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a6d1eb9-2229-46db-9916-94e35628f82c_426x291.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lGzq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a6d1eb9-2229-46db-9916-94e35628f82c_426x291.png" width="426" height="291" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0a6d1eb9-2229-46db-9916-94e35628f82c_426x291.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:291,&quot;width&quot;:426,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:26810,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/194030888?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a6d1eb9-2229-46db-9916-94e35628f82c_426x291.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lGzq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a6d1eb9-2229-46db-9916-94e35628f82c_426x291.png 424w, https://substackcdn.com/image/fetch/$s_!lGzq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a6d1eb9-2229-46db-9916-94e35628f82c_426x291.png 848w, https://substackcdn.com/image/fetch/$s_!lGzq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a6d1eb9-2229-46db-9916-94e35628f82c_426x291.png 1272w, https://substackcdn.com/image/fetch/$s_!lGzq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a6d1eb9-2229-46db-9916-94e35628f82c_426x291.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>More interestingly, debate helped even when all agents started wrong. Agents would sometimes arrive at the correct answer through mutual critique, even if no individual agent held it initially.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XPEu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff642032b-5730-49f0-9838-8212628ae05f_1074x795.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XPEu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff642032b-5730-49f0-9838-8212628ae05f_1074x795.png 424w, https://substackcdn.com/image/fetch/$s_!XPEu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff642032b-5730-49f0-9838-8212628ae05f_1074x795.png 848w, https://substackcdn.com/image/fetch/$s_!XPEu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff642032b-5730-49f0-9838-8212628ae05f_1074x795.png 1272w, https://substackcdn.com/image/fetch/$s_!XPEu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff642032b-5730-49f0-9838-8212628ae05f_1074x795.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XPEu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff642032b-5730-49f0-9838-8212628ae05f_1074x795.png" width="1074" height="795" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f642032b-5730-49f0-9838-8212628ae05f_1074x795.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1074,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:336598,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/194030888?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff642032b-5730-49f0-9838-8212628ae05f_1074x795.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XPEu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff642032b-5730-49f0-9838-8212628ae05f_1074x795.png 424w, https://substackcdn.com/image/fetch/$s_!XPEu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff642032b-5730-49f0-9838-8212628ae05f_1074x795.png 848w, https://substackcdn.com/image/fetch/$s_!XPEu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff642032b-5730-49f0-9838-8212628ae05f_1074x795.png 1272w, https://substackcdn.com/image/fetch/$s_!XPEu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff642032b-5730-49f0-9838-8212628ae05f_1074x795.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Both agents initially answer incorrectly, yet ChatGPT leverages Bard's wrong response to arrive at the correct answer. Source: [1]</figcaption></figure></div><blockquote><p>&#128064; Society of Mind&#8217;s key finding: broadcasting peer responses and iterating creates a self-correction pressure that single-agent reflection simply cannot replicate.</p></blockquote><p>However, there can be many flaws in the approach. When agents were confidently wrong and agreed with each other, the debate failed to correct the error. Models would affirm the incorrect consensus rather than challenge it. The paper even noted that LMs do not express their uncertainty when generating responses, and that correcting this would likely substantially improve performance. That observation planted the seed for several methods that followed.</p><p>&#10060; Society of Mind&#8217;s core problem: it broadcasts everything to everyone with no structure, no role differentiation, and no signal about which message to trust.</p><p>A simple mitigation is to simplify the communication topology by using a sparse connection graph (&#128073;<strong>Sparse MAD</strong> [6]). This not only reduces communication costs but also potentially improves performance when the dense connection creates too much noise.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m9cU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5313ac03-751e-4dcd-bf2b-3e132737a5da_818x967.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m9cU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5313ac03-751e-4dcd-bf2b-3e132737a5da_818x967.png 424w, https://substackcdn.com/image/fetch/$s_!m9cU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5313ac03-751e-4dcd-bf2b-3e132737a5da_818x967.png 848w, https://substackcdn.com/image/fetch/$s_!m9cU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5313ac03-751e-4dcd-bf2b-3e132737a5da_818x967.png 1272w, https://substackcdn.com/image/fetch/$s_!m9cU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5313ac03-751e-4dcd-bf2b-3e132737a5da_818x967.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m9cU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5313ac03-751e-4dcd-bf2b-3e132737a5da_818x967.png" width="818" height="967" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5313ac03-751e-4dcd-bf2b-3e132737a5da_818x967.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:967,&quot;width&quot;:818,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:154434,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/194030888?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5313ac03-751e-4dcd-bf2b-3e132737a5da_818x967.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!m9cU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5313ac03-751e-4dcd-bf2b-3e132737a5da_818x967.png 424w, https://substackcdn.com/image/fetch/$s_!m9cU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5313ac03-751e-4dcd-bf2b-3e132737a5da_818x967.png 848w, https://substackcdn.com/image/fetch/$s_!m9cU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5313ac03-751e-4dcd-bf2b-3e132737a5da_818x967.png 1272w, https://substackcdn.com/image/fetch/$s_!m9cU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5313ac03-751e-4dcd-bf2b-3e132737a5da_818x967.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example of Sparse MAD: From dense to neighbor-connected topology. Source: [6].</figcaption></figure></div><p>&#10060; Yet, a critical limitation remains: static topology treats all messages equally, failing to distinguish critical information from redundant noise, risking both useful context being discarded and irrelevant signals being propagated. </p><p>This raises an important research question: &#129504; <em>How to meaningfully cut the communication cost?</em> We will come back to this question in Chapters 4 and 5.</p><div><hr></div><h2>Chapter 2: Giving Agents Identities &#8212; The Judge and the Tit-for-Tat</h2><p>Du et al. showed that debate helps. But their agents were identical clones, i.e., they are from the same LLM, same prompt, no differentiation. Every agent played the same role in the same way. &#129504; <em>What if debate works better when agents are structurally forced to be different?</em></p><p>Liang et al. [2] introduced a structurally different MAD formulation motivated by a specific failure mode they called the <em>Degeneration-of-Thought (DoT)</em> problem. The authors&#8217; observation is that when a model is asked to self-reflect, it tends to converge almost immediately to whatever it initially said. Once an LLM has established confidence in its solution, it becomes effectively incapable of generating genuinely novel thinking in subsequent iterations, even when the initial answer is wrong. Self-reflection, despite its appeal, has a stubborn ceiling.</p><blockquote><p>&#128064; Originally, it is the problem of single agent that self-reflects. It can also happens in MAD if all of the agents are treated equally.</p></blockquote><p>The fix is structural: rather than allowing agents to freely update their opinions, Liang et al. enforce a &#8220;tit for tat&#8221; state in which agents are specifically instructed to challenge each other&#8217;s positions. This is explicitly stated in a meta prompt for all agents:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!boyg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6baf2ca-15b7-4d4b-8a3a-1c8f2c063674_539x231.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!boyg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6baf2ca-15b7-4d4b-8a3a-1c8f2c063674_539x231.png 424w, https://substackcdn.com/image/fetch/$s_!boyg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6baf2ca-15b7-4d4b-8a3a-1c8f2c063674_539x231.png 848w, https://substackcdn.com/image/fetch/$s_!boyg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6baf2ca-15b7-4d4b-8a3a-1c8f2c063674_539x231.png 1272w, https://substackcdn.com/image/fetch/$s_!boyg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6baf2ca-15b7-4d4b-8a3a-1c8f2c063674_539x231.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!boyg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6baf2ca-15b7-4d4b-8a3a-1c8f2c063674_539x231.png" width="431" height="184.71428571428572" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a6baf2ca-15b7-4d4b-8a3a-1c8f2c063674_539x231.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:231,&quot;width&quot;:539,&quot;resizeWidth&quot;:431,&quot;bytes&quot;:53363,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/194030888?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6baf2ca-15b7-4d4b-8a3a-1c8f2c063674_539x231.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!boyg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6baf2ca-15b7-4d4b-8a3a-1c8f2c063674_539x231.png 424w, https://substackcdn.com/image/fetch/$s_!boyg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6baf2ca-15b7-4d4b-8a3a-1c8f2c063674_539x231.png 848w, https://substackcdn.com/image/fetch/$s_!boyg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6baf2ca-15b7-4d4b-8a3a-1c8f2c063674_539x231.png 1272w, https://substackcdn.com/image/fetch/$s_!boyg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6baf2ca-15b7-4d4b-8a3a-1c8f2c063674_539x231.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Then, two debater roles are constructed to take affirmative and negative sides of the argument. To create these roles for the agent, the paper uses prompting techniques:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wVqN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f546a0a-bc3e-4ed3-a1ff-a9948aa86fc7_562x311.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wVqN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f546a0a-bc3e-4ed3-a1ff-a9948aa86fc7_562x311.png 424w, https://substackcdn.com/image/fetch/$s_!wVqN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f546a0a-bc3e-4ed3-a1ff-a9948aa86fc7_562x311.png 848w, https://substackcdn.com/image/fetch/$s_!wVqN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f546a0a-bc3e-4ed3-a1ff-a9948aa86fc7_562x311.png 1272w, https://substackcdn.com/image/fetch/$s_!wVqN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f546a0a-bc3e-4ed3-a1ff-a9948aa86fc7_562x311.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wVqN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f546a0a-bc3e-4ed3-a1ff-a9948aa86fc7_562x311.png" width="394" height="218.03202846975088" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4f546a0a-bc3e-4ed3-a1ff-a9948aa86fc7_562x311.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:311,&quot;width&quot;:562,&quot;resizeWidth&quot;:394,&quot;bytes&quot;:62174,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/194030888?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f546a0a-bc3e-4ed3-a1ff-a9948aa86fc7_562x311.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wVqN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f546a0a-bc3e-4ed3-a1ff-a9948aa86fc7_562x311.png 424w, https://substackcdn.com/image/fetch/$s_!wVqN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f546a0a-bc3e-4ed3-a1ff-a9948aa86fc7_562x311.png 848w, https://substackcdn.com/image/fetch/$s_!wVqN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f546a0a-bc3e-4ed3-a1ff-a9948aa86fc7_562x311.png 1272w, https://substackcdn.com/image/fetch/$s_!wVqN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f546a0a-bc3e-4ed3-a1ff-a9948aa86fc7_562x311.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Finally, a separate judge agent observes the full debate and determines when sufficient disagreement has been expressed to call a conclusion. The judge serves two functions. First, it acts as a moderator, preventing premature convergence by enforcing continued argumentation when the debate hasn&#8217;t been resolved enough. Second, it functions as an adaptive stopping criterion: once the debate has sufficiently explored the disagreement space, the judge extracts the final answer rather than waiting for a fixed number of rounds</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3rTv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7945bc96-51ec-449c-8af4-5cda1d05e5d1_956x773.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3rTv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7945bc96-51ec-449c-8af4-5cda1d05e5d1_956x773.png 424w, https://substackcdn.com/image/fetch/$s_!3rTv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7945bc96-51ec-449c-8af4-5cda1d05e5d1_956x773.png 848w, https://substackcdn.com/image/fetch/$s_!3rTv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7945bc96-51ec-449c-8af4-5cda1d05e5d1_956x773.png 1272w, https://substackcdn.com/image/fetch/$s_!3rTv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7945bc96-51ec-449c-8af4-5cda1d05e5d1_956x773.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3rTv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7945bc96-51ec-449c-8af4-5cda1d05e5d1_956x773.png" width="956" height="773" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7945bc96-51ec-449c-8af4-5cda1d05e5d1_956x773.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:773,&quot;width&quot;:956,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:229703,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/194030888?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7945bc96-51ec-449c-8af4-5cda1d05e5d1_956x773.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3rTv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7945bc96-51ec-449c-8af4-5cda1d05e5d1_956x773.png 424w, https://substackcdn.com/image/fetch/$s_!3rTv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7945bc96-51ec-449c-8af4-5cda1d05e5d1_956x773.png 848w, https://substackcdn.com/image/fetch/$s_!3rTv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7945bc96-51ec-449c-8af4-5cda1d05e5d1_956x773.png 1272w, https://substackcdn.com/image/fetch/$s_!3rTv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7945bc96-51ec-449c-8af4-5cda1d05e5d1_956x773.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">MAD with asymmetric agent roles. Source: [2].</figcaption></figure></div><p>This is a significant architectural departure from Society of Mind:</p><p>&#10060; Society of Mind: symmetric agents, same prompt, same role, same update rule</p><p>&#10004;&#65039; &#128073;<strong>MAD with judge-and-debater</strong>: asymmetric roles, adversarial framing, adaptive termination</p><p>The results on counter-intuitive arithmetic and commonsense machine translation showed that the tit-for-tat structure genuinely helped prevent premature convergence. In one illustrative example, agents initially made the classic mistake of averaging Alice&#8217;s uphill and downhill speeds to compute her average speed. The adversarial framing forced one agent to challenge this, eventually surfacing the correct method of dividing total distance by total time.</p><blockquote><p>&#128064; Role assignment changes the <em>pressure structure</em> of debate. When agents are forced to maintain opposing stances, they produce reasoning paths they wouldn&#8217;t have generated under free updating.</p></blockquote><p>One potential issue is that the judge&#8217;s quality matters enormously. When a weaker LLM is used as the judge while stronger models debate, performance degrades because the judge cannot reliably adjudicate between sophisticated arguments it can&#8217;t fully evaluate. </p><p>&#10060; The debate&#8217;s ceiling is bounded by the judge&#8217;s capacity.</p><div><hr></div><h2>Chapter 3: Agents That Know What They Don&#8217;t Know</h2><p>By this point, the field had two strong intuitions. First, debate helps because it creates multiple non-overlapping reasoning paths. Second, the debate&#8217;s outcome depends heavily on which agents get the most influence. In Society of Minds, all agents are treated equally. In the judge-and-debater framework, influence is determined by assigned role. But neither approach directly addresses the natural question: &#129504; <em>What if some agents genuinely know more about this particular question than others?</em></p><p>Lin and Hooi [3] built their method, &#128073;<strong>ConfMAD</strong>, around this question. The core insight is that LLMs do have internal uncertainty signals [4], and they just don&#8217;t communicate them by default. Confidence expression, the explicit verbal or numerical communication of how certain an agent is about its answer, can reshape the influence dynamics of debate in a principled way.</p><p>ConfMAD adds confidence expression throughout the debate process. At round <em>r = 0</em>, each debater independently generates a structured triple: (Reason, Answer, Confidence Score). In subsequent rounds, each debater reads the full debate history, which now includes the confidence scores of all prior responses, and conditions its update accordingly. A high-confidence agent saying &#8220;42, confidence 95&#8221; should exert more pull on the next round&#8217;s responses than a hedging agent saying &#8220;42, confidence 40.&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8mj8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F530f46f2-547c-4c1e-a23b-f6da6c26fa9b_740x290.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8mj8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F530f46f2-547c-4c1e-a23b-f6da6c26fa9b_740x290.png 424w, https://substackcdn.com/image/fetch/$s_!8mj8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F530f46f2-547c-4c1e-a23b-f6da6c26fa9b_740x290.png 848w, https://substackcdn.com/image/fetch/$s_!8mj8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F530f46f2-547c-4c1e-a23b-f6da6c26fa9b_740x290.png 1272w, https://substackcdn.com/image/fetch/$s_!8mj8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F530f46f2-547c-4c1e-a23b-f6da6c26fa9b_740x290.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8mj8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F530f46f2-547c-4c1e-a23b-f6da6c26fa9b_740x290.png" width="562" height="220.24324324324326" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/530f46f2-547c-4c1e-a23b-f6da6c26fa9b_740x290.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:290,&quot;width&quot;:740,&quot;resizeWidth&quot;:562,&quot;bytes&quot;:57469,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/194030888?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F530f46f2-547c-4c1e-a23b-f6da6c26fa9b_740x290.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8mj8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F530f46f2-547c-4c1e-a23b-f6da6c26fa9b_740x290.png 424w, https://substackcdn.com/image/fetch/$s_!8mj8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F530f46f2-547c-4c1e-a23b-f6da6c26fa9b_740x290.png 848w, https://substackcdn.com/image/fetch/$s_!8mj8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F530f46f2-547c-4c1e-a23b-f6da6c26fa9b_740x290.png 1272w, https://substackcdn.com/image/fetch/$s_!8mj8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F530f46f2-547c-4c1e-a23b-f6da6c26fa9b_740x290.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Confidence or uncertainty scores are incorporated into the prompt. Source: [3]</figcaption></figure></div><p></p><p>The design reveals a subtle tension: miscalibrated confidence scores can actively damage debate quality. If an agent is confidently wrong, it dominates the update and pulls other agents toward the error. If an agent is genuinely uncertain but labels itself as confident, the same harm occurs. The authors, therefore, pair confidence expression with a calibration step, applying temperature scaling to align the model&#8217;s expressed confidence with its empirical accuracy.</p><p>The confidence dynamics interact with debate in two pathological ways that ConfMAD must guard against:</p><ul><li><p>Stubbornness: a high-confidence agent refuses to update even when peer responses present compelling evidence for a different answer</p></li><li><p>Premature convergence: agents with low confidence collapse immediately to the majority answer before exploring alternative reasoning paths</p></li></ul><p>The calibrated confidence scores help mitigate both: uncertain agents are encouraged to explore longer; confident agents are given appropriate weight without completely dominating.</p><p>However, two key limitations remain:</p><p>&#10060; Confidence calibration is never perfect, risking bias in the debate process</p><p>&#10060; There is no mechanism to preserve the diversity of agent reasoning throughout the debate </p><p>More importantly, by adding uncertainty to the prompt, the context is getting longer and potentially adding noise and computation complexity. One promising direction to reduce communication cost is to filter messages from low-confidence agents. However, using a fixed threshold (e.g., uncertainty &lt; 0.5) is fragile since it risks discarding valuable information due to the inherent unreliability of uncertainty estimation. This remains an open problem without a satisfying solution.</p><div><hr></div><h2>Chapter 4: The Memory Problem and MAD with Memory Masking</h2><p>Society of Mind broadcasts everything. ConfMAD broadcasts everything plus confidence scores. But as the number of agents and rounds grows, &#8220;everything&#8221; becomes an increasingly noisy and redundant context. &#129504; <em>If past responses are treated as memories that agents condition on, what happens when those memories are wrong?</em></p><p>Tian et al. [5] provide both a diagnosis and a theoretical grounding for this problem. Their key observation: MAD agents are vulnerable to erroneous memories. When an incorrect response from round <em>r&#8722;1</em> persists in the context, it obviously corrupts round <em>r</em> generation. And because context accumulates across rounds, early errors have compounding effects. The poison spreads forward through the debate history.</p><p>The theoretical insight is crucial: the performance of MAD is bounded above by the quality of the memories it conditions on. Erroneous memories don&#8217;t just fail to help; they actively degrade the agent that reads them.</p><p>&#128073;<strong>MAD-M2</strong> (MAD with Memory Masking) addresses this by inserting an evaluation-and-masking step between each debate round. After round r generates responses G&#7523;, agents evaluate each response in G&#7523; before allowing them to become memories for round r+1. Two masking strategies are proposed:</p><p><strong>Subjective Masking.</strong> Each agent uses the LLM itself as an evaluator, labeling each peer response as &#8220;YES&#8221; (keep), &#8220;NO&#8221; (discard), or &#8220;NOT SURE&#8221; (treated as YES or NO depending on a strictness hyperparameter). This is reasoning-aware: the agent reads the content and makes a judgment about its correctness.</p><p><strong>Objective Masking.</strong> Use perplexity as a proxy for reliability (similar to confidence). High perplexity implies the model was uncertain when generating that response, making it more likely to contain errors. Only the response with the lowest perplexity is retained. This avoids the computational cost of LLM-based evaluation but relies on the assumption that perplexity tracks correctness, which holds reasonably for strong models but breaks down for weaker ones.</p><blockquote><p>&#128064; An interesting interaction effect emerges: the best masking strategy depends on the base model&#8217;s capability. For weaker models (Qwen2.5-7B), subjective masking outperforms objective. In this case, the model can reason about peer correctness even if its own generations are imperfect. For stronger models (QwQ-32B) on hard benchmarks like AIME, objective perplexity-based masking wins decisively, because stronger models&#8217; internal probability estimates are better calibrated.</p></blockquote><p>The three-step pipeline can be summarized as:</p><ol><li><p>Generate responses <em>G&#7523;</em> from all <em>N</em> agents</p></li><li><p>Evaluate and mask: produce <em>M &#8712; {0,1}&#7482;</em> and retain <em>&#119982;&#7523; = {g&#7523;,&#7522; | M&#7522; = 1}</em></p></li><li><p>Condition round <em>r+1</em> generation on <em>&#119982;&#7523;</em> rather than the full <em>G&#7523;</em></p></li></ol><p>Despite looking promising, both masking strategies suffer from limitations:</p><p>&#10060; Subjective masking adds significant token overhead; each agent must evaluate all peer responses before generating their own. Subjective evaluation is not always correct, risking losing important messages (if the agent already knows which answer is incorrect to mask, then it does not need debate). </p><p>&#10060; Objective masking is computationally cheap but assumes perplexity &#8776; correctness, and it has the same problem as ConfMAD.</p><div><hr></div><h2>Chapter 5: The Signal You Were Ignoring: Diversity-Aware Retention</h2><p>Each method so far has addressed a different dimension of the debate communication problem. Society of Minds establishes the baseline: peer communication helps. The judge-and-debater structure enforces role-based disagreement. ConfMAD adds confidence as a weight on peer influence. MAD-M2 filters by correctness signal (subjective or objective). But all of them share a blind spot.</p><p>&#129504; <em>What if the most important property of a retained response isn&#8217;t its confidence or its estimated correctness but how much new information it adds relative to what other agents already believe?</em></p><p>&#128073;<strong>Diversity-Aware Retention</strong> (<strong>DAR</strong> [7]) proposes this reframing. DAR argues that the bottleneck in debate isn&#8217;t the quality of individual responses. It&#8217;s the information content of what gets broadcast. Think about it:</p><ul><li><p>A high-confidence response that echoes what everyone already thinks? <strong>Adds nothing.</strong></p></li><li><p>A lower-confidence response that takes a completely different reasoning path? <strong>Adds everything.</strong></p></li></ul><blockquote><p>&#128064; So instead of asking <em>&#8220;how confident is this response?&#8221;</em>, DAR asks <em>&#8220;how different is this response from what everyone else already said?&#8221; </em></p></blockquote><p>This principle can be implemented using a Filtering Module <em>F</em>, an LLM, that is prompted to retain disagreeing messages. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Oaf0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f93953b-1196-4171-86f2-618869d9cbf5_1900x1019.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Oaf0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f93953b-1196-4171-86f2-618869d9cbf5_1900x1019.png 424w, https://substackcdn.com/image/fetch/$s_!Oaf0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f93953b-1196-4171-86f2-618869d9cbf5_1900x1019.png 848w, https://substackcdn.com/image/fetch/$s_!Oaf0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f93953b-1196-4171-86f2-618869d9cbf5_1900x1019.png 1272w, https://substackcdn.com/image/fetch/$s_!Oaf0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f93953b-1196-4171-86f2-618869d9cbf5_1900x1019.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Oaf0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f93953b-1196-4171-86f2-618869d9cbf5_1900x1019.png" width="1456" height="781" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f93953b-1196-4171-86f2-618869d9cbf5_1900x1019.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:781,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:388730,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/194030888?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f93953b-1196-4171-86f2-618869d9cbf5_1900x1019.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Oaf0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f93953b-1196-4171-86f2-618869d9cbf5_1900x1019.png 424w, https://substackcdn.com/image/fetch/$s_!Oaf0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f93953b-1196-4171-86f2-618869d9cbf5_1900x1019.png 848w, https://substackcdn.com/image/fetch/$s_!Oaf0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f93953b-1196-4171-86f2-618869d9cbf5_1900x1019.png 1272w, https://substackcdn.com/image/fetch/$s_!Oaf0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f93953b-1196-4171-86f2-618869d9cbf5_1900x1019.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">DAR illustrative example. Source: [7].</figcaption></figure></div><p>Here&#8217;s how DAR works in practice:</p><ul><li><p>&#127919; <strong>Diversity-first selection</strong>: at each round, DAR picks the responses that disagree the most with each other <em>and</em> with the current majority vote. Not the most confident. Not the most fluent. The most <em>informationally distinct</em>.</p></li><li><p>&#128204; <strong>Index-based retention</strong>: retained messages are preserved exactly as written, no summarization, no paraphrasing. What the agent said is what other agents read. Authentic disagreement stays authentic.</p></li><li><p>&#128202; <strong>Soft uncertainty signals</strong>: each peer response comes with an uncertainty score (average negative log-likelihood over answer tokens), passed directly into the prompt. Agents decide how much to trust each peer themselves &#8212; no hard threshold, no costly tuning.</p></li></ul><p>Some advantages of DAR compared to MAD-M2:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lhh8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34dcde4b-daa3-41d0-b8a3-36e37235c94b_1440x1102.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lhh8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34dcde4b-daa3-41d0-b8a3-36e37235c94b_1440x1102.png 424w, https://substackcdn.com/image/fetch/$s_!lhh8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34dcde4b-daa3-41d0-b8a3-36e37235c94b_1440x1102.png 848w, https://substackcdn.com/image/fetch/$s_!lhh8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34dcde4b-daa3-41d0-b8a3-36e37235c94b_1440x1102.png 1272w, https://substackcdn.com/image/fetch/$s_!lhh8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34dcde4b-daa3-41d0-b8a3-36e37235c94b_1440x1102.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lhh8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34dcde4b-daa3-41d0-b8a3-36e37235c94b_1440x1102.png" width="1440" height="1102" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/34dcde4b-daa3-41d0-b8a3-36e37235c94b_1440x1102.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1102,&quot;width&quot;:1440,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:188775,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/194030888?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34dcde4b-daa3-41d0-b8a3-36e37235c94b_1440x1102.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lhh8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34dcde4b-daa3-41d0-b8a3-36e37235c94b_1440x1102.png 424w, https://substackcdn.com/image/fetch/$s_!lhh8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34dcde4b-daa3-41d0-b8a3-36e37235c94b_1440x1102.png 848w, https://substackcdn.com/image/fetch/$s_!lhh8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34dcde4b-daa3-41d0-b8a3-36e37235c94b_1440x1102.png 1272w, https://substackcdn.com/image/fetch/$s_!lhh8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34dcde4b-daa3-41d0-b8a3-36e37235c94b_1440x1102.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This also reflects in the empirical results:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_uUA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287a1e3c-a1c9-4441-b814-36ac5f84e396_2873x781.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_uUA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287a1e3c-a1c9-4441-b814-36ac5f84e396_2873x781.png 424w, https://substackcdn.com/image/fetch/$s_!_uUA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287a1e3c-a1c9-4441-b814-36ac5f84e396_2873x781.png 848w, https://substackcdn.com/image/fetch/$s_!_uUA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287a1e3c-a1c9-4441-b814-36ac5f84e396_2873x781.png 1272w, https://substackcdn.com/image/fetch/$s_!_uUA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287a1e3c-a1c9-4441-b814-36ac5f84e396_2873x781.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_uUA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287a1e3c-a1c9-4441-b814-36ac5f84e396_2873x781.png" width="1456" height="396" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/287a1e3c-a1c9-4441-b814-36ac5f84e396_2873x781.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:396,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:399895,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/194030888?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287a1e3c-a1c9-4441-b814-36ac5f84e396_2873x781.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_uUA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287a1e3c-a1c9-4441-b814-36ac5f84e396_2873x781.png 424w, https://substackcdn.com/image/fetch/$s_!_uUA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287a1e3c-a1c9-4441-b814-36ac5f84e396_2873x781.png 848w, https://substackcdn.com/image/fetch/$s_!_uUA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287a1e3c-a1c9-4441-b814-36ac5f84e396_2873x781.png 1272w, https://substackcdn.com/image/fetch/$s_!_uUA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287a1e3c-a1c9-4441-b814-36ac5f84e396_2873x781.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>&#10060; DAR's diversity-first selection is not without risk. Retaining the most disagreeing responses assumes divergence is always informative, but a minority view can simply be wrong. This exposes a fundamental tension: too much consensus and debate collapses into echo chambers; too much divergence and agents get misled by confidently incorrect outliers. The optimal balance between convergence and divergent thinking remains an open and unsettled question in MAD research.</p><div class="poll-embed" data-attrs="{&quot;id&quot;:498100}" data-component-name="PollToDOM"></div><div><hr></div><h2>References</h2><p>[1] Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., and Mordatch, I. Improving Factuality and Reasoning in Language Models through Multiagent Debate. ICML 2024.</p><p>[2] Liang, T., He, Z., Jiao, W. et al. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. EMNLP 2024.</p><p>[3] Lin, Z. and Hooi, B. Enhancing Multi-Agent Debate System Performance via Confidence Expression. EMNLP 2025.</p><p>[4] Nguyen, M., Gupta, S., &amp; Le, H. (2025/2026). Probabilities Are All You Need: A Probability-Only Approach to Uncertainty Estimation in Large Language Models. AAAI 2026.</p><p>[5] Tian, H., Feng, X., Zhao, Z. et al. Multi-Agent Debate with Memory Masking. ICLR 2026.</p><p>[6] Li, Yunxuan, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, and Eugene Ie. "Improving multi-agent debate with sparse communication topology." In <em>Findings of the Association for Computational Linguistics: EMNLP 2024</em>, pp. 7281-7294. 2024.</p><p>[7] Nguyen, M., Nguyen, A., Nguyen, D., Venkatesh, S., and Le, H. Hear Both Sides: Efficient Multi-Agent Debate via Diversity-Aware Message Retention. arXiv:2603.20640 (2026).</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Neurocoder Tales! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Improving LLM Reasoning with RL Post-Training]]></title><description><![CDATA[Surveying New Frontiers in Reinforcement Learning for Language Models (Part 3)]]></description><link>https://hungleai.substack.com/p/improving-llm-reasoning-with-post</link><guid isPermaLink="false">https://hungleai.substack.com/p/improving-llm-reasoning-with-post</guid><dc:creator><![CDATA[Hung Le]]></dc:creator><pubDate>Mon, 08 Dec 2025 19:06:11 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ki1j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a08bc26-1a5d-449a-ae98-27e4a0a516be_1536x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Large language models are getting better at reasoning, not because we made them bigger, but because we finally learned how to <em>teach</em> them after pre-training,a.k.a., post-training. Continuing our series on <a href="https://hungleai.substack.com/p/think-before-you-speak-reinforcement">RL for LLM reasoning</a>, today&#8217;s blog reviews recent papers that boost LLM reasoning capability via post-training with RL. If you care about strengthening a model&#8217;s <em>intrinsic</em> reasoning capabilities rather than bolting on expensive test-time scaling or multi-sample decoding, this overview highlights the methods that genuinely transform the model.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ki1j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a08bc26-1a5d-449a-ae98-27e4a0a516be_1536x1024.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ki1j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a08bc26-1a5d-449a-ae98-27e4a0a516be_1536x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!ki1j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a08bc26-1a5d-449a-ae98-27e4a0a516be_1536x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!ki1j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a08bc26-1a5d-449a-ae98-27e4a0a516be_1536x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!ki1j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a08bc26-1a5d-449a-ae98-27e4a0a516be_1536x1024.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ki1j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a08bc26-1a5d-449a-ae98-27e4a0a516be_1536x1024.webp" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3a08bc26-1a5d-449a-ae98-27e4a0a516be_1536x1024.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Generated image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Generated image" title="Generated image" srcset="https://substackcdn.com/image/fetch/$s_!ki1j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a08bc26-1a5d-449a-ae98-27e4a0a516be_1536x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!ki1j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a08bc26-1a5d-449a-ae98-27e4a0a516be_1536x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!ki1j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a08bc26-1a5d-449a-ae98-27e4a0a516be_1536x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!ki1j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a08bc26-1a5d-449a-ae98-27e4a0a516be_1536x1024.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Many read my posts, but only 3% subscribe. If you find my writing helpful, please subscribe&#8212;it&#8217;s free! Your support motivates me to keep creating high-quality and exclusive content. Thank you!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h4><strong>Table of Contents</strong></h4><ul><li><p><a href="https://hungleai.substack.com/i/165059779/introduction">Introduction</a></p><ul><li><p><a href="https://hungleai.substack.com/i/165059779/why-rl-post-training-matters-for-reasoning">Why RL Post-Training Matters for Reasoning?</a></p></li><li><p><a href="https://hungleai.substack.com/i/165059779/from-test-time-to-post-training-whats-the-shift">From Test-Time to Post-Training: What&#8217;s the Shift?</a></p></li></ul></li><li><p><a href="https://hungleai.substack.com/i/165059779/deepseek-r-the-tipping-point">DeepSeek-R1: The Tipping Point</a></p><ul><li><p><a href="https://hungleai.substack.com/i/165059779/formalizing-the-post-training-framework">Formalizing the Post-training Framework</a></p></li><li><p><a href="https://hungleai.substack.com/i/165059779/the-rl-algorithm-group-based-policy-optimization-grpo">The RL Algorithm: Group-Based Policy Optimization (GRPO)</a></p></li><li><p><a href="https://hungleai.substack.com/i/165059779/on-the-reward-choice">On the Reward Choice</a></p></li><li><p><a href="https://hungleai.substack.com/i/165059779/other-interesting-insights">Other Interesting Insights</a></p></li></ul></li><li><p><a href="https://hungleai.substack.com/i/165059779/beyond-grpo-for-a-better-rl-post-training">Beyond GRPO: For A Better RL Post-Training</a></p><ul><li><p><a href="https://hungleai.substack.com/i/165059779/alternative-rl-algorithms">Alternative RL Algorithms</a></p></li><li><p><a href="https://hungleai.substack.com/i/165059779/reward-shaping">Reward Shaping</a></p></li><li><p><a href="https://hungleai.substack.com/i/165059779/optimizing-the-training-pipeline">Optimizing the Training Pipeline</a></p></li></ul></li></ul><div><hr></div><h2>Introduction</h2><p>Reasoning remains one of the few capabilities where LLMs still fall short. When the conversation turns to boosting reasoning in LLMs, the default reaction is to immediately reach for <strong><a href="https://hungleai.substack.com/p/reason-on-the-fly-how-rl-boosts-llm">test-time tricks</a></strong>: self-consistency, searching, multi-step decoding, or heavy ensembles. These methods are undeniably useful, but they all share a fundamental limitation: they are external scaffolds layered on top of an unchanged model.</p><h4>Why RL Post-Training Matters for Reasoning?</h4><p>Reasoning is ultimately about shaping the model&#8217;s internal search process, including how it attends, decomposes, checks, and revises its predictions. Supervised finetuning alone gives you patterns but rarely instills the principles needed for multi-step, error-sensitive reasoning. RL provides a way to push the model to explore diverse behaviors until it figures out the principles of reasoning on its own. When done right, the model starts solving tasks cleanly without huge decode budgets.</p><p>Over time, the model learns to prioritize reasoning strategies that generalize across tasks, rather than memorizing superficial patterns from the training data. This is why even relatively small models can outperform larger counterparts on multi-step reasoning benchmarks: the improvement comes from better <em>internal reasoning processes</em>, not from bigger parameter counts or expensive test-time tricks. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!J519!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12e4787-f0c9-43d2-bacc-aed6e5e242c0_580x344.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!J519!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12e4787-f0c9-43d2-bacc-aed6e5e242c0_580x344.png 424w, https://substackcdn.com/image/fetch/$s_!J519!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12e4787-f0c9-43d2-bacc-aed6e5e242c0_580x344.png 848w, https://substackcdn.com/image/fetch/$s_!J519!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12e4787-f0c9-43d2-bacc-aed6e5e242c0_580x344.png 1272w, https://substackcdn.com/image/fetch/$s_!J519!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12e4787-f0c9-43d2-bacc-aed6e5e242c0_580x344.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!J519!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12e4787-f0c9-43d2-bacc-aed6e5e242c0_580x344.png" width="580" height="344" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b12e4787-f0c9-43d2-bacc-aed6e5e242c0_580x344.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:344,&quot;width&quot;:580,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!J519!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12e4787-f0c9-43d2-bacc-aed6e5e242c0_580x344.png 424w, https://substackcdn.com/image/fetch/$s_!J519!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12e4787-f0c9-43d2-bacc-aed6e5e242c0_580x344.png 848w, https://substackcdn.com/image/fetch/$s_!J519!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12e4787-f0c9-43d2-bacc-aed6e5e242c0_580x344.png 1272w, https://substackcdn.com/image/fetch/$s_!J519!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb12e4787-f0c9-43d2-bacc-aed6e5e242c0_580x344.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Open-source LLMs, as small as 32B, with RL post-training, can outperform enterprise LLMs. Source: [1]</figcaption></figure></div><p></p><h4>From Test-Time to Post-Training: What&#8217;s the Shift?</h4><p>Test-time approaches rely on external scaffolds, generating multiple trajectories, performing searching and voting on the best answer, or sampling repeatedly to increase reliability. The model itself doesn&#8217;t change, so inference is expensive, and reasoning traces remain fragile depending on the samples. </p><p>Post-training RL takes a fundamentally different approach. By using reward-driven model updates, it reshapes the model&#8217;s internal heuristics, allowing it to plan, check, and revise its own reasoning based on the training rewards. As summarized in the figure below, post-training RL trades upfront training cost for cheaper, more reliable inference, cleaner reasoning traces, and stronger generalization. This shift marks the difference between forcing a model to reason through brute-force search and actually teaching it to reason internally.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZDeZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d834d-25f8-40f6-b3b7-dc52aebd9f6a_1143x814.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZDeZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d834d-25f8-40f6-b3b7-dc52aebd9f6a_1143x814.png 424w, https://substackcdn.com/image/fetch/$s_!ZDeZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d834d-25f8-40f6-b3b7-dc52aebd9f6a_1143x814.png 848w, https://substackcdn.com/image/fetch/$s_!ZDeZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d834d-25f8-40f6-b3b7-dc52aebd9f6a_1143x814.png 1272w, https://substackcdn.com/image/fetch/$s_!ZDeZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d834d-25f8-40f6-b3b7-dc52aebd9f6a_1143x814.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZDeZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d834d-25f8-40f6-b3b7-dc52aebd9f6a_1143x814.png" width="1143" height="814" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c62d834d-25f8-40f6-b3b7-dc52aebd9f6a_1143x814.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:814,&quot;width&quot;:1143,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:462200,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d834d-25f8-40f6-b3b7-dc52aebd9f6a_1143x814.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZDeZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d834d-25f8-40f6-b3b7-dc52aebd9f6a_1143x814.png 424w, https://substackcdn.com/image/fetch/$s_!ZDeZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d834d-25f8-40f6-b3b7-dc52aebd9f6a_1143x814.png 848w, https://substackcdn.com/image/fetch/$s_!ZDeZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d834d-25f8-40f6-b3b7-dc52aebd9f6a_1143x814.png 1272w, https://substackcdn.com/image/fetch/$s_!ZDeZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc62d834d-25f8-40f6-b3b7-dc52aebd9f6a_1143x814.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Test-time RL vs Post-training RL. </figcaption></figure></div><p>It&#8217;s also important to understand how post-training RL differs from the other two major forms of post-training: <strong>supervised fine-tuning (SFT)</strong> and <strong>preference training</strong>.<br>SFT teaches the model to <em>imitate</em> reasoning patterns, but it requires ground-truth reasoning traces and can only learn what&#8217;s shown in the ground-truth data. There&#8217;s no pressure to explore, self-correct, or discover better reasoning paths. Preference training (like <a href="https://hungleai.substack.com/p/aligning-large-language-models-with">DPO </a>or <a href="https://hungleai.substack.com/p/many-hands-make-light-work-leveraging">MRPO</a>) adds a ranking signal, nudging the model toward preferred outputs, but it still operates passively: the model isn&#8217;t encouraged to search, plan, or discover new strategies, only to reshape its distribution around examples humans liked. </p><p>RL is different. RL gives the model room to experiment, take multi-step trajectories, encounter failure, and adjust its internal heuristics based on cumulative reward. That exploration loop is exactly what makes reasoning skills intrinsic rather than decorative. Put simply:</p><p>&#10060; Test-time scaling searches for the best outputs</p><p>&#10060; SFT teaches outputs</p><p>&#10060; Preference training teaches preferences over outputs</p><p>&#10004;&#65039; RL teaches the process that produces those outputs</p><p>And that&#8217;s why post-training RL is emerging as the most reliable way to move beyond surface-level reasoning gains. It turns reasoning from something we approximate at inference into something the model genuinely knows how to do. With that foundation in place, we can now study representative works on RL post-training, also called Reinforcement Finetuning (RFT). </p><div><hr></div><h2>DeepSeek-R1: The Tipping Point</h2><p>DeepSeek began as an ambitious open-source initiative pushing the boundaries of efficient LLM training. Unlike many labs chasing scale at any cost, DeepSeek focused on pushing reasoning quality without gigantic models or massive inference budgets. That philosophy culminated in &#128073;<strong>DeepSeek-R1 </strong>[1], their breakthrough reasoning model.</p><p>R1 is built on a deceptively simple idea: outcome reward is all you need for RFT. It raises a research question:&#129504; <em>If a task has a verifiable final answer, can we train the model directly with policy gradients and a verifiable reward? </em>For domains like math, logical inference, and structured problem solving, where correctness is binary, this setup is ideal. Instead of collecting huge preference datasets, relying on imitation or complicated <a href="https://hungleai.substack.com/i/163970650/process-reward-models-defining-whats-good-reasoning">process reward models</a>, the model explores, fails, receives a clean reward, and gradually internalizes the principles of stepwise reasoning.</p><p>The DeepSeek team not only proved that this is doable, but they also open-sourced almost the entire training pipeline, showed how to make verifiable-reward RL work at scale, and did so at a shockingly low cost. Training was cheap, the resulting API was 27&#215; cheaper than OpenAI&#8217;s reasoning models, and in many benchmarks, the performance was effectively equivalent. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kOVb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb96669a0-c22a-4792-8412-692b8e1c18c8_640x424.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kOVb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb96669a0-c22a-4792-8412-692b8e1c18c8_640x424.png 424w, https://substackcdn.com/image/fetch/$s_!kOVb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb96669a0-c22a-4792-8412-692b8e1c18c8_640x424.png 848w, https://substackcdn.com/image/fetch/$s_!kOVb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb96669a0-c22a-4792-8412-692b8e1c18c8_640x424.png 1272w, https://substackcdn.com/image/fetch/$s_!kOVb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb96669a0-c22a-4792-8412-692b8e1c18c8_640x424.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kOVb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb96669a0-c22a-4792-8412-692b8e1c18c8_640x424.png" width="640" height="424" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b96669a0-c22a-4792-8412-692b8e1c18c8_640x424.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:424,&quot;width&quot;:640,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kOVb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb96669a0-c22a-4792-8412-692b8e1c18c8_640x424.png 424w, https://substackcdn.com/image/fetch/$s_!kOVb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb96669a0-c22a-4792-8412-692b8e1c18c8_640x424.png 848w, https://substackcdn.com/image/fetch/$s_!kOVb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb96669a0-c22a-4792-8412-692b8e1c18c8_640x424.png 1272w, https://substackcdn.com/image/fetch/$s_!kOVb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb96669a0-c22a-4792-8412-692b8e1c18c8_640x424.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Training cost of Deepseek. <a href="https://www.reddit.com/r/singularity/comments/1id60qi/big_misconceptions_of_training_costs_for_deepseek/">Source</a></figcaption></figure></div><p>The moment constituted a powerful slap in the face to the prevailing view that expensive GPUs (promoted by NVIDIA) and high-budget models (like those from OpenAI) were indispensable for advanced AI:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!V8-D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F008080e6-0a59-47c2-9d3c-e1d6ecab5518_2033x645.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!V8-D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F008080e6-0a59-47c2-9d3c-e1d6ecab5518_2033x645.png 424w, https://substackcdn.com/image/fetch/$s_!V8-D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F008080e6-0a59-47c2-9d3c-e1d6ecab5518_2033x645.png 848w, https://substackcdn.com/image/fetch/$s_!V8-D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F008080e6-0a59-47c2-9d3c-e1d6ecab5518_2033x645.png 1272w, https://substackcdn.com/image/fetch/$s_!V8-D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F008080e6-0a59-47c2-9d3c-e1d6ecab5518_2033x645.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!V8-D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F008080e6-0a59-47c2-9d3c-e1d6ecab5518_2033x645.png" width="1456" height="462" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/008080e6-0a59-47c2-9d3c-e1d6ecab5518_2033x645.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:462,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:820493,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F008080e6-0a59-47c2-9d3c-e1d6ecab5518_2033x645.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!V8-D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F008080e6-0a59-47c2-9d3c-e1d6ecab5518_2033x645.png 424w, https://substackcdn.com/image/fetch/$s_!V8-D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F008080e6-0a59-47c2-9d3c-e1d6ecab5518_2033x645.png 848w, https://substackcdn.com/image/fetch/$s_!V8-D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F008080e6-0a59-47c2-9d3c-e1d6ecab5518_2033x645.png 1272w, https://substackcdn.com/image/fetch/$s_!V8-D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F008080e6-0a59-47c2-9d3c-e1d6ecab5518_2033x645.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>&#128064; R1 demonstrated something the field had suspected but never proven at this scale: Great reasoning doesn&#8217;t require trillion-parameter models. Rather, it requires the right RL formulation.</p></blockquote><p></p><h4>Formalizing the Post-training Framework</h4><p>DeepSeek-R1 is an RL post-training framework that helps LLMs  acquire real reasoning ability purely through reinforcement learning  without needing human-annotated reasoning traces.</p><p>In the paper [1], the authors present 2 DeepSeek-R1 versions. In the original &#8220;R1-Zero&#8221; variant, the authors start from a base LLM (DeepSeek-V3-Base) and apply a pure RL pipeline with a verifiable final-answer reward (for math, code, STEM tasks) and without any supervised fine-tuning (SFT). Over thousands of training updates, the model begins to exhibit emergent reasoning behaviors: self-verification, chain-of-thought generation, reflection, and dynamic strategy adaptation.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rLv8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45dc0249-2a8d-42a6-9df2-1ffaaeac3116_580x228.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rLv8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45dc0249-2a8d-42a6-9df2-1ffaaeac3116_580x228.png 424w, https://substackcdn.com/image/fetch/$s_!rLv8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45dc0249-2a8d-42a6-9df2-1ffaaeac3116_580x228.png 848w, https://substackcdn.com/image/fetch/$s_!rLv8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45dc0249-2a8d-42a6-9df2-1ffaaeac3116_580x228.png 1272w, https://substackcdn.com/image/fetch/$s_!rLv8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45dc0249-2a8d-42a6-9df2-1ffaaeac3116_580x228.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rLv8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45dc0249-2a8d-42a6-9df2-1ffaaeac3116_580x228.png" width="580" height="228" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/45dc0249-2a8d-42a6-9df2-1ffaaeac3116_580x228.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:228,&quot;width&quot;:580,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42266,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45dc0249-2a8d-42a6-9df2-1ffaaeac3116_580x228.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rLv8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45dc0249-2a8d-42a6-9df2-1ffaaeac3116_580x228.png 424w, https://substackcdn.com/image/fetch/$s_!rLv8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45dc0249-2a8d-42a6-9df2-1ffaaeac3116_580x228.png 848w, https://substackcdn.com/image/fetch/$s_!rLv8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45dc0249-2a8d-42a6-9df2-1ffaaeac3116_580x228.png 1272w, https://substackcdn.com/image/fetch/$s_!rLv8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45dc0249-2a8d-42a6-9df2-1ffaaeac3116_580x228.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">DeepSeek-R1 Zero workflow.</figcaption></figure></div><p>The final, polished DeepSeek-R1 extends this with a multi-stage pipeline: a cold-start using a small curated CoT dataset to improve readability, followed by RL, then rejection-sampling to build fresh SFT data, then another RL pass (including domain-general prompts). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rykT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a1e164-56ec-49fb-abac-86eb93a49d3a_1548x784.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rykT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a1e164-56ec-49fb-abac-86eb93a49d3a_1548x784.png 424w, https://substackcdn.com/image/fetch/$s_!rykT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a1e164-56ec-49fb-abac-86eb93a49d3a_1548x784.png 848w, https://substackcdn.com/image/fetch/$s_!rykT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a1e164-56ec-49fb-abac-86eb93a49d3a_1548x784.png 1272w, https://substackcdn.com/image/fetch/$s_!rykT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a1e164-56ec-49fb-abac-86eb93a49d3a_1548x784.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rykT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a1e164-56ec-49fb-abac-86eb93a49d3a_1548x784.png" width="1456" height="737" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/70a1e164-56ec-49fb-abac-86eb93a49d3a_1548x784.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:737,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rykT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a1e164-56ec-49fb-abac-86eb93a49d3a_1548x784.png 424w, https://substackcdn.com/image/fetch/$s_!rykT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a1e164-56ec-49fb-abac-86eb93a49d3a_1548x784.png 848w, https://substackcdn.com/image/fetch/$s_!rykT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a1e164-56ec-49fb-abac-86eb93a49d3a_1548x784.png 1272w, https://substackcdn.com/image/fetch/$s_!rykT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70a1e164-56ec-49fb-abac-86eb93a49d3a_1548x784.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">DeepSeek-R1 Full workflow. <a href="https://www.vellum.ai/blog/the-training-of-deepseek-r1-and-ways-to-use-it">Source</a></figcaption></figure></div><p>A key is what happens after the first RL run. DeepSeek takes the improved RL policy and uses it to generate many candidate reasoning traces for new problems. Instead of directly using the raw samples, they apply rejection sampling:</p><ul><li><p>Keep only trajectories with correct final answers</p></li><li><p>Optionally rank by readability, coherence, or internal consistency</p></li><li><p>Discard the rest completely</p></li></ul><p>This gives you a large dataset of high-quality, self-generated reasoning demonstrations. It&#8217;s like mining the model&#8217;s own best moments and discarding the noise. </p><blockquote><p>&#128064; Crucially, this dataset is much cleaner than the original SFT data with no human variation, a fully standardized structure, and perfectly aligned with the target format introduced during warm-start. </p></blockquote><p>Then, DeepSeek performs an SFT pass on the new data. This step accomplishes two things:</p><ol><li><p>It distills the stable reasoning behaviors discovered during RL into a single, deterministic forward pass.</p></li><li><p>It smooths out the variance of RL sampling, making the model&#8217;s outputs cleaner and more consistent at inference.</p></li></ol><p>At the end, the key component is the Pure-RL block. To understand it, we begin by framing LLM reasoning as an RL problem. An LLM acts as a conditional policy <em>&#960;<sub>&#952;</sub></em> over text, and each reasoning trajectory can be viewed within the standard RL tuple <em>(S,A,R)</em>:</p><ul><li><p><strong>State </strong><em><strong>s<sub>t</sub></strong></em><strong>&#8203;</strong>: the prompt <em>q</em> plus the partial generation up to step <em>t</em>:</p></li></ul><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\ns_t = (q, x_{1:t-1})\n&quot;,&quot;id&quot;:&quot;FEAOFQJEOR&quot;}" data-component-name="LatexBlockToDOM"></div><ul><li><p><strong>Action </strong><em><strong>a<sub>t</sub></strong></em><strong>&#8203;</strong>: the next token <em>x<sub>t</sub></em>&#8203; drawn from the policy:</p></li></ul><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;a_t = x_t \\sim \\pi_\\theta(\\cdot \\mid s_t)&quot;,&quot;id&quot;:&quot;PMJGNCIRYS&quot;}" data-component-name="LatexBlockToDOM"></div><ul><li><p><strong>Reward </strong><em><strong>r</strong></em>: a <em>final</em> sequence-level reward given after the full output is completed:</p></li></ul><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;r = R(x_{1:T}, q, a^*)&quot;,&quot;id&quot;:&quot;YTDABVZXZR&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>a*</em> is the ground-truth answer, and <em>R </em> is the reward function that can measure how accurate the output <em>x<sub>1:T</sub></em><sub>  </sub>is compared to the true answer <em>a*. </em>Usually, a simple exact match function with the ability to extract the final answer from the output is used. Because the reward is computed based on the ground-truth answer given by the dataset, it is:</p><p>&#10004;&#65039; Verifiable</p><p>&#10004;&#65039; Not requiring human annotations</p><p>This simple reward mechanism is crucial to DeepSeek-R1&#8217;s success, establishing it as an affordable alternative model. Yet, one question remains: &#129504; <em>If the method is so straightforward, why has no one successfully implemented it before?</em></p><p>The answer lies in the <strong> </strong>RL, which demands specialized techniques to work effectively in the LLM context:</p><ul><li><p>RL is Inherently Hard to Train: Standard RL algorithms are notoriously sample-inefficient and often exhibit high variance. Training requires meticulous tuning and robust techniques to stabilize the learning process.</p></li><li><p>Need for a Strong Base Model: RL algorithms typically cannot &#8220;bootstrap&#8221; a successful model from scratch. They require a &#8220;strong enough&#8221; base model to provide a solid foundation of language understanding and coherence. The base model used for DeepSeek-R1 (DeepSeek-V3), which is actually quite substantial and not small at all, provides this necessary strength to enable RL fine-tuning to work effectively.</p></li><li><p>Sparse Reward Problem: The task of aligning an LLM often results in a sparse reward signal. The model only receives feedback on the quality of its final response, making it difficult for the RL agent to determine which specific intermediate steps or tokens contributed to the success (or failure) of the output.</p></li><li><p>Necessity for a Novel RL Algorithm: Due to the issues of high variance and sparse rewards, the authors could not simply apply a naive or off-the-shelf RL technique. They had to adopt a new RL algorithm, named <strong>GRPO</strong>,  specifically designed to be stable, scalable, and effective when applied to the vast parameter space of a powerful base LLM in a practical setting.</p><p></p></li></ul><h4>The RL Algorithm: Group-Based Policy Optimization (GRPO)</h4><p>DeepSeek-R1&#8217;s RL algorithm is based on a well-known trust-region algorithm, called PPO [2]. To remind you, the original PPO objective is like this:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{maximize}\\  E_t[\\min(w_t(\\theta) \\hat{A}_t, \\text{clip}(w_t(\\theta), 1-\\epsilon, 1+\\epsilon) \\hat{A}_t)]\n&quot;,&quot;id&quot;:&quot;PCUXGYJUYN&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>w<sub>t</sub></em> is the important sampling ratio for action <em>a<sub>t</sub></em>. </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;w_i = \\frac{\\pi_\\theta(a_t \\mid s_t)}{\\pi_{\\theta_{\\text{old}}}(a_t \\mid s_t)}&quot;,&quot;id&quot;:&quot;AUHGWPOKXB&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Here,  DeepSeek-R1 modifies PPO with a group-based policy optimization scheme suitable for text space, which results in a new group-based advantage function. Specifically, for each query <em>q</em>, the model samples a <strong>group</strong> of <em>G</em> candidate outputs:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\{ o_1, \\dots, o_G \\} \\sim \\pi_{\\theta_{\\text{old}}}(\\cdot \\mid q)&quot;,&quot;id&quot;:&quot;GRIVEXNLJH&quot;}" data-component-name="LatexBlockToDOM"></div><p>Each output <em>o<sub>i</sub></em> represents the whole sequence of tokens and receives a scalar reward <em>r<sub>i</sub></em>. Instead of fitting a value function and using the value function estimation in calculating the advantage, DeepSeek uses relative ranking inside the group to compute advantages:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Cm4R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3265a65-125b-4c95-bd34-6d98150d8e4c_1188x270.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Cm4R!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3265a65-125b-4c95-bd34-6d98150d8e4c_1188x270.png 424w, https://substackcdn.com/image/fetch/$s_!Cm4R!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3265a65-125b-4c95-bd34-6d98150d8e4c_1188x270.png 848w, https://substackcdn.com/image/fetch/$s_!Cm4R!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3265a65-125b-4c95-bd34-6d98150d8e4c_1188x270.png 1272w, https://substackcdn.com/image/fetch/$s_!Cm4R!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3265a65-125b-4c95-bd34-6d98150d8e4c_1188x270.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Cm4R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3265a65-125b-4c95-bd34-6d98150d8e4c_1188x270.png" width="390" height="88.63636363636364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c3265a65-125b-4c95-bd34-6d98150d8e4c_1188x270.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:270,&quot;width&quot;:1188,&quot;resizeWidth&quot;:390,&quot;bytes&quot;:42429,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3265a65-125b-4c95-bd34-6d98150d8e4c_1188x270.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Cm4R!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3265a65-125b-4c95-bd34-6d98150d8e4c_1188x270.png 424w, https://substackcdn.com/image/fetch/$s_!Cm4R!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3265a65-125b-4c95-bd34-6d98150d8e4c_1188x270.png 848w, https://substackcdn.com/image/fetch/$s_!Cm4R!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3265a65-125b-4c95-bd34-6d98150d8e4c_1188x270.png 1272w, https://substackcdn.com/image/fetch/$s_!Cm4R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3265a65-125b-4c95-bd34-6d98150d8e4c_1188x270.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This yields a stable, variance-reduced advantage signal without a critic. The optimization objective is a PPO-style clipped surrogate, coupled with a KL constraint:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OYw9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a2be43-f23a-40d0-a86c-bf6e5fb4a77f_1826x399.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OYw9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a2be43-f23a-40d0-a86c-bf6e5fb4a77f_1826x399.png 424w, https://substackcdn.com/image/fetch/$s_!OYw9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a2be43-f23a-40d0-a86c-bf6e5fb4a77f_1826x399.png 848w, https://substackcdn.com/image/fetch/$s_!OYw9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a2be43-f23a-40d0-a86c-bf6e5fb4a77f_1826x399.png 1272w, https://substackcdn.com/image/fetch/$s_!OYw9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a2be43-f23a-40d0-a86c-bf6e5fb4a77f_1826x399.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OYw9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a2be43-f23a-40d0-a86c-bf6e5fb4a77f_1826x399.png" width="1456" height="318" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23a2be43-f23a-40d0-a86c-bf6e5fb4a77f_1826x399.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:318,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:110571,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a2be43-f23a-40d0-a86c-bf6e5fb4a77f_1826x399.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OYw9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a2be43-f23a-40d0-a86c-bf6e5fb4a77f_1826x399.png 424w, https://substackcdn.com/image/fetch/$s_!OYw9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a2be43-f23a-40d0-a86c-bf6e5fb4a77f_1826x399.png 848w, https://substackcdn.com/image/fetch/$s_!OYw9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a2be43-f23a-40d0-a86c-bf6e5fb4a77f_1826x399.png 1272w, https://substackcdn.com/image/fetch/$s_!OYw9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23a2be43-f23a-40d0-a86c-bf6e5fb4a77f_1826x399.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Here is the summary of GRPO&#8217;s features:</p><ul><li><p>It operates on a whole sentence level, not a step/token level.</p></li><li><p>It avoids learning value networks and lets the reward signal directly shape reasoning behaviors through group-based advantage estimations.</p></li><li><p>The objective function is averaged over <em>G </em>outputs, reducing variance in the gradients</p></li><li><p>The DPO clip forces the update to be small, increasing stability.</p></li><li><p>The KL further forces the new LLM to be close to the base model, maintaining general pre-trained performance.</p><p></p></li></ul><h4>On the Reward Choice</h4><p>As discussed earlier, the reward is simple and verifiable based on the ground-truth. However, to support reward calculation, the output of LLM must follow certain formats to allow later extraction of the final answers. If the model rambles, changes styles mid-generation, or forgets to delimit its conclusion, the verifier can&#8217;t score it, even if the reasoning is mostly correct. Therefore, DeepSeek-R1 proposes using the following prompt to ensure output format:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DusP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9406c378-d5ca-46f3-95bd-6ce8c1733444_2263x563.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DusP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9406c378-d5ca-46f3-95bd-6ce8c1733444_2263x563.png 424w, https://substackcdn.com/image/fetch/$s_!DusP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9406c378-d5ca-46f3-95bd-6ce8c1733444_2263x563.png 848w, https://substackcdn.com/image/fetch/$s_!DusP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9406c378-d5ca-46f3-95bd-6ce8c1733444_2263x563.png 1272w, https://substackcdn.com/image/fetch/$s_!DusP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9406c378-d5ca-46f3-95bd-6ce8c1733444_2263x563.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DusP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9406c378-d5ca-46f3-95bd-6ce8c1733444_2263x563.png" width="1456" height="362" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9406c378-d5ca-46f3-95bd-6ce8c1733444_2263x563.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:362,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:194031,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9406c378-d5ca-46f3-95bd-6ce8c1733444_2263x563.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DusP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9406c378-d5ca-46f3-95bd-6ce8c1733444_2263x563.png 424w, https://substackcdn.com/image/fetch/$s_!DusP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9406c378-d5ca-46f3-95bd-6ce8c1733444_2263x563.png 848w, https://substackcdn.com/image/fetch/$s_!DusP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9406c378-d5ca-46f3-95bd-6ce8c1733444_2263x563.png 1272w, https://substackcdn.com/image/fetch/$s_!DusP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9406c378-d5ca-46f3-95bd-6ce8c1733444_2263x563.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Source: [1].</figcaption></figure></div><p>Once the output is parseable, the reward becomes a clean deterministic function:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;r = \\mathbf{1}\\!\\left[a = a^\\ast\\right],\n&quot;,&quot;id&quot;:&quot;WLFKRSOBMM&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>a</em> is the extracted answer between  &lt;answer&gt;&lt;/answer&gt; and <em>a*</em> is the true answer. </p><h4>Other Interesting Insights</h4><p>One of the most surprising findings is that you don&#8217;t need any chain-of-thought supervision to unlock multi-step reasoning. The model invents a reasoning style because it is the only reliable way to consistently maximize outcome reward. This is a major shift in thinking: </p><p>Reliable Answer&#8658;Reliable Internal Search</p><p>The model rediscovers analysis &#8594; checking &#8594; correction, not because we teach it, but because outcome reward makes messy one-shot answers unstable, failing to generalize to many input questions seen during training. As proved by the results, pure RL post-training is already competitive against other baselines:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!keym!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93de561-142d-48c2-a683-70e66687af3d_1273x314.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!keym!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93de561-142d-48c2-a683-70e66687af3d_1273x314.png 424w, https://substackcdn.com/image/fetch/$s_!keym!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93de561-142d-48c2-a683-70e66687af3d_1273x314.png 848w, https://substackcdn.com/image/fetch/$s_!keym!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93de561-142d-48c2-a683-70e66687af3d_1273x314.png 1272w, https://substackcdn.com/image/fetch/$s_!keym!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93de561-142d-48c2-a683-70e66687af3d_1273x314.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!keym!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93de561-142d-48c2-a683-70e66687af3d_1273x314.png" width="1273" height="314" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a93de561-142d-48c2-a683-70e66687af3d_1273x314.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:314,&quot;width&quot;:1273,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:76899,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93de561-142d-48c2-a683-70e66687af3d_1273x314.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!keym!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93de561-142d-48c2-a683-70e66687af3d_1273x314.png 424w, https://substackcdn.com/image/fetch/$s_!keym!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93de561-142d-48c2-a683-70e66687af3d_1273x314.png 848w, https://substackcdn.com/image/fetch/$s_!keym!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93de561-142d-48c2-a683-70e66687af3d_1273x314.png 1272w, https://substackcdn.com/image/fetch/$s_!keym!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa93de561-142d-48c2-a683-70e66687af3d_1273x314.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Second, the paper also indicates the benefit of later SFT or distillation that compresses RL behaviors. After the first RL run, DeepSeek rejects low-quality samples and distills the remaining high-quality traces back into the model. You might expect a noticeable drop in performance because SFT is usually seen as &#8220;blurry&#8221; compared to RL. Instead, the distilled model actually becomes more stable and often more accurate.</p><p>Why? RL discovers a diverse set of good reasoning strategies, but SFT selects only the clean, high-confidence trajectories and forces the policy to reproduce them deterministically. The result is a smoother, more reliable reasoning model without the sampling volatility of pure RL. </p><blockquote><p>&#128064; RL finds the behavior; SFT perfects it.</p></blockquote><p>This approach of distillation from RL-trained big models to smaller ones is very effective:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HtoK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67dbe20-96e5-4954-b5a9-74795a468752_1259x265.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HtoK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67dbe20-96e5-4954-b5a9-74795a468752_1259x265.png 424w, https://substackcdn.com/image/fetch/$s_!HtoK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67dbe20-96e5-4954-b5a9-74795a468752_1259x265.png 848w, https://substackcdn.com/image/fetch/$s_!HtoK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67dbe20-96e5-4954-b5a9-74795a468752_1259x265.png 1272w, https://substackcdn.com/image/fetch/$s_!HtoK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67dbe20-96e5-4954-b5a9-74795a468752_1259x265.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HtoK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67dbe20-96e5-4954-b5a9-74795a468752_1259x265.png" width="1259" height="265" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d67dbe20-96e5-4954-b5a9-74795a468752_1259x265.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:265,&quot;width&quot;:1259,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63981,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67dbe20-96e5-4954-b5a9-74795a468752_1259x265.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HtoK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67dbe20-96e5-4954-b5a9-74795a468752_1259x265.png 424w, https://substackcdn.com/image/fetch/$s_!HtoK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67dbe20-96e5-4954-b5a9-74795a468752_1259x265.png 848w, https://substackcdn.com/image/fetch/$s_!HtoK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67dbe20-96e5-4954-b5a9-74795a468752_1259x265.png 1272w, https://substackcdn.com/image/fetch/$s_!HtoK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd67dbe20-96e5-4954-b5a9-74795a468752_1259x265.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Finally, one of the most striking aspects of DeepSeek-R1 is how much <em>emergent structure</em> arises even though the reward is extremely simple. There is no preference model, no human-written quality annotations, no chain-of-thought supervision baked into the reward. Yet the model discovers behaviors that look almost like they were explicitly trained for: the &#8220;aha moment&#8221;.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FJVv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec2a5844-cace-406c-b02c-23844135a887_1737x972.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FJVv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec2a5844-cace-406c-b02c-23844135a887_1737x972.png 424w, https://substackcdn.com/image/fetch/$s_!FJVv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec2a5844-cace-406c-b02c-23844135a887_1737x972.png 848w, https://substackcdn.com/image/fetch/$s_!FJVv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec2a5844-cace-406c-b02c-23844135a887_1737x972.png 1272w, https://substackcdn.com/image/fetch/$s_!FJVv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec2a5844-cace-406c-b02c-23844135a887_1737x972.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FJVv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec2a5844-cace-406c-b02c-23844135a887_1737x972.png" width="1456" height="815" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec2a5844-cace-406c-b02c-23844135a887_1737x972.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:815,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:208827,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec2a5844-cace-406c-b02c-23844135a887_1737x972.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FJVv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec2a5844-cace-406c-b02c-23844135a887_1737x972.png 424w, https://substackcdn.com/image/fetch/$s_!FJVv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec2a5844-cace-406c-b02c-23844135a887_1737x972.png 848w, https://substackcdn.com/image/fetch/$s_!FJVv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec2a5844-cace-406c-b02c-23844135a887_1737x972.png 1272w, https://substackcdn.com/image/fetch/$s_!FJVv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec2a5844-cace-406c-b02c-23844135a887_1737x972.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>&#128064; Although the behavior is indeed interesting and never seen in training data, subsequent papers (e.g., in [3]) show that the base LLM intrinsically possessed this capability, though it manifested with a lower probability before fine-tuning.</p></blockquote><div><hr></div><h2>Beyond GRPO: For A Better RL Post-Training</h2><p>In the previous section, we saw how GRPO laid the foundation for post-training RL: by estimating advantages across sampled reasoning trajectories, GRPO reduces variance compared to na&#239;ve policy gradients and reliably nudges the model toward higher-reward outputs. However, it remains rudimentary and leaves plenty of room for improvement. GRPO is still rudimentary in several ways:</p><ul><li><p><strong>Residual variance in long reasoning chains:</strong> As tasks grow in complexity, the gradient signal from sampled trajectories can remain noisy, making learning slow or unstable.</p></li><li><p><strong>Vulnerability to rare but correct trajectories:</strong> GRPO updates can inadvertently reduce the probability of correct but infrequent outputs, especially under sparse rewards.</p></li><li><p><strong>Scalability bottlenecks:</strong> When applied to very large models or long, multi-step reasoning, GRPO can become computationally expensive and less stable.</p></li></ul><h4>Alternative RL Algorithms</h4><p>To improve GRPO, many researchers have come up with alternative algorithms for RL post-training. For example, researchers introduced Decoupled Clip Dynamic Sampling Policy Optimization (&#128073;<strong>DAPO</strong> [4]), which is  a set of algorithmic and engineering fixes to make RL practical and reproducible at scale. </p><p><strong>Trick 0: Remove KL term</strong></p><p>Standard GRPO uses a KL penalty to constrain policy divergence. The authors exclude this term because long-CoT reasoning requires significant distributional shifts from the initial model, rendering the restriction unnecessary.</p><p><strong>Trick 1: Clip-Higher</strong></p><p>Recall that the clip in GRPO is to constrain the policy update, stabilizing the training. However, because of the constraints, standard clipping restricts probability updates: high-probability &#8220;exploitation&#8221; tokens can grow freely, but low-probability &#8220;exploration&#8221; tokens struggle. For example<strong>:</strong> with &#949; = 0.2 and positive advantage:</p><ul><li><p>Token A: &#960;<sub>&#952;old</sub>=0.9 &#8594; upper bound 0.9&#8901;1.2=1.08</p></li><li><p>Token B: &#960;<sub>&#952;old</sub>=0.01 &#8594; upper bound 0.01&#8901;1.2=0.012</p></li></ul><p>Here, high-probability tokens (A) easily increase, but rare tokens (B) barely move. This limits exploration and slows scaling. DAPO fixes this with the Clip-Higher strategy. Clip-Higher relaxes the upper bound for low-probability tokens, enabling more diverse sampling and helping the model discover better reasoning trajectories. In practice, the idea is implemented by simply modifying the clip part:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IIhe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7588ba-f1b7-499d-b838-21cf4fed3280_918x155.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IIhe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7588ba-f1b7-499d-b838-21cf4fed3280_918x155.png 424w, https://substackcdn.com/image/fetch/$s_!IIhe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7588ba-f1b7-499d-b838-21cf4fed3280_918x155.png 848w, https://substackcdn.com/image/fetch/$s_!IIhe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7588ba-f1b7-499d-b838-21cf4fed3280_918x155.png 1272w, https://substackcdn.com/image/fetch/$s_!IIhe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7588ba-f1b7-499d-b838-21cf4fed3280_918x155.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IIhe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7588ba-f1b7-499d-b838-21cf4fed3280_918x155.png" width="512" height="86.44880174291939" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd7588ba-f1b7-499d-b838-21cf4fed3280_918x155.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:155,&quot;width&quot;:918,&quot;resizeWidth&quot;:512,&quot;bytes&quot;:22920,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7588ba-f1b7-499d-b838-21cf4fed3280_918x155.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IIhe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7588ba-f1b7-499d-b838-21cf4fed3280_918x155.png 424w, https://substackcdn.com/image/fetch/$s_!IIhe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7588ba-f1b7-499d-b838-21cf4fed3280_918x155.png 848w, https://substackcdn.com/image/fetch/$s_!IIhe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7588ba-f1b7-499d-b838-21cf4fed3280_918x155.png 1272w, https://substackcdn.com/image/fetch/$s_!IIhe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd7588ba-f1b7-499d-b838-21cf4fed3280_918x155.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>By using different hyperparameters for low and high clipping thresholds, the authors increase the value of <em>&#949;<sub>high</sub></em> to create space for the increase of low-probability tokens. while keeping <em>&#949;<sub>low</sub> </em>small because increasing it will drive these token probabilities toward zero, collapsing the effective sampling space. </p><p><strong>Trick 2: Token-Level Policy Gradient Loss</strong></p><p>The original GRPO setup reduces the loss at the sample level: it calculates the probability of the whole response by averaging token probabilities within a sample, then averages across samples, giving every sample the same weight. In implementation, it looks like this:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VPW2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8c3d1b-f29a-4c95-9650-5281de9ae067_1340x201.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VPW2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8c3d1b-f29a-4c95-9650-5281de9ae067_1340x201.png 424w, https://substackcdn.com/image/fetch/$s_!VPW2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8c3d1b-f29a-4c95-9650-5281de9ae067_1340x201.png 848w, https://substackcdn.com/image/fetch/$s_!VPW2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8c3d1b-f29a-4c95-9650-5281de9ae067_1340x201.png 1272w, https://substackcdn.com/image/fetch/$s_!VPW2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8c3d1b-f29a-4c95-9650-5281de9ae067_1340x201.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VPW2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8c3d1b-f29a-4c95-9650-5281de9ae067_1340x201.png" width="1340" height="201" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed8c3d1b-f29a-4c95-9650-5281de9ae067_1340x201.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:201,&quot;width&quot;:1340,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42513,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8c3d1b-f29a-4c95-9650-5281de9ae067_1340x201.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VPW2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8c3d1b-f29a-4c95-9650-5281de9ae067_1340x201.png 424w, https://substackcdn.com/image/fetch/$s_!VPW2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8c3d1b-f29a-4c95-9650-5281de9ae067_1340x201.png 848w, https://substackcdn.com/image/fetch/$s_!VPW2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8c3d1b-f29a-4c95-9650-5281de9ae067_1340x201.png 1272w, https://substackcdn.com/image/fetch/$s_!VPW2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8c3d1b-f29a-4c95-9650-5281de9ae067_1340x201.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Because the token contribution is normalized by the length of the response, tokens inside long responses effectively count less. This creates two issues:</p><ol><li><p>Good long answers are undertrained, as the model can&#8217;t fully absorb the reasoning patterns in key tokens in the long answers.</p></li><li><p>Bad long answers are under-penalized as gibberish repetition isn&#8217;t punished strongly enough, causing entropy and response length to drift upward</p></li></ol><p>To fix this, DAPO switches to a token-level policy gradient loss, giving each token its fair contribution to the update by not normalizing by response length but by the total number of tokens. The loss is:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v_aC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0130fa8-75c0-48bb-ab7f-808f6f6c11f1_1234x185.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v_aC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0130fa8-75c0-48bb-ab7f-808f6f6c11f1_1234x185.png 424w, https://substackcdn.com/image/fetch/$s_!v_aC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0130fa8-75c0-48bb-ab7f-808f6f6c11f1_1234x185.png 848w, https://substackcdn.com/image/fetch/$s_!v_aC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0130fa8-75c0-48bb-ab7f-808f6f6c11f1_1234x185.png 1272w, https://substackcdn.com/image/fetch/$s_!v_aC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0130fa8-75c0-48bb-ab7f-808f6f6c11f1_1234x185.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v_aC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0130fa8-75c0-48bb-ab7f-808f6f6c11f1_1234x185.png" width="1234" height="185" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0130fa8-75c0-48bb-ab7f-808f6f6c11f1_1234x185.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:185,&quot;width&quot;:1234,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:35700,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0130fa8-75c0-48bb-ab7f-808f6f6c11f1_1234x185.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v_aC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0130fa8-75c0-48bb-ab7f-808f6f6c11f1_1234x185.png 424w, https://substackcdn.com/image/fetch/$s_!v_aC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0130fa8-75c0-48bb-ab7f-808f6f6c11f1_1234x185.png 848w, https://substackcdn.com/image/fetch/$s_!v_aC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0130fa8-75c0-48bb-ab7f-808f6f6c11f1_1234x185.png 1272w, https://substackcdn.com/image/fetch/$s_!v_aC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0130fa8-75c0-48bb-ab7f-808f6f6c11f1_1234x185.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Under this formulation, longer sequences naturally exert more influence on the gradient than shorter ones. More importantly, useful token patterns, whether they appear early or deep inside the reasoning trace, are updated proportionally, and harmful patterns get penalized no matter where they occur. </p><p><strong>Trick 3: Dynamic Sampling</strong></p><p>A subtle issue in GRPO-style algorithms is that when a prompt&#8217;s outputs all achieve 100% accuracy, every sample in that group receives the same reward. This collapses the advantage to zero, leading to zero gradients, which means weaker updates, higher noise sensitivity, and degraded sample efficiency.</p><p>Worse, as training progresses, the number of &#8220;all-correct&#8221; prompts steadily increases, so the effective number of learning prompts per batch shrinks. DAPO addresses this by oversampling and filtering prompts whose accuracy is exactly 1 or 0. Only prompts with meaningful gradient signals are retained, ensuring every batch contains a stable number of informative prompts. Before each update, the system dynamically samples until the batch is fully populated with prompts that provide non-zero gradient contribution, keeping training stable even as accuracy improves.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9d-q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F412fb8f7-6399-4956-b2cd-804bc6735db0_1248x235.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9d-q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F412fb8f7-6399-4956-b2cd-804bc6735db0_1248x235.png 424w, https://substackcdn.com/image/fetch/$s_!9d-q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F412fb8f7-6399-4956-b2cd-804bc6735db0_1248x235.png 848w, https://substackcdn.com/image/fetch/$s_!9d-q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F412fb8f7-6399-4956-b2cd-804bc6735db0_1248x235.png 1272w, https://substackcdn.com/image/fetch/$s_!9d-q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F412fb8f7-6399-4956-b2cd-804bc6735db0_1248x235.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9d-q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F412fb8f7-6399-4956-b2cd-804bc6735db0_1248x235.png" width="1248" height="235" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/412fb8f7-6399-4956-b2cd-804bc6735db0_1248x235.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:235,&quot;width&quot;:1248,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:49134,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F412fb8f7-6399-4956-b2cd-804bc6735db0_1248x235.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9d-q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F412fb8f7-6399-4956-b2cd-804bc6735db0_1248x235.png 424w, https://substackcdn.com/image/fetch/$s_!9d-q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F412fb8f7-6399-4956-b2cd-804bc6735db0_1248x235.png 848w, https://substackcdn.com/image/fetch/$s_!9d-q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F412fb8f7-6399-4956-b2cd-804bc6735db0_1248x235.png 1272w, https://substackcdn.com/image/fetch/$s_!9d-q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F412fb8f7-6399-4956-b2cd-804bc6735db0_1248x235.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Trick 4: Overlong Reward Shaping</strong></p><p>Most RL pipelines cap the maximum generation length and simply truncate responses that run past it. The default practice is to assign a punitive reward to these truncated samples. But in long-CoT reasoning, this turns out to be harmful: a perfectly good chain of thought may get cut off just because it&#8217;s long, and the model is punished for the wrong reason. </p><p>DAPO proposes to mask the loss of truncated outputs (Overlong Filtering<strong>)</strong>, making training dramatically more stable and performance on benchmarks like AIME jumps immediately. But filtering alone isn&#8217;t enough. Long responses can still waste tokens and reduce efficiency. So DAPO introduces a Soft Overlong Punishment, a gentle, length-aware shaping term:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VQdy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7c1e08-6d15-4dd8-9859-e6b2a53bc518_887x168.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VQdy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7c1e08-6d15-4dd8-9859-e6b2a53bc518_887x168.png 424w, https://substackcdn.com/image/fetch/$s_!VQdy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7c1e08-6d15-4dd8-9859-e6b2a53bc518_887x168.png 848w, https://substackcdn.com/image/fetch/$s_!VQdy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7c1e08-6d15-4dd8-9859-e6b2a53bc518_887x168.png 1272w, https://substackcdn.com/image/fetch/$s_!VQdy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7c1e08-6d15-4dd8-9859-e6b2a53bc518_887x168.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VQdy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7c1e08-6d15-4dd8-9859-e6b2a53bc518_887x168.png" width="887" height="168" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa7c1e08-6d15-4dd8-9859-e6b2a53bc518_887x168.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:168,&quot;width&quot;:887,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:26988,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7c1e08-6d15-4dd8-9859-e6b2a53bc518_887x168.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VQdy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7c1e08-6d15-4dd8-9859-e6b2a53bc518_887x168.png 424w, https://substackcdn.com/image/fetch/$s_!VQdy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7c1e08-6d15-4dd8-9859-e6b2a53bc518_887x168.png 848w, https://substackcdn.com/image/fetch/$s_!VQdy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7c1e08-6d15-4dd8-9859-e6b2a53bc518_887x168.png 1272w, https://substackcdn.com/image/fetch/$s_!VQdy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa7c1e08-6d15-4dd8-9859-e6b2a53bc518_887x168.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This function <strong>g</strong>rades the penalty smoothly as the model approaches the max-length zone, which is <em>L<sub>cache</sub></em> from <em>L<sub>max</sub></em>, and only applies a full &#8722;1 reward when it exceeds <em>L<sub>max</sub>.</em></p><p>The result is incrementally better as we apply these tricks one by one:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Tn8b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba1f2e7d-47c5-479e-a4a0-81c2c1cee0c7_765x398.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Tn8b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba1f2e7d-47c5-479e-a4a0-81c2c1cee0c7_765x398.png 424w, https://substackcdn.com/image/fetch/$s_!Tn8b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba1f2e7d-47c5-479e-a4a0-81c2c1cee0c7_765x398.png 848w, https://substackcdn.com/image/fetch/$s_!Tn8b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba1f2e7d-47c5-479e-a4a0-81c2c1cee0c7_765x398.png 1272w, https://substackcdn.com/image/fetch/$s_!Tn8b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba1f2e7d-47c5-479e-a4a0-81c2c1cee0c7_765x398.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Tn8b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba1f2e7d-47c5-479e-a4a0-81c2c1cee0c7_765x398.png" width="536" height="278.8601307189542" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ba1f2e7d-47c5-479e-a4a0-81c2c1cee0c7_765x398.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:398,&quot;width&quot;:765,&quot;resizeWidth&quot;:536,&quot;bytes&quot;:68597,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba1f2e7d-47c5-479e-a4a0-81c2c1cee0c7_765x398.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Tn8b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba1f2e7d-47c5-479e-a4a0-81c2c1cee0c7_765x398.png 424w, https://substackcdn.com/image/fetch/$s_!Tn8b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba1f2e7d-47c5-479e-a4a0-81c2c1cee0c7_765x398.png 848w, https://substackcdn.com/image/fetch/$s_!Tn8b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba1f2e7d-47c5-479e-a4a0-81c2c1cee0c7_765x398.png 1272w, https://substackcdn.com/image/fetch/$s_!Tn8b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba1f2e7d-47c5-479e-a4a0-81c2c1cee0c7_765x398.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>At the same time, concurrent researchers have started investigating other changes to the GRPO algorithm. Paper &#128073;<strong>Dr. GRPO</strong> [5] reveals 2 limitations of GRPO as a result of dividing the advantage by response length and per-question return variance. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y_9Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee03690-c327-49a7-9aa7-7e39fd2ba901_1302x302.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y_9Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee03690-c327-49a7-9aa7-7e39fd2ba901_1302x302.png 424w, https://substackcdn.com/image/fetch/$s_!y_9Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee03690-c327-49a7-9aa7-7e39fd2ba901_1302x302.png 848w, https://substackcdn.com/image/fetch/$s_!y_9Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee03690-c327-49a7-9aa7-7e39fd2ba901_1302x302.png 1272w, https://substackcdn.com/image/fetch/$s_!y_9Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee03690-c327-49a7-9aa7-7e39fd2ba901_1302x302.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y_9Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee03690-c327-49a7-9aa7-7e39fd2ba901_1302x302.png" width="1302" height="302" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fee03690-c327-49a7-9aa7-7e39fd2ba901_1302x302.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:302,&quot;width&quot;:1302,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:73740,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee03690-c327-49a7-9aa7-7e39fd2ba901_1302x302.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!y_9Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee03690-c327-49a7-9aa7-7e39fd2ba901_1302x302.png 424w, https://substackcdn.com/image/fetch/$s_!y_9Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee03690-c327-49a7-9aa7-7e39fd2ba901_1302x302.png 848w, https://substackcdn.com/image/fetch/$s_!y_9Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee03690-c327-49a7-9aa7-7e39fd2ba901_1302x302.png 1272w, https://substackcdn.com/image/fetch/$s_!y_9Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee03690-c327-49a7-9aa7-7e39fd2ba901_1302x302.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Both seem harmless at first glance, but together they create two systematic biases:</p><p>&#10060; <strong>Length bias: </strong>If a response is correct, shorter outputs get larger gradients because you&#8217;re dividing a positive advantage by fewer tokens. If a response is incorrect, longer outputs get <em>less</em> penalty because they&#8217;re cushioned by a bigger length divisor. Over time, this pushes the policy toward long incorrect chains and short correct ones. Simple fix:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gMcz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b949a33-43ec-4824-abe6-d50fdc34ccad_1452x397.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gMcz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b949a33-43ec-4824-abe6-d50fdc34ccad_1452x397.png 424w, https://substackcdn.com/image/fetch/$s_!gMcz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b949a33-43ec-4824-abe6-d50fdc34ccad_1452x397.png 848w, https://substackcdn.com/image/fetch/$s_!gMcz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b949a33-43ec-4824-abe6-d50fdc34ccad_1452x397.png 1272w, https://substackcdn.com/image/fetch/$s_!gMcz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b949a33-43ec-4824-abe6-d50fdc34ccad_1452x397.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gMcz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b949a33-43ec-4824-abe6-d50fdc34ccad_1452x397.png" width="1452" height="397" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4b949a33-43ec-4824-abe6-d50fdc34ccad_1452x397.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:397,&quot;width&quot;:1452,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:115402,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b949a33-43ec-4824-abe6-d50fdc34ccad_1452x397.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gMcz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b949a33-43ec-4824-abe6-d50fdc34ccad_1452x397.png 424w, https://substackcdn.com/image/fetch/$s_!gMcz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b949a33-43ec-4824-abe6-d50fdc34ccad_1452x397.png 848w, https://substackcdn.com/image/fetch/$s_!gMcz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b949a33-43ec-4824-abe6-d50fdc34ccad_1452x397.png 1272w, https://substackcdn.com/image/fetch/$s_!gMcz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b949a33-43ec-4824-abe6-d50fdc34ccad_1452x397.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>&#128064; This is similar to the token-level policy loss trick in DAPO paper. </p></blockquote><p>&#10060; <strong>Difficulty bias: </strong>Normalizing by the per-question standard deviation means that questions where all responses are almost correct (or almost wrong) get disproportionate weight. Easy or impossible examples dominate updates, while medium-difficulty ones, where reasoning actually matters, get suppressed. This can be simply fixed by removing the std denominator:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sKUP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e095d9-b3a3-4100-ba6a-525225c7aa01_1213x117.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sKUP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e095d9-b3a3-4100-ba6a-525225c7aa01_1213x117.png 424w, https://substackcdn.com/image/fetch/$s_!sKUP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e095d9-b3a3-4100-ba6a-525225c7aa01_1213x117.png 848w, https://substackcdn.com/image/fetch/$s_!sKUP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e095d9-b3a3-4100-ba6a-525225c7aa01_1213x117.png 1272w, https://substackcdn.com/image/fetch/$s_!sKUP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e095d9-b3a3-4100-ba6a-525225c7aa01_1213x117.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sKUP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e095d9-b3a3-4100-ba6a-525225c7aa01_1213x117.png" width="452" height="43.59769167353669" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2e095d9-b3a3-4100-ba6a-525225c7aa01_1213x117.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:117,&quot;width&quot;:1213,&quot;resizeWidth&quot;:452,&quot;bytes&quot;:28008,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e095d9-b3a3-4100-ba6a-525225c7aa01_1213x117.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sKUP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e095d9-b3a3-4100-ba6a-525225c7aa01_1213x117.png 424w, https://substackcdn.com/image/fetch/$s_!sKUP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e095d9-b3a3-4100-ba6a-525225c7aa01_1213x117.png 848w, https://substackcdn.com/image/fetch/$s_!sKUP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e095d9-b3a3-4100-ba6a-525225c7aa01_1213x117.png 1272w, https://substackcdn.com/image/fetch/$s_!sKUP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e095d9-b3a3-4100-ba6a-525225c7aa01_1213x117.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>These tricks accelerate the progress of RL post-training:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ba4U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632bc337-5629-4b9f-8e2f-41fbf0af2e7a_848x650.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ba4U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632bc337-5629-4b9f-8e2f-41fbf0af2e7a_848x650.png 424w, https://substackcdn.com/image/fetch/$s_!ba4U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632bc337-5629-4b9f-8e2f-41fbf0af2e7a_848x650.png 848w, https://substackcdn.com/image/fetch/$s_!ba4U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632bc337-5629-4b9f-8e2f-41fbf0af2e7a_848x650.png 1272w, https://substackcdn.com/image/fetch/$s_!ba4U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632bc337-5629-4b9f-8e2f-41fbf0af2e7a_848x650.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ba4U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632bc337-5629-4b9f-8e2f-41fbf0af2e7a_848x650.png" width="468" height="358.72641509433964" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/632bc337-5629-4b9f-8e2f-41fbf0af2e7a_848x650.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:650,&quot;width&quot;:848,&quot;resizeWidth&quot;:468,&quot;bytes&quot;:95276,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632bc337-5629-4b9f-8e2f-41fbf0af2e7a_848x650.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ba4U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632bc337-5629-4b9f-8e2f-41fbf0af2e7a_848x650.png 424w, https://substackcdn.com/image/fetch/$s_!ba4U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632bc337-5629-4b9f-8e2f-41fbf0af2e7a_848x650.png 848w, https://substackcdn.com/image/fetch/$s_!ba4U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632bc337-5629-4b9f-8e2f-41fbf0af2e7a_848x650.png 1272w, https://substackcdn.com/image/fetch/$s_!ba4U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632bc337-5629-4b9f-8e2f-41fbf0af2e7a_848x650.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In addition to the above simple tricks, other papers look for fundamental changes to the RL algorithm. One of the big limitations of pure on-policy RL for reasoning (GRPO/PPO-style) is that the model can only explore what it can generate. If the base model can&#8217;t produce high-quality chains of thought, then RL mostly just amplifies its existing bias patterns instead of teaching it genuinely new reasoning skills.</p><p>The paper, Learning to Reason under Off-Policy Guidance (&#128073;<strong>LUFFY</strong> [6]), introduces  a simple but surprisingly effective idea: use off-policy guidance such as reasoning traces from a much stronger model (e.g., DeepSeek R1), and fold them directly into the RL pipeline so the learner can &#8220;see&#8221; good reasoning even before it can generate any. But unlike na&#239;ve imitation, the goal here is not to copy the teacher; it&#8217;s to blend demonstrations into GRPO in a way that preserves exploration and lets the student ultimately outgrow the off-policy samples.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!reTo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ff3fd3-26b6-40c6-a972-e34110ef148b_946x416.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!reTo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ff3fd3-26b6-40c6-a972-e34110ef148b_946x416.png 424w, https://substackcdn.com/image/fetch/$s_!reTo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ff3fd3-26b6-40c6-a972-e34110ef148b_946x416.png 848w, https://substackcdn.com/image/fetch/$s_!reTo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ff3fd3-26b6-40c6-a972-e34110ef148b_946x416.png 1272w, https://substackcdn.com/image/fetch/$s_!reTo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ff3fd3-26b6-40c6-a972-e34110ef148b_946x416.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!reTo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ff3fd3-26b6-40c6-a972-e34110ef148b_946x416.png" width="946" height="416" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/81ff3fd3-26b6-40c6-a972-e34110ef148b_946x416.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:416,&quot;width&quot;:946,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66724,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ff3fd3-26b6-40c6-a972-e34110ef148b_946x416.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!reTo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ff3fd3-26b6-40c6-a972-e34110ef148b_946x416.png 424w, https://substackcdn.com/image/fetch/$s_!reTo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ff3fd3-26b6-40c6-a972-e34110ef148b_946x416.png 848w, https://substackcdn.com/image/fetch/$s_!reTo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ff3fd3-26b6-40c6-a972-e34110ef148b_946x416.png 1272w, https://substackcdn.com/image/fetch/$s_!reTo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ff3fd3-26b6-40c6-a972-e34110ef148b_946x416.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Trajectories from an off-policy model are used for training the current model. Source: [6]</figcaption></figure></div><p>Formally, let <em>G<sub>on</sub></em> and <em>G<sub>off</sub></em> denote on-policy and off-policy trajectory groups, sampled from <em>&#960;<sub>&#952;old</sub></em> and  <em>&#960;<sub>&#981;</sub></em>, respectively. LUFFY computes advantages using the union of both:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EP3m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebeb4be-89a0-4b74-a825-9f1fcda9622f_572x120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EP3m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebeb4be-89a0-4b74-a825-9f1fcda9622f_572x120.png 424w, https://substackcdn.com/image/fetch/$s_!EP3m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebeb4be-89a0-4b74-a825-9f1fcda9622f_572x120.png 848w, https://substackcdn.com/image/fetch/$s_!EP3m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebeb4be-89a0-4b74-a825-9f1fcda9622f_572x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EP3m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebeb4be-89a0-4b74-a825-9f1fcda9622f_572x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EP3m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebeb4be-89a0-4b74-a825-9f1fcda9622f_572x120.png" width="406" height="85.17482517482517" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aebeb4be-89a0-4b74-a825-9f1fcda9622f_572x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:120,&quot;width&quot;:572,&quot;resizeWidth&quot;:406,&quot;bytes&quot;:20262,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebeb4be-89a0-4b74-a825-9f1fcda9622f_572x120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EP3m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebeb4be-89a0-4b74-a825-9f1fcda9622f_572x120.png 424w, https://substackcdn.com/image/fetch/$s_!EP3m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebeb4be-89a0-4b74-a825-9f1fcda9622f_572x120.png 848w, https://substackcdn.com/image/fetch/$s_!EP3m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebeb4be-89a0-4b74-a825-9f1fcda9622f_572x120.png 1272w, https://substackcdn.com/image/fetch/$s_!EP3m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebeb4be-89a0-4b74-a825-9f1fcda9622f_572x120.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This ensures that high-quality off-policy trajectories receive larger advantages early in training (as on-policy rollout is worse), accelerating learning without overriding on-policy exploration as the policy improves. Then, two PPO-styled objectives for on- and off-policy training are used to update the current policy. Here, the off-policy term uses its own importance sampling ratio:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\hat{r}_{j,t} = \\frac{\\pi_\\theta(\\tau_{j,t})}{\\pi_\\phi(\\tau_{j,t})}&quot;,&quot;id&quot;:&quot;ECGHENBMOD&quot;}" data-component-name="LatexBlockToDOM"></div><p>Together, we have a mixed objective:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ijBO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f91d04d-8542-4d1e-961e-1d324a1b78e2_1378x246.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ijBO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f91d04d-8542-4d1e-961e-1d324a1b78e2_1378x246.png 424w, https://substackcdn.com/image/fetch/$s_!ijBO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f91d04d-8542-4d1e-961e-1d324a1b78e2_1378x246.png 848w, https://substackcdn.com/image/fetch/$s_!ijBO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f91d04d-8542-4d1e-961e-1d324a1b78e2_1378x246.png 1272w, https://substackcdn.com/image/fetch/$s_!ijBO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f91d04d-8542-4d1e-961e-1d324a1b78e2_1378x246.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ijBO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f91d04d-8542-4d1e-961e-1d324a1b78e2_1378x246.png" width="1378" height="246" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f91d04d-8542-4d1e-961e-1d324a1b78e2_1378x246.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:246,&quot;width&quot;:1378,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42675,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f91d04d-8542-4d1e-961e-1d324a1b78e2_1378x246.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ijBO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f91d04d-8542-4d1e-961e-1d324a1b78e2_1378x246.png 424w, https://substackcdn.com/image/fetch/$s_!ijBO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f91d04d-8542-4d1e-961e-1d324a1b78e2_1378x246.png 848w, https://substackcdn.com/image/fetch/$s_!ijBO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f91d04d-8542-4d1e-961e-1d324a1b78e2_1378x246.png 1272w, https://substackcdn.com/image/fetch/$s_!ijBO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f91d04d-8542-4d1e-961e-1d324a1b78e2_1378x246.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>Z</em> is the total number of tokens, serving as a normalization term. </p><blockquote><p>&#128064; Because <em>&#960;<sub>&#952;</sub></em><sub>&#8203;</sub> is much closer to &#960;&#952;old&#8203; than to <em>&#960;<sub>&#981;</sub></em>&#8203;, the off-policy ratio gradually&#8203; becomes smaller, naturally tempering gradients from the off-policy data.</p></blockquote><p>Now, one problem arises. Naively mixing off-policy data leads to rapid entropy collapse: the model overcommits to high-probability actions that coincide with off-policy tokens, eliminating exploratory behaviour required for multi-step reasoning. To counter this, LUFFY applies a shaping transformation to the off-policy importance ratio <em>f(r)</em> and removes the clip function, yielding the shaped off-policy gradient:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\nabla_\\theta J_{\\text{SHAPING-OFF}} =\n  \\mathbb{E}_{\\tau \\sim \\pi_\\phi}\n  \\left[\n    f'(\\pi_\\theta)\n    \\frac{\\pi_\\theta}{\\pi_\\phi}\n    \\nabla_\\theta \\log \\pi_\\theta\n    \\cdot \\hat{A}_j\n  \\right],&quot;,&quot;id&quot;:&quot;VAMQSJMXTN&quot;}" data-component-name="LatexBlockToDOM"></div><p>If we derive the gradient contribution for each candidate off-policy token <em>&#964;&#8242;<sub>t</sub></em> at time <em>t</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!miKd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce63816d-fa24-46ff-8f2b-a36a4d2267b9_948x187.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!miKd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce63816d-fa24-46ff-8f2b-a36a4d2267b9_948x187.png 424w, https://substackcdn.com/image/fetch/$s_!miKd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce63816d-fa24-46ff-8f2b-a36a4d2267b9_948x187.png 848w, https://substackcdn.com/image/fetch/$s_!miKd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce63816d-fa24-46ff-8f2b-a36a4d2267b9_948x187.png 1272w, https://substackcdn.com/image/fetch/$s_!miKd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce63816d-fa24-46ff-8f2b-a36a4d2267b9_948x187.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!miKd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce63816d-fa24-46ff-8f2b-a36a4d2267b9_948x187.png" width="508" height="100.20675105485232" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ce63816d-fa24-46ff-8f2b-a36a4d2267b9_948x187.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:187,&quot;width&quot;:948,&quot;resizeWidth&quot;:508,&quot;bytes&quot;:44813,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce63816d-fa24-46ff-8f2b-a36a4d2267b9_948x187.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!miKd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce63816d-fa24-46ff-8f2b-a36a4d2267b9_948x187.png 424w, https://substackcdn.com/image/fetch/$s_!miKd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce63816d-fa24-46ff-8f2b-a36a4d2267b9_948x187.png 848w, https://substackcdn.com/image/fetch/$s_!miKd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce63816d-fa24-46ff-8f2b-a36a4d2267b9_948x187.png 1272w, https://substackcdn.com/image/fetch/$s_!miKd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce63816d-fa24-46ff-8f2b-a36a4d2267b9_948x187.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>If we don&#8217;t use the shaping function, <em>f(r) = r, f&#8217;(r)=1</em>, then the gradient magnitude is bounded by <em>&#960;<sub>&#952;</sub>&#8203;(1&#8722;&#960;<sub>&#952;</sub>&#8203;)</em>. This term is tiny when:</p><ul><li><p><em>&#960;<sub>&#952;</sub></em> is near 0 &#8594; model thinks the token is impossible</p></li><li><p><em>&#960;<sub>&#952;</sub></em> is near 1 &#8594; model is already confident</p></li></ul><p>This is bad because the model struggles to get gradient signals from new reasoning moves (often starting with low-probability tokens). Thus, they propose shaping a function like this:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; f(\\hat{r}) = \\frac{\\hat{r}}{\\hat{r} + \\gamma},&quot;,&quot;id&quot;:&quot;ZRVRTZHPMB&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>&#947;</em> is set as 0.1. This has the key effect:</p><ul><li><p>When <em>&#960;<sub>&#952;</sub></em> is small, <em>f&#8242;</em> is large&#8658;boosts gradients</p></li><li><p>When <em>&#960;<sub>&#952;</sub></em> is large, <em>f&#8242;</em> shrinks&#8658;dampens near-certain tokens</p></li></ul><p>This improves the exploration and avoids entropy collapse:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PpQU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c69654-7b39-4d44-94cd-24995def6543_1306x421.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PpQU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c69654-7b39-4d44-94cd-24995def6543_1306x421.png 424w, https://substackcdn.com/image/fetch/$s_!PpQU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c69654-7b39-4d44-94cd-24995def6543_1306x421.png 848w, https://substackcdn.com/image/fetch/$s_!PpQU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c69654-7b39-4d44-94cd-24995def6543_1306x421.png 1272w, https://substackcdn.com/image/fetch/$s_!PpQU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c69654-7b39-4d44-94cd-24995def6543_1306x421.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PpQU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c69654-7b39-4d44-94cd-24995def6543_1306x421.png" width="1306" height="421" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a3c69654-7b39-4d44-94cd-24995def6543_1306x421.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:421,&quot;width&quot;:1306,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:169344,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c69654-7b39-4d44-94cd-24995def6543_1306x421.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PpQU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c69654-7b39-4d44-94cd-24995def6543_1306x421.png 424w, https://substackcdn.com/image/fetch/$s_!PpQU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c69654-7b39-4d44-94cd-24995def6543_1306x421.png 848w, https://substackcdn.com/image/fetch/$s_!PpQU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c69654-7b39-4d44-94cd-24995def6543_1306x421.png 1272w, https://substackcdn.com/image/fetch/$s_!PpQU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3c69654-7b39-4d44-94cd-24995def6543_1306x421.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>and better performance overall:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QXx9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6333a727-f377-4af8-984f-823fb5433604_1561x757.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QXx9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6333a727-f377-4af8-984f-823fb5433604_1561x757.png 424w, https://substackcdn.com/image/fetch/$s_!QXx9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6333a727-f377-4af8-984f-823fb5433604_1561x757.png 848w, https://substackcdn.com/image/fetch/$s_!QXx9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6333a727-f377-4af8-984f-823fb5433604_1561x757.png 1272w, https://substackcdn.com/image/fetch/$s_!QXx9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6333a727-f377-4af8-984f-823fb5433604_1561x757.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QXx9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6333a727-f377-4af8-984f-823fb5433604_1561x757.png" width="1456" height="706" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6333a727-f377-4af8-984f-823fb5433604_1561x757.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:706,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:124326,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6333a727-f377-4af8-984f-823fb5433604_1561x757.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QXx9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6333a727-f377-4af8-984f-823fb5433604_1561x757.png 424w, https://substackcdn.com/image/fetch/$s_!QXx9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6333a727-f377-4af8-984f-823fb5433604_1561x757.png 848w, https://substackcdn.com/image/fetch/$s_!QXx9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6333a727-f377-4af8-984f-823fb5433604_1561x757.png 1272w, https://substackcdn.com/image/fetch/$s_!QXx9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6333a727-f377-4af8-984f-823fb5433604_1561x757.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h4>Reward Shaping</h4><p>Improving the RL algorithm is only part of the story. The other part is how we tell the model what success looks like. Reward design shapes the entire search landscape that the model explores. A good reward doesn&#8217;t just score outputs; it pulls the model toward the kinds of internal reasoning moves we want it to adopt. And unlike test-time scaling, reward shaping changes the model&#8217;s internal dynamics, not just its decoding strategy. The right reward can turn a passive pattern-matcher into an active problem-solver.</p><p>One important element in determining reasoning quality is the CoT length: Give models more thinking time, and they often reason better. But when you fine-tune a model on long-form CoT traces and then optimize it with RL, the model doesn&#8217;t merely preserve the long patterns; it tends to <em>amplify</em> them. Both Llama-3.1-8B and Qwen-2.5-Math-7B quickly push their CoTs longer and longer until they hit the context limit. Once trajectories exceed the max length, the model gets a negative or zero reward, not because the reasoning is bad, but because the sequence literally doesn&#8217;t fit. </p><blockquote><p>&#128064; Similar problem like Trick 4 in the DAPO paper.</p></blockquote><p>To fix this, the authors propose a length-based reward shaping, named &#128073;<strong>Cosine Reward </strong>[7]. It introduces a very clean idea: make the reward sensitive to the CoT length.</p><p>Instead of the classic &#8220;1 for correct, 0 for incorrect&#8221; sparse reward, they use a piecewise cosine reward that encodes three simple principles:</p><ol><li><p><strong>Correct &gt; wrong:</strong> Correct answers always get a higher reward than wrong ones.</p></li><li><p><strong>Shorter &gt; longer (if correct):</strong> Among correct solutions, shorter CoTs are better.</p></li><li><p><strong>Longer &gt; shorter (if wrong):</strong> If the model is wrong, encourage it to think longer next time.</p></li></ol><p>To implement this, they propose a reward formula:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!u3ex!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7c7a712-c988-4b7c-8f19-83e9eb210dd9_932x616.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!u3ex!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7c7a712-c988-4b7c-8f19-83e9eb210dd9_932x616.png 424w, https://substackcdn.com/image/fetch/$s_!u3ex!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7c7a712-c988-4b7c-8f19-83e9eb210dd9_932x616.png 848w, https://substackcdn.com/image/fetch/$s_!u3ex!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7c7a712-c988-4b7c-8f19-83e9eb210dd9_932x616.png 1272w, https://substackcdn.com/image/fetch/$s_!u3ex!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7c7a712-c988-4b7c-8f19-83e9eb210dd9_932x616.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!u3ex!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7c7a712-c988-4b7c-8f19-83e9eb210dd9_932x616.png" width="526" height="347.656652360515" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7c7a712-c988-4b7c-8f19-83e9eb210dd9_932x616.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:616,&quot;width&quot;:932,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:107654,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7c7a712-c988-4b7c-8f19-83e9eb210dd9_932x616.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!u3ex!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7c7a712-c988-4b7c-8f19-83e9eb210dd9_932x616.png 424w, https://substackcdn.com/image/fetch/$s_!u3ex!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7c7a712-c988-4b7c-8f19-83e9eb210dd9_932x616.png 848w, https://substackcdn.com/image/fetch/$s_!u3ex!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7c7a712-c988-4b7c-8f19-83e9eb210dd9_932x616.png 1272w, https://substackcdn.com/image/fetch/$s_!u3ex!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7c7a712-c988-4b7c-8f19-83e9eb210dd9_932x616.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>where:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gNYB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabb78c92-68dc-4565-abfc-45977fc927f9_1446x153.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gNYB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabb78c92-68dc-4565-abfc-45977fc927f9_1446x153.png 424w, https://substackcdn.com/image/fetch/$s_!gNYB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabb78c92-68dc-4565-abfc-45977fc927f9_1446x153.png 848w, https://substackcdn.com/image/fetch/$s_!gNYB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabb78c92-68dc-4565-abfc-45977fc927f9_1446x153.png 1272w, https://substackcdn.com/image/fetch/$s_!gNYB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabb78c92-68dc-4565-abfc-45977fc927f9_1446x153.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gNYB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabb78c92-68dc-4565-abfc-45977fc927f9_1446x153.png" width="1446" height="153" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/abb78c92-68dc-4565-abfc-45977fc927f9_1446x153.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:153,&quot;width&quot;:1446,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29517,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabb78c92-68dc-4565-abfc-45977fc927f9_1446x153.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gNYB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabb78c92-68dc-4565-abfc-45977fc927f9_1446x153.png 424w, https://substackcdn.com/image/fetch/$s_!gNYB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabb78c92-68dc-4565-abfc-45977fc927f9_1446x153.png 848w, https://substackcdn.com/image/fetch/$s_!gNYB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabb78c92-68dc-4565-abfc-45977fc927f9_1446x153.png 1272w, https://substackcdn.com/image/fetch/$s_!gNYB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabb78c92-68dc-4565-abfc-45977fc927f9_1446x153.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This function makes sure that as <em>L<sub>gen</sub></em> increases to  <em>L<sub>max</sub></em>, the reward smoothly changes between the two reward hyperparameters. Depending on the correctness, the reward change can be decreased (correct) or increased (incorrect):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MsgE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d13ceb0-95ff-4656-9ce9-c104bf0cb5d1_950x405.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MsgE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d13ceb0-95ff-4656-9ce9-c104bf0cb5d1_950x405.png 424w, https://substackcdn.com/image/fetch/$s_!MsgE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d13ceb0-95ff-4656-9ce9-c104bf0cb5d1_950x405.png 848w, https://substackcdn.com/image/fetch/$s_!MsgE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d13ceb0-95ff-4656-9ce9-c104bf0cb5d1_950x405.png 1272w, https://substackcdn.com/image/fetch/$s_!MsgE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d13ceb0-95ff-4656-9ce9-c104bf0cb5d1_950x405.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MsgE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d13ceb0-95ff-4656-9ce9-c104bf0cb5d1_950x405.png" width="950" height="405" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d13ceb0-95ff-4656-9ce9-c104bf0cb5d1_950x405.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:405,&quot;width&quot;:950,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:68437,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d13ceb0-95ff-4656-9ce9-c104bf0cb5d1_950x405.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MsgE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d13ceb0-95ff-4656-9ce9-c104bf0cb5d1_950x405.png 424w, https://substackcdn.com/image/fetch/$s_!MsgE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d13ceb0-95ff-4656-9ce9-c104bf0cb5d1_950x405.png 848w, https://substackcdn.com/image/fetch/$s_!MsgE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d13ceb0-95ff-4656-9ce9-c104bf0cb5d1_950x405.png 1272w, https://substackcdn.com/image/fetch/$s_!MsgE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d13ceb0-95ff-4656-9ce9-c104bf0cb5d1_950x405.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The addition of Cosine Reward to RL training helps stabilize training as the number of training iterations increases:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RPaR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F962cf9a3-ed87-4023-b34d-7b8d7caaf832_917x399.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RPaR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F962cf9a3-ed87-4023-b34d-7b8d7caaf832_917x399.png 424w, https://substackcdn.com/image/fetch/$s_!RPaR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F962cf9a3-ed87-4023-b34d-7b8d7caaf832_917x399.png 848w, https://substackcdn.com/image/fetch/$s_!RPaR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F962cf9a3-ed87-4023-b34d-7b8d7caaf832_917x399.png 1272w, https://substackcdn.com/image/fetch/$s_!RPaR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F962cf9a3-ed87-4023-b34d-7b8d7caaf832_917x399.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RPaR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F962cf9a3-ed87-4023-b34d-7b8d7caaf832_917x399.png" width="917" height="399" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/962cf9a3-ed87-4023-b34d-7b8d7caaf832_917x399.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:399,&quot;width&quot;:917,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:80205,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F962cf9a3-ed87-4023-b34d-7b8d7caaf832_917x399.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RPaR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F962cf9a3-ed87-4023-b34d-7b8d7caaf832_917x399.png 424w, https://substackcdn.com/image/fetch/$s_!RPaR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F962cf9a3-ed87-4023-b34d-7b8d7caaf832_917x399.png 848w, https://substackcdn.com/image/fetch/$s_!RPaR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F962cf9a3-ed87-4023-b34d-7b8d7caaf832_917x399.png 1272w, https://substackcdn.com/image/fetch/$s_!RPaR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F962cf9a3-ed87-4023-b34d-7b8d7caaf832_917x399.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p> &#128064; As we will see later, the drop of performance due to repoonse length suddenly jumps to max is one of the &#8220;training collapses&#8220; often occured in RL post-training. </p></blockquote><p>However, introducing a length-based reward like Cosine Reward can be problematic. As for incorrect answers, a longer response is encouraged. The model exhibited reward-hacking behavior, inflating CoT length on difficult problems through repetition rather than genuine reasoning. </p><p>To address this, the authors introduce an N-gram repetition penalty: apply token-level penalties directly at the positions where repetition occurs. By recording N-grams at every step, they can detect an N-gram repetition to add a penalty (as a negative reward, for instance). </p><p>These reward shaping techniques collectively help improve the reasoning performance on math datasets significantly:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eWbf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb95288-5037-4643-b1fe-4a4ef2db00a1_987x380.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eWbf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb95288-5037-4643-b1fe-4a4ef2db00a1_987x380.png 424w, https://substackcdn.com/image/fetch/$s_!eWbf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb95288-5037-4643-b1fe-4a4ef2db00a1_987x380.png 848w, https://substackcdn.com/image/fetch/$s_!eWbf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb95288-5037-4643-b1fe-4a4ef2db00a1_987x380.png 1272w, https://substackcdn.com/image/fetch/$s_!eWbf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb95288-5037-4643-b1fe-4a4ef2db00a1_987x380.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eWbf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb95288-5037-4643-b1fe-4a4ef2db00a1_987x380.png" width="987" height="380" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6eb95288-5037-4643-b1fe-4a4ef2db00a1_987x380.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:380,&quot;width&quot;:987,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:79354,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb95288-5037-4643-b1fe-4a4ef2db00a1_987x380.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eWbf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb95288-5037-4643-b1fe-4a4ef2db00a1_987x380.png 424w, https://substackcdn.com/image/fetch/$s_!eWbf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb95288-5037-4643-b1fe-4a4ef2db00a1_987x380.png 848w, https://substackcdn.com/image/fetch/$s_!eWbf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb95288-5037-4643-b1fe-4a4ef2db00a1_987x380.png 1272w, https://substackcdn.com/image/fetch/$s_!eWbf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb95288-5037-4643-b1fe-4a4ef2db00a1_987x380.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As we can see, length-aware reward shaping helps stabilize emergent CoT scaling by nudging models toward efficient reasoning trajectories. However, even sophisticated length-based rewards face a key limitation: they assume the model is already capable of producing coherent, partially correct reasoning. This assumption breaks down for weaker models such as tiny LLMs(&#8804;1B parameters), where reasoning is fragile, and outcome rewards are extremely sparse. This leads to a natural question: &#129504; <em>How do we guide RL training when the model fails to produce any correct trajectories for long periods of time?</em></p><p>This is where the &#128073;<strong>Memory-R</strong>+<strong> </strong>[8] paper contributes a major new idea:<br>Memory-augmented intrinsic rewards<strong>. </strong>The Memory-R+ method proposes a shift in perspective: instead of relying solely on external rewards (correct/incorrect), the model should also learn from its own past reasoning, much like humans rely on episodic memory.</p><p>The approach introduces two episodic memory banks:</p><ul><li><p><strong>Success Memory (M&#8347;):</strong> stores reasoning traces that led to the correct answer</p></li><li><p><strong>Failure Memory (M<sub>f</sub>):</strong> stores reasoning traces that produced incorrect results</p></li></ul><p>These memories are stored as embeddings in a shared representation space using a Sentence Transformer encoder. The key idea is simple but powerful:</p><ul><li><p>Reward the model when its reasoning resembles past successes<strong> </strong></p></li><li><p>Reward the model when it avoids past failures</p></li></ul><p>This creates a dense, performance-driven intrinsic reward signal that tiny models desperately need. Given a new query, the memory system:</p><ol><li><p>Looks up similar past queries in both success and failure memories using k-NN in embedding space.</p></li><li><p>Retrieves the associated reasoning traces (not just final answers).</p></li><li><p>Evaluates the new response based on:</p><ul><li><p>similarity to successful reasoning (exploit)</p></li><li><p>dissimilarity to failed reasoning (explore)</p></li></ul></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!si0V!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0072b47-3a4c-44aa-b1a7-2849d1754569_1584x877.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!si0V!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0072b47-3a4c-44aa-b1a7-2849d1754569_1584x877.png 424w, https://substackcdn.com/image/fetch/$s_!si0V!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0072b47-3a4c-44aa-b1a7-2849d1754569_1584x877.png 848w, https://substackcdn.com/image/fetch/$s_!si0V!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0072b47-3a4c-44aa-b1a7-2849d1754569_1584x877.png 1272w, https://substackcdn.com/image/fetch/$s_!si0V!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0072b47-3a4c-44aa-b1a7-2849d1754569_1584x877.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!si0V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0072b47-3a4c-44aa-b1a7-2849d1754569_1584x877.png" width="1456" height="806" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d0072b47-3a4c-44aa-b1a7-2849d1754569_1584x877.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:806,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:284486,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0072b47-3a4c-44aa-b1a7-2849d1754569_1584x877.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!si0V!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0072b47-3a4c-44aa-b1a7-2849d1754569_1584x877.png 424w, https://substackcdn.com/image/fetch/$s_!si0V!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0072b47-3a4c-44aa-b1a7-2849d1754569_1584x877.png 848w, https://substackcdn.com/image/fetch/$s_!si0V!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0072b47-3a4c-44aa-b1a7-2849d1754569_1584x877.png 1272w, https://substackcdn.com/image/fetch/$s_!si0V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0072b47-3a4c-44aa-b1a7-2849d1754569_1584x877.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">How Memory-R+ works. Source: [8].</figcaption></figure></div><p> <strong>Exploit Reward</strong></p><p>The success memory provides a set of retrieved successful reasoning traces <em>B</em>. Their embeddings are averaged to form a centroid:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;c(M_s, q) = \\frac{1}{|B(M_s, q)|} \\sum_{a_j \\in B(M_s, q)} a_j\n&quot;,&quot;id&quot;:&quot;XZVMALXUOZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>A generated response <em>a</em> earns a higher reward when it is closer to this centroid:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;R_{\\text{exploit}}(q, a) = -\\left\\| a - c(M_s, q) \\right\\|_2\n&quot;,&quot;id&quot;:&quot;FKEUYHDWCU&quot;}" data-component-name="LatexBlockToDOM"></div><p>This encourages generalizable reasoning patterns, not rote memorization.<br>It captures structure like &#8220;identify quantities &#8594; set up equation &#8594; solve&#8221;, even if the surface text differs.</p><p><strong>Explore Reward</strong></p><p>Failure memory provides a set of incorrect reasoning traces. The authors measure novelty as the inverse similarity to the closest failure:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;R_{\\text{explore}}(q, a) = \n1 - \\max_{a_j \\in B(M_f, q)} \\mathrm{CS}(a, a_j)&quot;,&quot;id&quot;:&quot;XIEIKNCPWH&quot;}" data-component-name="LatexBlockToDOM"></div><p>If the model repeats a failed reasoning pattern, its reward goes down. If it proposes a novel direction, the reward goes up. This creates a natural curriculum:</p><ul><li><p>Early: encourages broad exploration (most reasoning is wrong).</p></li><li><p>Later: focuses on fine-grained distinctions between near-correct and correct reasoning.</p></li></ul><p>Finally, each intrinsic reward is min-max normalized over a sliding window, then combined:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;R_{\\text{mem}} = \n\\beta_s \\, \\hat{R}_{\\text{exploit}} \n+ \n\\beta_e \\, \\hat{R}_{\\text{explore}}&quot;,&quot;id&quot;:&quot;ZDPKWWFXYM&quot;}" data-component-name="LatexBlockToDOM"></div><p>The results are improvements across many tiny LLMs, including one with Thinking Mode (Qwen3-0.6B):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!C5qV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8e3b4-7c15-466b-8818-85f4bb2215c9_846x241.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C5qV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8e3b4-7c15-466b-8818-85f4bb2215c9_846x241.png 424w, https://substackcdn.com/image/fetch/$s_!C5qV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8e3b4-7c15-466b-8818-85f4bb2215c9_846x241.png 848w, https://substackcdn.com/image/fetch/$s_!C5qV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8e3b4-7c15-466b-8818-85f4bb2215c9_846x241.png 1272w, https://substackcdn.com/image/fetch/$s_!C5qV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8e3b4-7c15-466b-8818-85f4bb2215c9_846x241.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C5qV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8e3b4-7c15-466b-8818-85f4bb2215c9_846x241.png" width="846" height="241" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92d8e3b4-7c15-466b-8818-85f4bb2215c9_846x241.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:241,&quot;width&quot;:846,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45439,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8e3b4-7c15-466b-8818-85f4bb2215c9_846x241.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C5qV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8e3b4-7c15-466b-8818-85f4bb2215c9_846x241.png 424w, https://substackcdn.com/image/fetch/$s_!C5qV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8e3b4-7c15-466b-8818-85f4bb2215c9_846x241.png 848w, https://substackcdn.com/image/fetch/$s_!C5qV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8e3b4-7c15-466b-8818-85f4bb2215c9_846x241.png 1272w, https://substackcdn.com/image/fetch/$s_!C5qV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d8e3b4-7c15-466b-8818-85f4bb2215c9_846x241.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Training Instability and Collapse in LLMs</strong></p><p>The paper also analyzes the training collapse issues with RL post-training. LLMs often over-optimize simpler rewards, such as format rewards, at the expense of correctness, a phenomenon known as <strong>reward mode collapse</strong>. Models without intrinsic rewards focus on easy metrics, while Memory-R+ balances exploitation (aligning with successful reasoning) and exploration (avoiding past failures), preventing collapse.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!B-nJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff44f6bb5-ddfd-410a-a9d3-ce3350c09147_1375x479.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!B-nJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff44f6bb5-ddfd-410a-a9d3-ce3350c09147_1375x479.png 424w, https://substackcdn.com/image/fetch/$s_!B-nJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff44f6bb5-ddfd-410a-a9d3-ce3350c09147_1375x479.png 848w, https://substackcdn.com/image/fetch/$s_!B-nJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff44f6bb5-ddfd-410a-a9d3-ce3350c09147_1375x479.png 1272w, https://substackcdn.com/image/fetch/$s_!B-nJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff44f6bb5-ddfd-410a-a9d3-ce3350c09147_1375x479.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!B-nJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff44f6bb5-ddfd-410a-a9d3-ce3350c09147_1375x479.png" width="1375" height="479" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f44f6bb5-ddfd-410a-a9d3-ce3350c09147_1375x479.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:479,&quot;width&quot;:1375,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:203644,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff44f6bb5-ddfd-410a-a9d3-ce3350c09147_1375x479.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!B-nJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff44f6bb5-ddfd-410a-a9d3-ce3350c09147_1375x479.png 424w, https://substackcdn.com/image/fetch/$s_!B-nJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff44f6bb5-ddfd-410a-a9d3-ce3350c09147_1375x479.png 848w, https://substackcdn.com/image/fetch/$s_!B-nJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff44f6bb5-ddfd-410a-a9d3-ce3350c09147_1375x479.png 1272w, https://substackcdn.com/image/fetch/$s_!B-nJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff44f6bb5-ddfd-410a-a9d3-ce3350c09147_1375x479.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Response length collapse</strong> is another issue: models either under-generate (too short) or over-generate (too long) sequences, producing meaningless outputs. Length-based rewards like Cosine can worsen this. Memory-R+ stabilizes training by providing dense memory-based feedback, ensuring reasonable response lengths while improving correctness.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eagn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb6b2f2b-4500-4af4-8c97-38235dd2776a_1388x571.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eagn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb6b2f2b-4500-4af4-8c97-38235dd2776a_1388x571.png 424w, https://substackcdn.com/image/fetch/$s_!eagn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb6b2f2b-4500-4af4-8c97-38235dd2776a_1388x571.png 848w, https://substackcdn.com/image/fetch/$s_!eagn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb6b2f2b-4500-4af4-8c97-38235dd2776a_1388x571.png 1272w, https://substackcdn.com/image/fetch/$s_!eagn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb6b2f2b-4500-4af4-8c97-38235dd2776a_1388x571.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eagn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb6b2f2b-4500-4af4-8c97-38235dd2776a_1388x571.png" width="1388" height="571" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db6b2f2b-4500-4af4-8c97-38235dd2776a_1388x571.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:571,&quot;width&quot;:1388,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:224027,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb6b2f2b-4500-4af4-8c97-38235dd2776a_1388x571.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eagn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb6b2f2b-4500-4af4-8c97-38235dd2776a_1388x571.png 424w, https://substackcdn.com/image/fetch/$s_!eagn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb6b2f2b-4500-4af4-8c97-38235dd2776a_1388x571.png 848w, https://substackcdn.com/image/fetch/$s_!eagn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb6b2f2b-4500-4af4-8c97-38235dd2776a_1388x571.png 1272w, https://substackcdn.com/image/fetch/$s_!eagn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb6b2f2b-4500-4af4-8c97-38235dd2776a_1388x571.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>While Memory-R+ effectively stabilizes tiny LLM training through episodic memory and intrinsic rewards, it primarily operates at the response level, assigning rewards based on entire reasoning chains or key intermediate patterns. However, recent research has highlighted the benefits of dense process rewards that provide feedback at each reasoning step, rather than only at the outcome. </p><blockquote><p>&#128064; Token- or step-level feedback allows the model to understand which intermediate decisions are correct, improving training efficiency and reasoning fidelity. </p></blockquote><p>Despite their advantages, dense rewards are rarely deployed at scale in RL for LLMs. The challenges are threefold:</p><p>&#10060; Defining process rewards: Assigning meaningful credit to intermediate steps is non-trivial, as some seemingly &#8220;incorrect&#8221; steps may contribute indirectly to a correct final answer.</p><p>&#10060; Scalability of online updates: Updating dense reward models (process reward models, PRMs) online requires frequent retraining on step-level labels, which is costly and often infeasible.</p><p>&#10060; Extra modeling cost: Conventional dense reward methods require separate reward models trained with expensive annotations, adding significant overhead.</p><p>The &#128073;<strong>PRIME </strong>[9]<strong> </strong>framework (Process Reinforcement through Implicit Rewards) offers an elegant solution. Instead of requiring step-level labels, PRIME leverages implicit process reward modeling to generate token-level dense rewards derived from standard outcome labels. Essentially, a single reward model can infer dense rewards for each step, which are updated online with policy rollouts. This approach:</p><p>&#10004;&#65039; Provides fine-grained, step-level feedback for improved credit assignment.</p><p>&#10004;&#65039; Reduces reward sparsity without additional annotation cost.</p><p>&#10004;&#65039; Mitigates reward hacking by updating the reward model along with the policy, maintaining alignment between the model and its reward signal.</p><p>The Implicit PRM is trained only with outcome-level labels but can produce token-level dense rewards. To be specific, the reward model <em>&#960;<sub>&#981;&#8203; </sub></em><sub> </sub>is trained with the outcome reward:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LGhm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07f35432-de8b-4e96-9e4c-97bef80795ea_729x125.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LGhm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07f35432-de8b-4e96-9e4c-97bef80795ea_729x125.png 424w, https://substackcdn.com/image/fetch/$s_!LGhm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07f35432-de8b-4e96-9e4c-97bef80795ea_729x125.png 848w, https://substackcdn.com/image/fetch/$s_!LGhm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07f35432-de8b-4e96-9e4c-97bef80795ea_729x125.png 1272w, https://substackcdn.com/image/fetch/$s_!LGhm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07f35432-de8b-4e96-9e4c-97bef80795ea_729x125.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LGhm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07f35432-de8b-4e96-9e4c-97bef80795ea_729x125.png" width="250" height="42.86694101508916" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/07f35432-de8b-4e96-9e4c-97bef80795ea_729x125.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:125,&quot;width&quot;:729,&quot;resizeWidth&quot;:250,&quot;bytes&quot;:24346,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07f35432-de8b-4e96-9e4c-97bef80795ea_729x125.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LGhm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07f35432-de8b-4e96-9e4c-97bef80795ea_729x125.png 424w, https://substackcdn.com/image/fetch/$s_!LGhm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07f35432-de8b-4e96-9e4c-97bef80795ea_729x125.png 848w, https://substackcdn.com/image/fetch/$s_!LGhm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07f35432-de8b-4e96-9e4c-97bef80795ea_729x125.png 1272w, https://substackcdn.com/image/fetch/$s_!LGhm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07f35432-de8b-4e96-9e4c-97bef80795ea_729x125.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>We can train it using cross-entropy loss (after applying the sigmoid function to normalize the reward) based on whether the output is correct or not, such that sequences with higher outcome rewards should be assigned a higher reward.</p><p> Then, during inference to generate dense rewards for RL post-training, the reward at step <em>t</em> (i.e., for token <em>y<sub>t</sub></em>&#8203;) is:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_jkG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf957616-bf70-4c1e-928c-0800710afd2a_649x167.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_jkG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf957616-bf70-4c1e-928c-0800710afd2a_649x167.png 424w, https://substackcdn.com/image/fetch/$s_!_jkG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf957616-bf70-4c1e-928c-0800710afd2a_649x167.png 848w, https://substackcdn.com/image/fetch/$s_!_jkG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf957616-bf70-4c1e-928c-0800710afd2a_649x167.png 1272w, https://substackcdn.com/image/fetch/$s_!_jkG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf957616-bf70-4c1e-928c-0800710afd2a_649x167.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_jkG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf957616-bf70-4c1e-928c-0800710afd2a_649x167.png" width="359" height="92.37750385208012" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/af957616-bf70-4c1e-928c-0800710afd2a_649x167.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:167,&quot;width&quot;:649,&quot;resizeWidth&quot;:359,&quot;bytes&quot;:23846,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf957616-bf70-4c1e-928c-0800710afd2a_649x167.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_jkG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf957616-bf70-4c1e-928c-0800710afd2a_649x167.png 424w, https://substackcdn.com/image/fetch/$s_!_jkG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf957616-bf70-4c1e-928c-0800710afd2a_649x167.png 848w, https://substackcdn.com/image/fetch/$s_!_jkG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf957616-bf70-4c1e-928c-0800710afd2a_649x167.png 1272w, https://substackcdn.com/image/fetch/$s_!_jkG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf957616-bf70-4c1e-928c-0800710afd2a_649x167.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where:</p><ul><li><p><em>&#960;<sub>&#981;&#8203;</sub></em> = the Implicit PRM (reward model)</p></li><li><p><em>&#960;<sub>ref</sub></em> = reference model (often the initial SFT or base LM)</p></li><li><p><em>&#946;</em> = scaling factor</p></li></ul><blockquote><p>&#128064; The reward measures how much more likely the PRM is to generate this token compared to the reference model, which implicitly reflects intermediate correctness or quality at that token.</p></blockquote><p>Because we now have a reward for every step, we need to calculate the advantage function used in the RL algorithm (not necessarily GPO). Here, they propose to use a leave-one-out (LOO) baseline, resulting in the advantage function for each step <em>t</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2DaI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e094de-5c3f-45f8-9f5e-d8c25bd9c74a_1477x246.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2DaI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e094de-5c3f-45f8-9f5e-d8c25bd9c74a_1477x246.png 424w, https://substackcdn.com/image/fetch/$s_!2DaI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e094de-5c3f-45f8-9f5e-d8c25bd9c74a_1477x246.png 848w, https://substackcdn.com/image/fetch/$s_!2DaI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e094de-5c3f-45f8-9f5e-d8c25bd9c74a_1477x246.png 1272w, https://substackcdn.com/image/fetch/$s_!2DaI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e094de-5c3f-45f8-9f5e-d8c25bd9c74a_1477x246.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2DaI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e094de-5c3f-45f8-9f5e-d8c25bd9c74a_1477x246.png" width="1456" height="243" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64e094de-5c3f-45f8-9f5e-d8c25bd9c74a_1477x246.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:243,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43763,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e094de-5c3f-45f8-9f5e-d8c25bd9c74a_1477x246.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2DaI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e094de-5c3f-45f8-9f5e-d8c25bd9c74a_1477x246.png 424w, https://substackcdn.com/image/fetch/$s_!2DaI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e094de-5c3f-45f8-9f5e-d8c25bd9c74a_1477x246.png 848w, https://substackcdn.com/image/fetch/$s_!2DaI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e094de-5c3f-45f8-9f5e-d8c25bd9c74a_1477x246.png 1272w, https://substackcdn.com/image/fetch/$s_!2DaI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64e094de-5c3f-45f8-9f5e-d8c25bd9c74a_1477x246.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The advantage consists of the implicit process reward and outcome reward components. With a step-based advantage, they can use the standard step-level PPO as the RL objective:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z1w8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d86af31-0f63-4822-9934-bd42c575aaed_1389x164.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z1w8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d86af31-0f63-4822-9934-bd42c575aaed_1389x164.png 424w, https://substackcdn.com/image/fetch/$s_!z1w8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d86af31-0f63-4822-9934-bd42c575aaed_1389x164.png 848w, https://substackcdn.com/image/fetch/$s_!z1w8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d86af31-0f63-4822-9934-bd42c575aaed_1389x164.png 1272w, https://substackcdn.com/image/fetch/$s_!z1w8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d86af31-0f63-4822-9934-bd42c575aaed_1389x164.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z1w8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d86af31-0f63-4822-9934-bd42c575aaed_1389x164.png" width="1389" height="164" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d86af31-0f63-4822-9934-bd42c575aaed_1389x164.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:164,&quot;width&quot;:1389,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:39479,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d86af31-0f63-4822-9934-bd42c575aaed_1389x164.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!z1w8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d86af31-0f63-4822-9934-bd42c575aaed_1389x164.png 424w, https://substackcdn.com/image/fetch/$s_!z1w8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d86af31-0f63-4822-9934-bd42c575aaed_1389x164.png 848w, https://substackcdn.com/image/fetch/$s_!z1w8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d86af31-0f63-4822-9934-bd42c575aaed_1389x164.png 1272w, https://substackcdn.com/image/fetch/$s_!z1w8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d86af31-0f63-4822-9934-bd42c575aaed_1389x164.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>&#10060; However, training an additional reward model and using step-level PPO can be computationally expensive. </p><p></p><h4>Optimizing the Training Pipeline</h4><p>So far, most advances in post&#8209;training RL for reasoning optimize which algorithm to use or how to shape the reward. But there&#8217;s a third: how we structure the training itself, e.g., via curriculum learning. <strong>Curriculum learning</strong> is inspired by human education: instead of exposing a model to the hardest tasks from the start, we begin with easier tasks and progressively increase difficulty. So the key task is to estimate the difficulty of the data sample. </p><p>In the &#128073;<strong>AdaRFT</strong> [10] paper, the authors propose to use precomputed difficulty scores for each problem, which can come from human annotations, empirical success rates, or a separate difficulty-estimation mode. For example, they use an external LLM as the difficulty estimator. The problem now is choosing the right estimator model because not all models are suitable for difficulty estimation:</p><ul><li><p>Too strong (e.g., OpenAI o1, DeepSeek-R1): These models solve most problems on the first attempt, leaving little variance across problems. As a result, easy and hard problems cannot be distinguished effectively.</p></li><li><p>Too weak (e.g., LLaMA 3 1B): These models fail on nearly every problem, producing insufficient signal to guide curriculum adaptation.</p></li></ul><p>Thus, they select Qwen 2.5 MATH 7B as the difficulty estimator because it exhibits balanced problem-solving ability: it succeeds on moderately difficult problems but struggles with the most challenging ones. Then, for each problem <em>i</em>, the difficulty score <em>d<sub>i</sub></em>&#8203; is defined as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oTse!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1796b849-cbef-42c5-80d8-9b2e6deb1095_1293x143.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oTse!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1796b849-cbef-42c5-80d8-9b2e6deb1095_1293x143.png 424w, https://substackcdn.com/image/fetch/$s_!oTse!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1796b849-cbef-42c5-80d8-9b2e6deb1095_1293x143.png 848w, https://substackcdn.com/image/fetch/$s_!oTse!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1796b849-cbef-42c5-80d8-9b2e6deb1095_1293x143.png 1272w, https://substackcdn.com/image/fetch/$s_!oTse!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1796b849-cbef-42c5-80d8-9b2e6deb1095_1293x143.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oTse!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1796b849-cbef-42c5-80d8-9b2e6deb1095_1293x143.png" width="532" height="58.83681361175561" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1796b849-cbef-42c5-80d8-9b2e6deb1095_1293x143.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:143,&quot;width&quot;:1293,&quot;resizeWidth&quot;:532,&quot;bytes&quot;:27981,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1796b849-cbef-42c5-80d8-9b2e6deb1095_1293x143.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oTse!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1796b849-cbef-42c5-80d8-9b2e6deb1095_1293x143.png 424w, https://substackcdn.com/image/fetch/$s_!oTse!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1796b849-cbef-42c5-80d8-9b2e6deb1095_1293x143.png 848w, https://substackcdn.com/image/fetch/$s_!oTse!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1796b849-cbef-42c5-80d8-9b2e6deb1095_1293x143.png 1272w, https://substackcdn.com/image/fetch/$s_!oTse!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1796b849-cbef-42c5-80d8-9b2e6deb1095_1293x143.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This represents the empirical average accuracy of the estimator on problem <em>i. </em>With the difficulty score, the curriculum goal is to assign the samples with difficulty most suitable to the current model (target difficulty), not too easy, not too hard. As such, starting with an initial level of difficulty target, as the LLM improves over time, the target difficulty increases; if performance drops, it decreases. At each step, the model is finetuned on problems closest to the current target difficulty, ensuring steady, aligned progression. This is how it works in RL post-training:</p><ol><li><p>Dynamic Curriculum Sampling</p><ul><li><p>Compute the absolute difference between each problem&#8217;s difficulty and the current target difficulty.</p></li><li><p>Select a batch of problems that are closest to the target, keeping tasks neither too easy nor too hard.</p></li></ul></li><li><p>Policy Update</p><ul><li><p>The policy model generates responses for the selected batch.</p></li><li><p>Compute rewards based on correctness and update the policy using a reinforcement learning algorithm (e.g., PPO, GRPO, REINFORCE++).</p></li></ul></li><li><p>Target Difficulty Update</p><ul><li><p>The average reward over the batch determines whether to increase or decrease the target difficulty.</p></li><li><p>Smooth updates are ensured using a tanh function, while the difficulty is clipped within valid bounds.</p></li></ul></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oBCC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31648bbb-d6a1-4422-8e57-02341cb1d205_946x83.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oBCC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31648bbb-d6a1-4422-8e57-02341cb1d205_946x83.png 424w, https://substackcdn.com/image/fetch/$s_!oBCC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31648bbb-d6a1-4422-8e57-02341cb1d205_946x83.png 848w, https://substackcdn.com/image/fetch/$s_!oBCC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31648bbb-d6a1-4422-8e57-02341cb1d205_946x83.png 1272w, https://substackcdn.com/image/fetch/$s_!oBCC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31648bbb-d6a1-4422-8e57-02341cb1d205_946x83.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oBCC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31648bbb-d6a1-4422-8e57-02341cb1d205_946x83.png" width="604" height="52.99365750528541" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31648bbb-d6a1-4422-8e57-02341cb1d205_946x83.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:83,&quot;width&quot;:946,&quot;resizeWidth&quot;:604,&quot;bytes&quot;:19087,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31648bbb-d6a1-4422-8e57-02341cb1d205_946x83.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oBCC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31648bbb-d6a1-4422-8e57-02341cb1d205_946x83.png 424w, https://substackcdn.com/image/fetch/$s_!oBCC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31648bbb-d6a1-4422-8e57-02341cb1d205_946x83.png 848w, https://substackcdn.com/image/fetch/$s_!oBCC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31648bbb-d6a1-4422-8e57-02341cb1d205_946x83.png 1272w, https://substackcdn.com/image/fetch/$s_!oBCC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31648bbb-d6a1-4422-8e57-02341cb1d205_946x83.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p>&#128064;  This algorithm introduces many hyperparameters, which can be problematic for tuning.</p></blockquote><p>That said, the results have improved significantly:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xzmR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab63b4e8-9a5a-48bc-83f0-f3249eb73371_1794x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xzmR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab63b4e8-9a5a-48bc-83f0-f3249eb73371_1794x600.png 424w, https://substackcdn.com/image/fetch/$s_!xzmR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab63b4e8-9a5a-48bc-83f0-f3249eb73371_1794x600.png 848w, https://substackcdn.com/image/fetch/$s_!xzmR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab63b4e8-9a5a-48bc-83f0-f3249eb73371_1794x600.png 1272w, https://substackcdn.com/image/fetch/$s_!xzmR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab63b4e8-9a5a-48bc-83f0-f3249eb73371_1794x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xzmR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab63b4e8-9a5a-48bc-83f0-f3249eb73371_1794x600.png" width="1456" height="487" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ab63b4e8-9a5a-48bc-83f0-f3249eb73371_1794x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:487,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:199135,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab63b4e8-9a5a-48bc-83f0-f3249eb73371_1794x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xzmR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab63b4e8-9a5a-48bc-83f0-f3249eb73371_1794x600.png 424w, https://substackcdn.com/image/fetch/$s_!xzmR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab63b4e8-9a5a-48bc-83f0-f3249eb73371_1794x600.png 848w, https://substackcdn.com/image/fetch/$s_!xzmR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab63b4e8-9a5a-48bc-83f0-f3249eb73371_1794x600.png 1272w, https://substackcdn.com/image/fetch/$s_!xzmR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab63b4e8-9a5a-48bc-83f0-f3249eb73371_1794x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In addition to difficulty estimation via external models, there can be a simpler way to define sample difficulty: rely on the length of the question. In the &#128073;<strong>FASTCURL </strong>[11] paper, the researchers analyze the DEEPSEEK-R1-DISTILL-QWEN-1.5B model and discover that longer input prompts generally produce longer output responses:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!36pu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F982cd922-93ba-46fd-a5ee-0feeb0f30093_575x380.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!36pu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F982cd922-93ba-46fd-a5ee-0feeb0f30093_575x380.png 424w, https://substackcdn.com/image/fetch/$s_!36pu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F982cd922-93ba-46fd-a5ee-0feeb0f30093_575x380.png 848w, https://substackcdn.com/image/fetch/$s_!36pu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F982cd922-93ba-46fd-a5ee-0feeb0f30093_575x380.png 1272w, https://substackcdn.com/image/fetch/$s_!36pu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F982cd922-93ba-46fd-a5ee-0feeb0f30093_575x380.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!36pu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F982cd922-93ba-46fd-a5ee-0feeb0f30093_575x380.png" width="447" height="295.4086956521739" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/982cd922-93ba-46fd-a5ee-0feeb0f30093_575x380.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:380,&quot;width&quot;:575,&quot;resizeWidth&quot;:447,&quot;bytes&quot;:41852,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F982cd922-93ba-46fd-a5ee-0feeb0f30093_575x380.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!36pu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F982cd922-93ba-46fd-a5ee-0feeb0f30093_575x380.png 424w, https://substackcdn.com/image/fetch/$s_!36pu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F982cd922-93ba-46fd-a5ee-0feeb0f30093_575x380.png 848w, https://substackcdn.com/image/fetch/$s_!36pu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F982cd922-93ba-46fd-a5ee-0feeb0f30093_575x380.png 1272w, https://substackcdn.com/image/fetch/$s_!36pu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F982cd922-93ba-46fd-a5ee-0feeb0f30093_575x380.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Output-Input length correlation. Source: [13].</figcaption></figure></div><p>This observation motivated a simple but effective hypothesis: For complex reasoning tasks, the complexity of a problem correlates with the length of the solution the model must generate. Using this principle, the paper divides the training dataset into three subsets based on average input prompt length:</p><ul><li><p><strong>L1</strong>: Short CoT reasoning problems (simpler tasks,  easy)</p></li><li><p><strong>L2</strong>: The original dataset (baseline tasks, medium)</p></li><li><p><strong>L3</strong>: Long CoT reasoning problems (more complex tasks, hard)</p></li></ul><p>Below are the statistics of the datasets:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X20c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7900efa5-024f-45de-90b1-ede9d3ace327_604x189.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X20c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7900efa5-024f-45de-90b1-ede9d3ace327_604x189.png 424w, https://substackcdn.com/image/fetch/$s_!X20c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7900efa5-024f-45de-90b1-ede9d3ace327_604x189.png 848w, https://substackcdn.com/image/fetch/$s_!X20c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7900efa5-024f-45de-90b1-ede9d3ace327_604x189.png 1272w, https://substackcdn.com/image/fetch/$s_!X20c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7900efa5-024f-45de-90b1-ede9d3ace327_604x189.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X20c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7900efa5-024f-45de-90b1-ede9d3ace327_604x189.png" width="604" height="189" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7900efa5-024f-45de-90b1-ede9d3ace327_604x189.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:189,&quot;width&quot;:604,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:21642,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7900efa5-024f-45de-90b1-ede9d3ace327_604x189.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!X20c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7900efa5-024f-45de-90b1-ede9d3ace327_604x189.png 424w, https://substackcdn.com/image/fetch/$s_!X20c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7900efa5-024f-45de-90b1-ede9d3ace327_604x189.png 848w, https://substackcdn.com/image/fetch/$s_!X20c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7900efa5-024f-45de-90b1-ede9d3ace327_604x189.png 1272w, https://substackcdn.com/image/fetch/$s_!X20c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7900efa5-024f-45de-90b1-ede9d3ace327_604x189.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Therefore, the paper proposes to divide training into stages that correspond to different context lengths and data set lengths (difficulties). They tested with many configurations and found many good strategies. For example, they use 5 training stages corresponding to:</p><ul><li><p>Context length: 8K, 16K, 24K, 16K, 16K </p></li><li><p>Datasets: L1, L2, L3, L2, L2</p></li></ul><p>The intuition is first to progress from easy to hard (L1 to L3), then come back to the medium difficulty (L2) until convergence. No theory is given, but empirically, it works best:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X3GW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c404467-08ec-4355-9f84-2d1f1ee65c74_1054x631.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X3GW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c404467-08ec-4355-9f84-2d1f1ee65c74_1054x631.png 424w, https://substackcdn.com/image/fetch/$s_!X3GW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c404467-08ec-4355-9f84-2d1f1ee65c74_1054x631.png 848w, https://substackcdn.com/image/fetch/$s_!X3GW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c404467-08ec-4355-9f84-2d1f1ee65c74_1054x631.png 1272w, https://substackcdn.com/image/fetch/$s_!X3GW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c404467-08ec-4355-9f84-2d1f1ee65c74_1054x631.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X3GW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c404467-08ec-4355-9f84-2d1f1ee65c74_1054x631.png" width="1054" height="631" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3c404467-08ec-4355-9f84-2d1f1ee65c74_1054x631.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:631,&quot;width&quot;:1054,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:134024,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c404467-08ec-4355-9f84-2d1f1ee65c74_1054x631.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!X3GW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c404467-08ec-4355-9f84-2d1f1ee65c74_1054x631.png 424w, https://substackcdn.com/image/fetch/$s_!X3GW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c404467-08ec-4355-9f84-2d1f1ee65c74_1054x631.png 848w, https://substackcdn.com/image/fetch/$s_!X3GW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c404467-08ec-4355-9f84-2d1f1ee65c74_1054x631.png 1272w, https://substackcdn.com/image/fetch/$s_!X3GW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c404467-08ec-4355-9f84-2d1f1ee65c74_1054x631.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">FastCuRL performance over training steps. Source: [13]</figcaption></figure></div><p>Despite these good results, two major obstacles limit LLM performance in the normal curriculum setting:</p><p>&#10060; Difficulty Shift:  In education science, there is a Difficulty Shift phenomenon, a.k.a., the model&#8217;s perception of a problem&#8217;s difficulty changes dynamically as it learns. A problem considered &#8220;hard&#8221; initially may become &#8220;easy&#8221; later, making a static difficulty score, as in AdaRFT or FastCuRL, obsolete.</p><p>Inspired by this principle, the paper &#128073;<strong>ADCL+EGSR </strong>[12]  focuses on optimizing the data progression via a dynamic data assignment. There are two contributions in the paper: ADCL and EGSR, where the former focuses on curriculum improvement and the latter helps with training guidance. First, instead of using a static task ordering, ADCL periodically re-estimates the difficulty of upcoming batches based on the model&#8217;s current state:</p><ol><li><p>Initially, difficulty scores <em>&#948;<sub>0</sub></em> are assigned to all samples in the dataset <em>D </em>using the base model parameters <em>&#952;<sub>0</sub></em>&#8203;.</p></li><li><p>The dataset is sorted by these scores and partitioned into sequential batches <em>B<sub>1</sub>,B<sub>2</sub>,...,B<sub>K</sub></em>&#8203;.</p></li><li><p>After training on batch <em>B<sub>k</sub></em>&#8203;, the model parameters are updated to <em>&#952;<sub>k</sub></em>. </p></li><li><p>ADCL re-evaluates difficulty scores <em>&#948;<sub>k+1</sub></em> for the next batch&#8217;s samples and re-sorts the batch internally according to the new difficulty estimation before proceeding to the next iteration</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QRs2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517f6681-3d7e-4f62-906e-a3dca9127a9a_1448x237.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QRs2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517f6681-3d7e-4f62-906e-a3dca9127a9a_1448x237.png 424w, https://substackcdn.com/image/fetch/$s_!QRs2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517f6681-3d7e-4f62-906e-a3dca9127a9a_1448x237.png 848w, https://substackcdn.com/image/fetch/$s_!QRs2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517f6681-3d7e-4f62-906e-a3dca9127a9a_1448x237.png 1272w, https://substackcdn.com/image/fetch/$s_!QRs2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517f6681-3d7e-4f62-906e-a3dca9127a9a_1448x237.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QRs2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517f6681-3d7e-4f62-906e-a3dca9127a9a_1448x237.png" width="1448" height="237" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/517f6681-3d7e-4f62-906e-a3dca9127a9a_1448x237.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:237,&quot;width&quot;:1448,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:71754,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517f6681-3d7e-4f62-906e-a3dca9127a9a_1448x237.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QRs2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517f6681-3d7e-4f62-906e-a3dca9127a9a_1448x237.png 424w, https://substackcdn.com/image/fetch/$s_!QRs2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517f6681-3d7e-4f62-906e-a3dca9127a9a_1448x237.png 848w, https://substackcdn.com/image/fetch/$s_!QRs2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517f6681-3d7e-4f62-906e-a3dca9127a9a_1448x237.png 1272w, https://substackcdn.com/image/fetch/$s_!QRs2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517f6681-3d7e-4f62-906e-a3dca9127a9a_1448x237.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Besides curriculum training, the paper also aims to address one issue of current RL post-training:</p><p>&#10060; On-Policy Exploration Limits: Standard post-training, rely solely on self-generated trajectories. While this can improve problem-solving for tasks the model can occasionally solve, it struggles when the model&#8217;s current knowledge is insufficient to produce non-zero reward outputs. This &#8220;zero-reward&#8221; scenario halts learning because gradient updates vanish, i.e., no signal for training. </p><p>To address this issue, they propose to incorporate training trajectories generated from an expert policy. This idea is very similar to the aforementioned LUFFY paper. However, instead of following the correct way of defining important sampling for on and off-policy data like this:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KuH3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492934ca-a761-44cd-8914-7f70b9b156e5_709x205.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KuH3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492934ca-a761-44cd-8914-7f70b9b156e5_709x205.png 424w, https://substackcdn.com/image/fetch/$s_!KuH3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492934ca-a761-44cd-8914-7f70b9b156e5_709x205.png 848w, https://substackcdn.com/image/fetch/$s_!KuH3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492934ca-a761-44cd-8914-7f70b9b156e5_709x205.png 1272w, https://substackcdn.com/image/fetch/$s_!KuH3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492934ca-a761-44cd-8914-7f70b9b156e5_709x205.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KuH3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492934ca-a761-44cd-8914-7f70b9b156e5_709x205.png" width="401" height="115.94499294781382" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/492934ca-a761-44cd-8914-7f70b9b156e5_709x205.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:205,&quot;width&quot;:709,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:36866,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492934ca-a761-44cd-8914-7f70b9b156e5_709x205.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KuH3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492934ca-a761-44cd-8914-7f70b9b156e5_709x205.png 424w, https://substackcdn.com/image/fetch/$s_!KuH3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492934ca-a761-44cd-8914-7f70b9b156e5_709x205.png 848w, https://substackcdn.com/image/fetch/$s_!KuH3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492934ca-a761-44cd-8914-7f70b9b156e5_709x205.png 1272w, https://substackcdn.com/image/fetch/$s_!KuH3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F492934ca-a761-44cd-8914-7f70b9b156e5_709x205.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The paper indicates that directly using an expert policy <em>&#960;<sub>&#981;</sub></em>&#8203; is often impractical. The expert policy may be inaccessible or rely on incompatible tokenization, making probability ratio calculations infeasible. Even when available, the distributional mismatch between <em>&#960;<sub>&#981;</sub></em> and the current policy <em>&#960;<sub>&#952;old</sub></em>&#8203; can result in extreme importance weights, leading to unstable updates. So they instead use expert demonstrations to guide the current policy in generating improved trajectories:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Pw3P!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2d431bb-bbbd-497b-babb-a57d61d645ae_289x79.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Pw3P!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2d431bb-bbbd-497b-babb-a57d61d645ae_289x79.png 424w, https://substackcdn.com/image/fetch/$s_!Pw3P!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2d431bb-bbbd-497b-babb-a57d61d645ae_289x79.png 848w, https://substackcdn.com/image/fetch/$s_!Pw3P!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2d431bb-bbbd-497b-babb-a57d61d645ae_289x79.png 1272w, https://substackcdn.com/image/fetch/$s_!Pw3P!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2d431bb-bbbd-497b-babb-a57d61d645ae_289x79.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Pw3P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2d431bb-bbbd-497b-babb-a57d61d645ae_289x79.png" width="217" height="59.31833910034602" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d2d431bb-bbbd-497b-babb-a57d61d645ae_289x79.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:79,&quot;width&quot;:289,&quot;resizeWidth&quot;:217,&quot;bytes&quot;:5997,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2d431bb-bbbd-497b-babb-a57d61d645ae_289x79.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Pw3P!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2d431bb-bbbd-497b-babb-a57d61d645ae_289x79.png 424w, https://substackcdn.com/image/fetch/$s_!Pw3P!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2d431bb-bbbd-497b-babb-a57d61d645ae_289x79.png 848w, https://substackcdn.com/image/fetch/$s_!Pw3P!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2d431bb-bbbd-497b-babb-a57d61d645ae_289x79.png 1272w, https://substackcdn.com/image/fetch/$s_!Pw3P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2d431bb-bbbd-497b-babb-a57d61d645ae_289x79.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Here, the expert demonstration <em>g</em> is used in the prompt. This leads to a different training objective:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!V9ru!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f80659f-bd50-4524-830e-fc3d49e157d8_2079x846.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!V9ru!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f80659f-bd50-4524-830e-fc3d49e157d8_2079x846.png 424w, https://substackcdn.com/image/fetch/$s_!V9ru!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f80659f-bd50-4524-830e-fc3d49e157d8_2079x846.png 848w, https://substackcdn.com/image/fetch/$s_!V9ru!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f80659f-bd50-4524-830e-fc3d49e157d8_2079x846.png 1272w, https://substackcdn.com/image/fetch/$s_!V9ru!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f80659f-bd50-4524-830e-fc3d49e157d8_2079x846.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!V9ru!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f80659f-bd50-4524-830e-fc3d49e157d8_2079x846.png" width="1456" height="592" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f80659f-bd50-4524-830e-fc3d49e157d8_2079x846.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:592,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:250687,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f80659f-bd50-4524-830e-fc3d49e157d8_2079x846.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!V9ru!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f80659f-bd50-4524-830e-fc3d49e157d8_2079x846.png 424w, https://substackcdn.com/image/fetch/$s_!V9ru!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f80659f-bd50-4524-830e-fc3d49e157d8_2079x846.png 848w, https://substackcdn.com/image/fetch/$s_!V9ru!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f80659f-bd50-4524-830e-fc3d49e157d8_2079x846.png 1272w, https://substackcdn.com/image/fetch/$s_!V9ru!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f80659f-bd50-4524-830e-fc3d49e157d8_2079x846.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here, the offline trajectories are generated by <em>&#960;<sub>&#952;old</sub></em>, and the important sampling ratio for offline trajectories is now not significantly different from that for online trajectories. The advantages are also mixed between the two online and offline returns, as expected. The performance of the final model, ADCL+EGSR, is better than the standard RL post-training:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZGSa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21916543-54bc-4bca-ad75-a3f8ccad57fa_1159x527.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZGSa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21916543-54bc-4bca-ad75-a3f8ccad57fa_1159x527.png 424w, https://substackcdn.com/image/fetch/$s_!ZGSa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21916543-54bc-4bca-ad75-a3f8ccad57fa_1159x527.png 848w, https://substackcdn.com/image/fetch/$s_!ZGSa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21916543-54bc-4bca-ad75-a3f8ccad57fa_1159x527.png 1272w, https://substackcdn.com/image/fetch/$s_!ZGSa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21916543-54bc-4bca-ad75-a3f8ccad57fa_1159x527.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZGSa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21916543-54bc-4bca-ad75-a3f8ccad57fa_1159x527.png" width="1159" height="527" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/21916543-54bc-4bca-ad75-a3f8ccad57fa_1159x527.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:527,&quot;width&quot;:1159,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:111931,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/165059779?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21916543-54bc-4bca-ad75-a3f8ccad57fa_1159x527.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZGSa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21916543-54bc-4bca-ad75-a3f8ccad57fa_1159x527.png 424w, https://substackcdn.com/image/fetch/$s_!ZGSa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21916543-54bc-4bca-ad75-a3f8ccad57fa_1159x527.png 848w, https://substackcdn.com/image/fetch/$s_!ZGSa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21916543-54bc-4bca-ad75-a3f8ccad57fa_1159x527.png 1272w, https://substackcdn.com/image/fetch/$s_!ZGSa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21916543-54bc-4bca-ad75-a3f8ccad57fa_1159x527.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>&#10060; A big limitation of the approach is that ADCL requires recomputing the difficulty scores for each sample frequently, which slows down training. </p><p>&#129504; <em>The question of how to build an efficient adaptive RL curriculum during LLM post-training remains open and calls for continued research [13].</em></p><div><hr></div><h2>References</h2><p>[1] Guo, D., Yang, D., Zhang, H. <em>et al.</em> DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. <em>Nature</em> <strong>645</strong>, 633&#8211;638 (2025). https://doi.org/10.1038/s41586-025-09422-z</p><p>[2] Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. &#8220;Proximal policy optimization algorithms.&#8221; <em>arXiv preprint arXiv:1707.06347</em> (2017).</p><p>[3] There May Not be Aha Moment in R1-Zero-like Training &#8212; A Pilot Study. https://sail.sea.com/blog/articles/62</p><p>[4] Yu, Q., Zhang, Z., Zhu, R. et al. (2025). DAPO: An open-source LLM reinforcement learning system at scale. <em>NeurIPS 2025</em>.</p><p>[5] Liu, Zichen, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. &#8220;Understanding r1-zero-like training: A critical perspective.&#8221; <em>COLM 2025</em>.</p><p>[6] Yan et al. (2025). Learning to Reason under Off-Policy Guidance. <em>NeurIPS 2025.</em></p><p>[7] Yang, S., Tong, Y., Niu, X., Neubig, G., &amp; Yue, X. (2025). <em>Demystifying Long Chain-of-Thought Reasoning</em>. In <em>Proceedings of the 42nd International Conference on Machine Learning (ICML 2025).</em></p><p>[8] Hung Le, Van Dai Do, Dung Nguyen, and Svetha Venkatesh. <em>Reasoning Under 1 Billion: Memory-Augmented Reinforcement Learning for Large Language Models.</em> Transactions on Machine Learning Research (TMLR), 2025.</p><p>[9] Cui, Ganqu, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li et al. &#8220;Process reinforcement through implicit rewards.&#8221; <em>arXiv preprint arXiv:2502.01456</em> (2025).</p><p>[10] Shi, Taiwei, Yiyang Wu, Linxin Song, Tianyi Zhou, and Jieyu Zhao. &#8220;Efficient reinforcement finetuning via adaptive curriculum learning.&#8221; <em>arXiv preprint arXiv:2504.05520</em> (2025).</p><p>[11] Song, M., Zheng, M., Li, Z., Yang, W., &amp; Luo, X. FastCuRL: Curriculum Reinforcement Learning with Stage-wise Context Scaling for Efficient Training R1-like Reasoning Models. <em>Findings of the Association for Computational Linguistics: EMNLP 2025</em>.</p><p>[12] Zhang, E., Yan, X., Lin, W., Zhang, T., &amp; Lu, Q. <em>Learning Like Humans: Advancing LLM Reasoning Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation</em>.<em> EMNLP 2025</em>.</p><p>[13] Do, Dai, Manh Nguyen, Svetha Venkatesh, and Hung Le. &#8220;SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models.&#8221; <em>arXiv preprint arXiv:2508.05015</em> (2025).</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/p/improving-llm-reasoning-with-post?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading Neurocoder Tales! This post is public, so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/p/improving-llm-reasoning-with-post?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://hungleai.substack.com/p/improving-llm-reasoning-with-post?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p></p><p></p><p><strong> </strong></p><p></p>]]></content:encoded></item><item><title><![CDATA[Reason on the Fly: How RL Boosts LLM Reasoning On the Spot]]></title><description><![CDATA[Surveying New Frontiers in Reinforcement Learning for Language Models (Part 2)]]></description><link>https://hungleai.substack.com/p/reason-on-the-fly-how-rl-boosts-llm</link><guid isPermaLink="false">https://hungleai.substack.com/p/reason-on-the-fly-how-rl-boosts-llm</guid><dc:creator><![CDATA[Hung Le]]></dc:creator><pubDate>Tue, 03 Jun 2025 08:06:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!YoMR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e9cb474-13b5-4219-9528-987b4b4eba8b_1024x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In our last <a href="https://hungleai.substack.com/p/think-before-you-speak-reinforcement">post</a>, we warmed up with why reinforcement learning (RL) is a powerful paradigm for building smarter AI reasoners. Today, we zoom in on an exciting approach: using RL at inference time to improve large language model (LLM) reasoning on the spot. In particular, we explore ways to inject real-time reasoning into static LLMs. Let's break down how RL can transform a frozen LLM into a more dynamic, reasoned thinker at runtime.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YoMR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e9cb474-13b5-4219-9528-987b4b4eba8b_1024x1024.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YoMR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e9cb474-13b5-4219-9528-987b4b4eba8b_1024x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!YoMR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e9cb474-13b5-4219-9528-987b4b4eba8b_1024x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!YoMR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e9cb474-13b5-4219-9528-987b4b4eba8b_1024x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!YoMR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e9cb474-13b5-4219-9528-987b4b4eba8b_1024x1024.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YoMR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e9cb474-13b5-4219-9528-987b4b4eba8b_1024x1024.webp" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e9cb474-13b5-4219-9528-987b4b4eba8b_1024x1024.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Generated image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Generated image" title="Generated image" srcset="https://substackcdn.com/image/fetch/$s_!YoMR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e9cb474-13b5-4219-9528-987b4b4eba8b_1024x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!YoMR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e9cb474-13b5-4219-9528-987b4b4eba8b_1024x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!YoMR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e9cb474-13b5-4219-9528-987b4b4eba8b_1024x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!YoMR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e9cb474-13b5-4219-9528-987b4b4eba8b_1024x1024.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Reason on the Fly: How RL Boosts LLM Reasoning On the Spot. Source: OpenAI&#8217;s Sora. </figcaption></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Many read my posts, but only 3% subscribe. If you find my writing helpful, please subscribe&#8212;it&#8217;s free! Your support motivates me to keep creating high-quality and exclusive content. Thank you!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h4>Table of Contents</h4><ul><li><p><a href="https://hungleai.substack.com/i/163970650/why-test-time-matters">Why Test-Time Matters</a></p></li><li><p><a href="https://hungleai.substack.com/i/163970650/the-first-breakthrough-test-time-compute">The First Breakthrough: Test-Time Compute</a></p></li><li><p><a href="https://hungleai.substack.com/i/163970650/scaling-up-test-time-efforts">Scaling Up Test-Time Efforts</a></p></li><li><p><a href="https://hungleai.substack.com/i/163970650/process-reward-models-defining-whats-good-reasoning">Process Reward Models: Defining What&#8217;s "Good Reasoning"</a></p></li><li><p><a href="https://hungleai.substack.com/i/163970650/search-and-planning-with-rl">Search and Planning with RL</a></p></li><li><p><a href="https://hungleai.substack.com/i/163970650/conclusion-and-whats-next">Conclusion &amp; What&#8217;s Next</a></p></li></ul><div><hr></div><h2>Why Test-Time Matters</h2><p>A long-standing principle in machine learning is the clean separation between training and inference. Traditional pipelines front-load all the optimization into training, assuming that, once deployed, models will perform inference in a fixed, feed-forward manner. The model learns once and acts passively forever after.</p><p>In the context of LLMs, a dominant strategy to improve performance has been simple: <strong>scale-up training &#11014;&#65039;</strong>. Larger models, more data, and longer context windows have powered much of the progress to date. And this strategy works until it doesn&#8217;t. As we push toward ever-larger scales, we begin to see diminishing returns: </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Dzfo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d3cb2-e019-4a73-a9fd-9cb5fdb87e09_1315x776.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Dzfo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d3cb2-e019-4a73-a9fd-9cb5fdb87e09_1315x776.png 424w, https://substackcdn.com/image/fetch/$s_!Dzfo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d3cb2-e019-4a73-a9fd-9cb5fdb87e09_1315x776.png 848w, https://substackcdn.com/image/fetch/$s_!Dzfo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d3cb2-e019-4a73-a9fd-9cb5fdb87e09_1315x776.png 1272w, https://substackcdn.com/image/fetch/$s_!Dzfo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d3cb2-e019-4a73-a9fd-9cb5fdb87e09_1315x776.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Dzfo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d3cb2-e019-4a73-a9fd-9cb5fdb87e09_1315x776.png" width="685" height="404.22813688212926" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c47d3cb2-e019-4a73-a9fd-9cb5fdb87e09_1315x776.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:776,&quot;width&quot;:1315,&quot;resizeWidth&quot;:685,&quot;bytes&quot;:60843,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d3cb2-e019-4a73-a9fd-9cb5fdb87e09_1315x776.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Dzfo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d3cb2-e019-4a73-a9fd-9cb5fdb87e09_1315x776.png 424w, https://substackcdn.com/image/fetch/$s_!Dzfo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d3cb2-e019-4a73-a9fd-9cb5fdb87e09_1315x776.png 848w, https://substackcdn.com/image/fetch/$s_!Dzfo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d3cb2-e019-4a73-a9fd-9cb5fdb87e09_1315x776.png 1272w, https://substackcdn.com/image/fetch/$s_!Dzfo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc47d3cb2-e019-4a73-a9fd-9cb5fdb87e09_1315x776.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">More computing resources are required, but the performance gain is getting smaller. Source: [1]</figcaption></figure></div><p>More parameters don&#8217;t necessarily lead to better reasoning. And more data doesn&#8217;t guarantee generalization, especially to tasks that are unfamiliar or structurally complex. Why? Because we&#8217;re running up against a fundamental bottleneck: the data itself.</p><p>As <a href="https://www.reuters.com/technology/artificial-intelligence/ai-with-reasoning-power-will-be-less-predictable-ilya-sutskever-says-2024-12-14/">Ilya Sutskever</a> famously put it, <em>&#8220;We have but one internet.&#8221;</em> That single, finite corpus has already been scraped, filtered, and optimized for training. There is no untapped &#8220;second internet&#8221; to unlock new capabilities. This saturation of training-time gains is prompting a natural question: &#129504; <em>What else can we optimize</em>?</p><p>A new direction is taking shape: test-time compute. Rather than relying solely on what was learned during pretraining, this paradigm focuses on what can be actively computed during inference. The insight is simple but powerful: not all reasoning can (or should?) happen in a single forward pass. Some problems demand more thought. It pays to think longer, search deeper, and adapt on the fly while solving the task.</p><p>This idea isn&#8217;t new in the broader AI landscape. Fields like robotics, planning, and control have long leveraged test-time optimization to dynamically adjust actions, internal states, or policies based on the current situation. In these domains, the agent doesn&#8217;t freeze after training. It continues to optimize in real-time, guided by the specifics of the environment or task at hand. &#129504; <em>So why hasn&#8217;t this philosophy fully carried over to LLMs?</em></p><p>Because LLMs are typically treated as static predictors. Once trained, they produce outputs via greedy decoding or sampling, which is efficient, but fixed. This works fine for fluent language generation. Without a mechanism to adapt at test time, LLMs tend to default to surface-level heuristics, even when the task demands deeper computation. This is where test-time reinforcement learning shines as a promising bridge between passive prediction and active, deliberative reasoning.</p><div><hr></div><h2>The First Breakthrough: Test-Time Compute</h2><p>As pretraining-driven gains begin to plateau, the field has turned to <em>post-training</em> methods to push LLM capabilities further. Techniques like <a href="https://hungleai.substack.com/p/aligning-large-language-models-with">alignment through RLHF</a> and instruction tuning have become standard tools to refine behavior after pretraining. But another powerful and increasingly relevant direction is emerging: <strong>test-time compute (TTC)</strong>.</p><p>Instead of relying solely on what was baked in during pretraining, TTC explores what a model can compute on the fly, at the moment of inference. What if reasoning didn&#8217;t stop when training ended? What if, during deployment, models could think longer, adapt dynamically, and search more strategically to solve hard problems?</p><p>This shift in mindset is already underway. OpenAI&#8217;s o1 models mark a pivot toward deeper inference-time reasoning: </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Lh1f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c701674-7e25-4469-9f8d-de470026f439_867x364.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Lh1f!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c701674-7e25-4469-9f8d-de470026f439_867x364.gif 424w, https://substackcdn.com/image/fetch/$s_!Lh1f!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c701674-7e25-4469-9f8d-de470026f439_867x364.gif 848w, https://substackcdn.com/image/fetch/$s_!Lh1f!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c701674-7e25-4469-9f8d-de470026f439_867x364.gif 1272w, https://substackcdn.com/image/fetch/$s_!Lh1f!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c701674-7e25-4469-9f8d-de470026f439_867x364.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Lh1f!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c701674-7e25-4469-9f8d-de470026f439_867x364.gif" width="867" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2c701674-7e25-4469-9f8d-de470026f439_867x364.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:867,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:&quot;0_sgIW941zM9_11CKk.gif&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="0_sgIW941zM9_11CKk.gif" srcset="https://substackcdn.com/image/fetch/$s_!Lh1f!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c701674-7e25-4469-9f8d-de470026f439_867x364.gif 424w, https://substackcdn.com/image/fetch/$s_!Lh1f!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c701674-7e25-4469-9f8d-de470026f439_867x364.gif 848w, https://substackcdn.com/image/fetch/$s_!Lh1f!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c701674-7e25-4469-9f8d-de470026f439_867x364.gif 1272w, https://substackcdn.com/image/fetch/$s_!Lh1f!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c701674-7e25-4469-9f8d-de470026f439_867x364.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">o1-series &#8220;think&#8220; longer and slower. <a href="https://felo.ai/blog/how-to-use-free-openai-o1-reasoning-model/">Source</a>. </figcaption></figure></div><p></p><blockquote><p>&#128064; While the exact techniques remain undisclosed, OpenAI has stated that o1 is <em>&#8220;trained with reinforcement learning to perform complex reasoning&#8221; </em>[2]<em>. </em>The role of RL is indisputable.  </p></blockquote><p>The result is amazing:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WPRT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d08ad02-a45e-43b1-8ed9-8a4345d6f089_2048x796.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WPRT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d08ad02-a45e-43b1-8ed9-8a4345d6f089_2048x796.png 424w, https://substackcdn.com/image/fetch/$s_!WPRT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d08ad02-a45e-43b1-8ed9-8a4345d6f089_2048x796.png 848w, https://substackcdn.com/image/fetch/$s_!WPRT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d08ad02-a45e-43b1-8ed9-8a4345d6f089_2048x796.png 1272w, https://substackcdn.com/image/fetch/$s_!WPRT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d08ad02-a45e-43b1-8ed9-8a4345d6f089_2048x796.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WPRT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d08ad02-a45e-43b1-8ed9-8a4345d6f089_2048x796.png" width="1456" height="566" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1d08ad02-a45e-43b1-8ed9-8a4345d6f089_2048x796.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:566,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WPRT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d08ad02-a45e-43b1-8ed9-8a4345d6f089_2048x796.png 424w, https://substackcdn.com/image/fetch/$s_!WPRT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d08ad02-a45e-43b1-8ed9-8a4345d6f089_2048x796.png 848w, https://substackcdn.com/image/fetch/$s_!WPRT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d08ad02-a45e-43b1-8ed9-8a4345d6f089_2048x796.png 1272w, https://substackcdn.com/image/fetch/$s_!WPRT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d08ad02-a45e-43b1-8ed9-8a4345d6f089_2048x796.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">o1 models achieve remarkable results on reasoning benchmarks. Source: [2].</figcaption></figure></div><ul><li><p>On Codeforces, o1 ranks in the 89th percentile&#8212;an unprecedented level of programming ability for a language model.</p></li><li><p>On AIME, a prestigious U.S. math competition, o1 ranks among the top 500 students.</p></li><li><p>On GPQA, a benchmark of PhD-level science questions, o1 surpasses human experts in physics, chemistry, and biology.</p></li></ul><p>The results reveal a compelling insight: allocating more computational resources at test time by allowing the model to perform multiple reasoning passes, search over alternatives, or optimize responses can yield improvements comparable to, or even surpassing, those from scaling training-time compute:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mgpD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bd16752-d47d-4c01-b05b-54789c27c0ec_1980x1113.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mgpD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bd16752-d47d-4c01-b05b-54789c27c0ec_1980x1113.png 424w, https://substackcdn.com/image/fetch/$s_!mgpD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bd16752-d47d-4c01-b05b-54789c27c0ec_1980x1113.png 848w, https://substackcdn.com/image/fetch/$s_!mgpD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bd16752-d47d-4c01-b05b-54789c27c0ec_1980x1113.png 1272w, https://substackcdn.com/image/fetch/$s_!mgpD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bd16752-d47d-4c01-b05b-54789c27c0ec_1980x1113.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mgpD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bd16752-d47d-4c01-b05b-54789c27c0ec_1980x1113.png" width="1456" height="818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1bd16752-d47d-4c01-b05b-54789c27c0ec_1980x1113.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:818,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mgpD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bd16752-d47d-4c01-b05b-54789c27c0ec_1980x1113.png 424w, https://substackcdn.com/image/fetch/$s_!mgpD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bd16752-d47d-4c01-b05b-54789c27c0ec_1980x1113.png 848w, https://substackcdn.com/image/fetch/$s_!mgpD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bd16752-d47d-4c01-b05b-54789c27c0ec_1980x1113.png 1272w, https://substackcdn.com/image/fetch/$s_!mgpD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bd16752-d47d-4c01-b05b-54789c27c0ec_1980x1113.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Performance scaling with train-time and test-time compute. Source: [2].</figcaption></figure></div><blockquote><p>&#128064; While the compute costs may appear similar in magnitude, they reflect fundamentally different trade-offs. Training-time compute is a one-time, centralized investment that doesn&#8217;t directly affect the end-user experience. In contrast, test-time compute occurs during inference and consumes resources every time the model is used, placing a computational burden on the user or deployment system.</p></blockquote><p>These achievements aren&#8217;t just about language. They reflect <em>System 2 reasoning</em>: deliberate, multi-step, and strategic. Unlike System 1, which governs fast, automatic responses like language completion or basic recall, System 2 engages when tasks demand controlled, sequential thinking, such as solving a math proof, writing code, or evaluating competing hypotheses. A summary of System 1 and System 2 properties is given below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OttT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ac6bece-fd7a-440e-97a8-e57170d52420_1100x619.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OttT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ac6bece-fd7a-440e-97a8-e57170d52420_1100x619.png 424w, https://substackcdn.com/image/fetch/$s_!OttT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ac6bece-fd7a-440e-97a8-e57170d52420_1100x619.png 848w, https://substackcdn.com/image/fetch/$s_!OttT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ac6bece-fd7a-440e-97a8-e57170d52420_1100x619.png 1272w, https://substackcdn.com/image/fetch/$s_!OttT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ac6bece-fd7a-440e-97a8-e57170d52420_1100x619.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OttT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ac6bece-fd7a-440e-97a8-e57170d52420_1100x619.png" width="1100" height="619" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ac6bece-fd7a-440e-97a8-e57170d52420_1100x619.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:619,&quot;width&quot;:1100,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OttT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ac6bece-fd7a-440e-97a8-e57170d52420_1100x619.png 424w, https://substackcdn.com/image/fetch/$s_!OttT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ac6bece-fd7a-440e-97a8-e57170d52420_1100x619.png 848w, https://substackcdn.com/image/fetch/$s_!OttT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ac6bece-fd7a-440e-97a8-e57170d52420_1100x619.png 1272w, https://substackcdn.com/image/fetch/$s_!OttT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ac6bece-fd7a-440e-97a8-e57170d52420_1100x619.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">System 1 vs System 2 Overview. <a href="https://medium.com/@cch.chichieh/understanding-reasoning-models-test-time-compute-insights-from-deepseek-r1-d30783070827">Source</a>. </figcaption></figure></div><p>As we can see, System 2 is more powerful with deliberate reasoning. However, this deeper reasoning comes at a cost<strong>.</strong> High TTC is expensive, making such models suitable only for a narrow slice of high-stakes or compute-rich applications.</p><p>This raises two pressing research questions:</p><ol><li><p>&#129504; <em>How can we equip LLMs with the ability to scale their test-time reasoning?</em><br>More forward passes? Search over reasoning chains? External tools? Many strategies exist, but Reinforcement Learning (RL) offers a particularly principled framework.</p></li><li><p>&#129504; <em>More importantly, how can we make TTC efficient?</em> Inference-time RL can guide the search process efficiently. Rather than brute-force reasoning, models can learn when and how to allocate extra thinking time for prioritizing hard problems, rethinking weak outputs, or dynamically adjusting their decoding strategy.</p></li></ol><p>We can organize the approaches to scaling test-time compute by identifying which stages of the LLM inference pipeline offer opportunities for enhancement:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1D04!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4692dd7-a0d1-4fd4-92a0-e4ea2a20c658_1026x526.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1D04!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4692dd7-a0d1-4fd4-92a0-e4ea2a20c658_1026x526.png 424w, https://substackcdn.com/image/fetch/$s_!1D04!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4692dd7-a0d1-4fd4-92a0-e4ea2a20c658_1026x526.png 848w, https://substackcdn.com/image/fetch/$s_!1D04!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4692dd7-a0d1-4fd4-92a0-e4ea2a20c658_1026x526.png 1272w, https://substackcdn.com/image/fetch/$s_!1D04!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4692dd7-a0d1-4fd4-92a0-e4ea2a20c658_1026x526.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1D04!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4692dd7-a0d1-4fd4-92a0-e4ea2a20c658_1026x526.png" width="1026" height="526" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f4692dd7-a0d1-4fd4-92a0-e4ea2a20c658_1026x526.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:526,&quot;width&quot;:1026,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:62830,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4692dd7-a0d1-4fd4-92a0-e4ea2a20c658_1026x526.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1D04!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4692dd7-a0d1-4fd4-92a0-e4ea2a20c658_1026x526.png 424w, https://substackcdn.com/image/fetch/$s_!1D04!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4692dd7-a0d1-4fd4-92a0-e4ea2a20c658_1026x526.png 848w, https://substackcdn.com/image/fetch/$s_!1D04!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4692dd7-a0d1-4fd4-92a0-e4ea2a20c658_1026x526.png 1272w, https://substackcdn.com/image/fetch/$s_!1D04!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4692dd7-a0d1-4fd4-92a0-e4ea2a20c658_1026x526.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>One way to increase test-time compute is through prompting, encouraging the model to &#8220;think longer&#8221; by crafting prompts that trigger deeper reasoning. Beyond prompting, more structured approaches involve fine-tuning the model&#8217;s internal weights or modifying its outputs dynamically. In the following sections, we&#8217;ll investigate the growing ecosystem of test-time reinforcement learning methods, with a focus on these last strategies where RL plays a central role in interfering with the output generation process.</p><div><hr></div><h2>Scaling Up Test-Time Efforts</h2><p>Scaling test-time compute can be as simple as prompting the model to reflect more deeply, or as sophisticated as dynamically fine-tuning internal representations and outputs. The more sophisticated the approach, the more it engages System 2-style reasoning, which is deliberate, strategic, and multi-step:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Tgtc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F629d6aa6-a840-4eb7-b818-fef6c7e6ee4f_1199x553.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Tgtc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F629d6aa6-a840-4eb7-b818-fef6c7e6ee4f_1199x553.png 424w, https://substackcdn.com/image/fetch/$s_!Tgtc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F629d6aa6-a840-4eb7-b818-fef6c7e6ee4f_1199x553.png 848w, https://substackcdn.com/image/fetch/$s_!Tgtc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F629d6aa6-a840-4eb7-b818-fef6c7e6ee4f_1199x553.png 1272w, https://substackcdn.com/image/fetch/$s_!Tgtc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F629d6aa6-a840-4eb7-b818-fef6c7e6ee4f_1199x553.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Tgtc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F629d6aa6-a840-4eb7-b818-fef6c7e6ee4f_1199x553.png" width="1199" height="553" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/629d6aa6-a840-4eb7-b818-fef6c7e6ee4f_1199x553.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:553,&quot;width&quot;:1199,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:287415,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Tgtc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F629d6aa6-a840-4eb7-b818-fef6c7e6ee4f_1199x553.png 424w, https://substackcdn.com/image/fetch/$s_!Tgtc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F629d6aa6-a840-4eb7-b818-fef6c7e6ee4f_1199x553.png 848w, https://substackcdn.com/image/fetch/$s_!Tgtc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F629d6aa6-a840-4eb7-b818-fef6c7e6ee4f_1199x553.png 1272w, https://substackcdn.com/image/fetch/$s_!Tgtc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F629d6aa6-a840-4eb7-b818-fef6c7e6ee4f_1199x553.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">From System 1 to System 2 reasoning techniques. RL can be mostly used in structured, complicated, search-based approaches. Source: [3]</figcaption></figure></div><p>One early attempt to enhance LLM reasoning through structured test-time compute is &#128073;<strong>Tree of Thoughts (ToT)</strong>. Building on Chain-of-Thought prompting, ToT treats reasoning as an explicit search process through a space of intermediate &#8220;thoughts.&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XKkp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3c5dfb-b963-40cd-9346-f0c9d4a6d63f_1355x679.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XKkp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3c5dfb-b963-40cd-9346-f0c9d4a6d63f_1355x679.png 424w, https://substackcdn.com/image/fetch/$s_!XKkp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3c5dfb-b963-40cd-9346-f0c9d4a6d63f_1355x679.png 848w, https://substackcdn.com/image/fetch/$s_!XKkp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3c5dfb-b963-40cd-9346-f0c9d4a6d63f_1355x679.png 1272w, https://substackcdn.com/image/fetch/$s_!XKkp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3c5dfb-b963-40cd-9346-f0c9d4a6d63f_1355x679.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XKkp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3c5dfb-b963-40cd-9346-f0c9d4a6d63f_1355x679.png" width="1355" height="679" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cc3c5dfb-b963-40cd-9346-f0c9d4a6d63f_1355x679.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:679,&quot;width&quot;:1355,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XKkp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3c5dfb-b963-40cd-9346-f0c9d4a6d63f_1355x679.png 424w, https://substackcdn.com/image/fetch/$s_!XKkp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3c5dfb-b963-40cd-9346-f0c9d4a6d63f_1355x679.png 848w, https://substackcdn.com/image/fetch/$s_!XKkp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3c5dfb-b963-40cd-9346-f0c9d4a6d63f_1355x679.png 1272w, https://substackcdn.com/image/fetch/$s_!XKkp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3c5dfb-b963-40cd-9346-f0c9d4a6d63f_1355x679.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">ToT vs other prompting strategies. Source: [4]</figcaption></figure></div><p>Rather than committing to a single reasoning path, ToT asks LLMs to generate multiple candidate thoughts at each step, evaluates them using scoring or voting prompts, and searches through the resulting tree (e.g., via BFS or DFS) to find high-quality solutions.</p><p>The core steps are:</p><ol><li><p>Define what a &#8220;thought&#8221; means for the task using prompting (e.g., via few-shot examples). </p></li><li><p>Generate candidate thoughts using LLMs and prompts.</p></li><li><p>Evaluate their quality using LLMs and prompts.</p></li><li><p>Search for the most promising reasoning path using tree-based search algorithms like BFS or DFS.</p></li></ol><p><strong>Example: The 24 Game</strong></p><p>Consider the classic 24 Game: you're given a set of numbers, and the challenge is to apply arithmetic operations to reach the target number 24. The task requires reasoning and deciding which operations to apply, in what order, and how intermediate steps interact.</p><ol><li><p>Thought corresponds to one or more steps of applying operations to the input numbers 4,9,10,13. For example, use + for the first two numbers: 4+9=13, resulting in 10,13,13 left.</p></li><li><p>Generate several  &#8220;thoughts&#8220; using the Propose Prompt (see Figure below, box (a)). This results in several thought candidates: e.g., 4+9=13 or 10-4=6</p></li><li><p>Evaluate the candidate by prompting another LLM with Value Prompt (see Figure below, box (b)). The LLM will decide whether it is possible to continue the thought</p></li><li><p>Do the tree search based on LLM evaluations with the node as the candidate thought. If LLMs say it is impossible to continue. In this case, &#8220;4+9=13&#8221; node is impossible to continue, so the search algorithm will not go further. Instead, it chooses &#8220;10-4=6&#8221; to continue the generation and search process. </p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Fkdb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85df053e-e48a-44de-af21-c5aa74011046_1149x356.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Fkdb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85df053e-e48a-44de-af21-c5aa74011046_1149x356.png 424w, https://substackcdn.com/image/fetch/$s_!Fkdb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85df053e-e48a-44de-af21-c5aa74011046_1149x356.png 848w, https://substackcdn.com/image/fetch/$s_!Fkdb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85df053e-e48a-44de-af21-c5aa74011046_1149x356.png 1272w, https://substackcdn.com/image/fetch/$s_!Fkdb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85df053e-e48a-44de-af21-c5aa74011046_1149x356.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Fkdb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85df053e-e48a-44de-af21-c5aa74011046_1149x356.png" width="1149" height="356" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/85df053e-e48a-44de-af21-c5aa74011046_1149x356.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:356,&quot;width&quot;:1149,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Fkdb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85df053e-e48a-44de-af21-c5aa74011046_1149x356.png 424w, https://substackcdn.com/image/fetch/$s_!Fkdb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85df053e-e48a-44de-af21-c5aa74011046_1149x356.png 848w, https://substackcdn.com/image/fetch/$s_!Fkdb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85df053e-e48a-44de-af21-c5aa74011046_1149x356.png 1272w, https://substackcdn.com/image/fetch/$s_!Fkdb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85df053e-e48a-44de-af21-c5aa74011046_1149x356.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>From this example, we can construct a general framework for test-time scaling of LLMs as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UjR1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b43097-81cf-4ae5-a991-7936cd1ba078_1215x767.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UjR1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b43097-81cf-4ae5-a991-7936cd1ba078_1215x767.png 424w, https://substackcdn.com/image/fetch/$s_!UjR1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b43097-81cf-4ae5-a991-7936cd1ba078_1215x767.png 848w, https://substackcdn.com/image/fetch/$s_!UjR1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b43097-81cf-4ae5-a991-7936cd1ba078_1215x767.png 1272w, https://substackcdn.com/image/fetch/$s_!UjR1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b43097-81cf-4ae5-a991-7936cd1ba078_1215x767.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UjR1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b43097-81cf-4ae5-a991-7936cd1ba078_1215x767.png" width="1215" height="767" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d7b43097-81cf-4ae5-a991-7936cd1ba078_1215x767.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:767,&quot;width&quot;:1215,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:162836,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b43097-81cf-4ae5-a991-7936cd1ba078_1215x767.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UjR1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b43097-81cf-4ae5-a991-7936cd1ba078_1215x767.png 424w, https://substackcdn.com/image/fetch/$s_!UjR1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b43097-81cf-4ae5-a991-7936cd1ba078_1215x767.png 848w, https://substackcdn.com/image/fetch/$s_!UjR1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b43097-81cf-4ae5-a991-7936cd1ba078_1215x767.png 1272w, https://substackcdn.com/image/fetch/$s_!UjR1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b43097-81cf-4ae5-a991-7936cd1ba078_1215x767.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A general framework for test-time scaling. Source: [5]</figcaption></figure></div><p>There are 3 main components of the framework:</p><ul><li><p><strong>Verification</strong>: Evaluate the quality or correctness of generated outputs.</p></li><li><p><strong>Generation / Search</strong>: Produces reasoning candidates through sampling or exploration.</p></li><li><p><strong>Improvement with Feedback</strong>: Refines the model or  outputs using signals from verification or external supervision.</p></li></ul><p>As we shall see, the most recent methods aim to enhance one or more of these components. The last one will be discussed in my<a href="https://open.substack.com/pub/hungleai/p/improving-llm-reasoning-with-post?r=3an4d1&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true"> next blog</a>. Now, let&#8217;s explore the first component in the next section.</p><div><hr></div><h2>Process Reward Models: Defining What&#8217;s "Good Reasoning"</h2><p>For many reasoning-intensive tasks, such as math word problems, code generation, or symbolic logic, the final answer is verifiable. That is, given a question and a proposed solution, we can automatically check whether the answer is correct. This verifiability makes it possible to supervise models during training by rewarding correct outputs, even if we don't label every intermediate step.</p><p>However, this introduces a subtle challenge: not all paths to the correct answer are equally good. A model might stumble into the right solution by chance, through memorization, or via logically inconsistent steps. If we only reward the final answer, we risk reinforcing spurious or incoherent reasoning.</p><p>This realization has spurred interest in <strong>Process Reward Models (PRMs)</strong>, which aim to evaluate and guide the reasoning trajectories of LLMs, rather than just their final outputs.</p><h4>From Outcome-Based to Process-Based Evaluation</h4><p>Traditional reinforcement learning approaches often focus on outcome-based rewards, evaluating the correctness of the final answer. However, this method overlooks the variety of the reasoning path taken to arrive at that answer. PRMs address this gap by assigning rewards to intermediate reasoning steps, promoting coherent and logical progression throughout the problem-solving process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AAkl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20c7022-cf46-451a-b346-0ddc9c658e31_1003x345.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AAkl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20c7022-cf46-451a-b346-0ddc9c658e31_1003x345.png 424w, https://substackcdn.com/image/fetch/$s_!AAkl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20c7022-cf46-451a-b346-0ddc9c658e31_1003x345.png 848w, https://substackcdn.com/image/fetch/$s_!AAkl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20c7022-cf46-451a-b346-0ddc9c658e31_1003x345.png 1272w, https://substackcdn.com/image/fetch/$s_!AAkl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20c7022-cf46-451a-b346-0ddc9c658e31_1003x345.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AAkl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20c7022-cf46-451a-b346-0ddc9c658e31_1003x345.png" width="1003" height="345" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c20c7022-cf46-451a-b346-0ddc9c658e31_1003x345.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:345,&quot;width&quot;:1003,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45509,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20c7022-cf46-451a-b346-0ddc9c658e31_1003x345.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AAkl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20c7022-cf46-451a-b346-0ddc9c658e31_1003x345.png 424w, https://substackcdn.com/image/fetch/$s_!AAkl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20c7022-cf46-451a-b346-0ddc9c658e31_1003x345.png 848w, https://substackcdn.com/image/fetch/$s_!AAkl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20c7022-cf46-451a-b346-0ddc9c658e31_1003x345.png 1272w, https://substackcdn.com/image/fetch/$s_!AAkl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc20c7022-cf46-451a-b346-0ddc9c658e31_1003x345.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>While verifying a final answer can often be automated (e.g., is the math answer correct?), supervising the reasoning process itself is a much harder problem. There&#8217;s no straightforward way to automatically label whether an intermediate step is valid or logically sound. An early paper named &#128073;&#8220;<strong>Let&#8217;s Verify Step by Step</strong>&#8221; tackles this head-on by manually labeling the reasoning process using human annotators [6]. The researchers asked labelers to evaluate the correctness of each step in the solution paths generated by LLMs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S16l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45156b7e-7760-4648-8239-374603fc74a2_1473x592.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S16l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45156b7e-7760-4648-8239-374603fc74a2_1473x592.png 424w, https://substackcdn.com/image/fetch/$s_!S16l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45156b7e-7760-4648-8239-374603fc74a2_1473x592.png 848w, https://substackcdn.com/image/fetch/$s_!S16l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45156b7e-7760-4648-8239-374603fc74a2_1473x592.png 1272w, https://substackcdn.com/image/fetch/$s_!S16l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45156b7e-7760-4648-8239-374603fc74a2_1473x592.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S16l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45156b7e-7760-4648-8239-374603fc74a2_1473x592.png" width="1456" height="585" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/45156b7e-7760-4648-8239-374603fc74a2_1473x592.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:585,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:193173,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45156b7e-7760-4648-8239-374603fc74a2_1473x592.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S16l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45156b7e-7760-4648-8239-374603fc74a2_1473x592.png 424w, https://substackcdn.com/image/fetch/$s_!S16l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45156b7e-7760-4648-8239-374603fc74a2_1473x592.png 848w, https://substackcdn.com/image/fetch/$s_!S16l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45156b7e-7760-4648-8239-374603fc74a2_1473x592.png 1272w, https://substackcdn.com/image/fetch/$s_!S16l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45156b7e-7760-4648-8239-374603fc74a2_1473x592.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Manual labeling of the reasoning step quality. Source: [6]</figcaption></figure></div><p>This resulted in <strong>PRM800K,</strong> which contains 800K step-level labels, annotated across 75K model-generated solutions spanning 12K problems. Each step is labeled for correctness, enabling fine-grained reward modeling for reasoning.</p><p>The researchers then aim to train reward models in two regimes:</p><ul><li><p>Large-scale: They use the labeled dataset to fine-tune GPT-4, aiming to build high-quality outcome and process reward models (ORMs and PRMs). While this setup pushed performance to the state-of-the-art, differences in training data made direct comparisons between ORMs and PRMs tricky.</p></li><li><p>Small-scale: To isolate the effect of supervision and enable controlled ablations, they trained smaller models from scratch. This allows using large models to create labeled data for training small models. This setup allowed them to probe how process supervision affects reasoning quality under more constrained conditions.</p></li></ul><p>Here, the paper focuses on training reliable reward models rather than running a full RL loop. They evaluate the reward models by using the <em>best-of-N</em> approach:</p><ol><li><p>Generate <em>N</em> candidate solutions for each problem.</p></li><li><p>Use the reward model for assigning scores to the solutions to rank them. </p></li></ol><blockquote><p>&#128064; For PRM, when determining a solution-level score, they either use the minimum or the product of step-level scores as a reduction.</p></blockquote><ol start="3"><li><p>Pick the top-ranked one.</p></li><li><p>Check if the final answer is correct. </p></li></ol><p>The evaluation metric is the fraction of problems where the top-ranked solution is correct. The results reveal that PRM can detect the wrong steps in reasoning solutions. Compared to other methods, such as ORM or Majority Voting, PRM scales much better as <em>N</em> increases:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!psY_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577cd568-bfc0-419c-9984-896cf0e301e4_2081x974.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!psY_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577cd568-bfc0-419c-9984-896cf0e301e4_2081x974.png 424w, https://substackcdn.com/image/fetch/$s_!psY_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577cd568-bfc0-419c-9984-896cf0e301e4_2081x974.png 848w, https://substackcdn.com/image/fetch/$s_!psY_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577cd568-bfc0-419c-9984-896cf0e301e4_2081x974.png 1272w, https://substackcdn.com/image/fetch/$s_!psY_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577cd568-bfc0-419c-9984-896cf0e301e4_2081x974.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!psY_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577cd568-bfc0-419c-9984-896cf0e301e4_2081x974.png" width="1456" height="681" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/577cd568-bfc0-419c-9984-896cf0e301e4_2081x974.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:681,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:644786,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577cd568-bfc0-419c-9984-896cf0e301e4_2081x974.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!psY_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577cd568-bfc0-419c-9984-896cf0e301e4_2081x974.png 424w, https://substackcdn.com/image/fetch/$s_!psY_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577cd568-bfc0-419c-9984-896cf0e301e4_2081x974.png 848w, https://substackcdn.com/image/fetch/$s_!psY_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577cd568-bfc0-419c-9984-896cf0e301e4_2081x974.png 1272w, https://substackcdn.com/image/fetch/$s_!psY_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F577cd568-bfc0-419c-9984-896cf0e301e4_2081x974.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>&#10060; One major bottleneck for training PRMs is the need for high-quality, step-level annotations, especially in domains like math, where reasoning is complex and annotators need domain expertise. </p><p>&#10060; Human labeling at this level of granularity is expensive and time-consuming, making it difficult to scale PRM training for practical use.</p><p>Therefore, recent papers propose an automatic process supervision framework to reduce reliance on manual labeling. For example, in the &#128073;<strong>MATH-SHEPHERD</strong> paper [7], the researchers define a step's quality by how likely it is to lead to the correct final answer. Intuitively, here's how it works:</p><ul><li><p>Start with a math problem and its ground-truth final answer.</p></li><li><p>From an intermediate step in a candidate solution, decode multiple possible future reasoning paths using a fine-tuned LLM.</p></li><li><p>Check how often these paths arrive at the correct final answer.</p></li><li><p>The more reliably a step leads to correct answers, the higher its correctness score.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7Xpu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f55781-5ef3-49fc-879a-359d84de63bb_947x388.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7Xpu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f55781-5ef3-49fc-879a-359d84de63bb_947x388.png 424w, https://substackcdn.com/image/fetch/$s_!7Xpu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f55781-5ef3-49fc-879a-359d84de63bb_947x388.png 848w, https://substackcdn.com/image/fetch/$s_!7Xpu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f55781-5ef3-49fc-879a-359d84de63bb_947x388.png 1272w, https://substackcdn.com/image/fetch/$s_!7Xpu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f55781-5ef3-49fc-879a-359d84de63bb_947x388.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7Xpu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f55781-5ef3-49fc-879a-359d84de63bb_947x388.png" width="947" height="388" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a5f55781-5ef3-49fc-879a-359d84de63bb_947x388.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:388,&quot;width&quot;:947,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7Xpu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f55781-5ef3-49fc-879a-359d84de63bb_947x388.png 424w, https://substackcdn.com/image/fetch/$s_!7Xpu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f55781-5ef3-49fc-879a-359d84de63bb_947x388.png 848w, https://substackcdn.com/image/fetch/$s_!7Xpu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f55781-5ef3-49fc-879a-359d84de63bb_947x388.png 1272w, https://substackcdn.com/image/fetch/$s_!7Xpu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f55781-5ef3-49fc-879a-359d84de63bb_947x388.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Automatically assigning rewards to the reasoning steps. Source: [7]</figcaption></figure></div><p>In particular, given a reasoning step <em>s<sub>i</sub></em>&#8203;, they generate a set of <em>N</em> completions (i.e., possible continuations of reasoning) using a fine-tuned LLM:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\left\\{ (s_{i+1,j}, \\dots, s_{K_j,j}, a_j) \\right\\}_{j=1}^{N}\n&quot;,&quot;id&quot;:&quot;EYESQRMSFX&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>a<sub>j</sub></em>&#8203; is the final answer reached in the <em>j</em>-th continuation. The set of all decoded answers is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;A = \\{ a_j \\}_{j=1}^N\n&quot;,&quot;id&quot;:&quot;WTSYHJEKJV&quot;}" data-component-name="LatexBlockToDOM"></div><p>To score the quality <em>y<sub>si</sub>&#8203;&#8203;</em> of step <em>s<sub>i</sub>&#8203;</em>, we consider two estimation strategies:</p><ul><li><p>Hard Estimation (HE): A step is labeled as "good" (reward =1) if any of its completions reaches the correct answer a&#8727;:</p></li></ul><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;y_{s_i}^{\\text{HE}} =\n\\begin{cases}\n1 &amp; \\text{if } \\exists a_j \\in A \\text{ such that } a_j = a^* \\\\\n0 &amp; \\text{otherwise}\n\\end{cases}&quot;,&quot;id&quot;:&quot;SFTEAEWMED&quot;}" data-component-name="LatexBlockToDOM"></div><ul><li><p>Soft Estimation (SE):  Calculate the step-reward score as the frequency with which the completions arrive at the correct answer:</p></li></ul><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;y_{s_i}^{\\text{SE}} = \\frac{1}{N} \\sum_{j=1}^{N} \\mathbb{I}(a_j = a^*)\n&quot;,&quot;id&quot;:&quot;KOXZOVYEWL&quot;}" data-component-name="LatexBlockToDOM"></div><h4>How to Train a Process Reward Model (PRM)</h4><p>Formally, a PRM maps a problem <em>P</em> and a reasoning step <em>S</em> to a (positive) reward <em>R</em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{PRM}: P \\times S \\rightarrow \\mathbb{R}_{+}\n&quot;,&quot;id&quot;:&quot;ZRCOWDZNLN&quot;}" data-component-name="LatexBlockToDOM"></div><p>The training objective treats step-level scoring as a binary classification problem. For a solution with <em>K</em> reasoning steps <em>{s<sub>1</sub>,s<sub>2</sub>,&#8230;,s<sub>K</sub>}</em>, the PRM learns to assign scores <em>r<sub>si</sub>&#8712;[0,1]</em> indicating the quality of each step, i.e., &gt;0.5 good, otherwise bad. The cross-entropy loss is applied over all steps:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{\\text{PRM}} = \\sum_{i=1}^K y_{s_i} \\log r_{s_i} + (1 - y_{s_i}) \\log (1 - r_{s_i})\n&quot;,&quot;id&quot;:&quot;PECOLUVRSO&quot;}" data-component-name="LatexBlockToDOM"></div><p>where:</p><ul><li><p><em>y<sub>si&#8203;&#8203;</sub>&#8712;{0,1}</em> is the target label indicating whether step <em>s<sub>i</sub></em> is good,</p></li><li><p><em>r<sub>si</sub></em> is the PRM's predicted score (typically after applying a sigmoid).</p></li></ul><p>The training results show that Soft Estimation is better than Hard Estimation:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ngOr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d31e8-0420-4b76-80cb-17b401d98097_613x479.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ngOr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d31e8-0420-4b76-80cb-17b401d98097_613x479.png 424w, https://substackcdn.com/image/fetch/$s_!ngOr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d31e8-0420-4b76-80cb-17b401d98097_613x479.png 848w, https://substackcdn.com/image/fetch/$s_!ngOr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d31e8-0420-4b76-80cb-17b401d98097_613x479.png 1272w, https://substackcdn.com/image/fetch/$s_!ngOr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d31e8-0420-4b76-80cb-17b401d98097_613x479.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ngOr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d31e8-0420-4b76-80cb-17b401d98097_613x479.png" width="377" height="294.58890701468187" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/712d31e8-0420-4b76-80cb-17b401d98097_613x479.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:479,&quot;width&quot;:613,&quot;resizeWidth&quot;:377,&quot;bytes&quot;:57952,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d31e8-0420-4b76-80cb-17b401d98097_613x479.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ngOr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d31e8-0420-4b76-80cb-17b401d98097_613x479.png 424w, https://substackcdn.com/image/fetch/$s_!ngOr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d31e8-0420-4b76-80cb-17b401d98097_613x479.png 848w, https://substackcdn.com/image/fetch/$s_!ngOr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d31e8-0420-4b76-80cb-17b401d98097_613x479.png 1272w, https://substackcdn.com/image/fetch/$s_!ngOr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F712d31e8-0420-4b76-80cb-17b401d98097_613x479.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Once we&#8217;ve trained a PRM to score each reasoning step, we can take the minimum score across all steps in a solution as the ranking score in the <em>best-of-N </em>approach. The authors even go further by integrating self-consistency with reward models:</p><ol><li><p>First, generate multiple candidate solutions with </p></li><li><p>Group these solutions by their predicted final answer.</p></li><li><p>Within each group, aggregate their PRM scores.</p></li><li><p>Finally, select the answer that is not just the most frequent, but the one supported by high reward scores.</p></li></ol><p>Formally, the final answer chosen under this combined scoring strategy is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;a_{\\text{sc+rm}} = \\arg\\max_{a} \\sum_{i=1}^{N} \\mathbb{I}(a_i = a) \\cdot \\text{PRM}(p, S_i)\n&quot;,&quot;id&quot;:&quot;KOADZPWAYX&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>More importantly, combining fine-tuning LLMs with the test-time strategy above (SHEPHERD) yields awe-inspiring results:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Atuq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07eb74c8-f77c-4a61-8413-19f8073cd2d8_1693x679.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Atuq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07eb74c8-f77c-4a61-8413-19f8073cd2d8_1693x679.png 424w, https://substackcdn.com/image/fetch/$s_!Atuq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07eb74c8-f77c-4a61-8413-19f8073cd2d8_1693x679.png 848w, https://substackcdn.com/image/fetch/$s_!Atuq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07eb74c8-f77c-4a61-8413-19f8073cd2d8_1693x679.png 1272w, https://substackcdn.com/image/fetch/$s_!Atuq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07eb74c8-f77c-4a61-8413-19f8073cd2d8_1693x679.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Atuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07eb74c8-f77c-4a61-8413-19f8073cd2d8_1693x679.png" width="1456" height="584" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/07eb74c8-f77c-4a61-8413-19f8073cd2d8_1693x679.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:584,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:154236,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07eb74c8-f77c-4a61-8413-19f8073cd2d8_1693x679.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Atuq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07eb74c8-f77c-4a61-8413-19f8073cd2d8_1693x679.png 424w, https://substackcdn.com/image/fetch/$s_!Atuq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07eb74c8-f77c-4a61-8413-19f8073cd2d8_1693x679.png 848w, https://substackcdn.com/image/fetch/$s_!Atuq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07eb74c8-f77c-4a61-8413-19f8073cd2d8_1693x679.png 1272w, https://substackcdn.com/image/fetch/$s_!Atuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07eb74c8-f77c-4a61-8413-19f8073cd2d8_1693x679.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>However, treating the PRM training task as a classification problem, i.e., labeling each step <em>s<sub>i</sub></em> as either "correct" or "incorrect", is not always perfect: </p><p>&#10060; It overlooks the structure of the trajectory, including the dependencies among steps, the cascading effect of earlier errors, and the varying importance of each step in solving the overall task. </p><p>&#10060; Each state is treated in isolation, leading to reward assignments that are often misaligned with the actual contribution of a step toward reaching a correct final answer.</p><p>To address this, a recent work introduces a new perspective: &#128073;<strong>Process Q-value Modeling (PQM) [8]</strong>. Instead of predicting correctness as a binary label, PQM frames the reasoning process as a Markov Decision Process (MDP) and aims to estimate the Q-value&#8212;that is, the expected probability that a particular action (a reasoning step) from a given state will lead to a correct final answer.</p><p>In this framework:</p><ul><li><p>The <strong>state</strong> <em>s<sub>h</sub></em>=(x,a<sub>1:h&#8722;1</sub>) represents the prompt <em>x</em> and prior steps <em>a<sub>1:h&#8722;1</sub></em>.</p></li><li><p>The <strong>action</strong> <em>a<sub>h</sub></em> is the generated reasoning at step <em>h</em>.</p></li><li><p>The <strong>policy</strong> <em>&#960;(a<sub>h</sub>&#8739;s<sub>h</sub>)</em> represents the LLM.</p></li><li><p>The <strong>Q-value</strong> <em>Q(s<sub>h</sub>, a<sub>h</sub>)</em> estimates how promising this step is in achieving the correct final answer.</p></li></ul><p>Here, the Q-value implicitly defines a process reward: good steps raise the probability of success, and bad ones diminish it. This leads to a <strong>comparative training objective</strong>, where the PRM learns to rank better trajectories (or sub-trajectories) higher than weaker ones. Crucially, this also recovers classification-based PRMs as a special case when only terminal correctness is available and steps are independent.</p><p>The paper offers several theoretical results:</p><ul><li><p>The Q-value for a reasoning step can be defined as the <em><a href="https://en.wikipedia.org/wiki/Logit">inverse sigmoid</a></em> of the probability that a trajectory leads to a correct final answer, under a given policy.</p></li><li><p>Then, the Q-value represents the likelihood that a partial action sequence will result in a correct answer, making it a natural fit for evaluating intermediate reasoning steps.</p></li><li><p>Under deterministic settings, the advantage function (a form of temporal differences of values) can be treated as a reward function due to potential-based reward shaping.</p></li><li><p>Among correct steps, later ones have higher Q-values; among incorrect steps, later ones have lower Q-values.</p></li><li><p>The Q-value of the first correct step is greater than that of the initial state, which is in turn greater than that of the first incorrect step.</p></li><li><p><strong>Final findings:</strong> Q-values of incorrect steps decrease over time, followed by the initial state's Q-value, then increasing Q-values of correct steps, establishing a clear separation between correct and incorrect reasoning trajectories.</p></li></ul><p>A central goal in training PQM with rankings is not just to distinguish between correct and incorrect actions but to reflect the magnitude of the difference between them. While the classical <a href="https://hturner.github.io/PlackettLuce/articles/Overview.html">Plackett-Luce (PL)</a> model provides a way to capture ranking relationships, it falls short in our setting where Q-value gaps between correct and incorrect steps can be highly uneven and consequential. Therefore, the authors propose a Q-margin-based comparative loss that is sensitive to Q-value differences, capturing not only <em>which step is better</em> but also <em>how much better</em>.</p><p>Based on the theoretical findings above, the optimal Q-values induce a strict ordering:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Q^*_{w_{|W|}} < \\cdots < Q^*_{w_2} < Q^*_{w_1} \\ll Q^*_0 < Q^*_{c_1} < Q^*_{c_2} < \\cdots < Q^*_{c_{|C|}}\n&quot;,&quot;id&quot;:&quot;QKBODXBJTV&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>W</em> and <em>C</em> denote the set of wrong and correct step indices, respectively. To enforce this structure during training, the authors define the loss:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{\\text{theorem}} = -\\frac{1}{H} \\left[\n\\sum_{t=2}^{|W|} \\log \\frac{\\exp(Q_{w_t})}{\\sum_{q=1}^{t} \\exp(Q_{w_q})} +\n\\sum_{t=0}^{|C|} \\log \\frac{\\exp(Q_{c_t})}{\\sum_{q=0}^{t} \\exp(Q_{c_q}) + \\sum_{w \\in W} \\exp(Q_w + \\zeta)}\n\\right]&quot;,&quot;id&quot;:&quot;DSUAEGSFPR&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, &#950; is a margin hyperparameter that magnifies the penalty for incorrect steps by adding a positive bias to their exponentiated Q-values. Let&#8217;s explain the equation:</p><ul><li><p>The first term models the relative ranking among incorrect steps <em>W</em>. The PL-style ranking is used to penalize incorrect steps that receive higher Q-values than worse ones. Starting the sum at <em>t=2</em> ensures we compare each step against all higher-ranked (i.e., better, smaller indices) incorrect steps. For example, if the model is estimating wrongly, like Q<sub>w2</sub>&gt;Q<sub>w1,</sub> the loss is greater, and thus minimizing the loss helps. </p></li><li><p>The second term compares correct steps <em>C</em> with all incorrect ones and &#8220;less correct&#8220; steps. The margin &#950;  encourages a clear separation between correct and incorrect actions, not just in rank but in actual Q-value magnitude.</p></li></ul><p>While the full loss <em>L<sub>theorem</sub></em> captures nuanced structure, it assumes accurate annotation of incorrect steps, which is rarely guaranteed in practice. Most datasets label all steps after the first mistake as "wrong," even if some are less severe. This motivates a simplified yet robust loss:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L} = -\\frac{1}{|C|} \\sum_{t=0}^{|C|} \\log \\frac{\\exp(Q_{c_t})}{\\sum_{q=0}^{t} \\exp(Q_{c_q}) + \\sum_{w \\in W} \\exp(Q_w + \\zeta)}\n&quot;,&quot;id&quot;:&quot;POSQPNJOAR&quot;}" data-component-name="LatexBlockToDOM"></div><p>This loss ignores the internal ranking among wrong steps (the first term in  <em>L<sub>theorem</sub></em>) and concentrates on learning a clear separation margin between the correct and incorrect actions.</p><p>The paper demonstrates better performance than other PRM training losses, such as using binary classification&#8217;s cross-entropy or MSE loss:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X9DU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79431725-2f9c-49a7-af2f-33d44cd48ebd_1235x705.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X9DU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79431725-2f9c-49a7-af2f-33d44cd48ebd_1235x705.png 424w, https://substackcdn.com/image/fetch/$s_!X9DU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79431725-2f9c-49a7-af2f-33d44cd48ebd_1235x705.png 848w, https://substackcdn.com/image/fetch/$s_!X9DU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79431725-2f9c-49a7-af2f-33d44cd48ebd_1235x705.png 1272w, https://substackcdn.com/image/fetch/$s_!X9DU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79431725-2f9c-49a7-af2f-33d44cd48ebd_1235x705.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X9DU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79431725-2f9c-49a7-af2f-33d44cd48ebd_1235x705.png" width="1235" height="705" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/79431725-2f9c-49a7-af2f-33d44cd48ebd_1235x705.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:705,&quot;width&quot;:1235,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:189873,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79431725-2f9c-49a7-af2f-33d44cd48ebd_1235x705.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!X9DU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79431725-2f9c-49a7-af2f-33d44cd48ebd_1235x705.png 424w, https://substackcdn.com/image/fetch/$s_!X9DU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79431725-2f9c-49a7-af2f-33d44cd48ebd_1235x705.png 848w, https://substackcdn.com/image/fetch/$s_!X9DU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79431725-2f9c-49a7-af2f-33d44cd48ebd_1235x705.png 1272w, https://substackcdn.com/image/fetch/$s_!X9DU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79431725-2f9c-49a7-af2f-33d44cd48ebd_1235x705.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>From a different perspective, researchers argue that intermediate, or process, rewards should act as fine-grained supervision, guiding the model at each step in a way that supports the long-term objective of arriving at the right outcome. This perspective differs from the conventional approach, which often assumes process rewards should strictly reflect the mathematical correctness or semantic relevance of each step. In contrast, large language models may need to take detours&#8212;generating seemingly trivial or redundant steps&#8212;to ultimately find a successful reasoning trajectory. Penalizing such steps too early or too harshly can prematurely collapse promising search paths.  Therefore, the paper proposes focusing on rewarding <strong>the progress</strong>, which can be estimated by advantage functions rather than the Q-value [9]. They name the method &#128073;<strong>Process Advantage Verifiers (PAV). </strong></p><p>To see why Q-value may not be an ideal estimation of progress, imagine we're running a beam search to explore different reasoning paths. Each path (or trace) is a sequence of steps, and we want to keep the most promising ones in the beam. A natural approach might be to retain those with the highest Q-values. But Q-values inherently entangle two things&#8212;the value of the action and the value of the state it came from. That creates a problem. If we compare actions from different states purely based on their absolute Q-values, we risk keeping steps that decrease our likelihood of success, simply because they come from "stronger" states:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o7b7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369fa84-e290-43d7-b4c1-7afa6c38b85f_1310x539.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o7b7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369fa84-e290-43d7-b4c1-7afa6c38b85f_1310x539.png 424w, https://substackcdn.com/image/fetch/$s_!o7b7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369fa84-e290-43d7-b4c1-7afa6c38b85f_1310x539.png 848w, https://substackcdn.com/image/fetch/$s_!o7b7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369fa84-e290-43d7-b4c1-7afa6c38b85f_1310x539.png 1272w, https://substackcdn.com/image/fetch/$s_!o7b7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369fa84-e290-43d7-b4c1-7afa6c38b85f_1310x539.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o7b7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369fa84-e290-43d7-b4c1-7afa6c38b85f_1310x539.png" width="1310" height="539" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7369fa84-e290-43d7-b4c1-7afa6c38b85f_1310x539.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:539,&quot;width&quot;:1310,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:152105,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369fa84-e290-43d7-b4c1-7afa6c38b85f_1310x539.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!o7b7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369fa84-e290-43d7-b4c1-7afa6c38b85f_1310x539.png 424w, https://substackcdn.com/image/fetch/$s_!o7b7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369fa84-e290-43d7-b4c1-7afa6c38b85f_1310x539.png 848w, https://substackcdn.com/image/fetch/$s_!o7b7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369fa84-e290-43d7-b4c1-7afa6c38b85f_1310x539.png 1272w, https://substackcdn.com/image/fetch/$s_!o7b7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7369fa84-e290-43d7-b4c1-7afa6c38b85f_1310x539.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Rather than looking at raw Q-values, we should focus on the change they induce&#8212;that is, the progress a step makes toward success. This is captured by the advantage function:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;A_{\\pi}(s_t, a_t) = Q_{\\pi}(s_t, a_t) - V_{\\pi}(s_t) = Q_{\\pi}(s_t, a_t) - Q_{\\pi}(s_{t-1}, a_{t-1})\n&quot;,&quot;id&quot;:&quot;IRECAKCYDO&quot;}" data-component-name="LatexBlockToDOM"></div><p>Advantage tells us whether a step improved or hurt our chances of solving the problem. A positive advantage means the step made meaningful progress; a negative one means it sets us back. By supervising the model with these advantage values, we reward real progress, not just proximity to good outcomes.</p><p>Furthermore, the advantage doesn&#8217;t need to be computed under the same policy that we&#8217;re training. As the paper argues, we may want to compute these progress signals under a stronger <strong>prover policy</strong> <em>&#956;,</em> different from our current base policy <em>&#960;</em>. This prover acts like an expert verifier, assessing whether a step truly advances reasoning under its own more accurate value estimates.</p><p>The authors then propose components of the loss to train the policy:</p><ul><li><p>Outcome reward (of course) of the base policy:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0_Nv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5d6377-f425-4865-b573-2bfa8bc88ea8_757x69.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0_Nv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5d6377-f425-4865-b573-2bfa8bc88ea8_757x69.png 424w, https://substackcdn.com/image/fetch/$s_!0_Nv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5d6377-f425-4865-b573-2bfa8bc88ea8_757x69.png 848w, https://substackcdn.com/image/fetch/$s_!0_Nv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5d6377-f425-4865-b573-2bfa8bc88ea8_757x69.png 1272w, https://substackcdn.com/image/fetch/$s_!0_Nv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5d6377-f425-4865-b573-2bfa8bc88ea8_757x69.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0_Nv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5d6377-f425-4865-b573-2bfa8bc88ea8_757x69.png" width="757" height="69" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e5d6377-f425-4865-b573-2bfa8bc88ea8_757x69.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:69,&quot;width&quot;:757,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:11817,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5d6377-f425-4865-b573-2bfa8bc88ea8_757x69.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0_Nv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5d6377-f425-4865-b573-2bfa8bc88ea8_757x69.png 424w, https://substackcdn.com/image/fetch/$s_!0_Nv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5d6377-f425-4865-b573-2bfa8bc88ea8_757x69.png 848w, https://substackcdn.com/image/fetch/$s_!0_Nv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5d6377-f425-4865-b573-2bfa8bc88ea8_757x69.png 1272w, https://substackcdn.com/image/fetch/$s_!0_Nv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e5d6377-f425-4865-b573-2bfa8bc88ea8_757x69.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p></p><ul><li><p>The advantage of the prover policy:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DYg6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3450736-0865-4b3b-8f41-fab00942f2aa_776x101.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DYg6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3450736-0865-4b3b-8f41-fab00942f2aa_776x101.png 424w, https://substackcdn.com/image/fetch/$s_!DYg6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3450736-0865-4b3b-8f41-fab00942f2aa_776x101.png 848w, https://substackcdn.com/image/fetch/$s_!DYg6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3450736-0865-4b3b-8f41-fab00942f2aa_776x101.png 1272w, https://substackcdn.com/image/fetch/$s_!DYg6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3450736-0865-4b3b-8f41-fab00942f2aa_776x101.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DYg6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3450736-0865-4b3b-8f41-fab00942f2aa_776x101.png" width="776" height="101" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d3450736-0865-4b3b-8f41-fab00942f2aa_776x101.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:101,&quot;width&quot;:776,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:14252,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3450736-0865-4b3b-8f41-fab00942f2aa_776x101.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DYg6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3450736-0865-4b3b-8f41-fab00942f2aa_776x101.png 424w, https://substackcdn.com/image/fetch/$s_!DYg6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3450736-0865-4b3b-8f41-fab00942f2aa_776x101.png 848w, https://substackcdn.com/image/fetch/$s_!DYg6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3450736-0865-4b3b-8f41-fab00942f2aa_776x101.png 1272w, https://substackcdn.com/image/fetch/$s_!DYg6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3450736-0865-4b3b-8f41-fab00942f2aa_776x101.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Using policy gradient methods, we can have the gradient to update the policy as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1V7_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fc59ffa-fc29-4ce7-8e08-d1201336d95c_880x132.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1V7_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fc59ffa-fc29-4ce7-8e08-d1201336d95c_880x132.png 424w, https://substackcdn.com/image/fetch/$s_!1V7_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fc59ffa-fc29-4ce7-8e08-d1201336d95c_880x132.png 848w, https://substackcdn.com/image/fetch/$s_!1V7_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fc59ffa-fc29-4ce7-8e08-d1201336d95c_880x132.png 1272w, https://substackcdn.com/image/fetch/$s_!1V7_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fc59ffa-fc29-4ce7-8e08-d1201336d95c_880x132.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1V7_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fc59ffa-fc29-4ce7-8e08-d1201336d95c_880x132.png" width="880" height="132" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7fc59ffa-fc29-4ce7-8e08-d1201336d95c_880x132.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:132,&quot;width&quot;:880,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:19213,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fc59ffa-fc29-4ce7-8e08-d1201336d95c_880x132.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1V7_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fc59ffa-fc29-4ce7-8e08-d1201336d95c_880x132.png 424w, https://substackcdn.com/image/fetch/$s_!1V7_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fc59ffa-fc29-4ce7-8e08-d1201336d95c_880x132.png 848w, https://substackcdn.com/image/fetch/$s_!1V7_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fc59ffa-fc29-4ce7-8e08-d1201336d95c_880x132.png 1272w, https://substackcdn.com/image/fetch/$s_!1V7_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fc59ffa-fc29-4ce7-8e08-d1201336d95c_880x132.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Now, the remaining question is: &#129504; <em>How Should We Choose the Prover Policy &#956;?</em></p><ul><li><p>Don&#8217;t set <em>&#956;=&#960;</em> <strong>(</strong>base policy):</p><ul><li><p>This reduces to standard outcome-based training.</p></li><li><p>The advantage <em>A<sub>&#960;</sub></em> adds no extra supervision, especially harmful if <em>&#960;</em> is weak, since both <em>Q<sub>&#960;</sub></em> and <em>A<sub>&#960;</sub></em> will be close to zero.</p></li></ul></li><li><p>Don&#8217;t use a very weak <em>&#956;</em>:</p><ul><li><p>Just like a weak base policy, a weak prover produces flat or noisy advantage signals.</p></li><li><p>Beam search and training signals become unreliable.</p></li></ul></li><li><p>Don&#8217;t use an overly strong prover <em>&#956;</em>:</p><ul><li><p>A strong prover can succeed regardless of trivial or irrelevant steps (e.g., restating the question).</p></li><li><p>As a result, <em>Q<sub>&#956;</sub></em> stays constant across such steps, leading to <em>A<sub>&#956;</sub>&#8203;&#8776;0</em>.</p></li><li><p>Training then rewards superficial behavior and fails to improve reasoning.</p></li></ul></li></ul><p>To prove the points, the author conducts a synthetic experiment (didactic setup):</p><ul><li><p><strong>Goal</strong>: Train a policy <em>&#960;</em> to generate a sequence that contains a hidden sub-sequence <em>y&#8727;</em> from a vocabulary <em>V={1,2,&#8230;,15}</em>.</p></li><li><p><strong>Reward</strong>: Sparse and terminal &#8212; the agent only receives a reward of 1 if <em>y&#8727;</em> appears in the output <em>y</em>; otherwise, 0.</p></li><li><p><strong>Prover policy </strong><em><strong>&#956;</strong></em>: A procedural policy controlled by a scalar <em>&#947;&gt;0</em>; as <em>&#947;</em> increases, <em>&#956;</em> becomes more effective, reaching perfect accuracy as <em>&#947;&#8594;&#8734;</em>.</p></li><li><p><strong>Assumption</strong>: The setup uses Oracle access to true <em>Q<sub>&#960;</sub></em> and <em>A<sub>&#956;</sub></em>&#8203; to avoid approximation errors from learned models.</p></li></ul><p>The results prove the point that <em>A</em> is better than <em>Q,</em> and middle <em>&#947; </em>is best.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ixJq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb685d558-0a33-4bf9-8f7b-e7c9b9900644_1327x510.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ixJq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb685d558-0a33-4bf9-8f7b-e7c9b9900644_1327x510.png 424w, https://substackcdn.com/image/fetch/$s_!ixJq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb685d558-0a33-4bf9-8f7b-e7c9b9900644_1327x510.png 848w, https://substackcdn.com/image/fetch/$s_!ixJq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb685d558-0a33-4bf9-8f7b-e7c9b9900644_1327x510.png 1272w, https://substackcdn.com/image/fetch/$s_!ixJq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb685d558-0a33-4bf9-8f7b-e7c9b9900644_1327x510.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ixJq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb685d558-0a33-4bf9-8f7b-e7c9b9900644_1327x510.png" width="1327" height="510" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b685d558-0a33-4bf9-8f7b-e7c9b9900644_1327x510.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:510,&quot;width&quot;:1327,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:177186,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb685d558-0a33-4bf9-8f7b-e7c9b9900644_1327x510.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ixJq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb685d558-0a33-4bf9-8f7b-e7c9b9900644_1327x510.png 424w, https://substackcdn.com/image/fetch/$s_!ixJq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb685d558-0a33-4bf9-8f7b-e7c9b9900644_1327x510.png 848w, https://substackcdn.com/image/fetch/$s_!ixJq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb685d558-0a33-4bf9-8f7b-e7c9b9900644_1327x510.png 1272w, https://substackcdn.com/image/fetch/$s_!ixJq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb685d558-0a33-4bf9-8f7b-e7c9b9900644_1327x510.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The results on the MATH dataset confirm the points:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3BxB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6663875-e42d-4442-a488-de09ab710fe3_1997x929.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3BxB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6663875-e42d-4442-a488-de09ab710fe3_1997x929.png 424w, https://substackcdn.com/image/fetch/$s_!3BxB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6663875-e42d-4442-a488-de09ab710fe3_1997x929.png 848w, https://substackcdn.com/image/fetch/$s_!3BxB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6663875-e42d-4442-a488-de09ab710fe3_1997x929.png 1272w, https://substackcdn.com/image/fetch/$s_!3BxB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6663875-e42d-4442-a488-de09ab710fe3_1997x929.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3BxB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6663875-e42d-4442-a488-de09ab710fe3_1997x929.png" width="1456" height="677" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b6663875-e42d-4442-a488-de09ab710fe3_1997x929.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:677,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:373378,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6663875-e42d-4442-a488-de09ab710fe3_1997x929.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3BxB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6663875-e42d-4442-a488-de09ab710fe3_1997x929.png 424w, https://substackcdn.com/image/fetch/$s_!3BxB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6663875-e42d-4442-a488-de09ab710fe3_1997x929.png 848w, https://substackcdn.com/image/fetch/$s_!3BxB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6663875-e42d-4442-a488-de09ab710fe3_1997x929.png 1272w, https://substackcdn.com/image/fetch/$s_!3BxB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6663875-e42d-4442-a488-de09ab710fe3_1997x929.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Search and Planning with RL</h2><p>The second component of test-time scaling is the generation and search process. RL  allows us to  help LLMs actively explore the space of reasoning steps to discover new, high-reward trajectories. When paired with search, RL becomes a powerful tool, not just for improving final answers, but for shaping how the model thinks.</p><p>Imagine solving a math problem. You might try several intermediate calculations, discard a few, refine your strategy, and finally reach a solution. This iterative process of proposing, testing, and revising steps is precisely what we want LLMs to emulate. Pure next-token prediction is too myopic for this, while search gives the model the "lookahead" it needs.</p><h4>TTC Scaling Using Classic Search Algorithms</h4><p>Early approaches, such as ToT [4], involve using straightforward brute-force search algorithms, such as <a href="https://en.wikipedia.org/wiki/Breadth-first_search">Breadth-First Search (BFS)</a> and <a href="https://en.wikipedia.org/wiki/Depth-first_search">Depth-First Search (DFS)</a>.  For example, a simple search framework will look like this</p><ul><li><p><strong>Step Sampling:</strong> At each step, use the question and prior reasoning to sample <em>B</em> next steps via an LLM.</p></li><li><p><strong>Reasoning Tree:</strong></p><ul><li><p>Organize reasoning as a tree:</p><ul><li><p>Root: question</p></li><li><p>Node: current generated text</p></li><li><p>Edge: reasoning step</p></li><li><p>Leaf: final answer</p></li></ul></li><li><p>A search algorithm selects a node to expand.</p></li></ul></li><li><p><strong>Expansion:</strong> Add the selected node's step to the prompt and sample new steps.</p></li><li><p><strong>Stopping:</strong> Halt when an answer is found or a step/computation limit is reached.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MXwG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a6f433-8695-442b-95b3-0aed4ac9bcf0_2077x964.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MXwG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a6f433-8695-442b-95b3-0aed4ac9bcf0_2077x964.png 424w, https://substackcdn.com/image/fetch/$s_!MXwG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a6f433-8695-442b-95b3-0aed4ac9bcf0_2077x964.png 848w, https://substackcdn.com/image/fetch/$s_!MXwG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a6f433-8695-442b-95b3-0aed4ac9bcf0_2077x964.png 1272w, https://substackcdn.com/image/fetch/$s_!MXwG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a6f433-8695-442b-95b3-0aed4ac9bcf0_2077x964.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MXwG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a6f433-8695-442b-95b3-0aed4ac9bcf0_2077x964.png" width="1456" height="676" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/75a6f433-8695-442b-95b3-0aed4ac9bcf0_2077x964.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:676,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:461854,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a6f433-8695-442b-95b3-0aed4ac9bcf0_2077x964.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MXwG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a6f433-8695-442b-95b3-0aed4ac9bcf0_2077x964.png 424w, https://substackcdn.com/image/fetch/$s_!MXwG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a6f433-8695-442b-95b3-0aed4ac9bcf0_2077x964.png 848w, https://substackcdn.com/image/fetch/$s_!MXwG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a6f433-8695-442b-95b3-0aed4ac9bcf0_2077x964.png 1272w, https://substackcdn.com/image/fetch/$s_!MXwG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75a6f433-8695-442b-95b3-0aed4ac9bcf0_2077x964.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Search in action. Source: [10].</figcaption></figure></div><p>The key here is the search algorithm. &#128073;<strong>MindStar (M*) </strong>[10]<strong> </strong>proposes an LLM-inference-time search procedure using Levin Tree Search, which is a best-first tree search algorithm. This kind of search avoids trying to select all nodes to expand. Thus, a criterion should be employed to choose which node or reasoning step to select. Here, the authors assess each reasoning step with the aforementioned PRM<strong>. </strong></p><p>Formally, given a current node <em>n<sub>d</sub></em>&#8203; and a candidate's next step <em>e<sub>d</sub></em>&#8203;, the PRM outputs a reward:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;P(n_d, e_d) = r_d \\in [0, 1]\n&quot;,&quot;id&quot;:&quot;UYDWBXXINK&quot;}" data-component-name="LatexBlockToDOM"></div><p>A high <em>r<sub>d</sub></em>&#8203; means the step is likely valid and consistent with prior reasoning. Once we score the child nodes with PRM, we must decide which node to expand next. Two strategies are examined:</p><ul><li><p>Beam search selects the (top) next steps with the highest rewards:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;e^*_d = \\arg\\max_{e_i \\in \\{e^1_d, \\dots, e^N_d\\}} P(n_d, e_i)\nn_{d+1} = [n_d \\oplus e^*_d]&quot;,&quot;id&quot;:&quot;QKGICKCLKG&quot;}" data-component-name="LatexBlockToDOM"></div><p></p></li><li><p>Levin Tree Search (LevinTS):  LevinTS improves on beam search by balancing cost and probability:</p><ul><li><p><strong>Cost:</strong> the cost of the path (the smaller the better)</p></li><li><p><strong>Probability:</strong> reflects how likely a node leads to a solution (the higher the better)</p></li></ul></li></ul><p>The paper proposes the cost function as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;f(n) = e^{\\tau \\cdot i_{tok}}\n&quot;,&quot;id&quot;:&quot;BYAWAYTAGR&quot;}" data-component-name="LatexBlockToDOM"></div><p>where:</p><ul><li><p><em>i<sub>tok</sub></em> = number of tokens in <em>n</em></p></li><li><p><em>&#964;</em> = temperature (controls search aggressiveness)</p></li></ul><p>And the probability <em>&#960;(n)</em> is recursively defined as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\pi(n) := \\pi(n') \\cdot \\frac{e^{P(n', e')}}{\\sum_{i=1}^{N} e^{P(n', e_i)}}\n&quot;,&quot;id&quot;:&quot;FJACWJDDTW&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>n</em> has parent <em>n&#8242;</em> connected by an edge <em>e&#8242;.</em></p><p>LevinTS expands nodes in order of increasing <em>f(n)/&#960;(n)</em>, offering a guaranteed token budget:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;|\\bar{N}(\\text{LevinTS}, N_g)| \\leq \\min_{n \\in N_g} \\frac{f(n)}{\\pi(n)}\n&quot;,&quot;id&quot;:&quot;GFPBJFYXVB&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>N<sub>g</sub></em> is a set of target nodes.</p><p>We can see these search algorithms improve LLMs significantly:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OpVV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c31b9d8-160c-4344-861e-a6c217ae422a_1256x988.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OpVV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c31b9d8-160c-4344-861e-a6c217ae422a_1256x988.png 424w, https://substackcdn.com/image/fetch/$s_!OpVV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c31b9d8-160c-4344-861e-a6c217ae422a_1256x988.png 848w, https://substackcdn.com/image/fetch/$s_!OpVV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c31b9d8-160c-4344-861e-a6c217ae422a_1256x988.png 1272w, https://substackcdn.com/image/fetch/$s_!OpVV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c31b9d8-160c-4344-861e-a6c217ae422a_1256x988.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OpVV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c31b9d8-160c-4344-861e-a6c217ae422a_1256x988.png" width="1256" height="988" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4c31b9d8-160c-4344-861e-a6c217ae422a_1256x988.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:988,&quot;width&quot;:1256,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:228997,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c31b9d8-160c-4344-861e-a6c217ae422a_1256x988.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OpVV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c31b9d8-160c-4344-861e-a6c217ae422a_1256x988.png 424w, https://substackcdn.com/image/fetch/$s_!OpVV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c31b9d8-160c-4344-861e-a6c217ae422a_1256x988.png 848w, https://substackcdn.com/image/fetch/$s_!OpVV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c31b9d8-160c-4344-861e-a6c217ae422a_1256x988.png 1272w, https://substackcdn.com/image/fetch/$s_!OpVV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c31b9d8-160c-4344-861e-a6c217ae422a_1256x988.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Sometimes, reasoning involves complicated planning, and to test LLM reasoning, we can use planning benchmarks such as maze navigation. Inspired by this, &#128073;<strong>LLM-A* </strong>paper aims to evaluate and improve LLMs on path planning [11].</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xTvx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efbc70f-71c2-4aec-9363-34f8a304356a_1104x483.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xTvx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efbc70f-71c2-4aec-9363-34f8a304356a_1104x483.png 424w, https://substackcdn.com/image/fetch/$s_!xTvx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efbc70f-71c2-4aec-9363-34f8a304356a_1104x483.png 848w, https://substackcdn.com/image/fetch/$s_!xTvx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efbc70f-71c2-4aec-9363-34f8a304356a_1104x483.png 1272w, https://substackcdn.com/image/fetch/$s_!xTvx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efbc70f-71c2-4aec-9363-34f8a304356a_1104x483.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xTvx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efbc70f-71c2-4aec-9363-34f8a304356a_1104x483.png" width="1104" height="483" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0efbc70f-71c2-4aec-9363-34f8a304356a_1104x483.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:483,&quot;width&quot;:1104,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:184321,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efbc70f-71c2-4aec-9363-34f8a304356a_1104x483.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xTvx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efbc70f-71c2-4aec-9363-34f8a304356a_1104x483.png 424w, https://substackcdn.com/image/fetch/$s_!xTvx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efbc70f-71c2-4aec-9363-34f8a304356a_1104x483.png 848w, https://substackcdn.com/image/fetch/$s_!xTvx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efbc70f-71c2-4aec-9363-34f8a304356a_1104x483.png 1272w, https://substackcdn.com/image/fetch/$s_!xTvx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efbc70f-71c2-4aec-9363-34f8a304356a_1104x483.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example of a path-finding problem. Source: [11].</figcaption></figure></div><p>The gist of LLM-A*: a hybrid algorithm that combines the optimality of A* with the high-level reasoning capabilities of an LLM<strong>.</strong> Together with A*, the LLM acts as a <em>planning guide</em>, suggesting waypoints (intermediate targets) based on its understanding of the start, goal, and obstacles. <strong> </strong></p><ul><li><p>The algorithm begins like A*: define the start and goal, initialize the OPEN = {s<sub>0</sub>} and CLOSED ={&#8709;} lists, and use a cost function <code>f(s) = g(s) + h(s)</code>. Here, A* employs a heuristic function <em>h(s)</em> to estimate the cost from a node <em>s</em> to the goal, and a cost function <em>g(s)</em> to track the exact cost from the start to <em>s. </em></p></li><li><p>But before the search begins, the LLM is prompted with the environment information and asked to generate a target list <code>T</code>: a series of waypoints that form a plausible path to the goal.</p></li><li><p>As A* proceeds, it modifies its <code>f(s)</code> score to bias toward the current LLM-suggested waypoint <code>t</code>:</p></li></ul><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;f(s) = g(s) + h(s) + cost(t, s)\n&quot;,&quot;id&quot;:&quot;HPGPIXQFFW&quot;}" data-component-name="LatexBlockToDOM"></div><blockquote><p>&#128064; The <em>cost(t,s) </em>term<em> </em>encourages A* to prioritze solutions that contain LLMs&#8217; waypoints.</p></blockquote><ul><li><p>Once the current target <code>t</code> is reached, it updates to the next in the sequence, continuously steering the local search using global LLM knowledge.</p></li><li><p>Crucially, the algorithm also verifies that all waypoints in <code>T</code> are outside obstacle zones and contain the correct start and goal nodes, correcting any LLM hallucination.</p></li></ul><p>LM-A* uses several prompting techniques to extract better target paths from the language model:</p><ul><li><p><strong>Few-Shot Learning:</strong> Demonstrates the LLM 5 by showing 5 demo examples of correct paths before prompting for a new one, thereby increasing reliability without requiring training.</p></li><li><p><strong>Chain-of-Thought (CoT):</strong> Encourages the LLM to reason step-by-step rather than jumping to the answer. This is particularly helpful in more complex environments where multi-step logic is essential.</p></li><li><p><strong>Recursive Path Evaluation (RePE):</strong> Breaks the planning into three LLM-driven stages:</p><ol><li><p>Analyze the environment,</p></li><li><p>Generate a step,</p></li><li><p>Evaluate that step.</p></li></ol><p>This recursion mimics approaches like <a href="https://www.promptingguide.ai/techniques/react">ReAct</a> and Self-Reflection but focuses entirely on internal reasoning, without environmental feedback.</p></li></ul><p>Finally, the results demonstrate that LLM-A* outperforms LLM alone significantly in the benchmark. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tKPH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ef581d-142c-4359-a569-2863c8f06037_1511x515.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tKPH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ef581d-142c-4359-a569-2863c8f06037_1511x515.png 424w, https://substackcdn.com/image/fetch/$s_!tKPH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ef581d-142c-4359-a569-2863c8f06037_1511x515.png 848w, https://substackcdn.com/image/fetch/$s_!tKPH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ef581d-142c-4359-a569-2863c8f06037_1511x515.png 1272w, https://substackcdn.com/image/fetch/$s_!tKPH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ef581d-142c-4359-a569-2863c8f06037_1511x515.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tKPH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ef581d-142c-4359-a569-2863c8f06037_1511x515.png" width="1456" height="496" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72ef581d-142c-4359-a569-2863c8f06037_1511x515.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:496,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:113934,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ef581d-142c-4359-a569-2863c8f06037_1511x515.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tKPH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ef581d-142c-4359-a569-2863c8f06037_1511x515.png 424w, https://substackcdn.com/image/fetch/$s_!tKPH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ef581d-142c-4359-a569-2863c8f06037_1511x515.png 848w, https://substackcdn.com/image/fetch/$s_!tKPH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ef581d-142c-4359-a569-2863c8f06037_1511x515.png 1272w, https://substackcdn.com/image/fetch/$s_!tKPH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ef581d-142c-4359-a569-2863c8f06037_1511x515.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>&#10060; One big issue with relying on classic search algorithms is the choice and design of the cost function <em>f()</em>.  Prior approaches often hinge on carefully engineered utility functions (e.g., number of tokens for text and distance for path planning), which introduce practical limitations.</p><p>An elegant alternative is to frame multi-step reasoning as an MDP. Here, each reasoning trajectory maps to a path through a decision space, where the state is the input prompt plus previous steps, the action is the next proposed token, and the Q-value can be used as the cost function. The paper &#128073;<strong>Q*</strong> leverages this structure by adopting a best-first decoding strategy, inspired by A* search [12].</p><blockquote><p>&#128064; A bit different from the aforementioned papers like PQM that model action as a reasoning step consisting of multiple tokens, the action in Q* is single token. </p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X7Jm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7169b567-58b5-4efe-b0ee-e578338f067c_1235x609.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X7Jm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7169b567-58b5-4efe-b0ee-e578338f067c_1235x609.png 424w, https://substackcdn.com/image/fetch/$s_!X7Jm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7169b567-58b5-4efe-b0ee-e578338f067c_1235x609.png 848w, https://substackcdn.com/image/fetch/$s_!X7Jm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7169b567-58b5-4efe-b0ee-e578338f067c_1235x609.png 1272w, https://substackcdn.com/image/fetch/$s_!X7Jm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7169b567-58b5-4efe-b0ee-e578338f067c_1235x609.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X7Jm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7169b567-58b5-4efe-b0ee-e578338f067c_1235x609.png" width="1235" height="609" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7169b567-58b5-4efe-b0ee-e578338f067c_1235x609.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:609,&quot;width&quot;:1235,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:146451,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7169b567-58b5-4efe-b0ee-e578338f067c_1235x609.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!X7Jm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7169b567-58b5-4efe-b0ee-e578338f067c_1235x609.png 424w, https://substackcdn.com/image/fetch/$s_!X7Jm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7169b567-58b5-4efe-b0ee-e578338f067c_1235x609.png 848w, https://substackcdn.com/image/fetch/$s_!X7Jm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7169b567-58b5-4efe-b0ee-e578338f067c_1235x609.png 1272w, https://substackcdn.com/image/fetch/$s_!X7Jm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7169b567-58b5-4efe-b0ee-e578338f067c_1235x609.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Q* framework. Source: [12]</figcaption></figure></div><p></p><p>Specifically, Q* casts reasoning as a heuristic search problem. Each partial reasoning path <em>s<sub>t</sub></em>&#8203; is scored using an A*-style utility function:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;f(s_t) = g(s_t) + \\lambda h(s_t)\n&quot;,&quot;id&quot;:&quot;GKYHRCFMTQ&quot;}" data-component-name="LatexBlockToDOM"></div><ul><li><p><em>g(s<sub>t</sub>&#8203;)</em>: Aggregated reward along the path so far.</p></li><li><p><em>h(s<sub>t</sub>)</em>: Heuristic estimate of the utility-to-go.</p></li><li><p><em>&#955;</em>: A scalar balancing the two terms.</p></li></ul><p>To compute <em>g(s<sub>t</sub>)</em>, Q* uses an aggregation function of a reward function <em>R<sub>P</sub></em>&#8203; that reflects task-specific preferences (e.g., correctness, coherence, confidence):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;g(s_t) = \\text{Agg}(R_P(s_1), \\ldots, R_P(s_t))\n&quot;,&quot;id&quot;:&quot;KAMZOGEUQV&quot;}" data-component-name="LatexBlockToDOM"></div><p>For the heuristic <em>h(s<sub>t</sub>)</em>, Q* uses the optimal<strong> </strong>Q-value:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;f(s_t) = g(s_t) + \\lambda \\max_{a_t \\in \\text{top-}K(\\pi_\\theta(\\cdot|s_t))} Q^*(s_t, a_t)\n&quot;,&quot;id&quot;:&quot;EMAUDXLLJZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Rather than exploring all possible continuations, Q* limits candidates to the top-K tokens proposed by the LLM.</p><p>The key challenge is estimating the optimal Q-value <em>Q*(s<sub>t</sub>,a<sub>t</sub>)</em> for a frozen LLM policy <em>&#960;<sub>&#952;</sub></em>. To do this, Q* learns a proxy Q-value function <em>Q&#770;</em> from offline trajectories:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\hat{Q} = \\arg\\min_Q \\frac{1}{NMT} \\sum_{i=1}^{N} \\sum_{j=1}^{M} \\sum_{a_t \\in a_{ij}} (Q(s_t, a_t) - \\hat{y}(s_t, a_t))^2\n&quot;,&quot;id&quot;:&quot;ZXEJCHYDXQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>M</em> is the number of actions in a trajectory and <em>N</em> is the number of trajectories. <em>y&#770;&#8203;(s<sub>t</sub>&#8203;,a<sub>t</sub>&#8203;)</em> is a label approximating the true Q-value, computed using the following strategies:</p><ol><li><p>Using Fitted Q-Iteration, labels are constructed recursively:</p></li></ol><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\hat{y}^\\ell(s_t, a_t) = \n\\begin{cases}\nR(s_t, a_t) &amp; \\text{if } t = T \\\\\nR(s_t, a_t) + \\gamma \\max_{a_{t+1} \\in \\text{top-}K} \\hat{Q}^{\\ell-1}(s_{t+1}, a_{t+1}) &amp; \\text{otherwise}\n\\end{cases}&quot;,&quot;id&quot;:&quot;ZHVIKYDXEH&quot;}" data-component-name="LatexBlockToDOM"></div><p>This process alternates between updating labels and training <em>Q&#770;</em> for <em>L</em> iterations.</p><ol start="2"><li><p>Learning from Rollouts: run random or MCTS rollouts from <em>(s<sub>t</sub>,a<sub>t</sub>)</em> and assign labels based on the best future trajectory:</p></li></ol><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\hat{y}(s_t, a_t) = R(s_t, a_t) + \\max_{\\tau \\sim \\mathcal{P}} \\sum_{t' = t+1}^{T} \\gamma^{T - t'} R(s'_{t'}, a'_{t'})\n&quot;,&quot;id&quot;:&quot;PCXYHPWQOQ&quot;}" data-component-name="LatexBlockToDOM"></div><ol start="3"><li><p>Using a Stronger LLM: if a stronger LLM <em>&#960;<sub>&#952;*</sub></em>&#8203; is available (e.g., GPT-4), complete the reasoning with it to estimate the future reward:</p></li></ol><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\hat{y}(s_t, a_t) = R(s_t, a_t) + \\sum_{t' = t+1}^{T} \\gamma^{T - t'} R(s^*_{t'}, a^*_{t'})\n&quot;,&quot;id&quot;:&quot;MYYHJXQFBH&quot;}" data-component-name="LatexBlockToDOM"></div><p>where:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;s_{t'}^* = [s_t; a_t; a_{t+1}^*; \\ldots; a_{t'-1}^*]\n\n&quot;,&quot;id&quot;:&quot;BBWAKMQWKT&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;a_{t'}^* \\sim \\pi_{\\theta^*}(\\cdot \\mid s_{t'}^*)\n&quot;,&quot;id&quot;:&quot;KDIFMZOEUH&quot;}" data-component-name="LatexBlockToDOM"></div><p>&#10060; While classic search-inspired frameworks like A* provide structured guidance for stepwise reasoning, they often rely on static heuristics and limited lookahead. </p><h4>Beyond Search with Planning</h4><p>To explore deeper decision spaces and balance exploration with exploitation, recent approaches turn to <a href="https://hungleai.substack.com/i/161521744/about-reinforcement-learning-rl">Monte Carlo Tree Search (MCTS)</a>&#8212;a planning algorithm that enables LLMs to simulate multiple reasoning trajectories and refine their outputs through iterative rollouts. This shift empowers LLMs to deliberate more thoroughly, especially on complex tasks with long-horizon dependencies.</p><p>Although classical MCTS depends on access to an explicit environment model to simulate forward trajectories, this assumption is often unnecessary in language-based tasks. A key insight exploited by &#128073;<strong>Language Agent Tree Search (LATS)</strong> [13] is that for most language model (LM) environments&#8212;such as web navigation, multi-step reasoning, or tool use&#8212;we can <em>reconstruct any prior state</em> simply by restoring the full-text history. This removes one of the main barriers to applying MCTS in natural language settings and opens the door to powerful planning capabilities within LMs.</p><p>LATS adapts MCTS to language agents by interpreting each node in the tree as a structured tuple:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;s = [x, a_{1:i}, o_{1:i}]&quot;,&quot;id&quot;:&quot;NYFIUAAMZX&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>x</em> is the original task input, <em>a<sub>1:i</sub></em> is the action sequence taken so far (including reasoning traces), and <em>o<sub>1:i</sub></em> are the corresponding observations or feedback. The search tree is built incrementally using six operations:</p><ol><li><p><strong>Selection</strong>: Starting from the root, LATS uses the UCT criterion to traverse down the tree until it reaches a leaf node.</p></li></ol><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{UCT}(s) = V(s) + w \\sqrt{\\frac{\\ln N(p)}{N(s)}}\n&quot;,&quot;id&quot;:&quot;DSDCXUJYAL&quot;}" data-component-name="LatexBlockToDOM"></div><ul><li><p><em>V(s)</em> is the estimated value of node <em>s</em></p></li><li><p><em>N(s)</em> is the number of visits to node <em>s</em></p></li><li><p><em>N(p)</em> is the number of visits to <em>s</em>'s parent <em>p</em></p></li><li><p><em>w</em> is the exploration weight</p></li></ul><ol start="2"><li><p><strong>Expansion</strong>: At the selected node, the LM samples <em>n</em> next-step actions:</p></li></ol><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;a_t^{(1)}, \\ldots, a_t^{(n)} \\sim p_\\theta(\\cdot|s)&quot;,&quot;id&quot;:&quot;CJIXXZOCVD&quot;}" data-component-name="LatexBlockToDOM"></div><ol start="3"><li><p><strong>Evaluation</strong>: Each new node is scored with a novel value function:</p></li></ol><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;V(s) = \\lambda \\cdot \\text{LM}(s) + (1 - \\lambda) \\cdot \\text{SC}(s)\n&quot;,&quot;id&quot;:&quot;RZJHQQCHWZ&quot;}" data-component-name="LatexBlockToDOM"></div><ul><li><p><em>LM(s)</em> is a scalar score predicted by the language model, prompted to end its reasoning trace with a self-assessed rating,</p></li><li><p><em>SC(s)</em> is a self-consistency score, aggregating agreement across multiple LM samples from the same node.</p></li></ul><ol start="4"><li><p><strong>Simulation</strong>: LATS recursively samples and simulates actions from the current node until a terminal state is reached. The LM executes this by generating the remainder of the trajectory.</p></li><li><p><strong>Backpropagation</strong>: Upon reaching a terminal state and obtaining a reward <em>r</em>, LATS updates all ancestor nodes along the path s<sub>0</sub>&#8203;&#8594;s<sub>1</sub>&#8203;&#8594;&#8943;&#8594;s<sub>&#8467;</sub>:</p></li></ol><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;V(s_i) = \\frac{V(s_i) \\cdot (N(s_i) - 1) + r}{N(s_i)}\n&quot;,&quot;id&quot;:&quot;KOMOQVEVVV&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, the visitation count is updated as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;N(s_i) \\leftarrow N(s_i) + 1,&quot;,&quot;id&quot;:&quot;FHGIHCKLHU&quot;}" data-component-name="LatexBlockToDOM"></div><ol start="6"><li><p><strong>Reflection</strong>: When a trajectory fails, the LM is prompted to reflect on its reasoning and generate a natural language critique. This is stored and injected as additional in-context information in future rollouts, providing &#8220;semantic gradient&#8221; signals that improve decision-making without explicit gradient descent. The prompt for reflection is like this:</p></li></ol><div class="pullquote"><p>You are an advanced reasoning agent that can improve based on self-reflection. You will be given a previous reasoning trial in which you were given access to a shopping website and a specific type of item to buy. You were given access to relevant context and an item to  &#8230;</p></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O-ze!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a89bc9-e17d-4ecd-8597-0042500d130e_1387x482.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O-ze!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a89bc9-e17d-4ecd-8597-0042500d130e_1387x482.png 424w, https://substackcdn.com/image/fetch/$s_!O-ze!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a89bc9-e17d-4ecd-8597-0042500d130e_1387x482.png 848w, https://substackcdn.com/image/fetch/$s_!O-ze!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a89bc9-e17d-4ecd-8597-0042500d130e_1387x482.png 1272w, https://substackcdn.com/image/fetch/$s_!O-ze!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a89bc9-e17d-4ecd-8597-0042500d130e_1387x482.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O-ze!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a89bc9-e17d-4ecd-8597-0042500d130e_1387x482.png" width="1387" height="482" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72a89bc9-e17d-4ecd-8597-0042500d130e_1387x482.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:482,&quot;width&quot;:1387,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:85602,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a89bc9-e17d-4ecd-8597-0042500d130e_1387x482.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!O-ze!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a89bc9-e17d-4ecd-8597-0042500d130e_1387x482.png 424w, https://substackcdn.com/image/fetch/$s_!O-ze!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a89bc9-e17d-4ecd-8597-0042500d130e_1387x482.png 848w, https://substackcdn.com/image/fetch/$s_!O-ze!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a89bc9-e17d-4ecd-8597-0042500d130e_1387x482.png 1272w, https://substackcdn.com/image/fetch/$s_!O-ze!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72a89bc9-e17d-4ecd-8597-0042500d130e_1387x482.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">MCTS in LATS. Source: [13].</figcaption></figure></div><p>As the results speak, LATS improves on tasks that require reasoning and tool use (internet search):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2Zyx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0e690d-4240-44d3-94fc-c0803458cd55_1289x396.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2Zyx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0e690d-4240-44d3-94fc-c0803458cd55_1289x396.png 424w, https://substackcdn.com/image/fetch/$s_!2Zyx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0e690d-4240-44d3-94fc-c0803458cd55_1289x396.png 848w, https://substackcdn.com/image/fetch/$s_!2Zyx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0e690d-4240-44d3-94fc-c0803458cd55_1289x396.png 1272w, https://substackcdn.com/image/fetch/$s_!2Zyx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0e690d-4240-44d3-94fc-c0803458cd55_1289x396.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2Zyx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0e690d-4240-44d3-94fc-c0803458cd55_1289x396.png" width="1289" height="396" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7a0e690d-4240-44d3-94fc-c0803458cd55_1289x396.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:396,&quot;width&quot;:1289,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:242205,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0e690d-4240-44d3-94fc-c0803458cd55_1289x396.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2Zyx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0e690d-4240-44d3-94fc-c0803458cd55_1289x396.png 424w, https://substackcdn.com/image/fetch/$s_!2Zyx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0e690d-4240-44d3-94fc-c0803458cd55_1289x396.png 848w, https://substackcdn.com/image/fetch/$s_!2Zyx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0e690d-4240-44d3-94fc-c0803458cd55_1289x396.png 1272w, https://substackcdn.com/image/fetch/$s_!2Zyx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a0e690d-4240-44d3-94fc-c0803458cd55_1289x396.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As we&#8217;ve seen, many recent papers propose integrating search and planning with RL concepts to improve LLM reasoning, ranging from beam search guided by reward models to complex tree-based planning algorithms. However, this raises an important question: &#129504; <em>Is more complexity always better?</em> </p><p>A recent empirical study titled &#128073;<em><strong>Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters</strong></em> performed comprehensive evaluations that revealed many insights [14]. </p><p>Core Idea: Test-Time Compute as a Resource:</p><ul><li><p>Instead of scaling parameters, we scale how we use compute at test time (e.g., number of samples, depth of revisions, search strategy).</p></li><li><p>Given a prompt and a fixed compute budget <em>N</em>, different compute strategies yield different accuracies.</p></li><li><p>Some questions benefit more from refinement (e.g., iterative revisions), while others require broader exploration (e.g., sampling diverse solutions in parallel).</p></li></ul><p>Formally, we define <code>Target(&#952;, N, q)</code> as the output distribution, given a strategy <code>&#952;</code>, budget <code>N</code>, and prompt <code>q</code>.</p><ul><li><p>The goal is to find the optimal search strategy:</p></li></ul><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\theta^*_{q, y^*(q)}(N) = \\arg\\max_\\theta \\mathbb{E}_{y \\sim \\text{Target}(\\theta, N, q)}[\\mathbf{1}_{y = y^*(q)}]\n&quot;,&quot;id&quot;:&quot;FNZYMDHGTU&quot;}" data-component-name="LatexBlockToDOM"></div><p>There are two main components in strategy: </p><ol><li><p><strong>Proposal Distribution</strong>: how candidate answers are generated (e.g., revision vs. independent sampling).</p></li><li><p><strong>Verifier</strong>: how candidates are judged (e.g., majority vote, best-of-N, or scoring models like ORMs or PRMs).</p></li></ol><p>The paper studies several options for strategies:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H6Qg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73692c1d-43c3-43a2-baa2-099e24590ba2_1961x961.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H6Qg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73692c1d-43c3-43a2-baa2-099e24590ba2_1961x961.png 424w, https://substackcdn.com/image/fetch/$s_!H6Qg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73692c1d-43c3-43a2-baa2-099e24590ba2_1961x961.png 848w, https://substackcdn.com/image/fetch/$s_!H6Qg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73692c1d-43c3-43a2-baa2-099e24590ba2_1961x961.png 1272w, https://substackcdn.com/image/fetch/$s_!H6Qg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73692c1d-43c3-43a2-baa2-099e24590ba2_1961x961.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H6Qg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73692c1d-43c3-43a2-baa2-099e24590ba2_1961x961.png" width="1456" height="714" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73692c1d-43c3-43a2-baa2-099e24590ba2_1961x961.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:714,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:250797,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73692c1d-43c3-43a2-baa2-099e24590ba2_1961x961.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H6Qg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73692c1d-43c3-43a2-baa2-099e24590ba2_1961x961.png 424w, https://substackcdn.com/image/fetch/$s_!H6Qg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73692c1d-43c3-43a2-baa2-099e24590ba2_1961x961.png 848w, https://substackcdn.com/image/fetch/$s_!H6Qg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73692c1d-43c3-43a2-baa2-099e24590ba2_1961x961.png 1272w, https://substackcdn.com/image/fetch/$s_!H6Qg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73692c1d-43c3-43a2-baa2-099e24590ba2_1961x961.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Common searching stragties. Lookahead search may include A* and MCTS.  Source: [14].</figcaption></figure></div><blockquote><p>&#128064; The intuition is that, for easy problems, simple search may suffice, yet for hard problems, it is better to sample diverse reasoning paths up front.</p></blockquote><p>This requires approximating question difficulty to guide how computing should be allocated. &#129504; <em>How to estimate problem difficulty?</em> </p><ul><li><p>Two ways:</p><ul><li><p>Oracle difficulty: uses ground-truth labels (ideal but impractical).</p></li><li><p>Model-predicted difficulty: uses verifier scores to estimate difficulty without access to answers.</p></li></ul></li></ul><p>We often assume more sophisticated search equals better performance. But the empirical results say: not always. In particular,</p><ul><li><p>Beam search wins early: At low-generation budgets, beam search significantly outperforms best-of-N. But this advantage vanishes as the budget increases.</p></li><li><p>Lookahead search underperforms: Despite being more powerful, it performs worse due to:</p><ul><li><p>Costly rollouts.</p></li><li><p>Over-optimization of the verifier (PRM), leading to repetitive or trivial outputs.</p></li></ul></li><li><p>Diminishing returns from search are real: As shown in some failure cases, stronger search just amplifies bad verifier signals, generating:</p><ul><li><p>Repetitive steps.</p></li><li><p>Overly short or spurious solutions.</p></li></ul></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q7vD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d70551b-149b-4d9d-8f68-8358710b670c_1177x428.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q7vD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d70551b-149b-4d9d-8f68-8358710b670c_1177x428.png 424w, https://substackcdn.com/image/fetch/$s_!Q7vD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d70551b-149b-4d9d-8f68-8358710b670c_1177x428.png 848w, https://substackcdn.com/image/fetch/$s_!Q7vD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d70551b-149b-4d9d-8f68-8358710b670c_1177x428.png 1272w, https://substackcdn.com/image/fetch/$s_!Q7vD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d70551b-149b-4d9d-8f68-8358710b670c_1177x428.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q7vD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d70551b-149b-4d9d-8f68-8358710b670c_1177x428.png" width="1177" height="428" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3d70551b-149b-4d9d-8f68-8358710b670c_1177x428.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:428,&quot;width&quot;:1177,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Q7vD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d70551b-149b-4d9d-8f68-8358710b670c_1177x428.png 424w, https://substackcdn.com/image/fetch/$s_!Q7vD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d70551b-149b-4d9d-8f68-8358710b670c_1177x428.png 848w, https://substackcdn.com/image/fetch/$s_!Q7vD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d70551b-149b-4d9d-8f68-8358710b670c_1177x428.png 1272w, https://substackcdn.com/image/fetch/$s_!Q7vD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d70551b-149b-4d9d-8f68-8358710b670c_1177x428.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Using difficulty-conditioned compute allocation&#8212;i.e., scaling search per question difficulty&#8212;improves efficiency:</p><ul><li><p>With oracle difficulty bins, the compute-optimal strategy nearly matches the best-of-64 using only 16 generations.</p></li><li><p>Even with model-predicted bins, the gains largely hold.</p></li></ul><p>Last but not least, the paper also explores improving the model&#8217;s <em>proposal distribution</em> by teaching it to revise its answers. This technique can be combined with parallel search approaches like best-of-N:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MPJ3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a03dbe-9525-40a1-9448-f4a3850beeff_2073x817.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MPJ3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a03dbe-9525-40a1-9448-f4a3850beeff_2073x817.png 424w, https://substackcdn.com/image/fetch/$s_!MPJ3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a03dbe-9525-40a1-9448-f4a3850beeff_2073x817.png 848w, https://substackcdn.com/image/fetch/$s_!MPJ3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a03dbe-9525-40a1-9448-f4a3850beeff_2073x817.png 1272w, https://substackcdn.com/image/fetch/$s_!MPJ3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a03dbe-9525-40a1-9448-f4a3850beeff_2073x817.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MPJ3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a03dbe-9525-40a1-9448-f4a3850beeff_2073x817.png" width="1456" height="574" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/34a03dbe-9525-40a1-9448-f4a3850beeff_2073x817.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:574,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:245180,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a03dbe-9525-40a1-9448-f4a3850beeff_2073x817.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MPJ3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a03dbe-9525-40a1-9448-f4a3850beeff_2073x817.png 424w, https://substackcdn.com/image/fetch/$s_!MPJ3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a03dbe-9525-40a1-9448-f4a3850beeff_2073x817.png 848w, https://substackcdn.com/image/fetch/$s_!MPJ3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a03dbe-9525-40a1-9448-f4a3850beeff_2073x817.png 1272w, https://substackcdn.com/image/fetch/$s_!MPJ3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34a03dbe-9525-40a1-9448-f4a3850beeff_2073x817.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Sequential revision approaches. Source: [14].</figcaption></figure></div><p>Unlike naive self-correction (which often fails), fine-tuning a model to iteratively refine previous answers yields clear improvements. So, they do <strong>s</strong>upervised fine-tuning (SFT) on trajectories composed of a sequence of incorrect answers followed by a correct one. This trains the model to identify and correct mistakes in context rather than discard prior attempts and restart.</p><p>At inference time, the finetuned model generates sequential revisions, each conditioned on the latest few attempts (up to four). Empirically, performance improves with each revision step, showing the model effectively learns to self-correct over time.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YSnY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84da4b9-75dc-4e81-88b0-553c18b1059c_2053x709.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YSnY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84da4b9-75dc-4e81-88b0-553c18b1059c_2053x709.png 424w, https://substackcdn.com/image/fetch/$s_!YSnY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84da4b9-75dc-4e81-88b0-553c18b1059c_2053x709.png 848w, https://substackcdn.com/image/fetch/$s_!YSnY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84da4b9-75dc-4e81-88b0-553c18b1059c_2053x709.png 1272w, https://substackcdn.com/image/fetch/$s_!YSnY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84da4b9-75dc-4e81-88b0-553c18b1059c_2053x709.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YSnY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84da4b9-75dc-4e81-88b0-553c18b1059c_2053x709.png" width="1456" height="503" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c84da4b9-75dc-4e81-88b0-553c18b1059c_2053x709.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:503,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:160277,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84da4b9-75dc-4e81-88b0-553c18b1059c_2053x709.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YSnY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84da4b9-75dc-4e81-88b0-553c18b1059c_2053x709.png 424w, https://substackcdn.com/image/fetch/$s_!YSnY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84da4b9-75dc-4e81-88b0-553c18b1059c_2053x709.png 848w, https://substackcdn.com/image/fetch/$s_!YSnY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84da4b9-75dc-4e81-88b0-553c18b1059c_2053x709.png 1272w, https://substackcdn.com/image/fetch/$s_!YSnY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84da4b9-75dc-4e81-88b0-553c18b1059c_2053x709.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p><strong>Key insights:</strong> They find a tradeoff between sequential revisions and parallel sampling at test time. The optimal ratio depends on both the compute budget and the question&#8217;s difficulty:</p><ul><li><p>Easier questions perform best with purely sequential computation.</p></li><li><p>Harder questions benefit from a balanced mix.</p></li></ul><p>By adjusting this ratio per question, they can outperform standard best-of-N baselines using up to 4&#215; less compute.</p><p>If you realize, all of these papers assume that the search process will eventually lead to a correct answer. In other words, they assume that LLMs already know the right answer, but just need the right trigger to reach it.</p><p>This is exactly the hypothesis behind &#128073;<strong>AlphaMath</strong>:</p><blockquote><p>&#128064; Pre-trained LLMs already contain rich mathematical knowledge. The challenge is not learning from scratch, but <em>activating</em> this latent reasoning through better prompts and smarter search.</p></blockquote><p>The paper follows prior works in modeling text generation as an MDP, where the LLM acts as a policy: At step <em>t</em>, state = partial solution <em>s<sub>t</sub></em>, action = next step <em>a<sub>t</sub>,</em> and<br>the transition is deterministic:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;s_{t+1} = \\text{Cat}(s_t, a_t), \\quad \\pi_\\theta(a_t \\mid s_t) = \\text{LLM}(a_t \\mid s_t)\n&quot;,&quot;id&quot;:&quot;AKWFMRMWOD&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Here, to find the optimal reasoning path, AlphaMath combines:</p><ul><li><p>A Monte Carlo Tree Search (MCTS) to balance exploration and exploitation, which is typical as in prior work:</p><p></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eKe6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb9fe9a-bbb2-4126-aec1-8583c65bae8c_1381x550.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eKe6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb9fe9a-bbb2-4126-aec1-8583c65bae8c_1381x550.png 424w, https://substackcdn.com/image/fetch/$s_!eKe6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb9fe9a-bbb2-4126-aec1-8583c65bae8c_1381x550.png 848w, https://substackcdn.com/image/fetch/$s_!eKe6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb9fe9a-bbb2-4126-aec1-8583c65bae8c_1381x550.png 1272w, https://substackcdn.com/image/fetch/$s_!eKe6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb9fe9a-bbb2-4126-aec1-8583c65bae8c_1381x550.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eKe6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb9fe9a-bbb2-4126-aec1-8583c65bae8c_1381x550.png" width="1381" height="550" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6eb9fe9a-bbb2-4126-aec1-8583c65bae8c_1381x550.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:550,&quot;width&quot;:1381,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:202313,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb9fe9a-bbb2-4126-aec1-8583c65bae8c_1381x550.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eKe6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb9fe9a-bbb2-4126-aec1-8583c65bae8c_1381x550.png 424w, https://substackcdn.com/image/fetch/$s_!eKe6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb9fe9a-bbb2-4126-aec1-8583c65bae8c_1381x550.png 848w, https://substackcdn.com/image/fetch/$s_!eKe6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb9fe9a-bbb2-4126-aec1-8583c65bae8c_1381x550.png 1272w, https://substackcdn.com/image/fetch/$s_!eKe6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eb9fe9a-bbb2-4126-aec1-8583c65bae8c_1381x550.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>The novel bit is that a lightweight value model was added to the same LLM to judge intermediate reasoning quality</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RU6N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63566bc4-db93-4b9a-ab5c-761e347d0fdf_602x504.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RU6N!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63566bc4-db93-4b9a-ab5c-761e347d0fdf_602x504.png 424w, https://substackcdn.com/image/fetch/$s_!RU6N!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63566bc4-db93-4b9a-ab5c-761e347d0fdf_602x504.png 848w, https://substackcdn.com/image/fetch/$s_!RU6N!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63566bc4-db93-4b9a-ab5c-761e347d0fdf_602x504.png 1272w, https://substackcdn.com/image/fetch/$s_!RU6N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63566bc4-db93-4b9a-ab5c-761e347d0fdf_602x504.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RU6N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63566bc4-db93-4b9a-ab5c-761e347d0fdf_602x504.png" width="422" height="353.30232558139534" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/63566bc4-db93-4b9a-ab5c-761e347d0fdf_602x504.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:504,&quot;width&quot;:602,&quot;resizeWidth&quot;:422,&quot;bytes&quot;:59336,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63566bc4-db93-4b9a-ab5c-761e347d0fdf_602x504.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RU6N!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63566bc4-db93-4b9a-ab5c-761e347d0fdf_602x504.png 424w, https://substackcdn.com/image/fetch/$s_!RU6N!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63566bc4-db93-4b9a-ab5c-761e347d0fdf_602x504.png 848w, https://substackcdn.com/image/fetch/$s_!RU6N!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63566bc4-db93-4b9a-ab5c-761e347d0fdf_602x504.png 1272w, https://substackcdn.com/image/fetch/$s_!RU6N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63566bc4-db93-4b9a-ab5c-761e347d0fdf_602x504.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Not relying on the value estimated in MCTS, the paper proposes a value model <em>V<sub>&#981;</sub>(s)</em> to estimate how likely a partial solution will lead to a correct final answer.</p><blockquote><p>&#128064; A value model can generalize to new states, which will be useful to boostrap the value estimation of MCTS.</p></blockquote><p>Training this model relies on Monte Carlo evaluation from multiple rollouts:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;V_e(s_t) = \\frac{1}{N} \\sum_{i=1}^{N} r(a^{(i)}_{\\geq t}, s^{(i)}_{> t} \\mid s_t)\n&quot;,&quot;id&quot;:&quot;OCSKFWYMBK&quot;}" data-component-name="LatexBlockToDOM"></div><p>Then, we can optimize the value model via regression:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{V_\\phi}(s) = \\left(V_\\phi(s) - V_e(s)\\right)^2\n&quot;,&quot;id&quot;:&quot;CUEKLMGRZI&quot;}" data-component-name="LatexBlockToDOM"></div><p>Given the value model, the MCTS phase uses both policy and value models to guide simulations:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SRJb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc73b949-2d91-4e07-8189-b5089f4754c1_1536x680.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SRJb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc73b949-2d91-4e07-8189-b5089f4754c1_1536x680.png 424w, https://substackcdn.com/image/fetch/$s_!SRJb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc73b949-2d91-4e07-8189-b5089f4754c1_1536x680.png 848w, https://substackcdn.com/image/fetch/$s_!SRJb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc73b949-2d91-4e07-8189-b5089f4754c1_1536x680.png 1272w, https://substackcdn.com/image/fetch/$s_!SRJb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc73b949-2d91-4e07-8189-b5089f4754c1_1536x680.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SRJb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc73b949-2d91-4e07-8189-b5089f4754c1_1536x680.png" width="1456" height="645" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cc73b949-2d91-4e07-8189-b5089f4754c1_1536x680.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:645,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:261425,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc73b949-2d91-4e07-8189-b5089f4754c1_1536x680.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SRJb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc73b949-2d91-4e07-8189-b5089f4754c1_1536x680.png 424w, https://substackcdn.com/image/fetch/$s_!SRJb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc73b949-2d91-4e07-8189-b5089f4754c1_1536x680.png 848w, https://substackcdn.com/image/fetch/$s_!SRJb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc73b949-2d91-4e07-8189-b5089f4754c1_1536x680.png 1272w, https://substackcdn.com/image/fetch/$s_!SRJb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc73b949-2d91-4e07-8189-b5089f4754c1_1536x680.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Summary of 4 MCTS steps in AlphaMath. Source:[15]<strong>.</strong></figcaption></figure></div><p>Here, some steps introduce novelty. First, action selection uses a PUCT-style rule:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;a_t = \\arg\\max_a \\left[ \\hat{Q}(s_t, a) + c_{\\text{puct}} \\cdot \\pi_\\theta(a \\mid s_t) \\cdot \\frac{\\sqrt{N_{\\text{parent}(a)}}}{1 + N(s_t, a)} \\right]\n&quot;,&quot;id&quot;:&quot;BSAMMVFFLL&quot;}" data-component-name="LatexBlockToDOM"></div><p>Each term here has a clear role:</p><ul><li><p><em>Q&#770;</em> : This is the MCTS&#8217;s average return (value) obtained when taking action <em>a<sub>t</sub></em> at state <em>s<sub>t</sub></em>&#8203;, based on past simulations. It encourages picking the best-known option so far (exploitation).</p></li><li><p><em>&#960;<sub>&#952;</sub>&#8203;(a&#8739;s<sub>t</sub>&#8203;)</em>:<em> </em>the likelihood of taking action <em>a</em> according to the current policy. It helps bias the search toward actions the model thinks are promising &#8212; this is a key LLM-specific addition. It helps reduce the number of required simulations by starting from smart guesses (prior guidance).</p></li><li><p>The counting term favors actions that haven&#8217;t been visited much.</p><ul><li><p>If <em>N(s<sub>t</sub>,a)</em>is low, the denominator is small, so the whole term becomes large &#8594; we explore new actions.</p></li><li><p>As <em>N(s<sub>t</sub>,a)</em> increases (more visits), the term shrinks, reducing exploration pressure.</p></li></ul></li></ul><p>Second, the evaluation step is a mix of reward observation and value model estimation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\hat{V}(s_t)^{(i)} = (1 - \\lambda) \\cdot V_\\phi(s_t) + \\lambda \\cdot r\\left( a^{(i)}_{t' \\geq t},\\, s^{(i)}_{t' > t} \\mid s_t \\right)\n\n&quot;,&quot;id&quot;:&quot;FZBPVYHQZF&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>i </em>represents the <em>i</em>-th simulation. Given the current value estimation, we can do the backpropagation step to update the state-action value as:</p><p></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\nN(s, a) &amp;\\leftarrow N(s, a) + 1 \\\\\n\\hat{Q}(s, a) &amp;\\leftarrow \\frac{1}{N(s, a)} \\sum_{j=1}^{i} \\mathbb{I}_{s, a \\rightarrow s_t} \\, \\hat{V}(s_t^{(j)})\n\\end{align}&quot;,&quot;id&quot;:&quot;UVWYWWDALW&quot;}" data-component-name="LatexBlockToDOM"></div><p>After running <em>N</em> simulations with MCTS, we get a search tree with state-action values <em>Q(s, a)</em> stored at each node. Since the transition is deterministic, we assume for non-terminal nodes:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Q(s_t, a_t) = r(s_t, a_t) + V(s_{t+1}) = V(s_{t+1})\n&quot;,&quot;id&quot;:&quot;WQWYJSQPZW&quot;}" data-component-name="LatexBlockToDOM"></div><p>This implies that we can estimate the MCTS&#8217;s value as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{V}(s_{t+1}) = \\hat{Q}(s_t, a_t)\n&quot;,&quot;id&quot;:&quot;QDGIRKNEPZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>This &#8220;ground-truth&#8220; value will be used to train the value model. </p><p>Finally,  although MCTS works, it&#8217;s slow. During deployment inference, to speed up inference, AlphaMath introduces Step-level Beam Search (SBS):</p><ul><li><p>At each step, generate <em>B<sub>2</sub>&#8203;</em> next-step candidates with the LLM</p></li><li><p>Evaluate each with <em>V<sub>&#981;</sub></em>, keep top <em>B<sub>1</sub></em><sub>&#8203;</sub></p></li></ul><blockquote><p>&#128064; No simulation, no tree&#8212;just fast reasoning with value guidance. MCTS is just the tool for training the value model. When <em>B<sub>1</sub>=1</em>, SBS becomes a <strong>fast MCTS approximation</strong> for real-time use.</p></blockquote><p>The results show that SBS is much faster while keeping competitive performance:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7qkB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8d3556-aaa7-4c7d-b49b-f3920147a429_1290x911.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7qkB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8d3556-aaa7-4c7d-b49b-f3920147a429_1290x911.png 424w, https://substackcdn.com/image/fetch/$s_!7qkB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8d3556-aaa7-4c7d-b49b-f3920147a429_1290x911.png 848w, https://substackcdn.com/image/fetch/$s_!7qkB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8d3556-aaa7-4c7d-b49b-f3920147a429_1290x911.png 1272w, https://substackcdn.com/image/fetch/$s_!7qkB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8d3556-aaa7-4c7d-b49b-f3920147a429_1290x911.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7qkB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8d3556-aaa7-4c7d-b49b-f3920147a429_1290x911.png" width="470" height="331.91472868217056" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ad8d3556-aaa7-4c7d-b49b-f3920147a429_1290x911.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:911,&quot;width&quot;:1290,&quot;resizeWidth&quot;:470,&quot;bytes&quot;:221546,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/163970650?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8d3556-aaa7-4c7d-b49b-f3920147a429_1290x911.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7qkB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8d3556-aaa7-4c7d-b49b-f3920147a429_1290x911.png 424w, https://substackcdn.com/image/fetch/$s_!7qkB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8d3556-aaa7-4c7d-b49b-f3920147a429_1290x911.png 848w, https://substackcdn.com/image/fetch/$s_!7qkB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8d3556-aaa7-4c7d-b49b-f3920147a429_1290x911.png 1272w, https://substackcdn.com/image/fetch/$s_!7qkB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad8d3556-aaa7-4c7d-b49b-f3920147a429_1290x911.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Conclusion &amp; What&#8217;s Next</h2><p>This blog covered test-time reinforcement learning methods that guide large language models without further gradient updates. By treating reasoning as a sequential decision process, we leveraged techniques like:</p><ul><li><p>Classic Tree Search</p></li><li><p>Monte Carlo Tree Search</p></li><li><p>Process Reward Modeling</p></li></ul><p>But even with advanced test-time methods like MCTS and beam search, the model&#8217;s reasoning remains externally guided and self-supervised by final answers only. These methods help with step selection, but they don&#8217;t change the model&#8217;s internal reasoning ability. &#129504; <em>So what if we want to go further? What if we aim to improve the LLM&#8217;s internal reasoning competence, not just steer it at test time?</em></p><p>To do that, we need more than search&#8212;we need to update the model&#8217;s weights. This is where reinforcement learning during post-training comes in: by assigning reward signals not only to the final answer, but also to the quality and structure of intermediate reasoning steps, we can directly fine-tune LLMs to reason better from the inside out.</p><p>&#128284; <a href="https://open.substack.com/pub/hungleai/p/improving-llm-reasoning-with-post?r=3an4d1&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">Next up: </a>We'll study how reinforcement learning frameworks can be adapted to fine-tune LLMs for better reasoning, using novel rewards and RL algorithms, and more. Stay tuned. Reasoning is just getting smarter. &#129513;&#128640;</p><div><hr></div><h2>References</h2><p>[1] Achiam, Josh, et al. "Gpt-4 technical report." <em>arXiv preprint arXiv:2303.08774</em> (2023).</p><p>[2] Jaech, Aaron, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar et al. "Openai o1 system card." arXiv preprint arXiv:2412.16720 (2024).</p><p>[3] Ji, Yixin, Juntao Li, Hai Ye, Kaixin Wu, Jia Xu, Linjian Mo, and Min Zhang. "Test-time Computing: from System-1 Thinking to System-2 Thinking." <em>arXiv preprint arXiv:2501.02497</em> (2025).</p><p>[4] Yao, Shunyu, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. "Tree of thoughts: Deliberate problem solving with large language models." Advances in Neural Information Processing Systems 36 (2024).</p><p>[5] Kambhampati, Subbarao, Kaya Stechly, and Karthik Valmeekam. "(How) Do reasoning models reason?" <em>Annals of the New York Academy of Sciences</em> (2025).</p><p>[6] Lightman, Hunter, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. "Let's verify step by step." In <em>The Twelfth International Conference on Learning Representations</em>. 2023.</p><p>[7] Wang, Peiyi, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations." In <em>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</em>, pp. 9426-9439. 2024.</p><p>[8] Li, Wendi, and Yixuan Li. "Process reward model with q-value rankings."  In <em>The Thirteenth International Conference on Learning Representations, </em>2025.</p><p>[9] Setlur, Amrith, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. "Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning." In <em>The Thirteenth International Conference on Learning Representations</em>. 2025.</p><p>[10] Kang, Jikun, Xin Zhe Li, Xi Chen, Amirreza Kazemi, Qianyi Sun, Boxing Chen, Dong Li et al. "Mindstar: Enhancing math reasoning in pre-trained llms at inference time." <em>arXiv preprint arXiv:2405.16265</em> (2024).</p><p>[11] Meng, Silin, Yiwei Wang, Cheng-Fu Yang, Nanyun Peng, and Kai-Wei Chang. "LLM-A*: Large Language Model Enhanced Incremental Heuristic Search on Path Planning." In <em>Findings of the Association for Computational Linguistics: EMNLP 2024</em>, pp. 1087-1102. 2024.</p><p>[12] Wang, Chaojie, Yanchen Deng, Zhiyi Lyu, Liang Zeng, Jujie He, Shuicheng Yan, and Bo An. "Q*: Improving multi-step reasoning for llms with deliberative planning." <em>arXiv preprint arXiv:2406.14283</em> (2024).</p><p>[13] Zhou, Andy, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. "Language agent tree search unifies reasoning, acting, and planning in language models." In <em>Proceedings of the 41st International Conference on Machine Learning</em>, pp. 62138-62160. 2024.</p><p>[14] Snell, Charlie, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. "Scaling llm test-time compute optimally can be more effective than scaling model parameters." <em>arXiv preprint arXiv:2408.03314</em> (2024).</p><p>[15] Chen, Guoxin, Minpeng Liao, Chengxi Li, and Kai Fan. "AlphaMath Almost Zero: Process Supervision without Process." In <em>The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024</em>.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/p/reason-on-the-fly-how-rl-boosts-llm?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading Neurocoder Tales! This post is public, so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/p/reason-on-the-fly-how-rl-boosts-llm?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://hungleai.substack.com/p/reason-on-the-fly-how-rl-boosts-llm?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p></p>]]></content:encoded></item><item><title><![CDATA[The Best of Time-Series Forecasting (Part II): Advancements in Time Series Modeling Through Large Language Models]]></title><description><![CDATA[A comprehensive collection of leading LLM papers on time series forecasting]]></description><link>https://hungleai.substack.com/p/the-best-of-time-series-forecasting-a98</link><guid isPermaLink="false">https://hungleai.substack.com/p/the-best-of-time-series-forecasting-a98</guid><dc:creator><![CDATA[Hung Le]]></dc:creator><pubDate>Tue, 08 Apr 2025 23:42:51 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!iplR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F707880cb-ad3a-4b1a-a65b-f17376a6ac82_1024x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://hungleai.substack.com/p/the-best-of-time-series-forecasting">Part 1</a>&nbsp;of my blog looked at how time-series forecasting has evolved&#8212;from traditional models like ARIMA to deep learning methods like Transformers. These approaches brought big improvements, especially in handling complex and long-range patterns. However, they also have limits, especially when it comes to adapting to new data or working well across very different domains.</p><p>Now, a new wave of models is entering the scene: Large Language Models (LLMs). These models were originally built for language tasks&#8212;like chatbots, summarizing text, and answering questions. But recently, researchers have started using LLMs for time-series forecasting, too.</p><p>In this post, we&#8217;ll explore: </p><p>&#10004; How LLMs are being adapted to handle time-series data<br>&#10004; Some recent research and early results<br>&#10004; Key challenges and open questions</p><p>LLMs won&#8217;t replace every forecasting model, but they&#8217;re opening up new ideas about how we can approach time-series problems. Let&#8217;s take a look at where this is all heading.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iplR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F707880cb-ad3a-4b1a-a65b-f17376a6ac82_1024x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iplR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F707880cb-ad3a-4b1a-a65b-f17376a6ac82_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!iplR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F707880cb-ad3a-4b1a-a65b-f17376a6ac82_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!iplR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F707880cb-ad3a-4b1a-a65b-f17376a6ac82_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!iplR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F707880cb-ad3a-4b1a-a65b-f17376a6ac82_1024x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iplR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F707880cb-ad3a-4b1a-a65b-f17376a6ac82_1024x1024.jpeg" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/707880cb-ad3a-4b1a-a65b-f17376a6ac82_1024x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:231476,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F707880cb-ad3a-4b1a-a65b-f17376a6ac82_1024x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iplR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F707880cb-ad3a-4b1a-a65b-f17376a6ac82_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!iplR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F707880cb-ad3a-4b1a-a65b-f17376a6ac82_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!iplR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F707880cb-ad3a-4b1a-a65b-f17376a6ac82_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!iplR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F707880cb-ad3a-4b1a-a65b-f17376a6ac82_1024x1024.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">LLM vs Time-Series Forecasting. Source: Copilot. </figcaption></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Many read my posts, but only 3% subscribe. If you find my writing helpful, please subscribe&#8212;it&#8217;s free! Your support motivates me to keep creating high-quality and exclusive content. Thank you</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h4>Table of Contents</h4><ul><li><p><a href="https://hungleai.substack.com/i/158621521/why-do-we-need-large-language-models">Why Do We Need Large Language Models?</a></p></li><li><p><a href="https://hungleai.substack.com/i/158621521/direct-application-of-pre-trained-llms">Direct Application of Pre-trained LLMs</a></p></li><li><p><a href="https://hungleai.substack.com/i/158621521/designing-and-fine-tuning-llms-for-time-series-forecasting">Designing and Fine-tuning LLMs for Time Series Forecasting</a></p></li><li><p><a href="https://hungleai.substack.com/i/158621521/building-foundational-models-for-time-series-a-new-era">Building Foundational Models for Time Series: A New Era?</a></p></li><li><p><a href="https://hungleai.substack.com/i/158621521/conclusion">Conclusion</a></p></li></ul><div><hr></div><h2>Why Do We Need Large Language Models?</h2><p>Time-series forecasting has been a long-standing challenge in data science, underpinning critical applications like financial market prediction, weather forecasting, and supply chain management. For the last 30 years, statistical models such as ARIMA and ETS have been extensively used for time-series tasks, followed by deep learning methods like LSTMs and Transformers. So, why are researchers now turning to Large Language Models (LLMs) for time-series forecasting?</p><h4>Pattern Recognition Capability</h4><p>LLMs excel at recognizing complex patterns across diverse data sources, including both structured numerical sequences and unstructured text. Their self-attention mechanism allows them to capture long-range dependencies and underlying trends in time-series numbers. For example, big LLMs like ChatGPT can find the number patterns easily:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iWh8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b13c6f-2784-43ed-863e-48a47243c46e_952x662.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iWh8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b13c6f-2784-43ed-863e-48a47243c46e_952x662.png 424w, https://substackcdn.com/image/fetch/$s_!iWh8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b13c6f-2784-43ed-863e-48a47243c46e_952x662.png 848w, https://substackcdn.com/image/fetch/$s_!iWh8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b13c6f-2784-43ed-863e-48a47243c46e_952x662.png 1272w, https://substackcdn.com/image/fetch/$s_!iWh8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b13c6f-2784-43ed-863e-48a47243c46e_952x662.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iWh8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b13c6f-2784-43ed-863e-48a47243c46e_952x662.png" width="952" height="662" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1b13c6f-2784-43ed-863e-48a47243c46e_952x662.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:662,&quot;width&quot;:952,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:35738,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b13c6f-2784-43ed-863e-48a47243c46e_952x662.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iWh8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b13c6f-2784-43ed-863e-48a47243c46e_952x662.png 424w, https://substackcdn.com/image/fetch/$s_!iWh8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b13c6f-2784-43ed-863e-48a47243c46e_952x662.png 848w, https://substackcdn.com/image/fetch/$s_!iWh8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b13c6f-2784-43ed-863e-48a47243c46e_952x662.png 1272w, https://substackcdn.com/image/fetch/$s_!iWh8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1b13c6f-2784-43ed-863e-48a47243c46e_952x662.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4><strong>Understanding Context Beyond Numbers</strong></h4><p>Additionally, LLMs can leverage their pre-trained knowledge to identify contextual signals&#8212;such as economic shifts, seasonal effects, or market sentiment&#8212;helping to enhance time-series forecasting beyond traditional statistical and deep learning approaches.</p><h4><strong>Long-term Forecasting</strong></h4><p>While traditional time-series models learn patterns from past data, they often struggle with long-term dependencies. LLMs, especially those equipped with long-term memory architectures, can store and recall historical insights more effectively, improving their ability to recognize seasonality, anomalies, and evolving trends over time. </p><blockquote><p>&#128064; That said, since LLMs are primarily designed for text data, applying them to time-series data is not straightforward. In the following sections, we will explore recent works that address this challenge.</p></blockquote><div><hr></div><h2>Direct Application of Pre-trained LLMs</h2><p>A key approach is converting numerical time-series data into text, representing each value as a string, i.e., <strong>prompt-based forecasting</strong>. Because the sequence order is maintained in string form, LLMs can still capture temporal dependencies. Here, the choice of tokenization is crucial, as it determines how the model processes and learns patterns from the data. </p><p>Several notable examples illustrate the direct prompting approach. &#128073;<strong>PromptCast </strong>[1] represents a pioneering effort in this direction, converting numerical time series into natural language prompts using predefined templates. This framework frames the forecasting task as a sentence-to-sentence generation problem, where the LLM is prompted with historical data described in natural language and asked to generate a sentence representing the forecast.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QMCB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e007a0c-00e1-4456-8c2d-165c85f15272_1170x512.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QMCB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e007a0c-00e1-4456-8c2d-165c85f15272_1170x512.png 424w, https://substackcdn.com/image/fetch/$s_!QMCB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e007a0c-00e1-4456-8c2d-165c85f15272_1170x512.png 848w, https://substackcdn.com/image/fetch/$s_!QMCB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e007a0c-00e1-4456-8c2d-165c85f15272_1170x512.png 1272w, https://substackcdn.com/image/fetch/$s_!QMCB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e007a0c-00e1-4456-8c2d-165c85f15272_1170x512.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QMCB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e007a0c-00e1-4456-8c2d-165c85f15272_1170x512.png" width="1170" height="512" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e007a0c-00e1-4456-8c2d-165c85f15272_1170x512.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:512,&quot;width&quot;:1170,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:131286,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e007a0c-00e1-4456-8c2d-165c85f15272_1170x512.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QMCB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e007a0c-00e1-4456-8c2d-165c85f15272_1170x512.png 424w, https://substackcdn.com/image/fetch/$s_!QMCB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e007a0c-00e1-4456-8c2d-165c85f15272_1170x512.png 848w, https://substackcdn.com/image/fetch/$s_!QMCB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e007a0c-00e1-4456-8c2d-165c85f15272_1170x512.png 1272w, https://substackcdn.com/image/fetch/$s_!QMCB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e007a0c-00e1-4456-8c2d-165c85f15272_1170x512.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">PromptCast Framework: time-series data is directly used as strings in the prompt. Source: [1]</figcaption></figure></div><p>The paper introduces simple templates to translate time series to prompts. For example, for the ECL dataset:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xaT8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43123f5e-bf4d-40ab-985f-a8c148466c41_1971x254.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xaT8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43123f5e-bf4d-40ab-985f-a8c148466c41_1971x254.png 424w, https://substackcdn.com/image/fetch/$s_!xaT8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43123f5e-bf4d-40ab-985f-a8c148466c41_1971x254.png 848w, https://substackcdn.com/image/fetch/$s_!xaT8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43123f5e-bf4d-40ab-985f-a8c148466c41_1971x254.png 1272w, https://substackcdn.com/image/fetch/$s_!xaT8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43123f5e-bf4d-40ab-985f-a8c148466c41_1971x254.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xaT8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43123f5e-bf4d-40ab-985f-a8c148466c41_1971x254.png" width="1456" height="188" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/43123f5e-bf4d-40ab-985f-a8c148466c41_1971x254.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:188,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:93019,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43123f5e-bf4d-40ab-985f-a8c148466c41_1971x254.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xaT8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43123f5e-bf4d-40ab-985f-a8c148466c41_1971x254.png 424w, https://substackcdn.com/image/fetch/$s_!xaT8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43123f5e-bf4d-40ab-985f-a8c148466c41_1971x254.png 848w, https://substackcdn.com/image/fetch/$s_!xaT8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43123f5e-bf4d-40ab-985f-a8c148466c41_1971x254.png 1272w, https://substackcdn.com/image/fetch/$s_!xaT8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43123f5e-bf4d-40ab-985f-a8c148466c41_1971x254.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Despite being simple, the approach shows reasonable results compared to other deep learning methods, such as Autoformer and FEDFormer:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Wixa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16592558-fc50-4aec-a394-aa4b74b2657b_2737x565.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Wixa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16592558-fc50-4aec-a394-aa4b74b2657b_2737x565.png 424w, https://substackcdn.com/image/fetch/$s_!Wixa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16592558-fc50-4aec-a394-aa4b74b2657b_2737x565.png 848w, https://substackcdn.com/image/fetch/$s_!Wixa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16592558-fc50-4aec-a394-aa4b74b2657b_2737x565.png 1272w, https://substackcdn.com/image/fetch/$s_!Wixa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16592558-fc50-4aec-a394-aa4b74b2657b_2737x565.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Wixa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16592558-fc50-4aec-a394-aa4b74b2657b_2737x565.png" width="2737" height="565" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16592558-fc50-4aec-a394-aa4b74b2657b_2737x565.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:565,&quot;width&quot;:2737,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:310772,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F532d7222-b32c-4e6f-9e7d-caec659f1ada_2737x565.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Wixa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16592558-fc50-4aec-a394-aa4b74b2657b_2737x565.png 424w, https://substackcdn.com/image/fetch/$s_!Wixa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16592558-fc50-4aec-a394-aa4b74b2657b_2737x565.png 848w, https://substackcdn.com/image/fetch/$s_!Wixa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16592558-fc50-4aec-a394-aa4b74b2657b_2737x565.png 1272w, https://substackcdn.com/image/fetch/$s_!Wixa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16592558-fc50-4aec-a394-aa4b74b2657b_2737x565.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p>&#128064; The paper only uses small and old LLMs such as BigBird and Pegasus. The results can be even better with stronger and modern LLMs such as Chat-GPT. </p></blockquote><p>Unlike PromptCast, which focuses on prompt engineering, the &#128073;<strong>LLMTime</strong> [2] paper demonstrates that LLMs can be used directly as forecasters without the need for additional text or prompt engineering, <em>provided that the numerical values are carefully preprocessed</em>.</p><p>For example, the choice of tokenization (with space or not) matters:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y4Hf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff437fb9b-af2e-43e5-8315-9e496166b965_2239x546.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y4Hf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff437fb9b-af2e-43e5-8315-9e496166b965_2239x546.png 424w, https://substackcdn.com/image/fetch/$s_!y4Hf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff437fb9b-af2e-43e5-8315-9e496166b965_2239x546.png 848w, https://substackcdn.com/image/fetch/$s_!y4Hf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff437fb9b-af2e-43e5-8315-9e496166b965_2239x546.png 1272w, https://substackcdn.com/image/fetch/$s_!y4Hf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff437fb9b-af2e-43e5-8315-9e496166b965_2239x546.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y4Hf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff437fb9b-af2e-43e5-8315-9e496166b965_2239x546.png" width="1456" height="355" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f437fb9b-af2e-43e5-8315-9e496166b965_2239x546.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:355,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:300819,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff437fb9b-af2e-43e5-8315-9e496166b965_2239x546.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!y4Hf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff437fb9b-af2e-43e5-8315-9e496166b965_2239x546.png 424w, https://substackcdn.com/image/fetch/$s_!y4Hf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff437fb9b-af2e-43e5-8315-9e496166b965_2239x546.png 848w, https://substackcdn.com/image/fetch/$s_!y4Hf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff437fb9b-af2e-43e5-8315-9e496166b965_2239x546.png 1272w, https://substackcdn.com/image/fetch/$s_!y4Hf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff437fb9b-af2e-43e5-8315-9e496166b965_2239x546.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Tokenization with space between digits yields better forecasting. Source: [2]</figcaption></figure></div><p>In particular, the paper proposes:</p><ul><li><p>Digits are spaced out for separate tokenization.</p></li><li><p>Commas mark time steps.</p></li><li><p>Decimal points are removed to save context (e.g., by multiplying the original value with 100)</p></li><li><p>Example: <strong>0.123, 1.23, 12.3, 123.0 &#8594; "1 2 , 1 2 3 , 1 2 3 0 , 1 2 3 0 0"</strong></p></li></ul><p>Other tricks include:</p><ul><li><p><strong>Rescaling</strong>: The method rescales time series values so that the &#945;-percentile of the rescaled data is 1, preventing extreme values from dominating while ensuring the LLM sees some examples where digit lengths change.  For example,  a time series has values <code>[10, 50, 200, 1000]</code>, and we set &#945; = 0.75. If the 75th percentile value is 200, we scale all values by 1/200, so the rescaled series becomes <code>[0.05, 0.25, 1, 5]</code>,</p></li><li><p><strong>Sampling</strong>: To forecast, generate multiple samples (e.g., 20) from the LLM and use their statistics (e.g., median or quantiles) to create a point or probabilistic estimate.</p></li></ul><p><strong>Likelihood Estimation</strong>: To get good sampling, the paper refines the way to compute the probability of the LLM&#8217;s output. The approach allows LLMs to act like hierarchical softmax distributions over numerical values. A number <em>u</em> of <em>n</em> digits has probability:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\np(u) = p(u_n | u_{n-1}, \\dots, u_0) p(u_{n-1} | u_{n-2}, \\dots, u_0) \\dots p(u_0)\n\n&quot;,&quot;id&quot;:&quot;ZQGQMXCBUM&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Nmbd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fbae2da-bfd2-4a65-9c3f-ddfbd3924b72_1207x580.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Nmbd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fbae2da-bfd2-4a65-9c3f-ddfbd3924b72_1207x580.png 424w, https://substackcdn.com/image/fetch/$s_!Nmbd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fbae2da-bfd2-4a65-9c3f-ddfbd3924b72_1207x580.png 848w, https://substackcdn.com/image/fetch/$s_!Nmbd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fbae2da-bfd2-4a65-9c3f-ddfbd3924b72_1207x580.png 1272w, https://substackcdn.com/image/fetch/$s_!Nmbd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fbae2da-bfd2-4a65-9c3f-ddfbd3924b72_1207x580.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Nmbd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fbae2da-bfd2-4a65-9c3f-ddfbd3924b72_1207x580.png" width="483" height="232.09610604805303" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2fbae2da-bfd2-4a65-9c3f-ddfbd3924b72_1207x580.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:580,&quot;width&quot;:1207,&quot;resizeWidth&quot;:483,&quot;bytes&quot;:104831,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fbae2da-bfd2-4a65-9c3f-ddfbd3924b72_1207x580.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Nmbd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fbae2da-bfd2-4a65-9c3f-ddfbd3924b72_1207x580.png 424w, https://substackcdn.com/image/fetch/$s_!Nmbd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fbae2da-bfd2-4a65-9c3f-ddfbd3924b72_1207x580.png 848w, https://substackcdn.com/image/fetch/$s_!Nmbd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fbae2da-bfd2-4a65-9c3f-ddfbd3924b72_1207x580.png 1272w, https://substackcdn.com/image/fetch/$s_!Nmbd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fbae2da-bfd2-4a65-9c3f-ddfbd3924b72_1207x580.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p></p><p>To estimate the probability for a &#8220;continous&#8220; value <em>x </em>by assigning a uniform probability within the bin that <em>x </em>falls into. For example, if <em>x=0.5371</em> falls into bin <em>k (0.537-0.538)</em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\log p(x) = \\log p_k - n \\log B\n&quot;,&quot;id&quot;:&quot;XPFZTVZAXK&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>B</em> is the number base, e.g, <em>B=10</em>, each bin has the probability <em>B&#8315;&#8319;</em>. </p><p>The experiment results demonstrate that LLMTime is competitive against standard and deep forecasting methods:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!soeE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b0e7c5-0c57-425a-b82c-fb714ba290e1_1866x714.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!soeE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b0e7c5-0c57-425a-b82c-fb714ba290e1_1866x714.png 424w, https://substackcdn.com/image/fetch/$s_!soeE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b0e7c5-0c57-425a-b82c-fb714ba290e1_1866x714.png 848w, https://substackcdn.com/image/fetch/$s_!soeE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b0e7c5-0c57-425a-b82c-fb714ba290e1_1866x714.png 1272w, https://substackcdn.com/image/fetch/$s_!soeE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b0e7c5-0c57-425a-b82c-fb714ba290e1_1866x714.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!soeE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b0e7c5-0c57-425a-b82c-fb714ba290e1_1866x714.png" width="1456" height="557" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0b0e7c5-0c57-425a-b82c-fb714ba290e1_1866x714.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:557,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:164078,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b0e7c5-0c57-425a-b82c-fb714ba290e1_1866x714.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!soeE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b0e7c5-0c57-425a-b82c-fb714ba290e1_1866x714.png 424w, https://substackcdn.com/image/fetch/$s_!soeE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b0e7c5-0c57-425a-b82c-fb714ba290e1_1866x714.png 848w, https://substackcdn.com/image/fetch/$s_!soeE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b0e7c5-0c57-425a-b82c-fb714ba290e1_1866x714.png 1272w, https://substackcdn.com/image/fetch/$s_!soeE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b0e7c5-0c57-425a-b82c-fb714ba290e1_1866x714.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Forecasting error: lower is better. Source: [2].</figcaption></figure></div><p>LLMs are showing promise in time-series forecasting, but basic prompting limits them. A more recent work, &#128073;<strong>LSTPrompt</strong>, introduces a sophisticated Chain-oj-Thought prompt to break down forecasting, improving accuracy.</p><p>Some prompting techniques:</p><ul><li><p>Add task description</p></li><li><p>Explain the long-term/short-term properties of the data</p></li><li><p>Include &#8220;TimeBreath&#8221; trick prompt </p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ToWl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3231697-9c4f-4fee-91d3-973894493d40_1607x862.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ToWl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3231697-9c4f-4fee-91d3-973894493d40_1607x862.png 424w, https://substackcdn.com/image/fetch/$s_!ToWl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3231697-9c4f-4fee-91d3-973894493d40_1607x862.png 848w, https://substackcdn.com/image/fetch/$s_!ToWl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3231697-9c4f-4fee-91d3-973894493d40_1607x862.png 1272w, https://substackcdn.com/image/fetch/$s_!ToWl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3231697-9c4f-4fee-91d3-973894493d40_1607x862.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ToWl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3231697-9c4f-4fee-91d3-973894493d40_1607x862.png" width="1456" height="781" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d3231697-9c4f-4fee-91d3-973894493d40_1607x862.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:781,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:314414,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3231697-9c4f-4fee-91d3-973894493d40_1607x862.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ToWl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3231697-9c4f-4fee-91d3-973894493d40_1607x862.png 424w, https://substackcdn.com/image/fetch/$s_!ToWl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3231697-9c4f-4fee-91d3-973894493d40_1607x862.png 848w, https://substackcdn.com/image/fetch/$s_!ToWl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3231697-9c4f-4fee-91d3-973894493d40_1607x862.png 1272w, https://substackcdn.com/image/fetch/$s_!ToWl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3231697-9c4f-4fee-91d3-973894493d40_1607x862.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Prompts help improve time-series forecasting. Source: [3].</figcaption></figure></div><p>Despite promising results, directly using LLMs &#8220;as-is&#8220; for time-series forecasting has key limitations:</p><p><strong>&#10060; No temporal inductive bias</strong>: LLMs lack built-in mechanisms for capturing time dependencies.</p><p><strong>&#10060; Tokenization artifacts:</strong> Discretization may reduce numerical precision.</p><p><strong>&#10060; High computational cost</strong>: More expensive than traditional models.</p><p><strong>&#10060; Limited extrapolation</strong>: Struggles with long-term forecasts.</p><p><strong>&#10060; Lack of domain constraints</strong>: Misses explicit trend/seasonality modeling.</p><p>In the next section, we will see more complicated solutions that address these concerns. </p><p><strong>&#10060; High cost: </strong>Good results often require strong LLMs as GPT-4. However, this is expensive and not open-source. </p><div><hr></div><h2>Designing and Fine-tuning LLMs for Time Series Forecasting</h2><p>To overcome these limitations, research has focused on <strong>repurposing and fine-tuning</strong> LLMs for time-series forecasting. This includes adapting their architecture and further training pre-trained models on time-series data.</p><p>One core technique is aligning the modalities of time series and language, making LLMs effective for time-series data. The obvious way is to choose some early layers of the LLMs to finetune on time-series forecasting tasks to align the encoding part of the LLMs to the time-series domain. For example, <strong>GPT4TS</strong> [4] only finetunes positional embeddings and layer normalization parts of the GPT model:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OLLL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef94d7aa-07d8-48b0-a0b2-6c734b436903_1101x640.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OLLL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef94d7aa-07d8-48b0-a0b2-6c734b436903_1101x640.png 424w, https://substackcdn.com/image/fetch/$s_!OLLL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef94d7aa-07d8-48b0-a0b2-6c734b436903_1101x640.png 848w, https://substackcdn.com/image/fetch/$s_!OLLL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef94d7aa-07d8-48b0-a0b2-6c734b436903_1101x640.png 1272w, https://substackcdn.com/image/fetch/$s_!OLLL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef94d7aa-07d8-48b0-a0b2-6c734b436903_1101x640.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OLLL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef94d7aa-07d8-48b0-a0b2-6c734b436903_1101x640.png" width="1101" height="640" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef94d7aa-07d8-48b0-a0b2-6c734b436903_1101x640.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:640,&quot;width&quot;:1101,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:131358,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef94d7aa-07d8-48b0-a0b2-6c734b436903_1101x640.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OLLL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef94d7aa-07d8-48b0-a0b2-6c734b436903_1101x640.png 424w, https://substackcdn.com/image/fetch/$s_!OLLL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef94d7aa-07d8-48b0-a0b2-6c734b436903_1101x640.png 848w, https://substackcdn.com/image/fetch/$s_!OLLL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef94d7aa-07d8-48b0-a0b2-6c734b436903_1101x640.png 1272w, https://substackcdn.com/image/fetch/$s_!OLLL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef94d7aa-07d8-48b0-a0b2-6c734b436903_1101x640.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Finetune LLMs for time-series. Source: [4].</figcaption></figure></div><p>Similarly, &#128073;<strong>LLM4TS [7]</strong> develops an LLM framework for time-series forecasting. Compared to GPT4S, more engineering effort is spent on timestamp and time-series embedding to represent the tokens for the LLM engine. For example, the paper proposes to embed each time scale information separately with a final pooling to get the temporal representation <em>e<sub>temp</sub></em>: </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SycA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3408c0f9-4086-4999-bb94-0c310be700a1_1510x567.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SycA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3408c0f9-4086-4999-bb94-0c310be700a1_1510x567.png 424w, https://substackcdn.com/image/fetch/$s_!SycA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3408c0f9-4086-4999-bb94-0c310be700a1_1510x567.png 848w, https://substackcdn.com/image/fetch/$s_!SycA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3408c0f9-4086-4999-bb94-0c310be700a1_1510x567.png 1272w, https://substackcdn.com/image/fetch/$s_!SycA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3408c0f9-4086-4999-bb94-0c310be700a1_1510x567.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SycA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3408c0f9-4086-4999-bb94-0c310be700a1_1510x567.png" width="1456" height="547" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3408c0f9-4086-4999-bb94-0c310be700a1_1510x567.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:547,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:75938,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3408c0f9-4086-4999-bb94-0c310be700a1_1510x567.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SycA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3408c0f9-4086-4999-bb94-0c310be700a1_1510x567.png 424w, https://substackcdn.com/image/fetch/$s_!SycA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3408c0f9-4086-4999-bb94-0c310be700a1_1510x567.png 848w, https://substackcdn.com/image/fetch/$s_!SycA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3408c0f9-4086-4999-bb94-0c310be700a1_1510x567.png 1272w, https://substackcdn.com/image/fetch/$s_!SycA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3408c0f9-4086-4999-bb94-0c310be700a1_1510x567.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Timestamp embedding for LLMs. Source: [7].</figcaption></figure></div><p>Then, the final embedding is computed as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vu8V!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bfcff8-ae82-4e55-8fdd-dfccadce2f92_781x124.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vu8V!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bfcff8-ae82-4e55-8fdd-dfccadce2f92_781x124.png 424w, https://substackcdn.com/image/fetch/$s_!vu8V!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bfcff8-ae82-4e55-8fdd-dfccadce2f92_781x124.png 848w, https://substackcdn.com/image/fetch/$s_!vu8V!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bfcff8-ae82-4e55-8fdd-dfccadce2f92_781x124.png 1272w, https://substackcdn.com/image/fetch/$s_!vu8V!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bfcff8-ae82-4e55-8fdd-dfccadce2f92_781x124.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vu8V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bfcff8-ae82-4e55-8fdd-dfccadce2f92_781x124.png" width="272" height="43.18565941101152" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/55bfcff8-ae82-4e55-8fdd-dfccadce2f92_781x124.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:124,&quot;width&quot;:781,&quot;resizeWidth&quot;:272,&quot;bytes&quot;:12842,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bfcff8-ae82-4e55-8fdd-dfccadce2f92_781x124.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vu8V!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bfcff8-ae82-4e55-8fdd-dfccadce2f92_781x124.png 424w, https://substackcdn.com/image/fetch/$s_!vu8V!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bfcff8-ae82-4e55-8fdd-dfccadce2f92_781x124.png 848w, https://substackcdn.com/image/fetch/$s_!vu8V!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bfcff8-ae82-4e55-8fdd-dfccadce2f92_781x124.png 1272w, https://substackcdn.com/image/fetch/$s_!vu8V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55bfcff8-ae82-4e55-8fdd-dfccadce2f92_781x124.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>e<sub>token</sub> </em>is the time-series patch representation and <em>e<sub>pos</sub> </em>is the positional encoding. </p><p> The paper also introduces a two-step training approach:</p><ol><li><p><strong>Alignment training</strong>, where the model learns to align representations through a next-token prediction task, similar to the pretraining phase in LLMs;</p></li><li><p><strong>Forecasting fine-tuning</strong>, where the model is further trained on downstream time series forecasting tasks to specialize in prediction.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KORd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6807ce-71ac-40a3-b58d-e6c36fea09f6_1950x903.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KORd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6807ce-71ac-40a3-b58d-e6c36fea09f6_1950x903.png 424w, https://substackcdn.com/image/fetch/$s_!KORd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6807ce-71ac-40a3-b58d-e6c36fea09f6_1950x903.png 848w, https://substackcdn.com/image/fetch/$s_!KORd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6807ce-71ac-40a3-b58d-e6c36fea09f6_1950x903.png 1272w, https://substackcdn.com/image/fetch/$s_!KORd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6807ce-71ac-40a3-b58d-e6c36fea09f6_1950x903.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KORd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6807ce-71ac-40a3-b58d-e6c36fea09f6_1950x903.png" width="1456" height="674" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3f6807ce-71ac-40a3-b58d-e6c36fea09f6_1950x903.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:674,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:319606,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6807ce-71ac-40a3-b58d-e6c36fea09f6_1950x903.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KORd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6807ce-71ac-40a3-b58d-e6c36fea09f6_1950x903.png 424w, https://substackcdn.com/image/fetch/$s_!KORd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6807ce-71ac-40a3-b58d-e6c36fea09f6_1950x903.png 848w, https://substackcdn.com/image/fetch/$s_!KORd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6807ce-71ac-40a3-b58d-e6c36fea09f6_1950x903.png 1272w, https://substackcdn.com/image/fetch/$s_!KORd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6807ce-71ac-40a3-b58d-e6c36fea09f6_1950x903.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">2-stage training in LLM4TS. Source: [7].</figcaption></figure></div><blockquote><p>&#128064; Although these methods aim to align LLM text representations with time series data, they largely stop at simply retraining the encoding layers without deeper adaptation.</p></blockquote><p>Digging deeper, &#128073;<strong>Time-LLM</strong> [5] exemplifies a reprogramming framework where the input time series is transformed into text prototype representations and natural language prompts are used to guide the LLM's reasoning process. This approach keeps the underlying LLM intact and trains a separate reprogramming layer to translate the observed time series into a language-based representation, as illustrated in Fig. c below: </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!62P9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff152f2b6-55d0-4777-b708-24ef0df9a1e8_2389x607.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!62P9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff152f2b6-55d0-4777-b708-24ef0df9a1e8_2389x607.png 424w, https://substackcdn.com/image/fetch/$s_!62P9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff152f2b6-55d0-4777-b708-24ef0df9a1e8_2389x607.png 848w, https://substackcdn.com/image/fetch/$s_!62P9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff152f2b6-55d0-4777-b708-24ef0df9a1e8_2389x607.png 1272w, https://substackcdn.com/image/fetch/$s_!62P9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff152f2b6-55d0-4777-b708-24ef0df9a1e8_2389x607.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!62P9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff152f2b6-55d0-4777-b708-24ef0df9a1e8_2389x607.png" width="1456" height="370" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f152f2b6-55d0-4777-b708-24ef0df9a1e8_2389x607.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:370,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:245120,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff152f2b6-55d0-4777-b708-24ef0df9a1e8_2389x607.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!62P9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff152f2b6-55d0-4777-b708-24ef0df9a1e8_2389x607.png 424w, https://substackcdn.com/image/fetch/$s_!62P9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff152f2b6-55d0-4777-b708-24ef0df9a1e8_2389x607.png 848w, https://substackcdn.com/image/fetch/$s_!62P9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff152f2b6-55d0-4777-b708-24ef0df9a1e8_2389x607.png 1272w, https://substackcdn.com/image/fetch/$s_!62P9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff152f2b6-55d0-4777-b708-24ef0df9a1e8_2389x607.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Different ways to adapt LLMs to time-series tasks. Source: [5].</figcaption></figure></div><blockquote><p>&#128064;Essentially, reprogramming framework finetunes the model's input and output layers while keeping its core frozen for efficiency.</p></blockquote><p>Here, the paper proposes patch reprogramming to align time series with the LLM&#8217;s embedding space by  transforming a subsequence of time series (patch) to a text-aligned token representation. As such, they start with pretrained embedding of text tokens:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;E \\in \\mathbb{R}^{V \\times D}&quot;,&quot;id&quot;:&quot;CYDQCWEMTF&quot;}" data-component-name="LatexBlockToDOM"></div><p>where  <em>V</em> is vocabulary size and  <em>D</em> is embedding dimension. Direct mapping is inefficient, they propose to project the original embedding space to a smaller prototype space <em>E&#8217;</em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;E' \\in \\mathbb{R}^{V' \\times D}, \\quad V' \\ll V.&quot;,&quot;id&quot;:&quot;RRQHUIYZQH&quot;}" data-component-name="LatexBlockToDOM"></div><p>These prototypes encode phrases like <em>"short up"</em> or <em>"steady down"</em>, keeping representations within the LLM&#8217;s text space, yet more specific. Then, they integrate prototypes to time-series representation via <strong>cross-attention</strong>. For an attention head <em>k</em>, the query Q is constructed from the time-series patch <em>X<sub>p</sub> </em>while the key and value are generated from the prototype embedding:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Q_k^{(i)} = \\hat{X}_P^{(i)} W_Q^k, \\quad K_k^{(i)} = E' W_K^k, \\quad V_k^{(i)} = E' W_V^k,&quot;,&quot;id&quot;:&quot;TZUEVLTYVU&quot;}" data-component-name="LatexBlockToDOM"></div><p>The attention mechanism computes the alignment between a patch <em>(i)</em> and prototypes:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Z_k^{(i)} = \\text{SOFTMAX} \\left( \\frac{Q_k^{(i)} K_k^{(i) \\top}}{\\sqrt{d_k}} \\right) V_k^{(i)}.&quot;,&quot;id&quot;:&quot;EVZVXXSRGL&quot;}" data-component-name="LatexBlockToDOM"></div><p>Aggregating across heads produces the reprogrammed representation, which is then linearly projected to match the hidden dimensions of the backbone model:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;O^{(i)} \\in \\mathbb{R}^{P \\times D}.&quot;,&quot;id&quot;:&quot;KKFCCYLEFE&quot;}" data-component-name="LatexBlockToDOM"></div><p>Given the reprogrammed patches, the paper proposes 2 ways to feed them to the LLMs:</p><ul><li><p><strong>Patch-as-Prefix: </strong>Treat patch as input text followed by a simple prompt to trigger the LLM to predict time series values in natural language. However, this method faces challenges:</p><ol><li><p>LLMs struggle with high-precision numerals without external tools, making long-horizon forecasting difficult.</p></li><li><p>Different pre-training corpora and tokenization strategies lead to inconsistencies in representing numeric values (e.g., <code>['0', '.', '6', '1']</code> vs. <code>['0', '.', '61']</code>).</p></li></ol></li><li><p><strong>Prompt-as-Prefix</strong> circumvents these limitations by structuring prompts with three key components:</p><ol><li><p>Use instruction prompt as input context, task instruction, and trends and lags statistics. </p></li><li><p>The instruction precedes the patch, and we extract only the output segment corresponding to the patch for regression, similar to standard forecasting. A projection layer is needed to align the output.</p></li></ol></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dwMS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4514acb9-1bab-40c4-8e72-b707e515b7e6_1350x449.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dwMS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4514acb9-1bab-40c4-8e72-b707e515b7e6_1350x449.png 424w, https://substackcdn.com/image/fetch/$s_!dwMS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4514acb9-1bab-40c4-8e72-b707e515b7e6_1350x449.png 848w, https://substackcdn.com/image/fetch/$s_!dwMS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4514acb9-1bab-40c4-8e72-b707e515b7e6_1350x449.png 1272w, https://substackcdn.com/image/fetch/$s_!dwMS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4514acb9-1bab-40c4-8e72-b707e515b7e6_1350x449.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dwMS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4514acb9-1bab-40c4-8e72-b707e515b7e6_1350x449.png" width="1350" height="449" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4514acb9-1bab-40c4-8e72-b707e515b7e6_1350x449.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:449,&quot;width&quot;:1350,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:139227,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4514acb9-1bab-40c4-8e72-b707e515b7e6_1350x449.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dwMS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4514acb9-1bab-40c4-8e72-b707e515b7e6_1350x449.png 424w, https://substackcdn.com/image/fetch/$s_!dwMS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4514acb9-1bab-40c4-8e72-b707e515b7e6_1350x449.png 848w, https://substackcdn.com/image/fetch/$s_!dwMS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4514acb9-1bab-40c4-8e72-b707e515b7e6_1350x449.png 1272w, https://substackcdn.com/image/fetch/$s_!dwMS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4514acb9-1bab-40c4-8e72-b707e515b7e6_1350x449.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Patch Reprogramming and 2 Prompting Structures. Source: [5]</figcaption></figure></div><p>With these architecture changes, the performance on long-term forecasting tasks is impressive:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jQKY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F436b1e63-5903-4d9d-b8c3-fdc4d6b6e1db_2108x647.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jQKY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F436b1e63-5903-4d9d-b8c3-fdc4d6b6e1db_2108x647.png 424w, https://substackcdn.com/image/fetch/$s_!jQKY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F436b1e63-5903-4d9d-b8c3-fdc4d6b6e1db_2108x647.png 848w, https://substackcdn.com/image/fetch/$s_!jQKY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F436b1e63-5903-4d9d-b8c3-fdc4d6b6e1db_2108x647.png 1272w, https://substackcdn.com/image/fetch/$s_!jQKY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F436b1e63-5903-4d9d-b8c3-fdc4d6b6e1db_2108x647.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jQKY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F436b1e63-5903-4d9d-b8c3-fdc4d6b6e1db_2108x647.png" width="1456" height="447" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/436b1e63-5903-4d9d-b8c3-fdc4d6b6e1db_2108x647.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:447,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:206056,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F436b1e63-5903-4d9d-b8c3-fdc4d6b6e1db_2108x647.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jQKY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F436b1e63-5903-4d9d-b8c3-fdc4d6b6e1db_2108x647.png 424w, https://substackcdn.com/image/fetch/$s_!jQKY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F436b1e63-5903-4d9d-b8c3-fdc4d6b6e1db_2108x647.png 848w, https://substackcdn.com/image/fetch/$s_!jQKY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F436b1e63-5903-4d9d-b8c3-fdc4d6b6e1db_2108x647.png 1272w, https://substackcdn.com/image/fetch/$s_!jQKY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F436b1e63-5903-4d9d-b8c3-fdc4d6b6e1db_2108x647.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Given these results, it seems that LLMs will have huge potential for boosting forecasting accuracy. However, a recent study [6] reveals that it is not that easy. The research question is: &#128073;<em><strong>Are Language Models Actually Useful for Time Series Forecasting? </strong></em></p><p>The finding shows that:</p><ul><li><p>Despite the hype, large language models (LLMs) don&#8217;t help with time series forecasting&#8212;in fact, simpler models without LLMs often perform better. </p></li><li><p>LLMs add extra cost without improving accuracy, struggling to understand time-based patterns, or helping in low-data situations.</p></li></ul><p>To understand why, let&#8217;s look at the experimental setting:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XhIK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44738f53-01ae-4d9e-babd-e5a05c5d6a79_2087x719.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XhIK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44738f53-01ae-4d9e-babd-e5a05c5d6a79_2087x719.png 424w, https://substackcdn.com/image/fetch/$s_!XhIK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44738f53-01ae-4d9e-babd-e5a05c5d6a79_2087x719.png 848w, https://substackcdn.com/image/fetch/$s_!XhIK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44738f53-01ae-4d9e-babd-e5a05c5d6a79_2087x719.png 1272w, https://substackcdn.com/image/fetch/$s_!XhIK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44738f53-01ae-4d9e-babd-e5a05c5d6a79_2087x719.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XhIK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44738f53-01ae-4d9e-babd-e5a05c5d6a79_2087x719.png" width="2087" height="719" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44738f53-01ae-4d9e-babd-e5a05c5d6a79_2087x719.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:719,&quot;width&quot;:2087,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:280755,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F524cf436-4817-4ded-96ea-b3db4d998c49_2087x719.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XhIK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44738f53-01ae-4d9e-babd-e5a05c5d6a79_2087x719.png 424w, https://substackcdn.com/image/fetch/$s_!XhIK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44738f53-01ae-4d9e-babd-e5a05c5d6a79_2087x719.png 848w, https://substackcdn.com/image/fetch/$s_!XhIK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44738f53-01ae-4d9e-babd-e5a05c5d6a79_2087x719.png 1272w, https://substackcdn.com/image/fetch/$s_!XhIK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44738f53-01ae-4d9e-babd-e5a05c5d6a79_2087x719.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">LLM-powered configuration for time-series forecasting</figcaption></figure></div><p>Here, the paper considers the standard architecture of utilizing LLMs for forecasting tasks, which may involve a pre-trained LLM and other components:</p><ul><li><p>Input normalization</p></li><li><p>Finetune encoding and projection layers</p></li><li><p>Alignment training</p></li></ul><p>To confirm the contribution of the LLM component, the paper removes or replaces the LLM module with other things. It turns out that without the LLM module, the performance tends to be better:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RLwr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d811bf-4335-4cde-b2ae-33c974af0cc7_1583x883.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RLwr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d811bf-4335-4cde-b2ae-33c974af0cc7_1583x883.png 424w, https://substackcdn.com/image/fetch/$s_!RLwr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d811bf-4335-4cde-b2ae-33c974af0cc7_1583x883.png 848w, https://substackcdn.com/image/fetch/$s_!RLwr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d811bf-4335-4cde-b2ae-33c974af0cc7_1583x883.png 1272w, https://substackcdn.com/image/fetch/$s_!RLwr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d811bf-4335-4cde-b2ae-33c974af0cc7_1583x883.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RLwr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d811bf-4335-4cde-b2ae-33c974af0cc7_1583x883.png" width="541" height="301.71153846153845" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a4d811bf-4335-4cde-b2ae-33c974af0cc7_1583x883.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:812,&quot;width&quot;:1456,&quot;resizeWidth&quot;:541,&quot;bytes&quot;:228676,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d811bf-4335-4cde-b2ae-33c974af0cc7_1583x883.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RLwr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d811bf-4335-4cde-b2ae-33c974af0cc7_1583x883.png 424w, https://substackcdn.com/image/fetch/$s_!RLwr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d811bf-4335-4cde-b2ae-33c974af0cc7_1583x883.png 848w, https://substackcdn.com/image/fetch/$s_!RLwr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d811bf-4335-4cde-b2ae-33c974af0cc7_1583x883.png 1272w, https://substackcdn.com/image/fetch/$s_!RLwr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d811bf-4335-4cde-b2ae-33c974af0cc7_1583x883.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>&#128064; The results suggest that improvements in previous papers is mainly attributed to normalization and encoding layer finetuning. </p></blockquote><p>More results confirm that the pre-training knowledge stored in LLMs may not be useful for forecasting tasks. As shown below, even without using pre-training weights, the model can still perform best. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TAMS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e172e54-76fb-4798-b586-4723b0d4c08c_1862x783.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TAMS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e172e54-76fb-4798-b586-4723b0d4c08c_1862x783.png 424w, https://substackcdn.com/image/fetch/$s_!TAMS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e172e54-76fb-4798-b586-4723b0d4c08c_1862x783.png 848w, https://substackcdn.com/image/fetch/$s_!TAMS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e172e54-76fb-4798-b586-4723b0d4c08c_1862x783.png 1272w, https://substackcdn.com/image/fetch/$s_!TAMS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e172e54-76fb-4798-b586-4723b0d4c08c_1862x783.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TAMS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e172e54-76fb-4798-b586-4723b0d4c08c_1862x783.png" width="571" height="240.00824175824175" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e172e54-76fb-4798-b586-4723b0d4c08c_1862x783.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:612,&quot;width&quot;:1456,&quot;resizeWidth&quot;:571,&quot;bytes&quot;:306160,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e172e54-76fb-4798-b586-4723b0d4c08c_1862x783.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TAMS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e172e54-76fb-4798-b586-4723b0d4c08c_1862x783.png 424w, https://substackcdn.com/image/fetch/$s_!TAMS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e172e54-76fb-4798-b586-4723b0d4c08c_1862x783.png 848w, https://substackcdn.com/image/fetch/$s_!TAMS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e172e54-76fb-4798-b586-4723b0d4c08c_1862x783.png 1272w, https://substackcdn.com/image/fetch/$s_!TAMS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e172e54-76fb-4798-b586-4723b0d4c08c_1862x783.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>&#128064; Althouth the result is interesting, it is only tested on GPT-2, a weak LLM. It does not guarantee for other LLMs. </p></blockquote><p>If LLMs do not help, &#129504;<em>where does the performance come from?</em> The paper suggests that simple techniques used in current LLM forecasters are enough to build strong forecasters:</p><ul><li><p>Patching (channel independent)</p></li><li><p>One-layer attention </p></li><li><p>Linear projections</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Oimd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F929ae982-faad-43e4-8940-e6e91ab59aa8_479x577.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Oimd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F929ae982-faad-43e4-8940-e6e91ab59aa8_479x577.png 424w, https://substackcdn.com/image/fetch/$s_!Oimd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F929ae982-faad-43e4-8940-e6e91ab59aa8_479x577.png 848w, https://substackcdn.com/image/fetch/$s_!Oimd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F929ae982-faad-43e4-8940-e6e91ab59aa8_479x577.png 1272w, https://substackcdn.com/image/fetch/$s_!Oimd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F929ae982-faad-43e4-8940-e6e91ab59aa8_479x577.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Oimd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F929ae982-faad-43e4-8940-e6e91ab59aa8_479x577.png" width="317" height="381.8559498956159" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/929ae982-faad-43e4-8940-e6e91ab59aa8_479x577.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:577,&quot;width&quot;:479,&quot;resizeWidth&quot;:317,&quot;bytes&quot;:53633,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F929ae982-faad-43e4-8940-e6e91ab59aa8_479x577.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Oimd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F929ae982-faad-43e4-8940-e6e91ab59aa8_479x577.png 424w, https://substackcdn.com/image/fetch/$s_!Oimd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F929ae982-faad-43e4-8940-e6e91ab59aa8_479x577.png 848w, https://substackcdn.com/image/fetch/$s_!Oimd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F929ae982-faad-43e4-8940-e6e91ab59aa8_479x577.png 1272w, https://substackcdn.com/image/fetch/$s_!Oimd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F929ae982-faad-43e4-8940-e6e91ab59aa8_479x577.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">PAttn: Patching and attention work best. Source: [6]</figcaption></figure></div><p>If this finding sounds grim for the future of LLMs in time series, there&#8217;s good news: a recent study shows that when used the right way, LLMs can significantly boost forecasting performance. It&#8217;s not that LLMs are useless&#8212;it&#8217;s that how we use them makes all the difference.</p><p>Concretely, most existing LLM-based time series models miss a key ingredient: <strong>autoregression</strong>&#8212;the core of how both LLMs and traditional forecasters make predictions. A new approach, &#128073;<strong>AutoTimes </strong>[8], brings this back by using LLMs in their natural autoregressive mode, enabling flexible, multi-step forecasts without needing to retrain for different input/output lengths.</p><p>AutoTimes leverages existing components and ideas:</p><ul><li><p>Patching to encode a segment of a time series as a token representation</p></li><li><p>Timestamp representation modeling to improve token representation</p></li><li><p>Only 0.1% of parameters are added by embedding time series as LLM tokens&#8212;keeping the LLM frozen and training efficient.</p></li></ul><p>The main contribution here is that it trains the LLMs with autoregression style with the next token prediction task. The predicted tokens are then projected back to time-series space for forecasting:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Cbvp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0874cc7a-16a6-411f-8315-b8caa1a484b9_1327x776.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Cbvp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0874cc7a-16a6-411f-8315-b8caa1a484b9_1327x776.png 424w, https://substackcdn.com/image/fetch/$s_!Cbvp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0874cc7a-16a6-411f-8315-b8caa1a484b9_1327x776.png 848w, https://substackcdn.com/image/fetch/$s_!Cbvp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0874cc7a-16a6-411f-8315-b8caa1a484b9_1327x776.png 1272w, https://substackcdn.com/image/fetch/$s_!Cbvp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0874cc7a-16a6-411f-8315-b8caa1a484b9_1327x776.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Cbvp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0874cc7a-16a6-411f-8315-b8caa1a484b9_1327x776.png" width="1327" height="776" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0874cc7a-16a6-411f-8315-b8caa1a484b9_1327x776.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:776,&quot;width&quot;:1327,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:197640,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0874cc7a-16a6-411f-8315-b8caa1a484b9_1327x776.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Cbvp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0874cc7a-16a6-411f-8315-b8caa1a484b9_1327x776.png 424w, https://substackcdn.com/image/fetch/$s_!Cbvp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0874cc7a-16a6-411f-8315-b8caa1a484b9_1327x776.png 848w, https://substackcdn.com/image/fetch/$s_!Cbvp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0874cc7a-16a6-411f-8315-b8caa1a484b9_1327x776.png 1272w, https://substackcdn.com/image/fetch/$s_!Cbvp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0874cc7a-16a6-411f-8315-b8caa1a484b9_1327x776.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Next token prediction in AutoTimes. Source: [8].</figcaption></figure></div><p>Under this approach, the results look promising:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!80vw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3450e00-1806-48b3-b2c0-1a5e007fc2ab_2067x573.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!80vw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3450e00-1806-48b3-b2c0-1a5e007fc2ab_2067x573.png 424w, https://substackcdn.com/image/fetch/$s_!80vw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3450e00-1806-48b3-b2c0-1a5e007fc2ab_2067x573.png 848w, https://substackcdn.com/image/fetch/$s_!80vw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3450e00-1806-48b3-b2c0-1a5e007fc2ab_2067x573.png 1272w, https://substackcdn.com/image/fetch/$s_!80vw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3450e00-1806-48b3-b2c0-1a5e007fc2ab_2067x573.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!80vw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3450e00-1806-48b3-b2c0-1a5e007fc2ab_2067x573.png" width="1456" height="404" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3450e00-1806-48b3-b2c0-1a5e007fc2ab_2067x573.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:404,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:149298,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3450e00-1806-48b3-b2c0-1a5e007fc2ab_2067x573.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!80vw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3450e00-1806-48b3-b2c0-1a5e007fc2ab_2067x573.png 424w, https://substackcdn.com/image/fetch/$s_!80vw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3450e00-1806-48b3-b2c0-1a5e007fc2ab_2067x573.png 848w, https://substackcdn.com/image/fetch/$s_!80vw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3450e00-1806-48b3-b2c0-1a5e007fc2ab_2067x573.png 1272w, https://substackcdn.com/image/fetch/$s_!80vw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3450e00-1806-48b3-b2c0-1a5e007fc2ab_2067x573.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>More importantly, the paper shows that the LLM module is really useful, as indicated in the ablations study:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Uls8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c448be-3c94-4f41-a731-56978658f5f3_2065x615.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Uls8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c448be-3c94-4f41-a731-56978658f5f3_2065x615.png 424w, https://substackcdn.com/image/fetch/$s_!Uls8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c448be-3c94-4f41-a731-56978658f5f3_2065x615.png 848w, https://substackcdn.com/image/fetch/$s_!Uls8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c448be-3c94-4f41-a731-56978658f5f3_2065x615.png 1272w, https://substackcdn.com/image/fetch/$s_!Uls8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c448be-3c94-4f41-a731-56978658f5f3_2065x615.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Uls8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c448be-3c94-4f41-a731-56978658f5f3_2065x615.png" width="1456" height="434" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/87c448be-3c94-4f41-a731-56978658f5f3_2065x615.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:434,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:138238,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c448be-3c94-4f41-a731-56978658f5f3_2065x615.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Uls8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c448be-3c94-4f41-a731-56978658f5f3_2065x615.png 424w, https://substackcdn.com/image/fetch/$s_!Uls8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c448be-3c94-4f41-a731-56978658f5f3_2065x615.png 848w, https://substackcdn.com/image/fetch/$s_!Uls8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c448be-3c94-4f41-a731-56978658f5f3_2065x615.png 1272w, https://substackcdn.com/image/fetch/$s_!Uls8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c448be-3c94-4f41-a731-56978658f5f3_2065x615.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We can see that by designing the right method for LLMs, their full potential in time series forecasting can finally be unlocked. With approaches like AutoTimes, LLMs go beyond being just expensive add-ons&#8212;they become efficient and generalizable forecasters.</p><div><hr></div><h2>Building Foundational Models for Time Series: A New Era?</h2><p>Instead of just adapting existing LLMs, researchers are now exploring <strong>specialized foundational models</strong> trained directly on vast time series data.</p><ul><li><p>Why Specialized Models? Training LLMs from scratch on time series could help them understand temporal dependencies just as language models learn from text.</p></li><li><p>Challenges &amp; Breakthroughs: Time series data is highly variable and non-stationary, making dataset creation and training complex, but models like TimeGPT and Chronos show it's possible. Let&#8217;s discover how these specialized LLMs are reshaping time series forecasting.</p></li></ul><p>&#128073;<strong>TimeGPT </strong>[9], the first time-series foundation model,  employs Transformer architecture to learn time-series representations. It processes a sliding window of historical values, enriched with local positional encodings, and passes this through a deep encoder-decoder stack with residual connections and layer normalization. At the end, a linear head maps the decoder&#8217;s output to a forecasting window&#8212;predicting what comes next. The training objective is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;P\\left( y_{t+1:t+h} \\mid y_{0:t},\\ x_{0:t+h} \\right) = f_\\theta(y_{0:t},\\ x_{0:t+h})\n&quot;,&quot;id&quot;:&quot;QJVEFYYFGJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>The paper applies Transformer-based architecture on time-series datasets:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bN6k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea50f7a-da10-4c5b-9f09-f3d558f99a6f_1119x406.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bN6k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea50f7a-da10-4c5b-9f09-f3d558f99a6f_1119x406.png 424w, https://substackcdn.com/image/fetch/$s_!bN6k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea50f7a-da10-4c5b-9f09-f3d558f99a6f_1119x406.png 848w, https://substackcdn.com/image/fetch/$s_!bN6k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea50f7a-da10-4c5b-9f09-f3d558f99a6f_1119x406.png 1272w, https://substackcdn.com/image/fetch/$s_!bN6k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea50f7a-da10-4c5b-9f09-f3d558f99a6f_1119x406.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bN6k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea50f7a-da10-4c5b-9f09-f3d558f99a6f_1119x406.png" width="1119" height="406" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ea50f7a-da10-4c5b-9f09-f3d558f99a6f_1119x406.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:406,&quot;width&quot;:1119,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:267735,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea50f7a-da10-4c5b-9f09-f3d558f99a6f_1119x406.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bN6k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea50f7a-da10-4c5b-9f09-f3d558f99a6f_1119x406.png 424w, https://substackcdn.com/image/fetch/$s_!bN6k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea50f7a-da10-4c5b-9f09-f3d558f99a6f_1119x406.png 848w, https://substackcdn.com/image/fetch/$s_!bN6k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea50f7a-da10-4c5b-9f09-f3d558f99a6f_1119x406.png 1272w, https://substackcdn.com/image/fetch/$s_!bN6k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ea50f7a-da10-4c5b-9f09-f3d558f99a6f_1119x406.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">TimeGPT Architecture. Source: [9].</figcaption></figure></div><blockquote><p>&#128064; TimeGPT is not based on an existing language model. While it shares the &#8220;foundation model&#8221; spirit&#8212;training on vast data at scale&#8212;its architecture is customized for numerical sequences, not text. </p></blockquote><p>To train TimeGPT, researchers compiled what&#8217;s likely the largest public collection of time series ever used, with over <strong>100 billion data points</strong>. This dataset spans a rich mix of domains: finance, economics, healthcare, demographics, weather, web traffic, transport, IoT sensors, and more. These series bring with them a wide spectrum of behaviors&#8212;multiple seasonalities, varying cycle lengths, non-linear trends, noise, and sudden anomalies.</p><blockquote><p>&#128064; Instead of overly sanitizing the data, the creators kept most of it raw, simply standardizing formats and filling in missing values. This decision ensured the model would learn from real-world messiness&#8212;a critical factor in achieving robustness and generalization.</p></blockquote><p>Training TimeGPT required <strong>multiple days on NVIDIA A10G GPU clusters</strong>. Extensive tuning was conducted to optimize batch sizes, learning rates, and other parameters. Consistent with earlier large-scale training studies, larger batch sizes, and smaller learning rates helped stabilize training. </p><p>One standout feature of TimeGPT is its support for probabilistic forecasting&#8212;not just predicting what&#8217;s most likely to happen, but quantifying uncertainty around those predictions. This is achieved using <strong>conformal prediction</strong>, a model-agnostic, non-parametric method that doesn&#8217;t assume a specific distribution. During inference, TimeGPT performs rolling forecasts on the latest data to estimate and calibrate the model&#8217;s prediction intervals, making it a more trustworthy tool for decision-making under uncertainty.</p><p>Sharing the same spirit with TimeGPT, &#128073;<strong>MOIRAI</strong> [10] prepares a large-scale time-series dataset and trains a Transformer-based model with the data. However, MOIRAI provides more sophisticated encoding and decoding pipelines.  First, it proposes a Patch-based Masked Forecasting framework where it divides time series into <strong>non-overlapping patches</strong> and learns to forecast the masked future patches using a Transformer encoder.</p><ul><li><p>Each patch captures temporal chunks of data, and multiple patch sizes are used to adapt to both high- and low-frequency signals.</p></li><li><p>During training, future patches are replaced with a special <code>[mask]</code> token &#8212; a learnable embedding &#8212; signaling the model to predict them.</p></li></ul><p>Most models assume fixed-length multivariate inputs. MOIRAI aims to work with <strong>any number of variates</strong>, even if unseen during training. To this end, it flattens multivariate time series into a single long sequence and applies attention to it. To preserve variate identity and ensure robustness to variate permutation:</p><ul><li><p>It uses variates encodings (not positional encodings).</p></li><li><p>Applies binary attention biases that help the Transformer:</p><ul><li><p>Distinguish between intra- and inter-variate interactions.</p></li><li><p>Remain equivariant to variate order and invariant to variate IDs.</p></li></ul></li></ul><p>In particular, define the rotary positional encoding matrix as <em>R</em> and let <em>u(1)</em>, and <em>u(2)</em> be learnable scalars, we have the attention score:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{equation}\nE_{ij,mn} = (\\mathbf{W}_Q x_{i,m})^\\top \\mathbf{R}_{i-j} (\\mathbf{W}_K x_{j,n}) \n+ u^{(1)} \\cdot \\mathbf{1}_{\\{m=n\\}} + u^{(2)} \\cdot \\mathbf{1}_{\\{m \\ne n\\}}\n\\end{equation}&quot;,&quot;id&quot;:&quot;RDFZHUVJQN&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, the binary attention bias is controlled by <em>u</em> as the two scalars bias the attention scores based on whether the variate (<em>m,n</em>) indices match:</p><ul><li><p><em>u(1)</em> enhances attention between elements of the <strong>same variate</strong>.</p></li><li><p><em>u(2)</em> modulates <strong>cross-variate</strong> attention.</p></li></ul><p>They are learned to best fit the training data. The final attention weights are:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;A_{ij,mn} = \\frac{\\exp(E_{ij,mn})}{\\sum\\limits_{k,o} \\exp(E_{ik,mo})}\n&quot;,&quot;id&quot;:&quot;SFYDLLTIOI&quot;}" data-component-name="LatexBlockToDOM"></div><p>MOIRAI adopts several modern LLM techniques to stabilize and improve the Transformer encoder:</p><ul><li><p>RMSNorm replaces LayerNorm</p></li><li><p>SwiGLU activation used in FFNs</p></li><li><p>Query-key normalization for attention</p></li><li><p>Pre-normalization layout for better gradient flow</p></li><li><p>Biases are removed for simplicity and regularization</p></li></ul><p>Finally, instead of predicting a single value, MOIRAI outputs parameters of a <strong>mixture of parametric distributions</strong> &#8212; allowing flexible and accurate uncertainty modeling. In particular, the output reads:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;p(Y_{t:t+h} \\mid \\hat{\\phi}) = \\sum_{i=1}^{c} w_i \\cdot p_i(Y_{t:t+h} \\mid \\hat{\\phi}_i)\n&quot;,&quot;id&quot;:&quot;DDWPKRNKMO&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, <em>Y<sub>t:t+h</sub> </em>is the time-series patch. Distribution choices:</p><ul><li><p>Student&#8217;s t &#8211; for general robustness</p></li><li><p>Negative Binomial &#8211; for count-based data</p></li><li><p>Log-normal &#8211; for right-skewed distributions (e.g., prices)</p></li><li><p>Low-variance Normal &#8211; for high-confidence predictions</p></li></ul><p>This setup supports both sampling and likelihood-based training with minimal overhead.  The loss function represents the negative log-likelihood:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{\\text{NLL}} = - \\log \\left( \\sum_{i=1}^{c} w_i \\cdot p_i(Y_{t:t+h}^{\\text{true}} \\mid \\hat{\\phi}_i) \\right)\n&quot;,&quot;id&quot;:&quot;WXTXOVGRDF&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>The whole pipeline is given below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z_KE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd600b82e-d1fe-44a8-96ef-a035fd54e1b0_1219x506.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z_KE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd600b82e-d1fe-44a8-96ef-a035fd54e1b0_1219x506.png 424w, https://substackcdn.com/image/fetch/$s_!z_KE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd600b82e-d1fe-44a8-96ef-a035fd54e1b0_1219x506.png 848w, https://substackcdn.com/image/fetch/$s_!z_KE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd600b82e-d1fe-44a8-96ef-a035fd54e1b0_1219x506.png 1272w, https://substackcdn.com/image/fetch/$s_!z_KE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd600b82e-d1fe-44a8-96ef-a035fd54e1b0_1219x506.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z_KE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd600b82e-d1fe-44a8-96ef-a035fd54e1b0_1219x506.png" width="1219" height="506" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d600b82e-d1fe-44a8-96ef-a035fd54e1b0_1219x506.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:506,&quot;width&quot;:1219,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:97860,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd600b82e-d1fe-44a8-96ef-a035fd54e1b0_1219x506.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!z_KE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd600b82e-d1fe-44a8-96ef-a035fd54e1b0_1219x506.png 424w, https://substackcdn.com/image/fetch/$s_!z_KE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd600b82e-d1fe-44a8-96ef-a035fd54e1b0_1219x506.png 848w, https://substackcdn.com/image/fetch/$s_!z_KE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd600b82e-d1fe-44a8-96ef-a035fd54e1b0_1219x506.png 1272w, https://substackcdn.com/image/fetch/$s_!z_KE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd600b82e-d1fe-44a8-96ef-a035fd54e1b0_1219x506.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">MORAI pipeline. Source: [10].</figcaption></figure></div><p>In terms of performance, MORAI can outperform some deep learning forecasters. This is remarkable because the foundation model does not train on each  dataset as the other baselines. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nky3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21e5ab52-1422-4968-a246-e1cc093f2996_1277x796.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nky3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21e5ab52-1422-4968-a246-e1cc093f2996_1277x796.png 424w, https://substackcdn.com/image/fetch/$s_!nky3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21e5ab52-1422-4968-a246-e1cc093f2996_1277x796.png 848w, https://substackcdn.com/image/fetch/$s_!nky3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21e5ab52-1422-4968-a246-e1cc093f2996_1277x796.png 1272w, https://substackcdn.com/image/fetch/$s_!nky3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21e5ab52-1422-4968-a246-e1cc093f2996_1277x796.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nky3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21e5ab52-1422-4968-a246-e1cc093f2996_1277x796.png" width="1277" height="796" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/21e5ab52-1422-4968-a246-e1cc093f2996_1277x796.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:796,&quot;width&quot;:1277,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:120285,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21e5ab52-1422-4968-a246-e1cc093f2996_1277x796.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nky3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21e5ab52-1422-4968-a246-e1cc093f2996_1277x796.png 424w, https://substackcdn.com/image/fetch/$s_!nky3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21e5ab52-1422-4968-a246-e1cc093f2996_1277x796.png 848w, https://substackcdn.com/image/fetch/$s_!nky3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21e5ab52-1422-4968-a246-e1cc093f2996_1277x796.png 1272w, https://substackcdn.com/image/fetch/$s_!nky3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21e5ab52-1422-4968-a246-e1cc093f2996_1277x796.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Approaching differently, &#128073;<strong>Chronos </strong>[11] repurposes standard language models&#8212;like T5 or GPT-2&#8212;to handle time series data with minimal changes, by converting continuous time series values into discrete tokens. These tokenized sequences are then treated as a &#8220;language of time series&#8221; that LMs can ingest and model.</p><p>To make time series digestible for LLMs:</p><ul><li><p><strong>Scaling:</strong> Each time series is normalized, particularly via <strong>mean scaling</strong> (preserving zero values, e.g., "zero sales" days).</p></li><li><p><strong>Quantization:</strong> The scaled values are mapped to discrete bins (e.g., 1024 total). They use <strong>uniform binning</strong> (equal bin widths) to generalize better across unseen data distributions.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KWx5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe796b55a-66e7-4a09-9182-d830f340521d_997x283.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KWx5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe796b55a-66e7-4a09-9182-d830f340521d_997x283.png 424w, https://substackcdn.com/image/fetch/$s_!KWx5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe796b55a-66e7-4a09-9182-d830f340521d_997x283.png 848w, https://substackcdn.com/image/fetch/$s_!KWx5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe796b55a-66e7-4a09-9182-d830f340521d_997x283.png 1272w, https://substackcdn.com/image/fetch/$s_!KWx5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe796b55a-66e7-4a09-9182-d830f340521d_997x283.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KWx5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe796b55a-66e7-4a09-9182-d830f340521d_997x283.png" width="570" height="161.79538615847542" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e796b55a-66e7-4a09-9182-d830f340521d_997x283.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:283,&quot;width&quot;:997,&quot;resizeWidth&quot;:570,&quot;bytes&quot;:36451,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe796b55a-66e7-4a09-9182-d830f340521d_997x283.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KWx5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe796b55a-66e7-4a09-9182-d830f340521d_997x283.png 424w, https://substackcdn.com/image/fetch/$s_!KWx5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe796b55a-66e7-4a09-9182-d830f340521d_997x283.png 848w, https://substackcdn.com/image/fetch/$s_!KWx5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe796b55a-66e7-4a09-9182-d830f340521d_997x283.png 1272w, https://substackcdn.com/image/fetch/$s_!KWx5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe796b55a-66e7-4a09-9182-d830f340521d_997x283.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p>&#128064; This converts a real-valued time series into a discrete sequence:<br><code>[x1, ..., xC] &#8594; [token1, ..., tokenC]. </code>This is similar to fine-tuning LLM approaches mentioned above. </p></blockquote><p>Then, the LLMs are employed to process the data</p><ul><li><p>Off-the-shelf LMs (like T5 or GPT-2) are used without architecture changes.</p></li><li><p>Only modification: Adjusting vocabulary size to match the quantized token space.</p></li><li><p>No time or frequency features (e.g., no day-of-week encoding), which surprisingly doesn&#8217;t degrade performance.</p></li></ul><p>The training Objective is the standard <strong>cross-entropy loss</strong> over the token vocabulary:</p><ul><li><p>Predicts next token <em>z<sub>C+h+1</sub></em> given previous tokens <em>z<sub>1:C+h</sub></em></p></li><li><p>Trained like standard language modeling</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0RZV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc71e78d-c951-4d88-aae6-b5e87fc9e17c_1508x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0RZV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc71e78d-c951-4d88-aae6-b5e87fc9e17c_1508x630.png 424w, https://substackcdn.com/image/fetch/$s_!0RZV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc71e78d-c951-4d88-aae6-b5e87fc9e17c_1508x630.png 848w, https://substackcdn.com/image/fetch/$s_!0RZV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc71e78d-c951-4d88-aae6-b5e87fc9e17c_1508x630.png 1272w, https://substackcdn.com/image/fetch/$s_!0RZV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc71e78d-c951-4d88-aae6-b5e87fc9e17c_1508x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0RZV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc71e78d-c951-4d88-aae6-b5e87fc9e17c_1508x630.png" width="1456" height="608" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cc71e78d-c951-4d88-aae6-b5e87fc9e17c_1508x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:608,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:137643,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc71e78d-c951-4d88-aae6-b5e87fc9e17c_1508x630.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0RZV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc71e78d-c951-4d88-aae6-b5e87fc9e17c_1508x630.png 424w, https://substackcdn.com/image/fetch/$s_!0RZV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc71e78d-c951-4d88-aae6-b5e87fc9e17c_1508x630.png 848w, https://substackcdn.com/image/fetch/$s_!0RZV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc71e78d-c951-4d88-aae6-b5e87fc9e17c_1508x630.png 1272w, https://substackcdn.com/image/fetch/$s_!0RZV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc71e78d-c951-4d88-aae6-b5e87fc9e17c_1508x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Chronos Framework. Source: [11].</figcaption></figure></div><p>During inference, Chronos samples token autoregressively:</p><ul><li><p>These tokens are dequantized and unscaled back to real numbers.</p></li><li><p>Multiple samples are drawn to get probabilistic forecasts, forming a distribution over possible futures.</p></li></ul><p>Chronos tackles the data scarcity problem with two strategies:</p><ol><li><p><strong>TSMixup</strong> &#8211; Combines time series from different datasets via convex interpolation.</p></li><li><p><strong>KernelSynth</strong> &#8211; Uses Gaussian processes to generate synthetic time series with random kernel compositions.</p></li></ol><p>These data augmentation techniques improve generalization and robustness, especially for zero-shot settings. </p><p>Chronos was trained and evaluated across <strong>55 datasets</strong> and:</p><ul><li><p>Outperformed traditional time series models and specialized deep nets.</p></li><li><p>Achieved state-of-the-art zero-shot performance&#8212;i.e., good generalization without fine-tuning, i.e., better than MORAI.</p></li><li><p>Performed competitively even with modest model sizes, making it computationally efficient.</p></li></ul><p>The benchmark mainly involves 2 metrics: </p><ul><li><p>WQL evaluates the accuracy of probabilistic forecasts (i.e. when your model predicts a distribution or multiple quantiles, not just a single value).</p></li><li><p>MASE is a scale-independent metric used to evaluate point forecasts. It compares your forecast to a naive baseline (like the previous value or seasonal naive).</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!r6WC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8249bc9-5a9c-42ae-8da4-7a212f8692e4_1110x481.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!r6WC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8249bc9-5a9c-42ae-8da4-7a212f8692e4_1110x481.png 424w, https://substackcdn.com/image/fetch/$s_!r6WC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8249bc9-5a9c-42ae-8da4-7a212f8692e4_1110x481.png 848w, https://substackcdn.com/image/fetch/$s_!r6WC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8249bc9-5a9c-42ae-8da4-7a212f8692e4_1110x481.png 1272w, https://substackcdn.com/image/fetch/$s_!r6WC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8249bc9-5a9c-42ae-8da4-7a212f8692e4_1110x481.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!r6WC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8249bc9-5a9c-42ae-8da4-7a212f8692e4_1110x481.png" width="1110" height="481" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c8249bc9-5a9c-42ae-8da4-7a212f8692e4_1110x481.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:481,&quot;width&quot;:1110,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:121107,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07efa05b-3e30-468c-87d2-9f878564a6aa_1110x481.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!r6WC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8249bc9-5a9c-42ae-8da4-7a212f8692e4_1110x481.png 424w, https://substackcdn.com/image/fetch/$s_!r6WC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8249bc9-5a9c-42ae-8da4-7a212f8692e4_1110x481.png 848w, https://substackcdn.com/image/fetch/$s_!r6WC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8249bc9-5a9c-42ae-8da4-7a212f8692e4_1110x481.png 1272w, https://substackcdn.com/image/fetch/$s_!r6WC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8249bc9-5a9c-42ae-8da4-7a212f8692e4_1110x481.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>While models like Chronos and Moirai have made notable progress towards universal forecasting, they face limitations:</p><p> <strong>&#10060; </strong>They<strong> </strong>rely on moderate and fixed context lengths or handcrafted heuristics that limit their flexibility and scalability. </p><p><strong>&#10060; </strong>They follow dense training, i.e., being computationally expensive and requiring large GPUs to accommodate. </p><p>To address these issues, researchers recently introduced &#128073;<strong>TIME-MoE </strong>[12]&#8212;a scalable, sparse, and general-purpose time series foundation model built to mirror the success of LLMs and vision transformers in their respective domains. In particular, the model features:</p><ul><li><p>Decoder-only Transformer with a Mixture-of-Experts (MoE) backbone.</p></li><li><p>Operates in an auto-regressive manner, enabling support for <em>any</em> forecasting horizon.</p></li><li><p>Capable of handling long contexts (up to 4096 tokens), which is critical for long-term temporal dependencies.</p></li><li><p>Sparse activation: Only a tiny subset of expert networks are activated per token, making the model highly efficient.</p></li><li><p>Allows the model to scale up (up to 2.4 billion parameters) without linear increases in compute cost.</p></li></ul><p>Below is the overall architecture and details of each processing step. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K0kw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b5f2c5-52b1-4f7a-8c60-a8c17439b64a_1668x1085.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K0kw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b5f2c5-52b1-4f7a-8c60-a8c17439b64a_1668x1085.png 424w, https://substackcdn.com/image/fetch/$s_!K0kw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b5f2c5-52b1-4f7a-8c60-a8c17439b64a_1668x1085.png 848w, https://substackcdn.com/image/fetch/$s_!K0kw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b5f2c5-52b1-4f7a-8c60-a8c17439b64a_1668x1085.png 1272w, https://substackcdn.com/image/fetch/$s_!K0kw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b5f2c5-52b1-4f7a-8c60-a8c17439b64a_1668x1085.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K0kw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b5f2c5-52b1-4f7a-8c60-a8c17439b64a_1668x1085.png" width="1456" height="947" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1b5f2c5-52b1-4f7a-8c60-a8c17439b64a_1668x1085.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:947,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:336888,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b5f2c5-52b1-4f7a-8c60-a8c17439b64a_1668x1085.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K0kw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b5f2c5-52b1-4f7a-8c60-a8c17439b64a_1668x1085.png 424w, https://substackcdn.com/image/fetch/$s_!K0kw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b5f2c5-52b1-4f7a-8c60-a8c17439b64a_1668x1085.png 848w, https://substackcdn.com/image/fetch/$s_!K0kw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b5f2c5-52b1-4f7a-8c60-a8c17439b64a_1668x1085.png 1272w, https://substackcdn.com/image/fetch/$s_!K0kw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b5f2c5-52b1-4f7a-8c60-a8c17439b64a_1668x1085.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Time-MOE architecture. Source: [12].</figcaption></figure></div><p><strong>&#9312;&#9313;</strong> The first step is to embed time series data into tokens.  Unlike the mainstream that uses patch-based tokenization, here they use point-wise tokenization, i.e., each time-series step is projected to high-dimensional embedding:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;h'_t = \\text{SwiGLU}(x_t) = \\text{Swish}(W x_t) \\otimes (V x_t)\n&quot;,&quot;id&quot;:&quot;TNSABLNLRW&quot;}" data-component-name="LatexBlockToDOM"></div><p>where:</p><ul><li><p><em>W</em> and <em>V</em> are learnable projection weights</p></li><li><p><em>&#8855;</em> denotes element-wise product</p></li></ul><p><strong>&#9314; </strong>TIME-MoE builds on the decoder-only transformer, with common tricks for time series:</p><ul><li><p>RMSNorm for stable training</p></li><li><p>Rotary Positional Embedding (RoPE) to better generalize to long-range sequences</p></li><li><p>Bias-free Layers (except in self-attention) for better extrapolation</p></li></ul><p><strong>The twist:</strong> each transformer block replaces the standard feedforward layer with a <strong>Mixture-of-Experts</strong> module. Here's what happens at a layer <em>l</em> step-by-step:</p><p>i. Self-Attention + Residual:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;u^l_t = \\text{SA}(\\text{RMSNorm}(h^{l-1}_t)) + h^{l-1}_t\n&quot;,&quot;id&quot;:&quot;RGONKPVDLU&quot;}" data-component-name="LatexBlockToDOM"></div><p>ii. Normalization:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\bar{u}^l_t = \\text{RMSNorm}(u^l_t)\n&quot;,&quot;id&quot;:&quot;CKOMZFSLUQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>iii. Mixture-of-Experts Activation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;h^l_t = \\text{Mixture}(\\bar{u}^l_t) + u^l_t\n&quot;,&quot;id&quot;:&quot;LTQUAKVXTR&quot;}" data-component-name="LatexBlockToDOM"></div><p>In this step, the Mixture layer routes each token to a subset of <code>K</code> out of <code>N</code> possible experts, plus a shared expert to capture global knowledge:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Mixture}(\\bar{u}^l_t) = g_{N+1, t} \\cdot \\text{FFN}_{N+1}(\\bar{u}^l_t) + \\sum_{i=1}^{N} g_{i,t} \\cdot \\text{FFN}_i(\\bar{u}^l_t)\n&quot;,&quot;id&quot;:&quot;LRCQDIIVTW&quot;}" data-component-name="LatexBlockToDOM"></div><p>Where the gating weights <em>g<sub>i,t</sub></em> are determined via softmax over router scores <em>s<sub>i,t</sub></em>&#8203;. In particular, the gating weights for each specialized expert are:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;g_{i,t} = \n\\begin{cases}\ns_{i,t}, &amp; \\text{if } s_{i,t} \\in \\text{Top}_k(\\{s_{j,t} \\mid 1 \\le j \\le N\\}) \\\\\n0, &amp; \\text{otherwise}\n\\end{cases}&quot;,&quot;id&quot;:&quot;NPSLAOZNQD&quot;}" data-component-name="LatexBlockToDOM"></div><p>with the router&#8217;s score is computed as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;s_{i,t} = \\text{Softmax}_i(W^l_i \\bar{u}^l_t)\n&quot;,&quot;id&quot;:&quot;KAYHBSUAFX&quot;}" data-component-name="LatexBlockToDOM"></div><p> </p><p>On the other hand, the gate for the global expert is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;g_{N+1, t} = \\text{Sigmoid}(W^l_{N+1} \\bar{u}^l_t)\n&quot;,&quot;id&quot;:&quot;XJUTRBMUGO&quot;}" data-component-name="LatexBlockToDOM"></div><p><strong>&#9315;&#9316;</strong> Another contribution of the approach is that instead of predicting at a single horizon, TIME-MoE predicts at multiple resolutions simultaneously. Each projection head targets a different forecast length, enabling the model to:</p><ul><li><p>Generalize across short and long horizons</p></li><li><p>Learn a richer latent structure of the future</p></li><li><p>Ensemble multi-scale predictions for better robustness</p></li></ul><p>The total loss averages over all resolutions, using the <strong>Huber loss</strong> for each prediction:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{\\text{ar}}(x_t, \\hat{x}_t) =\n\\begin{cases}\n\\frac{1}{2}(x_t - \\hat{x}_t)^2, &amp; \\text{if } |x_t - \\hat{x}_t| \\le \\delta \\\\\n\\delta \\cdot (|x_t - \\hat{x}_t| - \\frac{1}{2} \\delta), &amp; \\text{otherwise}\n\\end{cases}&quot;,&quot;id&quot;:&quot;QTQLZGEFVA&quot;}" data-component-name="LatexBlockToDOM"></div><p>Last but not least, MoE models suffer from a known issue: <strong>routing collapse</strong>, where only a few experts get chosen repeatedly. To fix this, TIME-MoE introduces an <strong>auxiliary expert balance loss</strong>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{\\text{aux}} = \\sum_{i=1}^{N} f_i r_i\n&quot;,&quot;id&quot;:&quot;IDQFZZILEN&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>f<sub>i</sub></em>&#8203; is the fraction of tokens routed to expert <em>i:</em></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;f_i = \\frac{1}{KT} \\sum_{t=1}^{T} \\mathbb{I}(\\text{Expert } i \\text{ is selected at time } t)\n\n\n&quot;,&quot;id&quot;:&quot;DUAFIPYAFX&quot;}" data-component-name="LatexBlockToDOM"></div><p> <em>r<sub>i</sub></em>&#8203; is the average router probability for expert <em>i:</em></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;r_i = \\frac{1}{T} \\sum_{t=1}^{T} s_{i,t}\n&quot;,&quot;id&quot;:&quot;TQKNHAVWVK&quot;}" data-component-name="LatexBlockToDOM"></div><blockquote><p>&#128064; Minimizing the balance loss encourages uniform expert assignment. This is because the frequency <em>f</em> and the routing vector <em>r</em> are positively correlated (i.e., approximately proportional), allowing us to apply <strong><a href="https://brilliant.org/wiki/chebyshev-inequality/">Chebyshev&#8217;s sum inequality</a></strong>.</p></blockquote><p>The result is impressive. Time-MOE is now one of the SOTA in time-series forecasting when compared with other foundation models:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XpG_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb41a4a38-a6a3-4345-abf2-3bad8e86c46a_1851x1078.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XpG_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb41a4a38-a6a3-4345-abf2-3bad8e86c46a_1851x1078.png 424w, https://substackcdn.com/image/fetch/$s_!XpG_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb41a4a38-a6a3-4345-abf2-3bad8e86c46a_1851x1078.png 848w, https://substackcdn.com/image/fetch/$s_!XpG_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb41a4a38-a6a3-4345-abf2-3bad8e86c46a_1851x1078.png 1272w, https://substackcdn.com/image/fetch/$s_!XpG_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb41a4a38-a6a3-4345-abf2-3bad8e86c46a_1851x1078.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XpG_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb41a4a38-a6a3-4345-abf2-3bad8e86c46a_1851x1078.png" width="1456" height="848" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b41a4a38-a6a3-4345-abf2-3bad8e86c46a_1851x1078.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:848,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:489565,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb41a4a38-a6a3-4345-abf2-3bad8e86c46a_1851x1078.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XpG_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb41a4a38-a6a3-4345-abf2-3bad8e86c46a_1851x1078.png 424w, https://substackcdn.com/image/fetch/$s_!XpG_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb41a4a38-a6a3-4345-abf2-3bad8e86c46a_1851x1078.png 848w, https://substackcdn.com/image/fetch/$s_!XpG_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb41a4a38-a6a3-4345-abf2-3bad8e86c46a_1851x1078.png 1272w, https://substackcdn.com/image/fetch/$s_!XpG_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb41a4a38-a6a3-4345-abf2-3bad8e86c46a_1851x1078.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>as well as other specialized methods:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mg8i!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f407742-0fef-4db6-b846-5e646014d6f8_1931x1116.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mg8i!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f407742-0fef-4db6-b846-5e646014d6f8_1931x1116.png 424w, https://substackcdn.com/image/fetch/$s_!mg8i!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f407742-0fef-4db6-b846-5e646014d6f8_1931x1116.png 848w, https://substackcdn.com/image/fetch/$s_!mg8i!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f407742-0fef-4db6-b846-5e646014d6f8_1931x1116.png 1272w, https://substackcdn.com/image/fetch/$s_!mg8i!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f407742-0fef-4db6-b846-5e646014d6f8_1931x1116.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mg8i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f407742-0fef-4db6-b846-5e646014d6f8_1931x1116.png" width="1456" height="841" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f407742-0fef-4db6-b846-5e646014d6f8_1931x1116.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:841,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:494935,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/158621521?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f407742-0fef-4db6-b846-5e646014d6f8_1931x1116.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mg8i!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f407742-0fef-4db6-b846-5e646014d6f8_1931x1116.png 424w, https://substackcdn.com/image/fetch/$s_!mg8i!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f407742-0fef-4db6-b846-5e646014d6f8_1931x1116.png 848w, https://substackcdn.com/image/fetch/$s_!mg8i!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f407742-0fef-4db6-b846-5e646014d6f8_1931x1116.png 1272w, https://substackcdn.com/image/fetch/$s_!mg8i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f407742-0fef-4db6-b846-5e646014d6f8_1931x1116.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Conclusion</h2><p>LLMs aren&#8217;t the final answer to time-series forecasting, but they&#8217;ve changed the conversation. What started as a tool for language tasks is now being reimagined for sequences of all kinds. Researchers are finding creative ways to adapt pre-trained LLMs to time-series problems, fine-tune them for specific domains, or even build new foundational models trained purely on temporal data. Some of these early results are promising. </p><p>The takeaway? LLMs are giving us a new lens on forecasting problems. They open up ideas that weren&#8217;t obvious before&#8212;ideas about scale, transferability, and reasoning over time. That doesn&#8217;t mean we throw out everything we&#8217;ve learned from classical and deep learning models. It just means we now have another tool to work with&#8212;and a new frontier to explore.</p><div><hr></div><h2>Reference</h2><p>[1] Xue, Hao, and Flora D. Salim. "Promptcast: A new prompt-based learning paradigm for time series forecasting." <em>IEEE Transactions on Knowledge and Data Engineering</em> 36, no. 11 (2023): 6851-6864.</p><p>[2] Gruver, Nate, Marc Finzi, Shikai Qiu, and Andrew G. Wilson. "Large language models are zero-shot time series forecasters." <em>Advances in Neural Information Processing Systems</em> 36 (2023): 19622-19635.</p><p>[3] Liu, Haoxin, Zhiyuan Zhao, Jindong Wang, Harshavardhan Kamarthi, and B. Aditya Prakash. "LSTPrompt: Large Language Models as Zero-Shot Time Series Forecasters by Long-Short-Term Prompting." In <em>Findings of the Association for Computational Linguistics ACL 2024</em>, pp. 7832-7840. 2024.</p><p>[4] Zhou, Tian, Peisong Niu, Liang Sun, and Rong Jin. "One fits all: Power general time series analysis by pretrained lm." <em>Advances in neural information processing systems</em> 36 (2023): 43322-43355.</p><p>[5] Jin, Ming, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y. Zhang, Xiaoming Shi, Pin-Yu Chen et al. "Time-LLM: Time Series Forecasting by Reprogramming Large Language Models." In <em>The Twelfth International Conference on Learning Representations</em>, 2019.</p><p>[6] Tan, Mingtian, Mike Merrill, Vinayak Gupta, Tim Althoff, and Tom Hartvigsen. "Are language models actually useful for time series forecasting?" <em>Advances in Neural Information Processing Systems</em> 37 (2024): 60162-60191. </p><p>[7] Chang, Ching, Wei-Yao Wang, Wen-Chih Peng, and Tien-Fu Chen. "Llm4ts: Aligning pre-trained llms as data-efficient time-series forecasters." <em>arXiv preprint arXiv:2308.08469</em> (2023).</p><p>[8] Liu, Yong, Guo Qin, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. "Autotimes: Autoregressive time series forecasters via large language models." <em>Advances in Neural Information Processing Systems</em> 37 (2024): 122154-122184.</p><p>[9] Garza, Azul, Cristian Challu, and Max Mergenthaler-Canseco. "TimeGPT-1." <em>arXiv preprint arXiv:2310.03589</em> (2023).</p><p>[10] Woo, Gerald, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. "Unified Training of Universal Time Series Forecasting Transformers." In <em>International Conference on Machine Learning</em>, pp. 53140-53164. PMLR, 2024.</p><p>[11] Ansari, Abdul Fatir, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur et al. "Chronos: Learning the Language of Time Series." <em>Transactions on Machine Learning Research</em>, 2024. </p><p>[12] Shi, X., Wang, S., Nie, Y., Li, D., Ye, Z., Wen, Q., &amp; Jin, M. (2025). Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts. In Proceedings of the Thirteenth International Conference on Learning Representations, 2025.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>I hope you enjoy the article. Stay tuned for the newest and exclusive content by subscribing to <strong>Neurocoder Tales</strong>! Disclaimer: While every effort is made to provide accurate and unbiased information, errors may occur. Let me know if you catch any error.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[The Best of Time-Series Forecasting (Part I): From Seasonal Patterns to Transformer Models ]]></title><description><![CDATA[A collection of notable papers (excluding LLMs) on time-series forecasting, state-of-the-art, and beyond]]></description><link>https://hungleai.substack.com/p/the-best-of-time-series-forecasting</link><guid isPermaLink="false">https://hungleai.substack.com/p/the-best-of-time-series-forecasting</guid><dc:creator><![CDATA[Hung Le]]></dc:creator><pubDate>Mon, 10 Mar 2025 09:57:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!BZ1l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ef3465-c7c6-4870-a084-1806cf03be78_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>From finance to healthcare, energy, and climate science, time-series forecasting is a cornerstone of critical decision-making:</p><ul><li><p><strong>Finance:</strong> Accurate market predictions drive billion-dollar trades and risk management.</p></li><li><p><strong>Healthcare:</strong> Monitoring patient vitals helps detect early warning signs, enabling life-saving interventions.</p></li><li><p><strong>Energy:</strong> Power grid demand forecasting prevents blackouts and optimizes renewable energy integration.</p></li><li><p><strong>Climate Science:</strong> Weather and climate modeling help mitigate the impact of extreme events.</p></li></ul><p>AI has already revolutionized many fields, outperforming humans in complex games, generating realistic images, and crafting coherent text. However, when it comes to time-series forecasting, AI still struggles to keep up. </p><p>In this blog, we&#8217;ll explore the evolution of time-series forecasting:<br>&#10004; From <strong>classic statistical methods</strong> like ARIMA and exponential smoothing,<br>&#10004; To <strong>deep learning breakthroughs</strong> with Transformers.<br>&#10004; To the <strong>non-attention </strong>methods that challenge the dominance of Transformers</p><p>We&#8217;ll dive into the latest research, uncover the biggest challenges, and explore future AI forecasting options. Because in a world that never stops moving, seeing what&#8217;s coming next is more valuable than ever.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BZ1l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ef3465-c7c6-4870-a084-1806cf03be78_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BZ1l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ef3465-c7c6-4870-a084-1806cf03be78_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!BZ1l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ef3465-c7c6-4870-a084-1806cf03be78_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!BZ1l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ef3465-c7c6-4870-a084-1806cf03be78_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!BZ1l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ef3465-c7c6-4870-a084-1806cf03be78_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BZ1l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ef3465-c7c6-4870-a084-1806cf03be78_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23ef3465-c7c6-4870-a084-1806cf03be78_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Thumbnail Image Transformer trying to trace time-series data like stock sequence, black and white cartoon style&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Thumbnail Image Transformer trying to trace time-series data like stock sequence, black and white cartoon style" title="Thumbnail Image Transformer trying to trace time-series data like stock sequence, black and white cartoon style" srcset="https://substackcdn.com/image/fetch/$s_!BZ1l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ef3465-c7c6-4870-a084-1806cf03be78_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!BZ1l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ef3465-c7c6-4870-a084-1806cf03be78_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!BZ1l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ef3465-c7c6-4870-a084-1806cf03be78_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!BZ1l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ef3465-c7c6-4870-a084-1806cf03be78_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Transformers can forecast. Source: Copilot.</figcaption></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Many read my posts, but only 3% subscribe. If you find my writing helpful, please subscribe&#8212;it&#8217;s free! Your support motivates me to keep creating high-quality and exclusive content. Thank you!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><h4>Table of Contents</h4><ul><li><p><a href="https://hungleai.substack.com/i/155725604/why-is-time-series-forecasting-so-hard">Why Is Time-series Forecasting So Hard?</a></p><ul><li><p><a href="https://hungleai.substack.com/i/155725604/time-series-are-unlike-others">Time Series Are Unlike Others</a></p></li><li><p><a href="https://hungleai.substack.com/i/155725604/the-last-stronghold-of-ai">The Last Stronghold of AI?</a></p></li></ul></li><li><p><a href="https://hungleai.substack.com/i/155725604/classical-time-series-models">Classical Time-Series Models</a></p><ul><li><p><a href="https://hungleai.substack.com/i/155725604/arima-autoregressive-integrated-moving-average">ARIMA (AutoRegressive Integrated Moving Average)</a></p></li><li><p><a href="https://hungleai.substack.com/i/155725604/exponential-smoothing-ets-models">Exponential Smoothing (ETS)</a></p></li><li><p><a href="https://hungleai.substack.com/i/155725604/seasonal-decomposition-and-trend-analysis">Seasonal Decomposition and Trend Analysis</a></p></li></ul></li><li><p><a href="https://hungleai.substack.com/i/155725604/deep-learning-for-time-series-forecasting">Deep Learning for Time-Series Forecasting</a></p><ul><li><p><a href="https://hungleai.substack.com/i/155725604/modern-formulation-of-time-series-forecasting">Modern Formulation of Time-Series Forecasting</a></p></li><li><p><a href="https://hungleai.substack.com/i/155725604/transformer-based-time-series-models">Transformer-Based Time-Series Models</a></p></li><li><p><a href="https://hungleai.substack.com/i/155725604/when-attention-is-not-all-you-need-for-time-series">When Attention is Not All You Need for Time Series</a></p></li></ul></li><li><p><a href="https://hungleai.substack.com/i/155725604/conclusion">Conclusion</a></p></li></ul><div><hr></div><h2>Why Is Time-series Forecasting So Hard?</h2><p>It's a question that echoes through research labs and boardrooms alike, a testament to the persistent challenge of predicting the future from the flow of time. The difficulty doesn't only stem from a lack of data, but also from the very nature of temporal sequences themselves.</p><ul><li><p><strong>Dynamic and evolving patterns:</strong> Unlike static data, time-series sequences constantly shift over time.</p></li><li><p><strong>External disruptions:</strong> Economic shocks, pandemics, and supply chain issues create sudden, unpredictable changes.</p></li><li><p><strong>Long-range dependencies:</strong> Small fluctuations in early data points can have cascading effects later.</p></li><li><p><strong>Noisy and sparse data:</strong> Real-world time-series datasets are often incomplete or heavily affected by anomalies.</p></li></ul><h4>Time Series Are Unlike Others</h4><p>In addition to these issues, time series data are unlike others due to their special properties. When analyzing time series data, it's crucial to understand the properties of trends and seasonality. </p><p><strong>Trend:</strong></p><ul><li><p>A trend represents the long-term movement or direction of a time series. It indicates whether the data generally increases, decreases, or remains constant over an extended period.</p></li><li><p>Trends can be linear (straight line) or non-linear (curved).</p></li></ul><p><strong>Seasonality:</strong></p><ul><li><p>Seasonality refers to recurring, predictable patterns that occur regularly within a time series. These patterns are typically influenced by seasonal factors like time of year, day of the week, or time of day.</p></li><li><p>Seasonal patterns repeat with a fixed and known period.</p></li></ul><p>Accurately modeling these properties remains a significant challenge and an active area of research.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h57Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6cc5171-ab12-4b24-9792-66f153d7740c_3840x2160.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h57Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6cc5171-ab12-4b24-9792-66f153d7740c_3840x2160.png 424w, https://substackcdn.com/image/fetch/$s_!h57Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6cc5171-ab12-4b24-9792-66f153d7740c_3840x2160.png 848w, https://substackcdn.com/image/fetch/$s_!h57Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6cc5171-ab12-4b24-9792-66f153d7740c_3840x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!h57Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6cc5171-ab12-4b24-9792-66f153d7740c_3840x2160.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h57Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6cc5171-ab12-4b24-9792-66f153d7740c_3840x2160.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d6cc5171-ab12-4b24-9792-66f153d7740c_3840x2160.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;A time series graph showing the previous data differenced to remove the increasing trend&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A time series graph showing the previous data differenced to remove the increasing trend" title="A time series graph showing the previous data differenced to remove the increasing trend" srcset="https://substackcdn.com/image/fetch/$s_!h57Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6cc5171-ab12-4b24-9792-66f153d7740c_3840x2160.png 424w, https://substackcdn.com/image/fetch/$s_!h57Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6cc5171-ab12-4b24-9792-66f153d7740c_3840x2160.png 848w, https://substackcdn.com/image/fetch/$s_!h57Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6cc5171-ab12-4b24-9792-66f153d7740c_3840x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!h57Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6cc5171-ab12-4b24-9792-66f153d7740c_3840x2160.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example of time-series data. <a href="https://www.ibm.com/think/topics/arima-model#:~:text=ARIMA%20stands%20for%20Autoregressive%20Integrated,to%20forecasting%20time%20series%20data.">Source</a>. </figcaption></figure></div><h4>The Last Stronghold of AI?</h4><p>Amidst the rapid advancements in artificial intelligence, a crucial domain continues to resist its transformative touch. Despite the impressive capabilities demonstrated by large language models (LLMs) across various fields, a particular challenge persists, hinting at a deeper complexity within the nature of prediction.</p><ul><li><p>While LLMs have made huge strides in code generation, document understanding, and even creative writing, time-series forecasting remains one of the last frontiers AI has yet to conquer.</p></li><li><p>The question is: &#129504; <em>Can AI truly learn to predict the time-series future?</em></p></li></ul><p>To examine this empirically, let's analyze recent state-of-the-art (SOTA) time-series forecasting results.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pUcz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a3b01d-169c-475f-91af-3402eed14017_1141x704.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pUcz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a3b01d-169c-475f-91af-3402eed14017_1141x704.png 424w, https://substackcdn.com/image/fetch/$s_!pUcz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a3b01d-169c-475f-91af-3402eed14017_1141x704.png 848w, https://substackcdn.com/image/fetch/$s_!pUcz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a3b01d-169c-475f-91af-3402eed14017_1141x704.png 1272w, https://substackcdn.com/image/fetch/$s_!pUcz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a3b01d-169c-475f-91af-3402eed14017_1141x704.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pUcz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a3b01d-169c-475f-91af-3402eed14017_1141x704.png" width="1141" height="704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e7a3b01d-169c-475f-91af-3402eed14017_1141x704.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:704,&quot;width&quot;:1141,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:148971,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a3b01d-169c-475f-91af-3402eed14017_1141x704.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pUcz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a3b01d-169c-475f-91af-3402eed14017_1141x704.png 424w, https://substackcdn.com/image/fetch/$s_!pUcz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a3b01d-169c-475f-91af-3402eed14017_1141x704.png 848w, https://substackcdn.com/image/fetch/$s_!pUcz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a3b01d-169c-475f-91af-3402eed14017_1141x704.png 1272w, https://substackcdn.com/image/fetch/$s_!pUcz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a3b01d-169c-475f-91af-3402eed14017_1141x704.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><strong>Forecasting Error of Recent Models. Source: [1]. Blue: ground truth, orange: predicted. </strong></figcaption></figure></div><p>The results reveal a <strong>significant</strong> <strong>challenge</strong>: &#128073;while capturing broad trends is achievable, accurately predicting the nuanced fluctuations of time-series data remains a hurdle. This inability to capture detail can lead to significant financial loss in applications like stock market prediction.</p><div><hr></div><h2>Classical Time-Series Models</h2><p>Before diving into deep learning and transformer-based forecasting models, it&#8217;s essential to understand the traditional statistical methods that have been widely used for decades. These methods remain competitive in many applications, especially when data is limited or interpretability is crucial. </p><p>Classical methods consider a time series as a sequence of observations indexed by time, assuming that past values contain sufficient information to model future values. Mathematically, a time series can be defined as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{y} = \\{ y_1, y_2, \\dots, y_T \\}\\ \\text{where}\\  y_t\\in \\mathbb{R}\n&quot;,&quot;id&quot;:&quot;VBOUMHLYWX&quot;}" data-component-name="LatexBlockToDOM"></div><p>Classical time-series approaches aim to find a precise mathematical model of <em>y<sub>t</sub>, </em>representing it as a closed-form function of past data.  </p><h4>ARIMA: AutoRegressive Integrated Moving Average</h4><p>ARIMA is a powerful model for <strong>univariate</strong> time-series forecasting that captures patterns in past values and forecast errors. Three parameters define it:</p><ul><li><p><em>p</em> (Autoregression - AR): Number of past values used for forecasting.</p></li><li><p><em>d</em> (Differencing - I): Number of times the series is differenced to remove trends.</p></li><li><p><em>q</em> (Moving Average - MA): Number of past forecast errors included in the model.</p></li></ul><p>The model is often written as &#128073;<strong>ARIMA(</strong><em><strong>p, d, q</strong></em><strong>) </strong>[2]. We will go through each step of the method.</p><p><strong>Step 1: Differencing for Stationarity</strong></p><p>Most real-world time series are non-stationary, meaning their statistical properties change over time. To make the series stationary, we apply <em><strong>d</strong></em> levels of differencing. For example, <em>d=1:</em></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;y'_t = y_t - y_{t-1}\n&quot;,&quot;id&quot;:&quot;KNSAYACZDW&quot;}" data-component-name="LatexBlockToDOM"></div><p>For second-order differencing (<em>d=2</em>):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;y''_t = y'_t - y'_{t-1} = (y_t - y_{t-1}) - (y_{t-1} - y_{t-2})\n&quot;,&quot;id&quot;:&quot;LDVYLJOVJD&quot;}" data-component-name="LatexBlockToDOM"></div><p><strong>Step 2: Autoregressive (AR) </strong></p><p>The autoregressive part models the relationship between a time step <em>y<sub>t</sub></em> and its previous values:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;y_t = \\phi_1 y_{t-1} + \\phi_2 y_{t-2} + \\dots + \\phi_p y_{t-p} + \\epsilon_t\n&quot;,&quot;id&quot;:&quot;SRFQTBCXNN&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>&#981;<sub>i</sub></em> are the AR coefficients and <em>&#1013;<sub>t</sub></em>&#8203; is a white noise error term. We aim to learn <em>&#981;<sub>i</sub>.</em></p><p><strong>Step 3: Moving Average (MA) </strong></p><p>Instead of relying only on past values, the MA component incorporates past forecast errors:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;y_t = \\mu + \\theta_1 \\epsilon_{t-1} + \\theta_2 \\epsilon_{t-2} + \\dots + \\theta_q \\epsilon_{t-q} + \\epsilon_t\n&quot;,&quot;id&quot;:&quot;JJLELAWYIT&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>&#952;<sub>i</sub></em> are the MA coefficients and need to be learned. The errors <em>&#1013;</em> should be determined after an initial prediction.</p><blockquote><p>&#128064; Moving Average term here is a bit misleading, because this component works on error level, not the original signal. </p></blockquote><p><strong>Step 4: Combining Everything &#8211; The ARIMA Model</strong></p><p>The final ARIMA equation defines the relationships between the time series and ARIMA parameters satisfying the 3 components above:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\Phi(B) (1 - B)^d y_t = \\Theta(B) \\epsilon_t\n&quot;,&quot;id&quot;:&quot;CFOPZIVTAD&quot;}" data-component-name="LatexBlockToDOM"></div><p>where:</p><ul><li><p>B is the backshift operator <em>By<sub>t</sub>=y<sub>t&#8722;1</sub></em></p></li><li><p><em>&#934;(B)</em> is the AR polynomial</p></li><li><p><em>&#920;(B)</em> is the MA polynomial</p></li></ul><p>There are many ways to learn the parameters to satisfy the equation, and the procedure has been integrated into standard time-series libraries. For example, we can use Python:</p><pre><code>import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
import matplotlib.pyplot as plt

# Sample Time Series (replace with your data)
data = [50, 52, 55, 58, 61, 63, 65, 68, 70, 72]
time_series = pd.Series(data)

# Fit ARIMA(1, 1, 1) model (adjust p, d, q as needed)
model = ARIMA(time_series, order=(1, 1, 1))
model_fit = model.fit()

# Forecast next 3 values
forecast = model_fit.get_forecast(steps=3)
forecast_values = forecast.predicted_mean

# Plot results
plt.plot(time_series, label='Original')
plt.plot(pd.Series(range(len(time_series), len(time_series) + 3)), forecast_values, color='red', label='Forecast')
plt.legend()
plt.show()

print(forecast_values) # print the forecast values.</code></pre><h4>Exponential Smoothing (ETS Models)</h4><p>Exponential Smoothing methods forecast future values by giving exponentially <strong>decreasing weights</strong> to past observations. This helps to capture trends and seasonality. </p><p>A basic model for smoothing a time series prediction can be simple:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\hat{y}_{t+1} = \\alpha y_t + (1 - \\alpha) \\hat{y}_t\n&quot;,&quot;id&quot;:&quot;BRMPJVSAJQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>&#945;</em> is the smoothing parameter and <em>y&#770;<sub>t</sub></em> is the <strong>forecasted value</strong> for step <em>t</em>. </p><p><strong>&#128073;Holt&#8217;s Linear Trend Model </strong>[3]</p><p>Holt's Linear Trend model is a powerful tool for forecasting time series data that exhibits both a level and a linear trend. It's an extension of simple exponential smoothing designed to capture the direction and magnitude of changes over time.</p><p>These core equations define the model:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;l_t = \\alpha y_t + (1 - \\alpha) (l_{t-1} + b_{t-1})\n&quot;,&quot;id&quot;:&quot;KLLSXELZLG&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;b_t = \\beta (l_t - l_{t-1}) + (1 - \\beta) b_{t-1}\n&quot;,&quot;id&quot;:&quot;TGKNAIQMFH&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\hat{y}_{t+h} = l_t + h b_t\n&quot;,&quot;id&quot;:&quot;DOKASIRSHU&quot;}" data-component-name="LatexBlockToDOM"></div><p><strong>Role of </strong><em><strong>l<sub>t</sub></strong></em><strong>:</strong> The level component, denoted as <em>l<sub>t</sub></em>, represents the smoothed average of the time series at time <em>t</em>. It's essentially our estimate of the "current" value, stripped of short-term fluctuations.</p><p><strong>Role of </strong><em><strong>b<sub>t</sub></strong></em><strong>:</strong> The trend component, denoted as <em>b<sub>t</sub></em>, represents the estimated slope of the time series at time <em>t, </em>embedded in the term <em>l<sub>t</sub>-l<sub>t-1</sub></em>. It captures the rate of change (increase or decrease) in the data. </p><p>The final forecast equation  combines the level and trend to project future values of <em>h</em> steps ahead. This model separates the time series into its underlying level and trend components. This allows for more accurate forecasting of data with linear trends. By adjusting the smoothing parameters <em>&#945;</em> and <em>&#946;</em>, we can control the model's responsiveness to recent changes and fine-tune its performance.</p><p>As shown in the table below [10], it's surprising that simple methods like ARIMA remain competitive with complex deep learning models, explaining their continued widespread use in industrial applications.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VqUX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf0df908-9be2-4ffe-be07-677304090886_2085x522.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VqUX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf0df908-9be2-4ffe-be07-677304090886_2085x522.png 424w, https://substackcdn.com/image/fetch/$s_!VqUX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf0df908-9be2-4ffe-be07-677304090886_2085x522.png 848w, https://substackcdn.com/image/fetch/$s_!VqUX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf0df908-9be2-4ffe-be07-677304090886_2085x522.png 1272w, https://substackcdn.com/image/fetch/$s_!VqUX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf0df908-9be2-4ffe-be07-677304090886_2085x522.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VqUX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf0df908-9be2-4ffe-be07-677304090886_2085x522.png" width="1456" height="365" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf0df908-9be2-4ffe-be07-677304090886_2085x522.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:365,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:187833,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf0df908-9be2-4ffe-be07-677304090886_2085x522.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VqUX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf0df908-9be2-4ffe-be07-677304090886_2085x522.png 424w, https://substackcdn.com/image/fetch/$s_!VqUX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf0df908-9be2-4ffe-be07-677304090886_2085x522.png 848w, https://substackcdn.com/image/fetch/$s_!VqUX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf0df908-9be2-4ffe-be07-677304090886_2085x522.png 1272w, https://substackcdn.com/image/fetch/$s_!VqUX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf0df908-9be2-4ffe-be07-677304090886_2085x522.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Yet, they still have several inherent limitations:</p><p>&#10060; <strong>Stationarity Requirement:</strong> ARIMA models primarily work with stationary time series. If the data is not stationary (i.e., it has trends or seasonality), it needs to be transformed through differencing, which can sometimes lead to loss of information.</p><p>&#10060; <strong>Linearity Assumption: </strong>ARIMA and ETS models assume that the relationships within the time series are linear. They may not effectively capture non-linear patterns or complex dependencies.</p><p>&#10060; <strong>Parameter Selection: </strong>Determining the optimal values for these model's hyper-parameters (e.g., <em>p, d, q, &#945;, &#946;</em>) can be challenging. It often involves a trial-and-error process and requires expertise in time series analysis.</p><p>&#10060;  <strong>Limited to Univariate Data: </strong>These<strong> </strong>models are designed for univariate time series, meaning they can only model a single variable. They cannot directly handle multiple related time series.</p><h4>Seasonal Decomposition and Trend Analysis</h4><p>Classical seasonal decomposition breaks a time series into three components:</p><ol><li><p><strong>Trend (</strong><em><strong>T<sub>t&#8203;</sub></strong></em><strong>)</strong>: Long-term pattern.</p></li><li><p><strong>Seasonality (</strong><em><strong>S<sub>t</sub></strong></em><strong>&#8203;)</strong>: Repeating periodic variations.</p></li><li><p><strong>Residual (</strong><em><strong>R<sub>t</sub>&#8203;</strong></em><strong>)</strong>: Unexplained random variations or fluctuations.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mSOo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdad7ed13-7649-4ca7-ab13-ed513f2d1626_1021x510.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mSOo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdad7ed13-7649-4ca7-ab13-ed513f2d1626_1021x510.png 424w, https://substackcdn.com/image/fetch/$s_!mSOo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdad7ed13-7649-4ca7-ab13-ed513f2d1626_1021x510.png 848w, https://substackcdn.com/image/fetch/$s_!mSOo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdad7ed13-7649-4ca7-ab13-ed513f2d1626_1021x510.png 1272w, https://substackcdn.com/image/fetch/$s_!mSOo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdad7ed13-7649-4ca7-ab13-ed513f2d1626_1021x510.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mSOo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdad7ed13-7649-4ca7-ab13-ed513f2d1626_1021x510.png" width="1021" height="510" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dad7ed13-7649-4ca7-ab13-ed513f2d1626_1021x510.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:510,&quot;width&quot;:1021,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Time Series Analysis: Understanding Seasonality and Cyclicality | by Aaweg  I | Medium&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Time Series Analysis: Understanding Seasonality and Cyclicality | by Aaweg  I | Medium" title="Time Series Analysis: Understanding Seasonality and Cyclicality | by Aaweg  I | Medium" srcset="https://substackcdn.com/image/fetch/$s_!mSOo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdad7ed13-7649-4ca7-ab13-ed513f2d1626_1021x510.png 424w, https://substackcdn.com/image/fetch/$s_!mSOo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdad7ed13-7649-4ca7-ab13-ed513f2d1626_1021x510.png 848w, https://substackcdn.com/image/fetch/$s_!mSOo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdad7ed13-7649-4ca7-ab13-ed513f2d1626_1021x510.png 1272w, https://substackcdn.com/image/fetch/$s_!mSOo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdad7ed13-7649-4ca7-ab13-ed513f2d1626_1021x510.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Trend and Seasonal in a time series. <a href="https://ogre51.medium.com/in-time-series-forecasting-what-do-you-think-is-the-difference-between-seasonality-and-cyclicity-f4e8d9523d24">Source</a>. </figcaption></figure></div><p>If we assume an additive model for the decomposition, we can model the time-series step as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;y_t = T_t + S_t + R_t\n&quot;,&quot;id&quot;:&quot;GFAYMWNZWR&quot;}" data-component-name="LatexBlockToDOM"></div><p>Additive models assume constant variance. For time series with increasing variance, multiplicative models are more appropriate.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;y_t = T_t \\times S_t \\times R_t\n&quot;,&quot;id&quot;:&quot;XQLDGJAXMG&quot;}" data-component-name="LatexBlockToDOM"></div><p>The next task is to determine each component in the model. For example, we can estimate trends using a moving average:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;T_t = \\frac{1}{m} \\sum_{i=-k}^{k} y_{t+i}\n&quot;,&quot;id&quot;:&quot;BAKRVJEFOV&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>m</em> is the seasonal period. Then, we can remove trends and estimate seasonality:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;S_t = y_t - T_t\n&quot;,&quot;id&quot;:&quot;DMOJECLNSZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Or for multiplicative decomposition:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;S_t = y_t /T_t\n&quot;,&quot;id&quot;:&quot;FSVWOXDNUZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Finally, we can compute the <strong>residual</strong>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;R_t = y_t - (T_t + S_t)\n&quot;,&quot;id&quot;:&quot;PLYSWRZGTX&quot;}" data-component-name="LatexBlockToDOM"></div><p>Because seasonality is a fundamental characteristic of many time-series datasets, where patterns repeat at regular intervals. Classical time-series analysis leverages autocorrelation techniques to detect and quantify seasonal effects. Autocorrelation measures the linear relationship between a time series and its lagged versions, helping identify repeating cycles and dependencies over time.</p><p>The classic <strong>autocorrelation function (ACF)</strong> for a lag <em>k</em> in a time-series <em>y<sub>t</sub></em> of length <em>N</em> is defined as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;r_k = \\frac{\\sum_{t=1}^{N-k} (y_t - \\bar{y})(y_{t+k} - \\bar{y})}{\\sum_{t=1}^{N} (y_t - \\bar{y})^2}\n&quot;,&quot;id&quot;:&quot;DHYAOLLVYI&quot;}" data-component-name="LatexBlockToDOM"></div><p>where y&#772;  is the mean of the time series. A significant peak in the ACF at lag <em>k=s</em> suggests a <strong>seasonal pattern</strong> of period <em>s</em>.</p><p>As we will see later, these basic analyses will be adopted in the modern deep learning approach and play a crucial role in ensuring good forecasting performance. </p><div><hr></div><h2>Deep Learning for Time-Series Forecasting</h2><p>Traditional time-series models like ARIMA, ETS, and seasonal decomposition assume that the underlying process is <strong>stationary</strong> or can be transformed into a stationary form. These models rely on linear dependencies and handcrafted features, making them effective for simple and well-structured time series. However, as real-world applications grow in complexity, these assumptions often break down.</p><p>Modern time-series forecasting embraces a more <strong>data-driven</strong> and <strong>high-dimensional</strong> approach, leveraging machine learning and deep learning techniques. Instead of relying on predefined structures, modern models learn complex dependencies directly from data, making them more adaptable to dynamic, non-linear patterns. This shift also accommodates <strong>multivariate time series</strong>, where multiple interdependent variables evolve together over time.</p><h4>Modern Formulation of Time-Series Forecasting</h4><p>A time series is a sequence of data steps, considering that each step consists of <em>D</em> correlated variables (i.e., multivariate time series):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\mathbf{X} = \\{\\mathbf{x}_1, \\mathbf{x}_2, \\dots, \\mathbf{x}_{T_{\\text{total}}} \\}\n\n\\ \\text{where} \\ \\mathbf{x}_t \\in \\mathbb{R}^{D}.\n&quot;,&quot;id&quot;:&quot;SPZSWWJCBU&quot;}" data-component-name="LatexBlockToDOM"></div><blockquote><p>&#128064; There are other specific kinds of time series. For example, univariate time series: <em>D=1, </em>or multi-modal time series with <em>M </em>as the number of modalities:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{X} = \\{ \\mathbf{X}^{(1)}, \\mathbf{X}^{(2)}, \\dots, \\mathbf{X}^{(M)} \\}\n&quot;,&quot;id&quot;:&quot;CFLVWFMTXN&quot;}" data-component-name="LatexBlockToDOM"></div></blockquote><p>Theoretically, we can consider all the past steps to forecast future values of the time series. However, it is hard and inconvenient to learn with long, undetermined sequences. Thus, we construct training samples by <strong>sliding a window of fixed size</strong> over the time series. For each valid index <em>t</em>, we define a data sample:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{X}_{\\text{t}}^{(i)} = \\{\\mathbf{x}_{t-L+1}, \\dots, \\mathbf{x}_t\\}, \\quad \\mathbf{X}_{\\text{t}}^{(i)} \\in \\mathbb{R}^{L \\times D}\n&quot;,&quot;id&quot;:&quot;KJGTJSMLBN&quot;}" data-component-name="LatexBlockToDOM"></div><p>Time-series forecasting now involves learning a function that maps a historical window of <em>L</em> time steps to a future window of <em>T</em> time steps. Specifically, our goal is to predict the next <em>T </em>steps given the past <em>L</em> steps:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{X}_{\\text{future}}^{(i)} = \\{\\mathbf{x}_{t+1}, \\dots, \\mathbf{x}_{t+T}\\}, \\quad \\mathbf{X}_{\\text{future}}^{(i)} \\in \\mathbb{R}^{T \\times D}\n&quot;,&quot;id&quot;:&quot;MSHWEFUDEW&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p><strong>Objective:</strong> We want to learn a function <em>f<sub>&#952;</sub></em>&#8203; that predicts future values given past input time series:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{X}_{\\text{future}} = f_{\\theta}(\\mathbf{X}_{\\text{t}})\n&quot;,&quot;id&quot;:&quot;IFSCMVTGCK&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><h4>Transformer-Based Time-Series Models</h4><p>It is natural for deep learning to reshape time-series forecasting, but the recurrent paradigm&#8212;RNNs, LSTMs&#8212;simply doesn't cut it for complex, long-horizon problems. Long-term dependencies? Computational overhead? These are well-documented issues. That's why the Transformer, a model born in the NLP domain, has become a serious contender. Its self-attention mechanism provides a powerful tool for capturing those elusive long-range dependencies. </p><p>Anyone who's attempted to apply vanilla Transformers to a time series will attest to the inherent scalability issues. Beyond that, a fundamental mismatch exists: Transformers were built for the discrete, data-rich world of text, while time-series data is continuous and often comparatively scarce. To overcome this, the community has rallied, producing a wave of specialized Transformer architectures tailored to the nuances of time-series data</p><p><strong>&#128073;PatchTST: Simple Extension of Transformer to Time Series </strong>[6]</p><p><strong>PatchTST </strong>directly applies Transformer to time-series data with  two modifications:</p><ol><li><p><strong>Segmentation into Patches</strong>: The method<strong> </strong>segments time-series data into subseries-level patches, which serve as input tokens to the Transformer model. This patching mechanism ensures that local semantic information is preserved within each segment, enabling the model to capture fine-grained details that may otherwise be missed in traditional models.</p></li><li><p><strong>Channel-Independence</strong>: Each channel in PatchTST<strong> </strong>contains a single univariate time series. This setup allows all channels to share the same embedding and Transformer weights, promoting computational efficiency and reducing the complexity of the model. By minimizing the number of parameters that need to be learned, this approach ensures that the model remains scalable across different time-series datasets.</p></li></ol><p>The ideas can be summarized in the diagram below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2_HJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd9b049-948b-4b2b-8fdd-70e3fcdafe5f_1217x881.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2_HJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd9b049-948b-4b2b-8fdd-70e3fcdafe5f_1217x881.png 424w, https://substackcdn.com/image/fetch/$s_!2_HJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd9b049-948b-4b2b-8fdd-70e3fcdafe5f_1217x881.png 848w, https://substackcdn.com/image/fetch/$s_!2_HJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd9b049-948b-4b2b-8fdd-70e3fcdafe5f_1217x881.png 1272w, https://substackcdn.com/image/fetch/$s_!2_HJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd9b049-948b-4b2b-8fdd-70e3fcdafe5f_1217x881.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2_HJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd9b049-948b-4b2b-8fdd-70e3fcdafe5f_1217x881.png" width="1217" height="881" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/efd9b049-948b-4b2b-8fdd-70e3fcdafe5f_1217x881.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:881,&quot;width&quot;:1217,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:179210,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd9b049-948b-4b2b-8fdd-70e3fcdafe5f_1217x881.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2_HJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd9b049-948b-4b2b-8fdd-70e3fcdafe5f_1217x881.png 424w, https://substackcdn.com/image/fetch/$s_!2_HJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd9b049-948b-4b2b-8fdd-70e3fcdafe5f_1217x881.png 848w, https://substackcdn.com/image/fetch/$s_!2_HJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd9b049-948b-4b2b-8fdd-70e3fcdafe5f_1217x881.png 1272w, https://substackcdn.com/image/fetch/$s_!2_HJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd9b049-948b-4b2b-8fdd-70e3fcdafe5f_1217x881.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">PatchTST architecture. Source: [6].</figcaption></figure></div><p></p><p><strong>&#128073;Autoformer: Decomposition-Based Transformer for Long-Term Forecasting </strong>[4]</p><p>Inspired by classical methods, Autoformer introduces <strong>series decomposition</strong> into the Transformer architecture to separate <strong>trend</strong> and <strong>seasonality</strong>, improving efficiency and interpretability. Instead of relying on direct self-attention over raw time-series data, it models time series <em>X<sub>t</sub></em> as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{X}_t = \\mathcal{X}_t + \\mathcal{X}_s \n&quot;,&quot;id&quot;:&quot;WFTQHQVWQE&quot;}" data-component-name="LatexBlockToDOM"></div><p>where:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RxLB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F380655ed-ed56-4bbb-9151-68bd153e8749_429x97.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RxLB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F380655ed-ed56-4bbb-9151-68bd153e8749_429x97.png 424w, https://substackcdn.com/image/fetch/$s_!RxLB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F380655ed-ed56-4bbb-9151-68bd153e8749_429x97.png 848w, https://substackcdn.com/image/fetch/$s_!RxLB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F380655ed-ed56-4bbb-9151-68bd153e8749_429x97.png 1272w, https://substackcdn.com/image/fetch/$s_!RxLB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F380655ed-ed56-4bbb-9151-68bd153e8749_429x97.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RxLB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F380655ed-ed56-4bbb-9151-68bd153e8749_429x97.png" width="295" height="66.7016317016317" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/380655ed-ed56-4bbb-9151-68bd153e8749_429x97.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:97,&quot;width&quot;:429,&quot;resizeWidth&quot;:295,&quot;bytes&quot;:19482,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2ca084-c472-4fa8-b394-36a904c2a8aa_429x97.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RxLB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F380655ed-ed56-4bbb-9151-68bd153e8749_429x97.png 424w, https://substackcdn.com/image/fetch/$s_!RxLB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F380655ed-ed56-4bbb-9151-68bd153e8749_429x97.png 848w, https://substackcdn.com/image/fetch/$s_!RxLB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F380655ed-ed56-4bbb-9151-68bd153e8749_429x97.png 1272w, https://substackcdn.com/image/fetch/$s_!RxLB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F380655ed-ed56-4bbb-9151-68bd153e8749_429x97.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This decomposition forms a basic computation block named SeriesDecomp. Another important block of Autoformer is the Auto-Correlation Block, which is designed to efficiently capture period-based dependencies in time-series data. Unlike traditional self-attention mechanisms that rely on pairwise dot-product comparisons, Autoformer leverages autocorrelation analysis to identify repeating patterns in the data and aggregate information across time-delayed sub-series. The Auto-Correlation function is defined simply as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;R_{X X} (\\tau) = \\lim_{L \\to \\infty} \\frac{1}{L} \\sum_{t=1}^{L} X_t X_{t-\\tau}\n&quot;,&quot;id&quot;:&quot;XNXUBNZDWY&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>R<sub>XX</sub>&#8203;(&#964;)</em> measures the similarity between the original series and its lagged counterpart. Peaks in <em>R<sub>XX</sub>&#8203;(&#964;)</em> indicate potential periodic structures within the data. Autoformer selects the <strong>top-k most significant periods</strong> by computing:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tau_1, \\dots, \\tau_k = \\arg\\max_{\\tau \\in \\{1, \\dots, L\\}} R_{Q,K} (\\tau)\n&quot;,&quot;id&quot;:&quot;JGSFOUHRIY&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>R<sub>Q,K</sub>(&#964;)</em>is the autocorrelation function computed between the query <em>Q</em> and key <em>K</em> corresponding to <em>X</em>. The number of selected periods, <em>k</em>, is defined as <em>k = &#8970;c &#215; logL&#8971;</em> where <em>c</em> is a hyperparameter. </p><p>Once the top-k periodic dependencies are identified, Autoformer aligns and aggregates similar sub-series by shifting the time-series values according to the detected period lags:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Auto-Correlation}(Q, K, V) = \\sum_{i=1}^{k} \\text{Roll}(V, \\tau_i) \\hat{R}_{Q,K} (\\tau_i)\n&quot;,&quot;id&quot;:&quot;LFXHWIQSBI&quot;}" data-component-name="LatexBlockToDOM"></div><ul><li><p><strong>Roll(V,</strong><em>&#964;</em><strong>) </strong>shifts the value sequence  by delay <em>&#964;</em>, ensuring elements that are shifted beyond the first position are re-introduced at the last position</p></li><li><p><strong>R&#770;</strong> represents the softmax-normalized autocorrelation scores acting as attention weights.</p></li></ul><blockquote><p>&#128064; The process output is expected to amplify signals by effectively leveraging seasonal patterns.</p></blockquote><p>Given the SeriesDecomp and Auto-Correlation Block, Autoformer stacks these blocks together, forming Encoder-Decoder architectures as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WDdq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4aef0db-3377-4242-99ae-6b7af652bc61_1282x556.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WDdq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4aef0db-3377-4242-99ae-6b7af652bc61_1282x556.png 424w, https://substackcdn.com/image/fetch/$s_!WDdq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4aef0db-3377-4242-99ae-6b7af652bc61_1282x556.png 848w, https://substackcdn.com/image/fetch/$s_!WDdq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4aef0db-3377-4242-99ae-6b7af652bc61_1282x556.png 1272w, https://substackcdn.com/image/fetch/$s_!WDdq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4aef0db-3377-4242-99ae-6b7af652bc61_1282x556.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WDdq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4aef0db-3377-4242-99ae-6b7af652bc61_1282x556.png" width="1282" height="556" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b4aef0db-3377-4242-99ae-6b7af652bc61_1282x556.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:556,&quot;width&quot;:1282,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:182989,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4aef0db-3377-4242-99ae-6b7af652bc61_1282x556.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WDdq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4aef0db-3377-4242-99ae-6b7af652bc61_1282x556.png 424w, https://substackcdn.com/image/fetch/$s_!WDdq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4aef0db-3377-4242-99ae-6b7af652bc61_1282x556.png 848w, https://substackcdn.com/image/fetch/$s_!WDdq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4aef0db-3377-4242-99ae-6b7af652bc61_1282x556.png 1272w, https://substackcdn.com/image/fetch/$s_!WDdq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4aef0db-3377-4242-99ae-6b7af652bc61_1282x556.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Autoformer architecture. The Encoder focuses on seasonality representations, while the Decoder decomposes and reconstructs the final time series prediction. Source: [4].</figcaption></figure></div><p></p><p><strong>&#128073;FEDformer: Frequency Domain Transformer for Time-Series Forecasting [5]</strong></p><p>Naive Transformers struggle with capturing global time-series temporal patterns and entail substantial computational costs. The FEDformer addresses these challenges by integrating frequency domain analysis into a computation framework similar to Autoformer&#8217;s. </p><p>The first improvement is the Decomposition block, where FEDformer introduces Mixture of Experts for Seasonal-Trend Decomposition. In this approach, the trend is computed as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;X_{\\text{trend}} = \\text{Softmax}(L(x)) \\cdot F(x)\n&quot;,&quot;id&quot;:&quot;AHDFBYWXYW&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>F(&#183;)</em> applies a collection of average pooling filters, and <em>L(x)</em> calculates the weights used to combine the resulting trend features.</p><p>The second improvement is the Frequence-Enhanced blocks. Because time-series data frequently contain cyclic patterns that can be difficult to model directly, the paper proposes to use Discrete Fourier Transform (DFT) to decompose a signal into frequency components, making it easier to identify dominant trends.</p><p>Given a time-series sequence xnx_nxn&#8203; with <em>n=0,1,&#8230;,N&#8722;1</em>, the DFT transforms it into the frequency domain:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;X_l = \\sum_{n=0}^{N-1} x_n e^{-i\\omega ln}\n&quot;,&quot;id&quot;:&quot;QSUEOGMVWW&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>i</em> is the imaginary unit, and <em>X<sub>l</sub></em> represents the frequency components of the sequence. The inverse DFT reconstructs the original signal:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x_n = \\sum_{l=0}^{L-1} X_l e^{i\\omega ln}\n&quot;,&quot;id&quot;:&quot;LJUCUMCEDN&quot;}" data-component-name="LatexBlockToDOM"></div><blockquote><p>&#128064; The strength of frequency analysis using DFT is that using <a href="https://en.wikipedia.org/wiki/Fast_Fourier_transform">Fast Fourier Transform (FFT)</a>, the computation can be reduced from <em>O(N^2)</em> to <em>O(Nlog&#8289;N)</em>, and when selecting only a subset of frequencies, it can be further reduced to <em>O(N)</em>.</p></blockquote><p>Given these operators, the Frequency-Enhanced Block (FEB-f) applies FFT to transform inputs into the frequency domain, selects dominant modes, and applies a learned transformation before returning to the time domain to highlight the underlying dominant frequencies. The processing steps include:</p><ol><li><p><strong>Linear Projection:</strong> Transform time-series input <em>x&#8712;R^N&#215;D</em> using a learnable weight matrix <em>w </em>:</p></li></ol><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;q = x \\cdot w\n&quot;,&quot;id&quot;:&quot;GIRMCYUSTM&quot;}" data-component-name="LatexBlockToDOM"></div><ol start="2"><li><p><strong>Fourier Transform &amp; Mode Selection:</strong> Convert <em>q</em> to the frequency domain and retain only <em>M</em> dominant modes:</p></li></ol><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Q = \\mathcal{F}(q), \\quad Q' = \\text{Select}(Q)\n&quot;,&quot;id&quot;:&quot;NOMWJSYVHO&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, Select() is the function that keeps <em>M</em> frequency values <em>X<sub>l</sub></em>. </p><ol start="3"><li><p><strong>Frequency Domain Processing:</strong> Apply a learnable transformation R (a parameterized kernel) to enhance the signal:</p></li></ol><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Y = Q' \\odot R\n&quot;,&quot;id&quot;:&quot;MKATZCIXHK&quot;}" data-component-name="LatexBlockToDOM"></div><p>where &#8857; denotes element-wise multiplication.</p><p><strong>Inverse Fourier Transform:</strong> Zero-pad and apply the inverse Fourier transform to return to the time domain:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{FEB-f}(q) = \\mathcal{F}^{-1}(\\text{Padding}(Y))\n&quot;,&quot;id&quot;:&quot;NQYIJRSURI&quot;}" data-component-name="LatexBlockToDOM"></div><p>The third novel module is Frequency-Enhanced Attention which aims to capture the relationship between frequency-enhanced signals (trend and seasonal). Given queries <em>q</em>, keys <em>k</em>, and values <em>v, </em> the computation steps are:</p><ol><li><p><strong>Fourier Transform &amp; Mode Selection: </strong>Similar to Step 2 of FEB-f, just apply to all <em>q, k</em>, and <em>v</em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Q' = \\text{Select}(\\mathcal{F}(q)), \\quad K' = \\text{Select}(\\mathcal{F}(k)), \\quad V' = \\text{Select}(\\mathcal{F}(v))\n&quot;,&quot;id&quot;:&quot;BNFBFZBCUJ&quot;}" data-component-name="LatexBlockToDOM"></div></li><li><p><strong>Frequency-Space Attention Calculation:</strong></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Y = \\sigma(Q' K'^T) V'\n&quot;,&quot;id&quot;:&quot;OKBCEDFXEG&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>&#963;</em> is an activation function (e.g., softmax or tanh).</p></li><li><p><strong>Inverse Transform to Time Domain:</strong></p></li></ol><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{FEA-f}(q, k, v) = \\mathcal{F}^{-1}(\\text{Padding}(Y))\n&quot;,&quot;id&quot;:&quot;ELNAKSFAUP&quot;}" data-component-name="LatexBlockToDOM"></div><blockquote><p>&#128064;  The paper also proposes a similar procedure for FEB and FEA, just replacing Fourier Transform with <a href="https://en.wikipedia.org/wiki/Wavelet_transform">Wavelet Transform</a>, resulting FEB-w and FEA-f blocks, respectively.</p></blockquote><p>The FedFormer framework looks similar to Autoformer&#8217;s with the new blocks:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DRcW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac294042-19ff-4ce1-968b-dbae50d7aa1f_1792x582.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DRcW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac294042-19ff-4ce1-968b-dbae50d7aa1f_1792x582.png 424w, https://substackcdn.com/image/fetch/$s_!DRcW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac294042-19ff-4ce1-968b-dbae50d7aa1f_1792x582.png 848w, https://substackcdn.com/image/fetch/$s_!DRcW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac294042-19ff-4ce1-968b-dbae50d7aa1f_1792x582.png 1272w, https://substackcdn.com/image/fetch/$s_!DRcW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac294042-19ff-4ce1-968b-dbae50d7aa1f_1792x582.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DRcW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac294042-19ff-4ce1-968b-dbae50d7aa1f_1792x582.png" width="1456" height="473" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ac294042-19ff-4ce1-968b-dbae50d7aa1f_1792x582.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:473,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:150448,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac294042-19ff-4ce1-968b-dbae50d7aa1f_1792x582.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DRcW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac294042-19ff-4ce1-968b-dbae50d7aa1f_1792x582.png 424w, https://substackcdn.com/image/fetch/$s_!DRcW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac294042-19ff-4ce1-968b-dbae50d7aa1f_1792x582.png 848w, https://substackcdn.com/image/fetch/$s_!DRcW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac294042-19ff-4ce1-968b-dbae50d7aa1f_1792x582.png 1272w, https://substackcdn.com/image/fetch/$s_!DRcW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac294042-19ff-4ce1-968b-dbae50d7aa1f_1792x582.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Fedformer architecture. Source: [5]</figcaption></figure></div><h4>When Attention is Not All You Need for Time Series</h4><p>We've explored the allure of Transformers in time-series forecasting, marveling at their ability to capture intricate dependencies. However, a crucial question remains: &#129504; <em>Does the complexity of Transformers always translate to superior performance?</em> Recently, a growing body of research suggests that the answer might be a resounding 'no.'</p><p>While Transformers excel in capturing long-range dependencies through self-attention, they come with inherent drawbacks. These include:</p><ul><li><p><strong>Computational Cost:</strong> The quadratic complexity of self-attention <em>(O(n^2))</em> can be prohibitive for long time series, demanding significant computational resources and time.</p></li><li><p><strong>Data Hunger:</strong> Transformers, with their vast number of parameters, typically require massive datasets to train effectively. In many real-world time-series scenarios, such extensive data may not be readily available.</p></li><li><p><strong>Overfitting Risk:</strong> The flexibility of Transformers can lead to overfitting, particularly when dealing with noisy or short time series.</p></li></ul><p><strong>&#128073;DLinear: A Surprisingly Simple Challenger to Transformer Models in Time-Series Forecasting [7]</strong></p><p>DLinear takes the basic principles of forecasting and applies a clever yet straightforward approach to outshine some Transformer-based models in specific contexts. Concretely, each channel of the time-series input sequence is treated as a whole and provided as input to a simple model (e.g., linear transformation):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gAd1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf45b8a-2393-4b16-999c-df7eb260b99f_925x524.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gAd1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf45b8a-2393-4b16-999c-df7eb260b99f_925x524.png 424w, https://substackcdn.com/image/fetch/$s_!gAd1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf45b8a-2393-4b16-999c-df7eb260b99f_925x524.png 848w, https://substackcdn.com/image/fetch/$s_!gAd1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf45b8a-2393-4b16-999c-df7eb260b99f_925x524.png 1272w, https://substackcdn.com/image/fetch/$s_!gAd1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf45b8a-2393-4b16-999c-df7eb260b99f_925x524.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gAd1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf45b8a-2393-4b16-999c-df7eb260b99f_925x524.png" width="486" height="275.31243243243244" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/daf45b8a-2393-4b16-999c-df7eb260b99f_925x524.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:524,&quot;width&quot;:925,&quot;resizeWidth&quot;:486,&quot;bytes&quot;:126416,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf45b8a-2393-4b16-999c-df7eb260b99f_925x524.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gAd1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf45b8a-2393-4b16-999c-df7eb260b99f_925x524.png 424w, https://substackcdn.com/image/fetch/$s_!gAd1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf45b8a-2393-4b16-999c-df7eb260b99f_925x524.png 848w, https://substackcdn.com/image/fetch/$s_!gAd1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf45b8a-2393-4b16-999c-df7eb260b99f_925x524.png 1272w, https://substackcdn.com/image/fetch/$s_!gAd1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaf45b8a-2393-4b16-999c-df7eb260b99f_925x524.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Linear prediction for time series. Source: [7].</figcaption></figure></div><p> DLinear also proposes a decomposition strategy with linear layers, borrowed from models like Autoformer and FEDformer. Here&#8217;s how it works:</p><ol><li><p><strong>Decomposition</strong>: DLinear first decomposes the raw input time series into two components&#8212;<strong>trend</strong> and <strong>remainder (seasonal)</strong>&#8212;using a moving average kernel. This decomposition helps the model handle complex time-series data in a way that emphasizes the long-term trend and seasonal fluctuations separately.</p></li><li><p><strong>Two Linear Layers</strong>: After decomposition, DLinear applies two separate one-layer linear models to each component (trend and seasonal). By doing so, it captures both the long-term trend and the short-term seasonality in a very efficient manner.</p></li><li><p><strong>Combining the Components</strong>: Finally, the outputs of the trend and seasonal models are summed up to produce the final forecast.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nFi8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe748b425-1645-4801-8efc-95b05cec1d8a_966x304.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nFi8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe748b425-1645-4801-8efc-95b05cec1d8a_966x304.png 424w, https://substackcdn.com/image/fetch/$s_!nFi8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe748b425-1645-4801-8efc-95b05cec1d8a_966x304.png 848w, https://substackcdn.com/image/fetch/$s_!nFi8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe748b425-1645-4801-8efc-95b05cec1d8a_966x304.png 1272w, https://substackcdn.com/image/fetch/$s_!nFi8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe748b425-1645-4801-8efc-95b05cec1d8a_966x304.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nFi8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe748b425-1645-4801-8efc-95b05cec1d8a_966x304.png" width="966" height="304" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e748b425-1645-4801-8efc-95b05cec1d8a_966x304.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:304,&quot;width&quot;:966,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:94319,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe748b425-1645-4801-8efc-95b05cec1d8a_966x304.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nFi8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe748b425-1645-4801-8efc-95b05cec1d8a_966x304.png 424w, https://substackcdn.com/image/fetch/$s_!nFi8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe748b425-1645-4801-8efc-95b05cec1d8a_966x304.png 848w, https://substackcdn.com/image/fetch/$s_!nFi8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe748b425-1645-4801-8efc-95b05cec1d8a_966x304.png 1272w, https://substackcdn.com/image/fetch/$s_!nFi8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe748b425-1645-4801-8efc-95b05cec1d8a_966x304.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">DLinear architecture. Source: [7]</figcaption></figure></div><blockquote><p>&#128064; The model also has a variant, <strong>NLinear</strong>, designed to handle dataset distribution shifts. NLinear normalizes the input by subtracting the last value of the sequence, passing it through a linear layer, and then adding the subtracted value back.</p></blockquote><p>The experiments reveal that DLinear and its variants show competitive performance against complicated Transformer-based methods:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4CsC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1fb253b-1af4-4f10-bb98-ef0b8b1e4cf5_1501x224.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4CsC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1fb253b-1af4-4f10-bb98-ef0b8b1e4cf5_1501x224.png 424w, https://substackcdn.com/image/fetch/$s_!4CsC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1fb253b-1af4-4f10-bb98-ef0b8b1e4cf5_1501x224.png 848w, https://substackcdn.com/image/fetch/$s_!4CsC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1fb253b-1af4-4f10-bb98-ef0b8b1e4cf5_1501x224.png 1272w, https://substackcdn.com/image/fetch/$s_!4CsC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1fb253b-1af4-4f10-bb98-ef0b8b1e4cf5_1501x224.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4CsC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1fb253b-1af4-4f10-bb98-ef0b8b1e4cf5_1501x224.png" width="1456" height="217" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b1fb253b-1af4-4f10-bb98-ef0b8b1e4cf5_1501x224.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:217,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:65938,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1fb253b-1af4-4f10-bb98-ef0b8b1e4cf5_1501x224.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4CsC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1fb253b-1af4-4f10-bb98-ef0b8b1e4cf5_1501x224.png 424w, https://substackcdn.com/image/fetch/$s_!4CsC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1fb253b-1af4-4f10-bb98-ef0b8b1e4cf5_1501x224.png 848w, https://substackcdn.com/image/fetch/$s_!4CsC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1fb253b-1af4-4f10-bb98-ef0b8b1e4cf5_1501x224.png 1272w, https://substackcdn.com/image/fetch/$s_!4CsC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1fb253b-1af4-4f10-bb98-ef0b8b1e4cf5_1501x224.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>&#10060; One obvious limitation of DLinear is that it treats each channel independently, which may ignore the correlations across variates that may be beneficial for forecasting.</p><p><strong>&#128073;iTransformer: A Step Beyond DLinear</strong> [8]</p><p>This work addresses the limitations of Transformers : </p><p>&#10060;Falter with multivariate time series, compressing data at timestamps and losing crucial inter-variate correlations. </p><p>&#10060; The attention mechanisms also struggle with temporal dependencies.</p><p>iTransformer offers a solution by inverting the approach. It treats each variate as an independent token, allowing the attention mechanism to directly capture relationships between variates. This variate-centric view enables the model to effectively learn complex dependencies, leading to more accurate and robust time series forecasts. By shifting the focus from timestamps to variates, iTransformer unlocks a powerful new way to understand and predict time series data.</p><p>To this end, iTransformer uses DLinear to process temporal steps and Transformer&#8217;s attention to model channel relationships:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Di0c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7dc5357-891d-49c7-976d-a02fdb9cdc56_1256x530.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Di0c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7dc5357-891d-49c7-976d-a02fdb9cdc56_1256x530.png 424w, https://substackcdn.com/image/fetch/$s_!Di0c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7dc5357-891d-49c7-976d-a02fdb9cdc56_1256x530.png 848w, https://substackcdn.com/image/fetch/$s_!Di0c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7dc5357-891d-49c7-976d-a02fdb9cdc56_1256x530.png 1272w, https://substackcdn.com/image/fetch/$s_!Di0c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7dc5357-891d-49c7-976d-a02fdb9cdc56_1256x530.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Di0c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7dc5357-891d-49c7-976d-a02fdb9cdc56_1256x530.png" width="1256" height="530" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d7dc5357-891d-49c7-976d-a02fdb9cdc56_1256x530.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:530,&quot;width&quot;:1256,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:189218,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7dc5357-891d-49c7-976d-a02fdb9cdc56_1256x530.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Di0c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7dc5357-891d-49c7-976d-a02fdb9cdc56_1256x530.png 424w, https://substackcdn.com/image/fetch/$s_!Di0c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7dc5357-891d-49c7-976d-a02fdb9cdc56_1256x530.png 848w, https://substackcdn.com/image/fetch/$s_!Di0c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7dc5357-891d-49c7-976d-a02fdb9cdc56_1256x530.png 1272w, https://substackcdn.com/image/fetch/$s_!Di0c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7dc5357-891d-49c7-976d-a02fdb9cdc56_1256x530.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">iTransformer architecture. Source: [8].</figcaption></figure></div><p><strong>&#128073;TimeMixer: A Multiscale Approach for Time Series Forecasting [9]</strong></p><p>The TimeMixer method is designed to leverage the multiscale nature of time series data, where different scales exhibit unique properties. By focusing on modeling time-series specialty, the paper can achieve good results even without any attention mechanism.</p><p>To begin, TimeMixer first downsamples the input time series <em>X</em> into multiple scales using average pooling:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;X = \\{x_0, x_1, \\dots, x_M\\}, \\quad x_m \\in \\mathbb{R}^{\\left\\lfloor \\frac{P}{2^m} \\right\\rfloor \\times C}, \\quad m \\in \\{0, 1, \\dots, M\\}\n&quot;,&quot;id&quot;:&quot;CCFFQTVVNC&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>where <em>x<sub>0</sub></em> is the finest scale, and <em>x<sub>M</sub></em> is the coarsest scale.</p><p>In TimeMixer, the past information is processed using stacked Past-Decomposable-Mixing (PDM) blocks. The core idea behind PDM is to separate seasonal and trend components at each scale and mix them separately across scales.</p><ul><li><p>Decomposition:</p></li></ul><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\ns_l^m, t_l^m = \\text{SeriesDecomp}(x_l^m), \\quad \\forall m \\in \\{0, 1, \\dots, M\\}\n&quot;,&quot;id&quot;:&quot;XEYJADTGYY&quot;}" data-component-name="LatexBlockToDOM"></div><ul><li><p>Seasonal Mixing: In seasonal mixing, the paper uses a bottom-up approach to incorporate finer-scale data, enhancing the modeling of coarser scales and emphasizing the importance of detailed information for seasonal prediction. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xn3Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c55b3b9-3815-454d-8228-0b5789997ea8_570x475.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xn3Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c55b3b9-3815-454d-8228-0b5789997ea8_570x475.png 424w, https://substackcdn.com/image/fetch/$s_!Xn3Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c55b3b9-3815-454d-8228-0b5789997ea8_570x475.png 848w, https://substackcdn.com/image/fetch/$s_!Xn3Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c55b3b9-3815-454d-8228-0b5789997ea8_570x475.png 1272w, https://substackcdn.com/image/fetch/$s_!Xn3Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c55b3b9-3815-454d-8228-0b5789997ea8_570x475.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xn3Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c55b3b9-3815-454d-8228-0b5789997ea8_570x475.png" width="374" height="311.6666666666667" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c55b3b9-3815-454d-8228-0b5789997ea8_570x475.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:475,&quot;width&quot;:570,&quot;resizeWidth&quot;:374,&quot;bytes&quot;:75948,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c55b3b9-3815-454d-8228-0b5789997ea8_570x475.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Xn3Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c55b3b9-3815-454d-8228-0b5789997ea8_570x475.png 424w, https://substackcdn.com/image/fetch/$s_!Xn3Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c55b3b9-3815-454d-8228-0b5789997ea8_570x475.png 848w, https://substackcdn.com/image/fetch/$s_!Xn3Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c55b3b9-3815-454d-8228-0b5789997ea8_570x475.png 1272w, https://substackcdn.com/image/fetch/$s_!Xn3Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c55b3b9-3815-454d-8228-0b5789997ea8_570x475.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p></li></ul><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{for } m: 1 \\to M, \\quad s_l^m = s_l^m + \\text{Bottom-Up-Mixing}(s_l^{m-1})&quot;,&quot;id&quot;:&quot;FBNQDCGLDV&quot;}" data-component-name="LatexBlockToDOM"></div><ul><li><p>Trend Mixing:  In contrast to seasonal components, detailed variations in trend data can introduce noise when capturing broader trends. Coarser scale time series offer clearer macro-level information than finer scales. Thus, they use a top-down mixing approach to leverage macro-level insights from coarser scales to guide trend modeling at finer scales.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x1mB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b703a1-daa1-4c7d-b141-f2fe8a5a7304_590x439.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x1mB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b703a1-daa1-4c7d-b141-f2fe8a5a7304_590x439.png 424w, https://substackcdn.com/image/fetch/$s_!x1mB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b703a1-daa1-4c7d-b141-f2fe8a5a7304_590x439.png 848w, https://substackcdn.com/image/fetch/$s_!x1mB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b703a1-daa1-4c7d-b141-f2fe8a5a7304_590x439.png 1272w, https://substackcdn.com/image/fetch/$s_!x1mB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b703a1-daa1-4c7d-b141-f2fe8a5a7304_590x439.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x1mB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b703a1-daa1-4c7d-b141-f2fe8a5a7304_590x439.png" width="422" height="313.99661016949153" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0b703a1-daa1-4c7d-b141-f2fe8a5a7304_590x439.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:439,&quot;width&quot;:590,&quot;resizeWidth&quot;:422,&quot;bytes&quot;:85348,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b703a1-daa1-4c7d-b141-f2fe8a5a7304_590x439.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x1mB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b703a1-daa1-4c7d-b141-f2fe8a5a7304_590x439.png 424w, https://substackcdn.com/image/fetch/$s_!x1mB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b703a1-daa1-4c7d-b141-f2fe8a5a7304_590x439.png 848w, https://substackcdn.com/image/fetch/$s_!x1mB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b703a1-daa1-4c7d-b141-f2fe8a5a7304_590x439.png 1272w, https://substackcdn.com/image/fetch/$s_!x1mB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0b703a1-daa1-4c7d-b141-f2fe8a5a7304_590x439.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{for } m: (M-1) \\to 0, \\quad t_l^m = t_l^m + \\text{Top-Down-Mixing}(t_l^{m+1})\n&quot;,&quot;id&quot;:&quot;SSBYJVVEJU&quot;}" data-component-name="LatexBlockToDOM"></div><ul><li><p>Seasonal and Trend Mix:</p></li></ul><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;X_l = X_{l-1} + \\text{FeedForward} \\left( S\\text{-Mix} \\{s_l^m\\}_{m=0}^M + T\\text{-Mix} \\{t_l^m\\}_{m=0}^M \\right)\n&quot;,&quot;id&quot;:&quot;BGFWOXXLSV&quot;}" data-component-name="LatexBlockToDOM"></div><ul><li><p>The final prediction just combines prediction at all scales:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sCsM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa54f33d-3497-4e11-8467-9614868f6e04_1078x162.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sCsM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa54f33d-3497-4e11-8467-9614868f6e04_1078x162.png 424w, https://substackcdn.com/image/fetch/$s_!sCsM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa54f33d-3497-4e11-8467-9614868f6e04_1078x162.png 848w, https://substackcdn.com/image/fetch/$s_!sCsM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa54f33d-3497-4e11-8467-9614868f6e04_1078x162.png 1272w, https://substackcdn.com/image/fetch/$s_!sCsM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa54f33d-3497-4e11-8467-9614868f6e04_1078x162.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sCsM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa54f33d-3497-4e11-8467-9614868f6e04_1078x162.png" width="550" height="82.65306122448979" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa54f33d-3497-4e11-8467-9614868f6e04_1078x162.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:162,&quot;width&quot;:1078,&quot;resizeWidth&quot;:550,&quot;bytes&quot;:21506,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa54f33d-3497-4e11-8467-9614868f6e04_1078x162.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sCsM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa54f33d-3497-4e11-8467-9614868f6e04_1078x162.png 424w, https://substackcdn.com/image/fetch/$s_!sCsM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa54f33d-3497-4e11-8467-9614868f6e04_1078x162.png 848w, https://substackcdn.com/image/fetch/$s_!sCsM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa54f33d-3497-4e11-8467-9614868f6e04_1078x162.png 1272w, https://substackcdn.com/image/fetch/$s_!sCsM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa54f33d-3497-4e11-8467-9614868f6e04_1078x162.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kcv9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc06a7ce-4aa0-4349-8a88-2be2bd482747_1832x570.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kcv9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc06a7ce-4aa0-4349-8a88-2be2bd482747_1832x570.png 424w, https://substackcdn.com/image/fetch/$s_!kcv9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc06a7ce-4aa0-4349-8a88-2be2bd482747_1832x570.png 848w, https://substackcdn.com/image/fetch/$s_!kcv9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc06a7ce-4aa0-4349-8a88-2be2bd482747_1832x570.png 1272w, https://substackcdn.com/image/fetch/$s_!kcv9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc06a7ce-4aa0-4349-8a88-2be2bd482747_1832x570.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kcv9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc06a7ce-4aa0-4349-8a88-2be2bd482747_1832x570.png" width="1456" height="453" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cc06a7ce-4aa0-4349-8a88-2be2bd482747_1832x570.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:453,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:242556,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc06a7ce-4aa0-4349-8a88-2be2bd482747_1832x570.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kcv9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc06a7ce-4aa0-4349-8a88-2be2bd482747_1832x570.png 424w, https://substackcdn.com/image/fetch/$s_!kcv9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc06a7ce-4aa0-4349-8a88-2be2bd482747_1832x570.png 848w, https://substackcdn.com/image/fetch/$s_!kcv9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc06a7ce-4aa0-4349-8a88-2be2bd482747_1832x570.png 1272w, https://substackcdn.com/image/fetch/$s_!kcv9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc06a7ce-4aa0-4349-8a88-2be2bd482747_1832x570.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">TimeMixer Architecture. Source: [9].</figcaption></figure></div><p><strong>&#128073;CycleNet: Leveraging Explicit Periodic Modeling for Long-Term Time Series Forecasting [11]</strong></p><p>CycleNet introduces a novel approach to improve long-term time series forecasting  by <strong>explicitly modeling periodic patterns</strong> present in the data. The core of CycleNet lies in its <strong>Residual Cycle Forecasting (RCF)</strong> technique, which decomposes the time series into periodic components and residuals. The residuals, representing the fluctuations that cannot be explained by periodic patterns, are then predicted to complete the forecast. This approach offers a clear distinction between cyclic and non-cyclic components.</p><p>Given a time series with <em>D</em> channels and a cycle length <em>W</em>, the goal is to model the inherent periodicity within each channel. This is achieved through learnable recurrent cycles, denoted by Q&#8712;R^WxD, initialized to zero and trained to represent the cyclic components of the sequence. Some key points:</p><ul><li><p>These recurrent cycles are globally shared across channels</p></li><li><p>The process of modeling the periodic patterns involves cyclic replications of the recurrent cycles to match the length of the input sequence. </p></li></ul><p>For an input <em>x<sub>t-L+1:t</sub></em> the corresponding cyclic components <em>c<sub>t-L+1:t </sub></em>(past) and<em><sub> </sub></em> <em>c<sub>t+1:t+H </sub></em>(future) are generated by shifting and repeating the recurrent cycles <em>Q</em> as follows:</p><ul><li><p><strong>Shift</strong> <em>Q</em> by <code>t mod W</code> positions to get <em>Q(t)</em>, where <code>t mod W</code> represents the relative position of the current sequence within <em>Q</em>.</p></li><li><p><strong>Repeat</strong> <em>Q(t)</em> <code>&#8970;L/W&#8971;</code> times and concatenate the first <code>L mod W</code> elements of <em>Q(t)</em>.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G_7m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd776ab04-46df-4f96-aacf-32efd0450d6a_1314x313.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G_7m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd776ab04-46df-4f96-aacf-32efd0450d6a_1314x313.png 424w, https://substackcdn.com/image/fetch/$s_!G_7m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd776ab04-46df-4f96-aacf-32efd0450d6a_1314x313.png 848w, https://substackcdn.com/image/fetch/$s_!G_7m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd776ab04-46df-4f96-aacf-32efd0450d6a_1314x313.png 1272w, https://substackcdn.com/image/fetch/$s_!G_7m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd776ab04-46df-4f96-aacf-32efd0450d6a_1314x313.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G_7m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd776ab04-46df-4f96-aacf-32efd0450d6a_1314x313.png" width="1314" height="313" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d776ab04-46df-4f96-aacf-32efd0450d6a_1314x313.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:313,&quot;width&quot;:1314,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:85823,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd776ab04-46df-4f96-aacf-32efd0450d6a_1314x313.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!G_7m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd776ab04-46df-4f96-aacf-32efd0450d6a_1314x313.png 424w, https://substackcdn.com/image/fetch/$s_!G_7m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd776ab04-46df-4f96-aacf-32efd0450d6a_1314x313.png 848w, https://substackcdn.com/image/fetch/$s_!G_7m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd776ab04-46df-4f96-aacf-32efd0450d6a_1314x313.png 1272w, https://substackcdn.com/image/fetch/$s_!G_7m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd776ab04-46df-4f96-aacf-32efd0450d6a_1314x313.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Cycles generation. Left is for past data and right is for future data. Source: [11]. </figcaption></figure></div><p>Given the cycles, we make predictions by:</p><ul><li><p> Subtract the past cycle from the original time series to get the residual</p></li><li><p>Given past residual, predict the future residual</p></li><li><p>Reconstruct the future time series by adding the future residual with the future cycles</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JmjU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8695f0e-a353-4815-adf6-223a40c55738_1288x608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JmjU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8695f0e-a353-4815-adf6-223a40c55738_1288x608.png 424w, https://substackcdn.com/image/fetch/$s_!JmjU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8695f0e-a353-4815-adf6-223a40c55738_1288x608.png 848w, https://substackcdn.com/image/fetch/$s_!JmjU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8695f0e-a353-4815-adf6-223a40c55738_1288x608.png 1272w, https://substackcdn.com/image/fetch/$s_!JmjU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8695f0e-a353-4815-adf6-223a40c55738_1288x608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JmjU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8695f0e-a353-4815-adf6-223a40c55738_1288x608.png" width="1288" height="608" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e8695f0e-a353-4815-adf6-223a40c55738_1288x608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:608,&quot;width&quot;:1288,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:172481,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8695f0e-a353-4815-adf6-223a40c55738_1288x608.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JmjU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8695f0e-a353-4815-adf6-223a40c55738_1288x608.png 424w, https://substackcdn.com/image/fetch/$s_!JmjU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8695f0e-a353-4815-adf6-223a40c55738_1288x608.png 848w, https://substackcdn.com/image/fetch/$s_!JmjU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8695f0e-a353-4815-adf6-223a40c55738_1288x608.png 1272w, https://substackcdn.com/image/fetch/$s_!JmjU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8695f0e-a353-4815-adf6-223a40c55738_1288x608.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">CycleNet Prediction with <em>D=3</em>. Source: [10].</figcaption></figure></div><p>Despite its simplicity, the method exhibits very good performance on standard time series benchmark:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!p2Bo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd32b97-a973-4681-ab8d-17f39c5874e0_1368x392.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!p2Bo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd32b97-a973-4681-ab8d-17f39c5874e0_1368x392.png 424w, https://substackcdn.com/image/fetch/$s_!p2Bo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd32b97-a973-4681-ab8d-17f39c5874e0_1368x392.png 848w, https://substackcdn.com/image/fetch/$s_!p2Bo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd32b97-a973-4681-ab8d-17f39c5874e0_1368x392.png 1272w, https://substackcdn.com/image/fetch/$s_!p2Bo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd32b97-a973-4681-ab8d-17f39c5874e0_1368x392.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!p2Bo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd32b97-a973-4681-ab8d-17f39c5874e0_1368x392.png" width="1368" height="392" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0fd32b97-a973-4681-ab8d-17f39c5874e0_1368x392.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:392,&quot;width&quot;:1368,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:186186,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd32b97-a973-4681-ab8d-17f39c5874e0_1368x392.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!p2Bo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd32b97-a973-4681-ab8d-17f39c5874e0_1368x392.png 424w, https://substackcdn.com/image/fetch/$s_!p2Bo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd32b97-a973-4681-ab8d-17f39c5874e0_1368x392.png 848w, https://substackcdn.com/image/fetch/$s_!p2Bo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd32b97-a973-4681-ab8d-17f39c5874e0_1368x392.png 1272w, https://substackcdn.com/image/fetch/$s_!p2Bo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd32b97-a973-4681-ab8d-17f39c5874e0_1368x392.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>That said, there are unavoidable limitations:</p><p>&#10060; This approach assumes that cyclic patterns with constant cycle length exist in the time series. The unit cycle length <em>W</em> must be pre-defined. </p><p>&#10060; Similar to DLinear, it does not model cross-channel relationships. </p><p><strong>&#128073;SOFTS: A Simple, More Efficient Time-Series Forecaster [12]</strong></p><p>The new approach offers a fresh take on time-series forecasting that ditches the heavy attention mechanisms in favor of a lightweight MLP-based approach. Instead of treating each channel independently (losing useful correlations) or relying on expensive attention layers, SOFTS finds a sweet spot: it first extracts a global "core" representation of the time series and then fuses this core back into individual channels. </p><p>Concretely, there are 4 steps:</p><ol><li><p><strong>Embedding the Time Series:</strong> Each channel is projected into a hidden space of dimension <em>d</em> using a linear embedding function.</p></li></ol><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;S_0 = \\text{Embedding}(X), \\quad S_0 \\in \\mathbb{R}^{C \\times d}\n&quot;,&quot;id&quot;:&quot;OVZFUFUHJD&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, <em>C</em> is the number of channels. </p><ol start="2"><li><p><strong>Channel Interaction via STAR: </strong>The STAR module refines the channel representations over multiple layers. Instead of direct pairwise interactions, STAR aggregates information into a core representation:</p></li></ol><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;o = f(s_1, s_2, \\dots, s_C)\n&quot;,&quot;id&quot;:&quot;IOEPIPPGZT&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_VVQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b4afd9-dac9-4ce9-8376-4c7b06b2c468_633x414.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_VVQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b4afd9-dac9-4ce9-8376-4c7b06b2c468_633x414.png 424w, https://substackcdn.com/image/fetch/$s_!_VVQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b4afd9-dac9-4ce9-8376-4c7b06b2c468_633x414.png 848w, https://substackcdn.com/image/fetch/$s_!_VVQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b4afd9-dac9-4ce9-8376-4c7b06b2c468_633x414.png 1272w, https://substackcdn.com/image/fetch/$s_!_VVQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b4afd9-dac9-4ce9-8376-4c7b06b2c468_633x414.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_VVQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b4afd9-dac9-4ce9-8376-4c7b06b2c468_633x414.png" width="531" height="347.28909952606637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65b4afd9-dac9-4ce9-8376-4c7b06b2c468_633x414.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:414,&quot;width&quot;:633,&quot;resizeWidth&quot;:531,&quot;bytes&quot;:33512,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b4afd9-dac9-4ce9-8376-4c7b06b2c468_633x414.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_VVQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b4afd9-dac9-4ce9-8376-4c7b06b2c468_633x414.png 424w, https://substackcdn.com/image/fetch/$s_!_VVQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b4afd9-dac9-4ce9-8376-4c7b06b2c468_633x414.png 848w, https://substackcdn.com/image/fetch/$s_!_VVQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b4afd9-dac9-4ce9-8376-4c7b06b2c468_633x414.png 1272w, https://substackcdn.com/image/fetch/$s_!_VVQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65b4afd9-dac9-4ce9-8376-4c7b06b2c468_633x414.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">STAR's idea is to model channel interaction as a centralized representation. Source: [12].</figcaption></figure></div><p>Concretely, at iteration/layer <em>i,</em> the method computes the  function <em>f</em> as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;o_i = \\text{StochPool}(\\text{MLP}_1(S_{i-1}))\n&quot;,&quot;id&quot;:&quot;PZNKAPFTWU&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, Stochastic pooling (StochPool) aggregates representations from <em>C</em> series by randomly selecting values, effectively blending the characteristics of mean and max pooling.</p><ol start="3"><li><p><strong>Fusing Core and Channel Representation</strong></p></li></ol><p>The core representation is concatenated with each channel.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;F_i = \\text{Repeat_Concat}(S_{i-1}, o_i)\n&quot;,&quot;id&quot;:&quot;DTBJEAOTBX&quot;}" data-component-name="LatexBlockToDOM"></div><p>The fused representation is then projected back to the hidden space with a residual connection:</p><p></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;S_i = \\text{MLP}_2(F_i) + S_{i-1}\n&quot;,&quot;id&quot;:&quot;RZUZGUHZSO&quot;}" data-component-name="LatexBlockToDOM"></div><ol start="4"><li><p><strong>Prediction Layer:</strong> </p></li></ol><p>MLP takes the fused representation to make predictions:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\hat{Y} = \\text{Linear}(S_N), \\quad \\hat{Y} \\in \\mathbb{R}^{C \\times H}\n&quot;,&quot;id&quot;:&quot;MNVWJJMHER&quot;}" data-component-name="LatexBlockToDOM"></div><p>We can summarize the steps in the diagram below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Yxyy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf53b620-e470-4b62-a1ca-b28e57eec082_1462x494.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Yxyy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf53b620-e470-4b62-a1ca-b28e57eec082_1462x494.png 424w, https://substackcdn.com/image/fetch/$s_!Yxyy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf53b620-e470-4b62-a1ca-b28e57eec082_1462x494.png 848w, https://substackcdn.com/image/fetch/$s_!Yxyy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf53b620-e470-4b62-a1ca-b28e57eec082_1462x494.png 1272w, https://substackcdn.com/image/fetch/$s_!Yxyy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf53b620-e470-4b62-a1ca-b28e57eec082_1462x494.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Yxyy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf53b620-e470-4b62-a1ca-b28e57eec082_1462x494.png" width="1456" height="492" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf53b620-e470-4b62-a1ca-b28e57eec082_1462x494.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:492,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:106943,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf53b620-e470-4b62-a1ca-b28e57eec082_1462x494.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Yxyy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf53b620-e470-4b62-a1ca-b28e57eec082_1462x494.png 424w, https://substackcdn.com/image/fetch/$s_!Yxyy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf53b620-e470-4b62-a1ca-b28e57eec082_1462x494.png 848w, https://substackcdn.com/image/fetch/$s_!Yxyy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf53b620-e470-4b62-a1ca-b28e57eec082_1462x494.png 1272w, https://substackcdn.com/image/fetch/$s_!Yxyy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf53b620-e470-4b62-a1ca-b28e57eec082_1462x494.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Series-cOre Fused Time Series forecaster (SOFTS). Source: [12].</figcaption></figure></div><p>The results are impressive compared to prior works including Transformer-based methods:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eJks!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eb98a9a-2558-4ddb-a893-4f27da031663_1486x528.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eJks!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eb98a9a-2558-4ddb-a893-4f27da031663_1486x528.png 424w, https://substackcdn.com/image/fetch/$s_!eJks!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eb98a9a-2558-4ddb-a893-4f27da031663_1486x528.png 848w, https://substackcdn.com/image/fetch/$s_!eJks!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eb98a9a-2558-4ddb-a893-4f27da031663_1486x528.png 1272w, https://substackcdn.com/image/fetch/$s_!eJks!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eb98a9a-2558-4ddb-a893-4f27da031663_1486x528.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eJks!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eb98a9a-2558-4ddb-a893-4f27da031663_1486x528.png" width="1456" height="517" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8eb98a9a-2558-4ddb-a893-4f27da031663_1486x528.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:517,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:237506,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://hungleai.substack.com/i/155725604?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eb98a9a-2558-4ddb-a893-4f27da031663_1486x528.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eJks!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eb98a9a-2558-4ddb-a893-4f27da031663_1486x528.png 424w, https://substackcdn.com/image/fetch/$s_!eJks!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eb98a9a-2558-4ddb-a893-4f27da031663_1486x528.png 848w, https://substackcdn.com/image/fetch/$s_!eJks!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eb98a9a-2558-4ddb-a893-4f27da031663_1486x528.png 1272w, https://substackcdn.com/image/fetch/$s_!eJks!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eb98a9a-2558-4ddb-a893-4f27da031663_1486x528.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Conclusion</h2><p>Time-series forecasting remains one of the most challenging problems in AI, demanding models that can capture complex temporal dependencies, handle non-stationarity, and generalize across diverse patterns. Unlike standard machine learning tasks, time-series data is sequential, highly dynamic, and often influenced by external factors, making it resistant to one-size-fits-all solutions.</p><p><strong>The Road Ahead: </strong>The increasing demand for long-range forecasting, adaptation to distribution shifts, and better uncertainty quantification calls for innovations beyond current architectures. One promising direction is the integration of Large Language Models (LLMs) into time-series forecasting, leveraging their pretrained knowledge and ability to process complex dependencies and adapt to diverse forecasting tasks.</p><p>&#128293; <strong>Coming Next:</strong> In my <a href="https://hungleai.substack.com/p/the-best-of-time-series-forecasting-a98">next blog</a>, I&#8217;ll explore how LLMs can be adapted as a new solution for time-series forecasting. Stay tuned! &#128640;</p><div><hr></div><h2>References</h2><p>[1] Shi, Xiaoming, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin. "Time-moe: Billion-scale time series foundation models with mixture of experts." <em>ICLR, 2025</em>.</p><p>[2] Shumway, Robert H., David S. Stoffer, Robert H. Shumway, and David S. Stoffer. "ARIMA models." <em>Time series analysis and its applications: with R examples</em> (2017): 75-163.</p><p>[3] Holt, C. C. (1957). <em>Forecasting seasonals and trends by exponentially weighted averages</em> (O.N.R. Memorandum No. 52). Carnegie Institute of Technology, Pittsburgh USA.</p><p>[4] Wu, Haixu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. "Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting." <em>Advances in neural information processing systems</em> 34 (2021): 22419-22430.</p><p>[5] Zhou, Tian, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. "Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting." In <em>International conference on machine learning</em>, pp. 27268-27286. PMLR, 2022.</p><p>[6] Nie, Yuqi, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. "A time series is worth 64 words: Long-term forecasting with transformers." <em>arXiv preprint arXiv:2211.14730</em> (2022).</p><p>[7] Zeng, Ailing, Muxi Chen, Lei Zhang, and Qiang Xu. "Are transformers effective for time series forecasting?." In <em>Proceedings of the AAAI conference on artificial intelligence</em>, vol. 37, no. 9, pp. 11121-11128. 2023.</p><p>[8] Liu, Yong, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. "iTransformer: Inverted Transformers Are Effective for Time Series Forecasting." In <em>The Twelfth International Conference on Learning Representations</em>.</p><p>[9] Wang, Shiyu, Haixu Wu, Xiaoming Shi, Tengge Hu, Huakun Luo, Lintao Ma, James Y. Zhang, and JUN ZHOU. "TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting." In <em>The Twelfth International Conference on Learning Representations</em>.</p><p>[10] Qiu, Xiangfei, Jilin Hu, Lekui Zhou, Xingjian Wu, Junyang Du, Buang Zhang, Chenjuan Guo et al. "Tfb: Towards comprehensive and fair benchmarking of time series forecasting methods." <em>arXiv preprint arXiv:2403.20150</em> (2024).</p><p>[11] Lin, Shengsheng, Weiwei Lin, Xinyi Hu, Wentai Wu, Ruichao Mo, and Haocheng Zhong. "Cyclenet: enhancing time series forecasting through modeling periodic patterns." <em>Advances in Neural Information Processing Systems</em> 37 (2024): 106315-106345.</p><p>[12] Han, Lu, Xu-Yang Chen, Han-Jia Ye, and De-Chuan Zhan. "SOFTS: Efficient Multivariate Time Series Forecasting with Series-Core Fusion." In <em>The Thirty-eighth Annual Conference on Neural Information Processing Systems</em>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>I hope you enjoy the article. Stay tuned for the newest and exclusive content by subscribing to <strong>Neurocoder Tales</strong>! Disclaimer: While every effort is made to provide accurate and unbiased information, errors may occur. Let me know if you catch any error.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Memory-Augmented Large Language Models]]></title><description><![CDATA[Why and How Memory Matters for LLMs?]]></description><link>https://hungleai.substack.com/p/memory-augmented-large-language-models</link><guid isPermaLink="false">https://hungleai.substack.com/p/memory-augmented-large-language-models</guid><dc:creator><![CDATA[Hung Le]]></dc:creator><pubDate>Mon, 02 Dec 2024 23:06:46 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48b7ecda-27b9-4e58-9dfc-a4e29499857e_489x387.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h4>Table of Content</h4><ul><li><p><a href="https://hungleai.substack.com/i/151351282/what-are-memory-augmented-neural-networks">What are Memory-Augmented Neural Networks (MANNs)?</a></p><ul><li><p><a href="https://hungleai.substack.com/i/151351282/a-brief-history-of-manns">A Brief History of MANNs</a></p></li><li><p><a href="https://hungleai.substack.com/i/151351282/whats-holding-back-manns">What's Holding Back MANNs?</a></p></li></ul></li><li><p><a href="https://hungleai.substack.com/i/151351282/the-rise-of-memory-in-the-llm-era">The Rise of Memory in the LLM Era</a></p><ul><li><p><a href="https://hungleai.substack.com/i/151351282/why-memory-craves-llms">Why Memory Craves LLMs?</a></p></li><li><p><a href="https://hungleai.substack.com/i/151351282/why-llms-thrive-with-memory">Why LLMs Thrive with Memory?</a></p></li><li><p><a href="https://hungleai.substack.com/i/151351282/memory-augmented-large-language-models-ma-llm">Memory-Augmented Large Language Models (MA-LLM)</a></p></li></ul></li><li><p><a href="https://hungleai.substack.com/i/151351282/working-memory">Working Memory</a></p><ul><li><p><a href="https://hungleai.substack.com/i/151351282/string-memory-enabling-llms-to-simulate-universal-turing-machines">String Memory: Enabling LLMs to Simulate Universal Turing Machines</a></p></li><li><p><a href="https://hungleai.substack.com/i/151351282/tensor-memory-long-term-storage-and-generalization-power">Neural Memory: Long-term Storage and Generalization Power</a></p></li></ul></li><li><p><a href="https://hungleai.substack.com/i/151351282/episodic-memory">Episodic Memory </a></p><ul><li><p><a href="https://hungleai.substack.com/i/151351282/rapid-knowledge-integration-with-differentiable-memory">Rapid Knowledge Integration with Differentiable Memory</a></p></li><li><p><a href="https://hungleai.substack.com/i/151351282/prompt-optimization-with-nearest-neighbor-memory">Prompt Optimization with Nearest Neighbor Memory</a></p></li></ul></li><li><p><a href="https://hungleai.substack.com/i/151351282/the-future-of-memory">The Future of Memory</a></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZKyd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0beac246-d477-45cb-9ceb-409cdd4173ab_1024x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZKyd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0beac246-d477-45cb-9ceb-409cdd4173ab_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ZKyd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0beac246-d477-45cb-9ceb-409cdd4173ab_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ZKyd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0beac246-d477-45cb-9ceb-409cdd4173ab_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ZKyd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0beac246-d477-45cb-9ceb-409cdd4173ab_1024x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZKyd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0beac246-d477-45cb-9ceb-409cdd4173ab_1024x1024.jpeg" width="466" height="466" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0beac246-d477-45cb-9ceb-409cdd4173ab_1024x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:466,&quot;bytes&quot;:225008,&quot;alt&quot;:&quot;Memory-augmented LLMs&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Memory-augmented LLMs" title="Memory-augmented LLMs" srcset="https://substackcdn.com/image/fetch/$s_!ZKyd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0beac246-d477-45cb-9ceb-409cdd4173ab_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ZKyd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0beac246-d477-45cb-9ceb-409cdd4173ab_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ZKyd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0beac246-d477-45cb-9ceb-409cdd4173ab_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ZKyd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0beac246-d477-45cb-9ceb-409cdd4173ab_1024x1024.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Large Language Models with Memory. Source: Generated by DALL&#183;E 3. </figcaption></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Many read my posts, but only 3% subscribe. If you find my writing helpful, please subscribe&#8212;it&#8217;s free! Your support motivates me to keep creating high-quality and exclusive content. Thank you!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>What are Memory-Augmented Neural Networks?</h2><p>Memory is the essence of intelligence. Thanks to memory, humans can recognize objects, recall events, plan, explain, and reason. It allows us to learn continuously, adapt to new environments, and apply past knowledge to unfamiliar situations. For AI, memory could be equally transformative. In neural networks, memory enables more than just storing patterns&#8212;it provides a way to connect past experiences to current tasks, to adapt across contexts, and to hold knowledge over time [1].</p><h4>A Brief History of MANNs</h4><p>Integrating memory into neural networks is not a new concept&#8212;it dates back to early models like recurrent neural networks (RNNs) and the Hopfield network, which introduced the idea of internal state retention. In these architectures, memory is embedded within the hidden states of the networks, allowing them to process sequences and retain a limited context. For example, classic RNN memory reads:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;h_t = \\tanh(W_h h_{t-1} + W_x x_t + b)\n&quot;,&quot;id&quot;:&quot;SOTCZOZYIH&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>h<sub>t</sub></em> is the hidden state or memory at step <em>t</em>. Here, <em>x<sub>t</sub></em> is encoded in the current memory and <em>h<sub>t-1</sub></em> represents the previous memory capturing previous inputs <em>x<sub>1</sub>, x<sub>2</sub>, &#8230;, and x<sub>t-1</sub>.</em></p><p>Long Short-Term Memory (LSTM) networks further developed this idea, adding mechanisms like gates to control memory flow and enable long-range dependencies. However, these internal states are still limited in how effectively they can store and retrieve large amounts of past information.</p><blockquote><p>&#128064; The vector memory <em>h<sub>t </sub></em>fails to scale up when the number of dimension increases since <em>W<sub>h</sub></em> would require much more parameters, much bigger memory and is much slower to learn or compute. Therefore, vector-based memory is often low in capacity. </p></blockquote><p>It was only with the development of memory-augmented neural networks (MANNs), such as Neural Turing Machines (NTM, [2]) and Differentiable Neural Computers (DNC, [3]), that memory became a specialized and external component of the model. Unlike RNNs, which are constrained by the fixed size of hidden states, NTMs, and DNCs use a matrix memory that allows the network to store vast quantities of data explicitly. This matrix memory functions like a writable memory bank, consisting of multiple slots, where information can be stored, retrieved, and updated independently, providing a dedicated structure for past knowledge. The role of memory in these models became clearer: it was not just a transient internal state but a long-term storage that could be accessed flexibly. Given previous input tokens, a Controller, which later becomes LLM, is trained to read, write the memory, and make predictions on the next tokens:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lCYT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c80eaed-4639-4f2f-a104-a6bc51eec7c2_458x302.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lCYT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c80eaed-4639-4f2f-a104-a6bc51eec7c2_458x302.gif 424w, https://substackcdn.com/image/fetch/$s_!lCYT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c80eaed-4639-4f2f-a104-a6bc51eec7c2_458x302.gif 848w, https://substackcdn.com/image/fetch/$s_!lCYT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c80eaed-4639-4f2f-a104-a6bc51eec7c2_458x302.gif 1272w, https://substackcdn.com/image/fetch/$s_!lCYT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c80eaed-4639-4f2f-a104-a6bc51eec7c2_458x302.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lCYT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c80eaed-4639-4f2f-a104-a6bc51eec7c2_458x302.gif" width="458" height="302" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c80eaed-4639-4f2f-a104-a6bc51eec7c2_458x302.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:302,&quot;width&quot;:458,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:124267,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lCYT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c80eaed-4639-4f2f-a104-a6bc51eec7c2_458x302.gif 424w, https://substackcdn.com/image/fetch/$s_!lCYT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c80eaed-4639-4f2f-a104-a6bc51eec7c2_458x302.gif 848w, https://substackcdn.com/image/fetch/$s_!lCYT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c80eaed-4639-4f2f-a104-a6bc51eec7c2_458x302.gif 1272w, https://substackcdn.com/image/fetch/$s_!lCYT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c80eaed-4639-4f2f-a104-a6bc51eec7c2_458x302.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Memory-Controller Framework. </figcaption></figure></div><p>In MANNs, the matrix memory <em>M</em> can be updated recursively as the hidden states of RNNs:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;M_{t}=f\\left(M_{t-1},x_{t}\\right)&quot;,&quot;id&quot;:&quot;BNUQSLJQPA&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>f </em>is a general update function, which depends on the memory architectures and designs. </p><p>MANNs with matrix memory are now a promising direction for building neural networks with human-like memory capabilities. In these systems, memory is external to the network and accessed via learnable controllers, enabling the model to decide what to write and read from memory based on task needs. A complete memory model can capture both past data and relationships between data, resulting in a deliberate reasoning system that can simulate high-order reasoning [4].</p><p>Thus, the MANN approach allows for complex reasoning, long-term dependencies, and the handling of variable-length sequences, bringing neural networks closer to the adaptive and context-aware memory found in human cognition.</p><h4>What's Holding Back MANNs?</h4><p>While powerful, Memory-Augmented Neural Networks (MANNs) face key obstacles. First, their complex memory operations introduce high computational costs, making them slow and challenging to scale with large datasets. Let&#8217;s have a look at the write operator of the DNC memory to understand why:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;M_{t}=M_{t-1}\\odot\\left(\\textbf{1}-w\\left(M_{t-1},x_{t}\\right)\\otimes e\\left(M_{t-1},x_{t}\\right)\\right)+w\\left(M_{t-1},x_{t}\\right)\\otimes v\\left(M_{t-1},x_{t}\\right)&quot;,&quot;id&quot;:&quot;YDNAFUPQJY&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>w, e, v</em> are neural network functions, &#8857; and &#8855; are Hadamard (element-wise) and outer product, respectively. </p><p>MANNs also require a sophisticated controller to decide when and how to read from or write to memory. Training this controller is difficult, as no labeled steps guide memory access decisions. Without ground-truth reasoning labels, the controller must learn these strategies from scratch, a challenging process akin to meta-learning that can make training unstable and slow.</p><p>Finally, truly memory-demanding tasks are rare, and limited real-world data naturally requires such complex memory operations. This scarcity of relevant training data limits opportunities to unlock MANNs' full potential, holding them back from broader applications.</p><blockquote><p>&#128064; The limitations outlined above highlight why, despite numerous Memory-Augmented Transformer designs, few are practical or capable of scaling to billion-parameter LLMs.</p></blockquote><div><hr></div><h2>The Rise of Memory in the LLM Era</h2><p>Things change when LLMs emerge. In the evolving landscape of AI, memory, and LLMs form a symbiotic relationship, each enhancing the capabilities of the other. This interdependence allows both memory and LLMs to reach their full potential, making them far more powerful together than when used independently, heading towards Artificial General Intelligence (AGI).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oclY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28860be5-28e8-403b-9e22-d3297fccce8d_382x305.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oclY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28860be5-28e8-403b-9e22-d3297fccce8d_382x305.png 424w, https://substackcdn.com/image/fetch/$s_!oclY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28860be5-28e8-403b-9e22-d3297fccce8d_382x305.png 848w, https://substackcdn.com/image/fetch/$s_!oclY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28860be5-28e8-403b-9e22-d3297fccce8d_382x305.png 1272w, https://substackcdn.com/image/fetch/$s_!oclY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28860be5-28e8-403b-9e22-d3297fccce8d_382x305.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oclY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28860be5-28e8-403b-9e22-d3297fccce8d_382x305.png" width="382" height="305" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/28860be5-28e8-403b-9e22-d3297fccce8d_382x305.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:305,&quot;width&quot;:382,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:170329,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oclY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28860be5-28e8-403b-9e22-d3297fccce8d_382x305.png 424w, https://substackcdn.com/image/fetch/$s_!oclY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28860be5-28e8-403b-9e22-d3297fccce8d_382x305.png 848w, https://substackcdn.com/image/fetch/$s_!oclY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28860be5-28e8-403b-9e22-d3297fccce8d_382x305.png 1272w, https://substackcdn.com/image/fetch/$s_!oclY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28860be5-28e8-403b-9e22-d3297fccce8d_382x305.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">LLM-Memory Symbiosis. <a href="https://imgflip.com/memegenerator/Epic-Handshake">Source</a>.</figcaption></figure></div><h4>Why Memory Craves LLMs?</h4><p>By leveraging vast amounts of pre-trained knowledge, LLMs can significantly reduce the reliance on task-specific data, making training memory models for memory-intensive tasks more feasible. Moreover, LLMs are sophisticated enough to act as controllers, managing memory access more intuitively and efficiently. This eliminates the need for complex, manually trained controllers, streamlining the training process.</p><p>LLMs with memory are also faster and more efficient. LLMs can access and store information quickly by integrating memory within the model architecture, reducing computational overhead. This makes them more scalable and suitable for handling large datasets and complex reasoning tasks.</p><p>Ultimately, LLMs provide a natural and efficient way to integrate memory into AI systems. Their inherent capacity for learning and adaptation allows them to seamlessly incorporate memory mechanisms, enhancing their ability to process long-term dependencies and generalize across tasks.</p><h4>Why LLMs Thrive with Memory?</h4><p>LLMs are powerful tools, but they often struggle with tasks that require long-term context or reasoning over multiple steps. This is where memory comes into play. By integrating memory mechanisms, LLMs can:</p><ul><li><p><strong>Handle Long-Term Dependencies:</strong> LLMs can remember and utilize information from earlier parts of a text or conversation, improving their ability to generate coherent and contextually relevant responses.</p></li><li><p><strong>Facilitate Complex Reasoning:</strong> Memory allows LLMs to store intermediate results and refer back to them as needed, enabling more sophisticated reasoning processes.</p></li><li><p><strong>Enhance Creativity and Originality:</strong> LLMs can generate more creative and original content by accessing a vast knowledge base. Memory enables them to combine ideas from different sources and generate novel outputs.</p></li></ul><p>By addressing these limitations and harnessing the power of memory, LLMs can become even more versatile and capable, opening up new possibilities for AI applications.</p><p>Theoretically, we cannot prove that a pre-trained LLM is computationally universal, meaning it can solve any computable problem or simulate a universal Turing Machine. This limitation arises because LLMs are trained on finite data, and their convergence after training cannot be guaranteed to meet the conditions of universality. However, by equipping an LLM with external memory, it can achieve computational universality under reasonable assumptions [5]. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v772!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3199199-88d5-4968-a5b4-429789b242d3_1024x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v772!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3199199-88d5-4968-a5b4-429789b242d3_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!v772!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3199199-88d5-4968-a5b4-429789b242d3_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!v772!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3199199-88d5-4968-a5b4-429789b242d3_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!v772!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3199199-88d5-4968-a5b4-429789b242d3_1024x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v772!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3199199-88d5-4968-a5b4-429789b242d3_1024x1024.jpeg" width="388" height="388" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a3199199-88d5-4968-a5b4-429789b242d3_1024x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:388,&quot;bytes&quot;:302213,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v772!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3199199-88d5-4968-a5b4-429789b242d3_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!v772!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3199199-88d5-4968-a5b4-429789b242d3_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!v772!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3199199-88d5-4968-a5b4-429789b242d3_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!v772!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3199199-88d5-4968-a5b4-429789b242d3_1024x1024.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Hypothetical UTM. Source: DALLE-3. </figcaption></figure></div><p></p><p>Now, you may ask &#129504; <em>what is a  universal Turing Machine? </em> A Turing Machine (TM) is a theoretical computing model that can solve any computable problems given infinite memory. However, one TM only solves one specific problem. Universal Turing Machine (UTM) can simulate any TM given that the TM description is provided as the input. Therefore, one UTM, in theory, can solve all computable problems. UTM is realized as a general-purpose computer today, where the memory of the computers stores both programs and data. </p><blockquote><p>&#128064; UTM can be simulated approximately using neural network architectures that are trained end-to-end [6].</p></blockquote><p>The guarantee that an LLM with external memory can compute anything is significant, as it suggests that external memory can enhance LLM performance on complex tasks.</p><h4>Memory-Augmented Large Language Models (MA-LLM)</h4><p>Previous sections show that memory is vital in transforming LLMs from powerful tools into truly adaptive and intelligent systems. &#129504; <em>But what kind of memory is suitable for LLMs?</em> Just as the human brain relies on both working memory and episodic memory to navigate the world, LLMs benefit from these complementary forms of memory to handle diverse and complex challenges.</p><p><strong>&#128073;Working memory</strong> in LLMs is a short-term storage system, holding information relevant to the current task. Like a notepad for a writer, it is cleared and reset when the task concludes, making room for new data without interference from prior activities. This ensures that LLMs remain focused and efficient, adapting quickly to new tasks without being overwhelmed by the past.</p><p>In contrast, &#128073;<strong>episodic memory</strong> provides a longer-term perspective, capturing and storing knowledge that spans multiple tasks and events. It acts as a journal, retaining experiences, decisions, and outcomes, which can be revisited to improve understanding and performance over time. Episodic memory allows LLMs to learn from prior interactions and carry forward context, fostering continuity and depth in tasks requiring cumulative reasoning or personalized responses.</p><div><hr></div><h2>Working Memory</h2><p>The bounded input length of LLMs, such as the typical 4096-token limit, restricts their capacity to handle complex, multi-step reasoning tasks. Augmenting LLMs with external read-write memory offers a promising solution by extending their computational abilities and enabling them to simulate algorithms beyond the scope of finite automata. This external memory acts as a <strong>working memory</strong>, dynamically storing intermediate computations, sub-problems, or task-specific data that can be retrieved and updated during a reasoning process. </p><p>Two key operators of working memory are:</p><ul><li><p><strong>Read Operator</strong>: Retrieves relevant information based on the current input and stored entries.</p></li><li><p><strong>Write Operator</strong>: Updates the memory with new information generated during the task for future reference.</p><p></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UEns!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf253fe-d6ab-4118-aadf-eb0fc7fcf67f_486x568.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UEns!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf253fe-d6ab-4118-aadf-eb0fc7fcf67f_486x568.gif 424w, https://substackcdn.com/image/fetch/$s_!UEns!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf253fe-d6ab-4118-aadf-eb0fc7fcf67f_486x568.gif 848w, https://substackcdn.com/image/fetch/$s_!UEns!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf253fe-d6ab-4118-aadf-eb0fc7fcf67f_486x568.gif 1272w, https://substackcdn.com/image/fetch/$s_!UEns!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf253fe-d6ab-4118-aadf-eb0fc7fcf67f_486x568.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UEns!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf253fe-d6ab-4118-aadf-eb0fc7fcf67f_486x568.gif" width="314" height="366.9794238683128" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9bf253fe-d6ab-4118-aadf-eb0fc7fcf67f_486x568.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:568,&quot;width&quot;:486,&quot;resizeWidth&quot;:314,&quot;bytes&quot;:1682507,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UEns!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf253fe-d6ab-4118-aadf-eb0fc7fcf67f_486x568.gif 424w, https://substackcdn.com/image/fetch/$s_!UEns!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf253fe-d6ab-4118-aadf-eb0fc7fcf67f_486x568.gif 848w, https://substackcdn.com/image/fetch/$s_!UEns!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf253fe-d6ab-4118-aadf-eb0fc7fcf67f_486x568.gif 1272w, https://substackcdn.com/image/fetch/$s_!UEns!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bf253fe-d6ab-4118-aadf-eb0fc7fcf67f_486x568.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A General Working Memory Mechanisms.</figcaption></figure></div><blockquote><p>&#128064;The working memory is often associative, in that it reads the data for the LLM based on the similarity of the current input and what is stored in the memory. Hence, it is also referred to as &#128073;<strong>associative memory</strong>. </p></blockquote><p></p><h4>String Memory: Enabling LLMs to Simulate Universal Turing Machines</h4><p>A natural question arises: &#129504; <em>What gets stored in the memory?</em> One straightforward implementation is <strong>text</strong>&#8212;the memory can simply store raw text generated by the LLM or provided by users. This simplicity is sufficient to achieve universal computational capability, i.e., it can simulate a universal Turing Machine [5].  In the paper, the authors chose <em>U<sub>15,2</sub>-</em>a well-known small Universal Turing Machine as the simulation target.  The <em>U<sub>15,2</sub></em> Turing Machine with a memory tape of an infinite number of slots can be described as follows:</p><ol><li><p><strong>States (</strong><em><strong>Q</strong></em><strong>):</strong> A finite set of possible configurations the TM can be in. In this case, <em>U<sub>15,2 </sub></em>has 15 states, denoted as: <em>{A,B,C,&#8230;,I,J}</em>. </p></li><li><p><strong>Tape Alphabet (</strong><em><strong>&#931;</strong></em><strong>):</strong> A finite set of symbols that can be written on the tape. <em>U<sub>15,2 </sub></em>has <em>&#931;={0,1}</em></p></li><li><p><strong>Blank Symbol (</strong><em><strong>b</strong></em><strong>):</strong> A special symbol used to represent empty tape cells. Here, <em>b=0</em></p></li><li><p><strong>Start State (</strong><em><strong>q&#8320;</strong></em><strong>):</strong> The initial state of the TM. <em>q<sub>0</sub>=A</em></p></li><li><p><strong>Halting States (</strong><em><strong>T</strong></em><strong>):</strong> A set of (state, symbol) pairs that, when reached, halt the TM. <em>T = {(J, 1)}</em>. </p></li><li><p><strong>Transition Function (f:</strong><em><strong>Q&#215;&#931;&#8594;&#931;&#215;{&#8722;1,+1}&#215;Q</strong></em><strong>):</strong> A function that, given a current state and the symbol under the tape head, determines:</p><ul><li><p>The symbol to write to the tape.</p></li><li><p>The direction to move the tape head (left -1 or right +1).</p></li><li><p>The next state to transition to.</p></li></ul></li></ol><p>The transition function or program of <em>U<sub>15,2</sub></em> can be represented as a lookup table:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PJDs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec69349-f2e9-45db-8a8f-3fc0552b444b_864x222.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PJDs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec69349-f2e9-45db-8a8f-3fc0552b444b_864x222.png 424w, https://substackcdn.com/image/fetch/$s_!PJDs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec69349-f2e9-45db-8a8f-3fc0552b444b_864x222.png 848w, https://substackcdn.com/image/fetch/$s_!PJDs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec69349-f2e9-45db-8a8f-3fc0552b444b_864x222.png 1272w, https://substackcdn.com/image/fetch/$s_!PJDs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec69349-f2e9-45db-8a8f-3fc0552b444b_864x222.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PJDs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec69349-f2e9-45db-8a8f-3fc0552b444b_864x222.png" width="864" height="222" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ec69349-f2e9-45db-8a8f-3fc0552b444b_864x222.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:222,&quot;width&quot;:864,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:36120,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PJDs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec69349-f2e9-45db-8a8f-3fc0552b444b_864x222.png 424w, https://substackcdn.com/image/fetch/$s_!PJDs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec69349-f2e9-45db-8a8f-3fc0552b444b_864x222.png 848w, https://substackcdn.com/image/fetch/$s_!PJDs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec69349-f2e9-45db-8a8f-3fc0552b444b_864x222.png 1272w, https://substackcdn.com/image/fetch/$s_!PJDs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec69349-f2e9-45db-8a8f-3fc0552b444b_864x222.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>A Turing Machine starts in its initial state, reads the symbol under its tape head, and then, based on the current state and symbol, writes a new symbol, moves the head left or right, and transitions to a new state. This process repeats until the machine reaches a halting state.</p><p>Back to the contribution of the paper, the authors aim to simulate <em>U<sub>15,2</sub></em> using LLM augmented with a string-based memory. </p><blockquote><p>&#128064;The LLM is like a CPU that can access the working memory (RAM) to fetch data and instructions. </p></blockquote><p>The memory functions as a dictionary, mapping keys (variable names or addresses) to values: <strong>MEMORY[variable name] = "value"</strong>. Unlike physical RAM, keys are strings for seamless interaction with the LLM, while values can be either strings or integers. Following the UTM principle, the string values can also be the instructions. For example, <strong>MEMORY[&#8216;op&#8216;] </strong>represents the current instruction, eg., <strong>&#8216;halt&#8216;</strong>. <strong> </strong>Other important memory variables are:</p><ul><li><p><strong>MEMORY[&#8216;i&#8216;] </strong>represents the current location of the Turing machine head. For example, <strong>MEMORY[&#8216;i&#8216;] =4 </strong>means the head is in the 4th slot. </p></li><li><p><strong>MEMORY[number] </strong>represents the value stored at the number-th slot where the number is a string representing a number. For example, <strong>MEMORY[4] = &#8216;0&#8216;, </strong>indicating the symbol stored in the 4th slot is &#8216;0&#8216;. </p></li></ul><p>With that in mind, the LLM system executes in 3 steps:</p><ol><li><p><strong>Read:</strong> Retrieve and construct the next input prompt from the memory</p></li><li><p><strong>Compute:</strong> Execute the prompt (instruction) using the LLM</p></li><li><p><strong>Write:</strong> Parse the LLM's output to extract variable assignments, store them back in memory, and move to step 1. </p></li></ol><blockquote><p>&#128064;Notably, this approach requires no additional training or weight modification of the LLM, relying purely on prompt engineering and memory management for universal computation. The authors can test this approach on Flan-U-PaLM 540B.</p></blockquote><p>To guarantee computational universality stems from the system itself, interactions between the LLM and memory through Read and Write are limited to finite-state operations, like simple regular expression parsing. It is important to verify the power of LLM+String Memory to see if it is truly computational universality. Otherwise, during computation, the system can query another computer for answers. Now let&#8217;s look at each step in detail. </p><h5>Read </h5><p>The processing rule with the memory should be defined and executed by simple Python programs. For example, given a string, we can do Read by replacing the variable name that appears in the string (e.g., like <strong>@[variable_name]</strong>) with the value stored in the memory whose key equals the variable name:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WW59!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61060c73-1df5-4bdd-a8f9-ad6111e3c011_884x413.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WW59!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61060c73-1df5-4bdd-a8f9-ad6111e3c011_884x413.png 424w, https://substackcdn.com/image/fetch/$s_!WW59!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61060c73-1df5-4bdd-a8f9-ad6111e3c011_884x413.png 848w, https://substackcdn.com/image/fetch/$s_!WW59!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61060c73-1df5-4bdd-a8f9-ad6111e3c011_884x413.png 1272w, https://substackcdn.com/image/fetch/$s_!WW59!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61060c73-1df5-4bdd-a8f9-ad6111e3c011_884x413.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WW59!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61060c73-1df5-4bdd-a8f9-ad6111e3c011_884x413.png" width="884" height="413" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61060c73-1df5-4bdd-a8f9-ad6111e3c011_884x413.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:413,&quot;width&quot;:884,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:93225,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WW59!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61060c73-1df5-4bdd-a8f9-ad6111e3c011_884x413.png 424w, https://substackcdn.com/image/fetch/$s_!WW59!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61060c73-1df5-4bdd-a8f9-ad6111e3c011_884x413.png 848w, https://substackcdn.com/image/fetch/$s_!WW59!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61060c73-1df5-4bdd-a8f9-ad6111e3c011_884x413.png 1272w, https://substackcdn.com/image/fetch/$s_!WW59!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61060c73-1df5-4bdd-a8f9-ad6111e3c011_884x413.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A Python program to get variable values from the memory.   During Read,  char is specified as @.  During Write, char is %. </figcaption></figure></div><p>Given this function, any string containing pattern <strong>char[]</strong> can be replaced by memory content easily.  </p><h5>Compute</h5><p>In this step, the LLM tries to perform the <em>U<sub>15,2</sub>&#8217;s </em>transition function above. It is provided with the string representing the current state and the program that reads like:</p><div class="pullquote"><p>if the current head value on the tape is 0, the state will become B, write 0 to the current head, and shift the head right &#8230;</p></div><p>We are asking the LLM to do if-else generation conditioned on the head value using an instruction prompt. &#129504; <em>How to design the prompt?</em> The authors propose to use few-shot instruction to help the LLM familiarize with the syntax of generation output and if-else behavior. Some examples read:</p><pre><code>result = " op="%[B]" %[i]="0" i+=1 "
if 0==1 then result = " op="%[A]" %[i]="1" i+=1 "
$result
" op="%[B]" %[i]="0" i+=1 "</code></pre><pre><code>result = " op="%[B]" %[i]="0" i+=1 "
if 1==1 then result = " op="%[A]" %[i]="1" i+=1 "
$result
" op="%[A]" %[i]="1" i+=1 "</code></pre><p>All examples are stored in a variable <strong>boot</strong> stored in the memory. In other words, <strong>MEMORY[&#8216;boot&#8217;] =  &#8220;</strong><code>result = " op="%[B]" %[i]="0" i+=1 &#8230;". </code></p><p>Each transition rule corresponds to a state instruction prompt coupled with the few-shot examples. Each prompt is stored in the <strong>MEMORY </strong>as variables A, B, C &#8230;</p><pre><code>A = """@[boot]result = " op="%[B]" %[i]="0" i+=1 "
if @[@[i]]==1 then result = " op="%[A]" %[i]="1" i+=1 "
$result
"""
B = """@[boot]result = " op="%[C]" %[i]="1" i+=1 "
if @[@[i]]==1 then result = " op="%[A]" %[i]="1" i+=1 "
$result
"""
...</code></pre><p>For example, if the current state is A, <strong>MEMORY[&#8216;op&#8216;] = </strong></p><p><code>"""@[boot]result = " op="%[B]" %[i]="0" i+=1 "
if @[@[i]]==1 then result = " op="%[A]" %[i]="1" i+=1 "
$result
"""</code></p><p>and we give that prompt to the LLM after loading the value head <code>@[@[i]]</code> from the memory using Read. For example, let's say the value is 1, then we expect that the LLM will generate: <code>op="%[A]" %[i]="1" i+=1</code></p><h5>Write</h5><p>Given LLM&#8217;s raw output, we need to substitute the variable name (<strong>%[variable_name])</strong> with its values stored in the memory, producing a post-processed output string. Then,  the authors use a simple Python program to update the memory values with the assignment specified in the output string. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CPzO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ca80f9b-11cf-48bc-b5e9-97997a72ca9c_1176x509.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CPzO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ca80f9b-11cf-48bc-b5e9-97997a72ca9c_1176x509.png 424w, https://substackcdn.com/image/fetch/$s_!CPzO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ca80f9b-11cf-48bc-b5e9-97997a72ca9c_1176x509.png 848w, https://substackcdn.com/image/fetch/$s_!CPzO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ca80f9b-11cf-48bc-b5e9-97997a72ca9c_1176x509.png 1272w, https://substackcdn.com/image/fetch/$s_!CPzO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ca80f9b-11cf-48bc-b5e9-97997a72ca9c_1176x509.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CPzO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ca80f9b-11cf-48bc-b5e9-97997a72ca9c_1176x509.png" width="1176" height="509" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ca80f9b-11cf-48bc-b5e9-97997a72ca9c_1176x509.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:509,&quot;width&quot;:1176,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:111891,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CPzO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ca80f9b-11cf-48bc-b5e9-97997a72ca9c_1176x509.png 424w, https://substackcdn.com/image/fetch/$s_!CPzO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ca80f9b-11cf-48bc-b5e9-97997a72ca9c_1176x509.png 848w, https://substackcdn.com/image/fetch/$s_!CPzO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ca80f9b-11cf-48bc-b5e9-97997a72ca9c_1176x509.png 1272w, https://substackcdn.com/image/fetch/$s_!CPzO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ca80f9b-11cf-48bc-b5e9-97997a72ca9c_1176x509.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Update variables stored in the MEMORY. &#8220;label&#8220; can be &#8220;op&#8220;, &#8220;i&#8220; or any variable name. </figcaption></figure></div><p>Putting together, the whole system simulating <em>U<sub>15,2</sub></em> looks like this:</p><pre><code># Step 1: Initialize memory with predefined variables and values
Initialize MEMORY with:
    'boot', 'i', 'A', 'B', ... (values are defined earlier)

# Step 2: Set the initial operation
MEMORY['op'] = MEMORY['A']  # Start with operation defined in 'A'

# Step 3: Main execution loop
While True:
    op = MEMORY['op']  # Get the current operation
    
    # Check if the process should stop
    If op == 'halt':
        Exit loop  # End the program

    # Perform core actions
    Read  # Fetch required data
    Compute  # Perform calculations or operations
    Write  # Update memory 
</code></pre><p>Despite being theoretically powerful, the design of this kind of string memory faces difficulties:</p><p>&#10060; The system does not aim to solve the task directly. Rather, it simulates a UTM and then uses the program of the TM to solve the task. In theory, it can solve any task, but it still requires finding the program of TM that is suitable for the task. It is unclear how to find that program.</p><p>&#10060; Working on string or text level is slow because the LLM must generate texts and that involves a sequential sampling process, which is slow. </p><h4>Tensor Memory: Long-term Storage and Generalization Power</h4><p>In implementation, neural networks work with tensors. Therefore. it is convenient to store vector and tensor representations in the working memory, enabling the LLMs to communicate with LLMs in the representation-level, rather than text level. For example, Wang et al., (2024) proposed storing the LLM&#8217;s attention keys and values in external memory:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bRXA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9549f1b5-a242-466d-9f52-7a2c80d8f2b8_669x358.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bRXA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9549f1b5-a242-466d-9f52-7a2c80d8f2b8_669x358.png 424w, https://substackcdn.com/image/fetch/$s_!bRXA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9549f1b5-a242-466d-9f52-7a2c80d8f2b8_669x358.png 848w, https://substackcdn.com/image/fetch/$s_!bRXA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9549f1b5-a242-466d-9f52-7a2c80d8f2b8_669x358.png 1272w, https://substackcdn.com/image/fetch/$s_!bRXA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9549f1b5-a242-466d-9f52-7a2c80d8f2b8_669x358.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bRXA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9549f1b5-a242-466d-9f52-7a2c80d8f2b8_669x358.png" width="669" height="358" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9549f1b5-a242-466d-9f52-7a2c80d8f2b8_669x358.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:358,&quot;width&quot;:669,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:75153,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bRXA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9549f1b5-a242-466d-9f52-7a2c80d8f2b8_669x358.png 424w, https://substackcdn.com/image/fetch/$s_!bRXA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9549f1b5-a242-466d-9f52-7a2c80d8f2b8_669x358.png 848w, https://substackcdn.com/image/fetch/$s_!bRXA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9549f1b5-a242-466d-9f52-7a2c80d8f2b8_669x358.png 1272w, https://substackcdn.com/image/fetch/$s_!bRXA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9549f1b5-a242-466d-9f52-7a2c80d8f2b8_669x358.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Tensor memory stores keys and values. Source: [7]. </figcaption></figure></div><p>In particular, the input sequence is split into fixed-length (<em>context_size</em>) segments, each processed by a frozen LLM and their key-value pairs cached in memory. Current inputs use query vectors to retrieve memory content, fused with the local context for another trainable network (&#128073;<strong>SideNet</strong>) to make predictions.</p><p>SideNet, implemented as a Transformer, uses the backbone LLM's embedding layer and frozen language modeling head:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nwST!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6bfdddf-de50-43cf-a65c-8ef7b255463d_740x149.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nwST!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6bfdddf-de50-43cf-a65c-8ef7b255463d_740x149.png 424w, https://substackcdn.com/image/fetch/$s_!nwST!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6bfdddf-de50-43cf-a65c-8ef7b255463d_740x149.png 848w, https://substackcdn.com/image/fetch/$s_!nwST!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6bfdddf-de50-43cf-a65c-8ef7b255463d_740x149.png 1272w, https://substackcdn.com/image/fetch/$s_!nwST!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6bfdddf-de50-43cf-a65c-8ef7b255463d_740x149.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nwST!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6bfdddf-de50-43cf-a65c-8ef7b255463d_740x149.png" width="476" height="95.84324324324324" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c6bfdddf-de50-43cf-a65c-8ef7b255463d_740x149.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:149,&quot;width&quot;:740,&quot;resizeWidth&quot;:476,&quot;bytes&quot;:36752,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nwST!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6bfdddf-de50-43cf-a65c-8ef7b255463d_740x149.png 424w, https://substackcdn.com/image/fetch/$s_!nwST!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6bfdddf-de50-43cf-a65c-8ef7b255463d_740x149.png 848w, https://substackcdn.com/image/fetch/$s_!nwST!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6bfdddf-de50-43cf-a65c-8ef7b255463d_740x149.png 1272w, https://substackcdn.com/image/fetch/$s_!nwST!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6bfdddf-de50-43cf-a65c-8ef7b255463d_740x149.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The hidden state <em>H</em> will then be used to generate <em>Q, K, </em>and <em>V </em>to do the attention normally :</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iFF3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff107d56e-43ad-4cae-b944-d3a546ebeb50_281x76.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iFF3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff107d56e-43ad-4cae-b944-d3a546ebeb50_281x76.png 424w, https://substackcdn.com/image/fetch/$s_!iFF3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff107d56e-43ad-4cae-b944-d3a546ebeb50_281x76.png 848w, https://substackcdn.com/image/fetch/$s_!iFF3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff107d56e-43ad-4cae-b944-d3a546ebeb50_281x76.png 1272w, https://substackcdn.com/image/fetch/$s_!iFF3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff107d56e-43ad-4cae-b944-d3a546ebeb50_281x76.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iFF3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff107d56e-43ad-4cae-b944-d3a546ebeb50_281x76.png" width="187" height="50.57651245551602" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f107d56e-43ad-4cae-b944-d3a546ebeb50_281x76.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:76,&quot;width&quot;:281,&quot;resizeWidth&quot;:187,&quot;bytes&quot;:9009,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iFF3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff107d56e-43ad-4cae-b944-d3a546ebeb50_281x76.png 424w, https://substackcdn.com/image/fetch/$s_!iFF3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff107d56e-43ad-4cae-b944-d3a546ebeb50_281x76.png 848w, https://substackcdn.com/image/fetch/$s_!iFF3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff107d56e-43ad-4cae-b944-d3a546ebeb50_281x76.png 1272w, https://substackcdn.com/image/fetch/$s_!iFF3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff107d56e-43ad-4cae-b944-d3a546ebeb50_281x76.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The catch is we will augment this attention with memory contents by a special memory retrieval and fusion. </p><p><strong>Token-to-Chunk Memory Retrieval</strong> simplifies and accelerates memory operations by grouping tokens into fixed-size chunks. Instead of retrieving token-level key-value pairs, the system retrieves chunk-level pairs using mean-pooled vectors for efficient matching, then flattens the retrieved chunks back into token-level pairs for processing. </p><blockquote><p>&#128064;This approach reduces retrieval complexity, enhances accuracy, and allows adjustable granularity based on task needs, such as broader context for in-context learning.</p></blockquote><p>Given the current token&#8217;s query, we can retrieve the top <em>K/context_size </em>chunks. After flattening the chunks, we obtain <em>K</em> key-value pairs as the memory contexts. Rewrite them in tensor form for all input token reads:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!daob!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cf5bff-13b4-4947-8229-5432a650e80f_572x79.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!daob!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cf5bff-13b4-4947-8229-5432a650e80f_572x79.png 424w, https://substackcdn.com/image/fetch/$s_!daob!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cf5bff-13b4-4947-8229-5432a650e80f_572x79.png 848w, https://substackcdn.com/image/fetch/$s_!daob!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cf5bff-13b4-4947-8229-5432a650e80f_572x79.png 1272w, https://substackcdn.com/image/fetch/$s_!daob!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cf5bff-13b4-4947-8229-5432a650e80f_572x79.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!daob!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cf5bff-13b4-4947-8229-5432a650e80f_572x79.png" width="246" height="33.97552447552447" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50cf5bff-13b4-4947-8229-5432a650e80f_572x79.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:79,&quot;width&quot;:572,&quot;resizeWidth&quot;:246,&quot;bytes&quot;:13765,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!daob!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cf5bff-13b4-4947-8229-5432a650e80f_572x79.png 424w, https://substackcdn.com/image/fetch/$s_!daob!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cf5bff-13b4-4947-8229-5432a650e80f_572x79.png 848w, https://substackcdn.com/image/fetch/$s_!daob!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cf5bff-13b4-4947-8229-5432a650e80f_572x79.png 1272w, https://substackcdn.com/image/fetch/$s_!daob!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50cf5bff-13b4-4947-8229-5432a650e80f_572x79.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Memory Fusion</strong> combines local and retrieved memory contexts through a joint-attention mechanism in a specialized memory-augmented layer. Each token can attend to the tokens stored in the memory:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aAN-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f5a79d-8e5f-4db4-ba19-d4a1b18fd76b_433x77.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aAN-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f5a79d-8e5f-4db4-ba19-d4a1b18fd76b_433x77.png 424w, https://substackcdn.com/image/fetch/$s_!aAN-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f5a79d-8e5f-4db4-ba19-d4a1b18fd76b_433x77.png 848w, https://substackcdn.com/image/fetch/$s_!aAN-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f5a79d-8e5f-4db4-ba19-d4a1b18fd76b_433x77.png 1272w, https://substackcdn.com/image/fetch/$s_!aAN-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f5a79d-8e5f-4db4-ba19-d4a1b18fd76b_433x77.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aAN-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f5a79d-8e5f-4db4-ba19-d4a1b18fd76b_433x77.png" width="307" height="54.59353348729792" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e4f5a79d-8e5f-4db4-ba19-d4a1b18fd76b_433x77.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:77,&quot;width&quot;:433,&quot;resizeWidth&quot;:307,&quot;bytes&quot;:13824,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aAN-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f5a79d-8e5f-4db4-ba19-d4a1b18fd76b_433x77.png 424w, https://substackcdn.com/image/fetch/$s_!aAN-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f5a79d-8e5f-4db4-ba19-d4a1b18fd76b_433x77.png 848w, https://substackcdn.com/image/fetch/$s_!aAN-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f5a79d-8e5f-4db4-ba19-d4a1b18fd76b_433x77.png 1272w, https://substackcdn.com/image/fetch/$s_!aAN-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4f5a79d-8e5f-4db4-ba19-d4a1b18fd76b_433x77.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Then, we combine the memory attention with SideNet attention to produce the final output at layer <em>l</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jgGZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501d3c8c-24e4-4395-b468-f4e5229c8153_793x64.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jgGZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501d3c8c-24e4-4395-b468-f4e5229c8153_793x64.png 424w, https://substackcdn.com/image/fetch/$s_!jgGZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501d3c8c-24e4-4395-b468-f4e5229c8153_793x64.png 848w, https://substackcdn.com/image/fetch/$s_!jgGZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501d3c8c-24e4-4395-b468-f4e5229c8153_793x64.png 1272w, https://substackcdn.com/image/fetch/$s_!jgGZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501d3c8c-24e4-4395-b468-f4e5229c8153_793x64.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jgGZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501d3c8c-24e4-4395-b468-f4e5229c8153_793x64.png" width="402" height="32.44388398486759" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/501d3c8c-24e4-4395-b468-f4e5229c8153_793x64.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:64,&quot;width&quot;:793,&quot;resizeWidth&quot;:402,&quot;bytes&quot;:13705,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jgGZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501d3c8c-24e4-4395-b468-f4e5229c8153_793x64.png 424w, https://substackcdn.com/image/fetch/$s_!jgGZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501d3c8c-24e4-4395-b468-f4e5229c8153_793x64.png 848w, https://substackcdn.com/image/fetch/$s_!jgGZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501d3c8c-24e4-4395-b468-f4e5229c8153_793x64.png 1272w, https://substackcdn.com/image/fetch/$s_!jgGZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F501d3c8c-24e4-4395-b468-f4e5229c8153_793x64.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p>&#128064; During memory-augmented adaptation, only the SideNet's parameters are updated, while the backbone's pre-trained knowledge remains fixed. This streamlined method facilitates rapid convergence by effectively utilizing the existing expertise of the backbone model (405M GPT-2).</p></blockquote><p>&#10060; The frozen LLM offers both advantages and drawbacks. While it keeps the system lightweight, it limits adaptability. For instance, if the pre-trained LLM has shortcomings, the quality of memory content may suffer, reducing the effectiveness of the augmentation.</p><p>&#10060; Chunk-level retrieval, while extending the memory span, cannot provide precise access to specific tokens, which limits its utility in applications requiring detailed reflection on past inputs, such as code generation or fine-grained question answering. </p><p>&#10060; The SideNet is trained to purely combine the memory content with its current input. This process relies on KNN retrieval without learning to read or write to the memory and offers no mechanism to ensure generalization. </p><p>To address these issues, we can consider a more flexible design where the memory can be differentiable:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FW69!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930739f-d308-455b-91c6-dcbefa97dc39_522x480.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FW69!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930739f-d308-455b-91c6-dcbefa97dc39_522x480.png 424w, https://substackcdn.com/image/fetch/$s_!FW69!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930739f-d308-455b-91c6-dcbefa97dc39_522x480.png 848w, https://substackcdn.com/image/fetch/$s_!FW69!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930739f-d308-455b-91c6-dcbefa97dc39_522x480.png 1272w, https://substackcdn.com/image/fetch/$s_!FW69!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930739f-d308-455b-91c6-dcbefa97dc39_522x480.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FW69!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930739f-d308-455b-91c6-dcbefa97dc39_522x480.png" width="442" height="406.4367816091954" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b930739f-d308-455b-91c6-dcbefa97dc39_522x480.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:480,&quot;width&quot;:522,&quot;resizeWidth&quot;:442,&quot;bytes&quot;:41471,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FW69!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930739f-d308-455b-91c6-dcbefa97dc39_522x480.png 424w, https://substackcdn.com/image/fetch/$s_!FW69!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930739f-d308-455b-91c6-dcbefa97dc39_522x480.png 848w, https://substackcdn.com/image/fetch/$s_!FW69!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930739f-d308-455b-91c6-dcbefa97dc39_522x480.png 1272w, https://substackcdn.com/image/fetch/$s_!FW69!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb930739f-d308-455b-91c6-dcbefa97dc39_522x480.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Memory storage of past attention key-value pairs can be differentiable. Source: [8]</figcaption></figure></div><p>The proposed mechanism, termed &#128073;<strong>Infini-attention</strong>, also stores <em>Q, K</em>, and <em>V</em> representations of the LLMs. However, unlike Wang et al., (2024), the author proposes to use a compressive memory that  adopts a recurrent update<em> </em>using Linear Attention [9]:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WYFK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4b45e65-f253-4400-9319-d1cf31ccb0c1_588x78.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WYFK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4b45e65-f253-4400-9319-d1cf31ccb0c1_588x78.png 424w, https://substackcdn.com/image/fetch/$s_!WYFK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4b45e65-f253-4400-9319-d1cf31ccb0c1_588x78.png 848w, https://substackcdn.com/image/fetch/$s_!WYFK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4b45e65-f253-4400-9319-d1cf31ccb0c1_588x78.png 1272w, https://substackcdn.com/image/fetch/$s_!WYFK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4b45e65-f253-4400-9319-d1cf31ccb0c1_588x78.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WYFK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4b45e65-f253-4400-9319-d1cf31ccb0c1_588x78.png" width="490" height="65" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f4b45e65-f253-4400-9319-d1cf31ccb0c1_588x78.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:78,&quot;width&quot;:588,&quot;resizeWidth&quot;:490,&quot;bytes&quot;:11431,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WYFK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4b45e65-f253-4400-9319-d1cf31ccb0c1_588x78.png 424w, https://substackcdn.com/image/fetch/$s_!WYFK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4b45e65-f253-4400-9319-d1cf31ccb0c1_588x78.png 848w, https://substackcdn.com/image/fetch/$s_!WYFK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4b45e65-f253-4400-9319-d1cf31ccb0c1_588x78.png 1272w, https://substackcdn.com/image/fetch/$s_!WYFK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4b45e65-f253-4400-9319-d1cf31ccb0c1_588x78.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>or Linar Attention combined with Delta rule:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uQHz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80144581-1a81-4ed5-8c49-5502a10f0eae_453x78.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uQHz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80144581-1a81-4ed5-8c49-5502a10f0eae_453x78.png 424w, https://substackcdn.com/image/fetch/$s_!uQHz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80144581-1a81-4ed5-8c49-5502a10f0eae_453x78.png 848w, https://substackcdn.com/image/fetch/$s_!uQHz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80144581-1a81-4ed5-8c49-5502a10f0eae_453x78.png 1272w, https://substackcdn.com/image/fetch/$s_!uQHz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80144581-1a81-4ed5-8c49-5502a10f0eae_453x78.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uQHz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80144581-1a81-4ed5-8c49-5502a10f0eae_453x78.png" width="363" height="62.503311258278146" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/80144581-1a81-4ed5-8c49-5502a10f0eae_453x78.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:78,&quot;width&quot;:453,&quot;resizeWidth&quot;:363,&quot;bytes&quot;:11906,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uQHz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80144581-1a81-4ed5-8c49-5502a10f0eae_453x78.png 424w, https://substackcdn.com/image/fetch/$s_!uQHz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80144581-1a81-4ed5-8c49-5502a10f0eae_453x78.png 848w, https://substackcdn.com/image/fetch/$s_!uQHz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80144581-1a81-4ed5-8c49-5502a10f0eae_453x78.png 1272w, https://substackcdn.com/image/fetch/$s_!uQHz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80144581-1a81-4ed5-8c49-5502a10f0eae_453x78.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Then, the memory read performs a normalized matrix multiplication:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qU0a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5305c7de-ccbf-40d9-a4c5-a14b51b20d4e_236x73.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qU0a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5305c7de-ccbf-40d9-a4c5-a14b51b20d4e_236x73.png 424w, https://substackcdn.com/image/fetch/$s_!qU0a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5305c7de-ccbf-40d9-a4c5-a14b51b20d4e_236x73.png 848w, https://substackcdn.com/image/fetch/$s_!qU0a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5305c7de-ccbf-40d9-a4c5-a14b51b20d4e_236x73.png 1272w, https://substackcdn.com/image/fetch/$s_!qU0a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5305c7de-ccbf-40d9-a4c5-a14b51b20d4e_236x73.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qU0a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5305c7de-ccbf-40d9-a4c5-a14b51b20d4e_236x73.png" width="178" height="55.059322033898304" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5305c7de-ccbf-40d9-a4c5-a14b51b20d4e_236x73.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:73,&quot;width&quot;:236,&quot;resizeWidth&quot;:178,&quot;bytes&quot;:9056,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qU0a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5305c7de-ccbf-40d9-a4c5-a14b51b20d4e_236x73.png 424w, https://substackcdn.com/image/fetch/$s_!qU0a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5305c7de-ccbf-40d9-a4c5-a14b51b20d4e_236x73.png 848w, https://substackcdn.com/image/fetch/$s_!qU0a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5305c7de-ccbf-40d9-a4c5-a14b51b20d4e_236x73.png 1272w, https://substackcdn.com/image/fetch/$s_!qU0a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5305c7de-ccbf-40d9-a4c5-a14b51b20d4e_236x73.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Like Wang et al., (2024), this is interpolated with LLM&#8217;s original attention <em>A<sub>dot</sub></em>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FdJG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405657a1-ff2d-42e3-b2ee-db4fa008452e_577x43.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FdJG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405657a1-ff2d-42e3-b2ee-db4fa008452e_577x43.png 424w, https://substackcdn.com/image/fetch/$s_!FdJG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405657a1-ff2d-42e3-b2ee-db4fa008452e_577x43.png 848w, https://substackcdn.com/image/fetch/$s_!FdJG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405657a1-ff2d-42e3-b2ee-db4fa008452e_577x43.png 1272w, https://substackcdn.com/image/fetch/$s_!FdJG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405657a1-ff2d-42e3-b2ee-db4fa008452e_577x43.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FdJG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405657a1-ff2d-42e3-b2ee-db4fa008452e_577x43.png" width="423" height="31.523396880415945" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/405657a1-ff2d-42e3-b2ee-db4fa008452e_577x43.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:43,&quot;width&quot;:577,&quot;resizeWidth&quot;:423,&quot;bytes&quot;:13765,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FdJG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405657a1-ff2d-42e3-b2ee-db4fa008452e_577x43.png 424w, https://substackcdn.com/image/fetch/$s_!FdJG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405657a1-ff2d-42e3-b2ee-db4fa008452e_577x43.png 848w, https://substackcdn.com/image/fetch/$s_!FdJG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405657a1-ff2d-42e3-b2ee-db4fa008452e_577x43.png 1272w, https://substackcdn.com/image/fetch/$s_!FdJG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405657a1-ff2d-42e3-b2ee-db4fa008452e_577x43.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p>&#128064; Each Infini-attention layer is trained via backpropagation through time using gradients of compressive memory states, similar to RNN training. This is expensive, yet allow better adaptiation to downstream tasks that requires finetuning LLM. </p></blockquote><p>A different, more classical approach, for MA-LLM adopts the Controller Memory framework of NTM and DNC. In this MM-LLM architecture, the memory contains multiple slots. The LLM Controller reads and writes token representations to the slots. To save computation costs, the memory can be applied to the final layer of the LLM&#8217;s Transformer Encoder and Decoder as in Pointer-Augmented Neural Memory (&#128073;<strong>PANM</strong>) [10]:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!B-11!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe709e928-3601-4f02-80cd-a749ef8ba38b_697x361.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!B-11!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe709e928-3601-4f02-80cd-a749ef8ba38b_697x361.png 424w, https://substackcdn.com/image/fetch/$s_!B-11!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe709e928-3601-4f02-80cd-a749ef8ba38b_697x361.png 848w, https://substackcdn.com/image/fetch/$s_!B-11!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe709e928-3601-4f02-80cd-a749ef8ba38b_697x361.png 1272w, https://substackcdn.com/image/fetch/$s_!B-11!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe709e928-3601-4f02-80cd-a749ef8ba38b_697x361.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!B-11!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe709e928-3601-4f02-80cd-a749ef8ba38b_697x361.png" width="533" height="276.05882352941177" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e709e928-3601-4f02-80cd-a749ef8ba38b_697x361.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:361,&quot;width&quot;:697,&quot;resizeWidth&quot;:533,&quot;bytes&quot;:35700,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!B-11!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe709e928-3601-4f02-80cd-a749ef8ba38b_697x361.png 424w, https://substackcdn.com/image/fetch/$s_!B-11!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe709e928-3601-4f02-80cd-a749ef8ba38b_697x361.png 848w, https://substackcdn.com/image/fetch/$s_!B-11!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe709e928-3601-4f02-80cd-a749ef8ba38b_697x361.png 1272w, https://substackcdn.com/image/fetch/$s_!B-11!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe709e928-3601-4f02-80cd-a749ef8ba38b_697x361.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">PANM Design: The Memory and Controller are on top of the LLM&#8217;s encoder-decoder to utilize the LLM&#8217;s representations while reducing computing costs. Source: [10].</figcaption></figure></div><p>Unlike prior works that aim to use memory as long-term storage, PANM focuses on leveraging memory to empower LLM with length extrapolation capabilities. As such, PANM introduces a pointer-based mechanism with two principles:</p><ol><li><p><strong>Explicit Pointers as Physical Addresses:</strong> Incremental binary addresses replace softmax-based attention, ensuring scalable and predictable memory access for long sequences.</p></li><li><p><strong>Decoupled Pointer Operations:</strong> Pointer manipulation is separated from input data, enabling abstract operations like copying or sorting independently of specific values.</p></li></ol><p>Inspired by slot-based RAM, this design ensures reliable storage and retrieval across sequences of any length. The memory consists of:</p><ol><li><p><strong>An address bank (A)</strong>&#8212;a collection of binary memory addresses.</p></li><li><p><strong>A Pointer Unit (PU)</strong>&#8212;a module that manipulates pointers to access memory efficiently.</p></li></ol><blockquote><p>&#128064; Apllying PANM to LLama-2 7B improves allows the LLM to generalize to a new sequence length 100 times longer than those seen in the training data. More details on PANM can be found in our previous <a href="https://hungleai.substack.com/p/extending-neural-networks-to-new">blog post</a>. </p></blockquote><div><hr></div><h2>Episodic Memory</h2><p>In deep learning and reinforcement learning, episodic memory refers to a memory mechanism that stores specific experiences or "episodes" for future reference. Unlike working memory, episodic memory can last longer, across tasks. These memories typically capture key events or knowledge in a task, in associative key-value pairs. For example:</p><ul><li><p>Eiffel Tower &#8212; Paris </p></li><li><p>Friend&#8217;s DOB &#8212; 29/11/1995</p></li><li><p>9 AM yesterday &#8212; played soccer </p></li></ul><p>A defining characteristic of episodic memory, distinguishing it from associative memories like semantic memory, is its ability to update or rewire associations rapidly. This is essential in the context of LLMs as although massive knowledge or facts have been embedded in the LLMs&#8217; weights, they are more like semantic memory, and slow to update. This highlights the need for an additional storage mechanism that can rapidly adapt to new data or knowledge.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!B--k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6882e4fb-f9dd-4097-b61c-22b9169cccbd_698x352.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!B--k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6882e4fb-f9dd-4097-b61c-22b9169cccbd_698x352.png 424w, https://substackcdn.com/image/fetch/$s_!B--k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6882e4fb-f9dd-4097-b61c-22b9169cccbd_698x352.png 848w, https://substackcdn.com/image/fetch/$s_!B--k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6882e4fb-f9dd-4097-b61c-22b9169cccbd_698x352.png 1272w, https://substackcdn.com/image/fetch/$s_!B--k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6882e4fb-f9dd-4097-b61c-22b9169cccbd_698x352.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!B--k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6882e4fb-f9dd-4097-b61c-22b9169cccbd_698x352.png" width="486" height="245.0888252148997" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6882e4fb-f9dd-4097-b61c-22b9169cccbd_698x352.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:352,&quot;width&quot;:698,&quot;resizeWidth&quot;:486,&quot;bytes&quot;:77148,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!B--k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6882e4fb-f9dd-4097-b61c-22b9169cccbd_698x352.png 424w, https://substackcdn.com/image/fetch/$s_!B--k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6882e4fb-f9dd-4097-b61c-22b9169cccbd_698x352.png 848w, https://substackcdn.com/image/fetch/$s_!B--k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6882e4fb-f9dd-4097-b61c-22b9169cccbd_698x352.png 1272w, https://substackcdn.com/image/fetch/$s_!B--k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6882e4fb-f9dd-4097-b61c-22b9169cccbd_698x352.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">LLMs require external mechanisms to incorporate new knowledge, demonstrate a nuanced understanding of it, and generate content that reflects this updated information. <a href="https://rome.baulab.info/">Source</a>. </figcaption></figure></div><h4>Rapid Knowledge Integration with Differentiable Memory</h4><p>In recent work, Das et al., (2024) propose an MA-LLM named &#128073;<strong>Larimar</strong> that mirrors the hippocampus-neocortex interaction, where the memory rapidly captures factual updates as episodic memory, while the LLM encodes long-term patterns as semantic memory [11]. The episodic memory module serves as a global repository for storing the latest factual updates or edits, conditioning the LLM decoder to reflect this information. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YV3-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F219fd7fe-8af0-41e4-8077-b8fcd632e90a_1017x350.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YV3-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F219fd7fe-8af0-41e4-8077-b8fcd632e90a_1017x350.png 424w, https://substackcdn.com/image/fetch/$s_!YV3-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F219fd7fe-8af0-41e4-8077-b8fcd632e90a_1017x350.png 848w, https://substackcdn.com/image/fetch/$s_!YV3-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F219fd7fe-8af0-41e4-8077-b8fcd632e90a_1017x350.png 1272w, https://substackcdn.com/image/fetch/$s_!YV3-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F219fd7fe-8af0-41e4-8077-b8fcd632e90a_1017x350.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YV3-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F219fd7fe-8af0-41e4-8077-b8fcd632e90a_1017x350.png" width="1017" height="350" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/219fd7fe-8af0-41e4-8077-b8fcd632e90a_1017x350.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:350,&quot;width&quot;:1017,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:105468,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YV3-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F219fd7fe-8af0-41e4-8077-b8fcd632e90a_1017x350.png 424w, https://substackcdn.com/image/fetch/$s_!YV3-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F219fd7fe-8af0-41e4-8077-b8fcd632e90a_1017x350.png 848w, https://substackcdn.com/image/fetch/$s_!YV3-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F219fd7fe-8af0-41e4-8077-b8fcd632e90a_1017x350.png 1272w, https://substackcdn.com/image/fetch/$s_!YV3-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F219fd7fe-8af0-41e4-8077-b8fcd632e90a_1017x350.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Larimar writes the Encoder&#8217;s output to the memory and reads the memory for computing the Decoder&#8217;s input. Source: [11]. </figcaption></figure></div><p>Given a set of encoding <em>Z<sub>i</sub> </em>computed<em> </em>by the Encoder, representing the <em>i</em>-th knowledge we want to add to the memory <em>M</em>, assuming that we can find the address (key) <em>W</em> that specifies where to write to the memory, the authors propose the following memory update rule:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XCZ7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87718318-bd32-45c1-a3a9-bbc076f5bfc6_499x105.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XCZ7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87718318-bd32-45c1-a3a9-bbc076f5bfc6_499x105.png 424w, https://substackcdn.com/image/fetch/$s_!XCZ7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87718318-bd32-45c1-a3a9-bbc076f5bfc6_499x105.png 848w, https://substackcdn.com/image/fetch/$s_!XCZ7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87718318-bd32-45c1-a3a9-bbc076f5bfc6_499x105.png 1272w, https://substackcdn.com/image/fetch/$s_!XCZ7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87718318-bd32-45c1-a3a9-bbc076f5bfc6_499x105.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XCZ7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87718318-bd32-45c1-a3a9-bbc076f5bfc6_499x105.png" width="365" height="76.80360721442885" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/87718318-bd32-45c1-a3a9-bbc076f5bfc6_499x105.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:105,&quot;width&quot;:499,&quot;resizeWidth&quot;:365,&quot;bytes&quot;:14667,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XCZ7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87718318-bd32-45c1-a3a9-bbc076f5bfc6_499x105.png 424w, https://substackcdn.com/image/fetch/$s_!XCZ7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87718318-bd32-45c1-a3a9-bbc076f5bfc6_499x105.png 848w, https://substackcdn.com/image/fetch/$s_!XCZ7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87718318-bd32-45c1-a3a9-bbc076f5bfc6_499x105.png 1272w, https://substackcdn.com/image/fetch/$s_!XCZ7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87718318-bd32-45c1-a3a9-bbc076f5bfc6_499x105.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This rule resembles the Linear Attention + Delta Rule mentioned above. Furthermore, the memory update rule ensures important theoretical properties. Assuming  <em>M<sub>i-1</sub></em> is the least-squares solution for <em>Z<sub>0:i-1</sub></em>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LdKW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d9ab5f-74ee-45aa-b3bd-e1dec03427b0_442x86.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LdKW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d9ab5f-74ee-45aa-b3bd-e1dec03427b0_442x86.png 424w, https://substackcdn.com/image/fetch/$s_!LdKW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d9ab5f-74ee-45aa-b3bd-e1dec03427b0_442x86.png 848w, https://substackcdn.com/image/fetch/$s_!LdKW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d9ab5f-74ee-45aa-b3bd-e1dec03427b0_442x86.png 1272w, https://substackcdn.com/image/fetch/$s_!LdKW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d9ab5f-74ee-45aa-b3bd-e1dec03427b0_442x86.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LdKW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d9ab5f-74ee-45aa-b3bd-e1dec03427b0_442x86.png" width="360" height="70.04524886877829" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c1d9ab5f-74ee-45aa-b3bd-e1dec03427b0_442x86.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:86,&quot;width&quot;:442,&quot;resizeWidth&quot;:360,&quot;bytes&quot;:10730,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LdKW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d9ab5f-74ee-45aa-b3bd-e1dec03427b0_442x86.png 424w, https://substackcdn.com/image/fetch/$s_!LdKW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d9ab5f-74ee-45aa-b3bd-e1dec03427b0_442x86.png 848w, https://substackcdn.com/image/fetch/$s_!LdKW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d9ab5f-74ee-45aa-b3bd-e1dec03427b0_442x86.png 1272w, https://substackcdn.com/image/fetch/$s_!LdKW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d9ab5f-74ee-45aa-b3bd-e1dec03427b0_442x86.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><ul><li><p>We want to remember <em>Z<sub>i</sub>, so  we set &#945;<sub>i</sub>=1:</em></p></li></ul><div class="pullquote"><p>then M<sub>i</sub> is the  least-squares solution for Z<sub>0:i</sub></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wUez!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639529f7-f2ee-4b91-8ce1-a5191984a30c_442x86.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wUez!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639529f7-f2ee-4b91-8ce1-a5191984a30c_442x86.png 424w, https://substackcdn.com/image/fetch/$s_!wUez!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639529f7-f2ee-4b91-8ce1-a5191984a30c_442x86.png 848w, https://substackcdn.com/image/fetch/$s_!wUez!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639529f7-f2ee-4b91-8ce1-a5191984a30c_442x86.png 1272w, https://substackcdn.com/image/fetch/$s_!wUez!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639529f7-f2ee-4b91-8ce1-a5191984a30c_442x86.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wUez!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639529f7-f2ee-4b91-8ce1-a5191984a30c_442x86.png" width="350" height="68.09954751131222" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/639529f7-f2ee-4b91-8ce1-a5191984a30c_442x86.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:86,&quot;width&quot;:442,&quot;resizeWidth&quot;:350,&quot;bytes&quot;:12008,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wUez!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639529f7-f2ee-4b91-8ce1-a5191984a30c_442x86.png 424w, https://substackcdn.com/image/fetch/$s_!wUez!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639529f7-f2ee-4b91-8ce1-a5191984a30c_442x86.png 848w, https://substackcdn.com/image/fetch/$s_!wUez!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639529f7-f2ee-4b91-8ce1-a5191984a30c_442x86.png 1272w, https://substackcdn.com/image/fetch/$s_!wUez!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639529f7-f2ee-4b91-8ce1-a5191984a30c_442x86.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></div><p></p><ul><li><p>Case 2: We want to forget <em>Z<sub>i</sub>, which is stored before at i<sub>forget</sub>&lt;i, so we set &#945;<sub>i</sub>=-1:</em></p></li></ul><div class="pullquote"><p>then M<sub>i</sub> is the  least-squares solution for Z<sub>0:i-1 </sub>with Z<sub>iforget</sub> removed from the data:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e7CS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1aa4f8-c56e-4b19-9011-8ebff0137023_512x88.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e7CS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1aa4f8-c56e-4b19-9011-8ebff0137023_512x88.png 424w, https://substackcdn.com/image/fetch/$s_!e7CS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1aa4f8-c56e-4b19-9011-8ebff0137023_512x88.png 848w, https://substackcdn.com/image/fetch/$s_!e7CS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1aa4f8-c56e-4b19-9011-8ebff0137023_512x88.png 1272w, https://substackcdn.com/image/fetch/$s_!e7CS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1aa4f8-c56e-4b19-9011-8ebff0137023_512x88.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e7CS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1aa4f8-c56e-4b19-9011-8ebff0137023_512x88.png" width="372" height="63.9375" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9c1aa4f8-c56e-4b19-9011-8ebff0137023_512x88.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:88,&quot;width&quot;:512,&quot;resizeWidth&quot;:372,&quot;bytes&quot;:12686,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!e7CS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1aa4f8-c56e-4b19-9011-8ebff0137023_512x88.png 424w, https://substackcdn.com/image/fetch/$s_!e7CS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1aa4f8-c56e-4b19-9011-8ebff0137023_512x88.png 848w, https://substackcdn.com/image/fetch/$s_!e7CS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1aa4f8-c56e-4b19-9011-8ebff0137023_512x88.png 1272w, https://substackcdn.com/image/fetch/$s_!e7CS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1aa4f8-c56e-4b19-9011-8ebff0137023_512x88.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></div><blockquote><p>&#128064; Intuitively, the least-squares solution results in a good memory because the memory is computed such as we can minimize the reconstruction error, i.e., we can optimally reconstruct past data <em>Z<sub>j</sub></em>. </p></blockquote><p>So far, we have assumed <em>W<sub>i</sub></em> is given. &#129504; <em>How to compute W<sub>i</sub>? </em>The authors follow a prior work [12] in determining optimal values for <em>W<sub>i</sub>:</em></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;W_i=Z_iM_{i-1}^+&quot;,&quot;id&quot;:&quot;GQTXIXOVMM&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>X<sup>+</sup></em> denotes the pseudo-inverse of <em>X</em>. Intuitively, the idea is still to minimize the reconstruction error.  Given the memory, we can read or generate new content as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!V8N9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb295b-ca34-4bbb-bf97-e900d0db6f78_509x368.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!V8N9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb295b-ca34-4bbb-bf97-e900d0db6f78_509x368.png 424w, https://substackcdn.com/image/fetch/$s_!V8N9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb295b-ca34-4bbb-bf97-e900d0db6f78_509x368.png 848w, https://substackcdn.com/image/fetch/$s_!V8N9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb295b-ca34-4bbb-bf97-e900d0db6f78_509x368.png 1272w, https://substackcdn.com/image/fetch/$s_!V8N9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb295b-ca34-4bbb-bf97-e900d0db6f78_509x368.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!V8N9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb295b-ca34-4bbb-bf97-e900d0db6f78_509x368.png" width="421" height="304.37721021611003" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/01bb295b-ca34-4bbb-bf97-e900d0db6f78_509x368.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:368,&quot;width&quot;:509,&quot;resizeWidth&quot;:421,&quot;bytes&quot;:70970,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!V8N9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb295b-ca34-4bbb-bf97-e900d0db6f78_509x368.png 424w, https://substackcdn.com/image/fetch/$s_!V8N9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb295b-ca34-4bbb-bf97-e900d0db6f78_509x368.png 848w, https://substackcdn.com/image/fetch/$s_!V8N9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb295b-ca34-4bbb-bf97-e900d0db6f78_509x368.png 1272w, https://substackcdn.com/image/fetch/$s_!V8N9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01bb295b-ca34-4bbb-bf97-e900d0db6f78_509x368.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Pseudo-inverse memory functions. </figcaption></figure></div><p>Despite sound motivation for memory read/write design, Larimar has several drawbacks:</p><p>&#10060;  The approach assumes no inherent order between episodes, neglecting temporal dependencies between knowledge across episodes. This limitation is evident in the memory update rule, where altering the order of <em>Z</em> does not affect the resulting memory values.</p><p>&#10060; Using pseudo-inverse operations to estimate memory and addresses can become computationally slow when the number of memory accesses is high. </p><p>&#10060; Training memory operations using backpropagation can be slow and require many training data samples.</p><p>To address the first limitation, Ko et al., (2024) propose to model the temporal dependencies between facts. This is critical for QA tasks that require advanced reasoning:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cayK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9718309e-48fb-4c20-b663-91f6998b874f_623x186.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cayK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9718309e-48fb-4c20-b663-91f6998b874f_623x186.png 424w, https://substackcdn.com/image/fetch/$s_!cayK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9718309e-48fb-4c20-b663-91f6998b874f_623x186.png 848w, https://substackcdn.com/image/fetch/$s_!cayK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9718309e-48fb-4c20-b663-91f6998b874f_623x186.png 1272w, https://substackcdn.com/image/fetch/$s_!cayK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9718309e-48fb-4c20-b663-91f6998b874f_623x186.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cayK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9718309e-48fb-4c20-b663-91f6998b874f_623x186.png" width="623" height="186" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9718309e-48fb-4c20-b663-91f6998b874f_623x186.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:186,&quot;width&quot;:623,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42631,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cayK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9718309e-48fb-4c20-b663-91f6998b874f_623x186.png 424w, https://substackcdn.com/image/fetch/$s_!cayK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9718309e-48fb-4c20-b663-91f6998b874f_623x186.png 848w, https://substackcdn.com/image/fetch/$s_!cayK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9718309e-48fb-4c20-b663-91f6998b874f_623x186.png 1272w, https://substackcdn.com/image/fetch/$s_!cayK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9718309e-48fb-4c20-b663-91f6998b874f_623x186.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">BabiLong QA benchmark: the facts are temporally related. Source: [13].</figcaption></figure></div><p>The proposed method, dubbed &#128073;<strong>MemReasoner</strong> leverages Larimar memory operations:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UTtF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd24d04d-bdc2-49db-9860-5ede122b3bfd_981x326.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UTtF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd24d04d-bdc2-49db-9860-5ede122b3bfd_981x326.png 424w, https://substackcdn.com/image/fetch/$s_!UTtF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd24d04d-bdc2-49db-9860-5ede122b3bfd_981x326.png 848w, https://substackcdn.com/image/fetch/$s_!UTtF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd24d04d-bdc2-49db-9860-5ede122b3bfd_981x326.png 1272w, https://substackcdn.com/image/fetch/$s_!UTtF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd24d04d-bdc2-49db-9860-5ede122b3bfd_981x326.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UTtF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd24d04d-bdc2-49db-9860-5ede122b3bfd_981x326.png" width="981" height="326" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd24d04d-bdc2-49db-9860-5ede122b3bfd_981x326.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:326,&quot;width&quot;:981,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:94759,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UTtF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd24d04d-bdc2-49db-9860-5ede122b3bfd_981x326.png 424w, https://substackcdn.com/image/fetch/$s_!UTtF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd24d04d-bdc2-49db-9860-5ede122b3bfd_981x326.png 848w, https://substackcdn.com/image/fetch/$s_!UTtF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd24d04d-bdc2-49db-9860-5ede122b3bfd_981x326.png 1272w, https://substackcdn.com/image/fetch/$s_!UTtF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd24d04d-bdc2-49db-9860-5ede122b3bfd_981x326.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">MemReasoner utilizes episodic memory for reasoning. Source: [13].</figcaption></figure></div><p>Furthermore. it introduces 2 new components:</p><ul><li><p>Temporal Encoding: Positional encodings are computed for each line of context using sine and cosine functions. Additionally, learnable encodings are explored using a bidirectional GRU, where input sequences generate ordered context embeddings through GRU outputs. These embeddings are then written to memory using Larimar&#8217;s write operation.</p></li><li><p>Multi-step Reasoning with Query Rewriting: Multi-step reasoning tasks involve iterative "hops" between facts until the solution is reached. At each hop, <em>z<sub>q</sub></em> is processed by a simple linear transformation to align with memory content:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\hat{z}_q = W_q z_q&quot;,&quot;id&quot;:&quot;VFULSIQWUL&quot;}" data-component-name="LatexBlockToDOM"></div><p>Like Larimar, the memory readout is computed as:</p></li></ul><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;z_r = (\\hat{z}_q M^+ + \\eta)M\n&quot;,&quot;id&quot;:&quot;NNNLUOPGMS&quot;}" data-component-name="LatexBlockToDOM"></div><p>The query is updated iteratively:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;z_q \\gets z_q + \\alpha \\cdot z_r\n&quot;,&quot;id&quot;:&quot;YYAYHHGRKP&quot;}" data-component-name="LatexBlockToDOM"></div><p>This new query will be used to get a new read value <em>z&#8217;<sub>r</sub>.</em> This continues until convergence, i.e., <em>||z<sub>r</sub></em>-<em>z&#8217;<sub>r</sub>||&lt;&#964; </em>or after a maximum number of iterations. </p><blockquote><p>&#128064; Query rewritting is crucial when it requires multiple readings from the memory to find the relevant fact. For example, the orignial query is about A, and we want to know about E. If A-B, B-C, C-D, D-E facts are stored in the memory, we would need 4 reading &#8220;hops&#8220;.</p></blockquote><h4></h4><h4>Prompt Optimization with Nearest Neighbor Memory</h4><p>An efficient way to enhance LLM performance is by refining the prompting process. Episodic memory can significantly improve prompt optimization, benefiting the LLM's outputs as effectively as augmenting the model itself.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_YC2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30da1b30-02d9-4532-bc20-030145a54f09_611x232.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_YC2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30da1b30-02d9-4532-bc20-030145a54f09_611x232.gif 424w, https://substackcdn.com/image/fetch/$s_!_YC2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30da1b30-02d9-4532-bc20-030145a54f09_611x232.gif 848w, https://substackcdn.com/image/fetch/$s_!_YC2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30da1b30-02d9-4532-bc20-030145a54f09_611x232.gif 1272w, https://substackcdn.com/image/fetch/$s_!_YC2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30da1b30-02d9-4532-bc20-030145a54f09_611x232.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_YC2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30da1b30-02d9-4532-bc20-030145a54f09_611x232.gif" width="611" height="232" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/30da1b30-02d9-4532-bc20-030145a54f09_611x232.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:232,&quot;width&quot;:611,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:134299,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_YC2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30da1b30-02d9-4532-bc20-030145a54f09_611x232.gif 424w, https://substackcdn.com/image/fetch/$s_!_YC2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30da1b30-02d9-4532-bc20-030145a54f09_611x232.gif 848w, https://substackcdn.com/image/fetch/$s_!_YC2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30da1b30-02d9-4532-bc20-030145a54f09_611x232.gif 1272w, https://substackcdn.com/image/fetch/$s_!_YC2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30da1b30-02d9-4532-bc20-030145a54f09_611x232.gif 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Episodic Memory for improving the prompting process rather than the LLM itself. </figcaption></figure></div><p>By treating each training instance as an episode, the memory archives the experience of a seen prompt, which consists of input data, in-context learning (ICL) examples, and corresponding performance. During testing, we can refer to past experiences stored in the memory, to construct the prompt that potentially maximizes the testing performance. </p><p>Based on this principle, Do et al., (2024) propose an episodic memory for in-context example ordering optimization, called &#128073;<strong>POEM</strong> [14]. In their paper, the memory <em>M</em> is a set of tuples:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ocvH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9bb137-dacf-426d-864f-2ec17d10a7be_900x79.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ocvH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9bb137-dacf-426d-864f-2ec17d10a7be_900x79.png 424w, https://substackcdn.com/image/fetch/$s_!ocvH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9bb137-dacf-426d-864f-2ec17d10a7be_900x79.png 848w, https://substackcdn.com/image/fetch/$s_!ocvH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9bb137-dacf-426d-864f-2ec17d10a7be_900x79.png 1272w, https://substackcdn.com/image/fetch/$s_!ocvH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9bb137-dacf-426d-864f-2ec17d10a7be_900x79.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ocvH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9bb137-dacf-426d-864f-2ec17d10a7be_900x79.png" width="424" height="37.217777777777776" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1a9bb137-dacf-426d-864f-2ec17d10a7be_900x79.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:79,&quot;width&quot;:900,&quot;resizeWidth&quot;:424,&quot;bytes&quot;:16965,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ocvH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9bb137-dacf-426d-864f-2ec17d10a7be_900x79.png 424w, https://substackcdn.com/image/fetch/$s_!ocvH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9bb137-dacf-426d-864f-2ec17d10a7be_900x79.png 848w, https://substackcdn.com/image/fetch/$s_!ocvH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9bb137-dacf-426d-864f-2ec17d10a7be_900x79.png 1272w, https://substackcdn.com/image/fetch/$s_!ocvH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9bb137-dacf-426d-864f-2ec17d10a7be_900x79.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>s, a,</em> and <em>r</em> are state, action, and reward, inspired by reinforcement learning notations and <em>L</em> is the size of the memory. In the context of LLM, they mean:</p><ul><li><p>State <em>s</em>: the input data, i.e., the question we want the LLM to answer</p></li><li><p>Action <em>a</em>: the ordering of the in-context examples given we know the in-context examples</p></li><li><p>Reward <em>r</em>: the accuracy of the LLM&#8217;s output when prompted with <em>(s, a)</em></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hJ97!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ad36d-550b-4b90-aa57-1a99653ce0a4_552x287.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hJ97!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ad36d-550b-4b90-aa57-1a99653ce0a4_552x287.png 424w, https://substackcdn.com/image/fetch/$s_!hJ97!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ad36d-550b-4b90-aa57-1a99653ce0a4_552x287.png 848w, https://substackcdn.com/image/fetch/$s_!hJ97!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ad36d-550b-4b90-aa57-1a99653ce0a4_552x287.png 1272w, https://substackcdn.com/image/fetch/$s_!hJ97!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ad36d-550b-4b90-aa57-1a99653ce0a4_552x287.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hJ97!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ad36d-550b-4b90-aa57-1a99653ce0a4_552x287.png" width="422" height="219.40942028985506" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/223ad36d-550b-4b90-aa57-1a99653ce0a4_552x287.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:287,&quot;width&quot;:552,&quot;resizeWidth&quot;:422,&quot;bytes&quot;:83190,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hJ97!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ad36d-550b-4b90-aa57-1a99653ce0a4_552x287.png 424w, https://substackcdn.com/image/fetch/$s_!hJ97!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ad36d-550b-4b90-aa57-1a99653ce0a4_552x287.png 848w, https://substackcdn.com/image/fetch/$s_!hJ97!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ad36d-550b-4b90-aa57-1a99653ce0a4_552x287.png 1272w, https://substackcdn.com/image/fetch/$s_!hJ97!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ad36d-550b-4b90-aa57-1a99653ce0a4_552x287.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">A Prompt with in-context examples. <a href="https://thaihungle.github.io/talks/AJCAI24_Tutorial.pdf">Source</a>. </figcaption></figure></div><p>The paper models these components as follows:</p><ul><li><p>State representation: <a href="https://huggingface.co/sentence-transformers">SentenceTransformer</a> is used to encode the input to a state vector</p></li><li><p>Example selection: The set of examples is simply selected using a nearest neighbor search on a given database. We do not need to optimize this process.</p></li><li><p>Action encoding: The authors propose a clever way to allow generalization by representing the arrangement of in-context examples as a sequence of similarity ranks rather than their actual content. This rank-based representation captures relationships between examples, reducing overfitting and allowing the system to adapt better to new queries.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NfNd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10bc0343-a9d0-4d06-988e-db162d9d28f6_798x280.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NfNd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10bc0343-a9d0-4d06-988e-db162d9d28f6_798x280.png 424w, https://substackcdn.com/image/fetch/$s_!NfNd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10bc0343-a9d0-4d06-988e-db162d9d28f6_798x280.png 848w, https://substackcdn.com/image/fetch/$s_!NfNd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10bc0343-a9d0-4d06-988e-db162d9d28f6_798x280.png 1272w, https://substackcdn.com/image/fetch/$s_!NfNd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10bc0343-a9d0-4d06-988e-db162d9d28f6_798x280.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NfNd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10bc0343-a9d0-4d06-988e-db162d9d28f6_798x280.png" width="588" height="206.31578947368422" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/10bc0343-a9d0-4d06-988e-db162d9d28f6_798x280.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:280,&quot;width&quot;:798,&quot;resizeWidth&quot;:588,&quot;bytes&quot;:29603,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NfNd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10bc0343-a9d0-4d06-988e-db162d9d28f6_798x280.png 424w, https://substackcdn.com/image/fetch/$s_!NfNd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10bc0343-a9d0-4d06-988e-db162d9d28f6_798x280.png 848w, https://substackcdn.com/image/fetch/$s_!NfNd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10bc0343-a9d0-4d06-988e-db162d9d28f6_798x280.png 1272w, https://substackcdn.com/image/fetch/$s_!NfNd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10bc0343-a9d0-4d06-988e-db162d9d28f6_798x280.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Action Encoding example: The action <em>a = (m, 3, . . . , 1)</em> represents a specific permutation of the in-context examples. Here, the first example is the farthest from the test query, the second is the third closest, and so on, with the <em>m</em>-th position being the closest to the test query. Each action is thus a vector with <em>m</em> elements.</figcaption></figure></div><blockquote><p>&#128064; This action encoding scheme allows discrete action space, which is convenient for later memory reading. </p></blockquote><ul><li><p>Reward design: The reward changes depending on the tasks:</p><ul><li><p>Exact Match: A reward is given if the LLM&#8217;s output perfectly matches the ground-truth answer:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zt7f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b07667-4b50-4bf7-aa4a-8c7fb4128740_264x63.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zt7f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b07667-4b50-4bf7-aa4a-8c7fb4128740_264x63.png 424w, https://substackcdn.com/image/fetch/$s_!Zt7f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b07667-4b50-4bf7-aa4a-8c7fb4128740_264x63.png 848w, https://substackcdn.com/image/fetch/$s_!Zt7f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b07667-4b50-4bf7-aa4a-8c7fb4128740_264x63.png 1272w, https://substackcdn.com/image/fetch/$s_!Zt7f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b07667-4b50-4bf7-aa4a-8c7fb4128740_264x63.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zt7f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b07667-4b50-4bf7-aa4a-8c7fb4128740_264x63.png" width="242" height="57.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25b07667-4b50-4bf7-aa4a-8c7fb4128740_264x63.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:63,&quot;width&quot;:264,&quot;resizeWidth&quot;:242,&quot;bytes&quot;:7235,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Zt7f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b07667-4b50-4bf7-aa4a-8c7fb4128740_264x63.png 424w, https://substackcdn.com/image/fetch/$s_!Zt7f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b07667-4b50-4bf7-aa4a-8c7fb4128740_264x63.png 848w, https://substackcdn.com/image/fetch/$s_!Zt7f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b07667-4b50-4bf7-aa4a-8c7fb4128740_264x63.png 1272w, https://substackcdn.com/image/fetch/$s_!Zt7f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b07667-4b50-4bf7-aa4a-8c7fb4128740_264x63.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><ul><li><p>Classification: The reward is the difference between the log probability of the correct class and the largest log probability of the other classes:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ah9f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d20bce1-75fd-411c-b24e-a606efba2cca_523x66.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ah9f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d20bce1-75fd-411c-b24e-a606efba2cca_523x66.png 424w, https://substackcdn.com/image/fetch/$s_!Ah9f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d20bce1-75fd-411c-b24e-a606efba2cca_523x66.png 848w, https://substackcdn.com/image/fetch/$s_!Ah9f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d20bce1-75fd-411c-b24e-a606efba2cca_523x66.png 1272w, https://substackcdn.com/image/fetch/$s_!Ah9f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d20bce1-75fd-411c-b24e-a606efba2cca_523x66.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ah9f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d20bce1-75fd-411c-b24e-a606efba2cca_523x66.png" width="523" height="66" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1d20bce1-75fd-411c-b24e-a606efba2cca_523x66.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:66,&quot;width&quot;:523,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8853,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ah9f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d20bce1-75fd-411c-b24e-a606efba2cca_523x66.png 424w, https://substackcdn.com/image/fetch/$s_!Ah9f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d20bce1-75fd-411c-b24e-a606efba2cca_523x66.png 848w, https://substackcdn.com/image/fetch/$s_!Ah9f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d20bce1-75fd-411c-b24e-a606efba2cca_523x66.png 1272w, https://substackcdn.com/image/fetch/$s_!Ah9f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d20bce1-75fd-411c-b24e-a606efba2cca_523x66.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li><li><p>Generation: The reward is the difference between the log probability of the ground-truth sequence and the largest log probability of other sequences:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Idtc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c02cdcd-2e2e-4efe-b326-09177eaac2fc_369x154.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Idtc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c02cdcd-2e2e-4efe-b326-09177eaac2fc_369x154.png 424w, https://substackcdn.com/image/fetch/$s_!Idtc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c02cdcd-2e2e-4efe-b326-09177eaac2fc_369x154.png 848w, https://substackcdn.com/image/fetch/$s_!Idtc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c02cdcd-2e2e-4efe-b326-09177eaac2fc_369x154.png 1272w, https://substackcdn.com/image/fetch/$s_!Idtc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c02cdcd-2e2e-4efe-b326-09177eaac2fc_369x154.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Idtc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c02cdcd-2e2e-4efe-b326-09177eaac2fc_369x154.png" width="369" height="154" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c02cdcd-2e2e-4efe-b326-09177eaac2fc_369x154.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:154,&quot;width&quot;:369,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:14102,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Idtc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c02cdcd-2e2e-4efe-b326-09177eaac2fc_369x154.png 424w, https://substackcdn.com/image/fetch/$s_!Idtc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c02cdcd-2e2e-4efe-b326-09177eaac2fc_369x154.png 848w, https://substackcdn.com/image/fetch/$s_!Idtc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c02cdcd-2e2e-4efe-b326-09177eaac2fc_369x154.png 1272w, https://substackcdn.com/image/fetch/$s_!Idtc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c02cdcd-2e2e-4efe-b326-09177eaac2fc_369x154.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div></li></ul></li></ul><p>While building the memory,  we sample the data&#8212;the state, from the training data, select the in-context examples, and explore possible actions&#8212;the ordering of in-context examples to construct the complete prompt. The prompt will be used for LLM to generate outputs and collect the rewards.  Given the tuple (state, action, reward), POEM defines memory operations:</p><p><strong>Memory writing:</strong> there are 2 scenarios:</p><ul><li><p>If the state-action pair is new to the memory, we just insert the tuple into the memory. If memory is overflow, the oldest tuple will be removed</p></li><li><p>If the state-action pair already exists in memory, we update its stored reward to the current reward if the latter is higher. This maintains an optimistic estimation of the reward for the state-action pair. </p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s4k5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d46781d-29dc-4ef2-9e24-4df81773ddef_649x109.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s4k5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d46781d-29dc-4ef2-9e24-4df81773ddef_649x109.png 424w, https://substackcdn.com/image/fetch/$s_!s4k5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d46781d-29dc-4ef2-9e24-4df81773ddef_649x109.png 848w, https://substackcdn.com/image/fetch/$s_!s4k5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d46781d-29dc-4ef2-9e24-4df81773ddef_649x109.png 1272w, https://substackcdn.com/image/fetch/$s_!s4k5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d46781d-29dc-4ef2-9e24-4df81773ddef_649x109.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s4k5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d46781d-29dc-4ef2-9e24-4df81773ddef_649x109.png" width="397" height="66.6764252696456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5d46781d-29dc-4ef2-9e24-4df81773ddef_649x109.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:109,&quot;width&quot;:649,&quot;resizeWidth&quot;:397,&quot;bytes&quot;:20124,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s4k5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d46781d-29dc-4ef2-9e24-4df81773ddef_649x109.png 424w, https://substackcdn.com/image/fetch/$s_!s4k5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d46781d-29dc-4ef2-9e24-4df81773ddef_649x109.png 848w, https://substackcdn.com/image/fetch/$s_!s4k5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d46781d-29dc-4ef2-9e24-4df81773ddef_649x109.png 1272w, https://substackcdn.com/image/fetch/$s_!s4k5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d46781d-29dc-4ef2-9e24-4df81773ddef_649x109.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Memory reading: </strong>We aim to estimate the value for taking action <em>a</em> for a new testing input <em>s<sub>t</sub> </em>using nearest neighbor estimation with the state as the query:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y6KW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F949bcfe5-c6c3-43d3-9ee4-24fa5da233cf_811x110.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y6KW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F949bcfe5-c6c3-43d3-9ee4-24fa5da233cf_811x110.png 424w, https://substackcdn.com/image/fetch/$s_!y6KW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F949bcfe5-c6c3-43d3-9ee4-24fa5da233cf_811x110.png 848w, https://substackcdn.com/image/fetch/$s_!y6KW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F949bcfe5-c6c3-43d3-9ee4-24fa5da233cf_811x110.png 1272w, https://substackcdn.com/image/fetch/$s_!y6KW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F949bcfe5-c6c3-43d3-9ee4-24fa5da233cf_811x110.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y6KW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F949bcfe5-c6c3-43d3-9ee4-24fa5da233cf_811x110.png" width="486" height="65.9186189889026" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/949bcfe5-c6c3-43d3-9ee4-24fa5da233cf_811x110.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:110,&quot;width&quot;:811,&quot;resizeWidth&quot;:486,&quot;bytes&quot;:30320,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!y6KW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F949bcfe5-c6c3-43d3-9ee4-24fa5da233cf_811x110.png 424w, https://substackcdn.com/image/fetch/$s_!y6KW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F949bcfe5-c6c3-43d3-9ee4-24fa5da233cf_811x110.png 848w, https://substackcdn.com/image/fetch/$s_!y6KW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F949bcfe5-c6c3-43d3-9ee4-24fa5da233cf_811x110.png 1272w, https://substackcdn.com/image/fetch/$s_!y6KW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F949bcfe5-c6c3-43d3-9ee4-24fa5da233cf_811x110.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>s<sub>i</sub>, i = 1, ..., k</em> are the <em>k</em> states with the highest similarity to the testing state <em>s<sub>t</sub>.</em> <em>CS</em> is the Cosine Similarity measurement. The action that has the highest value estimation will be deemed optimal for the input <em>s<sub>t</sub>:</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HThK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe24532e-9768-4bf2-bd64-eccff98c70bd_285x52.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HThK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe24532e-9768-4bf2-bd64-eccff98c70bd_285x52.png 424w, https://substackcdn.com/image/fetch/$s_!HThK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe24532e-9768-4bf2-bd64-eccff98c70bd_285x52.png 848w, https://substackcdn.com/image/fetch/$s_!HThK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe24532e-9768-4bf2-bd64-eccff98c70bd_285x52.png 1272w, https://substackcdn.com/image/fetch/$s_!HThK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe24532e-9768-4bf2-bd64-eccff98c70bd_285x52.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HThK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe24532e-9768-4bf2-bd64-eccff98c70bd_285x52.png" width="239" height="43.60701754385965" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/be24532e-9768-4bf2-bd64-eccff98c70bd_285x52.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:52,&quot;width&quot;:285,&quot;resizeWidth&quot;:239,&quot;bytes&quot;:6806,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HThK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe24532e-9768-4bf2-bd64-eccff98c70bd_285x52.png 424w, https://substackcdn.com/image/fetch/$s_!HThK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe24532e-9768-4bf2-bd64-eccff98c70bd_285x52.png 848w, https://substackcdn.com/image/fetch/$s_!HThK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe24532e-9768-4bf2-bd64-eccff98c70bd_285x52.png 1272w, https://substackcdn.com/image/fetch/$s_!HThK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe24532e-9768-4bf2-bd64-eccff98c70bd_285x52.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p>&#128064; The proposed approach is fast to train and do inference. The authors show that it works for an array of LLMs from e RoBERTa-large to Llama2-7B, 13B and 70B. </p></blockquote><div><hr></div><h2>The Future of Memory</h2><p>As LLMs continue to scale, integrating efficient memory mechanisms becomes critical for handling dynamic knowledge updates, long-term dependencies, and computational efficiency. Traditional transformer-based architectures face challenges in managing memory due to their quadratic complexity and reliance on attention mechanisms, which become computationally prohibitive for longer sequences. The future lies in <strong>fast, scalable memory mechanisms</strong> that exhibit linear complexity, parallelizability, and the capacity to manage long-term knowledge without degrading performance. Below, we highlight promising approaches shaping this evolution.</p><h4><strong>1. State Space Models (SSMs)</strong></h4><p>State Space Models represent a paradigm shift in sequence modeling. By combining linear state-space equations with deep learning, SSMs can efficiently process sequences with linear complexity. They inherently model long-range dependencies by operating in continuous time, making them well-suited for extending memory capabilities in LLMs. Their attention-free nature allows scalable memory management while retaining interpretability and robustness. Big SSMs such as Mamba have shown potential as a Transformer-based LLM alternative (see more in this <a href="https://hungleai.substack.com/p/the-mamba-effect-state-space-models">blog post</a>).</p><h4><strong>2. Linear Attention</strong></h4><p>Linear attention mechanisms reformulate traditional attention calculations to scale linearly with sequence length [9]. By approximating or reweighting the attention matrix using kernel methods or other simplifications, linear attention reduces the computational overhead without sacrificing the model&#8217;s ability to capture contextual dependencies. This method is inherently parallelizable and faster than standard self-attention, making it a practical choice for tasks requiring both speed and scalability. Linear attention also retains compatibility with existing architectures, providing an efficient path to extend memory capabilities.</p><h4><strong>3. xLSTM</strong></h4><p><a href="https://hungleai.substack.com/p/xlstm-vs-lstm-how-the-new-lstm-scale">xLSTM </a>builds on traditional recurrent architectures, optimizing them for long-term memory retention and scalability [15]. Unlike vanilla LSTMs, xLSTM leverages architectural modifications to avoid vanishing gradients, allowing it to maintain information over extended sequences. It achieves near-linear complexity by avoiding redundant computations and parallelizing operations, making it a viable alternative for tasks requiring deep contextual understanding.</p><h4><strong>4. Stable Hadamard Memory Framework (SHM)</strong></h4><p>The Stable Hadamard Memory (SHM) framework employs the Hadamard product for memory updates and calibration, offering a robust and scalable memory solution [16]. Its core advantage is minimizing dependencies between time steps, stabilizing gradient flows, and preventing learning issues such as vanishing or exploding gradients. SHM is particularly adept at long-term reasoning, offering a linear complexity approach that is attention-free and inherently parallelizable. </p><p>Interestingly, these recent advancements converge under a unified <strong>Hadamard Memory Framework</strong> [16], where the matrix memory update process can be expressed in a linear formulation as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5E5A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc19f0c36-dd6b-4c3e-994f-2625397d695c_697x209.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5E5A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc19f0c36-dd6b-4c3e-994f-2625397d695c_697x209.png 424w, https://substackcdn.com/image/fetch/$s_!5E5A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc19f0c36-dd6b-4c3e-994f-2625397d695c_697x209.png 848w, https://substackcdn.com/image/fetch/$s_!5E5A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc19f0c36-dd6b-4c3e-994f-2625397d695c_697x209.png 1272w, https://substackcdn.com/image/fetch/$s_!5E5A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc19f0c36-dd6b-4c3e-994f-2625397d695c_697x209.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5E5A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc19f0c36-dd6b-4c3e-994f-2625397d695c_697x209.png" width="371" height="111.2467718794835" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c19f0c36-dd6b-4c3e-994f-2625397d695c_697x209.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:209,&quot;width&quot;:697,&quot;resizeWidth&quot;:371,&quot;bytes&quot;:20534,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5E5A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc19f0c36-dd6b-4c3e-994f-2625397d695c_697x209.png 424w, https://substackcdn.com/image/fetch/$s_!5E5A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc19f0c36-dd6b-4c3e-994f-2625397d695c_697x209.png 848w, https://substackcdn.com/image/fetch/$s_!5E5A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc19f0c36-dd6b-4c3e-994f-2625397d695c_697x209.png 1272w, https://substackcdn.com/image/fetch/$s_!5E5A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc19f0c36-dd6b-4c3e-994f-2625397d695c_697x209.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>For example, </p><ul><li><p>SSM: <em>M, C,</em> and <em>U</em> are vectors</p></li><li><p>Linear Attention: <em>C<sub>t</sub>=1</em></p></li><li><p>xLSTM: <em>C<sub>t</sub></em> is a scalar</p></li></ul><p>Viewing these advancements through a unified lens opens opportunities for comprehensive theoretical analysis and technical enhancements, offering valuable insights into the behavior and potential of modern attention-free LLMs. We will dive deep into this model in a separate <a href="https://open.substack.com/pub/hungleai/p/stable-hadamard-memory-the-unified?r=3an4d1&amp;utm_campaign=post&amp;utm_medium=web&amp;showWelcomeOnShare=true">blog post</a>. Stay tuned!</p><div><hr></div><h2>Reference</h2><p>[1] Le, Hung. "Memory and attention in deep learning." <em>arXiv preprint arXiv:2107.01390</em> (2021).</p><p>[2] Graves, Alex. "Neural Turing Machines." <em>arXiv preprint arXiv:1410.5401</em> (2014).</p><p>[3] Graves, A., Wayne, G., Reynolds, M. <em>et al.</em> Hybrid computing using a neural network with dynamic external memory. <em>Nature</em> <strong>538</strong>, 471&#8211;476 (2016).</p><p>[4] Le, Hung, Truyen Tran, and Svetha Venkatesh. "Self-attentive associative memory." In <em>International conference on machine learning</em>, pp. 5682-5691. PMLR, 2020.</p><p>[5] Schuurmans, Dale. "Memory augmented large language models are computationally universal." <em>arXiv preprint arXiv:2301.04589</em> (2023).</p><p>[6] Le, Hung, Truyen Tran, and Svetha Venkatesh. "Neural Stored-program Memory." In <em>International Conference on Learning Representations (2020)</em>.</p><p>[7] Wang, Weizhi, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. "Augmenting language models with long-term memory." <em>Advances in Neural Information Processing Systems</em> 36 (2024).</p><p>[8] Munkhdalai, Tsendsuren, Manaal Faruqui, and Siddharth Gopal. "Leave no context behind: Efficient infinite context transformers with infini-attention." <em>arXiv preprint arXiv:2404.07143</em> (2024).</p><p>[9] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc&#184;ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp. 5156&#8211;5165. PMLR, 2020.</p><p>[10] Le, Hung, Dung Nguyen, Kien Do, Svetha Venkatesh, and Truyen Tran. "Plug, Play, and Generalize: Length Extrapolation with Pointer-Augmented Neural Memory." <em>Transactions on Machine Learning Research, </em>2024.</p><p>[11] Das, Payel, Subhajit Chaudhury, Elliot Nelson, Igor Melnyk, Sarath Swaminathan, Sihui Dai, Aur&#233;lie Lozano et al. "Larimar: Large Language Models with Episodic Memory Control." <em>ICML</em>, 2024.</p><p>[12] Kha Pham, Hung Le, Man Ngo, Truyen Tran, Bao Ho, and Svetha Venkatesh. Generative pseudo-inverse memory. In International Conference on Learning Representations, 2021.</p><p>[13] Ko, Ching-Yun, Sihui Dai, Payel Das, Georgios Kollias, Subhajit Chaudhury, and Aurelie Lozano. "MemReasoner: A Memory-augmented LLM Architecture for Multi-hop Reasoning." In <em>The First Workshop on System-2 Reasoning at Scale, NeurIPS'24</em>. 2024.</p><p>[14] Do, Dai, Quan Tran, Svetha Venkatesh, and Hung Le. "Large Language Model Prompting with Episodic Memory." ECAI, 2024.</p><p>[15] Beck, Maximilian, Korbinian P&#246;ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, G&#252;nter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. "xLSTM: Extended Long Short-Term Memory." <em>NeurIPS,</em> 2024.</p><p>[16] Le, Hung, Kien Do, Dung Nguyen, Sunil Gupta, and Svetha Venkatesh. "Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning." <em>arXiv preprint arXiv:2410.10132</em> (2024).</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>I hope you enjoy the article. Stay tuned for the newest and exclusive content by subscribing to <strong>Neurocoder Tales</strong>! Disclaimer: While every effort is made to provide accurate and unbiased information, errors may occur. Let me know if you catch any error.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Uncertainty, Confidence, and Hallucination in Large Language Models]]></title><description><![CDATA[How to Spot When Your Large Language Model is Misleading You]]></description><link>https://hungleai.substack.com/p/uncertainty-confidence-and-hallucination</link><guid isPermaLink="false">https://hungleai.substack.com/p/uncertainty-confidence-and-hallucination</guid><dc:creator><![CDATA[Hung Le]]></dc:creator><pubDate>Sun, 21 Jul 2024 22:08:49 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a318c9-a65d-4957-bc8d-86592ba5f066_1297x603.gif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h4>Table of Content</h4><ul><li><p><a href="https://hungleai.substack.com/i/146529280/llm-is-just-making-stuff-up">LLM Is Just Making Stuff Up</a></p></li><li><p><a href="https://hungleai.substack.com/i/146529280/detecting-deception-tools-and-methods-for-identifying-llm-falsehoods">Detecting Deception: Tools and Methods for Identifying LLM Falsehoods</a></p></li><li><p><a href="https://hungleai.substack.com/i/146529280/score-based-approaches-for-uncertainty-estimation-in-llms">Score-based Approaches for Uncertainty Estimation in LLMs</a></p><ul><li><p><a href="https://hungleai.substack.com/i/146529280/heuristic-uncertainty-as-a-clue">Heuristic Uncertainty as a Clue</a></p></li><li><p><a href="https://hungleai.substack.com/i/146529280/quantifying-uncertainty-with-information-theory">Quantifying Uncertainty with Information Theory</a></p></li></ul></li><li><p><a href="https://hungleai.substack.com/i/146529280/model-based-hallucination-detection">Model-based Hallucination Detection </a></p><ul><li><p><a href="https://hungleai.substack.com/i/146529280/llms-as-evaluators">LLM as Evaluators</a></p></li><li><p><a href="https://hungleai.substack.com/i/146529280/simple-conformal-predictor">Simple Conformal Predictors</a></p></li></ul></li><li><p><a href="https://hungleai.substack.com/i/146529280/final-thoughts-the-future-of-llm-hallucination-detection">Final Thoughts: The Future of LLM Hallucination Detection</a></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zni7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9ddb8-5e33-4383-970e-87f9d27d8daf_1024x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zni7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9ddb8-5e33-4383-970e-87f9d27d8daf_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!zni7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9ddb8-5e33-4383-970e-87f9d27d8daf_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!zni7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9ddb8-5e33-4383-970e-87f9d27d8daf_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!zni7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9ddb8-5e33-4383-970e-87f9d27d8daf_1024x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zni7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9ddb8-5e33-4383-970e-87f9d27d8daf_1024x1024.jpeg" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ed9ddb8-5e33-4383-970e-87f9d27d8daf_1024x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:296167,&quot;alt&quot;:&quot;How to Spot When Your Large Language Model is Misleading You&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="How to Spot When Your Large Language Model is Misleading You" title="How to Spot When Your Large Language Model is Misleading You" srcset="https://substackcdn.com/image/fetch/$s_!zni7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9ddb8-5e33-4383-970e-87f9d27d8daf_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!zni7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9ddb8-5e33-4383-970e-87f9d27d8daf_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!zni7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9ddb8-5e33-4383-970e-87f9d27d8daf_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!zni7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ed9ddb8-5e33-4383-970e-87f9d27d8daf_1024x1024.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">How to Spot When Your Large Language Model is Misleading You? Source: generated by DALL-E3. </figcaption></figure></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Many read my posts, but only 3% subscribe. If you find my writing helpful, please subscribe&#8212;it&#8217;s free! Your support motivates me to keep creating high-quality and exclusive content. Thank you!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div><hr></div><h2>LLM Is Just Making Stuff Up</h2><p>Ever have a conversation with a large language model that sounds super confident, spitting out facts that seem...well, a little fishy? &#128031; You're not alone. One of the biggest challenges in working with Large Language Models (LLMs) is verifying the correctness of their output. Despite their advanced capabilities, LLMs can sometimes generate information that appears accurate but is fabricated. This phenomenon, known as &#128073; <strong>hallucination</strong>, can lead to misinformation and erode trust in AI systems.</p><p>Hallucination in AI is not a new phenomenon. Deep learning models, in general, are notorious for their over-confidence in predictions. For instance, in classification tasks, these models can assign a very high probability to a label prediction, even when the prediction is incorrect [1].  Deep learning models can be misleading in how powerful they truly are.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gWrB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770379e6-a878-4b1c-b065-919a820e0802_1209x260.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gWrB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770379e6-a878-4b1c-b065-919a820e0802_1209x260.png 424w, https://substackcdn.com/image/fetch/$s_!gWrB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770379e6-a878-4b1c-b065-919a820e0802_1209x260.png 848w, https://substackcdn.com/image/fetch/$s_!gWrB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770379e6-a878-4b1c-b065-919a820e0802_1209x260.png 1272w, https://substackcdn.com/image/fetch/$s_!gWrB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770379e6-a878-4b1c-b065-919a820e0802_1209x260.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gWrB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770379e6-a878-4b1c-b065-919a820e0802_1209x260.png" width="1209" height="260" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/770379e6-a878-4b1c-b065-919a820e0802_1209x260.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:260,&quot;width&quot;:1209,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:148143,&quot;alt&quot;:&quot;Over-confidence in deep models. &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Over-confidence in deep models. " title="Over-confidence in deep models. " srcset="https://substackcdn.com/image/fetch/$s_!gWrB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770379e6-a878-4b1c-b065-919a820e0802_1209x260.png 424w, https://substackcdn.com/image/fetch/$s_!gWrB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770379e6-a878-4b1c-b065-919a820e0802_1209x260.png 848w, https://substackcdn.com/image/fetch/$s_!gWrB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770379e6-a878-4b1c-b065-919a820e0802_1209x260.png 1272w, https://substackcdn.com/image/fetch/$s_!gWrB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770379e6-a878-4b1c-b065-919a820e0802_1209x260.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Over-confidence in deep models assigns a high probability of wrong prediction.  </figcaption></figure></div><p></p><p>In the context of AI-generated text, large language models (LLMs) can produce content that appears real and coherent, yet is irrelevant and unacceptable. Recent papers [2] categorize LLM hallucinations into 2 main types: </p><ul><li><p><strong>Factuality Hallucination:</strong> This is like the LLM making stuff up entirely. It might sound convincing, but the information is just plain wrong. Think of it like telling you the capital of France is New York City (&#128073; <strong>factual inconsistency</strong>, i.e., simply wrong) or the Roman Empire was the first civilization to discover Antarctica (&#128073; <strong>factual fabrication</strong>, i.e., no evidence).</p></li><li><p><strong>Faithfulness Hallucination:</strong> This happens when the LLM strays from your topic or instructions. It might weave a good story but doesn't answer your question or follow the original idea. Imagine asking for a recipe and getting a poem about kitchens instead (&#128073; <strong>instruction inconsistency</strong>). Another example is that LLMs can summarize an input document, perpetuating any incorrect or unsupported information it contains (&#128073; <strong>context inconsistency</strong>) or perform wrong mathematics derivation (&#128073; <strong>logical inconsistency</strong>).</p></li></ul><blockquote><p>&#128064; The dangerous thing is that the generated text looks really smooth and confident, which make it hard to know if the content is hallucinated or not. </p></blockquote><p>Imagine the catastrophic consequences of deploying LLMs in real-world applications without addressing their hallucination issues. For example, an LLM might mistakenly diagnose a benign condition when the symptoms indicate a serious illness, putting a patient's life at risk. The real-world impact of these errors is evident: Google's early LLM, BARD, cost the company $100 billion due to a critical hallucination error.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F-KM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcc0a670-68d4-44d5-a880-c870e222ac51_900x515.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F-KM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcc0a670-68d4-44d5-a880-c870e222ac51_900x515.jpeg 424w, https://substackcdn.com/image/fetch/$s_!F-KM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcc0a670-68d4-44d5-a880-c870e222ac51_900x515.jpeg 848w, https://substackcdn.com/image/fetch/$s_!F-KM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcc0a670-68d4-44d5-a880-c870e222ac51_900x515.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!F-KM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcc0a670-68d4-44d5-a880-c870e222ac51_900x515.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F-KM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcc0a670-68d4-44d5-a880-c870e222ac51_900x515.jpeg" width="900" height="515" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dcc0a670-68d4-44d5-a880-c870e222ac51_900x515.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:515,&quot;width&quot;:900,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Google's AI Chatbot Spreads Misinfo: A $140 Billion L &#8212; The Latch&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Google's AI Chatbot Spreads Misinfo: A $140 Billion L &#8212; The Latch" title="Google's AI Chatbot Spreads Misinfo: A $140 Billion L &#8212; The Latch" srcset="https://substackcdn.com/image/fetch/$s_!F-KM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcc0a670-68d4-44d5-a880-c870e222ac51_900x515.jpeg 424w, https://substackcdn.com/image/fetch/$s_!F-KM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcc0a670-68d4-44d5-a880-c870e222ac51_900x515.jpeg 848w, https://substackcdn.com/image/fetch/$s_!F-KM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcc0a670-68d4-44d5-a880-c870e222ac51_900x515.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!F-KM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcc0a670-68d4-44d5-a880-c870e222ac51_900x515.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Hallucinations cost Google a lot of money. <a href="https://thelatch.com.au/google-ai-chatbot/">Source</a>. </figcaption></figure></div><p>A reliable AI system should be upfront about its limitations. Imagine asking a friend a question, and they spout out an answer with zero hesitation, even if they're not entirely sure. Not ideal, right? The same goes for AI. The best AI systems are those that can signal their uncertainty when they're unsure. We don't want an AI that's either arrogantly confident about everything or so timid it never takes a guess. Hallucination or not, the key issue is identifying when an LLM&#8217;s output is unreliable. Now, the real questions are:</p><p>&#129504; <em>Can we detect when LLMs are generating misleading content?</em> <em>Or even better, can we mitigate the hallucination or dehallucinate LLMs&#8217; output? </em></p><p>Today's focus is on the first question. We'll explore the second question in the next post.</p><div><hr></div><h2>Detecting Deception: Tools and Methods for Identifying LLM Falsehoods</h2><p>We all know that feeling &#8211; you ask a language model a question, and it answers with booming confidence. So, &#129504; <em>how can we tell if an LLM is just making stuff up?</em> There are two key approaches to detect LLM "deception":</p><ul><li><p><strong>Score-based Methods: </strong>One way to sniff out an LLM's fib is to look at how uncertain it is about its answer. Imagine a friend who gives you an answer with a shrug and a mumbled "maybe." That might raise a red flag, right? Similarly, LLMs that express high uncertainty about their output are more likely to be unreliable. By measuring this uncertainty as a score, we can get a sense of how trustworthy the information might be.  In this vein, several approaches emerge, some inspired by heuristic methods in uncertainty estimation for deep learning, while others rely more on theoretical principles.</p></li><li><p><strong>Calling in the Backup: </strong>Another approach involves using external models called <a href="https://en.wikipedia.org/wiki/Conformal_prediction#:~:text=Conformal%20prediction%20(CP)%20is%20a,assuming%20exchangeability%20of%20the%20data.">conformal predictors</a>. Think of them as using AI to control AI. These models analyze the LLM's information and predict whether the output is real or fabricated (hallucinated).  Two approaches to Conformal Prediction:</p><ol><li><p><strong>LLM Evaluator:</strong> This method utilizes another LLM (or potentially the same one) to evaluate the generated text itself. This approach bypasses the need for handcrafted features but potentially introduces additional complexity.</p></li><li><p><strong>Simple Conformal Predictor:</strong> This approach leverages well-established methods like linear or logistic regression. However, it relies heavily on extracting informative features from the LLM's output.</p></li></ol></li></ul><p>The effectiveness of these methods depends on what information we have access to about the LLM itself. For example, the prediction will be more accurate if we can peek "inside" the LLM and see its internal workings (like a white box). However, if the LLM is a black box (we can't see its inner workings), we might need to ask it multiple times to get a clearer picture.  Crucially, all methods assume that LLMs have some awareness of their uncertainty or confidence levels, meaning they have a rough idea of how accurate their outputs are. Without this self-awareness, estimating uncertainty or predicting correctness solely by observing the LLMs is impossible.</p><blockquote><p> &#128064; Fortunately, recent evidence has pointed out that this assumption is practical and LLMs are aware of what they know or don&#8217;t know. All we need to do is find good ways to extract or trigger this information.</p></blockquote><p>To sum up, in any method, we just need to craft a score to measure the uncertainty/confidence of LLMs. For methods that do not use external conformal predictors, the score can be computed as a scalar using different approaches. The score then can be calibrated with a training dataset to find a proper threshold for detection decision-making. For conformal prediction approaches, the score can be extracted from the predictor&#8217;s prediction logits. We also need a training dataset to train the predictor. Sometimes, if the logits or the score is good, we can just use a default threshold of 0.5 without calibration. The general framework for detection is depicted below:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jVJv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a318c9-a65d-4957-bc8d-86592ba5f066_1297x603.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jVJv!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a318c9-a65d-4957-bc8d-86592ba5f066_1297x603.gif 424w, https://substackcdn.com/image/fetch/$s_!jVJv!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a318c9-a65d-4957-bc8d-86592ba5f066_1297x603.gif 848w, https://substackcdn.com/image/fetch/$s_!jVJv!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a318c9-a65d-4957-bc8d-86592ba5f066_1297x603.gif 1272w, https://substackcdn.com/image/fetch/$s_!jVJv!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a318c9-a65d-4957-bc8d-86592ba5f066_1297x603.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jVJv!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a318c9-a65d-4957-bc8d-86592ba5f066_1297x603.gif" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/11a318c9-a65d-4957-bc8d-86592ba5f066_1297x603.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6226743,&quot;alt&quot;:&quot;LLM Hallucination Detector Framework.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="LLM Hallucination Detector Framework." title="LLM Hallucination Detector Framework." srcset="https://substackcdn.com/image/fetch/$s_!jVJv!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a318c9-a65d-4957-bc8d-86592ba5f066_1297x603.gif 424w, https://substackcdn.com/image/fetch/$s_!jVJv!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a318c9-a65d-4957-bc8d-86592ba5f066_1297x603.gif 848w, https://substackcdn.com/image/fetch/$s_!jVJv!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a318c9-a65d-4957-bc8d-86592ba5f066_1297x603.gif 1272w, https://substackcdn.com/image/fetch/$s_!jVJv!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a318c9-a65d-4957-bc8d-86592ba5f066_1297x603.gif 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">LLM Hallucination Detector Framework.</figcaption></figure></div><div><hr></div><h2>Score-based Approaches for Uncertainty Estimation in LLMs</h2><h4><strong>Heuristic Uncertainty as a Clue</strong></h4><p>Just like other powerful AI models, LLMs have a built-in tool to estimate how likely their answers are to be correct. This tool is embedded in the final layer, known as the softmax layer, which calculates the probability of each token in the vocabulary appearing at the current timestep. Typically, the token with the highest probability is the one you see in the LLMs' output.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tYWH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca291ede-8c3d-4b56-9389-a60c5d4bff46_506x402.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tYWH!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca291ede-8c3d-4b56-9389-a60c5d4bff46_506x402.gif 424w, https://substackcdn.com/image/fetch/$s_!tYWH!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca291ede-8c3d-4b56-9389-a60c5d4bff46_506x402.gif 848w, https://substackcdn.com/image/fetch/$s_!tYWH!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca291ede-8c3d-4b56-9389-a60c5d4bff46_506x402.gif 1272w, https://substackcdn.com/image/fetch/$s_!tYWH!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca291ede-8c3d-4b56-9389-a60c5d4bff46_506x402.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tYWH!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca291ede-8c3d-4b56-9389-a60c5d4bff46_506x402.gif" width="404" height="320.9644268774704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ca291ede-8c3d-4b56-9389-a60c5d4bff46_506x402.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:402,&quot;width&quot;:506,&quot;resizeWidth&quot;:404,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tYWH!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca291ede-8c3d-4b56-9389-a60c5d4bff46_506x402.gif 424w, https://substackcdn.com/image/fetch/$s_!tYWH!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca291ede-8c3d-4b56-9389-a60c5d4bff46_506x402.gif 848w, https://substackcdn.com/image/fetch/$s_!tYWH!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca291ede-8c3d-4b56-9389-a60c5d4bff46_506x402.gif 1272w, https://substackcdn.com/image/fetch/$s_!tYWH!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca291ede-8c3d-4b56-9389-a60c5d4bff46_506x402.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">LLMs assign probability to the next possible tokens. <a href="https://wikidocs.net/178448">Source</a>.</figcaption></figure></div><p>Unfortunately, as discussed earlier, the built-in probability is not reliable. It does not reflect the reasonable confidence LLMs should have. For example, an LLM might assign a high probability to a factually incorrect answer, simply because the answer aligns with the patterns it has observed in its training data. Furthermore, the built-in probability only applies to individual tokens, not the overall coherence or accuracy of the entire response. This limitation hinders our ability to gauge the trustworthiness of a complete sentence. Ideally, we need methods to assess the probability of the entire content being factually sound and logically consistent. </p><p>Fortunately, the field of deep learning offers established methodologies for uncertainty estimation. As Huang et al. (2023) highlight in their comprehensive survey, three key approaches can be directly adapted from this literature to quantify the uncertainty associated with LLM responses [2].</p><p>&#128073;<strong> Probability Aggregation:</strong> This approach combines individual token-level probabilities (often in the form of log probabilities) to estimate a single probability score for the entire sentence and response. </p><blockquote><p> &#128064; This approach is simple and economical, only requiring one forward pass of LLM to get the log probs. However, it needs access to softmax layer information, which may be unavailable for black box LLM services. </p></blockquote><p>For example, max and average aggregation can be used to estimate the uncertainty of a sentence <em>i</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2Dop!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdec81c8c-7ef5-431d-b32c-975b38846eee_555x228.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2Dop!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdec81c8c-7ef5-431d-b32c-975b38846eee_555x228.png 424w, https://substackcdn.com/image/fetch/$s_!2Dop!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdec81c8c-7ef5-431d-b32c-975b38846eee_555x228.png 848w, https://substackcdn.com/image/fetch/$s_!2Dop!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdec81c8c-7ef5-431d-b32c-975b38846eee_555x228.png 1272w, https://substackcdn.com/image/fetch/$s_!2Dop!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdec81c8c-7ef5-431d-b32c-975b38846eee_555x228.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2Dop!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdec81c8c-7ef5-431d-b32c-975b38846eee_555x228.png" width="305" height="125.29729729729729" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dec81c8c-7ef5-431d-b32c-975b38846eee_555x228.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:228,&quot;width&quot;:555,&quot;resizeWidth&quot;:305,&quot;bytes&quot;:21623,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2Dop!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdec81c8c-7ef5-431d-b32c-975b38846eee_555x228.png 424w, https://substackcdn.com/image/fetch/$s_!2Dop!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdec81c8c-7ef5-431d-b32c-975b38846eee_555x228.png 848w, https://substackcdn.com/image/fetch/$s_!2Dop!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdec81c8c-7ef5-431d-b32c-975b38846eee_555x228.png 1272w, https://substackcdn.com/image/fetch/$s_!2Dop!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdec81c8c-7ef5-431d-b32c-975b38846eee_555x228.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where &#119901;<sub>&#119894;&#119895;</sub> is the probability of a token at position &#119895; of a sentence <em>i. </em></p><blockquote><p> &#128064;  Taking average of log probs is equivalent to measuring uncertainty as &#128073;<strong>Perplexity.</strong> </p></blockquote><p>Since LLMs inherently predict probabilities for all possible tokens at each step, they produce a distribution <em>p(x<sub>j</sub>) </em>for the <em>j</em>-th token. Hence we can leverage entropy, a well-established measure of uncertainty <em>H(X<sub>j</sub>) = -&#931; p(x<sub>j</sub>) * log(p(x<sub>j</sub>))</em>, to estimate sentence-level uncertainty as follows,</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!k61-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d08e56b-cbaa-4bc0-9048-6f5550bc3f01_408x217.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!k61-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d08e56b-cbaa-4bc0-9048-6f5550bc3f01_408x217.png 424w, https://substackcdn.com/image/fetch/$s_!k61-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d08e56b-cbaa-4bc0-9048-6f5550bc3f01_408x217.png 848w, https://substackcdn.com/image/fetch/$s_!k61-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d08e56b-cbaa-4bc0-9048-6f5550bc3f01_408x217.png 1272w, https://substackcdn.com/image/fetch/$s_!k61-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d08e56b-cbaa-4bc0-9048-6f5550bc3f01_408x217.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!k61-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d08e56b-cbaa-4bc0-9048-6f5550bc3f01_408x217.png" width="222" height="118.07352941176471" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d08e56b-cbaa-4bc0-9048-6f5550bc3f01_408x217.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:217,&quot;width&quot;:408,&quot;resizeWidth&quot;:222,&quot;bytes&quot;:16759,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!k61-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d08e56b-cbaa-4bc0-9048-6f5550bc3f01_408x217.png 424w, https://substackcdn.com/image/fetch/$s_!k61-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d08e56b-cbaa-4bc0-9048-6f5550bc3f01_408x217.png 848w, https://substackcdn.com/image/fetch/$s_!k61-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d08e56b-cbaa-4bc0-9048-6f5550bc3f01_408x217.png 1272w, https://substackcdn.com/image/fetch/$s_!k61-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d08e56b-cbaa-4bc0-9048-6f5550bc3f01_408x217.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>&#128073;<strong> Uncertainty through Voting:</strong> This trick involves generating multiple responses from the LLM for the same prompt. We then analyze the variance (how different or inconsistent the responses are from each other). The idea is that if the LLM keeps spitting out similar answers, it suggests a more consistent and potentially reliable thought process. The more the responses veer off course, the higher the uncertainty. Diving deeper, we can derive 2 metrics: (1) variation ratio (VR) and (2) variation ratio for original prediction (VRO). In particular, if we sample <em>T</em> sentence responses from the LLM and can measure the difference between 2 responses <em>p<sub>i</sub></em> and <em>p<sub>j</sub></em> via function <em>dist()</em>, we have: </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VEOR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323a8898-51d6-4075-ab21-24e3da3947fd_1246x541.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VEOR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323a8898-51d6-4075-ab21-24e3da3947fd_1246x541.png 424w, https://substackcdn.com/image/fetch/$s_!VEOR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323a8898-51d6-4075-ab21-24e3da3947fd_1246x541.png 848w, https://substackcdn.com/image/fetch/$s_!VEOR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323a8898-51d6-4075-ab21-24e3da3947fd_1246x541.png 1272w, https://substackcdn.com/image/fetch/$s_!VEOR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323a8898-51d6-4075-ab21-24e3da3947fd_1246x541.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VEOR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323a8898-51d6-4075-ab21-24e3da3947fd_1246x541.png" width="504" height="218.8314606741573" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/323a8898-51d6-4075-ab21-24e3da3947fd_1246x541.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:541,&quot;width&quot;:1246,&quot;resizeWidth&quot;:504,&quot;bytes&quot;:108696,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VEOR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323a8898-51d6-4075-ab21-24e3da3947fd_1246x541.png 424w, https://substackcdn.com/image/fetch/$s_!VEOR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323a8898-51d6-4075-ab21-24e3da3947fd_1246x541.png 848w, https://substackcdn.com/image/fetch/$s_!VEOR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323a8898-51d6-4075-ab21-24e3da3947fd_1246x541.png 1272w, https://substackcdn.com/image/fetch/$s_!VEOR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323a8898-51d6-4075-ab21-24e3da3947fd_1246x541.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The main difference between the VRO and VR formulas is that VRO only considers the variance between the original response and any additional generated responses (assigning a weight of 1 to the original response and 0 to the others). Here, the distance function can be the BLEU score which captures lexical matching:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TUI8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b42664-5813-4edb-8bfb-f0818d516cd1_1459x344.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TUI8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b42664-5813-4edb-8bfb-f0818d516cd1_1459x344.png 424w, https://substackcdn.com/image/fetch/$s_!TUI8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b42664-5813-4edb-8bfb-f0818d516cd1_1459x344.png 848w, https://substackcdn.com/image/fetch/$s_!TUI8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b42664-5813-4edb-8bfb-f0818d516cd1_1459x344.png 1272w, https://substackcdn.com/image/fetch/$s_!TUI8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b42664-5813-4edb-8bfb-f0818d516cd1_1459x344.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TUI8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b42664-5813-4edb-8bfb-f0818d516cd1_1459x344.png" width="1456" height="343" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25b42664-5813-4edb-8bfb-f0818d516cd1_1459x344.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:343,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66735,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TUI8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b42664-5813-4edb-8bfb-f0818d516cd1_1459x344.png 424w, https://substackcdn.com/image/fetch/$s_!TUI8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b42664-5813-4edb-8bfb-f0818d516cd1_1459x344.png 848w, https://substackcdn.com/image/fetch/$s_!TUI8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b42664-5813-4edb-8bfb-f0818d516cd1_1459x344.png 1272w, https://substackcdn.com/image/fetch/$s_!TUI8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25b42664-5813-4edb-8bfb-f0818d516cd1_1459x344.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">BLEU score to measure uncertainty. Source: [2].</figcaption></figure></div><p>We can also use BERT as the function to capture semantic similarity/difference between responses. SelfCheckGPT paper [3] proposes a similar uncertainty formula using BERT to measure the uncertainty of response <em>r<sub>i</sub></em>: </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NVJW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2777a31a-5a0a-4769-ac4d-a8fcd496bb04_696x147.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NVJW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2777a31a-5a0a-4769-ac4d-a8fcd496bb04_696x147.png 424w, https://substackcdn.com/image/fetch/$s_!NVJW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2777a31a-5a0a-4769-ac4d-a8fcd496bb04_696x147.png 848w, https://substackcdn.com/image/fetch/$s_!NVJW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2777a31a-5a0a-4769-ac4d-a8fcd496bb04_696x147.png 1272w, https://substackcdn.com/image/fetch/$s_!NVJW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2777a31a-5a0a-4769-ac4d-a8fcd496bb04_696x147.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NVJW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2777a31a-5a0a-4769-ac4d-a8fcd496bb04_696x147.png" width="384" height="81.10344827586206" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2777a31a-5a0a-4769-ac4d-a8fcd496bb04_696x147.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:147,&quot;width&quot;:696,&quot;resizeWidth&quot;:384,&quot;bytes&quot;:16233,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NVJW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2777a31a-5a0a-4769-ac4d-a8fcd496bb04_696x147.png 424w, https://substackcdn.com/image/fetch/$s_!NVJW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2777a31a-5a0a-4769-ac4d-a8fcd496bb04_696x147.png 848w, https://substackcdn.com/image/fetch/$s_!NVJW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2777a31a-5a0a-4769-ac4d-a8fcd496bb04_696x147.png 1272w, https://substackcdn.com/image/fetch/$s_!NVJW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2777a31a-5a0a-4769-ac4d-a8fcd496bb04_696x147.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Here, for each of the <em>N</em> sampling iterations, we sample several sentences and select the one most similar to the original response in terms of BERT score then take the average. </p><blockquote><p> &#128064; To sample different outputs from LLMs, we may need to access to the hyperparameter temperature <em>t</em>. <em>t=0</em> means the generation will be deterministic and ends up with the same output.  <em>t&gt;0</em> will enable more stochasticity in the generation process. Other than that, voting approach can work well with black box LLMs. </p></blockquote><p>&#128073;<strong> Uncertainty through Perturbation: </strong>One fascinating aspect of LLMs is their inherent randomness during text generation. Like a chain reaction, a tiny change in one predicted word can ripple through the entire sequence, potentially leading to completely different meanings. This stochastic nature highlights the sensitivity of LLMs throughout the prediction process and we can measure it by:</p><ol><li><p>Choose a token and replace (perturbed) it with other tokens (top-<em>k</em> highest probabilities). This leads to several responses. </p></li><li><p>Compute variance as the uncertainty score over the response as in Voting mechanisms above. </p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AxaV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fdae74-8d6f-4239-b2ed-fe222692105c_1479x328.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AxaV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fdae74-8d6f-4239-b2ed-fe222692105c_1479x328.png 424w, https://substackcdn.com/image/fetch/$s_!AxaV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fdae74-8d6f-4239-b2ed-fe222692105c_1479x328.png 848w, https://substackcdn.com/image/fetch/$s_!AxaV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fdae74-8d6f-4239-b2ed-fe222692105c_1479x328.png 1272w, https://substackcdn.com/image/fetch/$s_!AxaV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fdae74-8d6f-4239-b2ed-fe222692105c_1479x328.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AxaV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fdae74-8d6f-4239-b2ed-fe222692105c_1479x328.png" width="1456" height="323" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94fdae74-8d6f-4239-b2ed-fe222692105c_1479x328.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:323,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:92069,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AxaV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fdae74-8d6f-4239-b2ed-fe222692105c_1479x328.png 424w, https://substackcdn.com/image/fetch/$s_!AxaV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fdae74-8d6f-4239-b2ed-fe222692105c_1479x328.png 848w, https://substackcdn.com/image/fetch/$s_!AxaV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fdae74-8d6f-4239-b2ed-fe222692105c_1479x328.png 1272w, https://substackcdn.com/image/fetch/$s_!AxaV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94fdae74-8d6f-4239-b2ed-fe222692105c_1479x328.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Uncertainty through Perturbation. Source [2].</figcaption></figure></div><p><em>&#129504; Which token should we mess with? </em>The authors propose 3 ways:</p><ul><li><p><strong>Most Uncertain Spot:</strong> This refers to the place in the generated token where the LLM itself seems unsure about what word to pick next (highest entropy, Max).</p></li><li><p><strong>Most Confident Spot:</strong> This is the opposite of point 1, where the LLM seems very certain about the word it chose (lowest entropy, Min).</p></li><li><p><strong>Biggest Shift:</strong> This focuses on the point where the LLM's confidence level changes the most compared to the previous word (maximum change in entropy. MaxDiff).</p></li></ul><blockquote><p> &#128064; This approach requires major interference into the LLM computation process, and thus more suitable for white box setting. </p></blockquote><p>The research found that getting the LLM to vote on multiple responses is the best way to gauge uncertainty, followed by tweaking the text and looking at the changes, and lastly, simply looking at the probabilities the LLM assigns to each word.</p><blockquote><p>&#129514; Voting <em>&gt;</em> Perturbation <em>&gt;</em> Probability Aggregation</p></blockquote><p></p><h4>Quantifying Uncertainty with Information Theory</h4><p>Continuing the line of reasoning  that multiple samples help hallucination detection, we can investigate deeper into the hidden states of the LLMs instead of just probing outside. Concretely, Chen et al., (2024) sample responses multiple times, generating multiple hidden states and feature vectors, providing richer information about the LLM's confidence in its responses. The authors propose to use a metric based on the eigenvector of these feature vectors as an uncertainty metric, &#128073; <strong>EigenScore [7]</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZQDN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f670cb-fadc-41ae-afe3-8d9093d4e823_1017x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZQDN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f670cb-fadc-41ae-afe3-8d9093d4e823_1017x360.png 424w, https://substackcdn.com/image/fetch/$s_!ZQDN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f670cb-fadc-41ae-afe3-8d9093d4e823_1017x360.png 848w, https://substackcdn.com/image/fetch/$s_!ZQDN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f670cb-fadc-41ae-afe3-8d9093d4e823_1017x360.png 1272w, https://substackcdn.com/image/fetch/$s_!ZQDN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f670cb-fadc-41ae-afe3-8d9093d4e823_1017x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZQDN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f670cb-fadc-41ae-afe3-8d9093d4e823_1017x360.png" width="1017" height="360" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25f670cb-fadc-41ae-afe3-8d9093d4e823_1017x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:360,&quot;width&quot;:1017,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:113007,&quot;alt&quot;:&quot;Eigen Framework&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Eigen Framework" title="Eigen Framework" srcset="https://substackcdn.com/image/fetch/$s_!ZQDN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f670cb-fadc-41ae-afe3-8d9093d4e823_1017x360.png 424w, https://substackcdn.com/image/fetch/$s_!ZQDN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f670cb-fadc-41ae-afe3-8d9093d4e823_1017x360.png 848w, https://substackcdn.com/image/fetch/$s_!ZQDN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f670cb-fadc-41ae-afe3-8d9093d4e823_1017x360.png 1272w, https://substackcdn.com/image/fetch/$s_!ZQDN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25f670cb-fadc-41ae-afe3-8d9093d4e823_1017x360.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">EigenScore Framework. Source [7]</figcaption></figure></div><p>In particular, given <em>K</em> hidden state vectors as composing a matrix <em>Z</em>, they compute the covariance matrix:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZCk1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58bf80ab-f5dd-456a-b6f1-610b12e0f053_350x84.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZCk1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58bf80ab-f5dd-456a-b6f1-610b12e0f053_350x84.png 424w, https://substackcdn.com/image/fetch/$s_!ZCk1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58bf80ab-f5dd-456a-b6f1-610b12e0f053_350x84.png 848w, https://substackcdn.com/image/fetch/$s_!ZCk1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58bf80ab-f5dd-456a-b6f1-610b12e0f053_350x84.png 1272w, https://substackcdn.com/image/fetch/$s_!ZCk1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58bf80ab-f5dd-456a-b6f1-610b12e0f053_350x84.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZCk1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58bf80ab-f5dd-456a-b6f1-610b12e0f053_350x84.png" width="176" height="42.24" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/58bf80ab-f5dd-456a-b6f1-610b12e0f053_350x84.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:84,&quot;width&quot;:350,&quot;resizeWidth&quot;:176,&quot;bytes&quot;:6826,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZCk1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58bf80ab-f5dd-456a-b6f1-610b12e0f053_350x84.png 424w, https://substackcdn.com/image/fetch/$s_!ZCk1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58bf80ab-f5dd-456a-b6f1-610b12e0f053_350x84.png 848w, https://substackcdn.com/image/fetch/$s_!ZCk1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58bf80ab-f5dd-456a-b6f1-610b12e0f053_350x84.png 1272w, https://substackcdn.com/image/fetch/$s_!ZCk1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58bf80ab-f5dd-456a-b6f1-610b12e0f053_350x84.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where<em> J<sub>d</sub> = I<sub>d</sub> &#8722; 1<sub>d</sub> 1<sub>d</sub><sup>&#8868;</sup></em> is the centering matrix and <em>1<sub>d</sub> &#8712; R<sup>d</sup></em> is the all-one column vector. Then, EigenScore can be calculated as the logarithm of the determinant (LogDet) of the covariance matrix:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9WCm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01f223c8-541a-4f48-ad0c-0c3cd0337a06_999x358.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9WCm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01f223c8-541a-4f48-ad0c-0c3cd0337a06_999x358.png 424w, https://substackcdn.com/image/fetch/$s_!9WCm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01f223c8-541a-4f48-ad0c-0c3cd0337a06_999x358.png 848w, https://substackcdn.com/image/fetch/$s_!9WCm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01f223c8-541a-4f48-ad0c-0c3cd0337a06_999x358.png 1272w, https://substackcdn.com/image/fetch/$s_!9WCm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01f223c8-541a-4f48-ad0c-0c3cd0337a06_999x358.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9WCm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01f223c8-541a-4f48-ad0c-0c3cd0337a06_999x358.png" width="416" height="149.07707707707706" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/01f223c8-541a-4f48-ad0c-0c3cd0337a06_999x358.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:358,&quot;width&quot;:999,&quot;resizeWidth&quot;:416,&quot;bytes&quot;:93179,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9WCm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01f223c8-541a-4f48-ad0c-0c3cd0337a06_999x358.png 424w, https://substackcdn.com/image/fetch/$s_!9WCm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01f223c8-541a-4f48-ad0c-0c3cd0337a06_999x358.png 848w, https://substackcdn.com/image/fetch/$s_!9WCm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01f223c8-541a-4f48-ad0c-0c3cd0337a06_999x358.png 1272w, https://substackcdn.com/image/fetch/$s_!9WCm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01f223c8-541a-4f48-ad0c-0c3cd0337a06_999x358.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p> &#128064; The EigenScore represents <a href="https://en.wikipedia.org/wiki/Differential_entropy">the differential entropy</a> in the sentence embedding space following Gaussian distribution. Hence, it is reasonable to use it for measuring uncertainty. </p></blockquote><p>The authors also suggest clipping the features during the computation of EigenScore to reduce overconfidence estimation:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YJLV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0733c9b3-f3ab-4ac8-b823-6691acb2d14d_522x150.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YJLV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0733c9b3-f3ab-4ac8-b823-6691acb2d14d_522x150.png 424w, https://substackcdn.com/image/fetch/$s_!YJLV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0733c9b3-f3ab-4ac8-b823-6691acb2d14d_522x150.png 848w, https://substackcdn.com/image/fetch/$s_!YJLV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0733c9b3-f3ab-4ac8-b823-6691acb2d14d_522x150.png 1272w, https://substackcdn.com/image/fetch/$s_!YJLV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0733c9b3-f3ab-4ac8-b823-6691acb2d14d_522x150.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YJLV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0733c9b3-f3ab-4ac8-b823-6691acb2d14d_522x150.png" width="332" height="95.40229885057471" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0733c9b3-f3ab-4ac8-b823-6691acb2d14d_522x150.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:150,&quot;width&quot;:522,&quot;resizeWidth&quot;:332,&quot;bytes&quot;:19241,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YJLV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0733c9b3-f3ab-4ac8-b823-6691acb2d14d_522x150.png 424w, https://substackcdn.com/image/fetch/$s_!YJLV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0733c9b3-f3ab-4ac8-b823-6691acb2d14d_522x150.png 848w, https://substackcdn.com/image/fetch/$s_!YJLV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0733c9b3-f3ab-4ac8-b823-6691acb2d14d_522x150.png 1272w, https://substackcdn.com/image/fetch/$s_!YJLV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0733c9b3-f3ab-4ac8-b823-6691acb2d14d_522x150.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>h<sub>min</sub></em> and <em>h<sub>max</sub></em> are hyperparameters that can be tuned or calibrated. </p><p>More principled research [4] throws shade on LLMs' hallucinations by revisiting the basics of uncertainty in machine learning. Turns out, there are two big types:</p><ul><li><p><strong>Epistemic Uncertainty:</strong> When the LLM just doesn't know enough (think facts or grammar rules). This can happen because it hasn't seen enough training data or just isn't powerful enough yet.</p></li><li><p><strong>Aleatoric Uncertainty:</strong> This is when the question itself is tricky. Imagine there are multiple right answers, making it a guessing game even for the smartest LLM. Note that this kind of uncertainty is common in LLM settings because there can be many ways to generate reasonable responses. </p></li></ul><p>So, the lower the epistemic uncertainty, the more likely the LLM's answer is on point. Since aleatoric uncertainty is not the fault of the model and we cannot do anything about it, it is important to differentiate the two sources of uncertainty. </p><p>&#10060; The problem with heuristic approaches is that they only measure LLM uncertainty as a whole, not inherent ambiguity in the problem itself (aleatoric uncertainty). This can be misleading.  For example, a perfect predictor might have high aleatoric uncertainty, while a bad one might only have high epistemic uncertainty. Both would appear equally uncertain under heuristic methods.</p><p>Therefore, the authors propose to focus on identifying instances where only the epistemic uncertainty is large, which would suggest that the response is likely hallucinated. To this end, they propose &#128073; <strong>epistemic uncertainty via an iterative prompting procedure.</strong></p><p>Here's the trick to do that: first, they ask the model to respond to a query. Then, they ask for another response to the query plus the first response. After that, they request a third response to the query and the first two responses, continuing this process. If the LLM keeps changing its response across trials, it suggests a lack of confidence in its knowledge. In contrast, if the LLM consistently provides answers insensitive to the concatenation of its previous response, it indicates a stronger grasp of the topic.</p><blockquote><p>&#128064;  In other words, the responses should be independent. This means that the joint distribution of these responses, for a fixed query, must be a product distribution.</p></blockquote><p>To illustrate the point, the authors observe the probabilities of LLM on the correct answers when prompting with the iterative procedure:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vaIW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e5a7651-7cbd-4b61-bb1d-5de8db8a346a_1358x412.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vaIW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e5a7651-7cbd-4b61-bb1d-5de8db8a346a_1358x412.png 424w, https://substackcdn.com/image/fetch/$s_!vaIW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e5a7651-7cbd-4b61-bb1d-5de8db8a346a_1358x412.png 848w, https://substackcdn.com/image/fetch/$s_!vaIW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e5a7651-7cbd-4b61-bb1d-5de8db8a346a_1358x412.png 1272w, https://substackcdn.com/image/fetch/$s_!vaIW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e5a7651-7cbd-4b61-bb1d-5de8db8a346a_1358x412.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vaIW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e5a7651-7cbd-4b61-bb1d-5de8db8a346a_1358x412.png" width="1358" height="412" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1e5a7651-7cbd-4b61-bb1d-5de8db8a346a_1358x412.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:412,&quot;width&quot;:1358,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:132868,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vaIW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e5a7651-7cbd-4b61-bb1d-5de8db8a346a_1358x412.png 424w, https://substackcdn.com/image/fetch/$s_!vaIW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e5a7651-7cbd-4b61-bb1d-5de8db8a346a_1358x412.png 848w, https://substackcdn.com/image/fetch/$s_!vaIW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e5a7651-7cbd-4b61-bb1d-5de8db8a346a_1358x412.png 1272w, https://substackcdn.com/image/fetch/$s_!vaIW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e5a7651-7cbd-4b61-bb1d-5de8db8a346a_1358x412.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Left: The question is easy, the LLM knows the answer. The confidence stays high as the number of iterations increases. Middle: the question is hard, and the LLM is not confident. Its confidence reduces to zero quickly with more iterations. Right: When there are multiple right answers (high aleatoric uncertainty), confidence also reduces but is slower. </figcaption></figure></div><blockquote><p>&#128064; Why? Intuitively, if the question is seen during training, the attention key and query weights of the LLM are tuned to be able to project the question to higher attention scores than other sentences. Thus, the question will be attended the most regard less of the context length, and the LLM will always have a chance to look at the quetion to give the right answer.  On the contrary, if the question is novel, the weights can not do any thing and as the context get longer, the attention can be anywhere, leading to a wrong attended input for the LLMs to answer. </p></blockquote><p>In short, the iterative prompting procedure<strong> </strong>gives us a hint at the uncertain behavior of the LLM. Given the right motivation, now we can derive a robust uncertainty score. Formerly, given a query <em>x &#8712; X</em> and possible responses <em>Y<sub>1</sub>, . . . , Y<sub>t</sub>,</em>  a family of prompts <em>F = {F<sub>t</sub> : X &#8594; X | t &#8712; N}</em> is defined with the prompt function <em>F<sub>t</sub>(x, Y<sub>1</sub>, . . . , Y<sub>t</sub>)</em> as:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lhRR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6531befb-af2f-4acf-b254-03355652fce1_1145x336.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lhRR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6531befb-af2f-4acf-b254-03355652fce1_1145x336.png 424w, https://substackcdn.com/image/fetch/$s_!lhRR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6531befb-af2f-4acf-b254-03355652fce1_1145x336.png 848w, https://substackcdn.com/image/fetch/$s_!lhRR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6531befb-af2f-4acf-b254-03355652fce1_1145x336.png 1272w, https://substackcdn.com/image/fetch/$s_!lhRR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6531befb-af2f-4acf-b254-03355652fce1_1145x336.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lhRR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6531befb-af2f-4acf-b254-03355652fce1_1145x336.png" width="1145" height="336" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6531befb-af2f-4acf-b254-03355652fce1_1145x336.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:336,&quot;width&quot;:1145,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:74398,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lhRR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6531befb-af2f-4acf-b254-03355652fce1_1145x336.png 424w, https://substackcdn.com/image/fetch/$s_!lhRR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6531befb-af2f-4acf-b254-03355652fce1_1145x336.png 848w, https://substackcdn.com/image/fetch/$s_!lhRR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6531befb-af2f-4acf-b254-03355652fce1_1145x336.png 1272w, https://substackcdn.com/image/fetch/$s_!lhRR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6531befb-af2f-4acf-b254-03355652fce1_1145x336.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Then, we can model the distribution of the sequence of responses given the query x: </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gTs_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaded81-71b3-4fc8-9b22-a352cc89b155_1553x175.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gTs_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaded81-71b3-4fc8-9b22-a352cc89b155_1553x175.png 424w, https://substackcdn.com/image/fetch/$s_!gTs_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaded81-71b3-4fc8-9b22-a352cc89b155_1553x175.png 848w, https://substackcdn.com/image/fetch/$s_!gTs_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaded81-71b3-4fc8-9b22-a352cc89b155_1553x175.png 1272w, https://substackcdn.com/image/fetch/$s_!gTs_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaded81-71b3-4fc8-9b22-a352cc89b155_1553x175.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gTs_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaded81-71b3-4fc8-9b22-a352cc89b155_1553x175.png" width="1456" height="164" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dfaded81-71b3-4fc8-9b22-a352cc89b155_1553x175.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:164,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:65754,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gTs_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaded81-71b3-4fc8-9b22-a352cc89b155_1553x175.png 424w, https://substackcdn.com/image/fetch/$s_!gTs_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaded81-71b3-4fc8-9b22-a352cc89b155_1553x175.png 848w, https://substackcdn.com/image/fetch/$s_!gTs_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaded81-71b3-4fc8-9b22-a352cc89b155_1553x175.png 1272w, https://substackcdn.com/image/fetch/$s_!gTs_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaded81-71b3-4fc8-9b22-a352cc89b155_1553x175.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p>&#128064; The chain rule is approximated because of the use of the prompt function <em>F<sub>t</sub></em> to combine the random variables. Hence, it is pseudo join distribution. </p></blockquote><p>Given the formulation of the joint distribution, it is intuitive to say that the response of LLMs <em>Y<sub>1</sub>,&#8230;,Y<sub>n</sub>|x</em> is wrong if the LLM&#8217;s probability of <em>Y<sub>1</sub>,&#8230;,Y<sub>n</sub>|x</em> is unlike the ground truth probability of <em>Y<sub>1</sub>,&#8230;,Y<sub>n</sub>|x</em>. Thus, a metric that can measure the truthfulness of LLM&#8217;s output is the KL divergence between LLM&#8217;s joint distribution and ground truth joint distribution. Yet, we don&#8217;t know ground truth distribution. Fortunately, we can replace the KL with an estimable lower bound:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rG6z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4ba284c-dcd7-4794-9260-c3b1a9fcc37a_609x188.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rG6z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4ba284c-dcd7-4794-9260-c3b1a9fcc37a_609x188.png 424w, https://substackcdn.com/image/fetch/$s_!rG6z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4ba284c-dcd7-4794-9260-c3b1a9fcc37a_609x188.png 848w, https://substackcdn.com/image/fetch/$s_!rG6z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4ba284c-dcd7-4794-9260-c3b1a9fcc37a_609x188.png 1272w, https://substackcdn.com/image/fetch/$s_!rG6z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4ba284c-dcd7-4794-9260-c3b1a9fcc37a_609x188.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rG6z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4ba284c-dcd7-4794-9260-c3b1a9fcc37a_609x188.png" width="447" height="137.99014778325125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a4ba284c-dcd7-4794-9260-c3b1a9fcc37a_609x188.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:188,&quot;width&quot;:609,&quot;resizeWidth&quot;:447,&quot;bytes&quot;:25912,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rG6z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4ba284c-dcd7-4794-9260-c3b1a9fcc37a_609x188.png 424w, https://substackcdn.com/image/fetch/$s_!rG6z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4ba284c-dcd7-4794-9260-c3b1a9fcc37a_609x188.png 848w, https://substackcdn.com/image/fetch/$s_!rG6z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4ba284c-dcd7-4794-9260-c3b1a9fcc37a_609x188.png 1272w, https://substackcdn.com/image/fetch/$s_!rG6z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4ba284c-dcd7-4794-9260-c3b1a9fcc37a_609x188.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Computing the exact <a href="https://en.wikipedia.org/wiki/Mutual_information">mutual information</a> requires evaluating <em>Q</em> over its entire support, which can be infinite. Therefore, the authors propose to estimate the term by sampling-based approximation. In particular,</p><ol><li><p>Sample <em>X<sub>1</sub>, . . . , X<sub>k</sub> </em>sequence of responses from the LLM</p></li><li><p>Construct a set of indices of unique elements <em>S = &#8; i &#8712; [k] : X<sub>i</sub> &#8800; X<sub>j</sub> &#8704;j &lt; i</em></p></li><li><p>Construct empirical distributions: for all <em>i &#8712; S</em>:</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Hlj3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1d67ccd-5311-4275-aad4-904564ef76e2_1018x269.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Hlj3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1d67ccd-5311-4275-aad4-904564ef76e2_1018x269.png 424w, https://substackcdn.com/image/fetch/$s_!Hlj3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1d67ccd-5311-4275-aad4-904564ef76e2_1018x269.png 848w, https://substackcdn.com/image/fetch/$s_!Hlj3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1d67ccd-5311-4275-aad4-904564ef76e2_1018x269.png 1272w, https://substackcdn.com/image/fetch/$s_!Hlj3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1d67ccd-5311-4275-aad4-904564ef76e2_1018x269.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Hlj3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1d67ccd-5311-4275-aad4-904564ef76e2_1018x269.png" width="618" height="163.30255402750493" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1d67ccd-5311-4275-aad4-904564ef76e2_1018x269.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:269,&quot;width&quot;:1018,&quot;resizeWidth&quot;:618,&quot;bytes&quot;:75097,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Hlj3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1d67ccd-5311-4275-aad4-904564ef76e2_1018x269.png 424w, https://substackcdn.com/image/fetch/$s_!Hlj3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1d67ccd-5311-4275-aad4-904564ef76e2_1018x269.png 848w, https://substackcdn.com/image/fetch/$s_!Hlj3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1d67ccd-5311-4275-aad4-904564ef76e2_1018x269.png 1272w, https://substackcdn.com/image/fetch/$s_!Hlj3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1d67ccd-5311-4275-aad4-904564ef76e2_1018x269.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><ol start="4"><li><p>Finally, compute the estimated mutual information:</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OqME!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fb93fc9-27e6-4592-92a6-3574b9ffd088_831x184.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OqME!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fb93fc9-27e6-4592-92a6-3574b9ffd088_831x184.png 424w, https://substackcdn.com/image/fetch/$s_!OqME!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fb93fc9-27e6-4592-92a6-3574b9ffd088_831x184.png 848w, https://substackcdn.com/image/fetch/$s_!OqME!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fb93fc9-27e6-4592-92a6-3574b9ffd088_831x184.png 1272w, https://substackcdn.com/image/fetch/$s_!OqME!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fb93fc9-27e6-4592-92a6-3574b9ffd088_831x184.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OqME!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fb93fc9-27e6-4592-92a6-3574b9ffd088_831x184.png" width="376" height="83.25391095066185" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9fb93fc9-27e6-4592-92a6-3574b9ffd088_831x184.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:184,&quot;width&quot;:831,&quot;resizeWidth&quot;:376,&quot;bytes&quot;:35642,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OqME!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fb93fc9-27e6-4592-92a6-3574b9ffd088_831x184.png 424w, https://substackcdn.com/image/fetch/$s_!OqME!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fb93fc9-27e6-4592-92a6-3574b9ffd088_831x184.png 848w, https://substackcdn.com/image/fetch/$s_!OqME!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fb93fc9-27e6-4592-92a6-3574b9ffd088_831x184.png 1272w, https://substackcdn.com/image/fetch/$s_!OqME!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fb93fc9-27e6-4592-92a6-3574b9ffd088_831x184.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>&#120574;</em> and <em>k</em> are hyperparameters. </p><div><hr></div><h2>Model-based Hallucination Detection </h2><h4>LLMs as Evaluators</h4><p>It seems like a chicken-and-egg problem to use LLMs to detect LLMs&#8217; falsehood &#8211; <em>&#129504; how can an LLM identify a lie if it chose to generate the lie in the first place? </em>Interestingly, early research has shown that they can do it [5]. It is reasonable since humans also exhibit this behavior. We often make mistakes and only realize them upon reflection. Similarly, when interacting with language models, they can acknowledge and correct their errors when shown their mistakes.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vtC-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de85fb4-adb6-446b-ac86-8596cc22dd1d_982x387.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vtC-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de85fb4-adb6-446b-ac86-8596cc22dd1d_982x387.png 424w, https://substackcdn.com/image/fetch/$s_!vtC-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de85fb4-adb6-446b-ac86-8596cc22dd1d_982x387.png 848w, https://substackcdn.com/image/fetch/$s_!vtC-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de85fb4-adb6-446b-ac86-8596cc22dd1d_982x387.png 1272w, https://substackcdn.com/image/fetch/$s_!vtC-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de85fb4-adb6-446b-ac86-8596cc22dd1d_982x387.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vtC-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de85fb4-adb6-446b-ac86-8596cc22dd1d_982x387.png" width="982" height="387" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7de85fb4-adb6-446b-ac86-8596cc22dd1d_982x387.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:387,&quot;width&quot;:982,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:27109,&quot;alt&quot;:&quot;Examples of how an LLM can recognize its error after reflection. &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Examples of how an LLM can recognize its error after reflection. " title="Examples of how an LLM can recognize its error after reflection. " srcset="https://substackcdn.com/image/fetch/$s_!vtC-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de85fb4-adb6-446b-ac86-8596cc22dd1d_982x387.png 424w, https://substackcdn.com/image/fetch/$s_!vtC-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de85fb4-adb6-446b-ac86-8596cc22dd1d_982x387.png 848w, https://substackcdn.com/image/fetch/$s_!vtC-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de85fb4-adb6-446b-ac86-8596cc22dd1d_982x387.png 1272w, https://substackcdn.com/image/fetch/$s_!vtC-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de85fb4-adb6-446b-ac86-8596cc22dd1d_982x387.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Examples of how an LLM can recognize its error after reflection. </figcaption></figure></div><p>Enhancing validation reliability is possible by employing a more robust LLM to verify the outputs of a less powerful one. This approach, commonly used by the open-source community for benchmarking LLM improvements, involves leveraging a larger or more advanced model to assess the accuracy of a smaller or less-developed model. The detection framework is very simple:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lI4s!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b70ee0-d9e9-48c8-b18e-7164381e76ea_1645x272.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lI4s!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b70ee0-d9e9-48c8-b18e-7164381e76ea_1645x272.gif 424w, https://substackcdn.com/image/fetch/$s_!lI4s!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b70ee0-d9e9-48c8-b18e-7164381e76ea_1645x272.gif 848w, https://substackcdn.com/image/fetch/$s_!lI4s!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b70ee0-d9e9-48c8-b18e-7164381e76ea_1645x272.gif 1272w, https://substackcdn.com/image/fetch/$s_!lI4s!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b70ee0-d9e9-48c8-b18e-7164381e76ea_1645x272.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lI4s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b70ee0-d9e9-48c8-b18e-7164381e76ea_1645x272.gif" width="1456" height="241" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92b70ee0-d9e9-48c8-b18e-7164381e76ea_1645x272.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:241,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:173871,&quot;alt&quot;:&quot;LLMs as Evaluators Framework.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="LLMs as Evaluators Framework." title="LLMs as Evaluators Framework." srcset="https://substackcdn.com/image/fetch/$s_!lI4s!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b70ee0-d9e9-48c8-b18e-7164381e76ea_1645x272.gif 424w, https://substackcdn.com/image/fetch/$s_!lI4s!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b70ee0-d9e9-48c8-b18e-7164381e76ea_1645x272.gif 848w, https://substackcdn.com/image/fetch/$s_!lI4s!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b70ee0-d9e9-48c8-b18e-7164381e76ea_1645x272.gif 1272w, https://substackcdn.com/image/fetch/$s_!lI4s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b70ee0-d9e9-48c8-b18e-7164381e76ea_1645x272.gif 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">LLMs as Evaluators Framework. We ask an LLM (stronger) to evaluate the truthfulness of the Main (weaker) LLM&#8217;s response. The evaluation prompt to ask can be as simple as &#8220;Is this response correct?&#8220;</figcaption></figure></div><blockquote><p>&#128064; Simple evaluation prompts may not work well all the time, especially when the LLM Evaluator is not stronger than the main LLM. </p></blockquote><p>Improving the accuracy of the LLM evaluator requires special methods [5]. One property the research found out is that LLMs excel at calibrating multiple-choice and true/false questions. Put simply, the probabilities they assign to the options are somewhat reliable.</p><blockquote><p>&#128064; This property is more evident if the multiple choice has the suitable format. For example, if there is &#8220;None of the above&#8221; choice, the quality of calibration may be reduced. We may also need to tune the temperature <em>t</em> to have good probabilities. </p></blockquote><p>Thus, they propose a simple trick to use &#128073; <strong>LLM Prompting without finetuning</strong> to make the evaluation more accurate:</p><ol><li><p>Present the response to the LLM Evaluator and ask if the response is True or False. </p></li><li><p>Measure the the probability <em>P(&#8220;True&#8221;)</em> that the LLM Evaluator assigns to the token &#8220;True&#8221;. </p></li></ol><p>An example to illustrate the evaluation prompt:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hnJm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff626fcaa-95c5-4fc8-b827-0080a2d9f1da_1245x279.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hnJm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff626fcaa-95c5-4fc8-b827-0080a2d9f1da_1245x279.png 424w, https://substackcdn.com/image/fetch/$s_!hnJm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff626fcaa-95c5-4fc8-b827-0080a2d9f1da_1245x279.png 848w, https://substackcdn.com/image/fetch/$s_!hnJm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff626fcaa-95c5-4fc8-b827-0080a2d9f1da_1245x279.png 1272w, https://substackcdn.com/image/fetch/$s_!hnJm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff626fcaa-95c5-4fc8-b827-0080a2d9f1da_1245x279.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hnJm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff626fcaa-95c5-4fc8-b827-0080a2d9f1da_1245x279.png" width="1245" height="279" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f626fcaa-95c5-4fc8-b827-0080a2d9f1da_1245x279.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:279,&quot;width&quot;:1245,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:79923,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hnJm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff626fcaa-95c5-4fc8-b827-0080a2d9f1da_1245x279.png 424w, https://substackcdn.com/image/fetch/$s_!hnJm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff626fcaa-95c5-4fc8-b827-0080a2d9f1da_1245x279.png 848w, https://substackcdn.com/image/fetch/$s_!hnJm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff626fcaa-95c5-4fc8-b827-0080a2d9f1da_1245x279.png 1272w, https://substackcdn.com/image/fetch/$s_!hnJm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff626fcaa-95c5-4fc8-b827-0080a2d9f1da_1245x279.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>To enhance accuracy, the authors suggest presenting the model with additional  examples for comparison. For example, we can generate a total of 5 responses, and then ask the model to assess the validity of one of them&#8212;the original response of LLM.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x2vU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d41e34-f3e9-4de8-9615-701426ecc7b1_1440x489.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x2vU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d41e34-f3e9-4de8-9615-701426ecc7b1_1440x489.png 424w, https://substackcdn.com/image/fetch/$s_!x2vU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d41e34-f3e9-4de8-9615-701426ecc7b1_1440x489.png 848w, https://substackcdn.com/image/fetch/$s_!x2vU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d41e34-f3e9-4de8-9615-701426ecc7b1_1440x489.png 1272w, https://substackcdn.com/image/fetch/$s_!x2vU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d41e34-f3e9-4de8-9615-701426ecc7b1_1440x489.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x2vU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d41e34-f3e9-4de8-9615-701426ecc7b1_1440x489.png" width="1440" height="489" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42d41e34-f3e9-4de8-9615-701426ecc7b1_1440x489.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:489,&quot;width&quot;:1440,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:158857,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x2vU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d41e34-f3e9-4de8-9615-701426ecc7b1_1440x489.png 424w, https://substackcdn.com/image/fetch/$s_!x2vU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d41e34-f3e9-4de8-9615-701426ecc7b1_1440x489.png 848w, https://substackcdn.com/image/fetch/$s_!x2vU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d41e34-f3e9-4de8-9615-701426ecc7b1_1440x489.png 1272w, https://substackcdn.com/image/fetch/$s_!x2vU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d41e34-f3e9-4de8-9615-701426ecc7b1_1440x489.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>&#128064; The result can be further improved with <a href="https://www.promptingguide.ai/techniques/fewshot">few-shot prompting techniques</a>. In short, we can say that  Comparison (Few-shot) <em>&gt;</em> Comparison <em>&gt;</em> One Proposed Answer.</p></blockquote><p>In addition to prompting, the authors also propose &#128073; <strong>finetuning LLMs for the detection</strong> task. Concretely, they train LLMs to predict whether they know the answer to any given free-form question, i.e., estimating <em>P(IK) (&#8220;I know&#8221;),</em> using 2 approaches:</p><ul><li><p><strong>Value Head Integration:</strong> This approach introduces an additional "head" to the LLM architecture. This head is specifically trained to predict P(IK) as a logit value.  A key advantage of this method lies in its flexibility. We can probe the value head at any point during text generation, allowing for dynamic uncertainty assessment.</p></li><li><p><strong>Natural Language Prompt-based Training:</strong> This approach leverages natural language processing (NLP) techniques. They train the LLM to respond to the prompt: "With what confidence could you answer this question?" The model's target output is a human-readable percentage value (0% - 100%) reflecting its estimated confidence level in answering the question. This method offers a more intuitive interpretation of uncertainty for users.</p></li></ul><p>&#10060; Unfortunately, the Natural Language Prompt-based Training approach fails, so the authors only follow the Value Head Integration approach. </p><p>When training a model, it's essential to prepare training data. Similar to training other conformal predictors, we need data in a binary classification format:</p><ul><li><p><em>X</em>: the input and response from the LLM</p></li><li><p><em>Y</em>: whether the response is correct or not.</p></li></ul><p>In practice, they generated 30 response samples per question input. If 20 samples were deemed correct, they would have 20 positive-label data points in the training set, indicating the model "knew" the answer. Conversely, 10 incorrect samples resulted in 10 negative-label data points. The LLM is finetuned to output the value head following the ground-truth labels. </p><p>The results indicate that finetuning generally helps the model distinguish between correct and incorrect responses. In the in-distribution setting across datasets, the LLM's predicted <em>P(IK)</em> aligns somewhat with the ground truth. However, when generalized to a different dataset (from TriviaQA to Mixed-Arithmetic), this differentiation becomes less clear.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nnZX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabcb91ee-2d90-4316-b9c6-693bd517c980_1530x1011.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nnZX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabcb91ee-2d90-4316-b9c6-693bd517c980_1530x1011.png 424w, https://substackcdn.com/image/fetch/$s_!nnZX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabcb91ee-2d90-4316-b9c6-693bd517c980_1530x1011.png 848w, https://substackcdn.com/image/fetch/$s_!nnZX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabcb91ee-2d90-4316-b9c6-693bd517c980_1530x1011.png 1272w, https://substackcdn.com/image/fetch/$s_!nnZX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabcb91ee-2d90-4316-b9c6-693bd517c980_1530x1011.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nnZX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabcb91ee-2d90-4316-b9c6-693bd517c980_1530x1011.png" width="1456" height="962" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/abcb91ee-2d90-4316-b9c6-693bd517c980_1530x1011.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:962,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:177651,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nnZX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabcb91ee-2d90-4316-b9c6-693bd517c980_1530x1011.png 424w, https://substackcdn.com/image/fetch/$s_!nnZX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabcb91ee-2d90-4316-b9c6-693bd517c980_1530x1011.png 848w, https://substackcdn.com/image/fetch/$s_!nnZX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabcb91ee-2d90-4316-b9c6-693bd517c980_1530x1011.png 1272w, https://substackcdn.com/image/fetch/$s_!nnZX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabcb91ee-2d90-4316-b9c6-693bd517c980_1530x1011.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Despite the initial promising results, there are significant limitations with the approach:</p><p>&#10060; <strong>High Detection Cost:</strong> The cost of detection is high because it relies on LLMs. These models require substantial computational resources and energy, leading to increased expenses in terms of both hardware and operational costs.</p><p>&#10060; <strong>Insufficient Reliance on Textual Responses:</strong> Simply relying on the textual response to determine truthfulness is inadequate. Textual responses alone cannot comprehensively reveal the correctness of the information because LLMs are very good at making things look real. </p><p></p><h4>Simple Conformal Predictor</h4><p>The key to catching hallucinations might lie within the LLMs themselves. By peering deeper into their internal workings, we could extract valuable clues about their current state and what they "believe" to be true. This richer information would significantly boost the accuracy of hallucination detection. Think of it this way: with a clearer picture of the LLM's thought process, we wouldn't need such a complex detector. Even a simpler classifier could do the job if we have the right features to analyze. The workflow becomes:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wl5i!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6d05e92-0743-4fc3-b8e5-a8dfa311e007_1156x307.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wl5i!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6d05e92-0743-4fc3-b8e5-a8dfa311e007_1156x307.gif 424w, https://substackcdn.com/image/fetch/$s_!wl5i!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6d05e92-0743-4fc3-b8e5-a8dfa311e007_1156x307.gif 848w, https://substackcdn.com/image/fetch/$s_!wl5i!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6d05e92-0743-4fc3-b8e5-a8dfa311e007_1156x307.gif 1272w, https://substackcdn.com/image/fetch/$s_!wl5i!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6d05e92-0743-4fc3-b8e5-a8dfa311e007_1156x307.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wl5i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6d05e92-0743-4fc3-b8e5-a8dfa311e007_1156x307.gif" width="1156" height="307" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c6d05e92-0743-4fc3-b8e5-a8dfa311e007_1156x307.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:307,&quot;width&quot;:1156,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:709181,&quot;alt&quot;:&quot;Conformal prediction using LLM features. &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Conformal prediction using LLM features. " title="Conformal prediction using LLM features. " srcset="https://substackcdn.com/image/fetch/$s_!wl5i!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6d05e92-0743-4fc3-b8e5-a8dfa311e007_1156x307.gif 424w, https://substackcdn.com/image/fetch/$s_!wl5i!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6d05e92-0743-4fc3-b8e5-a8dfa311e007_1156x307.gif 848w, https://substackcdn.com/image/fetch/$s_!wl5i!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6d05e92-0743-4fc3-b8e5-a8dfa311e007_1156x307.gif 1272w, https://substackcdn.com/image/fetch/$s_!wl5i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6d05e92-0743-4fc3-b8e5-a8dfa311e007_1156x307.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Conformal prediction using LLM features. </figcaption></figure></div><p></p><p>This simple idea is attractive because a simple conformal predictor such as a feed-forward neural network can be used to perform the detection. However, the nature of these features can pose several challenges:</p><p>&#129300; Is it easy and cost-effective to extract these features?</p><p>&#129300; Are these features informative and can they generalize well to new prompts and different large language models (LLMs)?</p><p>Now, the main question is:<em> &#129504; Which features should we extract? </em>One candidate is &#128073; <strong>the internal states of the LLMs</strong>. Recent works have investigated and declared that the internal states of LLMs are reliable sources of information for truthfulness detection on the final response [6,7].  </p><blockquote><p>&#128064; It is important to note that this must be the internal states, not the response text or the response embedding vector. </p></blockquote><p>As quoted in their paper:</p><div class="pullquote"><p>We hypothesize that the truth or falsehood of a statement should be represented by, and therefore extractable from, the LLM&#8217;s internal state. </p><p>Source: [6]</p></div><p>Ok, let&#8217;s use the internal states, which are represented by the hidden layers of the LLMs.<em>&#129504; Which layers should we use?</em> Intuitively, the last hidden layer seems like a good candidate &#8211; it should theoretically hold all the processed information. But there's a catch: this layer is primarily focused on predicting the next word in the sequence, not necessarily retaining long-term context. Conversely, layers closer to the input are better at extracting basic features from the data, but might not capture the bigger picture. </p><p>To find out, the authors try out several hidden layers of the LLMs as the features. For each chosen layer, the feature vector can be simply the average across token timesteps or the last token&#8217;s hidden states. The results reveal that the middle layers perform best:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vTt2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0691bb95-5ad2-43bf-9169-fe1935c5f3cf_1047x311.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vTt2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0691bb95-5ad2-43bf-9169-fe1935c5f3cf_1047x311.png 424w, https://substackcdn.com/image/fetch/$s_!vTt2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0691bb95-5ad2-43bf-9169-fe1935c5f3cf_1047x311.png 848w, https://substackcdn.com/image/fetch/$s_!vTt2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0691bb95-5ad2-43bf-9169-fe1935c5f3cf_1047x311.png 1272w, https://substackcdn.com/image/fetch/$s_!vTt2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0691bb95-5ad2-43bf-9169-fe1935c5f3cf_1047x311.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vTt2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0691bb95-5ad2-43bf-9169-fe1935c5f3cf_1047x311.png" width="1047" height="311" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0691bb95-5ad2-43bf-9169-fe1935c5f3cf_1047x311.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:311,&quot;width&quot;:1047,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:154215,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vTt2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0691bb95-5ad2-43bf-9169-fe1935c5f3cf_1047x311.png 424w, https://substackcdn.com/image/fetch/$s_!vTt2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0691bb95-5ad2-43bf-9169-fe1935c5f3cf_1047x311.png 848w, https://substackcdn.com/image/fetch/$s_!vTt2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0691bb95-5ad2-43bf-9169-fe1935c5f3cf_1047x311.png 1272w, https://substackcdn.com/image/fetch/$s_!vTt2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0691bb95-5ad2-43bf-9169-fe1935c5f3cf_1047x311.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Detection performance across hidden layers. Source: [6]. </figcaption></figure></div><p>Recently, a more detailed investigation into the hidden states of LLMs aims to find out if these internal states can signal the risk of hallucination based on the given queries [8]. The goal is to see if we can reliably estimate this risk even before the LLM generates a response, i.e.,  &#128073; <strong>self-awareness</strong>. </p><p>Self-awareness is the ability in humans that causes us to hesitate before responding to queries or making decisions in situations where we recognize our lack of knowledge (we know what we don&#8217;t know). The authors want to verify that ability in LLMs by studying LLMs&#8217; internal states. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SLp1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70649580-edf2-496d-9b97-5a121e00940c_735x516.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SLp1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70649580-edf2-496d-9b97-5a121e00940c_735x516.png 424w, https://substackcdn.com/image/fetch/$s_!SLp1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70649580-edf2-496d-9b97-5a121e00940c_735x516.png 848w, https://substackcdn.com/image/fetch/$s_!SLp1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70649580-edf2-496d-9b97-5a121e00940c_735x516.png 1272w, https://substackcdn.com/image/fetch/$s_!SLp1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70649580-edf2-496d-9b97-5a121e00940c_735x516.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SLp1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70649580-edf2-496d-9b97-5a121e00940c_735x516.png" width="500" height="351.0204081632653" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/70649580-edf2-496d-9b97-5a121e00940c_735x516.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:516,&quot;width&quot;:735,&quot;resizeWidth&quot;:500,&quot;bytes&quot;:140367,&quot;alt&quot;:&quot;Self-awareness in LLMs. &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Self-awareness in LLMs. " title="Self-awareness in LLMs. " srcset="https://substackcdn.com/image/fetch/$s_!SLp1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70649580-edf2-496d-9b97-5a121e00940c_735x516.png 424w, https://substackcdn.com/image/fetch/$s_!SLp1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70649580-edf2-496d-9b97-5a121e00940c_735x516.png 848w, https://substackcdn.com/image/fetch/$s_!SLp1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70649580-edf2-496d-9b97-5a121e00940c_735x516.png 1272w, https://substackcdn.com/image/fetch/$s_!SLp1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70649580-edf2-496d-9b97-5a121e00940c_735x516.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Self-awareness in LLMs. The hidden representation space is generally clustered according to known and unknown input. Source: [8]</figcaption></figure></div><p></p><p>Concretely, they use internal states corresponding to the last token of queries, denoted as <em>x<sub>q</sub></em>. The conformal predictor, or estimator, employed is a variant of the multilayer perceptron (MLP) adapted from Llama&#8217;s. The estimator is mathematically formulated to predict the hallucination risk <em>H</em> as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AZgM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d008da-e58c-4032-a04d-504312cc6e81_804x167.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AZgM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d008da-e58c-4032-a04d-504312cc6e81_804x167.png 424w, https://substackcdn.com/image/fetch/$s_!AZgM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d008da-e58c-4032-a04d-504312cc6e81_804x167.png 848w, https://substackcdn.com/image/fetch/$s_!AZgM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d008da-e58c-4032-a04d-504312cc6e81_804x167.png 1272w, https://substackcdn.com/image/fetch/$s_!AZgM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d008da-e58c-4032-a04d-504312cc6e81_804x167.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AZgM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d008da-e58c-4032-a04d-504312cc6e81_804x167.png" width="410" height="85.16169154228855" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/01d008da-e58c-4032-a04d-504312cc6e81_804x167.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:167,&quot;width&quot;:804,&quot;resizeWidth&quot;:410,&quot;bytes&quot;:30259,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AZgM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d008da-e58c-4032-a04d-504312cc6e81_804x167.png 424w, https://substackcdn.com/image/fetch/$s_!AZgM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d008da-e58c-4032-a04d-504312cc6e81_804x167.png 848w, https://substackcdn.com/image/fetch/$s_!AZgM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d008da-e58c-4032-a04d-504312cc6e81_804x167.png 1272w, https://substackcdn.com/image/fetch/$s_!AZgM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d008da-e58c-4032-a04d-504312cc6e81_804x167.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>They also prepare a dataset containing both known and unknown queries for the LLMs. The LLMs are expected to be uncertain about the unknown queries, which they have never encountered before. They train the estimator on the dataset and compare it with simple baselines such as Perplexity and Prompting to illustrate the point that internal states are really good indicators of uncertainty for unknown queries.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6xYu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9164b04-c8c1-4c1a-8e7e-143b59e6ead5_826x468.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6xYu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9164b04-c8c1-4c1a-8e7e-143b59e6ead5_826x468.png 424w, https://substackcdn.com/image/fetch/$s_!6xYu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9164b04-c8c1-4c1a-8e7e-143b59e6ead5_826x468.png 848w, https://substackcdn.com/image/fetch/$s_!6xYu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9164b04-c8c1-4c1a-8e7e-143b59e6ead5_826x468.png 1272w, https://substackcdn.com/image/fetch/$s_!6xYu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9164b04-c8c1-4c1a-8e7e-143b59e6ead5_826x468.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6xYu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9164b04-c8c1-4c1a-8e7e-143b59e6ead5_826x468.png" width="826" height="468" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c9164b04-c8c1-4c1a-8e7e-143b59e6ead5_826x468.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:468,&quot;width&quot;:826,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:163775,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6xYu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9164b04-c8c1-4c1a-8e7e-143b59e6ead5_826x468.png 424w, https://substackcdn.com/image/fetch/$s_!6xYu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9164b04-c8c1-4c1a-8e7e-143b59e6ead5_826x468.png 848w, https://substackcdn.com/image/fetch/$s_!6xYu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9164b04-c8c1-4c1a-8e7e-143b59e6ead5_826x468.png 1272w, https://substackcdn.com/image/fetch/$s_!6xYu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9164b04-c8c1-4c1a-8e7e-143b59e6ead5_826x468.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">F1 scores across many datasets. Source: [8]</figcaption></figure></div><blockquote><p>&#129514; Internal-State Conformal Predictor <em>&gt;</em> PPL <em>&gt;</em> ICL Prompt <em>&gt;</em> Zero-shot Prompt. </p></blockquote><p>Hidden states, while powerful tools within Large Language Models (LLMs), come with inherent limitations:</p><p><strong>&#10060; Architecture Dependence: </strong>Extracting hidden states is intrinsically tied to specific LLM architectures and models. This creates a roadblock when transferring the extraction process across different LLMs. Each LLM architecture might require unique approaches to access and interpret its hidden states.</p><p><strong>&#10060; Sensitivity and Generalizability: </strong>Hidden states are demonstrably sensitive to the input data and the specific LLM they are extracted from. This sensitivity poses a significant challenge to generalizability. Conformal predictors, for instance, trained in a particular dataset's hidden states might not perform well when applied to hidden states derived from a different dataset or LLMs.</p><p>In a promising new direction, researchers have proposed an alternative approach that bypasses hidden states altogether. This method, &#128073; <strong>Lookback Lens</strong>, focuses on extracting features directly from the attention scores generated by LLMs. We focus on attention because it reveals how much the LLM considers the given context when generating text. This is especially valuable compared to other internal model workings. Since attention provides a human-understandable measure, it becomes a powerful tool for catching and fixing made-up information (hallucinations) in the generated text. </p><p>When a Transformer-based LLM performs attention, it attends to the context tokens and its newly generated tokens. The authors aggregate attention scores at attention head <em>h</em> and layer <em>l</em>, corresponding to the two types of attention: </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x-Op!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41dfb4f2-cbcd-4f73-be60-5d2b3bab26bf_480x270.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x-Op!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41dfb4f2-cbcd-4f73-be60-5d2b3bab26bf_480x270.png 424w, https://substackcdn.com/image/fetch/$s_!x-Op!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41dfb4f2-cbcd-4f73-be60-5d2b3bab26bf_480x270.png 848w, https://substackcdn.com/image/fetch/$s_!x-Op!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41dfb4f2-cbcd-4f73-be60-5d2b3bab26bf_480x270.png 1272w, https://substackcdn.com/image/fetch/$s_!x-Op!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41dfb4f2-cbcd-4f73-be60-5d2b3bab26bf_480x270.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x-Op!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41dfb4f2-cbcd-4f73-be60-5d2b3bab26bf_480x270.png" width="290" height="163.125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41dfb4f2-cbcd-4f73-be60-5d2b3bab26bf_480x270.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:270,&quot;width&quot;:480,&quot;resizeWidth&quot;:290,&quot;bytes&quot;:28625,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x-Op!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41dfb4f2-cbcd-4f73-be60-5d2b3bab26bf_480x270.png 424w, https://substackcdn.com/image/fetch/$s_!x-Op!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41dfb4f2-cbcd-4f73-be60-5d2b3bab26bf_480x270.png 848w, https://substackcdn.com/image/fetch/$s_!x-Op!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41dfb4f2-cbcd-4f73-be60-5d2b3bab26bf_480x270.png 1272w, https://substackcdn.com/image/fetch/$s_!x-Op!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41dfb4f2-cbcd-4f73-be60-5d2b3bab26bf_480x270.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Here <em>N</em> is the number of tokens in the context and <em>t</em> is the timestep of newly generated tokens. Hence, the lookback ratio <em>LR(l,h t)</em> for head <em>h</em> in layer <em>l</em> at time step <em>t</em> is:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vloL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87faeb1-7009-49ea-9def-64bef645c97e_800x210.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vloL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87faeb1-7009-49ea-9def-64bef645c97e_800x210.png 424w, https://substackcdn.com/image/fetch/$s_!vloL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87faeb1-7009-49ea-9def-64bef645c97e_800x210.png 848w, https://substackcdn.com/image/fetch/$s_!vloL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87faeb1-7009-49ea-9def-64bef645c97e_800x210.png 1272w, https://substackcdn.com/image/fetch/$s_!vloL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87faeb1-7009-49ea-9def-64bef645c97e_800x210.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vloL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87faeb1-7009-49ea-9def-64bef645c97e_800x210.png" width="352" height="92.4" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d87faeb1-7009-49ea-9def-64bef645c97e_800x210.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:210,&quot;width&quot;:800,&quot;resizeWidth&quot;:352,&quot;bytes&quot;:29706,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vloL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87faeb1-7009-49ea-9def-64bef645c97e_800x210.png 424w, https://substackcdn.com/image/fetch/$s_!vloL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87faeb1-7009-49ea-9def-64bef645c97e_800x210.png 848w, https://substackcdn.com/image/fetch/$s_!vloL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87faeb1-7009-49ea-9def-64bef645c97e_800x210.png 1272w, https://substackcdn.com/image/fetch/$s_!vloL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87faeb1-7009-49ea-9def-64bef645c97e_800x210.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p>&#128064; Intuitively, if the LLM focuses more on the context (the ratio is higher), it tends to be more reliable, less hallucinated. </p></blockquote><p>Of course, we can combine different layers, heads, and timesteps to form a combined feature vector for a span of generated text <em>Y={y<sub>t</sub> , y<sub>t+1</sub>, ..., y<sub>t+T &#8722;1</sub>}</em>. Given the feature vector, we can employ a simpler classifier (logistic regression) to detect if it is factual or hallucinated. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nGt9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff41d819a-28a8-461d-9548-50b03d4a8901_1496x397.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nGt9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff41d819a-28a8-461d-9548-50b03d4a8901_1496x397.png 424w, https://substackcdn.com/image/fetch/$s_!nGt9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff41d819a-28a8-461d-9548-50b03d4a8901_1496x397.png 848w, https://substackcdn.com/image/fetch/$s_!nGt9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff41d819a-28a8-461d-9548-50b03d4a8901_1496x397.png 1272w, https://substackcdn.com/image/fetch/$s_!nGt9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff41d819a-28a8-461d-9548-50b03d4a8901_1496x397.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nGt9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff41d819a-28a8-461d-9548-50b03d4a8901_1496x397.png" width="1456" height="386" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f41d819a-28a8-461d-9548-50b03d4a8901_1496x397.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:386,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:104680,&quot;alt&quot;:&quot;Lookback Lens architecture&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Lookback Lens architecture" title="Lookback Lens architecture" srcset="https://substackcdn.com/image/fetch/$s_!nGt9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff41d819a-28a8-461d-9548-50b03d4a8901_1496x397.png 424w, https://substackcdn.com/image/fetch/$s_!nGt9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff41d819a-28a8-461d-9548-50b03d4a8901_1496x397.png 848w, https://substackcdn.com/image/fetch/$s_!nGt9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff41d819a-28a8-461d-9548-50b03d4a8901_1496x397.png 1272w, https://substackcdn.com/image/fetch/$s_!nGt9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff41d819a-28a8-461d-9548-50b03d4a8901_1496x397.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Lookback Lens architecture. Source: [9].</figcaption></figure></div><p>The experimental results are promising, at least for summarization tasks where attending to the context is crucial to summarize. </p><blockquote><p>&#129514; Lookback Lens <em>&gt;</em> Hidden States <em>&gt;</em> Prompt </p></blockquote><p>The results also reveal that Lookback Lens might not always learn the training data perfectly, but it consistently performs better on completely different tasks (out-of-domain tasks). Lookback Lens analyzes attention maps (lookback ratio features), which is more robust than the hidden states, and  thus, is powerful and adaptable, making it useful for a wider range of problems.</p><p>Given the Lookback ratio as a score, we can measure the factuality or confidence of different generated candidates:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Lt9K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5896ae0d-016a-4df4-901c-499f5db912e5_1392x714.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Lt9K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5896ae0d-016a-4df4-901c-499f5db912e5_1392x714.png 424w, https://substackcdn.com/image/fetch/$s_!Lt9K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5896ae0d-016a-4df4-901c-499f5db912e5_1392x714.png 848w, https://substackcdn.com/image/fetch/$s_!Lt9K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5896ae0d-016a-4df4-901c-499f5db912e5_1392x714.png 1272w, https://substackcdn.com/image/fetch/$s_!Lt9K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5896ae0d-016a-4df4-901c-499f5db912e5_1392x714.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Lt9K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5896ae0d-016a-4df4-901c-499f5db912e5_1392x714.png" width="1392" height="714" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5896ae0d-016a-4df4-901c-499f5db912e5_1392x714.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:714,&quot;width&quot;:1392,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:215133,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Lt9K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5896ae0d-016a-4df4-901c-499f5db912e5_1392x714.png 424w, https://substackcdn.com/image/fetch/$s_!Lt9K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5896ae0d-016a-4df4-901c-499f5db912e5_1392x714.png 848w, https://substackcdn.com/image/fetch/$s_!Lt9K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5896ae0d-016a-4df4-901c-499f5db912e5_1392x714.png 1272w, https://substackcdn.com/image/fetch/$s_!Lt9K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5896ae0d-016a-4df4-901c-499f5db912e5_1392x714.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The lookback ratio score prefers factual statements over made-up content. Source: [9]. </figcaption></figure></div><div><hr></div><h2>Final Thoughts: The Future of LLM Hallucination Detection</h2><p>While the methods explored here offer promising avenues for detecting LLM hallucinations, there's still much room for exploration. Future research directions include:</p><ul><li><p><strong>Improved uncertainty estimation for LLMs:</strong> Refining techniques for LLMs to better quantify their uncertainty about generated content.</p></li><li><p><strong>Novel methods for leveraging internal LLM states:</strong> Exploring techniques to analyze internal LLM representations to glean deeper insights into the generation process and identify potential hallucinations.</p></li><li><p><strong>Integration with factual knowledge bases:</strong> Developing frameworks that seamlessly integrate LLM outputs with external knowledge sources to verify factual consistency and enhance detection accuracy.</p></li><li><p><strong>Benchmarking and interpretability:</strong> Establishing standardized benchmarks for evaluating hallucination detection methods and fostering interpretable models that provide clear explanations for their decisions.</p></li></ul><p>By addressing these challenges, we can move towards a future where LLMs are reliable partners in human endeavors, offering creative and informative outputs while minimizing the risk of misleading information. This will be crucial for fostering trust and wider adoption of LLM technology across various domains.</p><div><hr></div><h2>References</h2><p>[1] Guo, Chuan, et al. &#8220;On calibration of modern neural networks.&#8221; International conference on machine learning. PMLR, 2017.</p><p>[2] Huang, Lei, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen et al. "A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions." <em>arXiv preprint arXiv:2311.05232</em> (2023).</p><p>[3] Manakul, Potsawee, Adian Liusie, and Mark JF Gales. "Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models." <em>arXiv preprint arXiv:2303.08896</em> (2023).</p><p>[4] Yadkori, Yasin Abbasi, Ilja Kuzborskij, Andr&#225;s Gy&#246;rgy, and Csaba Szepesv&#225;ri. "To Believe or Not to Believe Your LLM." <em>arXiv preprint arXiv:2406.02543</em> (2024).</p><p>[5] Kadavath, Saurav, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer et al. "Language models (mostly) know what they know." <em>arXiv preprint arXiv:2207.05221</em> (2022).</p><p>[6] Azaria, Amos, and Tom Mitchell. "The internal state of an LLM knows when it's lying." <em>arXiv preprint arXiv:2304.13734</em> (2023).</p><p>[7] Chen, Chao, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. "INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection." In <em>The Twelfth International Conference on Learning Representations</em>.</p><p>[8] Ji, Ziwei, Delong Chen, Etsuko Ishii, Samuel Cahyawijaya, Yejin Bang, Bryan Wilie, and Pascale Fung. "LLM Internal States Reveal Hallucination Risk Faced With a Query." <em>arXiv preprint arXiv:2407.03282</em> (2024).</p><p>[9] Chuang, Yung-Sung, Linlu Qiu, Cheng-Yu Hsieh, Ranjay Krishna, Yoon Kim, and James Glass. "Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps." <em>arXiv preprint arXiv:2407.07071</em> (2024).</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>I hope you enjoy the article. Stay tuned for the newest and exclusive content by subscribing to <strong>Neurocoder Tales</strong>! Disclaimer: While every effort is made to provide accurate and unbiased information, errors may occur. Let me know if you catch any error.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Curious Agents Saga: Part 3 ]]></title><description><![CDATA[Beyond Surprise: Direct and Causal Exploration in Deep Reinforcement Learning]]></description><link>https://hungleai.substack.com/p/curious-agents-saga-part-3</link><guid isPermaLink="false">https://hungleai.substack.com/p/curious-agents-saga-part-3</guid><dc:creator><![CDATA[Hung Le]]></dc:creator><pubDate>Thu, 27 Jun 2024 00:59:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22d485c5-48c4-45bc-b32e-3edec853e5ef_1030x996.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h4>Table of Content</h4><ul><li><p><a href="https://hungleai.substack.com/i/141333215/reflection-on-intrinsic-motivation">Reflection on Intrinsic Motivation</a></p></li><li><p><a href="https://hungleai.substack.com/i/141333215/direct-exploration">Direct Exploration</a></p><ul><li><p><a href="https://hungleai.substack.com/i/141333215/replay-memory-focused-on-exploration">Replay Memory Focused on Exploration</a></p></li><li><p><a href="https://hungleai.substack.com/i/141333215/performance-based-replay-memory">Performance-based Replay Memory</a></p></li></ul></li><li><p><a href="https://hungleai.substack.com/i/141333215/causal-exploration">Causal Exploration</a></p><ul><li><p><a href="https://hungleai.substack.com/i/141333215/what-is-causality">What is Causality?</a></p></li><li><p><a href="https://hungleai.substack.com/i/141333215/dependency-test">Dependency Test </a></p></li><li><p><a href="https://hungleai.substack.com/i/141333215/potential-outcome">Potential Outcome</a> </p></li><li><p><a href="https://hungleai.substack.com/i/141333215/structural-causal-model">Structural Causal Model</a></p></li></ul></li></ul><div><hr></div><h2>Reflection on Intrinsic Motivation</h2><p>In the <a href="https://hungleai.substack.com/p/curious-agents-saga-part-2">previous article</a>, we reviewed an essential exploration framework called <em>intrinsic motivation</em>, which is widely used in deep RL due to its scalability. Within the framework, surprise and novelty are the medium for exploration. Regarding <strong>surprise</strong>, memory is often hidden within dynamics models, memorizing observed data to enhance predictive capabilities. This type of memory tends to be long-term, semantic, and slow to update, akin to a careful archivist meticulously preserving information.</p><p>On the other hand, <strong>novelty</strong> takes a more straightforward approach to memory. Memory is delineated here, resembling a slot-based matrix, a nearest neighbor estimator, or a simple counter. This memory is typically short-term, instance-based, and highly adaptive to environmental changes, acting more like a dynamic and responsive agent ready to adjust to new inputs swiftly.</p><p>They all begin with the memory origin, employing surprise or novelty mechanisms to create intrinsic rewards that guide the exploration of the RL agent. While being so convenient and easy to use, two major issues have hindered the ability of the framework to explore effectively:</p><ol><li><p><strong>Detachment:</strong> lose track of interesting areas to explore.</p></li><li><p><strong>Derailment:</strong>  prevent it from utilizing previously visited states.</p></li></ol><p> &#129504; <em>But what other alternatives could there be to overcome these issues?</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gz19!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb8d3e4-b525-4fa2-9844-0fd47b497680_704x612.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gz19!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb8d3e4-b525-4fa2-9844-0fd47b497680_704x612.png 424w, https://substackcdn.com/image/fetch/$s_!Gz19!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb8d3e4-b525-4fa2-9844-0fd47b497680_704x612.png 848w, https://substackcdn.com/image/fetch/$s_!Gz19!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb8d3e4-b525-4fa2-9844-0fd47b497680_704x612.png 1272w, https://substackcdn.com/image/fetch/$s_!Gz19!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb8d3e4-b525-4fa2-9844-0fd47b497680_704x612.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gz19!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb8d3e4-b525-4fa2-9844-0fd47b497680_704x612.png" width="480" height="417.27272727272725" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4eb8d3e4-b525-4fa2-9844-0fd47b497680_704x612.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:612,&quot;width&quot;:704,&quot;resizeWidth&quot;:480,&quot;bytes&quot;:70068,&quot;alt&quot;:&quot;Direct Exploration vs Intrinsic Motivation&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Direct Exploration vs Intrinsic Motivation" title="Direct Exploration vs Intrinsic Motivation" srcset="https://substackcdn.com/image/fetch/$s_!Gz19!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb8d3e4-b525-4fa2-9844-0fd47b497680_704x612.png 424w, https://substackcdn.com/image/fetch/$s_!Gz19!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb8d3e4-b525-4fa2-9844-0fd47b497680_704x612.png 848w, https://substackcdn.com/image/fetch/$s_!Gz19!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb8d3e4-b525-4fa2-9844-0fd47b497680_704x612.png 1272w, https://substackcdn.com/image/fetch/$s_!Gz19!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4eb8d3e4-b525-4fa2-9844-0fd47b497680_704x612.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Direct Exploration</h2><p>In this section, we will explore a new form of exploration that is more direct and mechanical. While it requires additional complexity, it proves to be quite effective. &#129504; <em>What is direct exploration?</em> In short,<em> </em>it is a systematic approach where an agent revisits previously discovered states to explore new actions and areas, enhancing efficiency and effectiveness in sparse reward environments.</p><p> Under this new scheme, the role of memory is simplified to:</p><ol><li><p>Storing past states.</p></li><li><p>Retrieving states to explore.</p></li><li><p>Replaying past experiences.</p></li></ol><p>We will refer to the memory as "replay memory" to distinguish it from semantic and episodic memory concepts.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HJBC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6838cb26-37fc-476b-ab61-94ad02fa51ca_682x261.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HJBC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6838cb26-37fc-476b-ab61-94ad02fa51ca_682x261.gif 424w, https://substackcdn.com/image/fetch/$s_!HJBC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6838cb26-37fc-476b-ab61-94ad02fa51ca_682x261.gif 848w, https://substackcdn.com/image/fetch/$s_!HJBC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6838cb26-37fc-476b-ab61-94ad02fa51ca_682x261.gif 1272w, https://substackcdn.com/image/fetch/$s_!HJBC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6838cb26-37fc-476b-ab61-94ad02fa51ca_682x261.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HJBC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6838cb26-37fc-476b-ab61-94ad02fa51ca_682x261.gif" width="682" height="261" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6838cb26-37fc-476b-ab61-94ad02fa51ca_682x261.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:261,&quot;width&quot;:682,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:152278,&quot;alt&quot;:&quot;Direct Exploration Mechanism. &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Direct Exploration Mechanism. " title="Direct Exploration Mechanism. " srcset="https://substackcdn.com/image/fetch/$s_!HJBC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6838cb26-37fc-476b-ab61-94ad02fa51ca_682x261.gif 424w, https://substackcdn.com/image/fetch/$s_!HJBC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6838cb26-37fc-476b-ab61-94ad02fa51ca_682x261.gif 848w, https://substackcdn.com/image/fetch/$s_!HJBC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6838cb26-37fc-476b-ab61-94ad02fa51ca_682x261.gif 1272w, https://substackcdn.com/image/fetch/$s_!HJBC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6838cb26-37fc-476b-ab61-94ad02fa51ca_682x261.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Direct Exploration Mechanism. </figcaption></figure></div><p>The effectiveness of the direct exploration framework heavily relies on the sampling strategy, which can generally be categorized into exploration-focused and performance-driven.</p><p></p><h4>Replay Memory Focused on Exploration</h4><p>One of the pioneering works that paved the way for the direct exploration framework is &#128073;<strong>Go-Explore </strong>[1], the first paper to achieve superhuman performance on Atari games.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iEbQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b106846-1200-4216-8cbf-eae536dd0a09_1238x798.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iEbQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b106846-1200-4216-8cbf-eae536dd0a09_1238x798.png 424w, https://substackcdn.com/image/fetch/$s_!iEbQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b106846-1200-4216-8cbf-eae536dd0a09_1238x798.png 848w, https://substackcdn.com/image/fetch/$s_!iEbQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b106846-1200-4216-8cbf-eae536dd0a09_1238x798.png 1272w, https://substackcdn.com/image/fetch/$s_!iEbQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b106846-1200-4216-8cbf-eae536dd0a09_1238x798.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iEbQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b106846-1200-4216-8cbf-eae536dd0a09_1238x798.png" width="1238" height="798" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8b106846-1200-4216-8cbf-eae536dd0a09_1238x798.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:798,&quot;width&quot;:1238,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:120777,&quot;alt&quot;:&quot;Montezuma Revenge Performance. Humans cannot achieve more than a 600k score. &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Montezuma Revenge Performance. Humans cannot achieve more than a 600k score. " title="Montezuma Revenge Performance. Humans cannot achieve more than a 600k score. " srcset="https://substackcdn.com/image/fetch/$s_!iEbQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b106846-1200-4216-8cbf-eae536dd0a09_1238x798.png 424w, https://substackcdn.com/image/fetch/$s_!iEbQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b106846-1200-4216-8cbf-eae536dd0a09_1238x798.png 848w, https://substackcdn.com/image/fetch/$s_!iEbQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b106846-1200-4216-8cbf-eae536dd0a09_1238x798.png 1272w, https://substackcdn.com/image/fetch/$s_!iEbQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b106846-1200-4216-8cbf-eae536dd0a09_1238x798.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Montezuma Revenge Performance. Humans cannot achieve more than a 600k score. Source: [1]. </figcaption></figure></div><p>According to the paper, memory, and simulators offer effective solutions to address the challenges of detachment and derailment mentioned earlier. Detachment, where an algorithm loses track of interesting areas to explore, can be mitigated by using memory to group similar states into cells. This approach is akin to a <a href="https://hungleai.substack.com/i/140993309/state-counting">hash count mechanism</a>. Each state is mapped to a cell, with each cell having a score that indicates its sampling probability.</p><p>On the other hand, derailment, where exploratory mechanisms prevent the algorithm from utilizing previously visited states, can be tackled using a simulator, although this is only suitable for Atari environments. The simulator samples a cell's state from memory and resets the agent's state to that of the cell. Throughout the exploration process, the memory is continually updated with new cells, ensuring that the agent remains on track and fully utilizes past experiences.</p><p>The basic workflow extends the direct exploration as follows</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A6JM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac6623f9-d9ff-46d0-99a7-c40b84f2120a_955x302.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A6JM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac6623f9-d9ff-46d0-99a7-c40b84f2120a_955x302.gif 424w, https://substackcdn.com/image/fetch/$s_!A6JM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac6623f9-d9ff-46d0-99a7-c40b84f2120a_955x302.gif 848w, https://substackcdn.com/image/fetch/$s_!A6JM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac6623f9-d9ff-46d0-99a7-c40b84f2120a_955x302.gif 1272w, https://substackcdn.com/image/fetch/$s_!A6JM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac6623f9-d9ff-46d0-99a7-c40b84f2120a_955x302.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A6JM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac6623f9-d9ff-46d0-99a7-c40b84f2120a_955x302.gif" width="955" height="302" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ac6623f9-d9ff-46d0-99a7-c40b84f2120a_955x302.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:302,&quot;width&quot;:955,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:225774,&quot;alt&quot;:&quot;Go-Explore Exploration schema (phase 1). &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Go-Explore Exploration schema (phase 1). " title="Go-Explore Exploration schema (phase 1). " srcset="https://substackcdn.com/image/fetch/$s_!A6JM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac6623f9-d9ff-46d0-99a7-c40b84f2120a_955x302.gif 424w, https://substackcdn.com/image/fetch/$s_!A6JM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac6623f9-d9ff-46d0-99a7-c40b84f2120a_955x302.gif 848w, https://substackcdn.com/image/fetch/$s_!A6JM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac6623f9-d9ff-46d0-99a7-c40b84f2120a_955x302.gif 1272w, https://substackcdn.com/image/fetch/$s_!A6JM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac6623f9-d9ff-46d0-99a7-c40b84f2120a_955x302.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Go-Explore Exploration schema (phase 1). </figcaption></figure></div><p>Here, the replay memory stores three crucial elements: the state, the cell (corresponding to the state), and the cell's score. A higher cell score indicates a greater likelihood that the cell will be sampled for exploration from its state.</p><p>The cell is a compressed state representation, making storing efficient and informative against noise. The paper uses simple engineering tricks for the state-to-cell mapping process using downscaled cells with adaptive downscaling parameters to ensure robustness. It starts by grouping recent frames into cells and selecting hyperparameter values that optimize a specialized objective. This objective encourages frames to be distributed as uniformly as possible across cells, preventing the issues caused by overly non-uniform distributions. By avoiding excessive aggregation and insufficient aggregation of frames, the approach ensures balanced exploration and maintains tractability.</p><p>The probability of choosing a cell at each step is proportional to its <a href="https://hungleai.substack.com/i/140993309/state-counting">count-based</a> selection weight to enhance the selection of novel states:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3SON!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf85dd38-985d-4383-95b3-75336314fb3f_342x161.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3SON!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf85dd38-985d-4383-95b3-75336314fb3f_342x161.png 424w, https://substackcdn.com/image/fetch/$s_!3SON!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf85dd38-985d-4383-95b3-75336314fb3f_342x161.png 848w, https://substackcdn.com/image/fetch/$s_!3SON!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf85dd38-985d-4383-95b3-75336314fb3f_342x161.png 1272w, https://substackcdn.com/image/fetch/$s_!3SON!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf85dd38-985d-4383-95b3-75336314fb3f_342x161.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3SON!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf85dd38-985d-4383-95b3-75336314fb3f_342x161.png" width="208" height="97.91812865497076" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf85dd38-985d-4383-95b3-75336314fb3f_342x161.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:161,&quot;width&quot;:342,&quot;resizeWidth&quot;:208,&quot;bytes&quot;:3805,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3SON!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf85dd38-985d-4383-95b3-75336314fb3f_342x161.png 424w, https://substackcdn.com/image/fetch/$s_!3SON!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf85dd38-985d-4383-95b3-75336314fb3f_342x161.png 848w, https://substackcdn.com/image/fetch/$s_!3SON!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf85dd38-985d-4383-95b3-75336314fb3f_342x161.png 1272w, https://substackcdn.com/image/fetch/$s_!3SON!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf85dd38-985d-4383-95b3-75336314fb3f_342x161.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The authors also incorporate domain knowledge by considering the number of horizontal neighbors to the cell present in the memory (h) and assigning a location bonus (k) for each cell (e.g., give a bonus at the location of the key):</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K99q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc407469a-26d7-44de-8695-62865c2a6659_324x135.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K99q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc407469a-26d7-44de-8695-62865c2a6659_324x135.png 424w, https://substackcdn.com/image/fetch/$s_!K99q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc407469a-26d7-44de-8695-62865c2a6659_324x135.png 848w, https://substackcdn.com/image/fetch/$s_!K99q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc407469a-26d7-44de-8695-62865c2a6659_324x135.png 1272w, https://substackcdn.com/image/fetch/$s_!K99q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc407469a-26d7-44de-8695-62865c2a6659_324x135.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K99q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc407469a-26d7-44de-8695-62865c2a6659_324x135.png" width="230" height="95.83333333333333" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c407469a-26d7-44de-8695-62865c2a6659_324x135.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:135,&quot;width&quot;:324,&quot;resizeWidth&quot;:230,&quot;bytes&quot;:3727,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K99q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc407469a-26d7-44de-8695-62865c2a6659_324x135.png 424w, https://substackcdn.com/image/fetch/$s_!K99q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc407469a-26d7-44de-8695-62865c2a6659_324x135.png 848w, https://substackcdn.com/image/fetch/$s_!K99q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc407469a-26d7-44de-8695-62865c2a6659_324x135.png 1272w, https://substackcdn.com/image/fetch/$s_!K99q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc407469a-26d7-44de-8695-62865c2a6659_324x135.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>In addition to the exploration phase, the authors propose a robustification phase (phase 2). The key idea here is to train from demonstrations: the backward algorithm positions the agent near the end of the trajectory. Then it runs PPO until the agent's performance matches that of the demonstration.</p><blockquote><p>&#10060; One key limitation of Go-Explore: cell design is not obvious and requires detailed knowledge of the observation space, the dynamics of the environment, and the subsequent task</p></blockquote><p>To address this limitation, &#128073;<strong>Latent Go-Explore</strong> [2] proposes a direct exploration mechanism without defining cell structures. It learns a latent representation distribution where states can be sampled from. The distribution is simultaneously learned with the exploration process, allowing the RL agent to refine its understanding of the environment in a more abstract and nuanced way. This concurrent learning ensures that the agent adapts and evolves its exploration strategy in real time.</p><p>A standout feature of Latent Go-Explore is its goal sampling method, which relies on a non-parametric density model of the latent space. This flexible approach allows the AI to focus on the most promising areas, enhancing exploration efficiency. By replacing traditional simulators with goal-based exploration, Latent Go-Explore directly pursues specific objectives, leading to more robust and versatile outcomes in dynamic environments. The workflow consists of 5 steps:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HtnA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc1529c-a09a-4137-b5bf-8d1df1b55423_1137x336.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HtnA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc1529c-a09a-4137-b5bf-8d1df1b55423_1137x336.png 424w, https://substackcdn.com/image/fetch/$s_!HtnA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc1529c-a09a-4137-b5bf-8d1df1b55423_1137x336.png 848w, https://substackcdn.com/image/fetch/$s_!HtnA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc1529c-a09a-4137-b5bf-8d1df1b55423_1137x336.png 1272w, https://substackcdn.com/image/fetch/$s_!HtnA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc1529c-a09a-4137-b5bf-8d1df1b55423_1137x336.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HtnA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc1529c-a09a-4137-b5bf-8d1df1b55423_1137x336.png" width="1137" height="336" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5fc1529c-a09a-4137-b5bf-8d1df1b55423_1137x336.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:336,&quot;width&quot;:1137,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:145758,&quot;alt&quot;:&quot;Latent Go-Explore framework. Source: [2]&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Latent Go-Explore framework. Source: [2]" title="Latent Go-Explore framework. Source: [2]" srcset="https://substackcdn.com/image/fetch/$s_!HtnA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc1529c-a09a-4137-b5bf-8d1df1b55423_1137x336.png 424w, https://substackcdn.com/image/fetch/$s_!HtnA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc1529c-a09a-4137-b5bf-8d1df1b55423_1137x336.png 848w, https://substackcdn.com/image/fetch/$s_!HtnA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc1529c-a09a-4137-b5bf-8d1df1b55423_1137x336.png 1272w, https://substackcdn.com/image/fetch/$s_!HtnA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc1529c-a09a-4137-b5bf-8d1df1b55423_1137x336.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Latent Go-Explore framework. Source: [2]</figcaption></figure></div><p>Step 2 is one of the most important steps where the latent representations are learned. This process leverages <a href="https://hungleai.substack.com/i/140993309/predictive-surprise">ICM&#8217;s Inverse Dynamics and Forward Dynamics</a> to predict the outcomes of actions, while Vector Quantized Variational Autoencoder (VQ-VAE [3]) helps in capturing and reconstructing intricate data patterns. These techniques work together to form a comprehensive representation of the states.</p><p>In Step 3, density estimation is employed to strategically sample goals. To ensure meaningful exploration, goals are selected at the edges of yet unexplored areas, promoting thorough investigation. These goals must be reachable, meaning they have been visited previously, ensuring feasibility. Additionally, a particle-based entropy estimator is used to calculate the density score and rank. Given the rank <em>R</em>, the sampling probability is:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3psf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93de6efc-f492-4d97-b53e-53e689edc647_528x143.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3psf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93de6efc-f492-4d97-b53e-53e689edc647_528x143.png 424w, https://substackcdn.com/image/fetch/$s_!3psf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93de6efc-f492-4d97-b53e-53e689edc647_528x143.png 848w, https://substackcdn.com/image/fetch/$s_!3psf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93de6efc-f492-4d97-b53e-53e689edc647_528x143.png 1272w, https://substackcdn.com/image/fetch/$s_!3psf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93de6efc-f492-4d97-b53e-53e689edc647_528x143.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3psf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93de6efc-f492-4d97-b53e-53e689edc647_528x143.png" width="320" height="86.66666666666667" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93de6efc-f492-4d97-b53e-53e689edc647_528x143.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:143,&quot;width&quot;:528,&quot;resizeWidth&quot;:320,&quot;bytes&quot;:12793,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3psf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93de6efc-f492-4d97-b53e-53e689edc647_528x143.png 424w, https://substackcdn.com/image/fetch/$s_!3psf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93de6efc-f492-4d97-b53e-53e689edc647_528x143.png 848w, https://substackcdn.com/image/fetch/$s_!3psf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93de6efc-f492-4d97-b53e-53e689edc647_528x143.png 1272w, https://substackcdn.com/image/fetch/$s_!3psf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93de6efc-f492-4d97-b53e-53e689edc647_528x143.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where the sampling distribution uses geometric law on the rank. <em>p</em> is a hyperparameter. The higher the rank (more dense), the less novel the sample is. </p><p>In Step 4, once the goal is defined, the agent employs goal-conditioned policy training to navigate toward it. This method involves training the agent's policy network to generate actions based explicitly on the specified goal state. Now the direct exploration framework does not require the simulator and can be used in any setting:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!63-A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6849ae2-de38-4bfa-8de8-263b52d0956e_1021x428.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!63-A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6849ae2-de38-4bfa-8de8-263b52d0956e_1021x428.gif 424w, https://substackcdn.com/image/fetch/$s_!63-A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6849ae2-de38-4bfa-8de8-263b52d0956e_1021x428.gif 848w, https://substackcdn.com/image/fetch/$s_!63-A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6849ae2-de38-4bfa-8de8-263b52d0956e_1021x428.gif 1272w, https://substackcdn.com/image/fetch/$s_!63-A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6849ae2-de38-4bfa-8de8-263b52d0956e_1021x428.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!63-A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6849ae2-de38-4bfa-8de8-263b52d0956e_1021x428.gif" width="1021" height="428" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c6849ae2-de38-4bfa-8de8-263b52d0956e_1021x428.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:428,&quot;width&quot;:1021,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:207135,&quot;alt&quot;:&quot;Goal-directed policy can replace simulators in the GO phase.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Goal-directed policy can replace simulators in the GO phase." title="Goal-directed policy can replace simulators in the GO phase." srcset="https://substackcdn.com/image/fetch/$s_!63-A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6849ae2-de38-4bfa-8de8-263b52d0956e_1021x428.gif 424w, https://substackcdn.com/image/fetch/$s_!63-A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6849ae2-de38-4bfa-8de8-263b52d0956e_1021x428.gif 848w, https://substackcdn.com/image/fetch/$s_!63-A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6849ae2-de38-4bfa-8de8-263b52d0956e_1021x428.gif 1272w, https://substackcdn.com/image/fetch/$s_!63-A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6849ae2-de38-4bfa-8de8-263b52d0956e_1021x428.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Goal-directed policy can replace simulators in the GO phase.</figcaption></figure></div><p>&#129504; <em>How to train a goal-directed policy? </em>The basic idea is to train a policy to maximize the reach-goal rewards (achiever) and this policy is often jointly trained with another exploration policy (explorer) whose goal is to maximize the exploration rewards such as intrinsic motivation rewards. To reduce the number of interactions with real-world environments, a world model can be used to generate trajectories for the training of the two policies. This general goal-conditioned RL (GCRL) framework is illustrated below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2lqR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb52a014-6aae-42b2-a330-4bf480ed7086_672x594.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2lqR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb52a014-6aae-42b2-a330-4bf480ed7086_672x594.png 424w, https://substackcdn.com/image/fetch/$s_!2lqR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb52a014-6aae-42b2-a330-4bf480ed7086_672x594.png 848w, https://substackcdn.com/image/fetch/$s_!2lqR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb52a014-6aae-42b2-a330-4bf480ed7086_672x594.png 1272w, https://substackcdn.com/image/fetch/$s_!2lqR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb52a014-6aae-42b2-a330-4bf480ed7086_672x594.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2lqR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb52a014-6aae-42b2-a330-4bf480ed7086_672x594.png" width="672" height="594" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eb52a014-6aae-42b2-a330-4bf480ed7086_672x594.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:594,&quot;width&quot;:672,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:65879,&quot;alt&quot;:&quot;GCRL: Achiever Explorer Framework. Goal conditioned Reinforcement Learning&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="GCRL: Achiever Explorer Framework. Goal conditioned Reinforcement Learning" title="GCRL: Achiever Explorer Framework. Goal conditioned Reinforcement Learning" srcset="https://substackcdn.com/image/fetch/$s_!2lqR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb52a014-6aae-42b2-a330-4bf480ed7086_672x594.png 424w, https://substackcdn.com/image/fetch/$s_!2lqR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb52a014-6aae-42b2-a330-4bf480ed7086_672x594.png 848w, https://substackcdn.com/image/fetch/$s_!2lqR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb52a014-6aae-42b2-a330-4bf480ed7086_672x594.png 1272w, https://substackcdn.com/image/fetch/$s_!2lqR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb52a014-6aae-42b2-a330-4bf480ed7086_672x594.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">GCRL: Achiever Explorer Framework. Source: [4].</figcaption></figure></div><p>Given the GCRL framework, the effort now is on defending the goal to guide the Go-phase and Explore-phase. Instead of selecting goals at the frontier of previously explored states, as advocated in Latent Go-Explore, we can directly search for goals that promise the highest exploration rewards throughout the exploration phase [5]. In particular, the paper formulates a novel objective:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YpYj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e05144f-2496-4e11-852a-f59a7c93eb47_661x222.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YpYj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e05144f-2496-4e11-852a-f59a7c93eb47_661x222.png 424w, https://substackcdn.com/image/fetch/$s_!YpYj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e05144f-2496-4e11-852a-f59a7c93eb47_661x222.png 848w, https://substackcdn.com/image/fetch/$s_!YpYj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e05144f-2496-4e11-852a-f59a7c93eb47_661x222.png 1272w, https://substackcdn.com/image/fetch/$s_!YpYj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e05144f-2496-4e11-852a-f59a7c93eb47_661x222.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YpYj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e05144f-2496-4e11-852a-f59a7c93eb47_661x222.png" width="559" height="187.7428139183056" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e05144f-2496-4e11-852a-f59a7c93eb47_661x222.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:222,&quot;width&quot;:661,&quot;resizeWidth&quot;:559,&quot;bytes&quot;:20527,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YpYj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e05144f-2496-4e11-852a-f59a7c93eb47_661x222.png 424w, https://substackcdn.com/image/fetch/$s_!YpYj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e05144f-2496-4e11-852a-f59a7c93eb47_661x222.png 848w, https://substackcdn.com/image/fetch/$s_!YpYj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e05144f-2496-4e11-852a-f59a7c93eb47_661x222.png 1272w, https://substackcdn.com/image/fetch/$s_!YpYj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e05144f-2496-4e11-852a-f59a7c93eb47_661x222.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Here, we search for the goal <em>g </em>that maximizes the expected exploration values from reached goal <em>s<sub>T</sub>. </em>The authors simplify the formulation by using approximations:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xqtr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e638690-b5c3-4617-968e-8bda711d036d_704x186.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xqtr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e638690-b5c3-4617-968e-8bda711d036d_704x186.png 424w, https://substackcdn.com/image/fetch/$s_!Xqtr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e638690-b5c3-4617-968e-8bda711d036d_704x186.png 848w, https://substackcdn.com/image/fetch/$s_!Xqtr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e638690-b5c3-4617-968e-8bda711d036d_704x186.png 1272w, https://substackcdn.com/image/fetch/$s_!Xqtr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e638690-b5c3-4617-968e-8bda711d036d_704x186.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xqtr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e638690-b5c3-4617-968e-8bda711d036d_704x186.png" width="704" height="186" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e638690-b5c3-4617-968e-8bda711d036d_704x186.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:186,&quot;width&quot;:704,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42299,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Xqtr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e638690-b5c3-4617-968e-8bda711d036d_704x186.png 424w, https://substackcdn.com/image/fetch/$s_!Xqtr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e638690-b5c3-4617-968e-8bda711d036d_704x186.png 848w, https://substackcdn.com/image/fetch/$s_!Xqtr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e638690-b5c3-4617-968e-8bda711d036d_704x186.png 1272w, https://substackcdn.com/image/fetch/$s_!Xqtr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e638690-b5c3-4617-968e-8bda711d036d_704x186.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7CQ5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7512c0e-a4dd-4bfa-a88c-6768f201ce8c_795x165.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7CQ5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7512c0e-a4dd-4bfa-a88c-6768f201ce8c_795x165.png 424w, https://substackcdn.com/image/fetch/$s_!7CQ5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7512c0e-a4dd-4bfa-a88c-6768f201ce8c_795x165.png 848w, https://substackcdn.com/image/fetch/$s_!7CQ5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7512c0e-a4dd-4bfa-a88c-6768f201ce8c_795x165.png 1272w, https://substackcdn.com/image/fetch/$s_!7CQ5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7512c0e-a4dd-4bfa-a88c-6768f201ce8c_795x165.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7CQ5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7512c0e-a4dd-4bfa-a88c-6768f201ce8c_795x165.png" width="795" height="165" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7512c0e-a4dd-4bfa-a88c-6768f201ce8c_795x165.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:165,&quot;width&quot;:795,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:26246,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7CQ5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7512c0e-a4dd-4bfa-a88c-6768f201ce8c_795x165.png 424w, https://substackcdn.com/image/fetch/$s_!7CQ5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7512c0e-a4dd-4bfa-a88c-6768f201ce8c_795x165.png 848w, https://substackcdn.com/image/fetch/$s_!7CQ5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7512c0e-a4dd-4bfa-a88c-6768f201ce8c_795x165.png 1272w, https://substackcdn.com/image/fetch/$s_!7CQ5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7512c0e-a4dd-4bfa-a88c-6768f201ce8c_795x165.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The optimization process starts by randomly sampling <em>N</em> goal candidates (<em>g<sub>k</sub></em>), from an initial distribution. These candidates are evaluated&#8203; using the world model and the approximation exploration value  for each goal <em>V<sub>k</sub></em>. After scoring each goal candidate, a Gaussian distribution is fitted based on the following rule:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dnpc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b44f7d0-b7b2-4051-9f24-5c4f3cc4c613_339x142.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dnpc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b44f7d0-b7b2-4051-9f24-5c4f3cc4c613_339x142.png 424w, https://substackcdn.com/image/fetch/$s_!dnpc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b44f7d0-b7b2-4051-9f24-5c4f3cc4c613_339x142.png 848w, https://substackcdn.com/image/fetch/$s_!dnpc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b44f7d0-b7b2-4051-9f24-5c4f3cc4c613_339x142.png 1272w, https://substackcdn.com/image/fetch/$s_!dnpc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b44f7d0-b7b2-4051-9f24-5c4f3cc4c613_339x142.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dnpc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b44f7d0-b7b2-4051-9f24-5c4f3cc4c613_339x142.png" width="339" height="142" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b44f7d0-b7b2-4051-9f24-5c4f3cc4c613_339x142.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:142,&quot;width&quot;:339,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:16452,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dnpc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b44f7d0-b7b2-4051-9f24-5c4f3cc4c613_339x142.png 424w, https://substackcdn.com/image/fetch/$s_!dnpc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b44f7d0-b7b2-4051-9f24-5c4f3cc4c613_339x142.png 848w, https://substackcdn.com/image/fetch/$s_!dnpc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b44f7d0-b7b2-4051-9f24-5c4f3cc4c613_339x142.png 1272w, https://substackcdn.com/image/fetch/$s_!dnpc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b44f7d0-b7b2-4051-9f24-5c4f3cc4c613_339x142.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>which then will be used for the next sampling iterations. The method is called            &#128073;<strong>Planning Exploratory Goals (PEG)</strong> and has shown promising results, evidenced by a vast range of exploration and higher goal-reach success rates:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6uUZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe66aacfa-6a3d-4048-8bd9-0a7715c4386a_800x431.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6uUZ!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe66aacfa-6a3d-4048-8bd9-0a7715c4386a_800x431.gif 424w, https://substackcdn.com/image/fetch/$s_!6uUZ!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe66aacfa-6a3d-4048-8bd9-0a7715c4386a_800x431.gif 848w, https://substackcdn.com/image/fetch/$s_!6uUZ!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe66aacfa-6a3d-4048-8bd9-0a7715c4386a_800x431.gif 1272w, https://substackcdn.com/image/fetch/$s_!6uUZ!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe66aacfa-6a3d-4048-8bd9-0a7715c4386a_800x431.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6uUZ!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe66aacfa-6a3d-4048-8bd9-0a7715c4386a_800x431.gif" width="800" height="431" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e66aacfa-6a3d-4048-8bd9-0a7715c4386a_800x431.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:431,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;PEG Performance in Mujoco, maze exploration, and robotic hand environments. &quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="PEG Performance in Mujoco, maze exploration, and robotic hand environments. " title="PEG Performance in Mujoco, maze exploration, and robotic hand environments. " srcset="https://substackcdn.com/image/fetch/$s_!6uUZ!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe66aacfa-6a3d-4048-8bd9-0a7715c4386a_800x431.gif 424w, https://substackcdn.com/image/fetch/$s_!6uUZ!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe66aacfa-6a3d-4048-8bd9-0a7715c4386a_800x431.gif 848w, https://substackcdn.com/image/fetch/$s_!6uUZ!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe66aacfa-6a3d-4048-8bd9-0a7715c4386a_800x431.gif 1272w, https://substackcdn.com/image/fetch/$s_!6uUZ!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe66aacfa-6a3d-4048-8bd9-0a7715c4386a_800x431.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">PEG Performance in Mujoco, maze exploration, and robotic hand environments. <a href="https://github.com/penn-pal-lab/peg">Source</a></figcaption></figure></div><p></p><h4>Performance-based Replay Memory</h4><p>The strategy for sampling goals can either encourage exploration, as discussed in previous papers or directly involve task performance. In the latter case, the aim is to sample states that lead to higher task returns. </p><p>One early and straightforward concept is &#128073; <strong>Self-imitation Learning</strong> [6]. This idea centers on leveraging past successful experiences to significantly enhance an agent's learning process. By mirroring its own past good decisions, the agent ensures that effective strategies are reinforced and repeated. A key component of this approach is memory, which acts as a replay buffer to store the agent's previous experiences. This replay buffer allows the agent to revisit and learn from past interactions, thereby making more informed decisions over time.</p><p>In practice, the agent selectively learns to imitate state-action pairs from the replay buffer when the return from a past episode exceeds its current value estimate. This performance-based sampling prioritizes high-reward strategies, ensuring that the agent focuses on actions that have previously yielded better results. Specifically, if the return in the past is greater than the agent&#8217;s value estimate (R &gt; V<sub>&#952;</sub>), the agent replicates the same action in similar future states. This strategy consistently guides the agent towards choosing high-performing actions, optimizing its overall performance. </p><blockquote><p>&#128064; It is important to note that performance-based sampling in this context is used to select data for training the policy, unlike direct exploration techniques that sample goals to initiate exploration. Despite this difference, the overall effect remains quite similar.</p></blockquote><p>Recent works advocate for hybrid solutions combining performance and novelty aspects into  goal-based exploration, leveraging replay memory to enhance the agent's ability to discover novel states. The principle is that by editing or augmenting trajectories stored from past experiences, the agent can generate new paths that lead to previously unvisited areas, thus initiating better exploration. </p><p>In the &#128073;<strong>Diverse Trajectory-conditioned Self-Imitation Learning (DTSIL)</strong> paper[7], This is achieved through a sequence-to-sequence model equipped with an attention mechanism, which effectively &#8216;translates&#8217; a demonstration trajectory into a sequence of actions, forming a new trajectory. If the ending of the newly generated trajectory differs significantly from previous ones, it is inserted into the memory; otherwise, it is replaced with a trajectory that offers a higher return.</p><p>The agent samples these trajectories using a<a href="https://hungleai.substack.com/i/140993309/state-counting"> count-based</a> novelty approach to ensure a diverse exploration of the state space. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E_bx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8413edb3-2f6f-4ac2-957e-3dacc5ed593d_995x529.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E_bx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8413edb3-2f6f-4ac2-957e-3dacc5ed593d_995x529.png 424w, https://substackcdn.com/image/fetch/$s_!E_bx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8413edb3-2f6f-4ac2-957e-3dacc5ed593d_995x529.png 848w, https://substackcdn.com/image/fetch/$s_!E_bx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8413edb3-2f6f-4ac2-957e-3dacc5ed593d_995x529.png 1272w, https://substackcdn.com/image/fetch/$s_!E_bx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8413edb3-2f6f-4ac2-957e-3dacc5ed593d_995x529.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E_bx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8413edb3-2f6f-4ac2-957e-3dacc5ed593d_995x529.png" width="995" height="529" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8413edb3-2f6f-4ac2-957e-3dacc5ed593d_995x529.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:529,&quot;width&quot;:995,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:214229,&quot;alt&quot;:&quot;Diverse Trajectory-conditioned Self-Imitation Learning&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Diverse Trajectory-conditioned Self-Imitation Learning" title="Diverse Trajectory-conditioned Self-Imitation Learning" srcset="https://substackcdn.com/image/fetch/$s_!E_bx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8413edb3-2f6f-4ac2-957e-3dacc5ed593d_995x529.png 424w, https://substackcdn.com/image/fetch/$s_!E_bx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8413edb3-2f6f-4ac2-957e-3dacc5ed593d_995x529.png 848w, https://substackcdn.com/image/fetch/$s_!E_bx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8413edb3-2f6f-4ac2-957e-3dacc5ed593d_995x529.png 1272w, https://substackcdn.com/image/fetch/$s_!E_bx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8413edb3-2f6f-4ac2-957e-3dacc5ed593d_995x529.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">DTSIL Framework. Source: [7]. </figcaption></figure></div><p>To optimize its learning, the agent is trained with a trajectory-conditioned policy network <em>&#960;<sub>&#952;</sub>(a<sub>t</sub>|e<sub>&#8804;t</sub>,o<sub>t</sub>, g)</em>, which allows it to flexibly imitate any given trajectory <em>g </em>sampled from the buffer. The network is given a demonstration trajectory <em>g </em>and recursively predicts the next actions to imitate <em>g. </em>To facilitate the imitation process, a reward of 0.1 is assigned if the state closely resembles the demonstrated one. Further imitation encouragement is applied by introducing the action imitation loss:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5Wxi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F857a97cc-ef24-4722-b268-36c37f46e2de_802x68.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5Wxi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F857a97cc-ef24-4722-b268-36c37f46e2de_802x68.png 424w, https://substackcdn.com/image/fetch/$s_!5Wxi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F857a97cc-ef24-4722-b268-36c37f46e2de_802x68.png 848w, https://substackcdn.com/image/fetch/$s_!5Wxi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F857a97cc-ef24-4722-b268-36c37f46e2de_802x68.png 1272w, https://substackcdn.com/image/fetch/$s_!5Wxi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F857a97cc-ef24-4722-b268-36c37f46e2de_802x68.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5Wxi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F857a97cc-ef24-4722-b268-36c37f46e2de_802x68.png" width="564" height="47.82044887780549" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/857a97cc-ef24-4722-b268-36c37f46e2de_802x68.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:68,&quot;width&quot;:802,&quot;resizeWidth&quot;:564,&quot;bytes&quot;:6716,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5Wxi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F857a97cc-ef24-4722-b268-36c37f46e2de_802x68.png 424w, https://substackcdn.com/image/fetch/$s_!5Wxi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F857a97cc-ef24-4722-b268-36c37f46e2de_802x68.png 848w, https://substackcdn.com/image/fetch/$s_!5Wxi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F857a97cc-ef24-4722-b268-36c37f46e2de_802x68.png 1272w, https://substackcdn.com/image/fetch/$s_!5Wxi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F857a97cc-ef24-4722-b268-36c37f46e2de_802x68.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>These imitation strategies ensure that the agent can replicate effective paths while still adapting to new scenarios due to changes in current observations.</p><p>After completing the last (non-terminal) state in the demonstration, the agent is encouraged to perform random exploration by assigning a reward of zero. This encourages the agent to deviate from known paths and explore new possibilities, thus enhancing its overall learning and adaptation capabilities.</p><div><hr></div><h2>Causal Exploration</h2><p>Imagine an RL agent tasked with picking up a key to open a door in a large room. Unfortunately, the agent spends most of its time wandering around the door but can't open it. If the agent understood that it needed the key to open the door, it wouldn't waste time near the door before obtaining the key. This scenario illustrates that better exploration requires an understanding of cause and effect, which can significantly improve the efficiency of the exploration process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZAdn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22d485c5-48c4-45bc-b32e-3edec853e5ef_1030x996.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZAdn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22d485c5-48c4-45bc-b32e-3edec853e5ef_1030x996.png 424w, https://substackcdn.com/image/fetch/$s_!ZAdn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22d485c5-48c4-45bc-b32e-3edec853e5ef_1030x996.png 848w, https://substackcdn.com/image/fetch/$s_!ZAdn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22d485c5-48c4-45bc-b32e-3edec853e5ef_1030x996.png 1272w, https://substackcdn.com/image/fetch/$s_!ZAdn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22d485c5-48c4-45bc-b32e-3edec853e5ef_1030x996.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZAdn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22d485c5-48c4-45bc-b32e-3edec853e5ef_1030x996.png" width="438" height="423.54174757281555" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/22d485c5-48c4-45bc-b32e-3edec853e5ef_1030x996.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:996,&quot;width&quot;:1030,&quot;resizeWidth&quot;:438,&quot;bytes&quot;:1711442,&quot;alt&quot;:&quot;An agent finds a key to open the door. Source: generated by DALL&#183;E 3. &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="An agent finds a key to open the door. Source: generated by DALL&#183;E 3. " title="An agent finds a key to open the door. Source: generated by DALL&#183;E 3. " srcset="https://substackcdn.com/image/fetch/$s_!ZAdn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22d485c5-48c4-45bc-b32e-3edec853e5ef_1030x996.png 424w, https://substackcdn.com/image/fetch/$s_!ZAdn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22d485c5-48c4-45bc-b32e-3edec853e5ef_1030x996.png 848w, https://substackcdn.com/image/fetch/$s_!ZAdn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22d485c5-48c4-45bc-b32e-3edec853e5ef_1030x996.png 1272w, https://substackcdn.com/image/fetch/$s_!ZAdn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22d485c5-48c4-45bc-b32e-3edec853e5ef_1030x996.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">An agent finds a key to open the door. Source: generated by DALL&#183;E 3. </figcaption></figure></div><p></p><h4>What is Causality?</h4><p>Causality delves into the intricate relationship between cause and effect, seeking to uncover the fundamental questions of how we can identify and infer these connections.  In the context of reinforcement learning, certain actions can lead to high or low rewards, such as picking up a key versus not picking it up. The state, a measurable context variable, influences both the action and the reward; for instance, the agent's current viewport as the state will allow the agent to take certain actions and receive certain rewards. Moreover, there are also confounding variables (U), which are unknown factors that affect both the action and the reward, such as the location of the door, outside of the viewport. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5cXS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfbc033-c11b-4f9c-9ace-629f3bdc6af9_382x304.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5cXS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfbc033-c11b-4f9c-9ace-629f3bdc6af9_382x304.gif 424w, https://substackcdn.com/image/fetch/$s_!5cXS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfbc033-c11b-4f9c-9ace-629f3bdc6af9_382x304.gif 848w, https://substackcdn.com/image/fetch/$s_!5cXS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfbc033-c11b-4f9c-9ace-629f3bdc6af9_382x304.gif 1272w, https://substackcdn.com/image/fetch/$s_!5cXS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfbc033-c11b-4f9c-9ace-629f3bdc6af9_382x304.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5cXS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfbc033-c11b-4f9c-9ace-629f3bdc6af9_382x304.gif" width="382" height="304" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbfbc033-c11b-4f9c-9ace-629f3bdc6af9_382x304.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:304,&quot;width&quot;:382,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:80331,&quot;alt&quot;:&quot;Causal relationship State (S), Action (A), Reward (R), and Unknown (U). &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Causal relationship State (S), Action (A), Reward (R), and Unknown (U). " title="Causal relationship State (S), Action (A), Reward (R), and Unknown (U). " srcset="https://substackcdn.com/image/fetch/$s_!5cXS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfbc033-c11b-4f9c-9ace-629f3bdc6af9_382x304.gif 424w, https://substackcdn.com/image/fetch/$s_!5cXS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfbc033-c11b-4f9c-9ace-629f3bdc6af9_382x304.gif 848w, https://substackcdn.com/image/fetch/$s_!5cXS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfbc033-c11b-4f9c-9ace-629f3bdc6af9_382x304.gif 1272w, https://substackcdn.com/image/fetch/$s_!5cXS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfbc033-c11b-4f9c-9ace-629f3bdc6af9_382x304.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Causal relationship State (S), Action (A), Reward (R), and Unknown (U). </figcaption></figure></div><p>The <a href="https://en.wikipedia.org/wiki/Causality_(book)">Structure Causal Models (SCMs)</a> framework, as introduced by Judea Pearl provides a structured approach to modeling causal relationships, using causal graphs to visually and mathematically represent the relationships between different variables. Formally, the relationship forms a causal graph <em>G={V,E} </em>where each vertex or node V<sub>i</sub> is defined by:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Pgm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01b4291e-6dfe-4ff9-994d-d800809ccc73_839x74.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Pgm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01b4291e-6dfe-4ff9-994d-d800809ccc73_839x74.png 424w, https://substackcdn.com/image/fetch/$s_!4Pgm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01b4291e-6dfe-4ff9-994d-d800809ccc73_839x74.png 848w, https://substackcdn.com/image/fetch/$s_!4Pgm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01b4291e-6dfe-4ff9-994d-d800809ccc73_839x74.png 1272w, https://substackcdn.com/image/fetch/$s_!4Pgm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01b4291e-6dfe-4ff9-994d-d800809ccc73_839x74.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Pgm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01b4291e-6dfe-4ff9-994d-d800809ccc73_839x74.png" width="452" height="39.86650774731824" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/01b4291e-6dfe-4ff9-994d-d800809ccc73_839x74.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:74,&quot;width&quot;:839,&quot;resizeWidth&quot;:452,&quot;bytes&quot;:8233,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4Pgm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01b4291e-6dfe-4ff9-994d-d800809ccc73_839x74.png 424w, https://substackcdn.com/image/fetch/$s_!4Pgm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01b4291e-6dfe-4ff9-994d-d800809ccc73_839x74.png 848w, https://substackcdn.com/image/fetch/$s_!4Pgm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01b4291e-6dfe-4ff9-994d-d800809ccc73_839x74.png 1272w, https://substackcdn.com/image/fetch/$s_!4Pgm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01b4291e-6dfe-4ff9-994d-d800809ccc73_839x74.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where PA means the set of parental nodes. Understanding these relationships is crucial for the agent to make more informed decisions and optimize its actions for better outcomes. However, SCM is often unknown to the agent. Consequently, the agent may attempt to open the door before searching for the key, necessitating causal discovery to extract the SCM from observation data. In the case of SCM, it means given a set of nodes, we need to find:</p><ul><li><p>Model <em>U<sub>i</sub></em></p></li><li><p>The edges between the nodes,<em> </em>parameterized by  structural parameter <em>&#951;</em> (e.g., adjacency matrix)</p></li><li><p>The relation function <em>f<sub>i</sub>, </em>parameterized by functional parameter <em>&#948;</em> (e.g., neural networks)</p></li></ul><p>Directly searching for SCM is not typically scalable due to the complexity of the causal graphs in RL. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oxme!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206335e2-bea5-4aad-8ed7-982a8eece629_883x761.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oxme!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206335e2-bea5-4aad-8ed7-982a8eece629_883x761.png 424w, https://substackcdn.com/image/fetch/$s_!oxme!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206335e2-bea5-4aad-8ed7-982a8eece629_883x761.png 848w, https://substackcdn.com/image/fetch/$s_!oxme!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206335e2-bea5-4aad-8ed7-982a8eece629_883x761.png 1272w, https://substackcdn.com/image/fetch/$s_!oxme!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206335e2-bea5-4aad-8ed7-982a8eece629_883x761.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oxme!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206335e2-bea5-4aad-8ed7-982a8eece629_883x761.png" width="883" height="761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/206335e2-bea5-4aad-8ed7-982a8eece629_883x761.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:761,&quot;width&quot;:883,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:104680,&quot;alt&quot;:&quot;Even a simple door-key environment can have super complex causal graphs. &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Even a simple door-key environment can have super complex causal graphs. " title="Even a simple door-key environment can have super complex causal graphs. " srcset="https://substackcdn.com/image/fetch/$s_!oxme!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206335e2-bea5-4aad-8ed7-982a8eece629_883x761.png 424w, https://substackcdn.com/image/fetch/$s_!oxme!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206335e2-bea5-4aad-8ed7-982a8eece629_883x761.png 848w, https://substackcdn.com/image/fetch/$s_!oxme!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206335e2-bea5-4aad-8ed7-982a8eece629_883x761.png 1272w, https://substackcdn.com/image/fetch/$s_!oxme!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F206335e2-bea5-4aad-8ed7-982a8eece629_883x761.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Even a simple door-key environment can have super complex causal graphs. </figcaption></figure></div><blockquote><p>&#128064; It is critical to find efficient alternative/approximating methods to discover causal relationships and leverage them for the agent's exploratory advantage.</p></blockquote><p></p><h4>Dependency Test</h4><p>One way to simplify the causal discovery is to remove the unknown variable <em>U</em> from the graph. Then, we focus on a one-step transition graph as proposed in [8]:</p><ul><li><p>States S is decomposed to N components (N entities)</p></li><li><p>Nodes:<em> V = {S<sub>1</sub>, . . . , S<sub>N</sub> , A, S<sub>1</sub>&#8217; , . . . , S<sub>N</sub>&#8217;}</em></p></li><li><p>Edges:&nbsp;</p><ul><li><p>From <em>S</em> to <em>S&#8217;</em>: dynamics function</p></li><li><p>From <em>S</em> to <em>A</em>: policy function</p></li><li><p>From <em>A</em> to <em>S&#8217;</em>: action influence</p></li></ul></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gUN7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aab8b4-dab4-4b99-8b1a-ed6ae7b1145c_923x372.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gUN7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aab8b4-dab4-4b99-8b1a-ed6ae7b1145c_923x372.png 424w, https://substackcdn.com/image/fetch/$s_!gUN7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aab8b4-dab4-4b99-8b1a-ed6ae7b1145c_923x372.png 848w, https://substackcdn.com/image/fetch/$s_!gUN7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aab8b4-dab4-4b99-8b1a-ed6ae7b1145c_923x372.png 1272w, https://substackcdn.com/image/fetch/$s_!gUN7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aab8b4-dab4-4b99-8b1a-ed6ae7b1145c_923x372.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gUN7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aab8b4-dab4-4b99-8b1a-ed6ae7b1145c_923x372.png" width="923" height="372" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23aab8b4-dab4-4b99-8b1a-ed6ae7b1145c_923x372.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:372,&quot;width&quot;:923,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Causal Action Influence Framework. &quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Causal Action Influence Framework. " title="Causal Action Influence Framework. " srcset="https://substackcdn.com/image/fetch/$s_!gUN7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aab8b4-dab4-4b99-8b1a-ed6ae7b1145c_923x372.png 424w, https://substackcdn.com/image/fetch/$s_!gUN7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aab8b4-dab4-4b99-8b1a-ed6ae7b1145c_923x372.png 848w, https://substackcdn.com/image/fetch/$s_!gUN7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aab8b4-dab4-4b99-8b1a-ed6ae7b1145c_923x372.png 1272w, https://substackcdn.com/image/fetch/$s_!gUN7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aab8b4-dab4-4b99-8b1a-ed6ae7b1145c_923x372.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Causal Action Influence Framework. Source: [8]</figcaption></figure></div><p>Considering the assumption of local causal mechanisms, the impact tends to be localized and sparse. Specifically, for a given situation <em>S=s</em>, certain influences within the local causal graph <em>G<sub>S</sub>=s</em>&#8203; may either exist or not. For instance, when the robot arm is far from the object, the action (<em>A</em>) primarily affects the state of the robot arm (<em>S<sub>1</sub></em>), rather than the state of the object (<em>S<sub>2</sub></em>). This variability highlights the nuanced nature of causal relationships within specific contexts, where the presence or absence of influences can significantly affect outcomes.</p><p>Under this assumption, the focus of causal discovery revolves around determining whether the agent is in control of the state of entity <em>S<sup>j</sup></em> given the current state <em>s</em>, i.e.,   &#128073; <strong>Causal Action Influence</strong> framework [8]. This can be verified through a dependence test, specifically using Conditional Mutual Information (CMI). In particular, we measure the influence of the state <em>s:</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Enwb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4835aaa4-2ce2-4357-9b92-8f54b82e0d97_726x332.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Enwb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4835aaa4-2ce2-4357-9b92-8f54b82e0d97_726x332.png 424w, https://substackcdn.com/image/fetch/$s_!Enwb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4835aaa4-2ce2-4357-9b92-8f54b82e0d97_726x332.png 848w, https://substackcdn.com/image/fetch/$s_!Enwb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4835aaa4-2ce2-4357-9b92-8f54b82e0d97_726x332.png 1272w, https://substackcdn.com/image/fetch/$s_!Enwb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4835aaa4-2ce2-4357-9b92-8f54b82e0d97_726x332.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Enwb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4835aaa4-2ce2-4357-9b92-8f54b82e0d97_726x332.png" width="726" height="332" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4835aaa4-2ce2-4357-9b92-8f54b82e0d97_726x332.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:332,&quot;width&quot;:726,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Enwb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4835aaa4-2ce2-4357-9b92-8f54b82e0d97_726x332.png 424w, https://substackcdn.com/image/fetch/$s_!Enwb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4835aaa4-2ce2-4357-9b92-8f54b82e0d97_726x332.png 848w, https://substackcdn.com/image/fetch/$s_!Enwb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4835aaa4-2ce2-4357-9b92-8f54b82e0d97_726x332.png 1272w, https://substackcdn.com/image/fetch/$s_!Enwb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4835aaa4-2ce2-4357-9b92-8f54b82e0d97_726x332.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This quantity can be approximated as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7Oh-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e6bd7a-1f60-4ce6-a1f1-750fa2e9a8a5_562x151.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7Oh-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e6bd7a-1f60-4ce6-a1f1-750fa2e9a8a5_562x151.png 424w, https://substackcdn.com/image/fetch/$s_!7Oh-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e6bd7a-1f60-4ce6-a1f1-750fa2e9a8a5_562x151.png 848w, https://substackcdn.com/image/fetch/$s_!7Oh-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e6bd7a-1f60-4ce6-a1f1-750fa2e9a8a5_562x151.png 1272w, https://substackcdn.com/image/fetch/$s_!7Oh-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e6bd7a-1f60-4ce6-a1f1-750fa2e9a8a5_562x151.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7Oh-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e6bd7a-1f60-4ce6-a1f1-750fa2e9a8a5_562x151.png" width="562" height="151" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/17e6bd7a-1f60-4ce6-a1f1-750fa2e9a8a5_562x151.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:151,&quot;width&quot;:562,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:24609,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7Oh-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e6bd7a-1f60-4ce6-a1f1-750fa2e9a8a5_562x151.png 424w, https://substackcdn.com/image/fetch/$s_!7Oh-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e6bd7a-1f60-4ce6-a1f1-750fa2e9a8a5_562x151.png 848w, https://substackcdn.com/image/fetch/$s_!7Oh-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e6bd7a-1f60-4ce6-a1f1-750fa2e9a8a5_562x151.png 1272w, https://substackcdn.com/image/fetch/$s_!7Oh-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e6bd7a-1f60-4ce6-a1f1-750fa2e9a8a5_562x151.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Neural networks can be used to estimate <em>&#952;</em>. In practice, since we are primarily concerned with the influence of the state <em>s</em> on a single final goal <em>S<sup>j</sup></em>, we use the score to characterize the state. This influence score can then serve as an<a href="https://hungleai.substack.com/p/curious-agents-saga-part-2"> intrinsic reward</a>, encouraging the agent to visit highly influential states.</p><p></p><h4>Potential Outcome</h4><p>In causal discovery, we generally need to address the question: &#129504; <em>When is an agent&#8217;s action &#8220;A = a&#8221; the cause of an outcome &#8220;B = b&#8221;?</em> Besides the dependency test, we can answer by using the potential outcome framework, which contrasts the actual outcome with a hypothetical scenario where &#8220;A = a&#8221; did not occur. This approach requires knowledge of the normative or counterfactual world to make accurate assessments.</p><p>One paper in this field explores the effect disentanglement of the transition sub-graph, focusing on &#128073; <strong>controllable effects</strong> [9]. It examines how the next state <em>S&#8217;</em> depends on both the action <em>A</em> (which encodes the controllable effect) and the previous state <em>S</em> (which encodes the non-action or dynamic effect). The central question is: &#129504; <em>is the next state primarily caused by the action taken, or is it merely coincidental?</em> Here, <em>S</em> acts as a known confounder, a common scenario in stochastic or multi-agent environments.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Shpo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40690501-bf69-4fe6-9d53-27f1db518442_648x292.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Shpo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40690501-bf69-4fe6-9d53-27f1db518442_648x292.png 424w, https://substackcdn.com/image/fetch/$s_!Shpo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40690501-bf69-4fe6-9d53-27f1db518442_648x292.png 848w, https://substackcdn.com/image/fetch/$s_!Shpo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40690501-bf69-4fe6-9d53-27f1db518442_648x292.png 1272w, https://substackcdn.com/image/fetch/$s_!Shpo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40690501-bf69-4fe6-9d53-27f1db518442_648x292.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Shpo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40690501-bf69-4fe6-9d53-27f1db518442_648x292.png" width="648" height="292" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/40690501-bf69-4fe6-9d53-27f1db518442_648x292.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:292,&quot;width&quot;:648,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Effect disentanglement framework.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Effect disentanglement framework." title="Effect disentanglement framework." srcset="https://substackcdn.com/image/fetch/$s_!Shpo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40690501-bf69-4fe6-9d53-27f1db518442_648x292.png 424w, https://substackcdn.com/image/fetch/$s_!Shpo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40690501-bf69-4fe6-9d53-27f1db518442_648x292.png 848w, https://substackcdn.com/image/fetch/$s_!Shpo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40690501-bf69-4fe6-9d53-27f1db518442_648x292.png 1272w, https://substackcdn.com/image/fetch/$s_!Shpo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40690501-bf69-4fe6-9d53-27f1db518442_648x292.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"> Effect disentanglement framework. Source: [9]. </figcaption></figure></div><p>What we observe is the total effect <em>e<sub>t</sub>(s,a)=s&#8217;-s</em>, which represents the change in s&#8242; from s. To measure the controllable effect <em>e<sub>c</sub>(s, a)</em>, the authors adopt the potential outcome framework, proposing the following estimations:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vaq7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe4a49b-bf87-4fde-81a7-127160c52c9b_384x54.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vaq7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe4a49b-bf87-4fde-81a7-127160c52c9b_384x54.png 424w, https://substackcdn.com/image/fetch/$s_!Vaq7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe4a49b-bf87-4fde-81a7-127160c52c9b_384x54.png 848w, https://substackcdn.com/image/fetch/$s_!Vaq7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe4a49b-bf87-4fde-81a7-127160c52c9b_384x54.png 1272w, https://substackcdn.com/image/fetch/$s_!Vaq7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe4a49b-bf87-4fde-81a7-127160c52c9b_384x54.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vaq7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe4a49b-bf87-4fde-81a7-127160c52c9b_384x54.png" width="384" height="54" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/efe4a49b-bf87-4fde-81a7-127160c52c9b_384x54.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:54,&quot;width&quot;:384,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Vaq7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe4a49b-bf87-4fde-81a7-127160c52c9b_384x54.png 424w, https://substackcdn.com/image/fetch/$s_!Vaq7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe4a49b-bf87-4fde-81a7-127160c52c9b_384x54.png 848w, https://substackcdn.com/image/fetch/$s_!Vaq7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe4a49b-bf87-4fde-81a7-127160c52c9b_384x54.png 1272w, https://substackcdn.com/image/fetch/$s_!Vaq7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefe4a49b-bf87-4fde-81a7-127160c52c9b_384x54.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8Ig0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feaf20897-1f60-44b0-badd-f9d5ab955484_479x79.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8Ig0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feaf20897-1f60-44b0-badd-f9d5ab955484_479x79.png 424w, https://substackcdn.com/image/fetch/$s_!8Ig0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feaf20897-1f60-44b0-badd-f9d5ab955484_479x79.png 848w, https://substackcdn.com/image/fetch/$s_!8Ig0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feaf20897-1f60-44b0-badd-f9d5ab955484_479x79.png 1272w, https://substackcdn.com/image/fetch/$s_!8Ig0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feaf20897-1f60-44b0-badd-f9d5ab955484_479x79.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8Ig0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feaf20897-1f60-44b0-badd-f9d5ab955484_479x79.png" width="479" height="79" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eaf20897-1f60-44b0-badd-f9d5ab955484_479x79.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:79,&quot;width&quot;:479,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8Ig0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feaf20897-1f60-44b0-badd-f9d5ab955484_479x79.png 424w, https://substackcdn.com/image/fetch/$s_!8Ig0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feaf20897-1f60-44b0-badd-f9d5ab955484_479x79.png 848w, https://substackcdn.com/image/fetch/$s_!8Ig0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feaf20897-1f60-44b0-badd-f9d5ab955484_479x79.png 1272w, https://substackcdn.com/image/fetch/$s_!8Ig0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feaf20897-1f60-44b0-badd-f9d5ab955484_479x79.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Here, the difference in outcome when taking action <em>a</em> and other &#8220;normal&#8220; action <em>&#227;</em> is used to estimate the controllable effect. To learn <em>e<sub>t</sub></em>, a neural network is trained to minimize the MSE loss:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CNLh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3f55b22-ae74-4543-9ced-7d58051ec3b8_339x68.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CNLh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3f55b22-ae74-4543-9ced-7d58051ec3b8_339x68.png 424w, https://substackcdn.com/image/fetch/$s_!CNLh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3f55b22-ae74-4543-9ced-7d58051ec3b8_339x68.png 848w, https://substackcdn.com/image/fetch/$s_!CNLh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3f55b22-ae74-4543-9ced-7d58051ec3b8_339x68.png 1272w, https://substackcdn.com/image/fetch/$s_!CNLh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3f55b22-ae74-4543-9ced-7d58051ec3b8_339x68.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CNLh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3f55b22-ae74-4543-9ced-7d58051ec3b8_339x68.png" width="339" height="68" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a3f55b22-ae74-4543-9ced-7d58051ec3b8_339x68.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:68,&quot;width&quot;:339,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CNLh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3f55b22-ae74-4543-9ced-7d58051ec3b8_339x68.png 424w, https://substackcdn.com/image/fetch/$s_!CNLh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3f55b22-ae74-4543-9ced-7d58051ec3b8_339x68.png 848w, https://substackcdn.com/image/fetch/$s_!CNLh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3f55b22-ae74-4543-9ced-7d58051ec3b8_339x68.png 1272w, https://substackcdn.com/image/fetch/$s_!CNLh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3f55b22-ae74-4543-9ced-7d58051ec3b8_339x68.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>As we can use the neural network to compute <em>e<sub>c</sub>(s, a), </em>we can learn <em>e<sub>c</sub> </em>distribution and sample <em>e<sub>c</sub> </em>as a goal as in the goal-conditioned RL (GCRL) framework presented earlier.</p><h4> </h4><h4>Structural Causal Model</h4><p>Only recently has the use of SCM become available in aiding RL exploration. To make the causal discovery tractable, early works [10] are limited to variable/object causal graphs rather than encompassing the full scope of transition graphs. This is crucial in environments containing multiple objects where understanding causality between the objects is beneficial for exploration. For instance, consider an agent combining wood and stone to craft an axe, the agent needs to figure out that &#8220;axe&#8221; has causal edges to &#8220;wood&#8221; and &#8220;stone&#8221;. This knowledge will enable the agent to look for wood and stone instead of other things when tasked with crafting an axe. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rAGP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8d89c2c-5336-4aec-865f-03465bd3300a_920x563.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rAGP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8d89c2c-5336-4aec-865f-03465bd3300a_920x563.png 424w, https://substackcdn.com/image/fetch/$s_!rAGP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8d89c2c-5336-4aec-865f-03465bd3300a_920x563.png 848w, https://substackcdn.com/image/fetch/$s_!rAGP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8d89c2c-5336-4aec-865f-03465bd3300a_920x563.png 1272w, https://substackcdn.com/image/fetch/$s_!rAGP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8d89c2c-5336-4aec-865f-03465bd3300a_920x563.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rAGP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8d89c2c-5336-4aec-865f-03465bd3300a_920x563.png" width="534" height="326.78478260869565" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c8d89c2c-5336-4aec-865f-03465bd3300a_920x563.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:563,&quot;width&quot;:920,&quot;resizeWidth&quot;:534,&quot;bytes&quot;:198794,&quot;alt&quot;:&quot;Causal graphs between key objects in the environment. &quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Causal graphs between key objects in the environment. " title="Causal graphs between key objects in the environment. " srcset="https://substackcdn.com/image/fetch/$s_!rAGP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8d89c2c-5336-4aec-865f-03465bd3300a_920x563.png 424w, https://substackcdn.com/image/fetch/$s_!rAGP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8d89c2c-5336-4aec-865f-03465bd3300a_920x563.png 848w, https://substackcdn.com/image/fetch/$s_!rAGP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8d89c2c-5336-4aec-865f-03465bd3300a_920x563.png 1272w, https://substackcdn.com/image/fetch/$s_!rAGP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8d89c2c-5336-4aec-865f-03465bd3300a_920x563.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Causal graphs between key objects in the environment. Source&#8221; [10].</figcaption></figure></div><blockquote><p>&#10060; This approach faces a clear limitation: it assumes the complete set of objects or environmental variables is known, meaning the environment can be fully factorized. </p></blockquote><p>In response, a recent paper introduced  &#128073; <strong>Variable-Agnostic Causal Exploration for Reinforcement Learning (VACERL) [11]</strong>, a novel framework that incorporates causal relationships to drive exploration without specifying environmental causal variables. The author uses attention mechanisms to automatically identify crucial observation-action steps associated with key variables. It then constructs a causal graph linking these steps, guiding the agent towards observation-action pairs with greater causal influence on task completion. Finally, the causal graph can be used to generate intrinsic rewards or goals for intrinsic motivation or GCRL framework, respectively. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1HuZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c18de13-1f18-4592-9d7d-ff20f3e7c587_2467x832.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1HuZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c18de13-1f18-4592-9d7d-ff20f3e7c587_2467x832.png 424w, https://substackcdn.com/image/fetch/$s_!1HuZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c18de13-1f18-4592-9d7d-ff20f3e7c587_2467x832.png 848w, https://substackcdn.com/image/fetch/$s_!1HuZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c18de13-1f18-4592-9d7d-ff20f3e7c587_2467x832.png 1272w, https://substackcdn.com/image/fetch/$s_!1HuZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c18de13-1f18-4592-9d7d-ff20f3e7c587_2467x832.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1HuZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c18de13-1f18-4592-9d7d-ff20f3e7c587_2467x832.png" width="1456" height="491" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3c18de13-1f18-4592-9d7d-ff20f3e7c587_2467x832.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:491,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:295000,&quot;alt&quot;:&quot;Variable-Agnostic Causal Exploration for Reinforcement Learning (VACERL)&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Variable-Agnostic Causal Exploration for Reinforcement Learning (VACERL)" title="Variable-Agnostic Causal Exploration for Reinforcement Learning (VACERL)" srcset="https://substackcdn.com/image/fetch/$s_!1HuZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c18de13-1f18-4592-9d7d-ff20f3e7c587_2467x832.png 424w, https://substackcdn.com/image/fetch/$s_!1HuZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c18de13-1f18-4592-9d7d-ff20f3e7c587_2467x832.png 848w, https://substackcdn.com/image/fetch/$s_!1HuZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c18de13-1f18-4592-9d7d-ff20f3e7c587_2467x832.png 1272w, https://substackcdn.com/image/fetch/$s_!1HuZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c18de13-1f18-4592-9d7d-ff20f3e7c587_2467x832.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">3 phases in the VACERL framework. Source: [11]. </figcaption></figure></div><p>Since the environment is not assumed to be factorized, a naive approach might think that any states or state-action pairs in trajectories represent environmental variables (EV). However, this would result in an overly complex causal graph. The crucial phase, therefore, is to detect key observation-action pairs, and consider these limited sets of important pairs as EV, thereby reducing the variable space. To this end, the authors train a Transformer to predict the final goal state. The Transformer (TF) takes past observations from the trajectory <em>k</em> as input and aims to output the goal observation (step <em>T</em>):</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3sWB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd130c4d7-1b98-4192-b36f-335c866c896c_1051x116.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3sWB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd130c4d7-1b98-4192-b36f-335c866c896c_1051x116.png 424w, https://substackcdn.com/image/fetch/$s_!3sWB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd130c4d7-1b98-4192-b36f-335c866c896c_1051x116.png 848w, https://substackcdn.com/image/fetch/$s_!3sWB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd130c4d7-1b98-4192-b36f-335c866c896c_1051x116.png 1272w, https://substackcdn.com/image/fetch/$s_!3sWB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd130c4d7-1b98-4192-b36f-335c866c896c_1051x116.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3sWB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd130c4d7-1b98-4192-b36f-335c866c896c_1051x116.png" width="552" height="60.924833491912466" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d130c4d7-1b98-4192-b36f-335c866c896c_1051x116.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:116,&quot;width&quot;:1051,&quot;resizeWidth&quot;:552,&quot;bytes&quot;:25390,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3sWB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd130c4d7-1b98-4192-b36f-335c866c896c_1051x116.png 424w, https://substackcdn.com/image/fetch/$s_!3sWB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd130c4d7-1b98-4192-b36f-335c866c896c_1051x116.png 848w, https://substackcdn.com/image/fetch/$s_!3sWB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd130c4d7-1b98-4192-b36f-335c866c896c_1051x116.png 1272w, https://substackcdn.com/image/fetch/$s_!3sWB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd130c4d7-1b98-4192-b36f-335c866c896c_1051x116.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The objective is to minimize the MSE error:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!68-I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e8d32f-a33c-487a-a933-2e662bad13bb_1294x106.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!68-I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e8d32f-a33c-487a-a933-2e662bad13bb_1294x106.png 424w, https://substackcdn.com/image/fetch/$s_!68-I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e8d32f-a33c-487a-a933-2e662bad13bb_1294x106.png 848w, https://substackcdn.com/image/fetch/$s_!68-I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e8d32f-a33c-487a-a933-2e662bad13bb_1294x106.png 1272w, https://substackcdn.com/image/fetch/$s_!68-I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e8d32f-a33c-487a-a933-2e662bad13bb_1294x106.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!68-I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e8d32f-a33c-487a-a933-2e662bad13bb_1294x106.png" width="608" height="49.805255023183925" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7e8d32f-a33c-487a-a933-2e662bad13bb_1294x106.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:106,&quot;width&quot;:1294,&quot;resizeWidth&quot;:608,&quot;bytes&quot;:31311,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!68-I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e8d32f-a33c-487a-a933-2e662bad13bb_1294x106.png 424w, https://substackcdn.com/image/fetch/$s_!68-I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e8d32f-a33c-487a-a933-2e662bad13bb_1294x106.png 848w, https://substackcdn.com/image/fetch/$s_!68-I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e8d32f-a33c-487a-a933-2e662bad13bb_1294x106.png 1272w, https://substackcdn.com/image/fetch/$s_!68-I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e8d32f-a33c-487a-a933-2e662bad13bb_1294x106.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The authors then  use the attention weights to filter out state-action pairs that are not important to the goal (those with low attention scores). Only <em>M</em> important pairs are stored in the &#8221;crucial observation-action steps&#8221; set (<em>S<sub>COAS</sub></em>). </p><p>Next, in the second phase, the causal relationships among the M steps are identified using a 2-step SCM discovery algorithm [12]. The main idea is to optimize the functional parameter <em>&#948;</em> and the structural parameter <em>&#951;</em> within the SCM framework through a flexible, two-phase iterative process. This involves alternately fixing one parameter while updating the other. Starting with random initialization, both parameters are trained using data induced from the set <em>S<sub>COAS</sub></em>. Here, the approach is based on the principle that a &#8220;cause&#8221; step must come before its &#8220;effect&#8221; step. Consequently, the function <em>f<sub>&#948;</sub></em> is represented by a neural network tasked with predicting the step at timestep <em>t</em> using the sequence of steps leading up to it. The training objective is to minimize the MSE error between the prediction and the true data:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!juLp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfb130d-93b1-4b1f-837c-c7a6511532ea_929x104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!juLp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfb130d-93b1-4b1f-837c-c7a6511532ea_929x104.png 424w, https://substackcdn.com/image/fetch/$s_!juLp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfb130d-93b1-4b1f-837c-c7a6511532ea_929x104.png 848w, https://substackcdn.com/image/fetch/$s_!juLp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfb130d-93b1-4b1f-837c-c7a6511532ea_929x104.png 1272w, https://substackcdn.com/image/fetch/$s_!juLp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfb130d-93b1-4b1f-837c-c7a6511532ea_929x104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!juLp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfb130d-93b1-4b1f-837c-c7a6511532ea_929x104.png" width="516" height="57.76533907427341" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2dfb130d-93b1-4b1f-837c-c7a6511532ea_929x104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:104,&quot;width&quot;:929,&quot;resizeWidth&quot;:516,&quot;bytes&quot;:20109,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!juLp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfb130d-93b1-4b1f-837c-c7a6511532ea_929x104.png 424w, https://substackcdn.com/image/fetch/$s_!juLp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfb130d-93b1-4b1f-837c-c7a6511532ea_929x104.png 848w, https://substackcdn.com/image/fetch/$s_!juLp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfb130d-93b1-4b1f-837c-c7a6511532ea_929x104.png 1272w, https://substackcdn.com/image/fetch/$s_!juLp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dfb130d-93b1-4b1f-837c-c7a6511532ea_929x104.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>To find <em>&#951; </em>representing causal influence from one node in the causal graph to another, the authors fix <em>&#948;</em> and optimize <em>&#951;</em> by updating the causal influence from a node <em>X<sub>j</sub></em>&#8203; to <em>X<sub>i</sub></em> as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O82k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3680bac1-fe34-418d-9e02-b37dc9352ca4_1058x94.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O82k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3680bac1-fe34-418d-9e02-b37dc9352ca4_1058x94.png 424w, https://substackcdn.com/image/fetch/$s_!O82k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3680bac1-fe34-418d-9e02-b37dc9352ca4_1058x94.png 848w, https://substackcdn.com/image/fetch/$s_!O82k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3680bac1-fe34-418d-9e02-b37dc9352ca4_1058x94.png 1272w, https://substackcdn.com/image/fetch/$s_!O82k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3680bac1-fe34-418d-9e02-b37dc9352ca4_1058x94.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O82k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3680bac1-fe34-418d-9e02-b37dc9352ca4_1058x94.png" width="1058" height="94" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3680bac1-fe34-418d-9e02-b37dc9352ca4_1058x94.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:94,&quot;width&quot;:1058,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:23641,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!O82k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3680bac1-fe34-418d-9e02-b37dc9352ca4_1058x94.png 424w, https://substackcdn.com/image/fetch/$s_!O82k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3680bac1-fe34-418d-9e02-b37dc9352ca4_1058x94.png 848w, https://substackcdn.com/image/fetch/$s_!O82k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3680bac1-fe34-418d-9e02-b37dc9352ca4_1058x94.png 1272w, https://substackcdn.com/image/fetch/$s_!O82k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3680bac1-fe34-418d-9e02-b37dc9352ca4_1058x94.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vqqy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecc6030-10ed-4413-af01-67f75bb794c4_1168x261.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vqqy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecc6030-10ed-4413-af01-67f75bb794c4_1168x261.png 424w, https://substackcdn.com/image/fetch/$s_!Vqqy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecc6030-10ed-4413-af01-67f75bb794c4_1168x261.png 848w, https://substackcdn.com/image/fetch/$s_!Vqqy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecc6030-10ed-4413-af01-67f75bb794c4_1168x261.png 1272w, https://substackcdn.com/image/fetch/$s_!Vqqy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecc6030-10ed-4413-af01-67f75bb794c4_1168x261.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vqqy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecc6030-10ed-4413-af01-67f75bb794c4_1168x261.png" width="608" height="135.86301369863014" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7ecc6030-10ed-4413-af01-67f75bb794c4_1168x261.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:261,&quot;width&quot;:1168,&quot;resizeWidth&quot;:608,&quot;bytes&quot;:30050,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Vqqy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecc6030-10ed-4413-af01-67f75bb794c4_1168x261.png 424w, https://substackcdn.com/image/fetch/$s_!Vqqy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecc6030-10ed-4413-af01-67f75bb794c4_1168x261.png 848w, https://substackcdn.com/image/fetch/$s_!Vqqy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecc6030-10ed-4413-af01-67f75bb794c4_1168x261.png 1272w, https://substackcdn.com/image/fetch/$s_!Vqqy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ecc6030-10ed-4413-af01-67f75bb794c4_1168x261.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>&#981;<sub>causal</sub></em> is the causal threshold. By iteratively updating the SCM&#8217;s parameters, the algorithm converges and provides a causal graph for the later phase. Given the graph, we can generate intrinsic rewards or set goals to aid exploration. </p><blockquote><p>&#10060; One limitation of the approach is found in Phase 1, where knowing the goal state is necessary to train the Transformer. This dependency on successful trajectories implies that sufficient exploration is needed, potentially resulting in inefficiencies due to the high number of environment steps required to collect them.</p></blockquote><div><hr></div><h2>References</h2><p>[1] Ecoffet, Adrien, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. "First return, then explore."&nbsp;<em>Nature</em>&nbsp;590, no. 7847 (2021): 580-586.</p><p>[2] Quentin Gallou&#180;edec and Emmanuel Dellandr&#180;ea. 2023. Cell-free latent go-explore. In International Conference on Machine Learning. PMLR, 10571&#8211;10586.</p><p>[3] Van Den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation learning." <em>Advances in neural information processing systems</em> 30 (2017).</p><p>[4] Russell Mendonca, Oleh Rybkin, Kostas Daniilidis, Danijar Hafner, and Deepak Pathak. Discovering and achieving goals with world models. In Advances in Neural Information Processing Systems, 2021.</p><p>[5] Hu, Edward S., Richard Chang, Oleh Rybkin, and Dinesh Jayaraman. "Planning Goals for Exploration." In <em>The Eleventh International Conference on Learning Representations</em>. 2022.</p><p>[6] Oh, Junhyuk, Yijie Guo, Satinder Singh, and Honglak Lee. "Self-imitation learning." In&nbsp;<em>International conference on machine learning</em>, pp. 3878-3887. PMLR, 2018.</p><p>[7] Yijie Guo, Jongwook Choi, Marcin Moczulski, Shengyu Feng, Samy Bengio, Mohammad Norouzi, and Honglak Lee. 2020. Memory-based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020), 4333&#8211;4345.</p><p>[8] Seitzer, Maximilian, Bernhard Sch&#246;lkopf, and Georg Martius. "Causal influence detection for improving efficiency in reinforcement learning." <em>Advances in Neural Information Processing Systems</em> 34 (2021).</p><p>[9] Oriol Corcoll, Raul Vicente. Disentangling Controlled Effects for Hierarchical Reinforcement Learning, CLeaR 2022</p><p>[10] Hu, X., Zhang, R., Tang, K., Guo, J., Yi, Q., Chen, R., ... &amp; Chen, Y. (2022). Causality-driven hierarchical structure discovery for reinforcement learning.&nbsp;Advances in Neural Information Processing Systems,&nbsp;35, 20064-20076.</p><p>[11] Nguyen H, Le H and Venkatesh S (2024), <em>"Variable-Agnostic Causal Exploration for Reinforcement Learning"</em>, In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD).</p><p>[12] Ke, N.R., Bilaniuk, O., Goyal, A., Bauer, S., Larochelle, H., Sch&#246;lkopf, B., Mozer,<br>M.C., Pal, C., Bengio, Y.: Learning neural causal models from unknown interventions. arXiv preprint arXiv:1910.01075 (2019) <br></p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><strong>I hope you enjoy the article. Stay tuned for the newest and exclusive content by subscribing to Neurocoder Tales! </strong><em>Disclaimer:</em> <em>While every effort is made to provide accurate and unbiased information, errors may occur. Let me know if you catch any error.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Human-Aligned Large Language Models ]]></title><description><![CDATA[About recent LLM alignment finetuning techniques such as RLHF, DPO, KTO, IPO and SPIN]]></description><link>https://hungleai.substack.com/p/aligning-large-language-models-with</link><guid isPermaLink="false">https://hungleai.substack.com/p/aligning-large-language-models-with</guid><dc:creator><![CDATA[Hung Le]]></dc:creator><pubDate>Mon, 19 Feb 2024 02:22:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dbb8329-2260-488b-92a8-0591e8c08efc_445x300.gif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Table of Content</h2><ul><li><p><a href="https://hungleai.substack.com/i/141411934/a-bit-of-context">A Bit of Context</a></p></li><li><p><a href="https://hungleai.substack.com/i/141411934/reinforced-finetuning-using-reward-model">Reinforced Finetuning using Reward Model</a></p><ul><li><p><a href="https://hungleai.substack.com/i/141411934/reward-learning">Reward Learning</a></p></li><li><p><a href="https://hungleai.substack.com/i/141411934/policy-optimization">Policy Optimization</a></p></li><li><p><a href="https://hungleai.substack.com/i/141411934/ai-feedback">AI Feedback</a></p></li></ul></li><li><p><a href="https://hungleai.substack.com/i/141411934/preference-optimization-without-rl">Preference Optimization without RL</a></p><ul><li><p><a href="https://hungleai.substack.com/i/141411934/direct-preference-optimization-dpo">Direct Preference Optimization (DPO)</a></p></li><li><p><a href="https://hungleai.substack.com/i/141411934/alternative-preference-optimization">Alternative Preference Optimization</a></p></li><li><p><a href="https://hungleai.substack.com/i/141411934/self-play-preference-optimization">Self-Play Preference Optimization</a></p></li></ul></li><li><p><a href="https://hungleai.substack.com/i/141411934/finetuning-with-rating-feedback">Finetuning with Rating Feedback</a></p></li></ul><div><hr></div><h2>A Bit of Context</h2><p>In alignment training, the goal is to ensure that the outputs generated by machine learning models align closely with human preferences, values, and intentions. The research traces back to the classic era of machine learning, studying preference models to support various tasks in classification and reinforcement learning [1,2]. Until 2017, researchers have scaled up the approach and employed reinforcement learning algorithms such as <a href="https://arxiv.org/abs/1707.06347">PPO</a> in aligning models to predict according to human preference [3]. As Large Language Models become more prevalent, alignment training emerges as a crucial factor in their success, especially in the case of <a href="https://chat.openai.com/">Chat-GPT</a>. The LLM finetuning framework generally consists of two steps:</p><ol><li><p>Supervised finetuning: LLM is finetuned on data with ground truth labels for the task.</p></li><li><p>Alignment finetuning: LLM is finetuned on human feedback data, often in the form of preference such as comparison feedback. </p></li></ol><p>&#129504; <em>Why is alignment necessary instead of keeping supervised fine-tuning?</em></p><ul><li><p>Limited training data leads to overfitting with excessive fine-tuning.</p></li><li><p>Obtaining more supervised data requires costly labeling of ground truth answers.</p></li><li><p>Supervised fine-tuning using LLM is expensive and does not prioritize generating user-desired outputs. </p></li></ul><p>&#128073; This article focuses on step 2: <strong>LLM alignment</strong>. A common scheme for alignment training is to use feedback-based labels as training signals to reduce the labeling cost. Another consideration is that the finetuned model should not deviate too much from the base model, ensuring that the final model retains the properties learned from extensive pretraining and supervised tuning data.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q6E4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728fcb39-05de-43f4-9d5f-793ccb78081a_2496x540.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q6E4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728fcb39-05de-43f4-9d5f-793ccb78081a_2496x540.png 424w, https://substackcdn.com/image/fetch/$s_!Q6E4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728fcb39-05de-43f4-9d5f-793ccb78081a_2496x540.png 848w, https://substackcdn.com/image/fetch/$s_!Q6E4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728fcb39-05de-43f4-9d5f-793ccb78081a_2496x540.png 1272w, https://substackcdn.com/image/fetch/$s_!Q6E4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728fcb39-05de-43f4-9d5f-793ccb78081a_2496x540.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q6E4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728fcb39-05de-43f4-9d5f-793ccb78081a_2496x540.png" width="1456" height="315" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/728fcb39-05de-43f4-9d5f-793ccb78081a_2496x540.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:315,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Q6E4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728fcb39-05de-43f4-9d5f-793ccb78081a_2496x540.png 424w, https://substackcdn.com/image/fetch/$s_!Q6E4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728fcb39-05de-43f4-9d5f-793ccb78081a_2496x540.png 848w, https://substackcdn.com/image/fetch/$s_!Q6E4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728fcb39-05de-43f4-9d5f-793ccb78081a_2496x540.png 1272w, https://substackcdn.com/image/fetch/$s_!Q6E4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728fcb39-05de-43f4-9d5f-793ccb78081a_2496x540.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Two principles for alignment training.</figcaption></figure></div><div><hr></div><h2>Reinforced Finetuning using Reward Model</h2><p>Originated from the situation where human feedback is employed to optimize RL policy, the approach, named Reinforcement learning with human feedback (RLHF, [3]), involves 2 phases: (1) reward learning and (2) policy optimization, as depicted in the schematic diagram below:  </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!p4_-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dbb8329-2260-488b-92a8-0591e8c08efc_445x300.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!p4_-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dbb8329-2260-488b-92a8-0591e8c08efc_445x300.gif 424w, https://substackcdn.com/image/fetch/$s_!p4_-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dbb8329-2260-488b-92a8-0591e8c08efc_445x300.gif 848w, https://substackcdn.com/image/fetch/$s_!p4_-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dbb8329-2260-488b-92a8-0591e8c08efc_445x300.gif 1272w, https://substackcdn.com/image/fetch/$s_!p4_-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dbb8329-2260-488b-92a8-0591e8c08efc_445x300.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!p4_-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dbb8329-2260-488b-92a8-0591e8c08efc_445x300.gif" width="445" height="300" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6dbb8329-2260-488b-92a8-0591e8c08efc_445x300.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:300,&quot;width&quot;:445,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:143709,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!p4_-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dbb8329-2260-488b-92a8-0591e8c08efc_445x300.gif 424w, https://substackcdn.com/image/fetch/$s_!p4_-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dbb8329-2260-488b-92a8-0591e8c08efc_445x300.gif 848w, https://substackcdn.com/image/fetch/$s_!p4_-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dbb8329-2260-488b-92a8-0591e8c08efc_445x300.gif 1272w, https://substackcdn.com/image/fetch/$s_!p4_-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dbb8329-2260-488b-92a8-0591e8c08efc_445x300.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Reinforcement learning with human feedback (RLHF). After training the reward model, it will provide the predicted reward for training the RL agent. </figcaption></figure></div><p>Put in the context of LLM in practice, we need to collect dataset <em>D </em>of training data from humans. Here, the Reinforcement Learning (RL) agent is represented by the LLM itself, operating within a one-step episode environment. In this setup, the LLM executes a single action to generate complete output responses [4]. </p><h4>Reward Learning</h4><p>Collecting preference data is one of the most efficient ways to get data from humans. In particular, preference data is generated by sampling two response outputs from a policy (normally a pretrained LLM)  and presenting them to an agent (normally a human) for rating to indicate which one is preferred.  </p><p>To model human preference, the authors in [3,4] choose the <a href="https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model">Bradley-Terry model</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Cquw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7936661c-c260-4efc-a725-ff8065633de3_632x88.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Cquw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7936661c-c260-4efc-a725-ff8065633de3_632x88.png 424w, https://substackcdn.com/image/fetch/$s_!Cquw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7936661c-c260-4efc-a725-ff8065633de3_632x88.png 848w, https://substackcdn.com/image/fetch/$s_!Cquw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7936661c-c260-4efc-a725-ff8065633de3_632x88.png 1272w, https://substackcdn.com/image/fetch/$s_!Cquw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7936661c-c260-4efc-a725-ff8065633de3_632x88.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Cquw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7936661c-c260-4efc-a725-ff8065633de3_632x88.png" width="546" height="76.0253164556962" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7936661c-c260-4efc-a725-ff8065633de3_632x88.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:88,&quot;width&quot;:632,&quot;resizeWidth&quot;:546,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Cquw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7936661c-c260-4efc-a725-ff8065633de3_632x88.png 424w, https://substackcdn.com/image/fetch/$s_!Cquw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7936661c-c260-4efc-a725-ff8065633de3_632x88.png 848w, https://substackcdn.com/image/fetch/$s_!Cquw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7936661c-c260-4efc-a725-ff8065633de3_632x88.png 1272w, https://substackcdn.com/image/fetch/$s_!Cquw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7936661c-c260-4efc-a725-ff8065633de3_632x88.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>p*(y<sub>1</sub>&#8827;y<sub>2</sub>|x) </em>denotes the true (human) probability <em>y<sub>1</sub></em> is preferred to <em>y<sub>2</sub> </em>given input <em>x</em>. Here, <em>r* </em>is the true reward function assigning a scalar score to indicate the output <em>y</em>'s suitability to input <em>x</em>. In practice, we can only estimate the true reward by training a reward model <em>r<sub>&#952;</sub> </em>to minimize the negative log-likelihood of <em>p*(y<sub>1</sub>&#8827;y<sub>2</sub>|x)</em> according to the preference data <em>D </em>as follows,</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Kgnx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e105d8f-4832-4979-b8db-f66fada79b7a_689x95.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Kgnx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e105d8f-4832-4979-b8db-f66fada79b7a_689x95.png 424w, https://substackcdn.com/image/fetch/$s_!Kgnx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e105d8f-4832-4979-b8db-f66fada79b7a_689x95.png 848w, https://substackcdn.com/image/fetch/$s_!Kgnx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e105d8f-4832-4979-b8db-f66fada79b7a_689x95.png 1272w, https://substackcdn.com/image/fetch/$s_!Kgnx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e105d8f-4832-4979-b8db-f66fada79b7a_689x95.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Kgnx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e105d8f-4832-4979-b8db-f66fada79b7a_689x95.png" width="689" height="95" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e105d8f-4832-4979-b8db-f66fada79b7a_689x95.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:95,&quot;width&quot;:689,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:13146,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Kgnx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e105d8f-4832-4979-b8db-f66fada79b7a_689x95.png 424w, https://substackcdn.com/image/fetch/$s_!Kgnx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e105d8f-4832-4979-b8db-f66fada79b7a_689x95.png 848w, https://substackcdn.com/image/fetch/$s_!Kgnx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e105d8f-4832-4979-b8db-f66fada79b7a_689x95.png 1272w, https://substackcdn.com/image/fetch/$s_!Kgnx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e105d8f-4832-4979-b8db-f66fada79b7a_689x95.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>y<sub>w</sub></em> and <em>y<sub>l</sub></em> are the preferred and dispreferred response, respective. With <em>&#963; </em>as the <a href="https://en.wikipedia.org/wiki/Sigmoid_function">sigmoid </a>function, the loss is simply the negative expected log-likelihood of the Bradley-Terry preference distribution.   Here, <em>(K, 2)</em> is the number of  response pairs from generated <em>K</em> responses, all presented to humans for rating. The reward learning process is summarized below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Sth5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad34bf19-e1fe-4521-aa74-070530a1face_2160x1472.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Sth5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad34bf19-e1fe-4521-aa74-070530a1face_2160x1472.png 424w, https://substackcdn.com/image/fetch/$s_!Sth5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad34bf19-e1fe-4521-aa74-070530a1face_2160x1472.png 848w, https://substackcdn.com/image/fetch/$s_!Sth5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad34bf19-e1fe-4521-aa74-070530a1face_2160x1472.png 1272w, https://substackcdn.com/image/fetch/$s_!Sth5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad34bf19-e1fe-4521-aa74-070530a1face_2160x1472.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Sth5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad34bf19-e1fe-4521-aa74-070530a1face_2160x1472.png" width="508" height="346.1098901098901" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ad34bf19-e1fe-4521-aa74-070530a1face_2160x1472.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:992,&quot;width&quot;:1456,&quot;resizeWidth&quot;:508,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Sth5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad34bf19-e1fe-4521-aa74-070530a1face_2160x1472.png 424w, https://substackcdn.com/image/fetch/$s_!Sth5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad34bf19-e1fe-4521-aa74-070530a1face_2160x1472.png 848w, https://substackcdn.com/image/fetch/$s_!Sth5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad34bf19-e1fe-4521-aa74-070530a1face_2160x1472.png 1272w, https://substackcdn.com/image/fetch/$s_!Sth5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad34bf19-e1fe-4521-aa74-070530a1face_2160x1472.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Reward learning in RLHF. A human provides training signals by comparing the quality of proposed outputs. Image taken from [4].</figcaption></figure></div><h4>Policy Optimization</h4><p>Given the reward, any policy gradient method can be used to update the LLM&#8217;s parameters to maximize the expected reward. The authors in [4] selected PPO as the optimization algorithm. They augmented it with regularization terms to restrict updates from deviating too significantly from the base model, a supervised fine-tune model (SFT). The regularization involves applying per-token <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL divergence</a> penalties from the base model to each token, aiming to mitigate over-optimization of the reward model. In addition, the authors also explore the integration of pretraining gradients into PPO gradients by introducing another regularization loss, which is the log-likelihood of the optimized policy on pertaining data.  The final RL objective is:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kBT8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9eddb4c-3327-4288-9c53-d64b6044289d_3076x816.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kBT8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9eddb4c-3327-4288-9c53-d64b6044289d_3076x816.png 424w, https://substackcdn.com/image/fetch/$s_!kBT8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9eddb4c-3327-4288-9c53-d64b6044289d_3076x816.png 848w, https://substackcdn.com/image/fetch/$s_!kBT8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9eddb4c-3327-4288-9c53-d64b6044289d_3076x816.png 1272w, https://substackcdn.com/image/fetch/$s_!kBT8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9eddb4c-3327-4288-9c53-d64b6044289d_3076x816.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kBT8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9eddb4c-3327-4288-9c53-d64b6044289d_3076x816.png" width="1456" height="386" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e9eddb4c-3327-4288-9c53-d64b6044289d_3076x816.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:386,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kBT8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9eddb4c-3327-4288-9c53-d64b6044289d_3076x816.png 424w, https://substackcdn.com/image/fetch/$s_!kBT8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9eddb4c-3327-4288-9c53-d64b6044289d_3076x816.png 848w, https://substackcdn.com/image/fetch/$s_!kBT8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9eddb4c-3327-4288-9c53-d64b6044289d_3076x816.png 1272w, https://substackcdn.com/image/fetch/$s_!kBT8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9eddb4c-3327-4288-9c53-d64b6044289d_3076x816.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Although this training scheme is the cornerstone for the success of Chat-GPT, it faces several limitations:</p><p>&#10060; It may face instability and require significant computational resources.</p><p>&#10060; It involves 2 cumbersome phases: reward model training and reinforcement learning fine-tuning.</p><p>&#128073; Coordinating these phases can be challenging due to varied hyperparameters and human feedback dynamics</p><h4>AI Feedback</h4><p>Gathering high-quality human preference labels for RLHF is time-consuming and expensive. RLAIF [9,10] offers a promising alternative, using a powerful LLM to generate preferences instead. The picture below summaries the two pipelines: </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KCd9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e328fc5-7978-4bee-9aa9-8273555e1580_1184x607.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KCd9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e328fc5-7978-4bee-9aa9-8273555e1580_1184x607.png 424w, https://substackcdn.com/image/fetch/$s_!KCd9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e328fc5-7978-4bee-9aa9-8273555e1580_1184x607.png 848w, https://substackcdn.com/image/fetch/$s_!KCd9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e328fc5-7978-4bee-9aa9-8273555e1580_1184x607.png 1272w, https://substackcdn.com/image/fetch/$s_!KCd9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e328fc5-7978-4bee-9aa9-8273555e1580_1184x607.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KCd9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e328fc5-7978-4bee-9aa9-8273555e1580_1184x607.png" width="1184" height="607" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9e328fc5-7978-4bee-9aa9-8273555e1580_1184x607.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:607,&quot;width&quot;:1184,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:209417,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KCd9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e328fc5-7978-4bee-9aa9-8273555e1580_1184x607.png 424w, https://substackcdn.com/image/fetch/$s_!KCd9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e328fc5-7978-4bee-9aa9-8273555e1580_1184x607.png 848w, https://substackcdn.com/image/fetch/$s_!KCd9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e328fc5-7978-4bee-9aa9-8273555e1580_1184x607.png 1272w, https://substackcdn.com/image/fetch/$s_!KCd9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e328fc5-7978-4bee-9aa9-8273555e1580_1184x607.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">RLAIF (above) vs RLHF (below). The difference is the source of preference feedback where the former uses LLM to evaluate preference and the latter uses humans. Image taken from [10].</figcaption></figure></div><p>The off-the-shelf LLM is a general-use but not specifically fine-tuned model for a downstream task. There are two approaches to prompt the off-the-shelf LLM:</p><ol><li><p>The LLM is tasked with rating which response is preferred when presented with a piece of text and two candidate responses.</p></li><li><p>The LLM is directly prompted for reward scores during RL, bypassing the step of distilling LLM preference labels into a reward model.</p></li></ol><blockquote><p>&#128064; There are several tricks to ensure the quality of generated preference labels. For example, to reduce position bias in preference labeling, the authors in [10] conduct two inferences for each pair of candidates. In the second inference, the order of candidate presentation to the LLM is reversed. The results from both inferences are averaged to obtain the final preference distribution.</p></blockquote><div><hr></div><h2>Preference Optimization without RL</h2><p>A recent breakthrough [5] in LLM alignment reveals the possibility of directly optimizing the LLM to align with preference data, eliminating the necessity for reward learning &#128073; <strong>DPO</strong>. This paradigm shift is illustrated below:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VZsp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c40a20e-04c2-45ae-8bcc-0f5904c7c875_1019x226.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VZsp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c40a20e-04c2-45ae-8bcc-0f5904c7c875_1019x226.png 424w, https://substackcdn.com/image/fetch/$s_!VZsp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c40a20e-04c2-45ae-8bcc-0f5904c7c875_1019x226.png 848w, https://substackcdn.com/image/fetch/$s_!VZsp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c40a20e-04c2-45ae-8bcc-0f5904c7c875_1019x226.png 1272w, https://substackcdn.com/image/fetch/$s_!VZsp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c40a20e-04c2-45ae-8bcc-0f5904c7c875_1019x226.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VZsp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c40a20e-04c2-45ae-8bcc-0f5904c7c875_1019x226.png" width="1019" height="226" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2c40a20e-04c2-45ae-8bcc-0f5904c7c875_1019x226.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:226,&quot;width&quot;:1019,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:86722,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VZsp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c40a20e-04c2-45ae-8bcc-0f5904c7c875_1019x226.png 424w, https://substackcdn.com/image/fetch/$s_!VZsp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c40a20e-04c2-45ae-8bcc-0f5904c7c875_1019x226.png 848w, https://substackcdn.com/image/fetch/$s_!VZsp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c40a20e-04c2-45ae-8bcc-0f5904c7c875_1019x226.png 1272w, https://substackcdn.com/image/fetch/$s_!VZsp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c40a20e-04c2-45ae-8bcc-0f5904c7c875_1019x226.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">RLHF vs DPO. Image taken from [5].</figcaption></figure></div><h4>Direct Preference Optimization (DPO)</h4><p>The key idea is that for a given reward function <em>r</em>, the main RLHF optimization objective with reward and KL divergence terms have a closed-form solution of the optimal policy:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EHNl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14f489fe-b417-45f9-814e-6989bec569fc_3612x1132.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EHNl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14f489fe-b417-45f9-814e-6989bec569fc_3612x1132.png 424w, https://substackcdn.com/image/fetch/$s_!EHNl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14f489fe-b417-45f9-814e-6989bec569fc_3612x1132.png 848w, https://substackcdn.com/image/fetch/$s_!EHNl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14f489fe-b417-45f9-814e-6989bec569fc_3612x1132.png 1272w, https://substackcdn.com/image/fetch/$s_!EHNl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14f489fe-b417-45f9-814e-6989bec569fc_3612x1132.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EHNl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14f489fe-b417-45f9-814e-6989bec569fc_3612x1132.png" width="1456" height="456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14f489fe-b417-45f9-814e-6989bec569fc_3612x1132.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EHNl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14f489fe-b417-45f9-814e-6989bec569fc_3612x1132.png 424w, https://substackcdn.com/image/fetch/$s_!EHNl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14f489fe-b417-45f9-814e-6989bec569fc_3612x1132.png 848w, https://substackcdn.com/image/fetch/$s_!EHNl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14f489fe-b417-45f9-814e-6989bec569fc_3612x1132.png 1272w, https://substackcdn.com/image/fetch/$s_!EHNl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14f489fe-b417-45f9-814e-6989bec569fc_3612x1132.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The problem with this closed-form solution is that <em>Z(x)</em> is impossible to compute because it involves iterating all y, which is infinite. However, the form of the solution suggests a clever way of changing variables to bypass the computation of <em>Z</em>. To see that, first, we write <em>r</em> as a function of the policy:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M_W8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa224a1-c231-4bd1-bbe2-8ecba6add5dc_479x95.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M_W8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa224a1-c231-4bd1-bbe2-8ecba6add5dc_479x95.png 424w, https://substackcdn.com/image/fetch/$s_!M_W8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa224a1-c231-4bd1-bbe2-8ecba6add5dc_479x95.png 848w, https://substackcdn.com/image/fetch/$s_!M_W8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa224a1-c231-4bd1-bbe2-8ecba6add5dc_479x95.png 1272w, https://substackcdn.com/image/fetch/$s_!M_W8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa224a1-c231-4bd1-bbe2-8ecba6add5dc_479x95.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M_W8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa224a1-c231-4bd1-bbe2-8ecba6add5dc_479x95.png" width="421" height="83.49686847599165" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/faa224a1-c231-4bd1-bbe2-8ecba6add5dc_479x95.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:95,&quot;width&quot;:479,&quot;resizeWidth&quot;:421,&quot;bytes&quot;:16916,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!M_W8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa224a1-c231-4bd1-bbe2-8ecba6add5dc_479x95.png 424w, https://substackcdn.com/image/fetch/$s_!M_W8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa224a1-c231-4bd1-bbe2-8ecba6add5dc_479x95.png 848w, https://substackcdn.com/image/fetch/$s_!M_W8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa224a1-c231-4bd1-bbe2-8ecba6add5dc_479x95.png 1272w, https://substackcdn.com/image/fetch/$s_!M_W8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa224a1-c231-4bd1-bbe2-8ecba6add5dc_479x95.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Plugging this into the Bradley-Terry preference model, we can cancel out <em>Z</em> and make the preference modeling tractable:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pWzZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57553f1f-30c8-4096-8f5b-14ee92f6f50a_737x99.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pWzZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57553f1f-30c8-4096-8f5b-14ee92f6f50a_737x99.png 424w, https://substackcdn.com/image/fetch/$s_!pWzZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57553f1f-30c8-4096-8f5b-14ee92f6f50a_737x99.png 848w, https://substackcdn.com/image/fetch/$s_!pWzZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57553f1f-30c8-4096-8f5b-14ee92f6f50a_737x99.png 1272w, https://substackcdn.com/image/fetch/$s_!pWzZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57553f1f-30c8-4096-8f5b-14ee92f6f50a_737x99.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pWzZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57553f1f-30c8-4096-8f5b-14ee92f6f50a_737x99.png" width="602" height="80.86567164179104" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/57553f1f-30c8-4096-8f5b-14ee92f6f50a_737x99.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:99,&quot;width&quot;:737,&quot;resizeWidth&quot;:602,&quot;bytes&quot;:24581,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pWzZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57553f1f-30c8-4096-8f5b-14ee92f6f50a_737x99.png 424w, https://substackcdn.com/image/fetch/$s_!pWzZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57553f1f-30c8-4096-8f5b-14ee92f6f50a_737x99.png 848w, https://substackcdn.com/image/fetch/$s_!pWzZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57553f1f-30c8-4096-8f5b-14ee92f6f50a_737x99.png 1272w, https://substackcdn.com/image/fetch/$s_!pWzZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57553f1f-30c8-4096-8f5b-14ee92f6f50a_737x99.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The negative log-likelihood loss to fit the policy <em>&#960;</em> to a given preference dataset <em>D</em> becomes:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xNYt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6600e527-f420-4ff7-9c0f-3a87969e85b4_965x92.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xNYt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6600e527-f420-4ff7-9c0f-3a87969e85b4_965x92.png 424w, https://substackcdn.com/image/fetch/$s_!xNYt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6600e527-f420-4ff7-9c0f-3a87969e85b4_965x92.png 848w, https://substackcdn.com/image/fetch/$s_!xNYt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6600e527-f420-4ff7-9c0f-3a87969e85b4_965x92.png 1272w, https://substackcdn.com/image/fetch/$s_!xNYt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6600e527-f420-4ff7-9c0f-3a87969e85b4_965x92.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xNYt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6600e527-f420-4ff7-9c0f-3a87969e85b4_965x92.png" width="965" height="92" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6600e527-f420-4ff7-9c0f-3a87969e85b4_965x92.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:92,&quot;width&quot;:965,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:30422,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xNYt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6600e527-f420-4ff7-9c0f-3a87969e85b4_965x92.png 424w, https://substackcdn.com/image/fetch/$s_!xNYt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6600e527-f420-4ff7-9c0f-3a87969e85b4_965x92.png 848w, https://substackcdn.com/image/fetch/$s_!xNYt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6600e527-f420-4ff7-9c0f-3a87969e85b4_965x92.png 1272w, https://substackcdn.com/image/fetch/$s_!xNYt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6600e527-f420-4ff7-9c0f-3a87969e85b4_965x92.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>&#129504; <em>What does the DPO update do?</em> If we inspect deeper, the gradient update of DPO reads:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xtoa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61959267-df18-490a-801c-b5488a3b0dfe_4864x876.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xtoa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61959267-df18-490a-801c-b5488a3b0dfe_4864x876.png 424w, https://substackcdn.com/image/fetch/$s_!Xtoa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61959267-df18-490a-801c-b5488a3b0dfe_4864x876.png 848w, https://substackcdn.com/image/fetch/$s_!Xtoa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61959267-df18-490a-801c-b5488a3b0dfe_4864x876.png 1272w, https://substackcdn.com/image/fetch/$s_!Xtoa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61959267-df18-490a-801c-b5488a3b0dfe_4864x876.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xtoa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61959267-df18-490a-801c-b5488a3b0dfe_4864x876.png" width="1456" height="262" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61959267-df18-490a-801c-b5488a3b0dfe_4864x876.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:262,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Xtoa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61959267-df18-490a-801c-b5488a3b0dfe_4864x876.png 424w, https://substackcdn.com/image/fetch/$s_!Xtoa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61959267-df18-490a-801c-b5488a3b0dfe_4864x876.png 848w, https://substackcdn.com/image/fetch/$s_!Xtoa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61959267-df18-490a-801c-b5488a3b0dfe_4864x876.png 1272w, https://substackcdn.com/image/fetch/$s_!Xtoa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61959267-df18-490a-801c-b5488a3b0dfe_4864x876.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="pullquote"><p>Intuitively, the gradient of the loss function L<sub>DPO</sub> increases the likelihood of the preferred completions y<sub>w</sub> and decreases the likelihood of dispreferred completions y<sub>l</sub>. Importantly, the examples are weighed by how much higher the implicit reward model rates the dispreferred completions, scaled by &#946;, i.e, how incorrectly the implicit reward model orders the completions, accounting for the strength of the KL constraint.</p><p>&#8212;Text from [5]&#8212;</p></div><p>Implementing the DPO loss only takes a few lines of pseudo-code:</p><pre><code>pi_logratios = policy_chosen_logps - policy_rejected_logps
ref_logratios = reference_chosen_logps - reference_rejected_logps
logits = pi_logratios - ref_logratios
loss = -logsigmoid(beta * logits)</code></pre><p>Here, the required inputs are the policy/reference chosen/rejected log probabilities of LLMs given data <em>(y<sub>w</sub>,y<sub>l</sub>,x). </em>The code is adapted from <a href="https://github.com/huggingface/trl/blob/685209716915b010098b7e18b05fbad42af864d3/trl/trainer/dpo_trainer.py">this repository</a>.  </p><p>&#10060; A drawback of DPO is its potential to rapidly overfit preference data, even with KL regularization. This limitation is highlighted in [7], and it stems from the nature of the sigmoid function in the Bradley-Terry model. Concretely, if <em>p*(y &#8827; y&#8242; ) = 1</em>, i.e., <em>y</em> is always preferred to <em>y&#8242;</em>, the DPO solution requires that <em>&#960;*(y&#8242;)/&#960;*(y)=0</em>, i.e. <em>&#960;*(y&#8242;)=0 </em>regardless of the value of <em>&#946;</em>. As preferences become more deterministic, which is common in empirical datasets due to the lack of data, the strength of the KL-regularization diminishes quickly, leading to the wrong estimation of <em>&#960;*(y&#8242;)</em>.</p><h4>Alternative Preference Optimization</h4><p>An alternative approach aiming for alignment training without relying on RL and reward models introduces a different objective function [6], which maximizes the margin between the log-likelihood of preferred and dispreferred outputs:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7Ys_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F400cd986-db0e-4197-83f9-d5dceeebb63f_2880x520.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7Ys_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F400cd986-db0e-4197-83f9-d5dceeebb63f_2880x520.png 424w, https://substackcdn.com/image/fetch/$s_!7Ys_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F400cd986-db0e-4197-83f9-d5dceeebb63f_2880x520.png 848w, https://substackcdn.com/image/fetch/$s_!7Ys_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F400cd986-db0e-4197-83f9-d5dceeebb63f_2880x520.png 1272w, https://substackcdn.com/image/fetch/$s_!7Ys_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F400cd986-db0e-4197-83f9-d5dceeebb63f_2880x520.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7Ys_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F400cd986-db0e-4197-83f9-d5dceeebb63f_2880x520.png" width="1456" height="263" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/400cd986-db0e-4197-83f9-d5dceeebb63f_2880x520.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:263,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7Ys_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F400cd986-db0e-4197-83f9-d5dceeebb63f_2880x520.png 424w, https://substackcdn.com/image/fetch/$s_!7Ys_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F400cd986-db0e-4197-83f9-d5dceeebb63f_2880x520.png 848w, https://substackcdn.com/image/fetch/$s_!7Ys_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F400cd986-db0e-4197-83f9-d5dceeebb63f_2880x520.png 1272w, https://substackcdn.com/image/fetch/$s_!7Ys_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F400cd986-db0e-4197-83f9-d5dceeebb63f_2880x520.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The method, known as Sequence-Likelihood Calibration (&#128073;<strong>SLiC</strong>, [6]), introduces a new constrained term to maximize the likelihood of the optimized model on reference output, which serves a similar purpose of maintaining proximity between the optimized model and the reference one. </p><p>Implementing SLiC margin loss only modifies one line of the DPO pseudo-code:</p><pre><code>loss = relu(1 - beta * logits) </code></pre><p>Here &#120575; is parameterized to beta, i.e., &#120575;=1/beta.</p><blockquote><p>&#128064; Here, we need to sample <em>y<sub>ref</sub></em> from the reference policy instead of computing the KL divergence term, which may be a limitation because sampling is often slower. </p></blockquote><p>A more comprehensive perspective on preference learning, in which RLHF and DPO are special cases, introduces a &#936;-preference optimization (&#128073; <strong>&#936;-PO [7]</strong>) objective: </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OHk8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd237032-f5c3-4462-8557-e71f43490842_3504x1008.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OHk8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd237032-f5c3-4462-8557-e71f43490842_3504x1008.png 424w, https://substackcdn.com/image/fetch/$s_!OHk8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd237032-f5c3-4462-8557-e71f43490842_3504x1008.png 848w, https://substackcdn.com/image/fetch/$s_!OHk8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd237032-f5c3-4462-8557-e71f43490842_3504x1008.png 1272w, https://substackcdn.com/image/fetch/$s_!OHk8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd237032-f5c3-4462-8557-e71f43490842_3504x1008.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OHk8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd237032-f5c3-4462-8557-e71f43490842_3504x1008.png" width="624" height="179.57142857142858" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bd237032-f5c3-4462-8557-e71f43490842_3504x1008.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:419,&quot;width&quot;:1456,&quot;resizeWidth&quot;:624,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OHk8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd237032-f5c3-4462-8557-e71f43490842_3504x1008.png 424w, https://substackcdn.com/image/fetch/$s_!OHk8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd237032-f5c3-4462-8557-e71f43490842_3504x1008.png 848w, https://substackcdn.com/image/fetch/$s_!OHk8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd237032-f5c3-4462-8557-e71f43490842_3504x1008.png 1272w, https://substackcdn.com/image/fetch/$s_!OHk8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd237032-f5c3-4462-8557-e71f43490842_3504x1008.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Depending on the choice of &#936;, we can recover the objective of RLHF coupled with the Bradley-Terry preference model. In particular, with <em>&#936;(q) = log(q/(1 &#8722; q))  </em>and <em>p*(y &#8827; y &#8242;) = &#963;(r(y) &#8722; r(y &#8242; ))</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HG9G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9231b6b-6b45-4882-87ae-133f13681f72_3260x700.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HG9G!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9231b6b-6b45-4882-87ae-133f13681f72_3260x700.png 424w, https://substackcdn.com/image/fetch/$s_!HG9G!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9231b6b-6b45-4882-87ae-133f13681f72_3260x700.png 848w, https://substackcdn.com/image/fetch/$s_!HG9G!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9231b6b-6b45-4882-87ae-133f13681f72_3260x700.png 1272w, https://substackcdn.com/image/fetch/$s_!HG9G!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9231b6b-6b45-4882-87ae-133f13681f72_3260x700.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HG9G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9231b6b-6b45-4882-87ae-133f13681f72_3260x700.png" width="476" height="102.32692307692308" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e9231b6b-6b45-4882-87ae-133f13681f72_3260x700.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:313,&quot;width&quot;:1456,&quot;resizeWidth&quot;:476,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HG9G!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9231b6b-6b45-4882-87ae-133f13681f72_3260x700.png 424w, https://substackcdn.com/image/fetch/$s_!HG9G!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9231b6b-6b45-4882-87ae-133f13681f72_3260x700.png 848w, https://substackcdn.com/image/fetch/$s_!HG9G!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9231b6b-6b45-4882-87ae-133f13681f72_3260x700.png 1272w, https://substackcdn.com/image/fetch/$s_!HG9G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9231b6b-6b45-4882-87ae-133f13681f72_3260x700.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Here and from now on, we exclude <em>x</em> to simplify the notations. As we can see, the RLHF objective is equivalent to the &#936;-PO objective, offset by a constant term. Using a similar derivation as in DPO, we can find the closed-form solution for the &#936;-PO objective as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uCB8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3f1109-3355-4a42-bf58-9353865adff1_3148x780.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uCB8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3f1109-3355-4a42-bf58-9353865adff1_3148x780.png 424w, https://substackcdn.com/image/fetch/$s_!uCB8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3f1109-3355-4a42-bf58-9353865adff1_3148x780.png 848w, https://substackcdn.com/image/fetch/$s_!uCB8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3f1109-3355-4a42-bf58-9353865adff1_3148x780.png 1272w, https://substackcdn.com/image/fetch/$s_!uCB8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3f1109-3355-4a42-bf58-9353865adff1_3148x780.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uCB8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3f1109-3355-4a42-bf58-9353865adff1_3148x780.png" width="410" height="101.65521978021978" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e3f1109-3355-4a42-bf58-9353865adff1_3148x780.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:361,&quot;width&quot;:1456,&quot;resizeWidth&quot;:410,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uCB8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3f1109-3355-4a42-bf58-9353865adff1_3148x780.png 424w, https://substackcdn.com/image/fetch/$s_!uCB8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3f1109-3355-4a42-bf58-9353865adff1_3148x780.png 848w, https://substackcdn.com/image/fetch/$s_!uCB8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3f1109-3355-4a42-bf58-9353865adff1_3148x780.png 1272w, https://substackcdn.com/image/fetch/$s_!uCB8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e3f1109-3355-4a42-bf58-9353865adff1_3148x780.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Although the derivation of &#936;-PO looks similar to DPO&#8217;s, the former offers a more generalized way to optimize the objective, which involves &#936;.  Here, the closed-form solution cannot be computed exactly either (<em>&#8733; </em>means proportional to). However, we can go around this technical issue by division trick:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3oIE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa935e2de-b42c-480b-9c6c-da9c15560daf_3680x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3oIE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa935e2de-b42c-480b-9c6c-da9c15560daf_3680x1000.png 424w, https://substackcdn.com/image/fetch/$s_!3oIE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa935e2de-b42c-480b-9c6c-da9c15560daf_3680x1000.png 848w, https://substackcdn.com/image/fetch/$s_!3oIE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa935e2de-b42c-480b-9c6c-da9c15560daf_3680x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!3oIE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa935e2de-b42c-480b-9c6c-da9c15560daf_3680x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3oIE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa935e2de-b42c-480b-9c6c-da9c15560daf_3680x1000.png" width="546" height="148.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a935e2de-b42c-480b-9c6c-da9c15560daf_3680x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:396,&quot;width&quot;:1456,&quot;resizeWidth&quot;:546,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3oIE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa935e2de-b42c-480b-9c6c-da9c15560daf_3680x1000.png 424w, https://substackcdn.com/image/fetch/$s_!3oIE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa935e2de-b42c-480b-9c6c-da9c15560daf_3680x1000.png 848w, https://substackcdn.com/image/fetch/$s_!3oIE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa935e2de-b42c-480b-9c6c-da9c15560daf_3680x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!3oIE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa935e2de-b42c-480b-9c6c-da9c15560daf_3680x1000.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Following some math reformulation, we end up to solve for the equation:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kt6E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc223157f-98df-492e-b409-fae26039899f_2712x1176.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kt6E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc223157f-98df-492e-b409-fae26039899f_2712x1176.png 424w, https://substackcdn.com/image/fetch/$s_!kt6E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc223157f-98df-492e-b409-fae26039899f_2712x1176.png 848w, https://substackcdn.com/image/fetch/$s_!kt6E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc223157f-98df-492e-b409-fae26039899f_2712x1176.png 1272w, https://substackcdn.com/image/fetch/$s_!kt6E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc223157f-98df-492e-b409-fae26039899f_2712x1176.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kt6E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc223157f-98df-492e-b409-fae26039899f_2712x1176.png" width="380" height="164.68406593406593" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c223157f-98df-492e-b409-fae26039899f_2712x1176.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:631,&quot;width&quot;:1456,&quot;resizeWidth&quot;:380,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kt6E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc223157f-98df-492e-b409-fae26039899f_2712x1176.png 424w, https://substackcdn.com/image/fetch/$s_!kt6E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc223157f-98df-492e-b409-fae26039899f_2712x1176.png 848w, https://substackcdn.com/image/fetch/$s_!kt6E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc223157f-98df-492e-b409-fae26039899f_2712x1176.png 1272w, https://substackcdn.com/image/fetch/$s_!kt6E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc223157f-98df-492e-b409-fae26039899f_2712x1176.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p>&#128064; In DPO, the objective is equivalent to maximizing <em>&#963;(h(y,y&#8217;)), </em>which is so different from &#936;-PO.</p></blockquote><p>The authors in [7] propose to use &#936; as an identity function, named &#128073;<strong>IPO</strong>, which simplifies the equation to solve:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6yR2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F753c4b2b-2e4b-4824-b963-204558fe7450_3652x972.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6yR2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F753c4b2b-2e4b-4824-b963-204558fe7450_3652x972.png 424w, https://substackcdn.com/image/fetch/$s_!6yR2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F753c4b2b-2e4b-4824-b963-204558fe7450_3652x972.png 848w, https://substackcdn.com/image/fetch/$s_!6yR2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F753c4b2b-2e4b-4824-b963-204558fe7450_3652x972.png 1272w, https://substackcdn.com/image/fetch/$s_!6yR2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F753c4b2b-2e4b-4824-b963-204558fe7450_3652x972.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6yR2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F753c4b2b-2e4b-4824-b963-204558fe7450_3652x972.png" width="492" height="131.1098901098901" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/753c4b2b-2e4b-4824-b963-204558fe7450_3652x972.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:388,&quot;width&quot;:1456,&quot;resizeWidth&quot;:492,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6yR2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F753c4b2b-2e4b-4824-b963-204558fe7450_3652x972.png 424w, https://substackcdn.com/image/fetch/$s_!6yR2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F753c4b2b-2e4b-4824-b963-204558fe7450_3652x972.png 848w, https://substackcdn.com/image/fetch/$s_!6yR2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F753c4b2b-2e4b-4824-b963-204558fe7450_3652x972.png 1272w, https://substackcdn.com/image/fetch/$s_!6yR2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F753c4b2b-2e4b-4824-b963-204558fe7450_3652x972.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This root-finding problem leads to the following mean-square error loss:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NN-9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e459c58-7fa3-466c-9e81-1c940c6eac3d_1216x178.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NN-9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e459c58-7fa3-466c-9e81-1c940c6eac3d_1216x178.png 424w, https://substackcdn.com/image/fetch/$s_!NN-9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e459c58-7fa3-466c-9e81-1c940c6eac3d_1216x178.png 848w, https://substackcdn.com/image/fetch/$s_!NN-9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e459c58-7fa3-466c-9e81-1c940c6eac3d_1216x178.png 1272w, https://substackcdn.com/image/fetch/$s_!NN-9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e459c58-7fa3-466c-9e81-1c940c6eac3d_1216x178.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NN-9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e459c58-7fa3-466c-9e81-1c940c6eac3d_1216x178.png" width="588" height="86.07236842105263" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9e459c58-7fa3-466c-9e81-1c940c6eac3d_1216x178.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:178,&quot;width&quot;:1216,&quot;resizeWidth&quot;:588,&quot;bytes&quot;:33267,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NN-9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e459c58-7fa3-466c-9e81-1c940c6eac3d_1216x178.png 424w, https://substackcdn.com/image/fetch/$s_!NN-9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e459c58-7fa3-466c-9e81-1c940c6eac3d_1216x178.png 848w, https://substackcdn.com/image/fetch/$s_!NN-9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e459c58-7fa3-466c-9e81-1c940c6eac3d_1216x178.png 1272w, https://substackcdn.com/image/fetch/$s_!NN-9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e459c58-7fa3-466c-9e81-1c940c6eac3d_1216x178.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This loss can be proved to be equivalent to:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DYGj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87253d5f-39f8-4914-9dcc-403ba6b02f26_578x105.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DYGj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87253d5f-39f8-4914-9dcc-403ba6b02f26_578x105.png 424w, https://substackcdn.com/image/fetch/$s_!DYGj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87253d5f-39f8-4914-9dcc-403ba6b02f26_578x105.png 848w, https://substackcdn.com/image/fetch/$s_!DYGj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87253d5f-39f8-4914-9dcc-403ba6b02f26_578x105.png 1272w, https://substackcdn.com/image/fetch/$s_!DYGj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87253d5f-39f8-4914-9dcc-403ba6b02f26_578x105.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DYGj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87253d5f-39f8-4914-9dcc-403ba6b02f26_578x105.png" width="498" height="90.46712802768167" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/87253d5f-39f8-4914-9dcc-403ba6b02f26_578x105.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:105,&quot;width&quot;:578,&quot;resizeWidth&quot;:498,&quot;bytes&quot;:16583,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DYGj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87253d5f-39f8-4914-9dcc-403ba6b02f26_578x105.png 424w, https://substackcdn.com/image/fetch/$s_!DYGj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87253d5f-39f8-4914-9dcc-403ba6b02f26_578x105.png 848w, https://substackcdn.com/image/fetch/$s_!DYGj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87253d5f-39f8-4914-9dcc-403ba6b02f26_578x105.png 1272w, https://substackcdn.com/image/fetch/$s_!DYGj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87253d5f-39f8-4914-9dcc-403ba6b02f26_578x105.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>I</em> is drawn from a Bernoulli distribution with a mean of  <em>p*(y &#8827; y&#8217;).  </em>This results in a naive estimation using samples of <em>y</em> and <em>y&#8217; </em>such that if <em>y</em> is preferred to <em>y&#8217;</em> <em>I(y,y&#8217;)=1. </em>On the other hand, <em>I(y,y&#8217;)=0. </em>This leads to the IPO loss for each pair of samples <em>y<sub>w</sub>, y<sub>l</sub> ~D: </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EcUV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4c5d72-30ce-4d8a-bf61-b3da350d977c_509x142.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EcUV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4c5d72-30ce-4d8a-bf61-b3da350d977c_509x142.png 424w, https://substackcdn.com/image/fetch/$s_!EcUV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4c5d72-30ce-4d8a-bf61-b3da350d977c_509x142.png 848w, https://substackcdn.com/image/fetch/$s_!EcUV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4c5d72-30ce-4d8a-bf61-b3da350d977c_509x142.png 1272w, https://substackcdn.com/image/fetch/$s_!EcUV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4c5d72-30ce-4d8a-bf61-b3da350d977c_509x142.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EcUV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4c5d72-30ce-4d8a-bf61-b3da350d977c_509x142.png" width="411" height="114.66011787819254" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c4c5d72-30ce-4d8a-bf61-b3da350d977c_509x142.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:142,&quot;width&quot;:509,&quot;resizeWidth&quot;:411,&quot;bytes&quot;:17368,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EcUV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4c5d72-30ce-4d8a-bf61-b3da350d977c_509x142.png 424w, https://substackcdn.com/image/fetch/$s_!EcUV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4c5d72-30ce-4d8a-bf61-b3da350d977c_509x142.png 848w, https://substackcdn.com/image/fetch/$s_!EcUV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4c5d72-30ce-4d8a-bf61-b3da350d977c_509x142.png 1272w, https://substackcdn.com/image/fetch/$s_!EcUV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c4c5d72-30ce-4d8a-bf61-b3da350d977c_509x142.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p>&#128064; The coefficient of 1/2 is present because each <em>(y<sub>w</sub>, y<sub>l</sub>)</em> pair represents four instances of the original loss, where <em>y<sub>w</sub></em> and <em>y<sub>l</sub></em> can be either <em>y</em> or <em>y&#8217;.</em> Due to the symmetry of <em>h</em> and the asymmetry of <em>I</em>, taking average results in the appearance of 1/2.</p></blockquote><p>According to the authors, IPO is less overfitting to the preference data than DPO because:</p><div class="pullquote"><p>In other words IPO, unlike DPO, always regularizes its solution towards &#960;<sub>ref</sub> by controlling the gap between the log-likelihood ratios log(&#960;(y<sub>w</sub>)/&#960;(y<sub>l</sub>)) and log(&#960;<sub>ref</sub>(y<sub>w</sub>)/&#960;<sub>ref</sub>(y<sub>l</sub>)), thus avoiding the over-fitting to the preference dataset.</p><p>&#8212;Text from [7]&#8212;</p></div><p>&#10060; Despite its theoretical rigorousness, IPO performance in practical benchmarks seems to be weaker than DPO, as showcased in <a href="https://huggingface.co/blog/pref-tuning">this study</a>. Implementing IPO is straightforward given the DPO code, which only replaces sigmoid with MSE:</p><pre><code>loss = (logits - 1 / (2 * beta)) ** 2</code></pre><h4>Self-Play Preference Optimization</h4><p>Since LLMs are powerful models, they can be used to generate <a href="https://huggingface.co/datasets/wangrongsheng/comparison_gpt4_data_en">preference feedback</a> for preference finetuning. In practice, researchers often use large LLMs like GPT-4 to generate preference datasets, where outputs for a given input from larger models are typically considered preferred over those from smaller ones. &#129504; <em>Can we incorporate this concept into the direct preference optimization process to minimize the reliance on human preference data?</em></p><p>A recent paper (&#128073;<strong>SPIN</strong>) answers YES to this question by providing a self-play alignment training that utilizes the output of the optimized LLM to construct preference data [8]. In the paper, the authors consider 2 LLMs and view one (the optimized LLM) as the main player <em>f<sub>t+1</sub> </em>and the other (an old version of the optimized LLM) as the opponent player <em>&#952;<sub>t</sub></em>. Here, the timestep t denotes the optimization step. Then the objective function is designed such that the primary player <em>f<sub>t+1</sub></em> maximizes the expected gap between the target data distribution <em>p<sub>data</sub></em> and the opponent player's distribution <em>p<sub>&#952;t</sub></em> :</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eweK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e4802f-df67-4051-9b22-78c990495bf3_4968x1124.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eweK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e4802f-df67-4051-9b22-78c990495bf3_4968x1124.png 424w, https://substackcdn.com/image/fetch/$s_!eweK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e4802f-df67-4051-9b22-78c990495bf3_4968x1124.png 848w, https://substackcdn.com/image/fetch/$s_!eweK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e4802f-df67-4051-9b22-78c990495bf3_4968x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!eweK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e4802f-df67-4051-9b22-78c990495bf3_4968x1124.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eweK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e4802f-df67-4051-9b22-78c990495bf3_4968x1124.png" width="1456" height="329" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/28e4802f-df67-4051-9b22-78c990495bf3_4968x1124.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:329,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eweK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e4802f-df67-4051-9b22-78c990495bf3_4968x1124.png 424w, https://substackcdn.com/image/fetch/$s_!eweK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e4802f-df67-4051-9b22-78c990495bf3_4968x1124.png 848w, https://substackcdn.com/image/fetch/$s_!eweK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e4802f-df67-4051-9b22-78c990495bf3_4968x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!eweK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28e4802f-df67-4051-9b22-78c990495bf3_4968x1124.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>In essence, the main player aims to differentiate between the true data and the generated data from the opponent player. On the other hand, the opponent player&#8217;s goal is to generate outputs that are assigned high probability by the main player. In other words, it tries to fool the main player, which leads to the opponent player&#8217;s objective maximizing:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-4NH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79456c9a-d918-485d-8cb1-ad2d8845dbee_678x90.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-4NH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79456c9a-d918-485d-8cb1-ad2d8845dbee_678x90.png 424w, https://substackcdn.com/image/fetch/$s_!-4NH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79456c9a-d918-485d-8cb1-ad2d8845dbee_678x90.png 848w, https://substackcdn.com/image/fetch/$s_!-4NH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79456c9a-d918-485d-8cb1-ad2d8845dbee_678x90.png 1272w, https://substackcdn.com/image/fetch/$s_!-4NH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79456c9a-d918-485d-8cb1-ad2d8845dbee_678x90.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-4NH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79456c9a-d918-485d-8cb1-ad2d8845dbee_678x90.png" width="374" height="49.64601769911504" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/79456c9a-d918-485d-8cb1-ad2d8845dbee_678x90.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:90,&quot;width&quot;:678,&quot;resizeWidth&quot;:374,&quot;bytes&quot;:17953,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-4NH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79456c9a-d918-485d-8cb1-ad2d8845dbee_678x90.png 424w, https://substackcdn.com/image/fetch/$s_!-4NH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79456c9a-d918-485d-8cb1-ad2d8845dbee_678x90.png 848w, https://substackcdn.com/image/fetch/$s_!-4NH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79456c9a-d918-485d-8cb1-ad2d8845dbee_678x90.png 1272w, https://substackcdn.com/image/fetch/$s_!-4NH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79456c9a-d918-485d-8cb1-ad2d8845dbee_678x90.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Combining with the usual KL-divergence loss to constrain the update of the opponent model not too far from its previous version, we arrive at finding the optimal opponent player:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g5cW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2740af5-655b-4ac0-afaa-e9c93625a632_4188x892.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g5cW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2740af5-655b-4ac0-afaa-e9c93625a632_4188x892.png 424w, https://substackcdn.com/image/fetch/$s_!g5cW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2740af5-655b-4ac0-afaa-e9c93625a632_4188x892.png 848w, https://substackcdn.com/image/fetch/$s_!g5cW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2740af5-655b-4ac0-afaa-e9c93625a632_4188x892.png 1272w, https://substackcdn.com/image/fetch/$s_!g5cW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2740af5-655b-4ac0-afaa-e9c93625a632_4188x892.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g5cW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2740af5-655b-4ac0-afaa-e9c93625a632_4188x892.png" width="1456" height="310" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2740af5-655b-4ac0-afaa-e9c93625a632_4188x892.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:310,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!g5cW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2740af5-655b-4ac0-afaa-e9c93625a632_4188x892.png 424w, https://substackcdn.com/image/fetch/$s_!g5cW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2740af5-655b-4ac0-afaa-e9c93625a632_4188x892.png 848w, https://substackcdn.com/image/fetch/$s_!g5cW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2740af5-655b-4ac0-afaa-e9c93625a632_4188x892.png 1272w, https://substackcdn.com/image/fetch/$s_!g5cW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2740af5-655b-4ac0-afaa-e9c93625a632_4188x892.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p>&#128064; The self-play concept is very similar to <a href="https://arxiv.org/abs/1406.2661">GAN</a>. </p></blockquote><p>Given that we can  model the optimal opponent player with an LLM with parameter <em>&#952;<sub>t+1</sub></em>, we can end up with a relationship between the opponent and the main player as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HcXA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cbfe4fc-a912-4638-95c8-af25593b957f_2568x712.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HcXA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cbfe4fc-a912-4638-95c8-af25593b957f_2568x712.png 424w, https://substackcdn.com/image/fetch/$s_!HcXA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cbfe4fc-a912-4638-95c8-af25593b957f_2568x712.png 848w, https://substackcdn.com/image/fetch/$s_!HcXA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cbfe4fc-a912-4638-95c8-af25593b957f_2568x712.png 1272w, https://substackcdn.com/image/fetch/$s_!HcXA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cbfe4fc-a912-4638-95c8-af25593b957f_2568x712.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HcXA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cbfe4fc-a912-4638-95c8-af25593b957f_2568x712.png" width="522" height="144.84065934065933" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5cbfe4fc-a912-4638-95c8-af25593b957f_2568x712.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:404,&quot;width&quot;:1456,&quot;resizeWidth&quot;:522,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HcXA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cbfe4fc-a912-4638-95c8-af25593b957f_2568x712.png 424w, https://substackcdn.com/image/fetch/$s_!HcXA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cbfe4fc-a912-4638-95c8-af25593b957f_2568x712.png 848w, https://substackcdn.com/image/fetch/$s_!HcXA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cbfe4fc-a912-4638-95c8-af25593b957f_2568x712.png 1272w, https://substackcdn.com/image/fetch/$s_!HcXA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cbfe4fc-a912-4638-95c8-af25593b957f_2568x712.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>As the name implies, in self-play, the main and opponent players are the same model, from different versions. The relationship defined above suggests a self-play update procedure as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!72Am!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe991b3be-18c3-4b40-a65f-260ec22a93c2_1174x201.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!72Am!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe991b3be-18c3-4b40-a65f-260ec22a93c2_1174x201.gif 424w, https://substackcdn.com/image/fetch/$s_!72Am!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe991b3be-18c3-4b40-a65f-260ec22a93c2_1174x201.gif 848w, https://substackcdn.com/image/fetch/$s_!72Am!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe991b3be-18c3-4b40-a65f-260ec22a93c2_1174x201.gif 1272w, https://substackcdn.com/image/fetch/$s_!72Am!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe991b3be-18c3-4b40-a65f-260ec22a93c2_1174x201.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!72Am!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe991b3be-18c3-4b40-a65f-260ec22a93c2_1174x201.gif" width="1174" height="201" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e991b3be-18c3-4b40-a65f-260ec22a93c2_1174x201.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:201,&quot;width&quot;:1174,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:129046,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!72Am!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe991b3be-18c3-4b40-a65f-260ec22a93c2_1174x201.gif 424w, https://substackcdn.com/image/fetch/$s_!72Am!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe991b3be-18c3-4b40-a65f-260ec22a93c2_1174x201.gif 848w, https://substackcdn.com/image/fetch/$s_!72Am!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe991b3be-18c3-4b40-a65f-260ec22a93c2_1174x201.gif 1272w, https://substackcdn.com/image/fetch/$s_!72Am!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe991b3be-18c3-4b40-a65f-260ec22a93c2_1174x201.gif 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Given the relationship. to find <em>&#952; </em>for both players, we just need to optimize the main player objective, which is equivalent to minimizing the following loss:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-O6-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de97c3d-1061-46d4-80ea-7014726ff039_1227x116.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-O6-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de97c3d-1061-46d4-80ea-7014726ff039_1227x116.png 424w, https://substackcdn.com/image/fetch/$s_!-O6-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de97c3d-1061-46d4-80ea-7014726ff039_1227x116.png 848w, https://substackcdn.com/image/fetch/$s_!-O6-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de97c3d-1061-46d4-80ea-7014726ff039_1227x116.png 1272w, https://substackcdn.com/image/fetch/$s_!-O6-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de97c3d-1061-46d4-80ea-7014726ff039_1227x116.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-O6-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de97c3d-1061-46d4-80ea-7014726ff039_1227x116.png" width="1227" height="116" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7de97c3d-1061-46d4-80ea-7014726ff039_1227x116.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:116,&quot;width&quot;:1227,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48815,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-O6-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de97c3d-1061-46d4-80ea-7014726ff039_1227x116.png 424w, https://substackcdn.com/image/fetch/$s_!-O6-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de97c3d-1061-46d4-80ea-7014726ff039_1227x116.png 848w, https://substackcdn.com/image/fetch/$s_!-O6-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de97c3d-1061-46d4-80ea-7014726ff039_1227x116.png 1272w, https://substackcdn.com/image/fetch/$s_!-O6-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de97c3d-1061-46d4-80ea-7014726ff039_1227x116.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This leads to the SPIN algorithm: </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zPxJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2152646-c036-4330-b0ea-8280c9e56949_1143x342.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zPxJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2152646-c036-4330-b0ea-8280c9e56949_1143x342.png 424w, https://substackcdn.com/image/fetch/$s_!zPxJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2152646-c036-4330-b0ea-8280c9e56949_1143x342.png 848w, https://substackcdn.com/image/fetch/$s_!zPxJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2152646-c036-4330-b0ea-8280c9e56949_1143x342.png 1272w, https://substackcdn.com/image/fetch/$s_!zPxJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2152646-c036-4330-b0ea-8280c9e56949_1143x342.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zPxJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2152646-c036-4330-b0ea-8280c9e56949_1143x342.png" width="1143" height="342" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d2152646-c036-4330-b0ea-8280c9e56949_1143x342.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:342,&quot;width&quot;:1143,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:97543,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zPxJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2152646-c036-4330-b0ea-8280c9e56949_1143x342.png 424w, https://substackcdn.com/image/fetch/$s_!zPxJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2152646-c036-4330-b0ea-8280c9e56949_1143x342.png 848w, https://substackcdn.com/image/fetch/$s_!zPxJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2152646-c036-4330-b0ea-8280c9e56949_1143x342.png 1272w, https://substackcdn.com/image/fetch/$s_!zPxJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2152646-c036-4330-b0ea-8280c9e56949_1143x342.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><blockquote><p>&#128064; The SPIN loss looks similar to DPO&#8217;s loss. Here are the differences:</p><ul><li><p>SPIN does not require preference labels, it uses the old (weak) and the current (strong) model to generate output and assume the current output is preferred. </p></li><li><p>DPO necessitates that, at the instance level, <em>y<sub>w</sub></em> is superior to <em>y<sub>l</sub></em>. In contrast, SPIN requires that, at the distribution level, the target <em>p<sub>data</sub></em> should be distinguishable from the weak LLM <em>p<sub>&#952;</sub></em> before it becomes strong.</p></li></ul></blockquote><p>&#10060; SPIN requires data generation for each iteration, which can be slow and expensive. Although it eliminates the need for preference data, it requires the supervised finetuning dataset. i.e., the ground truth <em>y</em> for <em>x</em>.  </p><div><hr></div><h2>Finetuning with Rating Feedback</h2><p>As demonstrated in earlier sections, the Bradley-Terry model has been extensively used in alignment training of LLMs. However, it is important to note that it is not the only preference model available. Another option is to utilize a different preference model, namely, the <a href="https://en.wikipedia.org/wiki/Prospect_theory">human value function</a> proposed by Kahneman and Tversky:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OMwR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ed593d-2bde-47f8-b61b-9fa27a09fe4e_5260x1352.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OMwR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ed593d-2bde-47f8-b61b-9fa27a09fe4e_5260x1352.png 424w, https://substackcdn.com/image/fetch/$s_!OMwR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ed593d-2bde-47f8-b61b-9fa27a09fe4e_5260x1352.png 848w, https://substackcdn.com/image/fetch/$s_!OMwR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ed593d-2bde-47f8-b61b-9fa27a09fe4e_5260x1352.png 1272w, https://substackcdn.com/image/fetch/$s_!OMwR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ed593d-2bde-47f8-b61b-9fa27a09fe4e_5260x1352.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OMwR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ed593d-2bde-47f8-b61b-9fa27a09fe4e_5260x1352.png" width="1456" height="374" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/09ed593d-2bde-47f8-b61b-9fa27a09fe4e_5260x1352.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:374,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OMwR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ed593d-2bde-47f8-b61b-9fa27a09fe4e_5260x1352.png 424w, https://substackcdn.com/image/fetch/$s_!OMwR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ed593d-2bde-47f8-b61b-9fa27a09fe4e_5260x1352.png 848w, https://substackcdn.com/image/fetch/$s_!OMwR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ed593d-2bde-47f8-b61b-9fa27a09fe4e_5260x1352.png 1272w, https://substackcdn.com/image/fetch/$s_!OMwR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09ed593d-2bde-47f8-b61b-9fa27a09fe4e_5260x1352.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The idea originates from the concept of loss aversion in behavioral economics, which observes that individuals tend to perceive losses as more significant than equivalent gains. It revolves around the notion that people assess their utility based on "gains" and "losses" relative to a specific reference point. This reference point varies among individuals and is relative to their unique circumstances. </p><p>Interestingly, previous approaches that utilize rewards (implicit or explicit) to model the human value function can be seen as variations of <em>h</em>, assuming that <em>z<sub>ref</sub></em> represents the reward for the dispreferred output. The figure below summarizes the idea:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Mi4e!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a1d422-7b07-4db4-a2ed-9fbd89d2e70c_6108x1840.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Mi4e!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a1d422-7b07-4db4-a2ed-9fbd89d2e70c_6108x1840.png 424w, https://substackcdn.com/image/fetch/$s_!Mi4e!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a1d422-7b07-4db4-a2ed-9fbd89d2e70c_6108x1840.png 848w, https://substackcdn.com/image/fetch/$s_!Mi4e!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a1d422-7b07-4db4-a2ed-9fbd89d2e70c_6108x1840.png 1272w, https://substackcdn.com/image/fetch/$s_!Mi4e!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a1d422-7b07-4db4-a2ed-9fbd89d2e70c_6108x1840.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Mi4e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a1d422-7b07-4db4-a2ed-9fbd89d2e70c_6108x1840.png" width="1456" height="439" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1a1d422-7b07-4db4-a2ed-9fbd89d2e70c_6108x1840.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:439,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Mi4e!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a1d422-7b07-4db4-a2ed-9fbd89d2e70c_6108x1840.png 424w, https://substackcdn.com/image/fetch/$s_!Mi4e!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a1d422-7b07-4db4-a2ed-9fbd89d2e70c_6108x1840.png 848w, https://substackcdn.com/image/fetch/$s_!Mi4e!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a1d422-7b07-4db4-a2ed-9fbd89d2e70c_6108x1840.png 1272w, https://substackcdn.com/image/fetch/$s_!Mi4e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1a1d422-7b07-4db4-a2ed-9fbd89d2e70c_6108x1840.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Alignment loss from the perspective of the human value function. Images taken from [11].</figcaption></figure></div><p>All of these losses share similar characteristics of human value function:</p><div class="pullquote"><p>1. the existence of a reference point that is added or subtracted to get the relative gain or loss</p><p> 2. convexity of the value function in relative losses and concavity in gains (i.e., diminishing sensitivity the further you are from the reference point) </p><p>3. loss-aversion (a greater rate of change in utility in the loss regime)</p><p>&#8212;Text from [11]&#8212;</p></div><p>Given these observations, it seems more appropriate to directly utilize Kahneman and Tversky's human value function rather than other alignment losses because it may be easier to fit with the preference data, which is collected from real humans. &#129504; <em>How can we employ the Kahneman and Tversky</em> <em>model for optimizing preferences in LLMs?</em></p><p>In [11], the authors propose one way to implement the human value function in aligning LLMs and call the method Kahneman-Tversky Optimization (&#128073; <strong>KTO</strong>). The approach suggests eliminating the use of pairs of labeled outputs for training. Instead, we only use a single output and compare it with an estimated z<sub>ref</sub>. The authors also suggest using a sigmoid <em>h</em> instead of an exponential one for ease of optimization, resulting in the modified value function:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Su9P!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76160c3d-f009-45c2-b9f4-2e3f154fee43_2380x680.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Su9P!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76160c3d-f009-45c2-b9f4-2e3f154fee43_2380x680.png 424w, https://substackcdn.com/image/fetch/$s_!Su9P!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76160c3d-f009-45c2-b9f4-2e3f154fee43_2380x680.png 848w, https://substackcdn.com/image/fetch/$s_!Su9P!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76160c3d-f009-45c2-b9f4-2e3f154fee43_2380x680.png 1272w, https://substackcdn.com/image/fetch/$s_!Su9P!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76160c3d-f009-45c2-b9f4-2e3f154fee43_2380x680.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Su9P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76160c3d-f009-45c2-b9f4-2e3f154fee43_2380x680.png" width="588" height="168" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/76160c3d-f009-45c2-b9f4-2e3f154fee43_2380x680.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:416,&quot;width&quot;:1456,&quot;resizeWidth&quot;:588,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Su9P!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76160c3d-f009-45c2-b9f4-2e3f154fee43_2380x680.png 424w, https://substackcdn.com/image/fetch/$s_!Su9P!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76160c3d-f009-45c2-b9f4-2e3f154fee43_2380x680.png 848w, https://substackcdn.com/image/fetch/$s_!Su9P!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76160c3d-f009-45c2-b9f4-2e3f154fee43_2380x680.png 1272w, https://substackcdn.com/image/fetch/$s_!Su9P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76160c3d-f009-45c2-b9f4-2e3f154fee43_2380x680.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>To simulate the two forms of the value for the gain and loss regime, the final function looks like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P0uJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d2cd3e7-d0c7-4250-a439-4d4ed168d15e_3856x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P0uJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d2cd3e7-d0c7-4250-a439-4d4ed168d15e_3856x1048.png 424w, https://substackcdn.com/image/fetch/$s_!P0uJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d2cd3e7-d0c7-4250-a439-4d4ed168d15e_3856x1048.png 848w, https://substackcdn.com/image/fetch/$s_!P0uJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d2cd3e7-d0c7-4250-a439-4d4ed168d15e_3856x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!P0uJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d2cd3e7-d0c7-4250-a439-4d4ed168d15e_3856x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P0uJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d2cd3e7-d0c7-4250-a439-4d4ed168d15e_3856x1048.png" width="1456" height="396" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d2cd3e7-d0c7-4250-a439-4d4ed168d15e_3856x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:396,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!P0uJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d2cd3e7-d0c7-4250-a439-4d4ed168d15e_3856x1048.png 424w, https://substackcdn.com/image/fetch/$s_!P0uJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d2cd3e7-d0c7-4250-a439-4d4ed168d15e_3856x1048.png 848w, https://substackcdn.com/image/fetch/$s_!P0uJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d2cd3e7-d0c7-4250-a439-4d4ed168d15e_3856x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!P0uJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d2cd3e7-d0c7-4250-a439-4d4ed168d15e_3856x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here, we only need a label for one output <em>y</em>, and the labeler aims to answer the question: &#129504; <em>is y desirable? </em>In practice, this can be implemented as a like or dislike button to rate an output from the model. Given the form of the value function, we aim to minimize the following loss:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yMhO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d835d27-eb33-4132-8c43-f48dd3c0cb2d_3508x968.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yMhO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d835d27-eb33-4132-8c43-f48dd3c0cb2d_3508x968.png 424w, https://substackcdn.com/image/fetch/$s_!yMhO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d835d27-eb33-4132-8c43-f48dd3c0cb2d_3508x968.png 848w, https://substackcdn.com/image/fetch/$s_!yMhO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d835d27-eb33-4132-8c43-f48dd3c0cb2d_3508x968.png 1272w, https://substackcdn.com/image/fetch/$s_!yMhO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d835d27-eb33-4132-8c43-f48dd3c0cb2d_3508x968.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yMhO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d835d27-eb33-4132-8c43-f48dd3c0cb2d_3508x968.png" width="1456" height="402" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8d835d27-eb33-4132-8c43-f48dd3c0cb2d_3508x968.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:402,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yMhO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d835d27-eb33-4132-8c43-f48dd3c0cb2d_3508x968.png 424w, https://substackcdn.com/image/fetch/$s_!yMhO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d835d27-eb33-4132-8c43-f48dd3c0cb2d_3508x968.png 848w, https://substackcdn.com/image/fetch/$s_!yMhO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d835d27-eb33-4132-8c43-f48dd3c0cb2d_3508x968.png 1272w, https://substackcdn.com/image/fetch/$s_!yMhO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d835d27-eb33-4132-8c43-f48dd3c0cb2d_3508x968.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In practice, when implementing KTO, we need to estimate the KL  term as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hRHf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e1903fc-213f-4579-be50-901ed06a755f_646x92.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hRHf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e1903fc-213f-4579-be50-901ed06a755f_646x92.png 424w, https://substackcdn.com/image/fetch/$s_!hRHf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e1903fc-213f-4579-be50-901ed06a755f_646x92.png 848w, https://substackcdn.com/image/fetch/$s_!hRHf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e1903fc-213f-4579-be50-901ed06a755f_646x92.png 1272w, https://substackcdn.com/image/fetch/$s_!hRHf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e1903fc-213f-4579-be50-901ed06a755f_646x92.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hRHf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e1903fc-213f-4579-be50-901ed06a755f_646x92.png" width="436" height="62.092879256965944" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1e1903fc-213f-4579-be50-901ed06a755f_646x92.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:92,&quot;width&quot;:646,&quot;resizeWidth&quot;:436,&quot;bytes&quot;:21470,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hRHf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e1903fc-213f-4579-be50-901ed06a755f_646x92.png 424w, https://substackcdn.com/image/fetch/$s_!hRHf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e1903fc-213f-4579-be50-901ed06a755f_646x92.png 848w, https://substackcdn.com/image/fetch/$s_!hRHf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e1903fc-213f-4579-be50-901ed06a755f_646x92.png 1272w, https://substackcdn.com/image/fetch/$s_!hRHf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e1903fc-213f-4579-be50-901ed06a755f_646x92.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>m</em> is the batch size and <em>z</em> is the unrelated output (to create a clear gap between the reference and the value). For convenience, if y is desirable or chosen, z can be chosen as the rejected one and vice versa. The max operator is to ensure KL approximation is still greater than or equal to 0. Note that in training practice, we make use of the preference data as the rating data with <em>y<sub>w</sub></em> as <em>y<sub>desriable</sub></em> and <em>y<sub>l</sub></em> as <em>y<sub>undersirable</sub></em> and thus, we can compute 2 KTO losses per (<em>y<sub>w</sub>,y<sub>l</sub>,x</em>). <sub> </sub> Concretely, we can implement KTO losses as follows:</p><pre><code>chosen_KL = (policy_chosen_logps - reference_chosen_logps).mean().clamp(min=0)
rejected_KL = (policy_rejected_logps - reference_rejected_logps).mean().clamp(min=0)
chosen_logratios = policy_chosen_logps - reference_chosen_logps
rejected_logratios = policy_rejected_logps - reference_rejected_logps
chosen_loss = 1 - sigmoid(lambda_d * (chosen_logratios - rejected_KL))
rejected_loss = 1 - sigmoid(lambda_u * (chosen_KL - rejected_logratios))</code></pre><p>A major benefit of KTO is label efficiency. Instead of generating 2 outputs and asking for a comparison from humans as DPO does, KTO only generates 1 output and asks for a rating:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zYAG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c6f07b-e449-4c45-9a70-c77f5c8b345e_1052x460.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zYAG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c6f07b-e449-4c45-9a70-c77f5c8b345e_1052x460.png 424w, https://substackcdn.com/image/fetch/$s_!zYAG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c6f07b-e449-4c45-9a70-c77f5c8b345e_1052x460.png 848w, https://substackcdn.com/image/fetch/$s_!zYAG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c6f07b-e449-4c45-9a70-c77f5c8b345e_1052x460.png 1272w, https://substackcdn.com/image/fetch/$s_!zYAG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c6f07b-e449-4c45-9a70-c77f5c8b345e_1052x460.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zYAG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c6f07b-e449-4c45-9a70-c77f5c8b345e_1052x460.png" width="1052" height="460" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08c6f07b-e449-4c45-9a70-c77f5c8b345e_1052x460.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:460,&quot;width&quot;:1052,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:189543,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zYAG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c6f07b-e449-4c45-9a70-c77f5c8b345e_1052x460.png 424w, https://substackcdn.com/image/fetch/$s_!zYAG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c6f07b-e449-4c45-9a70-c77f5c8b345e_1052x460.png 848w, https://substackcdn.com/image/fetch/$s_!zYAG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c6f07b-e449-4c45-9a70-c77f5c8b345e_1052x460.png 1272w, https://substackcdn.com/image/fetch/$s_!zYAG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c6f07b-e449-4c45-9a70-c77f5c8b345e_1052x460.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">DPO vs KTO: Difference in labeling process. Images taken from [11].</figcaption></figure></div><blockquote><p>&#128064; As a result, with KTO, we can collect more data and labels from humans, which generally improves the performance of the aligned LLMs.</p></blockquote><p>&#10060; KTO necessitates KL estimation, which demands a large batch size <em>m</em> for stable results. Consequently, this increases computational and memory requirements, making it less suitable for low-resource machines.</p><div><hr></div><h2>References</h2><p>[1] Huang, Tzu-Kuo, Ruby C. Weng, Chih-Jen Lin, and Greg Ridgeway. "Generalized Bradley-Terry Models and Multi-Class Probability Estimates." <em>Journal of Machine Learning Research</em> 7, no. 1 (2006).</p><p>[2] Schoenauer, Marc, Riad Akrour, Michele Sebag, and Jean-christophe Souplet. "Programming by feedback." In <em>International Conference on Machine Learning</em>, pp. 1503-1511. PMLR, 2014.</p><p>[3] Christiano, Paul F., Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. "Deep reinforcement learning from human preferences." <em>Advances in neural information processing systems</em> 30 (2017).</p><p>[4] Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang et al. "Training language models to follow instructions with human feedback." <em>Advances in Neural Information Processing Systems</em> 35 (2022): 27730-27744.</p><p>[5] Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. "Direct preference optimization: Your language model is secretly a reward model." <em>arXiv preprint arXiv:2305.18290</em> (2023).</p><p>[6] Zhao, Yao, Misha Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, and Peter J. Liu. "Calibrating sequence likelihood improves conditional language generation." <em>arXiv preprint arXiv:2210.00045</em> (2022).</p><p>[7] Azar, Mohammad Gheshlaghi, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and R&#233;mi Munos. "A general theoretical paradigm to understand learning from human preferences." <em>arXiv preprint arXiv:2310.12036</em> (2023).</p><p>[8] Chen, Zixiang, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. "Self-play fine-tuning converts weak language models to strong language models." <em>arXiv preprint arXiv:2401.01335</em> (2024).</p><p>[9] Bai, Yuntao, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen et al. "Constitutional ai: Harmlessness from ai feedback." <em>arXiv preprint arXiv:2212.08073</em> (2022).</p><p>[10] Lee, Harrison, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. "Rlaif: Scaling reinforcement learning from human feedback with ai feedback." <em>arXiv preprint arXiv:2309.00267</em> (2023).</p><p>[11] Ethayarajh, Kawin, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. "KTO: Model Alignment as Prospect Theoretic Optimization." <em>arXiv preprint arXiv:2402.01306</em> (2024).</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><strong>I hope you enjoy the article. Stay tuned for the newest and exclusive content by subscribing to Neurocoder Tales! </strong><em>Disclaimer:</em> <em>While every effort is made to provide accurate and unbiased information, errors may occur. Let me know if you catch any error.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Curious Agents Saga: Part 2 ]]></title><description><![CDATA[Novelty or Surprise: How to Make Your Deep Reinforcement Learning Agents Curious?]]></description><link>https://hungleai.substack.com/p/curious-agents-saga-part-2</link><guid isPermaLink="false">https://hungleai.substack.com/p/curious-agents-saga-part-2</guid><dc:creator><![CDATA[Hung Le]]></dc:creator><pubDate>Sat, 03 Feb 2024 12:48:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!et-B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdcd701e-7fc9-452b-84df-53b891faebad_640x480.gif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h4>Table of Content</h4><ul><li><p><a href="https://hungleai.substack.com/i/140993309/challenges-in-exploration">Challenges in Exploration</a></p></li><li><p><a href="https://hungleai.substack.com/i/140993309/novelty">Novelty</a></p><ul><li><p><a href="https://hungleai.substack.com/i/140993309/state-counting">State Counting</a></p></li><li><p><a href="https://hungleai.substack.com/i/140993309/change-counting">Change Counting</a></p></li><li><p><a href="https://hungleai.substack.com/i/140993309/novelty-through-reachability">Novelty through Reachability</a></p></li><li><p><a href="https://hungleai.substack.com/i/140993309/novelty-via-reconstruction">Novelty via Reconstruction</a></p></li></ul></li><li><p><a href="https://hungleai.substack.com/i/140993309/surprise">Surprise</a></p><ul><li><p><a href="https://hungleai.substack.com/i/140993309/predictive-surprise">Predictive Surprise</a></p></li><li><p><a href="https://hungleai.substack.com/i/140993309/bayesian-surprise">Bayesian Surprise</a></p></li></ul></li><li><p><a href="https://hungleai.substack.com/i/140993309/hybrid-approaches">Hybrid Approaches</a></p><ul><li><p><a href="https://hungleai.substack.com/i/140993309/surprise-novelty">Surprise + Novelty</a></p></li><li><p><a href="https://hungleai.substack.com/i/140993309/novelty-of-surprise">Novelty of Surprise </a></p></li></ul></li></ul><div><hr></div><h2>Challenges in Exploration</h2><p>In the <a href="https://hungleai.substack.com/p/curious-agents-saga-part-1">preceding post</a>, we studied classical exploration strategies primarily grounded in theoretical justifications under na&#239;ve assumptions, challenging to scale in complex scenarios. &#129504; <em>Why is scaling a big problem?</em> Practical environments often involve huge continuous state and action spaces, as exemplified below:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!et-B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdcd701e-7fc9-452b-84df-53b891faebad_640x480.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!et-B!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdcd701e-7fc9-452b-84df-53b891faebad_640x480.gif 424w, https://substackcdn.com/image/fetch/$s_!et-B!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdcd701e-7fc9-452b-84df-53b891faebad_640x480.gif 848w, https://substackcdn.com/image/fetch/$s_!et-B!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdcd701e-7fc9-452b-84df-53b891faebad_640x480.gif 1272w, https://substackcdn.com/image/fetch/$s_!et-B!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdcd701e-7fc9-452b-84df-53b891faebad_640x480.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!et-B!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdcd701e-7fc9-452b-84df-53b891faebad_640x480.gif" width="270" height="202.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cdcd701e-7fc9-452b-84df-53b891faebad_640x480.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:480,&quot;width&quot;:640,&quot;resizeWidth&quot;:270,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Reinforcement Learning: Playing Doom with PyTorch&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Reinforcement Learning: Playing Doom with PyTorch" title="Reinforcement Learning: Playing Doom with PyTorch" srcset="https://substackcdn.com/image/fetch/$s_!et-B!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdcd701e-7fc9-452b-84df-53b891faebad_640x480.gif 424w, https://substackcdn.com/image/fetch/$s_!et-B!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdcd701e-7fc9-452b-84df-53b891faebad_640x480.gif 848w, https://substackcdn.com/image/fetch/$s_!et-B!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdcd701e-7fc9-452b-84df-53b891faebad_640x480.gif 1272w, https://substackcdn.com/image/fetch/$s_!et-B!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdcd701e-7fc9-452b-84df-53b891faebad_640x480.gif 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a><figcaption class="image-caption">Doom environment: continuous high-dimensional state space (<a href="https://brandonmorris.dev/2018/10/09/dql-vizdoom/">source</a>). </figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7t3v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33e608a1-3763-49bd-978b-7e44f5835f0a_500x500.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7t3v!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33e608a1-3763-49bd-978b-7e44f5835f0a_500x500.gif 424w, https://substackcdn.com/image/fetch/$s_!7t3v!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33e608a1-3763-49bd-978b-7e44f5835f0a_500x500.gif 848w, https://substackcdn.com/image/fetch/$s_!7t3v!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33e608a1-3763-49bd-978b-7e44f5835f0a_500x500.gif 1272w, https://substackcdn.com/image/fetch/$s_!7t3v!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33e608a1-3763-49bd-978b-7e44f5835f0a_500x500.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7t3v!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33e608a1-3763-49bd-978b-7e44f5835f0a_500x500.gif" width="272" height="272" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/33e608a1-3763-49bd-978b-7e44f5835f0a_500x500.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:500,&quot;width&quot;:500,&quot;resizeWidth&quot;:272,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Installing MuJoCo to Work With OpenAI Gym Environments&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Installing MuJoCo to Work With OpenAI Gym Environments" title="Installing MuJoCo to Work With OpenAI Gym Environments" srcset="https://substackcdn.com/image/fetch/$s_!7t3v!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33e608a1-3763-49bd-978b-7e44f5835f0a_500x500.gif 424w, https://substackcdn.com/image/fetch/$s_!7t3v!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33e608a1-3763-49bd-978b-7e44f5835f0a_500x500.gif 848w, https://substackcdn.com/image/fetch/$s_!7t3v!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33e608a1-3763-49bd-978b-7e44f5835f0a_500x500.gif 1272w, https://substackcdn.com/image/fetch/$s_!7t3v!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33e608a1-3763-49bd-978b-7e44f5835f0a_500x500.gif 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Mujoco environment: continuous action space (<a href="https://neptune.ai/blog/installing-mujoco-to-work-with-openai-gym-environments">source</a>).</figcaption></figure></div><p>Classical approaches cannot be implemented or fail to hold their theoretical properties in these settings. Furthermore, primary exploration strategies in deep reinforcement learning face challenges in keeping up with the growing complexity of these expanding search spaces.</p><p>Interestingly, biological agents can cope with exploring vast continuous state and action spaces very well. For example, animals can travel long distances until they find food or water. Humans can navigate to an address in a strange city. Biological agents are less likely to solve for the <a href="https://hungleai.substack.com/i/140853184/upper-confidence-bound-ucb">UCB using Hoeffding's inequality</a>. &#129504; <em>What motivates these agents to explore? </em>Before addressing this question, it is beneficial to explore the common framework of intrinsic motivations that propel agents to explore.</p><p><a href="https://hungleai.substack.com/i/140853184/primal-exploration-in-deep-reinforcement-learning">Primal approaches</a> add entropy loss or inject noise into the policy/value parameters with the limitation that the level of exploration is  not explicitly conditioned on fine-grant factors such as states or actions. One typical remedy is to use <em>intrinsic reward bonuses</em>, which assign higher internal rewards to state-action pairs that require higher exploration and vice versa. The final reward for the agent will be the weighted sum of the intrinsic reward and the external (environment) reward. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zjyP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0524098f-58a6-46e2-bfcb-c35cf749b64c_571x217.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zjyP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0524098f-58a6-46e2-bfcb-c35cf749b64c_571x217.png 424w, https://substackcdn.com/image/fetch/$s_!zjyP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0524098f-58a6-46e2-bfcb-c35cf749b64c_571x217.png 848w, https://substackcdn.com/image/fetch/$s_!zjyP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0524098f-58a6-46e2-bfcb-c35cf749b64c_571x217.png 1272w, https://substackcdn.com/image/fetch/$s_!zjyP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0524098f-58a6-46e2-bfcb-c35cf749b64c_571x217.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zjyP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0524098f-58a6-46e2-bfcb-c35cf749b64c_571x217.png" width="571" height="217" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0524098f-58a6-46e2-bfcb-c35cf749b64c_571x217.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:217,&quot;width&quot;:571,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64688,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zjyP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0524098f-58a6-46e2-bfcb-c35cf749b64c_571x217.png 424w, https://substackcdn.com/image/fetch/$s_!zjyP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0524098f-58a6-46e2-bfcb-c35cf749b64c_571x217.png 848w, https://substackcdn.com/image/fetch/$s_!zjyP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0524098f-58a6-46e2-bfcb-c35cf749b64c_571x217.png 1272w, https://substackcdn.com/image/fetch/$s_!zjyP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0524098f-58a6-46e2-bfcb-c35cf749b64c_571x217.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Reward bonus framework.</figcaption></figure></div><p></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3ZxX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269807cb-8cc3-4b1b-b2cc-74aad8ff0516_508x226.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3ZxX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269807cb-8cc3-4b1b-b2cc-74aad8ff0516_508x226.gif 424w, https://substackcdn.com/image/fetch/$s_!3ZxX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269807cb-8cc3-4b1b-b2cc-74aad8ff0516_508x226.gif 848w, https://substackcdn.com/image/fetch/$s_!3ZxX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269807cb-8cc3-4b1b-b2cc-74aad8ff0516_508x226.gif 1272w, https://substackcdn.com/image/fetch/$s_!3ZxX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269807cb-8cc3-4b1b-b2cc-74aad8ff0516_508x226.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3ZxX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269807cb-8cc3-4b1b-b2cc-74aad8ff0516_508x226.gif" width="614" height="273.15748031496065" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/269807cb-8cc3-4b1b-b2cc-74aad8ff0516_508x226.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:226,&quot;width&quot;:508,&quot;resizeWidth&quot;:614,&quot;bytes&quot;:17838,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3ZxX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269807cb-8cc3-4b1b-b2cc-74aad8ff0516_508x226.gif 424w, https://substackcdn.com/image/fetch/$s_!3ZxX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269807cb-8cc3-4b1b-b2cc-74aad8ff0516_508x226.gif 848w, https://substackcdn.com/image/fetch/$s_!3ZxX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269807cb-8cc3-4b1b-b2cc-74aad8ff0516_508x226.gif 1272w, https://substackcdn.com/image/fetch/$s_!3ZxX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F269807cb-8cc3-4b1b-b2cc-74aad8ff0516_508x226.gif 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Exploration via intrinsic reward bonus.</figcaption></figure></div><p>In addition to bonus rewards, there are other methods to integrate intrinsic motivation into RL. For example, if we can identify the states that should be explored, we can use them as the goals in goal-conditioned RL approaches. We can also directly influence the policy towards the interesting states and actions by modifying the policy functions or the action sampling processes. Finally, in replay-based value learning, we can put more weight on the experiences that contain potential states and actions to encourage more visits to them. Among all, bonus reward is still the most common method due to its simplicity. Therefore, we will focus on this technique in this article. </p><div><hr></div><h2>Novelty</h2><p>Humans want to explore novel places, make new friends, and buy new stuff. It is inherent for humans to be motivated by new things.  &#129504; <em>How to translate this intrinsic motivation to RL agents? </em>Tracking the occurrences of a state (<em>N(s))</em> provides a novelty indicator, with increased occurrences signaling less novelty. This can lead to an intrinsic reward structure resembling the <a href="https://hungleai.substack.com/p/curious-agents-saga-part-1#%C2%A7upper-confidence-bound-ucb">UCB</a> strategy: <em>r<sub>i</sub>(s,a)=N(s)<sup>-0.5</sup> </em>where <em>N</em> counts the number of times <em>s</em> appears<em>. </em>Unfortunately, relying on empirical counts in continuous state spaces is impractical due to the rarity of exact state visits, resulting in <em>N(s)=0</em> most of the time. Correctly estimating <em>N(s)</em> demands additional effort. We can call such an estimation a pseudo-count.</p><h4>State Counting</h4><p>Bellemare et al. (2016) propose to use a density function of the state to estimate its occurrences, i.e., &#128073;<strong>density-based counting</strong> [1]. Let <em>&#961;(x)=&#961;(s=x|s<sub>1:n</sub>) </em>be a density function of the state <em>x</em> given <em>s<sub>1:n</sub></em> and  <em>&#961;&#8217;(x)=&#961;(s=x|s<sub>1:n</sub>x</em>) the density function of the state <em>x</em> after observing its first occurrence after <em>s<sub>1:n</sub></em> . Assume the existence of <em>N&#770;&#8239;(x) </em>and <em>n&#770;  </em> as a &#8220;pseudo-count&#8221; of <em>x</em> and the pseudo-total count before and after an occurrence of s given previous states <em>s<sub>1:n</sub></em> , respectively; because the true density of <em>x</em> stays the same before and after an occurrence of <em>x</em>,  the following relationships hold:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!f2TS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86eb3e69-9977-4377-8312-14e12979366b_567x117.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f2TS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86eb3e69-9977-4377-8312-14e12979366b_567x117.png 424w, https://substackcdn.com/image/fetch/$s_!f2TS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86eb3e69-9977-4377-8312-14e12979366b_567x117.png 848w, https://substackcdn.com/image/fetch/$s_!f2TS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86eb3e69-9977-4377-8312-14e12979366b_567x117.png 1272w, https://substackcdn.com/image/fetch/$s_!f2TS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86eb3e69-9977-4377-8312-14e12979366b_567x117.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f2TS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86eb3e69-9977-4377-8312-14e12979366b_567x117.png" width="567" height="117" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/86eb3e69-9977-4377-8312-14e12979366b_567x117.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:117,&quot;width&quot;:567,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!f2TS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86eb3e69-9977-4377-8312-14e12979366b_567x117.png 424w, https://substackcdn.com/image/fetch/$s_!f2TS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86eb3e69-9977-4377-8312-14e12979366b_567x117.png 848w, https://substackcdn.com/image/fetch/$s_!f2TS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86eb3e69-9977-4377-8312-14e12979366b_567x117.png 1272w, https://substackcdn.com/image/fetch/$s_!f2TS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86eb3e69-9977-4377-8312-14e12979366b_567x117.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>which yields (by plugging  <em>n&#770;  </em>value derived from the left to the right equation):</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9Azs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d055631-e41a-4d80-bbb8-407a718d0917_496x99.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9Azs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d055631-e41a-4d80-bbb8-407a718d0917_496x99.png 424w, https://substackcdn.com/image/fetch/$s_!9Azs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d055631-e41a-4d80-bbb8-407a718d0917_496x99.png 848w, https://substackcdn.com/image/fetch/$s_!9Azs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d055631-e41a-4d80-bbb8-407a718d0917_496x99.png 1272w, https://substackcdn.com/image/fetch/$s_!9Azs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d055631-e41a-4d80-bbb8-407a718d0917_496x99.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9Azs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d055631-e41a-4d80-bbb8-407a718d0917_496x99.png" width="496" height="99" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d055631-e41a-4d80-bbb8-407a718d0917_496x99.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:99,&quot;width&quot;:496,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:20026,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9Azs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d055631-e41a-4d80-bbb8-407a718d0917_496x99.png 424w, https://substackcdn.com/image/fetch/$s_!9Azs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d055631-e41a-4d80-bbb8-407a718d0917_496x99.png 848w, https://substackcdn.com/image/fetch/$s_!9Azs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d055631-e41a-4d80-bbb8-407a718d0917_496x99.png 1272w, https://substackcdn.com/image/fetch/$s_!9Azs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d055631-e41a-4d80-bbb8-407a718d0917_496x99.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p>&#128064;  In practice, in a huge state space, <em>&#961;&#8217;<sub>n</sub>(x)&#8776;0, </em>and thus we can rewrite the pseudo-count:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_J3n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb413655b-80cd-46a8-85e2-43c8eeb4a240_461x105.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_J3n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb413655b-80cd-46a8-85e2-43c8eeb4a240_461x105.png 424w, https://substackcdn.com/image/fetch/$s_!_J3n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb413655b-80cd-46a8-85e2-43c8eeb4a240_461x105.png 848w, https://substackcdn.com/image/fetch/$s_!_J3n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb413655b-80cd-46a8-85e2-43c8eeb4a240_461x105.png 1272w, https://substackcdn.com/image/fetch/$s_!_J3n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb413655b-80cd-46a8-85e2-43c8eeb4a240_461x105.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_J3n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb413655b-80cd-46a8-85e2-43c8eeb4a240_461x105.png" width="335" height="76.30151843817788" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b413655b-80cd-46a8-85e2-43c8eeb4a240_461x105.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:105,&quot;width&quot;:461,&quot;resizeWidth&quot;:335,&quot;bytes&quot;:14710,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_J3n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb413655b-80cd-46a8-85e2-43c8eeb4a240_461x105.png 424w, https://substackcdn.com/image/fetch/$s_!_J3n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb413655b-80cd-46a8-85e2-43c8eeb4a240_461x105.png 848w, https://substackcdn.com/image/fetch/$s_!_J3n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb413655b-80cd-46a8-85e2-43c8eeb4a240_461x105.png 1272w, https://substackcdn.com/image/fetch/$s_!_J3n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb413655b-80cd-46a8-85e2-43c8eeb4a240_461x105.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where PG means predictive gain, which is computed as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1yfL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546ab707-3a56-4eb9-90b2-7e98cf6982f5_568x70.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1yfL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546ab707-3a56-4eb9-90b2-7e98cf6982f5_568x70.png 424w, https://substackcdn.com/image/fetch/$s_!1yfL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546ab707-3a56-4eb9-90b2-7e98cf6982f5_568x70.png 848w, https://substackcdn.com/image/fetch/$s_!1yfL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546ab707-3a56-4eb9-90b2-7e98cf6982f5_568x70.png 1272w, https://substackcdn.com/image/fetch/$s_!1yfL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546ab707-3a56-4eb9-90b2-7e98cf6982f5_568x70.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1yfL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546ab707-3a56-4eb9-90b2-7e98cf6982f5_568x70.png" width="430" height="52.99295774647887" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/546ab707-3a56-4eb9-90b2-7e98cf6982f5_568x70.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:70,&quot;width&quot;:568,&quot;resizeWidth&quot;:430,&quot;bytes&quot;:16585,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1yfL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546ab707-3a56-4eb9-90b2-7e98cf6982f5_568x70.png 424w, https://substackcdn.com/image/fetch/$s_!1yfL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546ab707-3a56-4eb9-90b2-7e98cf6982f5_568x70.png 848w, https://substackcdn.com/image/fetch/$s_!1yfL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546ab707-3a56-4eb9-90b2-7e98cf6982f5_568x70.png 1272w, https://substackcdn.com/image/fetch/$s_!1yfL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F546ab707-3a56-4eb9-90b2-7e98cf6982f5_568x70.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This formulation closely resembles the <a href="https://hungleai.substack.com/i/140853184/information-gain">information gain</a>, which is the difference between the expectation of a posterior and prior distribution. </p></blockquote><p>Now, the final task is to estimate <em>&#961;(s), </em>which can be done using any density model such as <a href="http://proceedings.mlr.press/v32/bellemare14.html">CTS</a>. After we get the pseudo-count, we can derive the intrinsic reward as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bbta!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd79fdcfa-a774-4067-b00c-d4c0014758d4_340x48.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bbta!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd79fdcfa-a774-4067-b00c-d4c0014758d4_340x48.png 424w, https://substackcdn.com/image/fetch/$s_!Bbta!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd79fdcfa-a774-4067-b00c-d4c0014758d4_340x48.png 848w, https://substackcdn.com/image/fetch/$s_!Bbta!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd79fdcfa-a774-4067-b00c-d4c0014758d4_340x48.png 1272w, https://substackcdn.com/image/fetch/$s_!Bbta!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd79fdcfa-a774-4067-b00c-d4c0014758d4_340x48.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bbta!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd79fdcfa-a774-4067-b00c-d4c0014758d4_340x48.png" width="340" height="48" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d79fdcfa-a774-4067-b00c-d4c0014758d4_340x48.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:48,&quot;width&quot;:340,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8458,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Bbta!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd79fdcfa-a774-4067-b00c-d4c0014758d4_340x48.png 424w, https://substackcdn.com/image/fetch/$s_!Bbta!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd79fdcfa-a774-4067-b00c-d4c0014758d4_340x48.png 848w, https://substackcdn.com/image/fetch/$s_!Bbta!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd79fdcfa-a774-4067-b00c-d4c0014758d4_340x48.png 1272w, https://substackcdn.com/image/fetch/$s_!Bbta!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd79fdcfa-a774-4067-b00c-d4c0014758d4_340x48.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Here, the 0.01 term is to avoid dividing by zero when <em>N&#770;&#8239;(x)=0. </em>Using this reward, the authors improve <a href="https://www.nature.com/articles/nature14236">DQN</a> performance on various hard Atari games:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TZRJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56419452-9f37-4834-b983-7ccb4868a11b_940x212.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TZRJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56419452-9f37-4834-b983-7ccb4868a11b_940x212.png 424w, https://substackcdn.com/image/fetch/$s_!TZRJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56419452-9f37-4834-b983-7ccb4868a11b_940x212.png 848w, https://substackcdn.com/image/fetch/$s_!TZRJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56419452-9f37-4834-b983-7ccb4868a11b_940x212.png 1272w, https://substackcdn.com/image/fetch/$s_!TZRJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56419452-9f37-4834-b983-7ccb4868a11b_940x212.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TZRJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56419452-9f37-4834-b983-7ccb4868a11b_940x212.png" width="940" height="212" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56419452-9f37-4834-b983-7ccb4868a11b_940x212.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:212,&quot;width&quot;:940,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:124415,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TZRJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56419452-9f37-4834-b983-7ccb4868a11b_940x212.png 424w, https://substackcdn.com/image/fetch/$s_!TZRJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56419452-9f37-4834-b983-7ccb4868a11b_940x212.png 848w, https://substackcdn.com/image/fetch/$s_!TZRJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56419452-9f37-4834-b983-7ccb4868a11b_940x212.png 1272w, https://substackcdn.com/image/fetch/$s_!TZRJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56419452-9f37-4834-b983-7ccb4868a11b_940x212.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Exploration with pseudo-count (green line) on hard Atari games. Image taken from [1]</figcaption></figure></div><p>If counting each exact state is challenging, why not partition the continuous state space into manageable blocks? This is the approach taken by Tang et al. (2017) who use &#128073;<strong>hash count</strong> to address the counting issue [2]. By using a  function &#120601; mapping a state to a  code, we can count the occurrence of the code instead of the state. &#129504; <em>How to choose a good &#120601;?</em></p><div class="pullquote"><p>One important choice we can make regards the granularity of the discretization: we would like for &#8220;distant&#8221; states to be counted separately while &#8220;similar&#8221; states are merged.</p><p>&#8212;Text taken from [2]&#8212;</p></div><p>The paper picks <a href="https://en.wikipedia.org/wiki/SimHash#:~:text=In%20computer%20science%2C%20SimHash%20is,was%20created%20by%20Moses%20Charikar.">SimHash</a> for the mapping function, and it is very simple to implement: </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!03TB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bbc5398-b786-46c3-89c2-c24316fb5eb3_347x48.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!03TB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bbc5398-b786-46c3-89c2-c24316fb5eb3_347x48.png 424w, https://substackcdn.com/image/fetch/$s_!03TB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bbc5398-b786-46c3-89c2-c24316fb5eb3_347x48.png 848w, https://substackcdn.com/image/fetch/$s_!03TB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bbc5398-b786-46c3-89c2-c24316fb5eb3_347x48.png 1272w, https://substackcdn.com/image/fetch/$s_!03TB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bbc5398-b786-46c3-89c2-c24316fb5eb3_347x48.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!03TB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bbc5398-b786-46c3-89c2-c24316fb5eb3_347x48.png" width="347" height="48" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7bbc5398-b786-46c3-89c2-c24316fb5eb3_347x48.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:48,&quot;width&quot;:347,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8430,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!03TB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bbc5398-b786-46c3-89c2-c24316fb5eb3_347x48.png 424w, https://substackcdn.com/image/fetch/$s_!03TB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bbc5398-b786-46c3-89c2-c24316fb5eb3_347x48.png 848w, https://substackcdn.com/image/fetch/$s_!03TB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bbc5398-b786-46c3-89c2-c24316fb5eb3_347x48.png 1272w, https://substackcdn.com/image/fetch/$s_!03TB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bbc5398-b786-46c3-89c2-c24316fb5eb3_347x48.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>sgn</em> is the sign function, <em>A</em> is a <em>k &#215; d</em> matrix with i.i.d. entries drawn from a standard Gaussian distribution. and <em>g</em> is some transformation function that maps the state space to a <em>d</em>-dimensional representation space. We can control the granularity of the discretization by setting the <em>k</em> value (higher <em>k</em>, less occurrence collisions, and thus states are more distinguished).</p><p>Further, Tang et al. (2017) propose an extension of the approach for high-dimensional state space. In particular, representation learning is employed to capture good <em>g</em> through autoencoder network and reconstruction learning:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9lfd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa144796e-295a-44c0-8e5b-a6bd41d5cd25_1013x343.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9lfd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa144796e-295a-44c0-8e5b-a6bd41d5cd25_1013x343.png 424w, https://substackcdn.com/image/fetch/$s_!9lfd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa144796e-295a-44c0-8e5b-a6bd41d5cd25_1013x343.png 848w, https://substackcdn.com/image/fetch/$s_!9lfd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa144796e-295a-44c0-8e5b-a6bd41d5cd25_1013x343.png 1272w, https://substackcdn.com/image/fetch/$s_!9lfd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa144796e-295a-44c0-8e5b-a6bd41d5cd25_1013x343.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9lfd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa144796e-295a-44c0-8e5b-a6bd41d5cd25_1013x343.png" width="1013" height="343" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a144796e-295a-44c0-8e5b-a6bd41d5cd25_1013x343.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:343,&quot;width&quot;:1013,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82900,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9lfd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa144796e-295a-44c0-8e5b-a6bd41d5cd25_1013x343.png 424w, https://substackcdn.com/image/fetch/$s_!9lfd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa144796e-295a-44c0-8e5b-a6bd41d5cd25_1013x343.png 848w, https://substackcdn.com/image/fetch/$s_!9lfd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa144796e-295a-44c0-8e5b-a6bd41d5cd25_1013x343.png 1272w, https://substackcdn.com/image/fetch/$s_!9lfd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa144796e-295a-44c0-8e5b-a6bd41d5cd25_1013x343.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Autoencoder architecture for training the representation <em>g</em>. Image taken from [2]</figcaption></figure></div><p>The network aims to reconstruct the original state input <em>s</em>, and the hidden representation <em>b(s)</em> will be used to compute <em>g(s)=round(b(s)). </em>In addition to the reconstruction loss, the authors introduce a regularization term during network training to binary-encode the representations. This prevents the corresponding bit in the binary code from flipping throughout the agent's lifetime. The final loss is defined as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IHCq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06aa021d-675c-4cef-b7ad-176a669d6053_3924x948.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IHCq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06aa021d-675c-4cef-b7ad-176a669d6053_3924x948.png 424w, https://substackcdn.com/image/fetch/$s_!IHCq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06aa021d-675c-4cef-b7ad-176a669d6053_3924x948.png 848w, https://substackcdn.com/image/fetch/$s_!IHCq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06aa021d-675c-4cef-b7ad-176a669d6053_3924x948.png 1272w, https://substackcdn.com/image/fetch/$s_!IHCq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06aa021d-675c-4cef-b7ad-176a669d6053_3924x948.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IHCq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06aa021d-675c-4cef-b7ad-176a669d6053_3924x948.png" width="1456" height="352" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/06aa021d-675c-4cef-b7ad-176a669d6053_3924x948.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:352,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IHCq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06aa021d-675c-4cef-b7ad-176a669d6053_3924x948.png 424w, https://substackcdn.com/image/fetch/$s_!IHCq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06aa021d-675c-4cef-b7ad-176a669d6053_3924x948.png 848w, https://substackcdn.com/image/fetch/$s_!IHCq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06aa021d-675c-4cef-b7ad-176a669d6053_3924x948.png 1272w, https://substackcdn.com/image/fetch/$s_!IHCq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06aa021d-675c-4cef-b7ad-176a669d6053_3924x948.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p>&#128064; Extending these counting approaches to count state-action pairs is straightforward. To measure the novelty of state-action combinations, one can concatenate the action representation with the state representation.  </p></blockquote><h4>Change Counting</h4><p>Concentrating solely on novel states or actions is useless when they have little or no impact on the outcome. In navigation tasks with diverse situations such as those in <a href="https://minigrid.farama.org/index.html">Minigrid</a>, many activities are unnecessary:</p><div class="pullquote"><p>For instance, moving around, bumping into walls, or trying to open locked doors without keys all result in no change and thus will be of low interest. </p><p>&#8212;Text taken from [3]&#8212;</p></div><p>To encourage the agent to explore novel state-action pairs meaningfully, we can assess changes caused by activities and prioritize those that signify novelty. In particular, we can define <em>c(s,s&#8217;)</em>  as the environment change caused by a transition <em>(s, a, s&#8217;) </em>and use count-based methods to estimate its novelty. The proposed method, &#128073;<strong>C-BET</strong>, combines state count and change count, resulting in the intrinsic reward:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jtle!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11517549-ee2a-445e-959f-50835fcfb794_444x57.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jtle!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11517549-ee2a-445e-959f-50835fcfb794_444x57.png 424w, https://substackcdn.com/image/fetch/$s_!Jtle!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11517549-ee2a-445e-959f-50835fcfb794_444x57.png 848w, https://substackcdn.com/image/fetch/$s_!Jtle!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11517549-ee2a-445e-959f-50835fcfb794_444x57.png 1272w, https://substackcdn.com/image/fetch/$s_!Jtle!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11517549-ee2a-445e-959f-50835fcfb794_444x57.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jtle!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11517549-ee2a-445e-959f-50835fcfb794_444x57.png" width="444" height="57" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/11517549-ee2a-445e-959f-50835fcfb794_444x57.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:57,&quot;width&quot;:444,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:10871,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Jtle!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11517549-ee2a-445e-959f-50835fcfb794_444x57.png 424w, https://substackcdn.com/image/fetch/$s_!Jtle!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11517549-ee2a-445e-959f-50835fcfb794_444x57.png 848w, https://substackcdn.com/image/fetch/$s_!Jtle!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11517549-ee2a-445e-959f-50835fcfb794_444x57.png 1272w, https://substackcdn.com/image/fetch/$s_!Jtle!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11517549-ee2a-445e-959f-50835fcfb794_444x57.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The authors recognize the effectiveness of counting both quantities rather than solely focusing on counting the states or using the norm of the change to represent novelty, as illustrated in the visualization:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JBEP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66c2a2-b37d-4b37-850b-74d5aa5dcc12_1177x529.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JBEP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66c2a2-b37d-4b37-850b-74d5aa5dcc12_1177x529.png 424w, https://substackcdn.com/image/fetch/$s_!JBEP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66c2a2-b37d-4b37-850b-74d5aa5dcc12_1177x529.png 848w, https://substackcdn.com/image/fetch/$s_!JBEP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66c2a2-b37d-4b37-850b-74d5aa5dcc12_1177x529.png 1272w, https://substackcdn.com/image/fetch/$s_!JBEP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66c2a2-b37d-4b37-850b-74d5aa5dcc12_1177x529.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JBEP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66c2a2-b37d-4b37-850b-74d5aa5dcc12_1177x529.png" width="1177" height="529" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e66c2a2-b37d-4b37-850b-74d5aa5dcc12_1177x529.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:529,&quot;width&quot;:1177,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238271,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JBEP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66c2a2-b37d-4b37-850b-74d5aa5dcc12_1177x529.png 424w, https://substackcdn.com/image/fetch/$s_!JBEP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66c2a2-b37d-4b37-850b-74d5aa5dcc12_1177x529.png 848w, https://substackcdn.com/image/fetch/$s_!JBEP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66c2a2-b37d-4b37-850b-74d5aa5dcc12_1177x529.png 1272w, https://substackcdn.com/image/fetch/$s_!JBEP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66c2a2-b37d-4b37-850b-74d5aa5dcc12_1177x529.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Change count (last row) vs norm of change (middle row) vs state count (top row). Change count suffers less from attracting to meaningless activities. Image taken from [3].</figcaption></figure></div><blockquote><p>&#10060; An inherent constraint of count-based methods lies in the approximation error between the pseudo-count and the true count. The sensitivity of counting is particularly evident in the choice of representation or density model, especially when two states may share similar representations but should not be considered the same when counting. </p></blockquote><h4>Novelty through Reachability</h4><p>Another idea to overcome the limitation of count-based methods is to model novelty through different criteria. In [4], the authors provide an interesting intuition: novel observations are those that demand effort to reach, typically beyond the already explored areas of the environment. They measure the effort in environmental steps, estimating it with a neural network that predicts the steps between two observations. To capture the explored areas of the environment, they use an episodic memory initialized empty at the start of each episode and call their method &#128073;<strong>Episodic Curiosity (EC)</strong>. Observations encountered during an episode are then added to this memory. </p><p>By assessing the reachability score in terms of the estimated number of steps between a given observation and all stored observations in the memory, we can gauge the novelty of the observation. For example, if the maximum reachability score of the given observation is greater than a threshold <em>k, </em>we can regard it as novel. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JMXG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a37c5-6622-4e2e-a6b8-3d405509328f_550x327.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JMXG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a37c5-6622-4e2e-a6b8-3d405509328f_550x327.png 424w, https://substackcdn.com/image/fetch/$s_!JMXG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a37c5-6622-4e2e-a6b8-3d405509328f_550x327.png 848w, https://substackcdn.com/image/fetch/$s_!JMXG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a37c5-6622-4e2e-a6b8-3d405509328f_550x327.png 1272w, https://substackcdn.com/image/fetch/$s_!JMXG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a37c5-6622-4e2e-a6b8-3d405509328f_550x327.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JMXG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a37c5-6622-4e2e-a6b8-3d405509328f_550x327.png" width="550" height="327" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/583a37c5-6622-4e2e-a6b8-3d405509328f_550x327.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:327,&quot;width&quot;:550,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:68930,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JMXG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a37c5-6622-4e2e-a6b8-3d405509328f_550x327.png 424w, https://substackcdn.com/image/fetch/$s_!JMXG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a37c5-6622-4e2e-a6b8-3d405509328f_550x327.png 848w, https://substackcdn.com/image/fetch/$s_!JMXG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a37c5-6622-4e2e-a6b8-3d405509328f_550x327.png 1272w, https://substackcdn.com/image/fetch/$s_!JMXG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a37c5-6622-4e2e-a6b8-3d405509328f_550x327.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Novelty through reachability concept. An observation is novel if it can only reach those in the memory in more than k steps. Image taken from [4]</figcaption></figure></div><p>To implement this idea, we need to train a model predicting the steps between two observations. To simplify, we can predefine the threshold <em>k</em>, and create a model to classify whether two observations are separated by more or less than <em>k</em> steps.<em> </em>The model, named reachability network, takes 2 observations as input and outputs a logit score, performing binary classification and is trained on data collected from the agent interacting with the environment. After training, the reachability network is used to estimate the novelty of the current observation in the episode given the episodic memory <em>M</em>, which finally is used to compute the intrinsic reward. The training and inference procedures are depicted below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b18f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3405fd65-013c-42fb-8329-987936f97624_689x613.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b18f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3405fd65-013c-42fb-8329-987936f97624_689x613.gif 424w, https://substackcdn.com/image/fetch/$s_!b18f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3405fd65-013c-42fb-8329-987936f97624_689x613.gif 848w, https://substackcdn.com/image/fetch/$s_!b18f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3405fd65-013c-42fb-8329-987936f97624_689x613.gif 1272w, https://substackcdn.com/image/fetch/$s_!b18f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3405fd65-013c-42fb-8329-987936f97624_689x613.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b18f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3405fd65-013c-42fb-8329-987936f97624_689x613.gif" width="689" height="613" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3405fd65-013c-42fb-8329-987936f97624_689x613.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:613,&quot;width&quot;:689,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:479737,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b18f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3405fd65-013c-42fb-8329-987936f97624_689x613.gif 424w, https://substackcdn.com/image/fetch/$s_!b18f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3405fd65-013c-42fb-8329-987936f97624_689x613.gif 848w, https://substackcdn.com/image/fetch/$s_!b18f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3405fd65-013c-42fb-8329-987936f97624_689x613.gif 1272w, https://substackcdn.com/image/fetch/$s_!b18f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3405fd65-013c-42fb-8329-987936f97624_689x613.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Training and inference using the reachability network. The network predicts whether 2 observations are reachable by producing a score <em>c</em> (closer to 0 means not reachable and closer to 1 means reachable). Using a function <em>F</em> to aggregate the reachability scores between the current observation and those in the memory leads to the intrinsic reward. <em>F</em> can be max or 90-th percentile. Images taken from [4].</figcaption></figure></div><h4>Novelty via Reconstruction</h4><p>Memory plays a crucial role in identifying novel events amidst routine occurrences, and various modeling approaches can be employed to capture this functionality. As discussed earlier, memory can take the form of a straightforward counter or a buffer that retains a list of observations&#8212;characterized as non-parametric memory. In this section, our interest lies in exploring parametric memory models, particularly those embodied by autoencoder neural network architectures. </p><p>Theoretically, an overparameterized autoencoder whose task is to reconstruct its input is equivalent to an associative memory [14]. Therefore, we can train an autoencoder that takes the state as input to reconstruct and use its reconstruction error as an indicator of novelty, i.e., &#128073;<strong>autoencoder novelty</strong> [15]. Greater errors signify a higher level of novelty in states, indicating that the autoencoder has not encountered these states frequently enough to learn their successful reconstruction effectively. </p><p>It turns out that  even for random targets, reconstruction can still contribute to the exploration process. In &#128073;<strong>Random Network Distillation</strong> (<strong>RND</strong>, [16]), the intrinsic reward is defined through the task of predicting the output of a fixed  (target) network whose weights are random, as shown in the figure below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RRYv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbb3d9a-4240-4352-a75f-402348217b49_467x487.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RRYv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbb3d9a-4240-4352-a75f-402348217b49_467x487.gif 424w, https://substackcdn.com/image/fetch/$s_!RRYv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbb3d9a-4240-4352-a75f-402348217b49_467x487.gif 848w, https://substackcdn.com/image/fetch/$s_!RRYv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbb3d9a-4240-4352-a75f-402348217b49_467x487.gif 1272w, https://substackcdn.com/image/fetch/$s_!RRYv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbb3d9a-4240-4352-a75f-402348217b49_467x487.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RRYv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbb3d9a-4240-4352-a75f-402348217b49_467x487.gif" width="467" height="487" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fbbb3d9a-4240-4352-a75f-402348217b49_467x487.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:487,&quot;width&quot;:467,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:163689,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RRYv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbb3d9a-4240-4352-a75f-402348217b49_467x487.gif 424w, https://substackcdn.com/image/fetch/$s_!RRYv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbb3d9a-4240-4352-a75f-402348217b49_467x487.gif 848w, https://substackcdn.com/image/fetch/$s_!RRYv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbb3d9a-4240-4352-a75f-402348217b49_467x487.gif 1272w, https://substackcdn.com/image/fetch/$s_!RRYv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbbb3d9a-4240-4352-a75f-402348217b49_467x487.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">RND: given the state as inputs, one neural network (green) is trained to predict the other network (blue) output. The prediction error is the intrinsic reward. </figcaption></figure></div><p>The intrinsic reward is still the reconstruction error. But this time, the reconstructed target is no longer the original input. Instead, it is a transformed version of the input. </p><blockquote><p>&#128064; As you can see, through various means, novelty can be modeled and implemented in RL exploration. However, this is not the only source of motivation for agent curiosity. We will see more in the upcoming sections. </p></blockquote><div><hr></div><h2>Surprise</h2><p>Psychologically, surprise emerges when there's a discrepancy between expectations and the observed or experienced reality [5]. We, as humans, rely heavily on surprise to navigate our actions through life. Think back to the last time you encountered something unexpected, like an unpleasant odor or a road accident; the instinct is to immediately investigate what is happening. Inspired by this behavior, researchers have made numerous attempts to model surprise in RL agents, aiming to enhance their exploration of the environment. &#129504; <em>What is the mechanism for comparing an expectation with actuality to model surprise?</em></p><h4>Predictive Surprise</h4><p>One approach is to build a model of the environment, predicting the next state given the current state and action. This kind of model, also known as &#128073;<strong>forward dynamics or world model</strong> [6], provides a mechanism for estimating the expectation of the agent observation. In [7], the authors propose to use a neural network <em>f</em> that takes a representation of the current state and action to predict the next state. </p><blockquote><p>&#128064; A nuanced yet pivotal distinction between dynamics prediction and state reconstruction (in autoencoder novelty) is the involvement of next state prediction in the former. Predicting the future is inherently more challenging than reconstructing the current observations.</p></blockquote><p>The representation is shaped through unsupervised training, i.e., state reconstruction task, using an autoencoder&#8217;s hidden state. The network <em>f</em>, fed with the autoencoder&#8217;s hidden state, is trained to minimize the prediction error, which is the norm of the difference between the predicted state and the true state. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cycl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ef17ce-6457-442b-9376-1cf31cb137e8_1082x345.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cycl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ef17ce-6457-442b-9376-1cf31cb137e8_1082x345.gif 424w, https://substackcdn.com/image/fetch/$s_!cycl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ef17ce-6457-442b-9376-1cf31cb137e8_1082x345.gif 848w, https://substackcdn.com/image/fetch/$s_!cycl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ef17ce-6457-442b-9376-1cf31cb137e8_1082x345.gif 1272w, https://substackcdn.com/image/fetch/$s_!cycl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ef17ce-6457-442b-9376-1cf31cb137e8_1082x345.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cycl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ef17ce-6457-442b-9376-1cf31cb137e8_1082x345.gif" width="1082" height="345" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d3ef17ce-6457-442b-9376-1cf31cb137e8_1082x345.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:345,&quot;width&quot;:1082,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:149920,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cycl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ef17ce-6457-442b-9376-1cf31cb137e8_1082x345.gif 424w, https://substackcdn.com/image/fetch/$s_!cycl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ef17ce-6457-442b-9376-1cf31cb137e8_1082x345.gif 848w, https://substackcdn.com/image/fetch/$s_!cycl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ef17ce-6457-442b-9376-1cf31cb137e8_1082x345.gif 1272w, https://substackcdn.com/image/fetch/$s_!cycl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3ef17ce-6457-442b-9376-1cf31cb137e8_1082x345.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Surprise as dynamics model prediction error. Image taken from [7]</figcaption></figure></div><p>Here, the intrinsic reward is synonymous with the prediction error itself. This reward increases when the model encounters difficulty in predicting or expresses surprise at the current observation. </p><p>Apart from using an autoencoder, one may use different methods to learn a good representation because predicting raw observation is very hard. &#128073;<strong>Intrinsic Curiosity Module (ICM)</strong> [8] advocates for a state feature space that excludes uncontrollable factors that do not influence the agent's behavior, providing no incentive for learning. To learn state embeddings that enable controllable space, ICM employs an inverse dynamic model <em>g</em> predicting the action given 2 consecutive state representations &#120601;.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aNMN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca5c2daf-9699-447b-a418-0f0b06e2b3d7_335x67.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aNMN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca5c2daf-9699-447b-a418-0f0b06e2b3d7_335x67.png 424w, https://substackcdn.com/image/fetch/$s_!aNMN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca5c2daf-9699-447b-a418-0f0b06e2b3d7_335x67.png 848w, https://substackcdn.com/image/fetch/$s_!aNMN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca5c2daf-9699-447b-a418-0f0b06e2b3d7_335x67.png 1272w, https://substackcdn.com/image/fetch/$s_!aNMN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca5c2daf-9699-447b-a418-0f0b06e2b3d7_335x67.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aNMN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca5c2daf-9699-447b-a418-0f0b06e2b3d7_335x67.png" width="335" height="67" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ca5c2daf-9699-447b-a418-0f0b06e2b3d7_335x67.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:67,&quot;width&quot;:335,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:5381,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aNMN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca5c2daf-9699-447b-a418-0f0b06e2b3d7_335x67.png 424w, https://substackcdn.com/image/fetch/$s_!aNMN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca5c2daf-9699-447b-a418-0f0b06e2b3d7_335x67.png 848w, https://substackcdn.com/image/fetch/$s_!aNMN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca5c2daf-9699-447b-a418-0f0b06e2b3d7_335x67.png 1272w, https://substackcdn.com/image/fetch/$s_!aNMN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca5c2daf-9699-447b-a418-0f0b06e2b3d7_335x67.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>To make predictions accurately, the representations must be meaningful and action-oriented.  Combined with forward dynamics prediction, ICM proposes the intrinsic reward as </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gm8M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b5a3a7b-dd42-46e5-b115-8ad533478d30_2620x392.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gm8M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b5a3a7b-dd42-46e5-b115-8ad533478d30_2620x392.png 424w, https://substackcdn.com/image/fetch/$s_!gm8M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b5a3a7b-dd42-46e5-b115-8ad533478d30_2620x392.png 848w, https://substackcdn.com/image/fetch/$s_!gm8M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b5a3a7b-dd42-46e5-b115-8ad533478d30_2620x392.png 1272w, https://substackcdn.com/image/fetch/$s_!gm8M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b5a3a7b-dd42-46e5-b115-8ad533478d30_2620x392.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gm8M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b5a3a7b-dd42-46e5-b115-8ad533478d30_2620x392.png" width="515" height="77.10851648351648" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0b5a3a7b-dd42-46e5-b115-8ad533478d30_2620x392.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:218,&quot;width&quot;:1456,&quot;resizeWidth&quot;:515,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gm8M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b5a3a7b-dd42-46e5-b115-8ad533478d30_2620x392.png 424w, https://substackcdn.com/image/fetch/$s_!gm8M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b5a3a7b-dd42-46e5-b115-8ad533478d30_2620x392.png 848w, https://substackcdn.com/image/fetch/$s_!gm8M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b5a3a7b-dd42-46e5-b115-8ad533478d30_2620x392.png 1272w, https://substackcdn.com/image/fetch/$s_!gm8M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b5a3a7b-dd42-46e5-b115-8ad533478d30_2620x392.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>An alternative method with forward dynamics involves using the variance of the prediction rather than the error [10]. This requires multiple prediction models trained to minimize the forward dynamics prediction errors, and we use the empirical variance (&#128073;<strong>disagreement</strong>) of their predictions as the intrinsic reward.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zuj1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c0adf6c-5eb1-4ec3-ab8b-f995ded76bfb_1062x428.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zuj1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c0adf6c-5eb1-4ec3-ab8b-f995ded76bfb_1062x428.gif 424w, https://substackcdn.com/image/fetch/$s_!zuj1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c0adf6c-5eb1-4ec3-ab8b-f995ded76bfb_1062x428.gif 848w, https://substackcdn.com/image/fetch/$s_!zuj1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c0adf6c-5eb1-4ec3-ab8b-f995ded76bfb_1062x428.gif 1272w, https://substackcdn.com/image/fetch/$s_!zuj1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c0adf6c-5eb1-4ec3-ab8b-f995ded76bfb_1062x428.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zuj1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c0adf6c-5eb1-4ec3-ab8b-f995ded76bfb_1062x428.gif" width="1062" height="428" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c0adf6c-5eb1-4ec3-ab8b-f995ded76bfb_1062x428.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:428,&quot;width&quot;:1062,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:211726,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zuj1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c0adf6c-5eb1-4ec3-ab8b-f995ded76bfb_1062x428.gif 424w, https://substackcdn.com/image/fetch/$s_!zuj1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c0adf6c-5eb1-4ec3-ab8b-f995ded76bfb_1062x428.gif 848w, https://substackcdn.com/image/fetch/$s_!zuj1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c0adf6c-5eb1-4ec3-ab8b-f995ded76bfb_1062x428.gif 1272w, https://substackcdn.com/image/fetch/$s_!zuj1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c0adf6c-5eb1-4ec3-ab8b-f995ded76bfb_1062x428.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Disagreement-based intrinsic reward. Multiple dynamics models make predictions and the variance of these predictions is the intrinsic reward. Image taken from [10].</figcaption></figure></div><p> </p><blockquote><p>&#10060; Sometimes, focusing on the forward dynamics error  is not effective in driving the exploration, especially when the world model is not good and always predicts wrongly. </p><p>&#10060; Moreover, predictive surprise becomes irrelevant in scenarios with environmental noise. Take the hypothetical case of Noisy-TV [4], where an agent can open a TV channel displaying random contents. The agent, unable to predict the randomness, consistently experiences high prediction error or surprise. Yet, engaging with the noisy TV proves useless, diverting the agent from its primary task. Example of Noisy-TV is illustrated below:</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Hnw5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c98f399-64f5-4a25-9bcc-78bb7b76bb61_632x476.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Hnw5!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c98f399-64f5-4a25-9bcc-78bb7b76bb61_632x476.gif 424w, https://substackcdn.com/image/fetch/$s_!Hnw5!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c98f399-64f5-4a25-9bcc-78bb7b76bb61_632x476.gif 848w, https://substackcdn.com/image/fetch/$s_!Hnw5!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c98f399-64f5-4a25-9bcc-78bb7b76bb61_632x476.gif 1272w, https://substackcdn.com/image/fetch/$s_!Hnw5!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c98f399-64f5-4a25-9bcc-78bb7b76bb61_632x476.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Hnw5!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c98f399-64f5-4a25-9bcc-78bb7b76bb61_632x476.gif" width="374" height="281.6835443037975" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c98f399-64f5-4a25-9bcc-78bb7b76bb61_632x476.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:476,&quot;width&quot;:632,&quot;resizeWidth&quot;:374,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;10. RND(Exploration by Random Network Distillation)&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="10. RND(Exploration by Random Network Distillation)" title="10. RND(Exploration by Random Network Distillation)" srcset="https://substackcdn.com/image/fetch/$s_!Hnw5!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c98f399-64f5-4a25-9bcc-78bb7b76bb61_632x476.gif 424w, https://substackcdn.com/image/fetch/$s_!Hnw5!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c98f399-64f5-4a25-9bcc-78bb7b76bb61_632x476.gif 848w, https://substackcdn.com/image/fetch/$s_!Hnw5!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c98f399-64f5-4a25-9bcc-78bb7b76bb61_632x476.gif 1272w, https://substackcdn.com/image/fetch/$s_!Hnw5!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c98f399-64f5-4a25-9bcc-78bb7b76bb61_632x476.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Noisy-TV: a random TV will distract the RL agent from its main task due to high surprise (<a href="https://openai.com/research/reinforcement-learning-with-prediction-based-rewards">source</a>). </figcaption></figure></div><p>In [9], the authors propose using &#128073;<strong>learning progress </strong>as intrinsic motivation, which somehow addresses this issue. The learning progress is estimated by comparing the mean error rate of the prediction model during the current moving window to the mean error rate of the previous window. The two windows are different by &#120591; steps. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S9rT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3565d3d8-baec-4219-a461-ca8f7fe52217_326x51.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S9rT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3565d3d8-baec-4219-a461-ca8f7fe52217_326x51.png 424w, https://substackcdn.com/image/fetch/$s_!S9rT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3565d3d8-baec-4219-a461-ca8f7fe52217_326x51.png 848w, https://substackcdn.com/image/fetch/$s_!S9rT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3565d3d8-baec-4219-a461-ca8f7fe52217_326x51.png 1272w, https://substackcdn.com/image/fetch/$s_!S9rT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3565d3d8-baec-4219-a461-ca8f7fe52217_326x51.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S9rT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3565d3d8-baec-4219-a461-ca8f7fe52217_326x51.png" width="326" height="51" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3565d3d8-baec-4219-a461-ca8f7fe52217_326x51.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:51,&quot;width&quot;:326,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:4966,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S9rT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3565d3d8-baec-4219-a461-ca8f7fe52217_326x51.png 424w, https://substackcdn.com/image/fetch/$s_!S9rT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3565d3d8-baec-4219-a461-ca8f7fe52217_326x51.png 848w, https://substackcdn.com/image/fetch/$s_!S9rT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3565d3d8-baec-4219-a461-ca8f7fe52217_326x51.png 1272w, https://substackcdn.com/image/fetch/$s_!S9rT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3565d3d8-baec-4219-a461-ca8f7fe52217_326x51.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>k</em> is the window size. Intuitively, we aim to incentivize observations and actions that enhance learning progress.</p><blockquote><p>&#128064; Learning progress is immune against noisy-TV because the prediction error of watching noises is always high. If the agent keeps watching TV, there will be no change in the average error over time, showing no learning progress. However, correctly estimating the learning progress is complicated and it takes a long time to monitor the progress, which is not sample-efficient. </p></blockquote><h4>Bayesian Surprise</h4><p>Surprise can be interpreted from a Bayesian statistics perspective. Inspired by classical exploration such as <a href="https://hungleai.substack.com/i/140853184/information-gain">information gain</a>, the agent benefits from actions that minimize uncertainty about the dynamics, formalized as maximizing the cumulative reduction in entropy [11]:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!APgx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77f42cf2-2434-4b76-8dec-64c319c74fe0_3632x512.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!APgx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77f42cf2-2434-4b76-8dec-64c319c74fe0_3632x512.png 424w, https://substackcdn.com/image/fetch/$s_!APgx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77f42cf2-2434-4b76-8dec-64c319c74fe0_3632x512.png 848w, https://substackcdn.com/image/fetch/$s_!APgx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77f42cf2-2434-4b76-8dec-64c319c74fe0_3632x512.png 1272w, https://substackcdn.com/image/fetch/$s_!APgx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77f42cf2-2434-4b76-8dec-64c319c74fe0_3632x512.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!APgx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77f42cf2-2434-4b76-8dec-64c319c74fe0_3632x512.png" width="632" height="88.98351648351648" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/77f42cf2-2434-4b76-8dec-64c319c74fe0_3632x512.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:205,&quot;width&quot;:1456,&quot;resizeWidth&quot;:632,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!APgx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77f42cf2-2434-4b76-8dec-64c319c74fe0_3632x512.png 424w, https://substackcdn.com/image/fetch/$s_!APgx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77f42cf2-2434-4b76-8dec-64c319c74fe0_3632x512.png 848w, https://substackcdn.com/image/fetch/$s_!APgx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77f42cf2-2434-4b76-8dec-64c319c74fe0_3632x512.png 1272w, https://substackcdn.com/image/fetch/$s_!APgx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77f42cf2-2434-4b76-8dec-64c319c74fe0_3632x512.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The reduction of entropy per time step, also known as mutual information <em>I(&#120495;;S<sub>t+1</sub>|&#958;<sub>t</sub>,a<sub>t</sub>)</em>, is computed explicitly as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Rj22!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8719bdd3-657e-4bee-9f1a-d7aefcc17fd7_978x78.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Rj22!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8719bdd3-657e-4bee-9f1a-d7aefcc17fd7_978x78.png 424w, https://substackcdn.com/image/fetch/$s_!Rj22!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8719bdd3-657e-4bee-9f1a-d7aefcc17fd7_978x78.png 848w, https://substackcdn.com/image/fetch/$s_!Rj22!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8719bdd3-657e-4bee-9f1a-d7aefcc17fd7_978x78.png 1272w, https://substackcdn.com/image/fetch/$s_!Rj22!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8719bdd3-657e-4bee-9f1a-d7aefcc17fd7_978x78.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Rj22!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8719bdd3-657e-4bee-9f1a-d7aefcc17fd7_978x78.png" width="978" height="78" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8719bdd3-657e-4bee-9f1a-d7aefcc17fd7_978x78.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:78,&quot;width&quot;:978,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:26309,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Rj22!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8719bdd3-657e-4bee-9f1a-d7aefcc17fd7_978x78.png 424w, https://substackcdn.com/image/fetch/$s_!Rj22!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8719bdd3-657e-4bee-9f1a-d7aefcc17fd7_978x78.png 848w, https://substackcdn.com/image/fetch/$s_!Rj22!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8719bdd3-657e-4bee-9f1a-d7aefcc17fd7_978x78.png 1272w, https://substackcdn.com/image/fetch/$s_!Rj22!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8719bdd3-657e-4bee-9f1a-d7aefcc17fd7_978x78.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Here <em>&#952; </em>is the parameters of the dynamics model <em>&#120495;</em>. Because we are interested in finding intrinsic reward for a given timestep, we can define:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z05T!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e41ad0f-9d39-41a3-a252-937f4142b36d_412x54.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z05T!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e41ad0f-9d39-41a3-a252-937f4142b36d_412x54.png 424w, https://substackcdn.com/image/fetch/$s_!z05T!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e41ad0f-9d39-41a3-a252-937f4142b36d_412x54.png 848w, https://substackcdn.com/image/fetch/$s_!z05T!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e41ad0f-9d39-41a3-a252-937f4142b36d_412x54.png 1272w, https://substackcdn.com/image/fetch/$s_!z05T!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e41ad0f-9d39-41a3-a252-937f4142b36d_412x54.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z05T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e41ad0f-9d39-41a3-a252-937f4142b36d_412x54.png" width="412" height="54" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e41ad0f-9d39-41a3-a252-937f4142b36d_412x54.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:54,&quot;width&quot;:412,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7053,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!z05T!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e41ad0f-9d39-41a3-a252-937f4142b36d_412x54.png 424w, https://substackcdn.com/image/fetch/$s_!z05T!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e41ad0f-9d39-41a3-a252-937f4142b36d_412x54.png 848w, https://substackcdn.com/image/fetch/$s_!z05T!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e41ad0f-9d39-41a3-a252-937f4142b36d_412x54.png 1272w, https://substackcdn.com/image/fetch/$s_!z05T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e41ad0f-9d39-41a3-a252-937f4142b36d_412x54.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Unfortunately, the KL involves computing the posterior <em>p(&#952;|s<sub>t+1</sub>)</em>, which is generally <a href="https://thaihungle.github.io/lectures/VI/1-Introduction.pdf">intractable</a>. Hence, the authors in [11] turn to <a href="https://thaihungle.github.io/lectures/2022-15-09-VI">variational inference</a> for approximating the posterior. Specifically, then they use an alternative variational distribution <em>q(&#952;; &#120601;)</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1dtt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb79c0d-6956-4ae2-b962-7d8e5bab38ad_3080x572.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1dtt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb79c0d-6956-4ae2-b962-7d8e5bab38ad_3080x572.png 424w, https://substackcdn.com/image/fetch/$s_!1dtt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb79c0d-6956-4ae2-b962-7d8e5bab38ad_3080x572.png 848w, https://substackcdn.com/image/fetch/$s_!1dtt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb79c0d-6956-4ae2-b962-7d8e5bab38ad_3080x572.png 1272w, https://substackcdn.com/image/fetch/$s_!1dtt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb79c0d-6956-4ae2-b962-7d8e5bab38ad_3080x572.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1dtt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb79c0d-6956-4ae2-b962-7d8e5bab38ad_3080x572.png" width="636" height="117.93956043956044" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8bb79c0d-6956-4ae2-b962-7d8e5bab38ad_3080x572.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:270,&quot;width&quot;:1456,&quot;resizeWidth&quot;:636,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1dtt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb79c0d-6956-4ae2-b962-7d8e5bab38ad_3080x572.png 424w, https://substackcdn.com/image/fetch/$s_!1dtt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb79c0d-6956-4ae2-b962-7d8e5bab38ad_3080x572.png 848w, https://substackcdn.com/image/fetch/$s_!1dtt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb79c0d-6956-4ae2-b962-7d8e5bab38ad_3080x572.png 1272w, https://substackcdn.com/image/fetch/$s_!1dtt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb79c0d-6956-4ae2-b962-7d8e5bab38ad_3080x572.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where &#120601;<em> </em>is the variational parameter.<em> </em> This is equivalent to parameterizing the dynamics model as a <a href="https://link.springer.com/book/10.1007/978-1-4612-0745-0">Bayesian neural network</a> (BNN) with weight distributions maintained as a fully factorized Gaussian. The network can be trained by the following minimization:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZpzQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9167e869-0001-49f0-adf5-d9d75134b544_759x108.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZpzQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9167e869-0001-49f0-adf5-d9d75134b544_759x108.png 424w, https://substackcdn.com/image/fetch/$s_!ZpzQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9167e869-0001-49f0-adf5-d9d75134b544_759x108.png 848w, https://substackcdn.com/image/fetch/$s_!ZpzQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9167e869-0001-49f0-adf5-d9d75134b544_759x108.png 1272w, https://substackcdn.com/image/fetch/$s_!ZpzQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9167e869-0001-49f0-adf5-d9d75134b544_759x108.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZpzQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9167e869-0001-49f0-adf5-d9d75134b544_759x108.png" width="652" height="92.77470355731225" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9167e869-0001-49f0-adf5-d9d75134b544_759x108.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:108,&quot;width&quot;:759,&quot;resizeWidth&quot;:652,&quot;bytes&quot;:28238,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZpzQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9167e869-0001-49f0-adf5-d9d75134b544_759x108.png 424w, https://substackcdn.com/image/fetch/$s_!ZpzQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9167e869-0001-49f0-adf5-d9d75134b544_759x108.png 848w, https://substackcdn.com/image/fetch/$s_!ZpzQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9167e869-0001-49f0-adf5-d9d75134b544_759x108.png 1272w, https://substackcdn.com/image/fetch/$s_!ZpzQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9167e869-0001-49f0-adf5-d9d75134b544_759x108.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Finally, we can use the trained  variational distribution to compute the intrinsic reward or &#128073;<strong>Bayesian information gain</strong>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0XLA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7030067-fbe2-40f8-a77a-7965955b8ec4_320x56.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0XLA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7030067-fbe2-40f8-a77a-7965955b8ec4_320x56.png 424w, https://substackcdn.com/image/fetch/$s_!0XLA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7030067-fbe2-40f8-a77a-7965955b8ec4_320x56.png 848w, https://substackcdn.com/image/fetch/$s_!0XLA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7030067-fbe2-40f8-a77a-7965955b8ec4_320x56.png 1272w, https://substackcdn.com/image/fetch/$s_!0XLA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7030067-fbe2-40f8-a77a-7965955b8ec4_320x56.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0XLA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7030067-fbe2-40f8-a77a-7965955b8ec4_320x56.png" width="320" height="56" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e7030067-fbe2-40f8-a77a-7965955b8ec4_320x56.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:56,&quot;width&quot;:320,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:5787,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0XLA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7030067-fbe2-40f8-a77a-7965955b8ec4_320x56.png 424w, https://substackcdn.com/image/fetch/$s_!0XLA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7030067-fbe2-40f8-a77a-7965955b8ec4_320x56.png 848w, https://substackcdn.com/image/fetch/$s_!0XLA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7030067-fbe2-40f8-a77a-7965955b8ec4_320x56.png 1272w, https://substackcdn.com/image/fetch/$s_!0XLA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7030067-fbe2-40f8-a77a-7965955b8ec4_320x56.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Training a BNN is complicated, so Achiam and Sastry (2017) propose a different Bayesian view on surprise [12]. First, they formulate the objective of the RL agent as jointly maximizing expected return and surprise:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X14Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63675ba-3afb-463e-9465-7be8722e3c6f_420x55.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X14Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63675ba-3afb-463e-9465-7be8722e3c6f_420x55.png 424w, https://substackcdn.com/image/fetch/$s_!X14Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63675ba-3afb-463e-9465-7be8722e3c6f_420x55.png 848w, https://substackcdn.com/image/fetch/$s_!X14Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63675ba-3afb-463e-9465-7be8722e3c6f_420x55.png 1272w, https://substackcdn.com/image/fetch/$s_!X14Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63675ba-3afb-463e-9465-7be8722e3c6f_420x55.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X14Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63675ba-3afb-463e-9465-7be8722e3c6f_420x55.png" width="420" height="55" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d63675ba-3afb-463e-9465-7be8722e3c6f_420x55.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:55,&quot;width&quot;:420,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:11279,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!X14Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63675ba-3afb-463e-9465-7be8722e3c6f_420x55.png 424w, https://substackcdn.com/image/fetch/$s_!X14Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63675ba-3afb-463e-9465-7be8722e3c6f_420x55.png 848w, https://substackcdn.com/image/fetch/$s_!X14Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63675ba-3afb-463e-9465-7be8722e3c6f_420x55.png 1272w, https://substackcdn.com/image/fetch/$s_!X14Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63675ba-3afb-463e-9465-7be8722e3c6f_420x55.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>P</em> is the true dynamics model and <em>P&#120601; </em>is the learned dynamics model. It makes sense to model the surprise as the expected divergence between the observation and the predicted. The objective can be translated to maximizing the bonus reward per step:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JCs6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12cb7e1-384f-4b5e-95c8-5016db34a1a4_610x38.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JCs6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12cb7e1-384f-4b5e-95c8-5016db34a1a4_610x38.png 424w, https://substackcdn.com/image/fetch/$s_!JCs6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12cb7e1-384f-4b5e-95c8-5016db34a1a4_610x38.png 848w, https://substackcdn.com/image/fetch/$s_!JCs6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12cb7e1-384f-4b5e-95c8-5016db34a1a4_610x38.png 1272w, https://substackcdn.com/image/fetch/$s_!JCs6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12cb7e1-384f-4b5e-95c8-5016db34a1a4_610x38.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JCs6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12cb7e1-384f-4b5e-95c8-5016db34a1a4_610x38.png" width="610" height="38" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c12cb7e1-384f-4b5e-95c8-5016db34a1a4_610x38.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:38,&quot;width&quot;:610,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:10878,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JCs6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12cb7e1-384f-4b5e-95c8-5016db34a1a4_610x38.png 424w, https://substackcdn.com/image/fetch/$s_!JCs6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12cb7e1-384f-4b5e-95c8-5016db34a1a4_610x38.png 848w, https://substackcdn.com/image/fetch/$s_!JCs6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12cb7e1-384f-4b5e-95c8-5016db34a1a4_610x38.png 1272w, https://substackcdn.com/image/fetch/$s_!JCs6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12cb7e1-384f-4b5e-95c8-5016db34a1a4_610x38.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>In practice, we do not know <em>P</em>. Therefore, the authors propose an approximation:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lN_2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4544c119-8ec7-4fb6-8383-98bd631ece04_429x41.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lN_2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4544c119-8ec7-4fb6-8383-98bd631ece04_429x41.png 424w, https://substackcdn.com/image/fetch/$s_!lN_2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4544c119-8ec7-4fb6-8383-98bd631ece04_429x41.png 848w, https://substackcdn.com/image/fetch/$s_!lN_2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4544c119-8ec7-4fb6-8383-98bd631ece04_429x41.png 1272w, https://substackcdn.com/image/fetch/$s_!lN_2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4544c119-8ec7-4fb6-8383-98bd631ece04_429x41.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lN_2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4544c119-8ec7-4fb6-8383-98bd631ece04_429x41.png" width="429" height="41" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4544c119-8ec7-4fb6-8383-98bd631ece04_429x41.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:41,&quot;width&quot;:429,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7479,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lN_2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4544c119-8ec7-4fb6-8383-98bd631ece04_429x41.png 424w, https://substackcdn.com/image/fetch/$s_!lN_2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4544c119-8ec7-4fb6-8383-98bd631ece04_429x41.png 848w, https://substackcdn.com/image/fetch/$s_!lN_2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4544c119-8ec7-4fb6-8383-98bd631ece04_429x41.png 1272w, https://substackcdn.com/image/fetch/$s_!lN_2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4544c119-8ec7-4fb6-8383-98bd631ece04_429x41.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p>&#128064; This bonus term is similar to the prediction error. It just measures the error in log probability instead of the norm of the difference between the predicted and the reality. </p></blockquote><p>Another proposed approximation is:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nZ-4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c577e0e-fbb4-41be-af6e-943e4068781e_624x39.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nZ-4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c577e0e-fbb4-41be-af6e-943e4068781e_624x39.png 424w, https://substackcdn.com/image/fetch/$s_!nZ-4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c577e0e-fbb4-41be-af6e-943e4068781e_624x39.png 848w, https://substackcdn.com/image/fetch/$s_!nZ-4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c577e0e-fbb4-41be-af6e-943e4068781e_624x39.png 1272w, https://substackcdn.com/image/fetch/$s_!nZ-4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c577e0e-fbb4-41be-af6e-943e4068781e_624x39.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nZ-4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c577e0e-fbb4-41be-af6e-943e4068781e_624x39.png" width="624" height="39" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5c577e0e-fbb4-41be-af6e-943e4068781e_624x39.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:39,&quot;width&quot;:624,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:10124,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nZ-4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c577e0e-fbb4-41be-af6e-943e4068781e_624x39.png 424w, https://substackcdn.com/image/fetch/$s_!nZ-4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c577e0e-fbb4-41be-af6e-943e4068781e_624x39.png 848w, https://substackcdn.com/image/fetch/$s_!nZ-4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c577e0e-fbb4-41be-af6e-943e4068781e_624x39.png 1272w, https://substackcdn.com/image/fetch/$s_!nZ-4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c577e0e-fbb4-41be-af6e-943e4068781e_624x39.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Here, the intrinsic reward is essentially the learning progress written in the form of log probability (&#128073;B<strong>ayesian learning progress</strong>). To train the dynamics model <em>P</em>&#120601;, the authors solve the constrained optimization: </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-tYR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c33d757-e421-4bd3-b4d5-dc5b75db61d3_3160x624.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-tYR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c33d757-e421-4bd3-b4d5-dc5b75db61d3_3160x624.png 424w, https://substackcdn.com/image/fetch/$s_!-tYR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c33d757-e421-4bd3-b4d5-dc5b75db61d3_3160x624.png 848w, https://substackcdn.com/image/fetch/$s_!-tYR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c33d757-e421-4bd3-b4d5-dc5b75db61d3_3160x624.png 1272w, https://substackcdn.com/image/fetch/$s_!-tYR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c33d757-e421-4bd3-b4d5-dc5b75db61d3_3160x624.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-tYR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c33d757-e421-4bd3-b4d5-dc5b75db61d3_3160x624.png" width="1456" height="288" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4c33d757-e421-4bd3-b4d5-dc5b75db61d3_3160x624.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:288,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-tYR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c33d757-e421-4bd3-b4d5-dc5b75db61d3_3160x624.png 424w, https://substackcdn.com/image/fetch/$s_!-tYR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c33d757-e421-4bd3-b4d5-dc5b75db61d3_3160x624.png 848w, https://substackcdn.com/image/fetch/$s_!-tYR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c33d757-e421-4bd3-b4d5-dc5b75db61d3_3160x624.png 1272w, https://substackcdn.com/image/fetch/$s_!-tYR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c33d757-e421-4bd3-b4d5-dc5b75db61d3_3160x624.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p>&#128064; By introducing the KL constraint, the posterior model is prevented from diverging too far from the prior, thereby preventing the generation of unstable intrinsic rewards.</p></blockquote><p>The Bayesian surprises studied above are defined by a specific dynamics model. &#129504; <em>What happens if we consider a distribution of models?</em> In [13], the authors formulate Bayesian surprise as Information Gain given a policy <em>&#960;</em> as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fT_m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d23a80d-0f9e-496c-bf82-8ae387153d72_845x208.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fT_m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d23a80d-0f9e-496c-bf82-8ae387153d72_845x208.gif 424w, https://substackcdn.com/image/fetch/$s_!fT_m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d23a80d-0f9e-496c-bf82-8ae387153d72_845x208.gif 848w, https://substackcdn.com/image/fetch/$s_!fT_m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d23a80d-0f9e-496c-bf82-8ae387153d72_845x208.gif 1272w, https://substackcdn.com/image/fetch/$s_!fT_m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d23a80d-0f9e-496c-bf82-8ae387153d72_845x208.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fT_m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d23a80d-0f9e-496c-bf82-8ae387153d72_845x208.gif" width="845" height="208" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5d23a80d-0f9e-496c-bf82-8ae387153d72_845x208.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:208,&quot;width&quot;:845,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:158449,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fT_m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d23a80d-0f9e-496c-bf82-8ae387153d72_845x208.gif 424w, https://substackcdn.com/image/fetch/$s_!fT_m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d23a80d-0f9e-496c-bf82-8ae387153d72_845x208.gif 848w, https://substackcdn.com/image/fetch/$s_!fT_m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d23a80d-0f9e-496c-bf82-8ae387153d72_845x208.gif 1272w, https://substackcdn.com/image/fetch/$s_!fT_m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d23a80d-0f9e-496c-bf82-8ae387153d72_845x208.gif 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>P(T)</em> is the transition distribution of the environment and <em>P(T|</em>&#120601;<em>) </em>is the transition distribution according to the dynamics model. The term <em>u(s,a), </em>which  models some form of &#128073;<strong>Bayesian disagreement</strong>, turns out to be the <a href="https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence">Jensen-Shannon Divergence</a> of a set of learned dynamics from a transition dynamics <em>t</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PDql!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43e5a569-6820-445e-913a-5eba8a62bc7e_451x117.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PDql!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43e5a569-6820-445e-913a-5eba8a62bc7e_451x117.png 424w, https://substackcdn.com/image/fetch/$s_!PDql!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43e5a569-6820-445e-913a-5eba8a62bc7e_451x117.png 848w, https://substackcdn.com/image/fetch/$s_!PDql!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43e5a569-6820-445e-913a-5eba8a62bc7e_451x117.png 1272w, https://substackcdn.com/image/fetch/$s_!PDql!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43e5a569-6820-445e-913a-5eba8a62bc7e_451x117.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PDql!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43e5a569-6820-445e-913a-5eba8a62bc7e_451x117.png" width="451" height="117" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/43e5a569-6820-445e-913a-5eba8a62bc7e_451x117.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:117,&quot;width&quot;:451,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:26411,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PDql!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43e5a569-6820-445e-913a-5eba8a62bc7e_451x117.png 424w, https://substackcdn.com/image/fetch/$s_!PDql!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43e5a569-6820-445e-913a-5eba8a62bc7e_451x117.png 848w, https://substackcdn.com/image/fetch/$s_!PDql!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43e5a569-6820-445e-913a-5eba8a62bc7e_451x117.png 1272w, https://substackcdn.com/image/fetch/$s_!PDql!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43e5a569-6820-445e-913a-5eba8a62bc7e_451x117.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>P(S|s,a,t)</em> is the dynamics model learned from a transition dynamics <em>t</em>.  Here, we used the property that Jensen-Shannon divergence is the entropy of the average minus the average of the entropies. In practice, this JSD can be approximated by employing <em>N</em> dynamics models:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!18tq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac9d8c69-e89c-4b2f-8240-2b6e9de54373_402x179.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!18tq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac9d8c69-e89c-4b2f-8240-2b6e9de54373_402x179.png 424w, https://substackcdn.com/image/fetch/$s_!18tq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac9d8c69-e89c-4b2f-8240-2b6e9de54373_402x179.png 848w, https://substackcdn.com/image/fetch/$s_!18tq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac9d8c69-e89c-4b2f-8240-2b6e9de54373_402x179.png 1272w, https://substackcdn.com/image/fetch/$s_!18tq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac9d8c69-e89c-4b2f-8240-2b6e9de54373_402x179.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!18tq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac9d8c69-e89c-4b2f-8240-2b6e9de54373_402x179.png" width="402" height="179" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ac9d8c69-e89c-4b2f-8240-2b6e9de54373_402x179.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:179,&quot;width&quot;:402,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:22919,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!18tq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac9d8c69-e89c-4b2f-8240-2b6e9de54373_402x179.png 424w, https://substackcdn.com/image/fetch/$s_!18tq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac9d8c69-e89c-4b2f-8240-2b6e9de54373_402x179.png 848w, https://substackcdn.com/image/fetch/$s_!18tq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac9d8c69-e89c-4b2f-8240-2b6e9de54373_402x179.png 1272w, https://substackcdn.com/image/fetch/$s_!18tq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac9d8c69-e89c-4b2f-8240-2b6e9de54373_402x179.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>For each <em>P</em> parameterized by Gaussian distribution <em>&#120029;<sub>i</sub>(&#181;<sub>i</sub>,&#931;<sub>i</sub>)</em>, we need another layer of approximation to compute <em>u(s,a)</em>. Particularly, the authors propose to replace the Shannon entropy with <a href="https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy">R&#233;nyi entropy</a> and use the corresponding Jensen-R&#233;nyi Divergence (JRD):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e_sQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91728337-bc7c-4aa2-8ab7-f7af30551283_538x457.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e_sQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91728337-bc7c-4aa2-8ab7-f7af30551283_538x457.png 424w, https://substackcdn.com/image/fetch/$s_!e_sQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91728337-bc7c-4aa2-8ab7-f7af30551283_538x457.png 848w, https://substackcdn.com/image/fetch/$s_!e_sQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91728337-bc7c-4aa2-8ab7-f7af30551283_538x457.png 1272w, https://substackcdn.com/image/fetch/$s_!e_sQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91728337-bc7c-4aa2-8ab7-f7af30551283_538x457.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e_sQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91728337-bc7c-4aa2-8ab7-f7af30551283_538x457.png" width="538" height="457" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/91728337-bc7c-4aa2-8ab7-f7af30551283_538x457.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:457,&quot;width&quot;:538,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:60582,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!e_sQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91728337-bc7c-4aa2-8ab7-f7af30551283_538x457.png 424w, https://substackcdn.com/image/fetch/$s_!e_sQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91728337-bc7c-4aa2-8ab7-f7af30551283_538x457.png 848w, https://substackcdn.com/image/fetch/$s_!e_sQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91728337-bc7c-4aa2-8ab7-f7af30551283_538x457.png 1272w, https://substackcdn.com/image/fetch/$s_!e_sQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91728337-bc7c-4aa2-8ab7-f7af30551283_538x457.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>&#10060;  While these Bayesian views are theoretically rigorous, they face challenges with intricate computations of the posterior for dynamics models or the necessity to monitor multiple models throughout the training. They also make a lot of assumption and approximations that may go wrong in practice. </p></blockquote><div><hr></div><h2>Hybrid Approaches</h2><p>It didn't take long for researchers to recognize the potential marriage between surprise and novelty. The amalgamation of surprise and novelty constitutes a potent strategy to enhance the exploration capabilities of reinforcement learning agents. </p><h4>Surprise + Novelty</h4><p>The &#128073;<strong>Never Give Up</strong> agent (NGU), introduced in [17], combines existing surprise and novelty components from the literature cleverly:</p><ol><li><p>State representation learning via inverse dynamics [8]</p></li><li><p>Life-long novelty module using RND [16]</p></li><li><p>Episodic novelty using episodic memory inspired by EC [4]. Note that the implementation of the memory in NGU is new. </p></li></ol><p>This mix-up results in a very powerful exploration architecture, as depicted below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5R-B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea1e9aa-a238-45df-8984-9f21098c257e_915x361.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5R-B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea1e9aa-a238-45df-8984-9f21098c257e_915x361.gif 424w, https://substackcdn.com/image/fetch/$s_!5R-B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea1e9aa-a238-45df-8984-9f21098c257e_915x361.gif 848w, https://substackcdn.com/image/fetch/$s_!5R-B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea1e9aa-a238-45df-8984-9f21098c257e_915x361.gif 1272w, https://substackcdn.com/image/fetch/$s_!5R-B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea1e9aa-a238-45df-8984-9f21098c257e_915x361.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5R-B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea1e9aa-a238-45df-8984-9f21098c257e_915x361.gif" width="915" height="361" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ea1e9aa-a238-45df-8984-9f21098c257e_915x361.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:361,&quot;width&quot;:915,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:283217,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5R-B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea1e9aa-a238-45df-8984-9f21098c257e_915x361.gif 424w, https://substackcdn.com/image/fetch/$s_!5R-B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea1e9aa-a238-45df-8984-9f21098c257e_915x361.gif 848w, https://substackcdn.com/image/fetch/$s_!5R-B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea1e9aa-a238-45df-8984-9f21098c257e_915x361.gif 1272w, https://substackcdn.com/image/fetch/$s_!5R-B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea1e9aa-a238-45df-8984-9f21098c257e_915x361.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Never Give Up: The dynamics model <em>f</em> is employed to produce the representations for the novelty modules. Two types of novelty are combined to produce the final intrinsic reward. Image taken from [17].</figcaption></figure></div><p>The most interesting component is the episodic novelty, which encourages the exploration of novel states within an episode simply via nearest-neighbor matching. As a result, the agent will not revisit the same state in an episode twice. This concept is different from lifelong novelty as explained by the author:</p><div class="pullquote"><p>The episodic intrinsic reward  promotes the agent to visit as many different states as possible within a single episode. This means that the notion of novelty ignores inter-episode interactions: a state that has been visited thousands of times gives the same intrinsic reward as a completely new state as long as they are equally novel given the history of the current episode.</p><p>A life-long (or inter-episodic) novelty module provides a long-term novelty signal to statefully control the amount of exploration across episodes</p><p>&#8212;Text taken from [17]&#8212;</p></div><p>The authors implement the episodic intrinsic reward as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!thNW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e86b89c-6739-4177-902e-14b20161f2a4_857x145.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!thNW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e86b89c-6739-4177-902e-14b20161f2a4_857x145.png 424w, https://substackcdn.com/image/fetch/$s_!thNW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e86b89c-6739-4177-902e-14b20161f2a4_857x145.png 848w, https://substackcdn.com/image/fetch/$s_!thNW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e86b89c-6739-4177-902e-14b20161f2a4_857x145.png 1272w, https://substackcdn.com/image/fetch/$s_!thNW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e86b89c-6739-4177-902e-14b20161f2a4_857x145.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!thNW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e86b89c-6739-4177-902e-14b20161f2a4_857x145.png" width="574" height="97.11785297549592" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e86b89c-6739-4177-902e-14b20161f2a4_857x145.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:145,&quot;width&quot;:857,&quot;resizeWidth&quot;:574,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!thNW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e86b89c-6739-4177-902e-14b20161f2a4_857x145.png 424w, https://substackcdn.com/image/fetch/$s_!thNW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e86b89c-6739-4177-902e-14b20161f2a4_857x145.png 848w, https://substackcdn.com/image/fetch/$s_!thNW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e86b89c-6739-4177-902e-14b20161f2a4_857x145.png 1272w, https://substackcdn.com/image/fetch/$s_!thNW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e86b89c-6739-4177-902e-14b20161f2a4_857x145.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>N<sub>k</sub></em> nearest neighbor of the current state representation is retrieved from the episodic memory to compute the weighted sum of similarity to form the reward. The closer the current state is to its neighbors, the higher the similarity and thus, the smaller the reward. <em>K</em> is a function measuring the similarity and <em>c</em> is a hyperparameter guarding from divide-by-zero issues.</p><p>The ultimate intrinsic reward is:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UEDE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b7157-3250-45bd-89a3-5e35039b43df_4208x704.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UEDE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b7157-3250-45bd-89a3-5e35039b43df_4208x704.png 424w, https://substackcdn.com/image/fetch/$s_!UEDE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b7157-3250-45bd-89a3-5e35039b43df_4208x704.png 848w, https://substackcdn.com/image/fetch/$s_!UEDE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b7157-3250-45bd-89a3-5e35039b43df_4208x704.png 1272w, https://substackcdn.com/image/fetch/$s_!UEDE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b7157-3250-45bd-89a3-5e35039b43df_4208x704.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UEDE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b7157-3250-45bd-89a3-5e35039b43df_4208x704.png" width="1456" height="244" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d54b7157-3250-45bd-89a3-5e35039b43df_4208x704.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UEDE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b7157-3250-45bd-89a3-5e35039b43df_4208x704.png 424w, https://substackcdn.com/image/fetch/$s_!UEDE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b7157-3250-45bd-89a3-5e35039b43df_4208x704.png 848w, https://substackcdn.com/image/fetch/$s_!UEDE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b7157-3250-45bd-89a3-5e35039b43df_4208x704.png 1272w, https://substackcdn.com/image/fetch/$s_!UEDE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b7157-3250-45bd-89a3-5e35039b43df_4208x704.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>L</em> is a hyperparameter and <em>&#963;<sub>e</sub></em> and <em>&#181;<sub>e</sub></em> are running standard deviation and mean for <em>RND(x)</em>, contributing to normalizing the RND&#8217;s rewards. By utilizing both the episode and long-life novelty, the NGU agent quickly avoids revisiting the same state within a single episode and gradually stays away from states that have been encountered frequently across multiple episodes.</p><p>Later, the authors propose an upgraded version of the NGU, called &#128073;<strong>Agent57 [18]</strong>, with several improvements:</p><ol><li><p>Splitting the value function into 2 separate function values for external and internal rewards. </p></li><li><p>A population of policies (and value functions) is trained, each characterized by a distinct pair of exploration parameters:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YyBP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe801d0ba-9f8e-4a77-9155-e41c45ef69de_397x101.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YyBP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe801d0ba-9f8e-4a77-9155-e41c45ef69de_397x101.png 424w, https://substackcdn.com/image/fetch/$s_!YyBP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe801d0ba-9f8e-4a77-9155-e41c45ef69de_397x101.png 848w, https://substackcdn.com/image/fetch/$s_!YyBP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe801d0ba-9f8e-4a77-9155-e41c45ef69de_397x101.png 1272w, https://substackcdn.com/image/fetch/$s_!YyBP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe801d0ba-9f8e-4a77-9155-e41c45ef69de_397x101.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YyBP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe801d0ba-9f8e-4a77-9155-e41c45ef69de_397x101.png" width="235" height="59.785894206549116" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e801d0ba-9f8e-4a77-9155-e41c45ef69de_397x101.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:101,&quot;width&quot;:397,&quot;resizeWidth&quot;:235,&quot;bytes&quot;:12419,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YyBP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe801d0ba-9f8e-4a77-9155-e41c45ef69de_397x101.png 424w, https://substackcdn.com/image/fetch/$s_!YyBP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe801d0ba-9f8e-4a77-9155-e41c45ef69de_397x101.png 848w, https://substackcdn.com/image/fetch/$s_!YyBP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe801d0ba-9f8e-4a77-9155-e41c45ef69de_397x101.png 1272w, https://substackcdn.com/image/fetch/$s_!YyBP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe801d0ba-9f8e-4a77-9155-e41c45ef69de_397x101.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>N</em> is the size of the population. &#120574;<sub>j</sub> is the discount factor hyperparameters and <em>&#946;<sub>j </sub></em>is the intrinsic reward coefficient hyperparameters. These values are adaptively determined during training using a <a href="https://arxiv.org/pdf/0805.3415.pdf">sliding-window UCB bandit algorithm</a> (meta controller) where the policy corresponding to the hyperparameter <em>j</em> is trained and evaluated to compute a performance metric maximized by the meta controller. </p></li></ol><blockquote><p>&#128064;  Agent 57 is the first method that can beat humans in all 57 Atari games, as reported in the following figure:</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KUdT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4ee348-96de-4192-8e58-1d7032efe9b9_878x592.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KUdT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4ee348-96de-4192-8e58-1d7032efe9b9_878x592.png 424w, https://substackcdn.com/image/fetch/$s_!KUdT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4ee348-96de-4192-8e58-1d7032efe9b9_878x592.png 848w, https://substackcdn.com/image/fetch/$s_!KUdT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4ee348-96de-4192-8e58-1d7032efe9b9_878x592.png 1272w, https://substackcdn.com/image/fetch/$s_!KUdT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4ee348-96de-4192-8e58-1d7032efe9b9_878x592.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KUdT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4ee348-96de-4192-8e58-1d7032efe9b9_878x592.png" width="590" height="397.8132118451025" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf4ee348-96de-4192-8e58-1d7032efe9b9_878x592.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:592,&quot;width&quot;:878,&quot;resizeWidth&quot;:590,&quot;bytes&quot;:78194,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KUdT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4ee348-96de-4192-8e58-1d7032efe9b9_878x592.png 424w, https://substackcdn.com/image/fetch/$s_!KUdT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4ee348-96de-4192-8e58-1d7032efe9b9_878x592.png 848w, https://substackcdn.com/image/fetch/$s_!KUdT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4ee348-96de-4192-8e58-1d7032efe9b9_878x592.png 1272w, https://substackcdn.com/image/fetch/$s_!KUdT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf4ee348-96de-4192-8e58-1d7032efe9b9_878x592.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Agent 57 outperforms humans in 57 Atari games after 1e10 training frames.  Image taken from [18]. </figcaption></figure></div><p>This hybrid approach incorporates separate modules for surprise and novelty. &#129504; <em>Are there alternative methods to blend the two concepts?</em></p><h4>Novelty of Surprise </h4><p>Recently, Le et al. (2024) have proposed a new approach that leverages both surprise and novelty elements in a single exploration architecture [15].  The authors start from an observation that the norm of the prediction error (surprise norm) cannot fully capture the incentive because a high surprise norm may not correspond to a meaningful event (e.g., as in Noisy-TV). They propose a new metric, namely the novelty of the surprise or &#128073;<strong>surprise novelty</strong>. To identify surprise novelty, the agent needs to compare the current surprise with surprises in past encounters and use the novelty of the surprise as the intrinsic reward rather than its norm.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ftb5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a01f9f6-b82e-47c2-881c-024d6b18929c_905x345.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ftb5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a01f9f6-b82e-47c2-881c-024d6b18929c_905x345.png 424w, https://substackcdn.com/image/fetch/$s_!Ftb5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a01f9f6-b82e-47c2-881c-024d6b18929c_905x345.png 848w, https://substackcdn.com/image/fetch/$s_!Ftb5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a01f9f6-b82e-47c2-881c-024d6b18929c_905x345.png 1272w, https://substackcdn.com/image/fetch/$s_!Ftb5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a01f9f6-b82e-47c2-881c-024d6b18929c_905x345.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ftb5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a01f9f6-b82e-47c2-881c-024d6b18929c_905x345.png" width="602" height="229.49171270718233" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a01f9f6-b82e-47c2-881c-024d6b18929c_905x345.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:345,&quot;width&quot;:905,&quot;resizeWidth&quot;:602,&quot;bytes&quot;:59290,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ftb5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a01f9f6-b82e-47c2-881c-024d6b18929c_905x345.png 424w, https://substackcdn.com/image/fetch/$s_!Ftb5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a01f9f6-b82e-47c2-881c-024d6b18929c_905x345.png 848w, https://substackcdn.com/image/fetch/$s_!Ftb5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a01f9f6-b82e-47c2-881c-024d6b18929c_905x345.png 1272w, https://substackcdn.com/image/fetch/$s_!Ftb5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a01f9f6-b82e-47c2-881c-024d6b18929c_905x345.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Surprise norm vs surprise novelty in Montezuma Revenge. Interesting states have a better reflection on the surprise novelty. Images taken from [15].</figcaption></figure></div><p></p><p>Based on that principle, they operate novelty-based exploration approaches on surprise space rather than state space. This requires a surprise generator such as a dynamics model to produce the surprise vector <em>u</em>, i.e., the difference vector between the predicted and reality.  Then, inter and intra-episode novelty scores are estimated by a system of memory, called Surprise Memory (SM), consisting of an autoencoder network <em>W</em> and episodic memory <em>M</em>, respectively. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dQ1G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee73c68-0fd7-45b4-92d4-732b1ed9abc1_945x315.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dQ1G!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee73c68-0fd7-45b4-92d4-732b1ed9abc1_945x315.png 424w, https://substackcdn.com/image/fetch/$s_!dQ1G!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee73c68-0fd7-45b4-92d4-732b1ed9abc1_945x315.png 848w, https://substackcdn.com/image/fetch/$s_!dQ1G!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee73c68-0fd7-45b4-92d4-732b1ed9abc1_945x315.png 1272w, https://substackcdn.com/image/fetch/$s_!dQ1G!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee73c68-0fd7-45b4-92d4-732b1ed9abc1_945x315.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dQ1G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee73c68-0fd7-45b4-92d4-732b1ed9abc1_945x315.png" width="606" height="202" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fee73c68-0fd7-45b4-92d4-732b1ed9abc1_945x315.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:315,&quot;width&quot;:945,&quot;resizeWidth&quot;:606,&quot;bytes&quot;:47620,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dQ1G!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee73c68-0fd7-45b4-92d4-732b1ed9abc1_945x315.png 424w, https://substackcdn.com/image/fetch/$s_!dQ1G!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee73c68-0fd7-45b4-92d4-732b1ed9abc1_945x315.png 848w, https://substackcdn.com/image/fetch/$s_!dQ1G!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee73c68-0fd7-45b4-92d4-732b1ed9abc1_945x315.png 1272w, https://substackcdn.com/image/fetch/$s_!dQ1G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee73c68-0fd7-45b4-92d4-732b1ed9abc1_945x315.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Surprise novelty is estimated through surprise memory architecture (SM). Image taken from [15]. </figcaption></figure></div><p>This design is similar to NGU with lifelong and episodic novelty modules. However, the implementation promotes some new ideas:</p><ol><li><p>The episodic memory retrieval is trainable through an attention mechanism rather than using a fixed nearest neighbor retrieval, which allows flexible adaptation to specific tasks and environments.</p></li><li><p>The inter-episode novelty is estimated by an autoencoder rather than RND. </p></li><li><p>The novelty operates on surprise space rather than observation space. </p></li></ol><p>The workflow of the two memory modules is summarized below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8wGe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce43257a-6fba-437d-8432-2bb62026f4f7_537x270.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8wGe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce43257a-6fba-437d-8432-2bb62026f4f7_537x270.png 424w, https://substackcdn.com/image/fetch/$s_!8wGe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce43257a-6fba-437d-8432-2bb62026f4f7_537x270.png 848w, https://substackcdn.com/image/fetch/$s_!8wGe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce43257a-6fba-437d-8432-2bb62026f4f7_537x270.png 1272w, https://substackcdn.com/image/fetch/$s_!8wGe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce43257a-6fba-437d-8432-2bb62026f4f7_537x270.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8wGe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce43257a-6fba-437d-8432-2bb62026f4f7_537x270.png" width="537" height="270" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ce43257a-6fba-437d-8432-2bb62026f4f7_537x270.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:270,&quot;width&quot;:537,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:80971,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8wGe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce43257a-6fba-437d-8432-2bb62026f4f7_537x270.png 424w, https://substackcdn.com/image/fetch/$s_!8wGe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce43257a-6fba-437d-8432-2bb62026f4f7_537x270.png 848w, https://substackcdn.com/image/fetch/$s_!8wGe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce43257a-6fba-437d-8432-2bb62026f4f7_537x270.png 1272w, https://substackcdn.com/image/fetch/$s_!8wGe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce43257a-6fba-437d-8432-2bb62026f4f7_537x270.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The gist of the algorithm is the episodic surprise <em>u<sup>e</sup></em> is concatenated with the surprise <em>u</em> of observation, and the autoencoder <em>W</em> tries to reconstruct the concatenation to estimate its novelty. </p><blockquote><p>&#128064; The technique is very effective under a low-sample regime where Atari games are trained for only 50 million frames. </p></blockquote><p>&#129504; <em>Beyond novelty and surprise, what other strategies can enhance exploration in Reinforcement Learning?</em> Discover the answer in the third part of my post series.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/p/curious-agents-saga-part-3&quot;,&quot;text&quot;:&quot;Part 3&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://hungleai.substack.com/p/curious-agents-saga-part-3"><span>Part 3</span></a></p><div><hr></div><h2>References</h2><p>[1] Bellemare, Marc, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. "Unifying count-based exploration and intrinsic motivation." <em>Advances in neural information processing systems</em> 29 (2016).</p><p>[2] Tang, Haoran, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. "# exploration: A study of count-based exploration for deep reinforcement learning." <em>Advances in neural information processing systems</em> 30 (2017).</p><p>[3] Parisi, Simone, Victoria Dean, Deepak Pathak, and Abhinav Gupta. "Interesting object, curious agent: Learning task-agnostic exploration." <em>Advances in Neural Information Processing Systems</em> 34 (2021): 20516-20530.</p><p>[4] Savinov, Nikolay, Anton Raichuk, Rapha&#235;l Marinier, Damien Vincent, Marc Pollefeys, Timothy Lillicrap, and Sylvain Gelly. "Episodic curiosity through reachability." <em>arXiv preprint arXiv:1810.02274</em> (2018).</p><p>[5] Ekman, P., and Davidson, R. J. (eds.). (1994). <em>The Nature of Emotion: Fundamental Questions</em>. Oxford: Oxford University Press.</p><p>[6] Schmidhuber, J&#252;rgen. "A possibility for implementing curiosity and boredom in model-building neural controllers." In <em>Proc. of the international conference on simulation of adaptive behavior: From animals to animats</em>, pp. 222-227. 1991.</p><p>[7] Stadie, Levine, Abbeel: Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models. In NIPS 2015. </p><p>[8] Pathak, Deepak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. "Curiosity-driven exploration by self-supervised prediction." In <em>International conference on machine learning</em>, pp. 2778-2787. PMLR, 2017.</p><p>[9] Pierre-Yves Oudeyer &amp; Frederic Kaplan. &#8220;How can we define intrinsic motivation?&#8221; Conf. on Epigenetic Robotics, 2008.</p><p>[10] Deepak Pathak, et al. &#8220;Self-Supervised Exploration via Disagreement.&#8221; In ICML 2019.</p><p>[11] Houthooft, Rein, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. "Vime: Variational information maximizing exploration." <em>Advances in neural information processing systems</em> 29 (2016).</p><p>[12] Joshua Achiam and Shankar Sastry. 2017. Surprise-based intrinsic motivation for deep reinforcement learning. arXiv preprint arXiv:1703.01732 (2017).</p><p>[13] Pranav Shyam, Wojciech Jaskowski, and Faustino Gomez. 2019. Model-Based Active Exploration. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. 5779&#8211;5788.</p><p>[14] Adityanarayanan Radhakrishnan, Mikhail Belkin, and Caroline Uhler. 2020. Overparameterized neural networks implement associative memory. Proceedings of the National Academy of Sciences 117, 44 (2020), 27162&#8211;27170.</p><p>[15] Le, Hung, Kien Do, Dung Nguyen, and Svetha Venkatesh. "Beyond Surprise: Improving Exploration Through Surprise Novelty. In AAMAS, 2024.</p><p>[16] Burda, Yuri, Harrison Edwards, Amos Storkey, and Oleg Klimov. "Exploration by random network distillation." In <em>International Conference on Learning Representations</em>. 2018.</p><p>[17] Badia, Adri&#224; Puigdom&#232;nech, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman et al. "Never give up: Learning directed exploration strategies." <em>arXiv preprint arXiv:2002.06038</em> (2020).</p><p>[18] Badia, Adri&#224; Puigdom&#232;nech, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan Daniel Guo, and Charles Blundell. "Agent57: Outperforming the atari human benchmark." In <em>International conference on machine learning</em>, pp. 507-517. PMLR, 2020.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><strong>I hope you enjoy the article. Stay tuned for the newest and exclusive content by subscribing to Neurocoder Tales! </strong><em>Disclaimer:</em> <em>While every effort is made to provide accurate and unbiased information, errors may occur. Let me know if you catch any error.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Curious Agents Saga: Part 1 ]]></title><description><![CDATA[Legacy of Exploration in Reinforcement Learning]]></description><link>https://hungleai.substack.com/p/curious-agents-saga-part-1</link><guid isPermaLink="false">https://hungleai.substack.com/p/curious-agents-saga-part-1</guid><dc:creator><![CDATA[Hung Le]]></dc:creator><pubDate>Wed, 24 Jan 2024 09:30:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F785b0f32-025d-4401-a508-e17642de06d4_1024x747.gif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h4>Table of Content</h4><ul><li><p><a href="https://hungleai.substack.com/i/140853184/exploration-in-reinforcement-learning">Exploration in Reinforcement Learning </a></p></li><li><p><a href="https://hungleai.substack.com/i/140853184/classic-explorations">Classic Explorations</a></p><ul><li><p><a href="https://hungleai.substack.com/i/140853184/%CE%B5-greedy">&#949;-greedy</a></p></li><li><p><a href="https://hungleai.substack.com/i/140853184/upper-confidence-bound-ucb">Upper Confidence Bound (UCB)     </a></p></li><li><p><a href="https://hungleai.substack.com/i/140853184/thompson-sampling">Thompson Sampling</a></p></li><li><p><a href="https://hungleai.substack.com/i/140853184/information-gain">Information Gain</a></p></li></ul></li><li><p><a href="https://hungleai.substack.com/i/140853184/primal-exploration-in-deep-reinforcement-learning">Primal Exploration in Deep Reinforcement Learning</a></p><ul><li><p><a href="https://hungleai.substack.com/i/140853184/entropy-maximization">Entropy Maximization</a></p></li><li><p><a href="https://hungleai.substack.com/i/140853184/noisy-networks">Noisy Networks</a></p></li></ul></li></ul><div><hr></div><h2>Exploration in Reinforcement Learning </h2><p>In reinforcement learning (RL), an agent interacts with the environment, taking actions <em>a</em>,  receiving a reward <em>r</em>, and moving to a new state <em>s</em>. The agent is tasked with maximizing the accumulated rewards or returns <em>R</em> over time by finding optimal actions (policy). </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!L5vd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827043b9-3459-4339-8a9f-5268c08cb3fc_508x226.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!L5vd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827043b9-3459-4339-8a9f-5268c08cb3fc_508x226.gif 424w, https://substackcdn.com/image/fetch/$s_!L5vd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827043b9-3459-4339-8a9f-5268c08cb3fc_508x226.gif 848w, https://substackcdn.com/image/fetch/$s_!L5vd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827043b9-3459-4339-8a9f-5268c08cb3fc_508x226.gif 1272w, https://substackcdn.com/image/fetch/$s_!L5vd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827043b9-3459-4339-8a9f-5268c08cb3fc_508x226.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!L5vd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827043b9-3459-4339-8a9f-5268c08cb3fc_508x226.gif" width="468" height="208.20472440944883" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/827043b9-3459-4339-8a9f-5268c08cb3fc_508x226.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:226,&quot;width&quot;:508,&quot;resizeWidth&quot;:468,&quot;bytes&quot;:11714,&quot;alt&quot;:&quot;Reinforcement Learning Loop&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Reinforcement Learning Loop" title="Reinforcement Learning Loop" srcset="https://substackcdn.com/image/fetch/$s_!L5vd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827043b9-3459-4339-8a9f-5268c08cb3fc_508x226.gif 424w, https://substackcdn.com/image/fetch/$s_!L5vd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827043b9-3459-4339-8a9f-5268c08cb3fc_508x226.gif 848w, https://substackcdn.com/image/fetch/$s_!L5vd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827043b9-3459-4339-8a9f-5268c08cb3fc_508x226.gif 1272w, https://substackcdn.com/image/fetch/$s_!L5vd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827043b9-3459-4339-8a9f-5268c08cb3fc_508x226.gif 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a><figcaption class="image-caption">Reinforcement Learning Loop</figcaption></figure></div><p>To learn, the agent experiments with various actions and observes their consequences. It starts by exploring the environment with random actions and then determining which actions bring good and bad rewards. Here, we will not focus on how the agent can learn given his experiences in the environment. Rather, we are curious about the exploration strategies that trigger the learning.  It comes as no surprise that not all explorations are equal. Consider the illustration below, the purely random exploration (left figure), as often used in <a href="https://www.gatsby.ucl.ac.uk/~dayan/papers/cjch.pdf">Q-learning</a>, is inefficient, requiring countless interactions for the agent (yellow circle) to reach the goal (red diamond). Conversely, a strategic sweep of all possible areas (right figure) can significantly save time and effort for the agent in locating the goal. <em>&#129504; What makes a good exploration? </em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ti5d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22928642-e68c-431e-a2c5-02ed8b7b1f03_350x180.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ti5d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22928642-e68c-431e-a2c5-02ed8b7b1f03_350x180.gif 424w, https://substackcdn.com/image/fetch/$s_!Ti5d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22928642-e68c-431e-a2c5-02ed8b7b1f03_350x180.gif 848w, https://substackcdn.com/image/fetch/$s_!Ti5d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22928642-e68c-431e-a2c5-02ed8b7b1f03_350x180.gif 1272w, https://substackcdn.com/image/fetch/$s_!Ti5d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22928642-e68c-431e-a2c5-02ed8b7b1f03_350x180.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ti5d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22928642-e68c-431e-a2c5-02ed8b7b1f03_350x180.gif" width="350" height="180" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/22928642-e68c-431e-a2c5-02ed8b7b1f03_350x180.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:180,&quot;width&quot;:350,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:13656,&quot;alt&quot;:&quot;Good Exploration in Reinforcement Learning&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Good Exploration in Reinforcement Learning" title="Good Exploration in Reinforcement Learning" srcset="https://substackcdn.com/image/fetch/$s_!Ti5d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22928642-e68c-431e-a2c5-02ed8b7b1f03_350x180.gif 424w, https://substackcdn.com/image/fetch/$s_!Ti5d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22928642-e68c-431e-a2c5-02ed8b7b1f03_350x180.gif 848w, https://substackcdn.com/image/fetch/$s_!Ti5d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22928642-e68c-431e-a2c5-02ed8b7b1f03_350x180.gif 1272w, https://substackcdn.com/image/fetch/$s_!Ti5d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22928642-e68c-431e-a2c5-02ed8b7b1f03_350x180.gif 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption">Good exploration matters!</figcaption></figure></div><h2>Classic Explorations</h2><p>These approaches were developed before the deep learning era. They were primarily focused on the multi-arm bandit problem, a special branch of reinforcement learning where the state can be overlooked. Many of these approaches come with theoretical guarantees under certain assumptions that are not easily extended to complicated environments with high-dimensional state space. </p><h4>&#949;-greedy</h4><p>&#949;-greedy is the simplest exploration strategy that works (in theory). Unfortunately, it heavily relies on pure randomness and biased estimates of action values Q, and thus is sample-inefficient in practice. It is like flipping a coin when making decisions. We often go with what we assume is best, but sometimes, we take a random chance to explore other options. This is one example of an optimistic strategy.  Formally, it reads:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zx58!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51273981-af3e-4dd4-8523-797d03bf3072_858x95.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zx58!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51273981-af3e-4dd4-8523-797d03bf3072_858x95.png 424w, https://substackcdn.com/image/fetch/$s_!Zx58!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51273981-af3e-4dd4-8523-797d03bf3072_858x95.png 848w, https://substackcdn.com/image/fetch/$s_!Zx58!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51273981-af3e-4dd4-8523-797d03bf3072_858x95.png 1272w, https://substackcdn.com/image/fetch/$s_!Zx58!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51273981-af3e-4dd4-8523-797d03bf3072_858x95.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zx58!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51273981-af3e-4dd4-8523-797d03bf3072_858x95.png" width="566" height="62.66899766899767" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/51273981-af3e-4dd4-8523-797d03bf3072_858x95.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:95,&quot;width&quot;:858,&quot;resizeWidth&quot;:566,&quot;bytes&quot;:37444,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Zx58!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51273981-af3e-4dd4-8523-797d03bf3072_858x95.png 424w, https://substackcdn.com/image/fetch/$s_!Zx58!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51273981-af3e-4dd4-8523-797d03bf3072_858x95.png 848w, https://substackcdn.com/image/fetch/$s_!Zx58!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51273981-af3e-4dd4-8523-797d03bf3072_858x95.png 1272w, https://substackcdn.com/image/fetch/$s_!Zx58!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51273981-af3e-4dd4-8523-797d03bf3072_858x95.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">&#949;-greedy implementation. Image taken from [1]</figcaption></figure></div><p>where <em>Q<sub>  </sub></em>is the estimated value for taking action <em>a. </em>Here, we use the action value (ignoring the state), but it can be applied to the state-action value in normal RL.<em> </em>Despite its simplicity, &#949;-greedy is still widely used in modern RL such as <a href="https://www.nature.com/articles/nature14236">DQN</a> and often works well in dense-reward environments. However, for sparse reward environments,  &#949;-greedy struggles to explore, as evidenced in <a href="https://en.wikipedia.org/wiki/Montezuma%27s_Revenge_(video_game)">Montenzuma&#8217;s Revenge</a>. </p><blockquote><p>&#128064; It is not surprising why this strategy might struggle in real-world scenarios: being overly optimistic when your estimation is imprecise can be risky. It may lead to getting stuck in a local optimum and missing out on discovering the global one with the highest returns.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ctEH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ee8eef-b7f7-4fcd-baf1-d386c5ccded3_1962x1320.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ctEH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ee8eef-b7f7-4fcd-baf1-d386c5ccded3_1962x1320.png 424w, https://substackcdn.com/image/fetch/$s_!ctEH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ee8eef-b7f7-4fcd-baf1-d386c5ccded3_1962x1320.png 848w, https://substackcdn.com/image/fetch/$s_!ctEH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ee8eef-b7f7-4fcd-baf1-d386c5ccded3_1962x1320.png 1272w, https://substackcdn.com/image/fetch/$s_!ctEH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ee8eef-b7f7-4fcd-baf1-d386c5ccded3_1962x1320.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ctEH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ee8eef-b7f7-4fcd-baf1-d386c5ccded3_1962x1320.png" width="468" height="315" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61ee8eef-b7f7-4fcd-baf1-d386c5ccded3_1962x1320.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:980,&quot;width&quot;:1456,&quot;resizeWidth&quot;:468,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Benchmarking Montezuma Revenge&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Benchmarking Montezuma Revenge" title="Benchmarking Montezuma Revenge" srcset="https://substackcdn.com/image/fetch/$s_!ctEH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ee8eef-b7f7-4fcd-baf1-d386c5ccded3_1962x1320.png 424w, https://substackcdn.com/image/fetch/$s_!ctEH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ee8eef-b7f7-4fcd-baf1-d386c5ccded3_1962x1320.png 848w, https://substackcdn.com/image/fetch/$s_!ctEH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ee8eef-b7f7-4fcd-baf1-d386c5ccded3_1962x1320.png 1272w, https://substackcdn.com/image/fetch/$s_!ctEH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61ee8eef-b7f7-4fcd-baf1-d386c5ccded3_1962x1320.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Benchmarking &#949;-greedy (red line) and other exploration method on Montezuma&#8217;s Revenge [2]</figcaption></figure></div><h4>Upper Confidence Bound (UCB)     </h4><p>One way to address the problem of over-optimism is to consider the uncertainty of the estimation. Intuitively, we do not want to miss an action with a currently low estimated value and high uncertainty, as it may possess a higher value. The following formula (UCB strategy) reflects this intuition:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lxT8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc76e0921-f8be-4f66-af7e-784816c08624_2840x880.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lxT8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc76e0921-f8be-4f66-af7e-784816c08624_2840x880.png 424w, https://substackcdn.com/image/fetch/$s_!lxT8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc76e0921-f8be-4f66-af7e-784816c08624_2840x880.png 848w, https://substackcdn.com/image/fetch/$s_!lxT8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc76e0921-f8be-4f66-af7e-784816c08624_2840x880.png 1272w, https://substackcdn.com/image/fetch/$s_!lxT8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc76e0921-f8be-4f66-af7e-784816c08624_2840x880.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lxT8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc76e0921-f8be-4f66-af7e-784816c08624_2840x880.png" width="494" height="153.01785714285714" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c76e0921-f8be-4f66-af7e-784816c08624_2840x880.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:451,&quot;width&quot;:1456,&quot;resizeWidth&quot;:494,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lxT8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc76e0921-f8be-4f66-af7e-784816c08624_2840x880.png 424w, https://substackcdn.com/image/fetch/$s_!lxT8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc76e0921-f8be-4f66-af7e-784816c08624_2840x880.png 848w, https://substackcdn.com/image/fetch/$s_!lxT8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc76e0921-f8be-4f66-af7e-784816c08624_2840x880.png 1272w, https://substackcdn.com/image/fetch/$s_!lxT8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc76e0921-f8be-4f66-af7e-784816c08624_2840x880.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>U<sub>t</sub></em> is the estimated uncertainty of taking action <em>a </em>at timestep <em>t</em>. <em>U<sub>t</sub></em> is also known as the upper confidence for the true action value <em>Q<sup>*</sup></em> if we can guarantee that </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!c1f8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc210d53c-bb0a-4e5f-94ee-adfa3ed1ca39_1024x335.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!c1f8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc210d53c-bb0a-4e5f-94ee-adfa3ed1ca39_1024x335.gif 424w, https://substackcdn.com/image/fetch/$s_!c1f8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc210d53c-bb0a-4e5f-94ee-adfa3ed1ca39_1024x335.gif 848w, https://substackcdn.com/image/fetch/$s_!c1f8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc210d53c-bb0a-4e5f-94ee-adfa3ed1ca39_1024x335.gif 1272w, https://substackcdn.com/image/fetch/$s_!c1f8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc210d53c-bb0a-4e5f-94ee-adfa3ed1ca39_1024x335.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!c1f8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc210d53c-bb0a-4e5f-94ee-adfa3ed1ca39_1024x335.gif" width="452" height="147.87109375" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c210d53c-bb0a-4e5f-94ee-adfa3ed1ca39_1024x335.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:335,&quot;width&quot;:1024,&quot;resizeWidth&quot;:452,&quot;bytes&quot;:6285,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!c1f8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc210d53c-bb0a-4e5f-94ee-adfa3ed1ca39_1024x335.gif 424w, https://substackcdn.com/image/fetch/$s_!c1f8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc210d53c-bb0a-4e5f-94ee-adfa3ed1ca39_1024x335.gif 848w, https://substackcdn.com/image/fetch/$s_!c1f8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc210d53c-bb0a-4e5f-94ee-adfa3ed1ca39_1024x335.gif 1272w, https://substackcdn.com/image/fetch/$s_!c1f8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc210d53c-bb0a-4e5f-94ee-adfa3ed1ca39_1024x335.gif 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>If the inequality is true, the error of taking action <em>a</em> according to UCB will be reduced. It can be proven that the error, in the form of regret, follows a logarithmic asymptotic pattern compared to the linear regret of &#949;-greedy, yet we will not go into the details here. Rather, we focus on: <em>&#129504; How to ensure the inequality? </em></p><p>To find <em>U<sub>t</sub></em> , we use <a href="https://en.wikipedia.org/wiki/Hoeffding%27s_inequality">Hoeffding&#8217;s Inequality</a>. By treating <em>X</em> as the return <em>R(a)</em> and estimating <em>Q(a)</em> as the sample mean, we have:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XjwF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2caad07-6452-4cf9-bf87-6995cfe703c6_4240x752.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XjwF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2caad07-6452-4cf9-bf87-6995cfe703c6_4240x752.png 424w, https://substackcdn.com/image/fetch/$s_!XjwF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2caad07-6452-4cf9-bf87-6995cfe703c6_4240x752.png 848w, https://substackcdn.com/image/fetch/$s_!XjwF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2caad07-6452-4cf9-bf87-6995cfe703c6_4240x752.png 1272w, https://substackcdn.com/image/fetch/$s_!XjwF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2caad07-6452-4cf9-bf87-6995cfe703c6_4240x752.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XjwF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2caad07-6452-4cf9-bf87-6995cfe703c6_4240x752.png" width="560" height="99.23076923076923" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2caad07-6452-4cf9-bf87-6995cfe703c6_4240x752.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:258,&quot;width&quot;:1456,&quot;resizeWidth&quot;:560,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XjwF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2caad07-6452-4cf9-bf87-6995cfe703c6_4240x752.png 424w, https://substackcdn.com/image/fetch/$s_!XjwF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2caad07-6452-4cf9-bf87-6995cfe703c6_4240x752.png 848w, https://substackcdn.com/image/fetch/$s_!XjwF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2caad07-6452-4cf9-bf87-6995cfe703c6_4240x752.png 1272w, https://substackcdn.com/image/fetch/$s_!XjwF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2caad07-6452-4cf9-bf87-6995cfe703c6_4240x752.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>In simple terms, to guarantee the true <em>Q*</em> within a tight confidence bound, the right-hand side (RHS) should be maintained minimal, implying that the uncertainty (<em>U</em>) is inversely proportional to the square root of the number of trials (<em>N</em>). In essence, as we gather more trials (N increases), the uncertainty is allowed to decrease, leading to a reduction in the bound. If we denote the probability bound (RHS) as <em>p</em>, we can solve for <em>U</em>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OlV1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21019105-0242-4297-8cd5-537a7ca886d8_2200x1168.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OlV1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21019105-0242-4297-8cd5-537a7ca886d8_2200x1168.png 424w, https://substackcdn.com/image/fetch/$s_!OlV1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21019105-0242-4297-8cd5-537a7ca886d8_2200x1168.png 848w, https://substackcdn.com/image/fetch/$s_!OlV1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21019105-0242-4297-8cd5-537a7ca886d8_2200x1168.png 1272w, https://substackcdn.com/image/fetch/$s_!OlV1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21019105-0242-4297-8cd5-537a7ca886d8_2200x1168.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OlV1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21019105-0242-4297-8cd5-537a7ca886d8_2200x1168.png" width="264" height="140.15934065934067" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/21019105-0242-4297-8cd5-537a7ca886d8_2200x1168.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:773,&quot;width&quot;:1456,&quot;resizeWidth&quot;:264,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OlV1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21019105-0242-4297-8cd5-537a7ca886d8_2200x1168.png 424w, https://substackcdn.com/image/fetch/$s_!OlV1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21019105-0242-4297-8cd5-537a7ca886d8_2200x1168.png 848w, https://substackcdn.com/image/fetch/$s_!OlV1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21019105-0242-4297-8cd5-537a7ca886d8_2200x1168.png 1272w, https://substackcdn.com/image/fetch/$s_!OlV1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21019105-0242-4297-8cd5-537a7ca886d8_2200x1168.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>In practice, we can reparameterize <em>p</em> as <em>t<sup>-c </sup></em>and tune <em>c </em>as a hyperparameter. This leads to the classic UCB strategy:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3fYB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd558aa88-6154-4add-a961-857451b7c098_3280x1168.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3fYB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd558aa88-6154-4add-a961-857451b7c098_3280x1168.png 424w, https://substackcdn.com/image/fetch/$s_!3fYB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd558aa88-6154-4add-a961-857451b7c098_3280x1168.png 848w, https://substackcdn.com/image/fetch/$s_!3fYB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd558aa88-6154-4add-a961-857451b7c098_3280x1168.png 1272w, https://substackcdn.com/image/fetch/$s_!3fYB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd558aa88-6154-4add-a961-857451b7c098_3280x1168.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3fYB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd558aa88-6154-4add-a961-857451b7c098_3280x1168.png" width="504" height="179.30769230769232" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d558aa88-6154-4add-a961-857451b7c098_3280x1168.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:518,&quot;width&quot;:1456,&quot;resizeWidth&quot;:504,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3fYB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd558aa88-6154-4add-a961-857451b7c098_3280x1168.png 424w, https://substackcdn.com/image/fetch/$s_!3fYB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd558aa88-6154-4add-a961-857451b7c098_3280x1168.png 848w, https://substackcdn.com/image/fetch/$s_!3fYB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd558aa88-6154-4add-a961-857451b7c098_3280x1168.png 1272w, https://substackcdn.com/image/fetch/$s_!3fYB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd558aa88-6154-4add-a961-857451b7c098_3280x1168.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p>&#128064; While UCB does not involve explicit randomness, the action selection is not confined to a strictly greedy approach based solely on Q estimation. Implicitly, with a large value of the exploration-exploitation trade-off parameter <em>c</em>, the chosen action is more likely to deviate from the greedy action, leading to increased exploration.</p></blockquote><p></p><h4>Thompson Sampling</h4><p>When additional assumptions about the reward distribution are available, we can calculate the probability of each action being optimal. Consequently, an action can be chosen based on the probability that it is optimal (probability matching strategy). Formally, the probability of choosing action <em>a</em> is: </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hnLx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bd9cdd-aacf-4f1a-9ffa-296a61d18b18_3520x840.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hnLx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bd9cdd-aacf-4f1a-9ffa-296a61d18b18_3520x840.png 424w, https://substackcdn.com/image/fetch/$s_!hnLx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bd9cdd-aacf-4f1a-9ffa-296a61d18b18_3520x840.png 848w, https://substackcdn.com/image/fetch/$s_!hnLx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bd9cdd-aacf-4f1a-9ffa-296a61d18b18_3520x840.png 1272w, https://substackcdn.com/image/fetch/$s_!hnLx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bd9cdd-aacf-4f1a-9ffa-296a61d18b18_3520x840.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hnLx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bd9cdd-aacf-4f1a-9ffa-296a61d18b18_3520x840.png" width="434" height="103.4326923076923" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d0bd9cdd-aacf-4f1a-9ffa-296a61d18b18_3520x840.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:347,&quot;width&quot;:1456,&quot;resizeWidth&quot;:434,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hnLx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bd9cdd-aacf-4f1a-9ffa-296a61d18b18_3520x840.png 424w, https://substackcdn.com/image/fetch/$s_!hnLx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bd9cdd-aacf-4f1a-9ffa-296a61d18b18_3520x840.png 848w, https://substackcdn.com/image/fetch/$s_!hnLx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bd9cdd-aacf-4f1a-9ffa-296a61d18b18_3520x840.png 1272w, https://substackcdn.com/image/fetch/$s_!hnLx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0bd9cdd-aacf-4f1a-9ffa-296a61d18b18_3520x840.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Thompson sampling is one way to implement the probability matching strategy, and can be understood in a Bayesian setting as follows:</p><ol><li><p>Assume the reward follows a distribution <em>p(r|a, &#952;)</em> where <em>&#952;</em> is the parameter whose prior is <em>p(&#952;)</em></p></li><li><p>Given the set of past observations <em>D<sub>t</sub></em> is made of triplets <em>{(a<sub>i</sub>, r<sub>i</sub>)|i=1,2..,t}, </em>we update the posterior using Bayes rule:</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rrNt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d8e753-8003-43ca-94bb-307ed10812fc_2960x784.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rrNt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d8e753-8003-43ca-94bb-307ed10812fc_2960x784.png 424w, https://substackcdn.com/image/fetch/$s_!rrNt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d8e753-8003-43ca-94bb-307ed10812fc_2960x784.png 848w, https://substackcdn.com/image/fetch/$s_!rrNt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d8e753-8003-43ca-94bb-307ed10812fc_2960x784.png 1272w, https://substackcdn.com/image/fetch/$s_!rrNt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d8e753-8003-43ca-94bb-307ed10812fc_2960x784.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rrNt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d8e753-8003-43ca-94bb-307ed10812fc_2960x784.png" width="248" height="65.74725274725274" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/57d8e753-8003-43ca-94bb-307ed10812fc_2960x784.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:386,&quot;width&quot;:1456,&quot;resizeWidth&quot;:248,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rrNt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d8e753-8003-43ca-94bb-307ed10812fc_2960x784.png 424w, https://substackcdn.com/image/fetch/$s_!rrNt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d8e753-8003-43ca-94bb-307ed10812fc_2960x784.png 848w, https://substackcdn.com/image/fetch/$s_!rrNt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d8e753-8003-43ca-94bb-307ed10812fc_2960x784.png 1272w, https://substackcdn.com/image/fetch/$s_!rrNt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57d8e753-8003-43ca-94bb-307ed10812fc_2960x784.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><ol start="3"><li><p>Given the posterior, we can estimate the action value</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HKnE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febcc95a8-a091-4541-8ba5-8841a560bd51_3760x540.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HKnE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febcc95a8-a091-4541-8ba5-8841a560bd51_3760x540.png 424w, https://substackcdn.com/image/fetch/$s_!HKnE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febcc95a8-a091-4541-8ba5-8841a560bd51_3760x540.png 848w, https://substackcdn.com/image/fetch/$s_!HKnE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febcc95a8-a091-4541-8ba5-8841a560bd51_3760x540.png 1272w, https://substackcdn.com/image/fetch/$s_!HKnE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febcc95a8-a091-4541-8ba5-8841a560bd51_3760x540.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HKnE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febcc95a8-a091-4541-8ba5-8841a560bd51_3760x540.png" width="342" height="49.092032967032964" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ebcc95a8-a091-4541-8ba5-8841a560bd51_3760x540.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:209,&quot;width&quot;:1456,&quot;resizeWidth&quot;:342,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HKnE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febcc95a8-a091-4541-8ba5-8841a560bd51_3760x540.png 424w, https://substackcdn.com/image/fetch/$s_!HKnE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febcc95a8-a091-4541-8ba5-8841a560bd51_3760x540.png 848w, https://substackcdn.com/image/fetch/$s_!HKnE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febcc95a8-a091-4541-8ba5-8841a560bd51_3760x540.png 1272w, https://substackcdn.com/image/fetch/$s_!HKnE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febcc95a8-a091-4541-8ba5-8841a560bd51_3760x540.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><ol start="4"><li><p>Then, we can compute the probability of choosing action <em>a</em></p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F_7l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d8d12a-2c4a-40ed-8e18-ae2bc45d2f8c_4880x2276.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F_7l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d8d12a-2c4a-40ed-8e18-ae2bc45d2f8c_4880x2276.png 424w, https://substackcdn.com/image/fetch/$s_!F_7l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d8d12a-2c4a-40ed-8e18-ae2bc45d2f8c_4880x2276.png 848w, https://substackcdn.com/image/fetch/$s_!F_7l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d8d12a-2c4a-40ed-8e18-ae2bc45d2f8c_4880x2276.png 1272w, https://substackcdn.com/image/fetch/$s_!F_7l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d8d12a-2c4a-40ed-8e18-ae2bc45d2f8c_4880x2276.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F_7l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d8d12a-2c4a-40ed-8e18-ae2bc45d2f8c_4880x2276.png" width="446" height="207.9903846153846" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0d8d12a-2c4a-40ed-8e18-ae2bc45d2f8c_4880x2276.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:679,&quot;width&quot;:1456,&quot;resizeWidth&quot;:446,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!F_7l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d8d12a-2c4a-40ed-8e18-ae2bc45d2f8c_4880x2276.png 424w, https://substackcdn.com/image/fetch/$s_!F_7l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d8d12a-2c4a-40ed-8e18-ae2bc45d2f8c_4880x2276.png 848w, https://substackcdn.com/image/fetch/$s_!F_7l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d8d12a-2c4a-40ed-8e18-ae2bc45d2f8c_4880x2276.png 1272w, https://substackcdn.com/image/fetch/$s_!F_7l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d8d12a-2c4a-40ed-8e18-ae2bc45d2f8c_4880x2276.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>In practice, there is no need to compute the integral explicitly; it suffices to sample one <em>&#952;</em> and use it to calculate the probability of choosing action <em>a</em>. This aligns with the strategy of selecting <em>a</em> as the best option based on the current estimated action value using <em>&#952;</em>. The whole process is summarized in the following algorithm, extended to  the contextual multi-arm bandit with additional context input <em>x.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nLqS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ab2b7-f33b-4abf-8ad9-4f3e0dd61aee_790x316.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nLqS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ab2b7-f33b-4abf-8ad9-4f3e0dd61aee_790x316.png 424w, https://substackcdn.com/image/fetch/$s_!nLqS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ab2b7-f33b-4abf-8ad9-4f3e0dd61aee_790x316.png 848w, https://substackcdn.com/image/fetch/$s_!nLqS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ab2b7-f33b-4abf-8ad9-4f3e0dd61aee_790x316.png 1272w, https://substackcdn.com/image/fetch/$s_!nLqS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ab2b7-f33b-4abf-8ad9-4f3e0dd61aee_790x316.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nLqS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ab2b7-f33b-4abf-8ad9-4f3e0dd61aee_790x316.png" width="790" height="316" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/223ab2b7-f33b-4abf-8ad9-4f3e0dd61aee_790x316.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:316,&quot;width&quot;:790,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:57776,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nLqS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ab2b7-f33b-4abf-8ad9-4f3e0dd61aee_790x316.png 424w, https://substackcdn.com/image/fetch/$s_!nLqS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ab2b7-f33b-4abf-8ad9-4f3e0dd61aee_790x316.png 848w, https://substackcdn.com/image/fetch/$s_!nLqS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ab2b7-f33b-4abf-8ad9-4f3e0dd61aee_790x316.png 1272w, https://substackcdn.com/image/fetch/$s_!nLqS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F223ab2b7-f33b-4abf-8ad9-4f3e0dd61aee_790x316.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Thompson sampling algorithm taken from [3]</figcaption></figure></div><h4>Information Gain</h4><p>Information Gain (IG) measures the change in the amount of information (measured in <a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">entropy </a><em><a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">H</a></em>) of a latent variable, often referring to the parameter of the model <em>&#952;</em> after seeing observation (e.g., reward <em>r</em>) caused by some action <em>a</em>. A big drop in the entropy  means the observation makes the model more predictable and less uncertain, giving us valuable information from the action <em>a </em>that leads to the observation. Formally, we have:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LDyo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32200284-170f-4a4c-80ea-0ac81d6d5d60_4240x992.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LDyo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32200284-170f-4a4c-80ea-0ac81d6d5d60_4240x992.png 424w, https://substackcdn.com/image/fetch/$s_!LDyo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32200284-170f-4a4c-80ea-0ac81d6d5d60_4240x992.png 848w, https://substackcdn.com/image/fetch/$s_!LDyo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32200284-170f-4a4c-80ea-0ac81d6d5d60_4240x992.png 1272w, https://substackcdn.com/image/fetch/$s_!LDyo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32200284-170f-4a4c-80ea-0ac81d6d5d60_4240x992.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LDyo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32200284-170f-4a4c-80ea-0ac81d6d5d60_4240x992.png" width="536" height="125.53296703296704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/32200284-170f-4a4c-80ea-0ac81d6d5d60_4240x992.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:341,&quot;width&quot;:1456,&quot;resizeWidth&quot;:536,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LDyo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32200284-170f-4a4c-80ea-0ac81d6d5d60_4240x992.png 424w, https://substackcdn.com/image/fetch/$s_!LDyo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32200284-170f-4a4c-80ea-0ac81d6d5d60_4240x992.png 848w, https://substackcdn.com/image/fetch/$s_!LDyo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32200284-170f-4a4c-80ea-0ac81d6d5d60_4240x992.png 1272w, https://substackcdn.com/image/fetch/$s_!LDyo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32200284-170f-4a4c-80ea-0ac81d6d5d60_4240x992.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>In essence, our goal is to find a harmony between minimizing expected regret in the current period and acquiring new information about the observation model. The intuition suggests this objective:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gail!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47e4e60d-b1da-4bf7-bf59-fd58e71d708a_1024x376.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gail!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47e4e60d-b1da-4bf7-bf59-fd58e71d708a_1024x376.gif 424w, https://substackcdn.com/image/fetch/$s_!gail!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47e4e60d-b1da-4bf7-bf59-fd58e71d708a_1024x376.gif 848w, https://substackcdn.com/image/fetch/$s_!gail!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47e4e60d-b1da-4bf7-bf59-fd58e71d708a_1024x376.gif 1272w, https://substackcdn.com/image/fetch/$s_!gail!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47e4e60d-b1da-4bf7-bf59-fd58e71d708a_1024x376.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gail!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47e4e60d-b1da-4bf7-bf59-fd58e71d708a_1024x376.gif" width="520" height="190.9375" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/47e4e60d-b1da-4bf7-bf59-fd58e71d708a_1024x376.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:376,&quot;width&quot;:1024,&quot;resizeWidth&quot;:520,&quot;bytes&quot;:9591,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gail!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47e4e60d-b1da-4bf7-bf59-fd58e71d708a_1024x376.gif 424w, https://substackcdn.com/image/fetch/$s_!gail!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47e4e60d-b1da-4bf7-bf59-fd58e71d708a_1024x376.gif 848w, https://substackcdn.com/image/fetch/$s_!gail!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47e4e60d-b1da-4bf7-bf59-fd58e71d708a_1024x376.gif 1272w, https://substackcdn.com/image/fetch/$s_!gail!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47e4e60d-b1da-4bf7-bf59-fd58e71d708a_1024x376.gif 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Overall, this approach demonstrates better results compared to Thompson Sampling and UCB, in terms of minimizing cumulative regret, as shown in the figure below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e080!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F785b0f32-025d-4401-a508-e17642de06d4_1024x747.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e080!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F785b0f32-025d-4401-a508-e17642de06d4_1024x747.gif 424w, https://substackcdn.com/image/fetch/$s_!e080!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F785b0f32-025d-4401-a508-e17642de06d4_1024x747.gif 848w, https://substackcdn.com/image/fetch/$s_!e080!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F785b0f32-025d-4401-a508-e17642de06d4_1024x747.gif 1272w, https://substackcdn.com/image/fetch/$s_!e080!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F785b0f32-025d-4401-a508-e17642de06d4_1024x747.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e080!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F785b0f32-025d-4401-a508-e17642de06d4_1024x747.gif" width="442" height="322.435546875" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/785b0f32-025d-4401-a508-e17642de06d4_1024x747.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:747,&quot;width&quot;:1024,&quot;resizeWidth&quot;:442,&quot;bytes&quot;:111937,&quot;alt&quot;:&quot;Information Gain, Bayes UCB, Thompson Sampling&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Information Gain, Bayes UCB, Thompson Sampling" title="Information Gain, Bayes UCB, Thompson Sampling" srcset="https://substackcdn.com/image/fetch/$s_!e080!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F785b0f32-025d-4401-a508-e17642de06d4_1024x747.gif 424w, https://substackcdn.com/image/fetch/$s_!e080!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F785b0f32-025d-4401-a508-e17642de06d4_1024x747.gif 848w, https://substackcdn.com/image/fetch/$s_!e080!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F785b0f32-025d-4401-a508-e17642de06d4_1024x747.gif 1272w, https://substackcdn.com/image/fetch/$s_!e080!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F785b0f32-025d-4401-a508-e17642de06d4_1024x747.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Performance of IG (IDS) and other classic approaches in a multi-arm bandit task. Image taken from [4].</figcaption></figure></div><p></p><blockquote><p>&#128064; While classical exploration approaches have rigorous theoretical foundations, they are not without their limitations:</p></blockquote><p>&#10060; Scalability Issues: Most are specifically designed for bandit problems, and thus, they are hard to apply in large-scale or high-dimensional problems (e.g., Atari games), resulting in increased computational demands that can be impractical.</p><p>&#10060; Assumption Sensitivity: These methods heavily rely on specific assumptions about reward distributions or system dynamics, limiting their adaptability when assumptions do not hold.</p><p>&#10060; Vulnerability to Uncertainty: They may struggle in dynamic environments with complex reward structures or frequent changes, leading to suboptimal performance.</p><div><hr></div><h2>Primal Exploration in Deep Reinforcement Learning</h2><h4>Entropy Maximization</h4><p>In the era of deep learning, neural networks are used for approximating functions, including parameterizing value and policy functions in RL. While adopting &#949;-greedy is a simple exploration strategy for value-based deep RL like DQN, it becomes less straightforward for policy gradient methods. To address this, an entropy loss term is introduced in the objective function to penalize overly deterministic policies. This encourages diverse exploration, avoiding suboptimal actions by maximizing the bonus entropy loss term in policy gradient methods.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HhJW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde39e03-37d9-46f7-8e2e-3f3769b75c73_3800x952.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HhJW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde39e03-37d9-46f7-8e2e-3f3769b75c73_3800x952.png 424w, https://substackcdn.com/image/fetch/$s_!HhJW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde39e03-37d9-46f7-8e2e-3f3769b75c73_3800x952.png 848w, https://substackcdn.com/image/fetch/$s_!HhJW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde39e03-37d9-46f7-8e2e-3f3769b75c73_3800x952.png 1272w, https://substackcdn.com/image/fetch/$s_!HhJW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde39e03-37d9-46f7-8e2e-3f3769b75c73_3800x952.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HhJW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde39e03-37d9-46f7-8e2e-3f3769b75c73_3800x952.png" width="566" height="141.88873626373626" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bde39e03-37d9-46f7-8e2e-3f3769b75c73_3800x952.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:365,&quot;width&quot;:1456,&quot;resizeWidth&quot;:566,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HhJW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde39e03-37d9-46f7-8e2e-3f3769b75c73_3800x952.png 424w, https://substackcdn.com/image/fetch/$s_!HhJW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde39e03-37d9-46f7-8e2e-3f3769b75c73_3800x952.png 848w, https://substackcdn.com/image/fetch/$s_!HhJW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde39e03-37d9-46f7-8e2e-3f3769b75c73_3800x952.png 1272w, https://substackcdn.com/image/fetch/$s_!HhJW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde39e03-37d9-46f7-8e2e-3f3769b75c73_3800x952.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><blockquote><p>&#128064; Key papers leveraging this loss, such as <a href="https://arxiv.org/pdf/1602.01783.pdf">A3C</a> and <a href="https://arxiv.org/pdf/1707.06347.pdf">PPO</a>, assert enhanced exploration by discouraging premature convergence to suboptimal deterministic policies. Nevertheless, adding an extra loss is a double-edged sword&#8212;it enhances exploration but may also impede the optimization of other losses, especially the main objective. Moreover, the entropy loss does not enforce different level of exploration for different tasks. This is impractical since intuitively, certain tasks, particularly those with sparse rewards, may require more exploration than others.</p></blockquote><p></p><h4>Noisy Networks</h4><p>Another method to add randomness to the policy is to add noise to the weights of the neural networks [5]. For example, a noisy feedforward layer of a neural network is depicted below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eHRR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc4701d-8ee0-4154-aa36-5f0470b659c8_610x541.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eHRR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc4701d-8ee0-4154-aa36-5f0470b659c8_610x541.png 424w, https://substackcdn.com/image/fetch/$s_!eHRR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc4701d-8ee0-4154-aa36-5f0470b659c8_610x541.png 848w, https://substackcdn.com/image/fetch/$s_!eHRR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc4701d-8ee0-4154-aa36-5f0470b659c8_610x541.png 1272w, https://substackcdn.com/image/fetch/$s_!eHRR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc4701d-8ee0-4154-aa36-5f0470b659c8_610x541.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eHRR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc4701d-8ee0-4154-aa36-5f0470b659c8_610x541.png" width="404" height="358.3016393442623" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5fc4701d-8ee0-4154-aa36-5f0470b659c8_610x541.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:541,&quot;width&quot;:610,&quot;resizeWidth&quot;:404,&quot;bytes&quot;:54931,&quot;alt&quot;:&quot;Noisy Network Layer&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Noisy Network Layer" title="Noisy Network Layer" srcset="https://substackcdn.com/image/fetch/$s_!eHRR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc4701d-8ee0-4154-aa36-5f0470b659c8_610x541.png 424w, https://substackcdn.com/image/fetch/$s_!eHRR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc4701d-8ee0-4154-aa36-5f0470b659c8_610x541.png 848w, https://substackcdn.com/image/fetch/$s_!eHRR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc4701d-8ee0-4154-aa36-5f0470b659c8_610x541.png 1272w, https://substackcdn.com/image/fetch/$s_!eHRR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5fc4701d-8ee0-4154-aa36-5f0470b659c8_610x541.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">An example of a noisy linear layer. Here <em>w</em> is the matrix weight and <em>b</em> is the bias vector. The parameters &#181;<em><sup>w</sup></em>, <em>&#181;<sup>b</sup></em>, <em>&#963;<sup>w</sup></em> and <em>&#963;<sup>b</sup></em> are the learnables of the network whereas <em>&#949;<sup>w</sup></em> and <em>&#949;<sup>w</sup></em> are noise variables. Image taken from [5]</figcaption></figure></div><p>The noise variables can be sampled  independently from a zero-mean Gaussian distribution, resulting in the number of noise variables equalling the number of trainable parameters. Alternatively,  they can be factorized into the product of two noises, each sampled from a zero-mean Gaussian: </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!toIO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc093d8bc-8fd4-471b-aae0-f65a85410ab6_224x51.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!toIO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc093d8bc-8fd4-471b-aae0-f65a85410ab6_224x51.png 424w, https://substackcdn.com/image/fetch/$s_!toIO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc093d8bc-8fd4-471b-aae0-f65a85410ab6_224x51.png 848w, https://substackcdn.com/image/fetch/$s_!toIO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc093d8bc-8fd4-471b-aae0-f65a85410ab6_224x51.png 1272w, https://substackcdn.com/image/fetch/$s_!toIO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc093d8bc-8fd4-471b-aae0-f65a85410ab6_224x51.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!toIO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc093d8bc-8fd4-471b-aae0-f65a85410ab6_224x51.png" width="224" height="51" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c093d8bc-8fd4-471b-aae0-f65a85410ab6_224x51.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:51,&quot;width&quot;:224,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:5300,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!toIO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc093d8bc-8fd4-471b-aae0-f65a85410ab6_224x51.png 424w, https://substackcdn.com/image/fetch/$s_!toIO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc093d8bc-8fd4-471b-aae0-f65a85410ab6_224x51.png 848w, https://substackcdn.com/image/fetch/$s_!toIO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc093d8bc-8fd4-471b-aae0-f65a85410ab6_224x51.png 1272w, https://substackcdn.com/image/fetch/$s_!toIO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc093d8bc-8fd4-471b-aae0-f65a85410ab6_224x51.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>where <em>f</em> is a real-valued function that scales the original noises. Throughout training, noise samples are drawn and added to the weights for both forward and backward propagation. Interestingly, a similar concept was introduced around the same period in a concurrent study [6]. Both [5] and [6] received acceptance at ICLR 2018.</p><p>Noisy layers can be used to replace feedforward layers in the value network (DQN) and policy network (A3C). As the parameters of the noisy layer are trainable, it can dynamically adjust the exploration level to suit the task. For example, tasks with minimal exploration should feature small <em>&#181;</em> and <em>&#963;</em>, while tasks requiring more exploration should exhibit the opposite, as depicted below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KdS_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6f343aa-4f5d-4ce2-94e7-09e4ff675875_1167x308.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KdS_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6f343aa-4f5d-4ce2-94e7-09e4ff675875_1167x308.png 424w, https://substackcdn.com/image/fetch/$s_!KdS_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6f343aa-4f5d-4ce2-94e7-09e4ff675875_1167x308.png 848w, https://substackcdn.com/image/fetch/$s_!KdS_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6f343aa-4f5d-4ce2-94e7-09e4ff675875_1167x308.png 1272w, https://substackcdn.com/image/fetch/$s_!KdS_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6f343aa-4f5d-4ce2-94e7-09e4ff675875_1167x308.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KdS_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6f343aa-4f5d-4ce2-94e7-09e4ff675875_1167x308.png" width="1167" height="308" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e6f343aa-4f5d-4ce2-94e7-09e4ff675875_1167x308.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:308,&quot;width&quot;:1167,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:151804,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KdS_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6f343aa-4f5d-4ce2-94e7-09e4ff675875_1167x308.png 424w, https://substackcdn.com/image/fetch/$s_!KdS_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6f343aa-4f5d-4ce2-94e7-09e4ff675875_1167x308.png 848w, https://substackcdn.com/image/fetch/$s_!KdS_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6f343aa-4f5d-4ce2-94e7-09e4ff675875_1167x308.png 1272w, https://substackcdn.com/image/fetch/$s_!KdS_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6f343aa-4f5d-4ce2-94e7-09e4ff675875_1167x308.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The magnitude of the noisy layer&#8217;s parameter <em>&#963; </em>in different layers and tasks. Image taken from [5].</figcaption></figure></div><blockquote><p>&#128064; Although Noisy Networks can vary exploration degree across tasks, adapting exploration at the state level is far from reachable.  Certain states with higher uncertainty may require more exploration, while others may not. </p></blockquote><p>&#129504; <em>How to know which state should be explored?</em> Addressing this question will necessitate more sophisticated approaches, which will be covered in part 2 of my article. </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/publish/post/140993309&quot;,&quot;text&quot;:&quot;Part 2&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://hungleai.substack.com/publish/post/140993309"><span>Part 2</span></a></p><div><hr></div><h2>References</h2><p>[1] Sutton, Richard S., and Andrew G. Barto. <em>Reinforcement learning: An introduction</em>. MIT Press, 2018.</p><p>[2] Ta&#239;ga, Adrien Ali, William Fedus, Marlos C. Machado, Aaron Courville, and Marc G. Bellemare. "Benchmarking bonus-based exploration methods on the arcade learning environment." <em>arXiv preprint arXiv:1908.02388</em> (2019).</p><p>[3] Chapelle, Olivier, and Lihong Li. "An empirical evaluation of Thompson sampling." <em>Advances in neural information processing systems</em> 24 (2011).</p><p>[4] Russo, Daniel, and Benjamin Van Roy. "Learning to optimize via information-directed sampling." <em>Advances in Neural Information Processing Systems</em> 27 (2014).</p><p>[5] Fortunato, Meire, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves Vlad Mnih, Remi Munos Demis Hassabis Olivier Pietquin, Charles Blundell, and Shane Legg. "Noisy Networks for Exploration." <em>arXiv preprint arXiv:1706.10295</em> (2017).</p><p>[6] Plappert, Matthias, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. "Parameter space noise for exploration." <em>arXiv preprint arXiv:1706.01905</em> (2017).</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://hungleai.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><strong>I hope you enjoy the article. Stay tuned for the newest and exclusive content by subscribing to Neurocoder Tales! </strong><em>Disclaimer:</em> <em>While every effort is made to provide accurate and unbiased information, errors may occur. Let me know if you catch any error.</em>  </p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>