<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Reinforcement Learning - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Reinforcement Learning - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Tue, 19 May 2026 22:45:17 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/reinforcement-learning/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Build an Adaptive Tic-Tac-Toe AI with Reinforcement Learning in JavaScript ]]>
                </title>
                <description>
                    <![CDATA[ Reinforcement learning (RL) is one of the most powerful paradigms in artificial intelligence. Unlike supervised learning where you train models on labeled datasets, RL agents learn through direct interaction with their environment, receiving rewards ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-an-adaptive-tic-tac-toe-ai-with-reinforcement-learning-in-javascript/</link>
                <guid isPermaLink="false">68e57cd7b148e87f05670d05</guid>
                
                    <category>
                        <![CDATA[ JavaScript ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Reinforcement Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Mayur Vekariya ]]>
                </dc:creator>
                <pubDate>Tue, 07 Oct 2025 20:49:27 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759870150966/f65a07a6-123b-45e2-a3f2-bc099638825a.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Reinforcement learning (RL) is one of the most powerful paradigms in artificial intelligence. Unlike supervised learning where you train models on labeled datasets, RL agents learn through direct interaction with their environment, receiving rewards or penalties for their actions.</p>
<p>In this tutorial, you will build a Tic-Tac-Toe AI that learns optimal strategies through Q-learning, a foundational RL algorithm. You will implement adaptive difficulty levels, visualize the learning process in real-time, and explore advanced optimization techniques.</p>
<p>By the end of this tutorial, you’ll have a production-ready web application that demonstrates practical RL concepts – all running directly in the browser with vanilla JavaScript.</p>
<h2 id="heading-what-youll-learn">What You’ll Learn</h2>
<p>In this tutorial, you’ll learn:</p>
<ul>
<li><p>Core reinforcement learning concepts including Q-learning, exploration vs exploitation, and reward shaping.</p>
</li>
<li><p>How to implement a complete Q-learning algorithm with state management.</p>
</li>
<li><p>Advanced techniques like epsilon decay and experience replay.</p>
</li>
<li><p>How to build an interactive game with HTML5 Canvas and responsive controls.</p>
</li>
<li><p>Performance optimization for real-time AI decision-making.</p>
</li>
<li><p>Visualization techniques to understand the AI's learning process.</p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To get the most out of this tutorial, you should have:</p>
<ul>
<li><p>Solid understanding of JavaScript (ES6+ syntax, classes, array methods).</p>
</li>
<li><p>Familiarity with HTML5 Canvas API for graphics rendering.</p>
</li>
<li><p>Basic knowledge of algorithms and data structures.</p>
</li>
<li><p>Understanding of asynchronous JavaScript (Promises, async/await).</p>
</li>
</ul>
<p>You don’t need any prior machine learning experience, as I’ll explain all RL concepts from scratch.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-why-use-reinforcement-learning-for-game-ai">Why Use Reinforcement Learning for Game AI?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-understand-q-learning-the-foundation">How to Understand Q-Learning: The Foundation</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-project-architecture-overview">Project Architecture Overview</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-the-html-interface-with-tailwind-css">How to Build the HTML Interface with Tailwind CSS</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-implement-the-q-learning-algorithm">How to Implement the Q-Learning Algorithm</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-understand-the-enhanced-features">How to Understand the Enhanced Features</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-test-your-implementation">How to Test Your Implementation</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-advanced-optimizations-and-extensions">Advanced Optimizations and Extensions</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-common-pitfalls-and-solutions">Common Pitfalls and Solutions</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-extend-this-to-other-games">How to Extend This to Other Games</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-why-use-reinforcement-learning-for-game-ai">Why Use Reinforcement Learning for Game AI?</h2>
<p>Games provide an ideal environment for learning RL because they have:</p>
<ol>
<li><p><strong>Clear state representations</strong> – The game board at any moment</p>
</li>
<li><p><strong>Discrete action spaces</strong> – A finite set of valid moves</p>
</li>
<li><p><strong>Immediate feedback</strong> – Win, lose, or draw outcomes</p>
</li>
<li><p><strong>Deterministic rules</strong> – Consistent behavior across games</p>
</li>
</ol>
<p>Traditional game AI uses techniques like minimax with alpha-beta pruning. While effective, these approaches require you to explicitly program game strategies. RL agents, by contrast, discover optimal strategies through experience – much like humans learn through practice.</p>
<p>Tic-Tac-Toe serves as an excellent starting point because:</p>
<ul>
<li><p>The state space is manageable (5,478 unique positions)</p>
</li>
<li><p>Games are short, allowing rapid iteration</p>
</li>
<li><p>Perfect play is achievable, providing a clear success metric</p>
</li>
<li><p>The concepts scale to more complex games</p>
</li>
</ul>
<h2 id="heading-how-to-understand-q-learning-the-foundation">How to Understand Q-Learning: The Foundation</h2>
<p><a target="_blank" href="https://www.freecodecamp.org/news/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc/">Q-learning</a> is a model-free, value-based RL algorithm. Let me break down what that means:</p>
<ul>
<li><p><strong>Model-free</strong> means that the agent doesn’t need to understand the game's rules. It learns purely from experience.</p>
</li>
<li><p><strong>Value-based</strong> means that the agent learns the "value" of each action in each state, then chooses the action with the highest value.</p>
</li>
</ul>
<h3 id="heading-core-components">Core Components</h3>
<p>There are a few key components you’ll need to understand before building this game.</p>
<p>First, we have <strong>state (s)</strong>, which here is the current game board configuration. We represent this as a 9-character string (for example, <code>"XO-X-----"</code> where <code>-</code> represents empty cells).</p>
<p>Next, we have <strong>action (a)</strong>, which is a move the AI can make. We represent this as an index from 0-8 corresponding to board positions.</p>
<p>Then there’s <strong>reward (r)</strong>, the numerical feedback from the environment:</p>
<ul>
<li><p><code>+1</code> for winning</p>
</li>
<li><p><code>-1</code> for losing</p>
</li>
<li><p><code>0</code> for draws or ongoing games</p>
</li>
</ul>
<p>We also have <strong>Q-Table</strong>, a lookup table storing Q(s,a) – the expected cumulative reward for taking action <code>a</code> in state <code>s</code>.</p>
<p>And finally, there’s <strong>policy</strong>, the strategy for choosing actions. We use an epsilon-greedy policy that balances exploration and exploitation.</p>
<h3 id="heading-the-q-learning-update-rule">The Q-Learning Update Rule</h3>
<p>The heart of Q-learning is this update formula:</p>
<pre><code class="lang-bash">Q(s,a) ← Q(s,a) + α[r + γ max Q(s<span class="hljs-string">',a'</span>) - Q(s,a)]
</code></pre>
<p>Where:</p>
<ul>
<li><p><code>α</code> (alpha) = Learning rate (0 to 1) – how much to update the Q-value</p>
</li>
<li><p><code>γ</code> (gamma) = Discount factor (0 to 1) – how much to value future rewards</p>
</li>
<li><p><code>s'</code> = Next state after taking action <code>a</code></p>
</li>
<li><p><code>max Q(s',a')</code> = Highest Q-value available in the next state.</p>
</li>
</ul>
<p>This formula implements <strong>temporal difference learning</strong>. This means it updates our estimate of Q(s,a) based on the difference between our current estimate and a better estimate using the actual reward received plus the best possible future reward.</p>
<h3 id="heading-how-exploration-vs-exploitation-works">How Exploration vs Exploitation Works</h3>
<p>A critical challenge in reinforcement learning is the "exploration vs. exploitation" trade-off. To understand why this is difficult, imagine choosing a place for dinner.</p>
<ul>
<li><p><strong>Exploitation:</strong> You could go to your favorite restaurant. You know the food is good, and you're almost guaranteed a satisfying meal. This is a safe, reliable choice that maximizes your immediate reward based on past experience.</p>
</li>
<li><p><strong>Exploration:</strong> You could try a new, unknown restaurant. It might be a disaster, or you might discover a new favorite that’s even better than your old one. This is a risky choice that provides no immediate guarantee, but it's the only way to gather new information and potentially find a better long-term strategy.</p>
</li>
</ul>
<p>The same dilemma applies to our AI. If it only exploits its current knowledge, it might get stuck using a mediocre strategy, never discovering the brilliant moves that lead to a guaranteed win. If it only explores by making random moves, it will never learn to use the good strategies it finds and will play poorly.</p>
<p>The key is to balance the two: explore enough to find optimal strategies, but exploit that knowledge to win games.</p>
<p>To achieve this balance, we use an <strong>epsilon-greedy (ϵ) strategy</strong>. It’s a simple but powerful way to manage this trade-off:</p>
<ol>
<li><p>We choose a small value for epsilon (ϵ), for example, 0.1 (which represents a 10% probability).</p>
</li>
<li><p>Before the AI makes a move, it generates a random number between 0 and 1.</p>
</li>
<li><p><strong>If the random number is less than ϵ (the 10% chance):</strong> The AI ignores its strategy and chooses a random available move. This is <strong>exploration</strong>.</p>
</li>
<li><p><strong>If the random number is greater than or equal to ϵ (the 90% chance):</strong> The AI chooses the best-known move from its Q-table.This is <strong>exploitation</strong>.</p>
</li>
</ol>
<p>This ensures the AI primarily plays to win but still dedicates a small fraction of its moves to trying new things. We will also implement <strong>epsilon decay</strong> – starting with a higher ϵ value to encourage exploration when the AI is inexperienced, and gradually lowering it as the AI learns and becomes more confident in its strategy.</p>
<h2 id="heading-project-architecture-overview">Project Architecture Overview</h2>
<p>Before you start coding, here's the structure of the application you’ll build:</p>
<pre><code class="lang-bash">tic-tac-toe-ai/
├── index.html          <span class="hljs-comment"># Game interface with Tailwind CSS</span>
└── game.js            <span class="hljs-comment"># Complete game logic and AI</span>
</code></pre>
<p>You will organize your code into two main classes in game.js:</p>
<ol>
<li><p><strong>QLearning</strong>: Implements the Q-learning algorithm.</p>
</li>
<li><p><strong>TicTacToe</strong>: Manages game state and rendering.</p>
</li>
</ol>
<h2 id="heading-how-to-build-the-html-interface-with-tailwind-css">How to Build the HTML Interface with Tailwind CSS</h2>
<p>Create an <code>index.html</code> file with Tailwind CSS CDN:</p>
<pre><code class="lang-xml"><span class="hljs-meta">&lt;!DOCTYPE <span class="hljs-meta-keyword">html</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">html</span> <span class="hljs-attr">lang</span>=<span class="hljs-string">"en"</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">head</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">meta</span> <span class="hljs-attr">charset</span>=<span class="hljs-string">"UTF-8"</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">meta</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"viewport"</span> <span class="hljs-attr">content</span>=<span class="hljs-string">"width=device-width, initial-scale=1.0"</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">title</span>&gt;</span>Tic-Tac-Toe AI with Q-Learning<span class="hljs-tag">&lt;/<span class="hljs-name">title</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">script</span> <span class="hljs-attr">src</span>=<span class="hljs-string">"https://cdn.tailwindcss.com"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">script</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">head</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">body</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"bg-gradient-to-br from-purple-600 to-purple-900 min-h-screen flex items-center justify-center p-4"</span>&gt;</span>

  <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"bg-white rounded-3xl shadow-2xl p-8 max-w-5xl w-full"</span>&gt;</span>
    <span class="hljs-comment">&lt;!-- Header --&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-center mb-8"</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">h1</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-4xl font-bold text-gray-800 mb-2"</span>&gt;</span>🎮 Tic-Tac-Toe AI<span class="hljs-tag">&lt;/<span class="hljs-name">h1</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">p</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-gray-600 text-lg"</span>&gt;</span>Watch the AI learn through reinforcement learning<span class="hljs-tag">&lt;/<span class="hljs-name">p</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>

    <span class="hljs-comment">&lt;!-- Training Indicator --&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"trainingIndicator"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"hidden bg-yellow-100 border-l-4 border-yellow-500 text-yellow-700 p-4 mb-6 rounded"</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">p</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"font-semibold"</span>&gt;</span>🤖 AI is training... <span class="hljs-tag">&lt;<span class="hljs-name">span</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"trainingProgress"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">span</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">p</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>

    <span class="hljs-comment">&lt;!-- Main Game Area --&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"grid md:grid-cols-2 gap-8"</span>&gt;</span>

      <span class="hljs-comment">&lt;!-- Canvas Section --&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"flex flex-col items-center"</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">canvas</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"gameCanvas"</span> <span class="hljs-attr">width</span>=<span class="hljs-string">"400"</span> <span class="hljs-attr">height</span>=<span class="hljs-string">"400"</span> 
                <span class="hljs-attr">class</span>=<span class="hljs-string">"border-4 border-purple-500 rounded-xl shadow-lg cursor-pointer hover:scale-[1.02] transition-transform"</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">canvas</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"gameStatus"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"mt-4 text-xl font-bold text-gray-700 min-h-[30px]"</span>&gt;</span>
          Your turn! (X)
        <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
      <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>

      <span class="hljs-comment">&lt;!-- Controls Section --&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"space-y-6"</span>&gt;</span>

        <span class="hljs-comment">&lt;!-- Game Controls --&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"bg-gray-50 rounded-xl p-6"</span>&gt;</span>
          <span class="hljs-tag">&lt;<span class="hljs-name">h3</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-xl font-bold text-gray-800 mb-4"</span>&gt;</span>Game Controls<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
          <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"space-y-3"</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">onclick</span>=<span class="hljs-string">"game.reset()"</span> 
                    <span class="hljs-attr">class</span>=<span class="hljs-string">"w-full bg-purple-600 hover:bg-purple-700 text-white font-semibold py-3 px-6 rounded-lg transition-all hover:-translate-y-0.5 shadow-md hover:shadow-lg"</span>&gt;</span>
              New Game
            <span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">onclick</span>=<span class="hljs-string">"game.startTraining()"</span> 
                    <span class="hljs-attr">class</span>=<span class="hljs-string">"w-full bg-green-600 hover:bg-green-700 text-white font-semibold py-3 px-6 rounded-lg transition-all hover:-translate-y-0.5 shadow-md hover:shadow-lg"</span>&gt;</span>
              Train AI (1000 games)
            <span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">onclick</span>=<span class="hljs-string">"game.resetAI()"</span> 
                    <span class="hljs-attr">class</span>=<span class="hljs-string">"w-full bg-red-600 hover:bg-red-700 text-white font-semibold py-3 px-6 rounded-lg transition-all hover:-translate-y-0.5 shadow-md hover:shadow-lg"</span>&gt;</span>
              Reset AI Memory
            <span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
          <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>

        <span class="hljs-comment">&lt;!-- Difficulty Selector --&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"bg-gray-50 rounded-xl p-6"</span>&gt;</span>
          <span class="hljs-tag">&lt;<span class="hljs-name">h3</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-xl font-bold text-gray-800 mb-4"</span>&gt;</span>Difficulty Level<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
          <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"grid grid-cols-3 gap-2"</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">onclick</span>=<span class="hljs-string">"game.setDifficulty('beginner')"</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"diffBeginner"</span>
                    <span class="hljs-attr">class</span>=<span class="hljs-string">"py-2 px-4 rounded-lg font-semibold text-sm transition-all bg-green-100 text-green-700 hover:bg-green-200"</span>&gt;</span>
              🌱 Beginner
            <span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">onclick</span>=<span class="hljs-string">"game.setDifficulty('intermediate')"</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"diffIntermediate"</span>
                    <span class="hljs-attr">class</span>=<span class="hljs-string">"py-2 px-4 rounded-lg font-semibold text-sm transition-all bg-white text-gray-700 hover:bg-gray-100 border-2 border-purple-500"</span>&gt;</span>
              🎯 Medium
            <span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">button</span> <span class="hljs-attr">onclick</span>=<span class="hljs-string">"game.setDifficulty('expert')"</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"diffExpert"</span>
                    <span class="hljs-attr">class</span>=<span class="hljs-string">"py-2 px-4 rounded-lg font-semibold text-sm transition-all bg-white text-gray-700 hover:bg-gray-100"</span>&gt;</span>
              🔥 Expert
            <span class="hljs-tag">&lt;/<span class="hljs-name">button</span>&gt;</span>
          <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>

        <span class="hljs-comment">&lt;!-- AI Parameters --&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"bg-gray-50 rounded-xl p-6"</span>&gt;</span>
          <span class="hljs-tag">&lt;<span class="hljs-name">h3</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-xl font-bold text-gray-800 mb-4"</span>&gt;</span>AI Parameters<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>

          <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"space-y-4"</span>&gt;</span>
            <span class="hljs-comment">&lt;!-- Learning Rate --&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">div</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"flex justify-between items-center mb-2"</span>&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">label</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-sm font-medium text-gray-700 flex items-center gap-1"</span>&gt;</span>
                  Learning Rate (α)
                  <span class="hljs-tag">&lt;<span class="hljs-name">span</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"group relative"</span>&gt;</span>
                    <span class="hljs-tag">&lt;<span class="hljs-name">span</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"cursor-help text-purple-500"</span>&gt;</span>ⓘ<span class="hljs-tag">&lt;/<span class="hljs-name">span</span>&gt;</span>
                    <span class="hljs-tag">&lt;<span class="hljs-name">span</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"invisible group-hover:visible absolute left-0 top-6 w-64 bg-gray-900 text-white text-xs rounded-lg p-3 z-10 shadow-xl"</span>&gt;</span>
                      Controls how quickly the AI updates its knowledge. Higher values = faster learning but less stability. Recommended: 0.1-0.3
                    <span class="hljs-tag">&lt;/<span class="hljs-name">span</span>&gt;</span>
                  <span class="hljs-tag">&lt;/<span class="hljs-name">span</span>&gt;</span>
                <span class="hljs-tag">&lt;/<span class="hljs-name">label</span>&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">span</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"learningRateValue"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-sm font-bold text-purple-600"</span>&gt;</span>0.1<span class="hljs-tag">&lt;/<span class="hljs-name">span</span>&gt;</span>
              <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">input</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"range"</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"learningRate"</span> <span class="hljs-attr">min</span>=<span class="hljs-string">"0.01"</span> <span class="hljs-attr">max</span>=<span class="hljs-string">"0.5"</span> <span class="hljs-attr">step</span>=<span class="hljs-string">"0.01"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"0.1"</span>
                     <span class="hljs-attr">class</span>=<span class="hljs-string">"w-full h-2 bg-gray-200 rounded-lg appearance-none cursor-pointer"</span>&gt;</span>
            <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>

            <span class="hljs-comment">&lt;!-- Discount Factor --&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">div</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"flex justify-between items-center mb-2"</span>&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">label</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-sm font-medium text-gray-700 flex items-center gap-1"</span>&gt;</span>
                  Discount Factor (γ)
                  <span class="hljs-tag">&lt;<span class="hljs-name">span</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"group relative"</span>&gt;</span>
                    <span class="hljs-tag">&lt;<span class="hljs-name">span</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"cursor-help text-purple-500"</span>&gt;</span>ⓘ<span class="hljs-tag">&lt;/<span class="hljs-name">span</span>&gt;</span>
                    <span class="hljs-tag">&lt;<span class="hljs-name">span</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"invisible group-hover:visible absolute left-0 top-6 w-64 bg-gray-900 text-white text-xs rounded-lg p-3 z-10 shadow-xl"</span>&gt;</span>
                      Determines how much the AI values future rewards vs immediate rewards. Higher = more long-term thinking. Recommended: 0.85-0.95
                    <span class="hljs-tag">&lt;/<span class="hljs-name">span</span>&gt;</span>
                  <span class="hljs-tag">&lt;/<span class="hljs-name">span</span>&gt;</span>
                <span class="hljs-tag">&lt;/<span class="hljs-name">label</span>&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">span</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"discountFactorValue"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-sm font-bold text-purple-600"</span>&gt;</span>0.9<span class="hljs-tag">&lt;/<span class="hljs-name">span</span>&gt;</span>
              <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">input</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"range"</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"discountFactor"</span> <span class="hljs-attr">min</span>=<span class="hljs-string">"0.5"</span> <span class="hljs-attr">max</span>=<span class="hljs-string">"0.99"</span> <span class="hljs-attr">step</span>=<span class="hljs-string">"0.01"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"0.9"</span>
                     <span class="hljs-attr">class</span>=<span class="hljs-string">"w-full h-2 bg-gray-200 rounded-lg appearance-none cursor-pointer"</span>&gt;</span>
            <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>

            <span class="hljs-comment">&lt;!-- Exploration Rate --&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">div</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"flex justify-between items-center mb-2"</span>&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">label</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-sm font-medium text-gray-700 flex items-center gap-1"</span>&gt;</span>
                  Exploration Rate (ε)
                  <span class="hljs-tag">&lt;<span class="hljs-name">span</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"group relative"</span>&gt;</span>
                    <span class="hljs-tag">&lt;<span class="hljs-name">span</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"cursor-help text-purple-500"</span>&gt;</span>ⓘ<span class="hljs-tag">&lt;/<span class="hljs-name">span</span>&gt;</span>
                    <span class="hljs-tag">&lt;<span class="hljs-name">span</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"invisible group-hover:visible absolute left-0 top-6 w-64 bg-gray-900 text-white text-xs rounded-lg p-3 z-10 shadow-xl"</span>&gt;</span>
                      Chance the AI tries random moves vs using learned strategy. Higher = more experimentation. Set to 0.01 for best play after training.
                    <span class="hljs-tag">&lt;/<span class="hljs-name">span</span>&gt;</span>
                  <span class="hljs-tag">&lt;/<span class="hljs-name">span</span>&gt;</span>
                <span class="hljs-tag">&lt;/<span class="hljs-name">label</span>&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">span</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"explorationRateValue"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-sm font-bold text-purple-600"</span>&gt;</span>0.1<span class="hljs-tag">&lt;/<span class="hljs-name">span</span>&gt;</span>
              <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">input</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"range"</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"explorationRate"</span> <span class="hljs-attr">min</span>=<span class="hljs-string">"0"</span> <span class="hljs-attr">max</span>=<span class="hljs-string">"0.5"</span> <span class="hljs-attr">step</span>=<span class="hljs-string">"0.01"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"0.1"</span>
                     <span class="hljs-attr">class</span>=<span class="hljs-string">"w-full h-2 bg-gray-200 rounded-lg appearance-none cursor-pointer"</span>&gt;</span>
            <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
          <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>

        <span class="hljs-comment">&lt;!-- Statistics --&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"bg-gray-50 rounded-xl p-6"</span>&gt;</span>
          <span class="hljs-tag">&lt;<span class="hljs-name">h3</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-xl font-bold text-gray-800 mb-4"</span>&gt;</span>Statistics<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
          <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"grid grid-cols-3 gap-3"</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"bg-white rounded-lg p-3 text-center shadow-sm"</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-xs text-gray-600 mb-1"</span>&gt;</span>Games<span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"gamesPlayed"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-2xl font-bold text-gray-800"</span>&gt;</span>0<span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
            <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"bg-white rounded-lg p-3 text-center shadow-sm"</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-xs text-gray-600 mb-1"</span>&gt;</span>AI Wins<span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"aiWins"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-2xl font-bold text-green-600"</span>&gt;</span>0<span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
            <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"bg-white rounded-lg p-3 text-center shadow-sm"</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-xs text-gray-600 mb-1"</span>&gt;</span>You Win<span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"playerWins"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-2xl font-bold text-red-600"</span>&gt;</span>0<span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
            <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"bg-white rounded-lg p-3 text-center shadow-sm"</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-xs text-gray-600 mb-1"</span>&gt;</span>Draws<span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"draws"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-2xl font-bold text-gray-600"</span>&gt;</span>0<span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
            <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"bg-white rounded-lg p-3 text-center shadow-sm"</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-xs text-gray-600 mb-1"</span>&gt;</span>States<span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"statesLearned"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-2xl font-bold text-purple-600"</span>&gt;</span>0<span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
            <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"bg-white rounded-lg p-3 text-center shadow-sm"</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-xs text-gray-600 mb-1"</span>&gt;</span>Win Rate<span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
              <span class="hljs-tag">&lt;<span class="hljs-name">div</span> <span class="hljs-attr">id</span>=<span class="hljs-string">"winRate"</span> <span class="hljs-attr">class</span>=<span class="hljs-string">"text-2xl font-bold text-blue-600"</span>&gt;</span>0%<span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
            <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
          <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>

      <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>
  <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span>

  <span class="hljs-tag">&lt;<span class="hljs-name">script</span> <span class="hljs-attr">src</span>=<span class="hljs-string">"game.js"</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-name">script</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">body</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">html</span>&gt;</span>
</code></pre>
<p>This HTML structure creates a responsive, modern interface using Tailwind CSS utility classes. The layout uses a two-column grid on medium screens and larger, with the game canvas on the left and all controls on the right. The training indicator starts hidden and only appears during AI training sessions.</p>
<p>All interactive elements (buttons, sliders) use <code>onclick</code> handlers and <code>oninput</code> events to communicate with the JavaScript game logic. The tooltip system uses CSS group hover states to show explanatory text when users hover over the info icons, helping them understand each parameter without cluttering the interface.</p>
<p>Let’s talk in a bit more detail about some key parts of the code:</p>
<ul>
<li><p><strong>Header Section</strong>: Displays the game title and subtitle to introduce users to the application.</p>
</li>
<li><p><strong>Training Indicator</strong>: A yellow banner that appears only during AI training sessions, showing progress updates every 50 games. This provides visual feedback so users know the training is in progress.</p>
</li>
<li><p><strong>Canvas Section</strong>: Contains the HTML5 Canvas element where the game board is drawn. The canvas is 400x400 pixels and styled with Tailwind classes for borders and hover effects. Below it is a status message that updates based on game state.</p>
</li>
<li><p><strong>Game Controls</strong>: Three primary buttons that let users start a new game, train the AI through 1000 self-play games, or completely reset the AI's memory (clearing the Q-table).</p>
</li>
<li><p><strong>Difficulty Selector</strong>: Three buttons for choosing AI difficulty. Beginner mode makes the AI play randomly 70% of the time, Intermediate uses Q-learning, and Expert implements perfect minimax play.</p>
</li>
<li><p><strong>AI Parameters</strong>: Three range sliders with tooltips that let users adjust the core reinforcement learning hyperparameters in real-time. The tooltips appear on hover and explain what each parameter does.</p>
</li>
<li><p><strong>Statistics Panel</strong>: A grid of six cards displaying real-time metrics including games played, wins/losses/draws, learned states, and AI win rate percentage.</p>
</li>
</ul>
<p>All interactive elements use <code>onclick</code> handlers that call methods from the <code>game</code> object defined in <code>game.js</code>.</p>
<h2 id="heading-how-to-implement-the-q-learning-algorithm">How to Implement the Q-Learning Algorithm</h2>
<p>Now, let's bring the theory to life. Create a <code>game.js</code> file. We will build this file step-by-step, but if you get stuck at any point or want to see the complete code for reference, you can find the final version <a target="_blank" href="https://github.com/mayur9210/tic-tac-toe-ai/blob/main/game.js">on <strong>GitHub</strong> here</a>.</p>
<p>Our code will be structured into two main classes: <code>QLearning</code>, which will handle the AI's "brain" and learning logic, and <code>TicTacToe</code>, which will manage the game state, rendering, and user interaction.</p>
<h3 id="heading-the-qlearning-class-the-ais-brain">The <code>QLearning</code> Class: The AI's Brain</h3>
<p>This class will contain all the logic for the <a target="_blank" href="https://github.com/mayur9210/tic-tac-toe-ai/blob/main/game.js">reinforcement learning agent</a>. Let's build it piece by piece.</p>
<h4 id="heading-1-constructor-and-q-table-management">1. Constructor and Q-Table Management</h4>
<p>First, let's set up the <code>constructor</code> and a method to access our Q-table. The Q-table will be a JavaScript <code>Map</code>, which is highly efficient for storing and retrieving key-value pairs where the key (the board state) is a string.</p>
<pre><code class="lang-javascript"><span class="hljs-comment">// In game.js</span>

<span class="hljs-comment">// Q-Learning Agent with localStorage support</span>
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">QLearning</span> </span>{
  <span class="hljs-keyword">constructor</span>(lr = 0.1, gamma = 0.9, epsilon = 0.1) {
    <span class="hljs-built_in">this</span>.q = <span class="hljs-keyword">new</span> <span class="hljs-built_in">Map</span>(); <span class="hljs-comment">// Stores Q-values: { state =&gt; [q_action_0, q_action_1, ...] }</span>
    <span class="hljs-built_in">this</span>.lr = lr; <span class="hljs-comment">// Learning Rate (α)</span>
    <span class="hljs-built_in">this</span>.gamma = gamma; <span class="hljs-comment">// Discount Factor (γ)</span>
    <span class="hljs-built_in">this</span>.epsilon = epsilon; <span class="hljs-comment">// Exploration Rate (ε)</span>
    <span class="hljs-built_in">this</span>.difficulty = <span class="hljs-string">'intermediate'</span>;
  }

  getQ(state) {
    <span class="hljs-keyword">if</span> (!<span class="hljs-built_in">this</span>.q.has(state)) {
      <span class="hljs-built_in">this</span>.q.set(state, <span class="hljs-built_in">Array</span>(<span class="hljs-number">9</span>).fill(<span class="hljs-number">0</span>));
    }
    <span class="hljs-keyword">return</span> <span class="hljs-built_in">this</span>.q.get(state);
  }
</code></pre>
<ul>
<li><p>The <code>constructor</code> initializes our three key hyperparameters (α, γ, ϵ) and the Q-table itself.</p>
</li>
<li><p><code>getQ(state)</code> is a crucial helper function. It safely retrieves the array of Q-values for a given board state. If the AI has never seen this state before, it creates a new entry in the map with an array of nine zeros, representing an initial Q-value of 0 for each possible move.</p>
</li>
</ul>
<h4 id="heading-2-choosing-an-action-the-epsilon-greedy-strategy">2. Choosing an Action (The Epsilon-Greedy Strategy)</h4>
<p>Next, we'll implement the <code>getAction</code> method. This is where the AI decides which move to make, incorporating our difficulty levels and the epsilon-greedy strategy.</p>
<pre><code class="lang-javascript">  getAction(state, available) {
    <span class="hljs-comment">// Difficulty-based behavior</span>
    <span class="hljs-keyword">if</span> (<span class="hljs-built_in">this</span>.difficulty === <span class="hljs-string">'beginner'</span>) {
      <span class="hljs-comment">// 70% random moves for beginner</span>
      <span class="hljs-keyword">if</span> (<span class="hljs-built_in">Math</span>.random() &lt; <span class="hljs-number">0.7</span>) {
        <span class="hljs-keyword">return</span> available[~~(<span class="hljs-built_in">Math</span>.random() * available.length)];
      }
    } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (<span class="hljs-built_in">this</span>.difficulty === <span class="hljs-string">'expert'</span>) {
      <span class="hljs-comment">// Use minimax for perfect play</span>
      <span class="hljs-keyword">return</span> <span class="hljs-built_in">this</span>.getMinimaxAction(state, available);
    }

    <span class="hljs-comment">// Intermediate: epsilon-greedy</span>
    <span class="hljs-keyword">if</span> (<span class="hljs-built_in">Math</span>.random() &lt; <span class="hljs-built_in">this</span>.epsilon) {
      <span class="hljs-keyword">return</span> available[~~(<span class="hljs-built_in">Math</span>.random() * available.length)];
    }
    <span class="hljs-keyword">const</span> q = <span class="hljs-built_in">this</span>.getQ(state);
    <span class="hljs-keyword">return</span> available.reduce(<span class="hljs-function">(<span class="hljs-params">best, a</span>) =&gt;</span> q[a] &gt; q[best] ? a : best, available[<span class="hljs-number">0</span>]);
  }
</code></pre>
<ul>
<li><p>The logic first checks the difficulty. 'Beginner' is mostly random, while 'Expert' defers to a separate, perfect-play algorithm.</p>
</li>
<li><p>For the 'Intermediate' level, it implements the epsilon-greedy logic. With probability ϵ, it explores (chooses a random move). Otherwise, it exploits (chooses the best-known move from the Q-table).</p>
</li>
</ul>
<h4 id="heading-3-the-learning-rule">3. The Learning Rule</h4>
<p>The <code>update</code> method is the heart of the algorithm. It's the direct implementation of the Q-learning formula we discussed earlier.</p>
<p><em>Q(s, a) ← Q(s, a) + α [r + γ max(a') Q(s', a') − Q(s, a)]</em></p>
<pre><code class="lang-javascript">  update(s, a, r, s2, available2) {
    <span class="hljs-keyword">const</span> q = <span class="hljs-built_in">this</span>.getQ(s);
    <span class="hljs-keyword">const</span> maxQ2 = available2.length ? <span class="hljs-built_in">Math</span>.max(...available2.map(<span class="hljs-function"><span class="hljs-params">a_prime</span> =&gt;</span> <span class="hljs-built_in">this</span>.getQ(s2)[a_prime])) : <span class="hljs-number">0</span>;
    q[a] += <span class="hljs-built_in">this</span>.lr * (r + <span class="hljs-built_in">this</span>.gamma * maxQ2 - q[a]);
  }
</code></pre>
<ul>
<li><p><code>maxQ2</code> calculates the <code>max Q(s',a')</code> part of the formula – the best possible Q-value the AI can get from its next move.</p>
</li>
<li><p>The final line is a direct translation of the formula, updating the value of the action just taken based on the reward and future potential.</p>
</li>
</ul>
<h4 id="heading-4-minimax-for-expert-mode">4. Minimax for Expert Mode</h4>
<p>For our 'Expert' level, we'll implement the minimax algorithm, a classic recursive algorithm from game theory that guarantees perfect play.</p>
<pre><code class="lang-javascript">  getMinimaxAction(state, available) {
    <span class="hljs-keyword">let</span> bestScore = -<span class="hljs-literal">Infinity</span>;
    <span class="hljs-keyword">let</span> bestMove = available[<span class="hljs-number">0</span>];

    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">const</span> move <span class="hljs-keyword">of</span> available) {
      <span class="hljs-keyword">const</span> newState = state.substring(<span class="hljs-number">0</span>, move) + <span class="hljs-string">'O'</span> + state.substring(move + <span class="hljs-number">1</span>);
      <span class="hljs-keyword">const</span> score = <span class="hljs-built_in">this</span>.minimax(newState, <span class="hljs-number">0</span>, <span class="hljs-literal">false</span>);
      <span class="hljs-keyword">if</span> (score &gt; bestScore) {
        bestScore = score;
        bestMove = move;
      }
    }
    <span class="hljs-keyword">return</span> bestMove;
  }

  minimax(state, depth, isMaximizing) {
    <span class="hljs-keyword">const</span> winner = <span class="hljs-built_in">this</span>.checkWinnerStatic(state);
    <span class="hljs-keyword">if</span> (winner === <span class="hljs-string">'O'</span>) <span class="hljs-keyword">return</span> <span class="hljs-number">10</span> - depth;
    <span class="hljs-keyword">if</span> (winner === <span class="hljs-string">'X'</span>) <span class="hljs-keyword">return</span> depth - <span class="hljs-number">10</span>;
    <span class="hljs-keyword">if</span> (winner === <span class="hljs-string">'draw'</span>) <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;

    <span class="hljs-keyword">const</span> available = [...state].map(<span class="hljs-function">(<span class="hljs-params">c, i</span>) =&gt;</span> c === <span class="hljs-string">'-'</span> ? i : <span class="hljs-literal">null</span>).filter(<span class="hljs-function"><span class="hljs-params">x</span> =&gt;</span> x !== <span class="hljs-literal">null</span>);

    <span class="hljs-keyword">if</span> (isMaximizing) {
      <span class="hljs-keyword">let</span> best = -<span class="hljs-literal">Infinity</span>;
      <span class="hljs-keyword">for</span> (<span class="hljs-keyword">const</span> move <span class="hljs-keyword">of</span> available) {
        <span class="hljs-keyword">const</span> newState = state.substring(<span class="hljs-number">0</span>, move) + <span class="hljs-string">'O'</span> + state.substring(move + <span class="hljs-number">1</span>);
        best = <span class="hljs-built_in">Math</span>.max(best, <span class="hljs-built_in">this</span>.minimax(newState, depth + <span class="hljs-number">1</span>, <span class="hljs-literal">false</span>));
      }
      <span class="hljs-keyword">return</span> best;
    } <span class="hljs-keyword">else</span> {
      <span class="hljs-keyword">let</span> best = <span class="hljs-literal">Infinity</span>;
      <span class="hljs-keyword">for</span> (<span class="hljs-keyword">const</span> move <span class="hljs-keyword">of</span> available) {
        <span class="hljs-keyword">const</span> newState = state.substring(<span class="hljs-number">0</span>, move) + <span class="hljs-string">'X'</span> + state.substring(move + <span class="hljs-number">1</span>);
        best = <span class="hljs-built_in">Math</span>.min(best, <span class="hljs-built_in">this</span>.minimax(newState, depth + <span class="hljs-number">1</span>, <span class="hljs-literal">true</span>));
      }
      <span class="hljs-keyword">return</span> best;
    }
  }

  checkWinnerStatic(state) {
    <span class="hljs-keyword">const</span> patterns = [[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>,<span class="hljs-number">2</span>],[<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>],[<span class="hljs-number">6</span>,<span class="hljs-number">7</span>,<span class="hljs-number">8</span>],[<span class="hljs-number">0</span>,<span class="hljs-number">3</span>,<span class="hljs-number">6</span>],[<span class="hljs-number">1</span>,<span class="hljs-number">4</span>,<span class="hljs-number">7</span>],[<span class="hljs-number">2</span>,<span class="hljs-number">5</span>,<span class="hljs-number">8</span>],[<span class="hljs-number">0</span>,<span class="hljs-number">4</span>,<span class="hljs-number">8</span>],[<span class="hljs-number">2</span>,<span class="hljs-number">4</span>,<span class="hljs-number">6</span>]];
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">const</span> p <span class="hljs-keyword">of</span> patterns) {
      <span class="hljs-keyword">if</span> (state[p[<span class="hljs-number">0</span>]] !== <span class="hljs-string">'-'</span> &amp;&amp; state[p[<span class="hljs-number">0</span>]] === state[p[<span class="hljs-number">1</span>]] &amp;&amp; state[p[<span class="hljs-number">1</span>]] === state[p[<span class="hljs-number">2</span>]]) {
        <span class="hljs-keyword">return</span> state[p[<span class="hljs-number">0</span>]];
      }
    }
    <span class="hljs-keyword">return</span> state.includes(<span class="hljs-string">'-'</span>) ? <span class="hljs-literal">null</span> : <span class="hljs-string">'draw'</span>;
  }
</code></pre>
<h4 id="heading-5-helper-and-persistence-methods">5. Helper and Persistence Methods</h4>
<p>Finally, let's add methods for epsilon decay, resetting the AI's memory, and saving/loading the Q-table to <code>localStorage</code>.</p>
<pre><code class="lang-javascript">  decay() {
    <span class="hljs-built_in">this</span>.epsilon = <span class="hljs-built_in">Math</span>.max(<span class="hljs-number">0.01</span>, <span class="hljs-built_in">this</span>.epsilon * <span class="hljs-number">0.995</span>);
  }

  reset() {
    <span class="hljs-built_in">this</span>.q.clear();
    <span class="hljs-built_in">this</span>.epsilon = <span class="hljs-number">0.1</span>;
  }

  save() {
    <span class="hljs-keyword">const</span> data = {
      <span class="hljs-attr">q</span>: <span class="hljs-built_in">Array</span>.from(<span class="hljs-built_in">this</span>.q.entries()),
      <span class="hljs-attr">lr</span>: <span class="hljs-built_in">this</span>.lr,
      <span class="hljs-attr">gamma</span>: <span class="hljs-built_in">this</span>.gamma,
      <span class="hljs-attr">epsilon</span>: <span class="hljs-built_in">this</span>.epsilon,
      <span class="hljs-attr">difficulty</span>: <span class="hljs-built_in">this</span>.difficulty
    };
    <span class="hljs-built_in">localStorage</span>.setItem(<span class="hljs-string">'tictactoe_ai'</span>, <span class="hljs-built_in">JSON</span>.stringify(data));
  }

  load() {
    <span class="hljs-keyword">const</span> saved = <span class="hljs-built_in">localStorage</span>.getItem(<span class="hljs-string">'tictactoe_ai'</span>);
    <span class="hljs-keyword">if</span> (!saved) <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>;

    <span class="hljs-keyword">try</span> {
      <span class="hljs-keyword">const</span> data = <span class="hljs-built_in">JSON</span>.parse(saved);
      <span class="hljs-built_in">this</span>.q = <span class="hljs-keyword">new</span> <span class="hljs-built_in">Map</span>(data.q);
      <span class="hljs-built_in">this</span>.lr = data.lr;
      <span class="hljs-built_in">this</span>.gamma = data.gamma;
      <span class="hljs-built_in">this</span>.epsilon = data.epsilon;
      <span class="hljs-built_in">this</span>.difficulty = data.difficulty || <span class="hljs-string">'intermediate'</span>;
      <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span>;
    } <span class="hljs-keyword">catch</span> (e) {
      <span class="hljs-built_in">console</span>.error(<span class="hljs-string">'Failed to load AI state:'</span>, e);
      <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>;
    }
  }

  clearStorage() {
    <span class="hljs-built_in">localStorage</span>.removeItem(<span class="hljs-string">'tictactoe_ai'</span>);
  }
}
</code></pre>
<h3 id="heading-the-tictactoe-class-managing-the-game">The <code>TicTacToe</code> Class: Managing the Game</h3>
<p>Now that we have our AI "brain," we need to build the game around it. This class will handle rendering the board, processing user clicks, managing game flow, and calling the AI when it's its turn.</p>
<h4 id="heading-1-constructor-and-control-initialization">1. Constructor and Control Initialization</h4>
<p>The constructor sets up the game's initial state, gets a reference to the HTML canvas, and wires up event listeners for user input.</p>
<pre><code class="lang-javascript"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">TicTacToe</span> </span>{
  <span class="hljs-keyword">constructor</span>() {
    <span class="hljs-built_in">this</span>.board = <span class="hljs-string">'---------'</span>;
    <span class="hljs-built_in">this</span>.ai = <span class="hljs-keyword">new</span> QLearning();
    <span class="hljs-built_in">this</span>.stats = { <span class="hljs-attr">played</span>: <span class="hljs-number">0</span>, <span class="hljs-attr">aiWins</span>: <span class="hljs-number">0</span>, <span class="hljs-attr">playerWins</span>: <span class="hljs-number">0</span>, <span class="hljs-attr">draws</span>: <span class="hljs-number">0</span> };
    <span class="hljs-built_in">this</span>.training = <span class="hljs-literal">false</span>;
    <span class="hljs-built_in">this</span>.gameOver = <span class="hljs-literal">false</span>;

    <span class="hljs-built_in">this</span>.canvas = <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'gameCanvas'</span>);
    <span class="hljs-built_in">this</span>.ctx = <span class="hljs-built_in">this</span>.canvas.getContext(<span class="hljs-string">'2d'</span>);
    <span class="hljs-built_in">this</span>.cellSize = <span class="hljs-number">133.33</span>;

    <span class="hljs-built_in">this</span>.canvas.onclick = <span class="hljs-function"><span class="hljs-params">e</span> =&gt;</span> <span class="hljs-built_in">this</span>.handleClick(e);
    <span class="hljs-built_in">this</span>.initControls();
    <span class="hljs-built_in">this</span>.loadState();
    <span class="hljs-built_in">this</span>.draw();
  }

  initControls() {
    [<span class="hljs-string">'learningRate'</span>, <span class="hljs-string">'discountFactor'</span>, <span class="hljs-string">'explorationRate'</span>].forEach(<span class="hljs-function"><span class="hljs-params">id</span> =&gt;</span> {
      <span class="hljs-keyword">const</span> el = <span class="hljs-built_in">document</span>.getElementById(id);
      el.oninput = <span class="hljs-function"><span class="hljs-params">e</span> =&gt;</span> {
        <span class="hljs-keyword">const</span> val = <span class="hljs-built_in">parseFloat</span>(e.target.value);
        <span class="hljs-built_in">document</span>.getElementById(id + <span class="hljs-string">'Value'</span>).textContent = val.toFixed(<span class="hljs-number">2</span>);
        <span class="hljs-keyword">if</span> (id === <span class="hljs-string">'learningRate'</span>) <span class="hljs-built_in">this</span>.ai.lr = val;
        <span class="hljs-keyword">if</span> (id === <span class="hljs-string">'discountFactor'</span>) <span class="hljs-built_in">this</span>.ai.gamma = val;
        <span class="hljs-keyword">if</span> (id === <span class="hljs-string">'explorationRate'</span>) <span class="hljs-built_in">this</span>.ai.epsilon = val;
        <span class="hljs-built_in">this</span>.saveState();
      };
    });
  }
</code></pre>
<p><code>initControls</code> connects our HTML sliders to the AI's parameters, allowing for real-time adjustments.</p>
<h4 id="heading-2-difficulty-and-ui-methods">2. Difficulty and UI Methods</h4>
<p>These methods manage the difficulty setting and update the UI accordingly.</p>
<pre><code class="lang-javascript">  setDifficulty(level) {
    <span class="hljs-built_in">this</span>.ai.difficulty = level;

    <span class="hljs-comment">// Update button styles</span>
    [<span class="hljs-string">'beginner'</span>, <span class="hljs-string">'intermediate'</span>, <span class="hljs-string">'expert'</span>].forEach(<span class="hljs-function"><span class="hljs-params">diff</span> =&gt;</span> {
      <span class="hljs-keyword">const</span> btn = <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">`diff<span class="hljs-subst">${diff.charAt(<span class="hljs-number">0</span>).toUpperCase() + diff.slice(<span class="hljs-number">1</span>)}</span>`</span>);
      <span class="hljs-keyword">if</span> (diff === level) {
        btn.className = <span class="hljs-string">'py-2 px-4 rounded-lg font-semibold text-sm transition-all bg-purple-600 text-white border-2 border-purple-600'</span>;
      } <span class="hljs-keyword">else</span> {
        btn.className = <span class="hljs-string">'py-2 px-4 rounded-lg font-semibold text-sm transition-all bg-white text-gray-700 hover:bg-gray-100'</span>;
      }
    });

    <span class="hljs-keyword">if</span> (level === <span class="hljs-string">'beginner'</span>) <span class="hljs-built_in">this</span>.setStatus(<span class="hljs-string">'🌱 Beginner mode: AI makes more mistakes'</span>);
    <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (level === <span class="hljs-string">'intermediate'</span>) <span class="hljs-built_in">this</span>.setStatus(<span class="hljs-string">'🎯 Medium mode: Balanced AI using Q-learning'</span>);
    <span class="hljs-keyword">else</span> <span class="hljs-built_in">this</span>.setStatus(<span class="hljs-string">'🔥 Expert mode: Perfect AI using minimax algorithm'</span>);

    <span class="hljs-built_in">this</span>.saveState();
  }
</code></pre>
<h4 id="heading-3-drawing-and-rendering">3. Drawing and Rendering</h4>
<p>These methods use the HTML5 Canvas API to visually represent the game state.</p>
<pre><code class="lang-javascript">  draw() {
    <span class="hljs-keyword">const</span> { ctx, canvas, cellSize } = <span class="hljs-built_in">this</span>;
    ctx.fillStyle = <span class="hljs-string">'#fff'</span>;
    ctx.fillRect(<span class="hljs-number">0</span>, <span class="hljs-number">0</span>, canvas.width, canvas.height);

    ctx.strokeStyle = <span class="hljs-string">'#8b5cf6'</span>;
    ctx.lineWidth = <span class="hljs-number">4</span>;
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">let</span> i = <span class="hljs-number">1</span>; i &lt; <span class="hljs-number">3</span>; i++) {
      ctx.beginPath();
      ctx.moveTo(i * cellSize, <span class="hljs-number">0</span>);
      ctx.lineTo(i * cellSize, canvas.height);
      ctx.stroke();
      ctx.beginPath();
      ctx.moveTo(<span class="hljs-number">0</span>, i * cellSize);
      ctx.lineTo(canvas.width, i * cellSize);
      ctx.stroke();
    }

    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">let</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">9</span>; i++) {
      <span class="hljs-keyword">const</span> symbol = <span class="hljs-built_in">this</span>.board[i];
      <span class="hljs-keyword">if</span> (symbol === <span class="hljs-string">'-'</span>) <span class="hljs-keyword">continue</span>;

      <span class="hljs-keyword">const</span> x = (i % <span class="hljs-number">3</span>) * cellSize + cellSize / <span class="hljs-number">2</span>;
      <span class="hljs-keyword">const</span> y = ~~(i / <span class="hljs-number">3</span>) * cellSize + cellSize / <span class="hljs-number">2</span>;

      ctx.strokeStyle = symbol === <span class="hljs-string">'X'</span> ? <span class="hljs-string">'#ef4444'</span> : <span class="hljs-string">'#10b981'</span>;
      ctx.lineWidth = <span class="hljs-number">8</span>;
      ctx.lineCap = <span class="hljs-string">'round'</span>;

      <span class="hljs-keyword">if</span> (symbol === <span class="hljs-string">'X'</span>) {
        <span class="hljs-keyword">const</span> s = cellSize * <span class="hljs-number">0.3</span>;
        ctx.beginPath();
        ctx.moveTo(x - s, y - s);
        ctx.lineTo(x + s, y + s);
        ctx.stroke();
        ctx.beginPath();
        ctx.moveTo(x + s, y - s);
        ctx.lineTo(x - s, y + s);
        ctx.stroke();
      } <span class="hljs-keyword">else</span> {
        ctx.beginPath();
        ctx.arc(x, y, cellSize * <span class="hljs-number">0.3</span>, <span class="hljs-number">0</span>, <span class="hljs-built_in">Math</span>.PI * <span class="hljs-number">2</span>);
        ctx.stroke();
      }
    }

    <span class="hljs-keyword">const</span> winner = <span class="hljs-built_in">this</span>.checkWinner();
    <span class="hljs-keyword">if</span> (winner?.line) <span class="hljs-built_in">this</span>.drawWinLine(winner.line);
  }

  drawWinLine(line) {
    <span class="hljs-keyword">const</span> [a, , c] = line;
    <span class="hljs-keyword">const</span> startX = (a % <span class="hljs-number">3</span>) * <span class="hljs-built_in">this</span>.cellSize + <span class="hljs-built_in">this</span>.cellSize / <span class="hljs-number">2</span>;
    <span class="hljs-keyword">const</span> startY = ~~(a / <span class="hljs-number">3</span>) * <span class="hljs-built_in">this</span>.cellSize + <span class="hljs-built_in">this</span>.cellSize / <span class="hljs-number">2</span>;
    <span class="hljs-keyword">const</span> endX = (c % <span class="hljs-number">3</span>) * <span class="hljs-built_in">this</span>.cellSize + <span class="hljs-built_in">this</span>.cellSize / <span class="hljs-number">2</span>;
    <span class="hljs-keyword">const</span> endY = ~~(c / <span class="hljs-number">3</span>) * <span class="hljs-built_in">this</span>.cellSize + <span class="hljs-built_in">this</span>.cellSize / <span class="hljs-number">2</span>;

    <span class="hljs-built_in">this</span>.ctx.strokeStyle = <span class="hljs-string">'#fbbf24'</span>;
    <span class="hljs-built_in">this</span>.ctx.lineWidth = <span class="hljs-number">6</span>;
    <span class="hljs-built_in">this</span>.ctx.beginPath();
    <span class="hljs-built_in">this</span>.ctx.moveTo(startX, startY);
    <span class="hljs-built_in">this</span>.ctx.lineTo(endX, endY);
    <span class="hljs-built_in">this</span>.ctx.stroke();
  }
</code></pre>
<h4 id="heading-4-player-interaction-and-the-game-loop">4. Player Interaction and the Game Loop</h4>
<p>This is the core interactive logic. <code>handleClick</code> translates a click into a board position, <code>move</code> updates the state, and <code>aiMove</code> gets an action from the <code>QLearning</code> class and executes it.</p>
<pre><code class="lang-javascript">  handleClick(e) {
    <span class="hljs-keyword">if</span> (<span class="hljs-built_in">this</span>.gameOver || <span class="hljs-built_in">this</span>.training) <span class="hljs-keyword">return</span>;

    <span class="hljs-keyword">const</span> rect = <span class="hljs-built_in">this</span>.canvas.getBoundingClientRect();
    <span class="hljs-keyword">const</span> col = ~~((e.clientX - rect.left) / <span class="hljs-built_in">this</span>.cellSize);
    <span class="hljs-keyword">const</span> row = ~~((e.clientY - rect.top) / <span class="hljs-built_in">this</span>.cellSize);
    <span class="hljs-keyword">const</span> idx = row * <span class="hljs-number">3</span> + col;

    <span class="hljs-keyword">if</span> (<span class="hljs-built_in">this</span>.board[idx] === <span class="hljs-string">'-'</span>) {
      <span class="hljs-built_in">this</span>.move(idx, <span class="hljs-string">'X'</span>);
      <span class="hljs-keyword">if</span> (!<span class="hljs-built_in">this</span>.gameOver) <span class="hljs-built_in">setTimeout</span>(<span class="hljs-function">() =&gt;</span> <span class="hljs-built_in">this</span>.aiMove(), <span class="hljs-number">300</span>);
    }
  }

  move(idx, player) {
    <span class="hljs-keyword">if</span> (<span class="hljs-built_in">this</span>.board[idx] !== <span class="hljs-string">'-'</span> || <span class="hljs-built_in">this</span>.gameOver) <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>;
    <span class="hljs-built_in">this</span>.board = <span class="hljs-built_in">this</span>.board.substring(<span class="hljs-number">0</span>, idx) + player + <span class="hljs-built_in">this</span>.board.substring(idx + <span class="hljs-number">1</span>);
    <span class="hljs-built_in">this</span>.draw();
    <span class="hljs-built_in">this</span>.checkGameOver();
    <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span>;
  }

  aiMove() {
    <span class="hljs-keyword">if</span> (<span class="hljs-built_in">this</span>.gameOver) <span class="hljs-keyword">return</span>;

    <span class="hljs-keyword">const</span> state = <span class="hljs-built_in">this</span>.board;
    <span class="hljs-keyword">const</span> available = <span class="hljs-built_in">this</span>.getAvailable();
    <span class="hljs-keyword">const</span> action = <span class="hljs-built_in">this</span>.ai.getAction(state, available);

    <span class="hljs-built_in">this</span>.move(action, <span class="hljs-string">'O'</span>);

    <span class="hljs-keyword">const</span> winner = <span class="hljs-built_in">this</span>.checkWinner();
    <span class="hljs-keyword">const</span> reward = winner?.winner === <span class="hljs-string">'O'</span> ? <span class="hljs-number">1</span> : winner?.winner === <span class="hljs-string">'X'</span> ? <span class="hljs-number">-1</span> : <span class="hljs-number">0</span>;
    <span class="hljs-built_in">this</span>.ai.update(state, action, reward, <span class="hljs-built_in">this</span>.board, <span class="hljs-built_in">this</span>.getAvailable());
  }
</code></pre>
<p>After the AI moves, it immediately calls <code>this.ai.update()</code> to learn from the result of its action.</p>
<h4 id="heading-5-the-rules-engine">5. The Rules Engine</h4>
<p>These helpers determine the game's state: available moves, winner, and game over conditions.</p>
<pre><code class="lang-javascript">  getAvailable() {
    <span class="hljs-keyword">return</span> [...this.board].map(<span class="hljs-function">(<span class="hljs-params">c, i</span>) =&gt;</span> c === <span class="hljs-string">'-'</span> ? i : <span class="hljs-literal">null</span>).filter(<span class="hljs-function"><span class="hljs-params">x</span> =&gt;</span> x !== <span class="hljs-literal">null</span>);
  }

  checkWinner() {
    <span class="hljs-keyword">const</span> patterns = [[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>,<span class="hljs-number">2</span>],[<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>],[<span class="hljs-number">6</span>,<span class="hljs-number">7</span>,<span class="hljs-number">8</span>],[<span class="hljs-number">0</span>,<span class="hljs-number">3</span>,<span class="hljs-number">6</span>],[<span class="hljs-number">1</span>,<span class="hljs-number">4</span>,<span class="hljs-number">7</span>],[<span class="hljs-number">2</span>,<span class="hljs-number">5</span>,<span class="hljs-number">8</span>],[<span class="hljs-number">0</span>,<span class="hljs-number">4</span>,<span class="hljs-number">8</span>],[<span class="hljs-number">2</span>,<span class="hljs-number">4</span>,<span class="hljs-number">6</span>]];
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">const</span> p <span class="hljs-keyword">of</span> patterns) {
      <span class="hljs-keyword">if</span> (<span class="hljs-built_in">this</span>.board[p[<span class="hljs-number">0</span>]] !== <span class="hljs-string">'-'</span> &amp;&amp; 
          <span class="hljs-built_in">this</span>.board[p[<span class="hljs-number">0</span>]] === <span class="hljs-built_in">this</span>.board[p[<span class="hljs-number">1</span>]] &amp;&amp; 
          <span class="hljs-built_in">this</span>.board[p[<span class="hljs-number">1</span>]] === <span class="hljs-built_in">this</span>.board[p[<span class="hljs-number">2</span>]]) {
        <span class="hljs-keyword">return</span> { <span class="hljs-attr">winner</span>: <span class="hljs-built_in">this</span>.board[p[<span class="hljs-number">0</span>]], <span class="hljs-attr">line</span>: p };
      }
    }
    <span class="hljs-keyword">return</span> <span class="hljs-built_in">this</span>.board.includes(<span class="hljs-string">'-'</span>) ? <span class="hljs-literal">null</span> : { <span class="hljs-attr">winner</span>: <span class="hljs-string">'draw'</span>, <span class="hljs-attr">line</span>: <span class="hljs-literal">null</span> };
  }

  checkGameOver() {
    <span class="hljs-keyword">const</span> result = <span class="hljs-built_in">this</span>.checkWinner();
    <span class="hljs-keyword">if</span> (!result) <span class="hljs-keyword">return</span>;

    <span class="hljs-built_in">this</span>.gameOver = <span class="hljs-literal">true</span>;
    <span class="hljs-built_in">this</span>.stats.played++;

    <span class="hljs-keyword">if</span> (result.winner === <span class="hljs-string">'X'</span>) {
      <span class="hljs-built_in">this</span>.stats.playerWins++;
      <span class="hljs-keyword">if</span> (!<span class="hljs-built_in">this</span>.training) <span class="hljs-built_in">this</span>.setStatus(<span class="hljs-string">'🎉 You win!'</span>);
    } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (result.winner === <span class="hljs-string">'O'</span>) {
      <span class="hljs-built_in">this</span>.stats.aiWins++;
      <span class="hljs-keyword">if</span> (!<span class="hljs-built_in">this</span>.training) <span class="hljs-built_in">this</span>.setStatus(<span class="hljs-string">'🤖 AI wins!'</span>);
    } <span class="hljs-keyword">else</span> {
      <span class="hljs-built_in">this</span>.stats.draws++;
      <span class="hljs-keyword">if</span> (!<span class="hljs-built_in">this</span>.training) <span class="hljs-built_in">this</span>.setStatus(<span class="hljs-string">'🤝 Draw!'</span>);
    }

    <span class="hljs-keyword">if</span> (!<span class="hljs-built_in">this</span>.training) {
      <span class="hljs-built_in">this</span>.updateStats();
      <span class="hljs-built_in">this</span>.saveState();
    }
  }
</code></pre>
<h4 id="heading-6-ui-and-statistics-updates">6. UI and Statistics Updates</h4>
<p>These methods connect the internal game state to the HTML elements, displaying status messages and statistics.</p>
<pre><code class="lang-javascript">  setStatus(msg) {
    <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'gameStatus'</span>).textContent = msg;
  }

  updateStats() {
    <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'gamesPlayed'</span>).textContent = <span class="hljs-built_in">this</span>.stats.played;
    <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'aiWins'</span>).textContent = <span class="hljs-built_in">this</span>.stats.aiWins;
    <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'playerWins'</span>).textContent = <span class="hljs-built_in">this</span>.stats.playerWins;
    <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'draws'</span>).textContent = <span class="hljs-built_in">this</span>.stats.draws;
    <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'statesLearned'</span>).textContent = <span class="hljs-built_in">this</span>.ai.q.size;

    <span class="hljs-keyword">const</span> winRate = <span class="hljs-built_in">this</span>.stats.played ? (<span class="hljs-built_in">this</span>.stats.aiWins / <span class="hljs-built_in">this</span>.stats.played * <span class="hljs-number">100</span>).toFixed(<span class="hljs-number">1</span>) : <span class="hljs-number">0</span>;
    <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'winRate'</span>).textContent = <span class="hljs-string">`<span class="hljs-subst">${winRate}</span>%`</span>;
  }
</code></pre>
<h4 id="heading-7-game-and-ai-management">7. Game and AI Management</h4>
<p>These methods are wired to the control buttons for resetting the game or the AI's memory.</p>
<pre><code class="lang-javascript">  reset() {
    <span class="hljs-built_in">this</span>.board = <span class="hljs-string">'---------'</span>;
    <span class="hljs-built_in">this</span>.gameOver = <span class="hljs-literal">false</span>;
    <span class="hljs-built_in">this</span>.draw();
    <span class="hljs-built_in">this</span>.setStatus(<span class="hljs-string">'Your turn! (X)'</span>);
  }

  resetAI() {
    <span class="hljs-keyword">if</span> (confirm(<span class="hljs-string">'Reset AI memory? All progress will be lost.'</span>)) {
      <span class="hljs-built_in">this</span>.ai.reset();
      <span class="hljs-built_in">this</span>.ai.clearStorage();
      <span class="hljs-built_in">this</span>.stats = { <span class="hljs-attr">played</span>: <span class="hljs-number">0</span>, <span class="hljs-attr">aiWins</span>: <span class="hljs-number">0</span>, <span class="hljs-attr">playerWins</span>: <span class="hljs-number">0</span>, <span class="hljs-attr">draws</span>: <span class="hljs-number">0</span> };
      <span class="hljs-built_in">this</span>.updateStats();
      <span class="hljs-built_in">this</span>.reset();
      <span class="hljs-built_in">this</span>.setStatus(<span class="hljs-string">'AI memory reset!'</span>);
      <span class="hljs-built_in">localStorage</span>.removeItem(<span class="hljs-string">'tictactoe_stats'</span>);
    }
  }
</code></pre>
<h4 id="heading-8-the-self-play-training-loop">8. The Self-Play Training Loop</h4>
<p>This is the logic for the "Train AI" button, allowing the AI to learn rapidly by playing against itself.</p>
<pre><code class="lang-javascript">  <span class="hljs-keyword">async</span> startTraining() {
    <span class="hljs-built_in">this</span>.training = <span class="hljs-literal">true</span>;
    <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'trainingIndicator'</span>).classList.remove(<span class="hljs-string">'hidden'</span>);

    <span class="hljs-keyword">const</span> originalEpsilon = <span class="hljs-built_in">this</span>.ai.epsilon;
    <span class="hljs-built_in">this</span>.ai.epsilon = <span class="hljs-number">0.3</span>; <span class="hljs-comment">// Higher exploration during training</span>

    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">let</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">1000</span>; i++) {
      <span class="hljs-keyword">await</span> <span class="hljs-built_in">this</span>.trainGame();
      <span class="hljs-built_in">this</span>.ai.decay();
      <span class="hljs-keyword">if</span> (i % <span class="hljs-number">50</span> === <span class="hljs-number">0</span>) {
        <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'trainingProgress'</span>).textContent = <span class="hljs-string">`<span class="hljs-subst">${i + <span class="hljs-number">1</span>}</span>/1000`</span>;
        <span class="hljs-keyword">await</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Promise</span>(<span class="hljs-function"><span class="hljs-params">r</span> =&gt;</span> <span class="hljs-built_in">setTimeout</span>(r, <span class="hljs-number">0</span>)); <span class="hljs-comment">// Allow UI to update</span>
      }
    }

    <span class="hljs-built_in">this</span>.ai.epsilon = originalEpsilon;
    <span class="hljs-built_in">this</span>.training = <span class="hljs-literal">false</span>;
    <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'trainingIndicator'</span>).classList.add(<span class="hljs-string">'hidden'</span>);
    <span class="hljs-built_in">this</span>.updateStats();
    <span class="hljs-built_in">this</span>.reset();
    <span class="hljs-built_in">this</span>.setStatus(<span class="hljs-string">'Training complete!'</span>);
    <span class="hljs-built_in">this</span>.saveState();
  }

  <span class="hljs-keyword">async</span> trainGame() {
    <span class="hljs-built_in">this</span>.board = <span class="hljs-string">'---------'</span>;
    <span class="hljs-built_in">this</span>.gameOver = <span class="hljs-literal">false</span>;
    <span class="hljs-keyword">const</span> moves = [];

    <span class="hljs-keyword">while</span> (!<span class="hljs-built_in">this</span>.gameOver &amp;&amp; <span class="hljs-built_in">this</span>.getAvailable().length &gt; <span class="hljs-number">0</span>) {
      <span class="hljs-keyword">const</span> state = <span class="hljs-built_in">this</span>.board;
      <span class="hljs-keyword">const</span> available = <span class="hljs-built_in">this</span>.getAvailable();
      <span class="hljs-comment">// Alternate players (X and O) are both the AI</span>
      <span class="hljs-keyword">const</span> player = moves.length % <span class="hljs-number">2</span> === <span class="hljs-number">0</span> ? <span class="hljs-string">'X'</span> : <span class="hljs-string">'O'</span>; 
      <span class="hljs-keyword">const</span> action = <span class="hljs-built_in">this</span>.ai.getAction(state, available);

      moves.push({ state, action, player });
      <span class="hljs-built_in">this</span>.move(action, player);
    }

    <span class="hljs-keyword">const</span> winner = <span class="hljs-built_in">this</span>.checkWinner();
    <span class="hljs-comment">// Assign rewards after the game is over</span>
    moves.forEach(<span class="hljs-function"><span class="hljs-params">m</span> =&gt;</span> {
      <span class="hljs-keyword">const</span> reward = winner?.winner === m.player ? <span class="hljs-number">1</span> : (winner?.winner &amp;&amp; winner.winner !== m.player) ? <span class="hljs-number">-1</span> : <span class="hljs-number">0</span>;
      <span class="hljs-built_in">this</span>.ai.update(m.state, m.action, reward, <span class="hljs-built_in">this</span>.board, []);
    });
  }
</code></pre>
<h4 id="heading-9-state-persistence">9. State Persistence</h4>
<p>These methods orchestrate saving and loading the game state and AI's memory to <code>localStorage</code>.</p>
<pre><code class="lang-javascript">  saveState() {
    <span class="hljs-built_in">this</span>.ai.save();
    <span class="hljs-built_in">localStorage</span>.setItem(<span class="hljs-string">'tictactoe_stats'</span>, <span class="hljs-built_in">JSON</span>.stringify(<span class="hljs-built_in">this</span>.stats));
  }

  loadState() {
    <span class="hljs-keyword">if</span> (<span class="hljs-built_in">this</span>.ai.load()) {
      <span class="hljs-keyword">const</span> savedStats = <span class="hljs-built_in">localStorage</span>.getItem(<span class="hljs-string">'tictactoe_stats'</span>);
      <span class="hljs-keyword">if</span> (savedStats) {
        <span class="hljs-built_in">this</span>.stats = <span class="hljs-built_in">JSON</span>.parse(savedStats);
      }
      <span class="hljs-built_in">this</span>.updateStats();
      <span class="hljs-built_in">this</span>.setDifficulty(<span class="hljs-built_in">this</span>.ai.difficulty);

      <span class="hljs-comment">// Update sliders to reflect loaded AI state</span>
      <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'learningRate'</span>).value = <span class="hljs-built_in">this</span>.ai.lr;
      <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'learningRateValue'</span>).textContent = <span class="hljs-built_in">this</span>.ai.lr.toFixed(<span class="hljs-number">2</span>);
      <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'discountFactor'</span>).value = <span class="hljs-built_in">this</span>.ai.gamma;
      <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'discountFactorValue'</span>).textContent = <span class="hljs-built_in">this</span>.ai.gamma.toFixed(<span class="hljs-number">2</span>);
      <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'explorationRate'</span>).value = <span class="hljs-built_in">this</span>.ai.epsilon;
      <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">'explorationRateValue'</span>).textContent = <span class="hljs-built_in">this</span>.ai.epsilon.toFixed(<span class="hljs-number">2</span>);

      <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'✓ Loaded AI state from localStorage'</span>);
    }
  }
}
</code></pre>
<h4 id="heading-10-initializing-the-game">10. Initializing the Game</h4>
<p>Finally, add this snippet at the end of <code>game.js</code> to create an instance of the game once the HTML document is loaded.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">let</span> game;
<span class="hljs-built_in">window</span>.addEventListener(<span class="hljs-string">'DOMContentLoaded'</span>, <span class="hljs-function">() =&gt;</span> {
  game = <span class="hljs-keyword">new</span> TicTacToe();
});
</code></pre>
<p>This completes our implementation! You now have a fully functional <code>game.js</code> file. If you encountered any issues or want to double-check your work, you can compare your code against the complete source file available on GitHub: <a target="_blank" href="https://github.com/mayur9210/tic-tac-toe-ai/blob/main/game.js">https://github.com/mayur9210/tic-tac-toe-ai/blob/main/game.js</a>.</p>
<h2 id="heading-how-to-understand-the-enhanced-features">How to Understand the Enhanced Features</h2>
<p>Beyond the core Q-learning logic, this implementation includes several enhanced features to create a complete, user-friendly, and educational application. Let's explore what they are and how they work.</p>
<h3 id="heading-1-adaptive-difficulty-levels">1. Adaptive Difficulty Levels</h3>
<p>The game supports three distinct difficulty modes to cater to different players:</p>
<ul>
<li><p><strong>Beginner (🌱):</strong> This mode is designed for new players. The AI makes random moves 70% of the time, providing a high chance for the player to win and learn the game's rules.</p>
</li>
<li><p><strong>Intermediate (🎯):</strong> This is the standard mode where the AI uses the Q-learning algorithm with an epsilon-greedy strategy. It presents a challenging but fair opponent that improves over time.</p>
</li>
<li><p><strong>Expert (🔥):</strong> This mode switches from reinforcement learning to the classic <strong>minimax algorithm</strong>. This algorithm plays a perfect game, meaning it is impossible to beat (the best a player can achieve is a draw). This serves as a benchmark for optimal play.</p>
</li>
</ul>
<h3 id="heading-2-other-enhanced-features">2. Other Enhanced Features</h3>
<p>In addition to the difficulty levels, the application includes:</p>
<ul>
<li><p><strong>Real-time AI parameter tuning:</strong> The sliders in the UI allow you to adjust the Learning Rate (α), Discount Factor (γ), and Exploration Rate (ϵ) on the fly. This lets you directly observe how different hyperparameters affect the AI's learning speed and performance.</p>
</li>
<li><p><strong>Persistence with localStorage:</strong> The AI automatically saves its Q-table and your game statistics to the browser's local storage. When you close the tab and come back later, the AI will remember everything it has learned.</p>
</li>
<li><p><strong>Dedicated self-play training mode:</strong> The "Train AI" button allows the AI to play 1,000 games against itself in a matter of seconds. This rapidly populates the Q-table and is far more efficient than learning from just human-played games.</p>
</li>
</ul>
<h2 id="heading-putting-it-all-together-a-guided-test-run">Putting It All Together: A Guided Test Run</h2>
<p>Once you have the HTML (<code>index.html</code>) and JavaScript (<code>game.js</code>) files in same directory, open the HTML file in a web browser to test all the features. When you open the HTML file, it should look like as shown in the below image.</p>
<p>I have also <a target="_blank" href="https://mayur9210.github.io/tic-tac-toe-ai/">hosted this file on GitHub Pages</a> if you want to see how it works.</p>
<p>Now that you have the application running, let's walk through how to test the features and witness the AI's learning process firsthand. This interactive testing is the most rewarding part, as you'll see the abstract concepts come to life.</p>
<h3 id="heading-step-1-challenge-the-untrained-ai">Step 1: Challenge the Untrained AI</h3>
<p>When you first load the game, the AI is a blank slate. Its Q-table is empty. Make sure the difficulty is set to <strong>🌱 Beginner</strong> and play a game against it. You'll likely find it very easy to beat. It makes random, nonsensical moves because it has no experience. Notice the "States Learned" in the statistics panel is very low.</p>
<h3 id="heading-step-2-train-the-ai">Step 2: Train the AI</h3>
<p>Now for the magic. Click the <strong>"Train AI (1000 games)"</strong> button. You'll see the yellow training indicator appear with a progress counter. In these few seconds, the AI is playing 1,000 games against itself, rapidly learning from its wins, losses, and draws. For every move in every game, it updates its Q-table, reinforcing good strategies and penalizing bad ones.</p>
<h3 id="heading-step-3-challenge-the-trained-ai">Step 3: Challenge the Trained AI</h3>
<p>Once training is complete, play another game on <strong>🎯 Medium</strong> difficulty. The difference should be dramatic. The AI will now play strategically, blocking your wins and setting up its own. It is no longer a pushover. Check the statistics panel again: you'll see the "States Learned" count has jumped significantly, representing all the new board positions it now understands.</p>
<h3 id="heading-step-4-experiment-with-the-controls">Step 4: Experiment with the Controls</h3>
<p>Now that you have a trained AI, experiment with the other features:</p>
<ul>
<li><p><strong>Switch to 🔥 Expert:</strong> Play against the minimax algorithm. Notice that you can't win. This demonstrates the power of a perfect-play algorithm.</p>
</li>
<li><p><strong>Tweak the parameters:</strong> Set the Exploration Rate (ε) slider to 0. The AI will become completely deterministic, always picking the move with the highest Q-value. Set it to 0.5, and watch it become more erratic and experimental again.</p>
</li>
<li><p><strong>Reset the AI:</strong> Click the "Reset AI Memory" button. This will wipe its Q-table. If you play against it now, you'll find it's back to its original, untrained state. This confirms that its "intelligence" was stored in the Q-table you just erased.</p>
</li>
</ul>
<h3 id="heading-verifying-the-implementation-with-automated-tests">Verifying the Implementation with Automated Tests</h3>
<p>While playing the game gives you a good feel for the AI's behavior, automated tests are crucial for programmatically confirming that the underlying code is correct. This is different from the manual testing you just performed. Here, we are writing code to check our code.</p>
<p>The following test suite validates the three most critical features: difficulty switching, data persistence with <code>localStorage</code>, and the infallibility of the expert minimax AI. You can run these tests by copying and pasting the code into your browser's developer console while the game is open.</p>
<pre><code class="lang-javascript"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">runTests</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'🧪 Running enhanced tests...'</span>);

  <span class="hljs-comment">// Test 1: Difficulty switching</span>
  <span class="hljs-keyword">const</span> g1 = <span class="hljs-keyword">new</span> TicTacToe();
  g1.setDifficulty(<span class="hljs-string">'beginner'</span>);
  <span class="hljs-built_in">console</span>.assert(g1.ai.difficulty === <span class="hljs-string">'beginner'</span>, <span class="hljs-string">'✓ Difficulty switching works'</span>);

  <span class="hljs-comment">// Test 2: localStorage persistence</span>
  <span class="hljs-keyword">const</span> g2 = <span class="hljs-keyword">new</span> TicTacToe();
  g2.ai.q.set(<span class="hljs-string">'test-state'</span>, [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">7</span>, <span class="hljs-number">8</span>, <span class="hljs-number">9</span>]);
  g2.saveState();
  <span class="hljs-keyword">const</span> g3 = <span class="hljs-keyword">new</span> TicTacToe();
  <span class="hljs-built_in">console</span>.assert(g3.ai.q.has(<span class="hljs-string">'test-state'</span>), <span class="hljs-string">'✓ localStorage persistence works'</span>);

  <span class="hljs-comment">// Test 3: Minimax never loses</span>
  <span class="hljs-keyword">const</span> g4 = <span class="hljs-keyword">new</span> TicTacToe();
  g4.setDifficulty(<span class="hljs-string">'expert'</span>);
  <span class="hljs-keyword">let</span> expertLosses = <span class="hljs-number">0</span>;
  <span class="hljs-keyword">for</span> (<span class="hljs-keyword">let</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">100</span>; i++) {
    g4.reset();
    <span class="hljs-keyword">while</span> (!g4.gameOver) {
      <span class="hljs-keyword">const</span> available = g4.getAvailable();
      <span class="hljs-keyword">const</span> move = available[~~(<span class="hljs-built_in">Math</span>.random() * available.length)];
      g4.move(move, <span class="hljs-string">'X'</span>);
      <span class="hljs-keyword">if</span> (!g4.gameOver) g4.aiMove();
    }
    <span class="hljs-keyword">const</span> winner = g4.checkWinner();
    <span class="hljs-keyword">if</span> (winner?.winner === <span class="hljs-string">'X'</span>) expertLosses++;
  }
  <span class="hljs-built_in">console</span>.assert(expertLosses === <span class="hljs-number">0</span>, <span class="hljs-string">'✓ Expert AI never loses'</span>);

  <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'✅ All tests passed!'</span>);
}
</code></pre>
<p>How these tests work:</p>
<ol>
<li><p><strong>Difficulty switching:</strong> The first test creates a game instance, sets the difficulty, and asserts that the AI's internal property was updated correctly.</p>
</li>
<li><p><strong>Persistence:</strong> The second test simulates saving the AI's state. It adds a dummy entry to the Q-table, saves it, creates a <em>new</em> game instance (simulating a page reload), and asserts that the new instance successfully loaded the saved data.</p>
</li>
<li><p><strong>Expert mode correctness:</strong> The third and most rigorous test plays 100 games against the expert AI using random moves for the player. It then asserts that the expert AI never lost a single game, proving the minimax implementation is correct.</p>
</li>
</ol>
<p>You can run these tests in your browser's console after loading the game as shown in the below screenshot.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759790825366/aedc84b7-5399-4067-bf2c-b0b488192c62.png" alt="Running tests" class="image--center mx-auto" width="1454" height="924" loading="lazy"></p>
<h2 id="heading-advanced-optimizations-and-extensions">Advanced Optimizations and Extensions</h2>
<p>Now that you have the complete implementation, here are ways to extend it further:</p>
<h3 id="heading-how-to-implement-symmetry-reduction">How to Implement Symmetry Reduction</h3>
<p>You can reduce the state space by recognizing equivalent board positions:</p>
<pre><code class="lang-javascript">getCanonicalState(s) {
  <span class="hljs-keyword">const</span> transforms = [
    s, <span class="hljs-built_in">this</span>.rot90(s), <span class="hljs-built_in">this</span>.rot180(s), <span class="hljs-built_in">this</span>.rot270(s),
    <span class="hljs-built_in">this</span>.flip(s), <span class="hljs-built_in">this</span>.flip(<span class="hljs-built_in">this</span>.rot90(s)), 
    <span class="hljs-built_in">this</span>.flip(<span class="hljs-built_in">this</span>.rot180(s)), <span class="hljs-built_in">this</span>.flip(<span class="hljs-built_in">this</span>.rot270(s))
  ];
  <span class="hljs-keyword">return</span> transforms.sort()[<span class="hljs-number">0</span>];
}

rot90(s) {
  <span class="hljs-keyword">const</span> b = s.split(<span class="hljs-string">''</span>);
  <span class="hljs-keyword">return</span> [b[<span class="hljs-number">6</span>],b[<span class="hljs-number">3</span>],b[<span class="hljs-number">0</span>],b[<span class="hljs-number">7</span>],b[<span class="hljs-number">4</span>],b[<span class="hljs-number">1</span>],b[<span class="hljs-number">8</span>],b[<span class="hljs-number">5</span>],b[<span class="hljs-number">2</span>]].join(<span class="hljs-string">''</span>);
}

rot180(s) {
  <span class="hljs-keyword">return</span> s.split(<span class="hljs-string">''</span>).reverse().join(<span class="hljs-string">''</span>);
}

rot270(s) {
  <span class="hljs-keyword">const</span> b = s.split(<span class="hljs-string">''</span>);
  <span class="hljs-keyword">return</span> [b[<span class="hljs-number">2</span>],b[<span class="hljs-number">5</span>],b[<span class="hljs-number">8</span>],b[<span class="hljs-number">1</span>],b[<span class="hljs-number">4</span>],b[<span class="hljs-number">7</span>],b[<span class="hljs-number">0</span>],b[<span class="hljs-number">3</span>],b[<span class="hljs-number">6</span>]].join(<span class="hljs-string">''</span>);
}

flip(s) {
  <span class="hljs-keyword">const</span> b = s.split(<span class="hljs-string">''</span>);
  <span class="hljs-keyword">return</span> [b[<span class="hljs-number">2</span>],b[<span class="hljs-number">1</span>],b[<span class="hljs-number">0</span>],b[<span class="hljs-number">5</span>],b[<span class="hljs-number">4</span>],b[<span class="hljs-number">3</span>],b[<span class="hljs-number">8</span>],b[<span class="hljs-number">7</span>],b[<span class="hljs-number">6</span>]].join(<span class="hljs-string">''</span>);
}
</code></pre>
<p>This symmetry reduction technique speeds up AI learning by recognizing equivalent board positions.</p>
<p><strong>How it works:</strong></p>
<ul>
<li><p><strong>getCanonicalState()</strong>: Generates all 8 symmetric versions of a board state (4 rotations + 4 flipped versions) and returns the alphabetically first one as the standard representation</p>
</li>
<li><p><strong>rot90()</strong>: Rotates board 90° clockwise by remapping position indices</p>
</li>
<li><p><strong>rot180()</strong>: Rotates 180° by reversing the board array</p>
</li>
<li><p><strong>rot270()</strong>: Rotates 270° clockwise (or 90° counterclockwise)</p>
</li>
<li><p><strong>flip()</strong>: Mirrors the board horizontally</p>
</li>
</ul>
<p><strong>Why this matters:</strong> By storing only canonical states in the Q-table, the AI reduces unique positions from ~5,500 to ~700, making learning <strong>8x faster</strong>.</p>
<p><strong>Example:</strong> These boards are considered identical:</p>
<pre><code class="lang-bash">X-- --- --X
--- = --- = ---
--- --- ---
(original) (180° rotation) (horizontal flip)
</code></pre>
<p>All three map to the same canonical state, so the AI only needs to learn one instead of three.</p>
<p>Modify <code>getQ()</code> to use canonical states. This reduces learning time by 8x since the AI recognizes rotated and flipped positions as equivalent.</p>
<h3 id="heading-how-to-add-export-and-import-functionality">How to Add Export and Import Functionality</h3>
<p>You can also let users share trained AI models:</p>
<pre><code class="lang-javascript">exportAI() {
  <span class="hljs-keyword">const</span> data = {
    <span class="hljs-attr">q</span>: <span class="hljs-built_in">Array</span>.from(<span class="hljs-built_in">this</span>.ai.q.entries()),
    <span class="hljs-attr">stats</span>: <span class="hljs-built_in">this</span>.stats,
    <span class="hljs-attr">difficulty</span>: <span class="hljs-built_in">this</span>.ai.difficulty,
    <span class="hljs-attr">timestamp</span>: <span class="hljs-built_in">Date</span>.now()
  };

  <span class="hljs-keyword">const</span> blob = <span class="hljs-keyword">new</span> Blob([<span class="hljs-built_in">JSON</span>.stringify(data)], { <span class="hljs-attr">type</span>: <span class="hljs-string">'application/json'</span> });
  <span class="hljs-keyword">const</span> url = URL.createObjectURL(blob);
  <span class="hljs-keyword">const</span> a = <span class="hljs-built_in">document</span>.createElement(<span class="hljs-string">'a'</span>);
  a.href = url;
  a.download = <span class="hljs-string">`tictactoe-ai-<span class="hljs-subst">${<span class="hljs-built_in">Date</span>.now()}</span>.json`</span>;
  a.click();
  URL.revokeObjectURL(url);
}

importAI(file) {
  <span class="hljs-keyword">const</span> reader = <span class="hljs-keyword">new</span> FileReader();
  reader.onload = <span class="hljs-function">(<span class="hljs-params">e</span>) =&gt;</span> {
    <span class="hljs-keyword">try</span> {
      <span class="hljs-keyword">const</span> data = <span class="hljs-built_in">JSON</span>.parse(e.target.result);
      <span class="hljs-built_in">this</span>.ai.q = <span class="hljs-keyword">new</span> <span class="hljs-built_in">Map</span>(data.q);
      <span class="hljs-built_in">this</span>.stats = data.stats;
      <span class="hljs-built_in">this</span>.ai.difficulty = data.difficulty;
      <span class="hljs-built_in">this</span>.updateStats();
      <span class="hljs-built_in">this</span>.setStatus(<span class="hljs-string">'✓ AI imported successfully!'</span>);
    } <span class="hljs-keyword">catch</span> (err) {
      <span class="hljs-built_in">this</span>.setStatus(<span class="hljs-string">'✗ Import failed: Invalid file'</span>);
    }
  };
  reader.readAsText(file);
}
</code></pre>
<p>These methods enable sharing trained AI models between users. The <code>exportAI()</code> method packages the complete AI state (Q-table, statistics, difficulty, and timestamp) into a JSON object, creates a Blob from the JSON string, generates a temporary download URL, programmatically creates and clicks a download link, then cleans up the URL. The filename includes a timestamp for version tracking.</p>
<p>The <code>importAI()</code> method uses FileReader to asynchronously read an uploaded JSON file, parses it, reconstructs the Map from the array of entries, restores all game state, and updates the display. Error handling catches invalid JSON or corrupted files.</p>
<h3 id="heading-how-to-add-q-value-heatmap-visualization">How to Add Q-Value Heatmap Visualization</h3>
<p>Here’s how you can visualize the AI's decision-making:</p>
<pre><code class="lang-javascript">drawQValueHeatmap() {
  <span class="hljs-keyword">const</span> state = <span class="hljs-built_in">this</span>.board;
  <span class="hljs-keyword">const</span> qValues = <span class="hljs-built_in">this</span>.ai.getQ(state);
  <span class="hljs-keyword">const</span> available = <span class="hljs-built_in">this</span>.getAvailable();

  <span class="hljs-keyword">if</span> (available.length === <span class="hljs-number">0</span>) <span class="hljs-keyword">return</span>;

  <span class="hljs-keyword">const</span> maxQ = <span class="hljs-built_in">Math</span>.max(...available.map(<span class="hljs-function"><span class="hljs-params">i</span> =&gt;</span> qValues[i]));
  <span class="hljs-keyword">const</span> minQ = <span class="hljs-built_in">Math</span>.min(...available.map(<span class="hljs-function"><span class="hljs-params">i</span> =&gt;</span> qValues[i]));
  <span class="hljs-keyword">const</span> range = maxQ - minQ || <span class="hljs-number">1</span>;

  <span class="hljs-built_in">this</span>.ctx.globalAlpha = <span class="hljs-number">0.3</span>;
  <span class="hljs-keyword">for</span> (<span class="hljs-keyword">const</span> i <span class="hljs-keyword">of</span> available) {
    <span class="hljs-keyword">const</span> normalized = (qValues[i] - minQ) / range;
    <span class="hljs-keyword">const</span> row = ~~(i / <span class="hljs-number">3</span>);
    <span class="hljs-keyword">const</span> col = i % <span class="hljs-number">3</span>;

    <span class="hljs-comment">// Green for high Q-values, red for low</span>
    <span class="hljs-keyword">const</span> hue = normalized * <span class="hljs-number">120</span>;
    <span class="hljs-built_in">this</span>.ctx.fillStyle = <span class="hljs-string">`hsl(<span class="hljs-subst">${hue}</span>, 70%, 50%)`</span>;
    <span class="hljs-built_in">this</span>.ctx.fillRect(
      col * <span class="hljs-built_in">this</span>.cellSize + <span class="hljs-number">5</span>,
      row * <span class="hljs-built_in">this</span>.cellSize + <span class="hljs-number">5</span>,
      <span class="hljs-built_in">this</span>.cellSize - <span class="hljs-number">10</span>,
      <span class="hljs-built_in">this</span>.cellSize - <span class="hljs-number">10</span>
    );

    <span class="hljs-comment">// Draw Q-value</span>
    <span class="hljs-built_in">this</span>.ctx.globalAlpha = <span class="hljs-number">1</span>;
    <span class="hljs-built_in">this</span>.ctx.fillStyle = <span class="hljs-string">'#000'</span>;
    <span class="hljs-built_in">this</span>.ctx.font = <span class="hljs-string">'14px monospace'</span>;
    <span class="hljs-built_in">this</span>.ctx.fillText(
      qValues[i].toFixed(<span class="hljs-number">2</span>),
      col * <span class="hljs-built_in">this</span>.cellSize + <span class="hljs-number">10</span>,
      row * <span class="hljs-built_in">this</span>.cellSize + <span class="hljs-number">25</span>
    );
  }
  <span class="hljs-built_in">this</span>.ctx.globalAlpha = <span class="hljs-number">1</span>;
}
</code></pre>
<p>This visualization method creates a color-coded heatmap showing the AI's confidence in each available move.</p>
<p>It first retrieves Q-values for the current state and finds the min/max values among available positions to normalize the data. For each empty cell, it calculates a normalized score (0 to 1), converts it to a hue value (0° red for low values, 120° green for high values) using HSL color space, and fills the cell with a semi-transparent colored rectangle. It then overlays the actual Q-value as text for precise inspection.</p>
<p>This gives you instant visual feedback about which moves the AI considers most promising. Green cells are good moves, red cells are poor moves.</p>
<h2 id="heading-common-pitfalls-and-solutions">Common Pitfalls and Solutions</h2>
<h3 id="heading-issue-1-ai-does-not-improve">Issue 1: AI Does Not Improve</h3>
<ul>
<li><p><strong>Cause</strong>: The learning rate is too low or there hasn't been enough training.</p>
</li>
<li><p><strong>Solution</strong>: Increase the learning rate to between 0.2 and 0.3, and train for more than 2000 games.</p>
</li>
</ul>
<h3 id="heading-issue-2-ai-makes-random-moves">Issue 2: AI Makes Random Moves</h3>
<ul>
<li><p><strong>Cause</strong>: The exploration rate is too high after training.</p>
</li>
<li><p><strong>Solution</strong>: Reduce the exploration rate to 0.01 once training is complete.</p>
</li>
</ul>
<h3 id="heading-issue-3-slow-performance">Issue 3: Slow Performance</h3>
<ul>
<li><p><strong>Cause</strong>: The state representation or Q-table lookup is inefficient.</p>
</li>
<li><p><strong>Solution</strong>: Use a Map instead of objects and implement state caching.</p>
</li>
</ul>
<h3 id="heading-issue-4-ai-overfits-to-one-strategy">Issue 4: AI Overfits to One Strategy</h3>
<ul>
<li><p><strong>Cause</strong>: There isn't enough exploration during training.</p>
</li>
<li><p><strong>Solution</strong>: Begin with a high exploration rate (ε=0.5) and gradually decrease it.</p>
</li>
</ul>
<h2 id="heading-how-to-extend-this-to-other-games">How to Extend This to Other Games</h2>
<p>This framework adapts to other games:</p>
<ul>
<li><p><strong>Connect Four</strong>: 42-character state, 7 actions (columns)</p>
</li>
<li><p><strong>Blackjack</strong>: State includes hand values and dealer card</p>
</li>
<li><p><strong>Snake</strong>: Continuous states require function approximation</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You have built a complete reinforcement learning system in JavaScript. This project demonstrates:</p>
<ul>
<li><p>Core RL concepts with practical implementation</p>
</li>
<li><p>Clean, maintainable code architecture</p>
</li>
<li><p>Real-time training and visualization</p>
</li>
<li><p>Advanced techniques like epsilon decay and self-play</p>
</li>
<li><p>Three difficulty levels from beginner to expert</p>
</li>
<li><p>Data persistence with localStorage</p>
</li>
<li><p>Interactive tooltips for learning</p>
</li>
</ul>
<p>The Q-learning foundation you have implemented powers more advanced techniques like Deep Q-Networks (DQN) used in modern game AI.</p>
<h2 id="heading-next-steps">Next Steps</h2>
<p>Here are some ways to continue learning:</p>
<ol>
<li><p>Add more difficulty levels with custom parameters</p>
</li>
<li><p>Implement state persistence with IndexedDB for larger Q-tables</p>
</li>
<li><p>Create multiplayer mode with AI observation</p>
</li>
<li><p>Build a neural network version with TensorFlow.js</p>
</li>
<li><p>Extend to Connect Four or Chess endgames</p>
</li>
</ol>
<h3 id="heading-resources-for-further-learning">Resources for Further Learning</h3>
<ul>
<li><p><a target="_blank" href="http://incompleteideas.net/book/the-book.html">Reinforcement Learning: An Introduction</a> by Sutton and Barto (free online textbook)</p>
</li>
<li><p><a target="_blank" href="https://spinningup.openai.com/">OpenAI Spinning Up</a> – comprehensive RL resource</p>
</li>
<li><p><a target="_blank" href="https://sites.google.com/view/deep-rl-bootcamp/">Deep RL Bootcamp</a> – Berkeley video lectures</p>
</li>
<li><p><a target="_blank" href="https://stable-baselines3.readthedocs.io/">Stable-Baselines3 Documentation</a> – production RL implementations</p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Use Gymnasium for Reinforcement Learning ]]>
                </title>
                <description>
                    <![CDATA[ Embark on an exciting journey to learn the fundamentals of reinforcement learning and its implementation using Gymnasium, the open-source Python library previously known as OpenAI Gym. We just published a full course on the freeCodeCamp.org YouTube c... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/use-openai-gymnasium-for-reinforcement-learning/</link>
                <guid isPermaLink="false">66b206d6b7ebc564bd87e357</guid>
                
                    <category>
                        <![CDATA[ Reinforcement Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Tue, 21 Mar 2023 14:17:07 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/03/gymnasium.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Embark on an exciting journey to learn the fundamentals of reinforcement learning and its implementation using Gymnasium, the open-source Python library previously known as OpenAI Gym.</p>
<p>We just published a full course on the freeCodeCamp.org YouTube channel that will teach you the basics of reinforcement learning using Gymnasium.</p>
<p>Mustafa Esoofally created this course. He is an experienced machine learning engineer and course creator.</p>
<p>Gymnasium is an open source Python library maintained by the Farama Foundation. It offers a rich collection of pre-built environments for reinforcement learning agents, a standard API for communication between learning algorithms and environments, and a standard set of environments compliant with that API. This comprehensive video course is designed to help you understand reinforcement learning, a branch of machine learning that focuses on intelligent agents taking actions in an environment to maximize cumulative rewards.</p>
<h3 id="heading-course-contents">Course Contents</h3>
<p>This video course is carefully structured to provide you with a complete understanding of reinforcement learning, from basics to advanced topics:</p>
<ol>
<li><strong>Introduction</strong><br>Get an overview of the course, its objectives, and the topics we will cover.</li>
<li><strong>Reinforcement Learning Basics (Agent and Environment)</strong><br>Learn about the fundamental concepts of reinforcement learning, including agents, environments, and their interactions.</li>
<li><strong>Introduction to Gymnasium</strong><br>Discover the power of Gymnasium and how it can help you develop and test reinforcement learning algorithms.</li>
<li><strong>Blackjack Rules and Implementation in Gymnasium</strong><br>Dive into the classic card game of Blackjack and learn how to implement it using Gymnasium.</li>
<li><strong>Solving Blackjack</strong><br>Explore the process of solving Blackjack using reinforcement learning techniques.</li>
<li><strong>Install and Import Libraries</strong><br>Learn how to set up your Python environment and import the necessary libraries for reinforcement learning.</li>
<li><strong>Observing the Environment</strong><br>Understand how to monitor and interact with the environment during reinforcement learning tasks.</li>
<li><strong>Executing an Action in the Environment</strong><br>Master the process of performing actions in the environment and receiving feedback.</li>
<li><strong>Understand and Implement Epsilon-greedy Strategy to Solve Blackjack</strong><br>Learn the epsilon-greedy strategy, an essential technique for solving Blackjack with reinforcement learning.</li>
<li><strong>Understand the Q-values</strong><br>Explore the concept of Q-values and how they are used in reinforcement learning algorithms.</li>
<li><strong>Training the Agent to Play Blackjack</strong><br>Learn the process of training a reinforcement learning agent to play Blackjack effectively.</li>
<li><strong>Visualize the Training of Agent Playing Blackjack</strong><br>Discover how to visualize and analyze the training process of a reinforcement learning agent.</li>
<li><strong>Summary of Solving Blackjack</strong><br>Review the key concepts and techniques learned while solving Blackjack.</li>
<li><strong>Solving Cartpole Using Deep-Q-Networks (DQN)</strong><br>Learn how to solve the classic Cartpole problem using Deep-Q-Networks, a popular reinforcement learning technique.</li>
<li><strong>Summary of Solving Cartpole</strong><br>Recap the essential elements of solving Cartpole using reinforcement learning.</li>
<li><strong>Advanced Topics and Introduction to Multi-Agent Reinforcement Learning using Pettingzoo</strong><br>Delve into advanced reinforcement learning topics, including multi-agent reinforcement learning and the use of the Pettingzoo library.</li>
</ol>
<p>With this comprehensive video course, you'll be well-equipped to tackle reinforcement learning challenges using the powerful Gymnasium library. </p>
<p>Watch the full course on <a target="_blank" href="https://youtu.be/vufTSJbzKGU">the freeCodeCamp.org YouTube channel</a> (3-hour watch).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/vufTSJbzKGU" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Train an AI to Play a Snake Game Using Python ]]>
                </title>
                <description>
                    <![CDATA[ Why waste time playing video games when you can train an AI to do it for you? Ok, maybe playing yourself is more fun but training an AI can be more educational. We just published a course on the freeCodeCamp.org YouTube channel that will teach you th... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/train-an-ai-to-play-a-snake-game-using-python/</link>
                <guid isPermaLink="false">66b206ae260b867a4064ba06</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Reinforcement Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Mon, 25 Apr 2022 16:48:40 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/06/aisnake2.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Why waste time playing video games when you can train an AI to do it for you? Ok, maybe playing yourself is more fun but training an AI can be more educational.</p>
<p>We just published a course on the freeCodeCamp.org YouTube channel that will teach you the basics of reinforcement learning by showing you how to teach an AI to play a snake game.</p>
<p>Reinforcement learning is a type of machine learning that enables an agent to learn in an environment by trial and error using feedback from its own actions and experiences.</p>
<p>First you will create the game using Python and Pygame. Then you will create and train a neural network using PyTorch that can play the game better than most humans.</p>
<p>Patrick Loeber, also known as Python Engineer, created this course. He has created many popular courses related to Python and machine learning. </p>
<p>The girl and snake art for this course were created by <a target="_blank" href="http://rachel.likespizza.com/">Rachel Likes Pizza</a>.</p>
<p>Here is what is what you will do in this four-part course:</p>
<ul>
<li>Learn the basics of Reinforcement Learning and Deep Q Learning</li>
<li>Setup the environment and implement a snake game</li>
<li>Implement an agent to control the game</li>
<li>Create and train a neural network to play the game</li>
</ul>
<p>Watch the full course below or on the freeCodeCamp.org YouTube channel (2-hour watch).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/L8ypSXwyBds" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Intro to Advanced Actor-Critic Methods: Reinforcement Learning Course ]]>
                </title>
                <description>
                    <![CDATA[ Actor-Critic Methods are very useful reinforcement learning techniques. Actor-critic methods are most useful for applications in robotics as they allow software to output continuous, rather than discrete actions. This enables control of electric moto... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/intro-to-advanced-actor-critic-methods-reinforcement-learning-course/</link>
                <guid isPermaLink="false">66b203a027569435a9255ad2</guid>
                
                    <category>
                        <![CDATA[ Reinforcement Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Fri, 30 Jul 2021 22:20:15 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/07/activecritic.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Actor-Critic Methods are very useful reinforcement learning techniques.</p>
<p>Actor-critic methods are most useful for applications in robotics as they allow software to output continuous, rather than discrete actions. This enables control of electric motors to actuate movement in robotic systems, at the expense of increased computational complexity.</p>
<p>We just released a comprehensive course on Actor-Critic methods on the freeCodeCamp.org YouTube channel.</p>
<p>Dr. Tabor developed this course. He is a physicist and former semiconductor engineer who is now a data scientist.</p>
<p>The basic idea behind actor-critic methods is that there are two deep neural networks. The actor network approximates the agent’s policy: a probability distribution that tells us the probability of selecting a (continuous) action given some state of the environment. The critic network approximates the value function: the agent’s estimate of future rewards that follow the current state. These two networks interact to shift the policy towards more profitable states, where profitability is determined by interacting with the environment.</p>
<p>This requires no prior knowledge of how our environment works, or any input regarding rules of the game. All we have to do is let the algorithm interact with the environment and watch as it learns. </p>
<p>This course also incorporate some useful innovations from deep Q learning, such as the use of experience replay buffers and target networks. This increases stability and robustness of the learned policies, so that our agent are able to learn effective policies for navigating the Open AI gym environments.</p>
<p>Here are the algorithms covered in this course:</p>
<ul>
<li>Actor Critic</li>
<li>Deep Deterministic Policy Gradients (DDPG)</li>
<li>Twin Delayed Deep Deterministic Policy Gradients (TD3)</li>
<li>Proximal Policy Optimization (PPO)</li>
<li>Soft Actor Critic (SAC)</li>
<li>Asynchronous Advantage Actor Critic (A3C)</li>
</ul>
<p>Watch the full course below or on <a target="_blank" href="https://youtu.be/K2qjAixgLqk">the freeCodeCamp.org YouTube channel</a> (6-hour watch).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/K2qjAixgLqk" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How I planned my meals with Reinforcement Learning on a budget ]]>
                </title>
                <description>
                    <![CDATA[ By Sterling Osborne, PhD Researcher Following my recent article on applying Reinforcement Learning to real life problems, I decided to demonstrate this with a small example. The aim is to create an algorithm that can find a suitable choice of food pr... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-i-planned-my-meals-with-reinforcement-learning-on-a-budget-a82aac906ada/</link>
                <guid isPermaLink="false">66c34e03f41767c3c96bacca</guid>
                
                    <category>
                        <![CDATA[ budget ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Reinforcement Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ technology ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 16 Apr 2019 16:00:10 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*DJoo_O-eNQAnYrc4blWzAg.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Sterling Osborne, PhD Researcher</p>
<p>Following <a target="_blank" href="https://medium.freecodecamp.org/how-to-apply-reinforcement-learning-to-real-life-planning-problems-90f8fa3dc0c5">my recent article on applying Reinforcement Learning to real life problems</a>, I decided to demonstrate this with a small example. The aim is to create an algorithm that can find a suitable choice of food products to fit within a budget and meet my personal preferences.</p>
<p>I have also posted the description, data and code kernel to Kaggle and this can be found <a target="_blank" href="https://www.kaggle.com/osbornep/reinforcement-learning-for-meal-planning-in-python/notebook">here</a>.</p>
<p>Please let me know if you have any questions or suggestions.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/LZJGp50r3YXHLlCWaMEoqYBhRwvpxdjsdgwa" alt="Image" width="800" height="533" loading="lazy">
<em>Photo: Pixabay</em></p>
<h3 id="heading-aim">Aim</h3>
<p>When food shopping, there are many different products for the same ingredient to choose from in supermarkets. Some are less expensive, others are of higher quality. I would like to create a model that, for the required ingredients, can select the optimal products required to make a meal that is both:</p>
<ol>
<li>Within my budget</li>
<li>Meets my personal preferences</li>
</ol>
<p>To do this, I will first build a very simple model that can recommend the products that are below my budgets before introducing my preferences.</p>
<p>The reason we use a model is so that we could, in theory, scale the problem to consider more and more ingredients and products that would cause the problem to then be beyond the possibility of any mental calculations.</p>
<h3 id="heading-method">Method</h3>
<p>To achieve this, I will be building a simple reinforcement learning model and I’ll use Monte Carlo learning to find the optimal combination of products.</p>
<p>First, let us formally define the parts of our model as a Markov Decision Process:</p>
<ul>
<li>We have a finite number of ingredients required to make any meal and are considered to be our <strong>States</strong></li>
<li>There are the finite possible products for each ingredient and are therefore the <strong>Actions of each state</strong></li>
<li>Our preferences become the <strong>Individual Rewards</strong> for selecting each product, we will cover this in more detail later</li>
</ul>
<p>Monte Carlo learning takes the combined the quality of each step towards reaching an end goal and requires that, in order to assess the quality of any step, we must wait and see the outcome of the whole combination. This process is repeated over and over again in episodes with many different products until is finds the selection that appears to lead to a positive outcome repeatedly. This is the reinforcement learning process where our environment is simulated based on the knowledge about costs and preferences we obtained.</p>
<p>Monte Carlo is often avoided due to the time required to go through the whole process before being able to learn. However, in our problem it is required as our final check when establishing whether the combination of products selected is good or bad is to add up the real cost of those selected and check whether or not this is below or above our budget. Furthermore, at least at this stage, we will not be considering more than a few ingredients and so the time taken is not significant in this regard.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/5XabGzV2o9PFKK7nUoP-QHSShPx6a2ur1XdM" alt="Image" width="710" height="273" loading="lazy">
_[https://www.tractica.com/artificial-intelligence/reinforcement-learning-and-its-implications-for-enterprise-artificial-intelligence/](https://www.tractica.com/artificial-intelligence/reinforcement-learning-and-its-implications-for-enterprise-artificial-intelligence/" rel="noopener" target="<em>blank" title=")</em></p>
<h3 id="heading-sample-data">Sample Data</h3>
<p>For this demonstration, I have created some sample data for a meal where we have 4 ingredients and 9 products, as shown in the diagram below.</p>
<p>We need to select one product for each ingredient in the meal.</p>
<p>This means we have 2 x 2 x 2 x 3 = 24 possible selections of products for the 4 ingredients.</p>
<p>I have also included the real cost for each product and V_0.</p>
<p>V_0 is simply the initial quality of each product to meet our requirements and we set this to 0 for each.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/eCjXwnetr8IA787ykWpZq4k6OdbHEroKoA5M" alt="Image" width="800" height="446" loading="lazy">
<em>Diagram showing the possible product choices for each ingredient</em></p>
<p>First, we import the required packages and data.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/IJ5NNxRwWJQ8QcwXNicYIDoxXZjcEGzOdzRz" alt="Image" width="800" height="396" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/YvGNf8QiGazYIDSMQhdmNWKU8TnfviITZHrd" alt="Image" width="394" height="310" loading="lazy"></p>
<h3 id="heading-applying-the-model-in-theory">Applying the Model in Theory</h3>
<p>For now, I will not introduce any individual rewards for the products. Instead, I will simply focus on whether the combination of products selected is below our budget or not. This outcome is defined as the <strong>Terminal Reward</strong> of our problem.</p>
<p>For example, say we have a budget of £30, then the choice:</p>
<p>a1→b1→c1→d1</p>
<p>Then the real cost of this selection is:</p>
<p>£10+£8+£3+£8 = £29 &lt; £30</p>
<p>And therefore, our terminal reward is:</p>
<p>R_T=+1</p>
<p>Whereas,</p>
<p>a2→b2→c2→d1</p>
<p>Then the real cost of this selection is:</p>
<p>£6+£11+£7+£8 = £32 &gt; £30</p>
<p>And therefore, our terminal reward is:</p>
<p>R_T=−1</p>
<p>For now, we are simply telling our model whether the choice is good or bad and will observe what this does to the results.</p>
<h3 id="heading-model-learning">Model Learning</h3>
<p>So how does our model actually learn? In short, we get our model to try out lots of combinations of products and at the end of each tell it whether its choice was good or bad. Over time, it will recognise that some products generally lead to getting a good outcome while others do not.</p>
<p>What we end up creating are values for how good each product is, denoted V(a). We have already introduced the initial V(a) for each product, but how do we reach go from these initial values to actually being able to make a decision?</p>
<p>For this, we need an <strong>Update Rule</strong>. This tells the model, after each time it has presented its choice of products and we have told it whether it’s selection is good or bad, how to add this to our initial values.</p>
<p>Our update rule is as follows:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/tBfSiFmawCqLJM2rP0VpSrWRa7btGNT4iOiw" alt="Image" width="247" height="41" loading="lazy"></p>
<p>This may look unusual at first but in words we are simply updating the value of any action, V(a), by an amount that is either a little more if the outcome was good or a little less if the outcome was bad.</p>
<p>G is the <strong>Return</strong> and is simply to total reward obtained. Currently in our example, this is simply the terminal reward (+1 or -1 accordingly). We will reintroduce this later when we include individual product rewards.</p>
<p>Alpha, αα, is the <strong>Learning Rate</strong> and we will demonstrate how this effects the results more later but just for now, the simple explanation is: “The learning rate determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing, while a factor of 1 makes the agent consider only the most recent information.” (<a target="_blank" href="https://en.wikipedia.org/wiki/Q-learning">https://en.wikipedia.org/wiki/Q-learning</a>)</p>
<h3 id="heading-small-demo-of-updating-values">Small Demo of Updating Values</h3>
<p>So how do we actually use this with our model?</p>
<p>Let us start with a table that has each product and its initial V_0(a):</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/dKkesqok94JNwSZHLDBzEdKnmyKP5NmqwERQ" alt="Image" width="140" height="311" loading="lazy"></p>
<p>We now pick a random selection of products, each combination is known as an <strong>episode</strong>. We also set α=0.5α=0.5 for now just for simplicity in the calculations.</p>
<p>For example:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/rAPEHqEY3U0XJekgOwusn6dYbxpQgHQ4ghhl" alt="Image" width="314" height="305" loading="lazy"></p>
<p>Therefore, all actions that lead to this positive outcome are updated as well to produced the following table with V1(a):</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/GSmEiJ8IZIYsbsUaieXz4jBQfjeL3wzch1HC" alt="Image" width="191" height="313" loading="lazy"></p>
<p>So let us try again by picking another random episode:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/r7aeYiWCHv9zR0d9e5E2QbbKaN17BjllHdz-" alt="Image" width="397" height="534" loading="lazy"></p>
<p>Therefore, we can add V2(a) to our table:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/E3Y525fxV7LX8vcdTvqVNyaGT0GAQ8vkLb4I" alt="Image" width="263" height="320" loading="lazy"></p>
<h3 id="heading-action-selection">Action Selection</h3>
<p>You may have noticed in the demo, I have simply randomly selected the products in each episode. We could do this, but using a completely random selection process may mean that some actions are not selected often enough to know whether they are good or bad.</p>
<p>Similarly, if we went another way and decided to select the products greedily, i.e. to ones that currently have the best value, we may miss one that is in fact better but was never given a chance. For example, if we chose the best actions from V2(a) we would get a2, b1, c1 and d2 or d3 which both provide a positive terminal reward therefore, if we used a purely greedy selection process, we would never consider any other products as these continue to provide a positive outcome.</p>
<p>Instead, we implement <strong>epsilon-greedy</strong> action selection where we randomly select products with probability ϵ, and greedily select products with probability 1−ϵ1−ϵ where:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/w7prUkGhYtx6PfE16dfGTFC2QTZG0DhbEWRy" alt="Image" width="92" height="36" loading="lazy"></p>
<p>This means that we are going reach the optimal choice of products quickly, as we continue to test whether the ‘good’ products are in fact optimal. But it also leaves room for us to also explore other products occasionally, just to make sure they aren’t as good as our current choice.</p>
<h3 id="heading-building-and-applying-our-model">Building and Applying our Model</h3>
<p>We are now ready to build a simple model as shown in the MCModelv1 function below.</p>
<p>Although this seems complex, I have done nothing more than apply the methods previously discussed in such a way that we can vary the inputs and still obtain results. Admittedly, this was my first attempt at doing this and so my coding may not be perfectly written but should be sufficient for our requirements.</p>
<p>To calculate the terminal reward, we currently use the following condition to check if the total cost is less or more than our budget:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/3Wqi4imz2FQDmGOWAj96pNV3lhw6PcEv-BwO" alt="Image" width="800" height="121" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/Sh2bBPQoanKimfAFPNLATlzHIRVZeMsapqVI" alt="Image" width="653" height="339" loading="lazy"></p>
<p><strong>The full code for the model is too large to fit here nicely, but can be found at the linked <a target="_blank" href="https://www.kaggle.com/osbornep/reinforcement-learning-for-meal-planning-in-python/notebook">Kaggle</a> page.</strong></p>
<h4 id="heading-we-now-run-our-model-with-some-sample-variables">We now run our model with some sample variables:</h4>
<p><img src="https://cdn-media-1.freecodecamp.org/images/wsawgTD72Mfaf2GqTzg9-2EPXgWHV7Jcx2Cq" alt="Image" width="680" height="356" loading="lazy"></p>
<p>In our function, we have 6 outputs from the model:</p>
<ul>
<li>Mdl[0]: Returns the Sum of all V(a) for each episode</li>
<li>Mdl[1]: Returns to Sum of V(a) for the cheapest products, possible to define due to the simplicity of our sample data</li>
<li>Mdl[2]: Returns the Sum of V(a) for the non-cheapest products</li>
<li>Mdl[3]: Returns the optimal actions of the final episode</li>
<li>Mdl[4]: Returns the data table with the final V(a) added for each product</li>
<li>Mdl[5]: Shows the optimal action at each episode</li>
</ul>
<p>There is a lot to take away from these, so let us go through each and establish what we can learn to improve our model.</p>
<h4 id="heading-optimal-actions-of-final-episode">Optimal actions of final episode</h4>
<p>First, let’s see what the model suggests we should select. In this run it suggests actions, or products, that have a total cost below budget which is good.</p>
<p>However, there is still more that we can check to help us understand what is going on.</p>
<p>First, we can plot the total V for all actions, and we see that this is converging, which is ideal. We want our model to converge so that as we try more episodes we are ‘zoning-in’ on the optimal choice of products. The reason the output converges is because we are reducing the amount it learns each time by a factor of αα, in this case 0.5. We will show later what happens if we vary this or don’t apply this at all.</p>
<p>We have also plotted the sum of V for the products we know are cheapest, based on being able to assess the small sample size, and the others separately. Again, both are converging positively although the cheaper products appear to have slightly higher values.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/4WTvyb0pa3be9kJmsvP8PQsIGSce2EdbQPLj" alt="Image" width="680" height="492" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/UFFckpQP0O2XLeE49eoTeaiVWmKY-fScgvJ0" alt="Image" width="677" height="763" loading="lazy"></p>
<h4 id="heading-so-why-is-this-happening-and-why-did-the-model-suggest-the-actions-it-did">So why is this happening and why did the model suggest the actions it did?</h4>
<p>To understand that, we need to dissect the suggestions made by the model at each episode and how this relates to our return.</p>
<p>Below, we have taken the optimal action for each state. We can see that the suggested actions do vary greatly between episodes and the model appears to decide which is wants to suggest very quickly.</p>
<p>Therefore, I have plotted the total cost of the suggested actions at each episode and we can see the actions vary initially then smooth out and the resulting total cost is below our budget. This helps us understand what is going on greatly.</p>
<p>So far, all we have told the model is to provide a selection that is below budget and it has. It has simply found a answer that is below the budget as required.</p>
<p>So what is the next step? Before I introduce rewards I want to demonstrate what happens if I vary some of the parameters and what we can do if we decide to change what we want our model to suggest.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/A5ywpBkRB2P3G-8WCwYu1-HQ2lAuON565bS3" alt="Image" width="681" height="325" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/LrC00RigDKHw90YQMaAnKVxogZJ9urKzrtP8" alt="Image" width="297" height="851" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/pMFzDzpPZRDDzpSpUuEb7mUbhLCKHD93kldf" alt="Image" width="637" height="763" loading="lazy"></p>
<h3 id="heading-effect-of-changing-parameters-and-how-to-change-models-aim">Effect of Changing Parameters and How to Change Model’s Aim</h3>
<p>We have a few parameters that can be changed:</p>
<ol>
<li>The Budget</li>
<li>Our learning rate, α</li>
<li>Out action selection parameter, ϵ</li>
</ol>
<h4 id="heading-varying-budget">Varying Budget</h4>
<p>First, let us observe what happens if we make our budget either impossibly low or high.</p>
<p>A small budget means we only obtain a negative reward means that we will force our V to converge negatively whereas a budget that is too high will cause our V to converge positively as all actions are continually positive.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/ClL2XcjSAPq4hzJWOLLygAyYiI4QS7-hmmYi" alt="Image" width="636" height="712" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/ePz3LMNDjQN1FQGHdcYHEF65gA9A1MOu7Ywh" alt="Image" width="638" height="721" loading="lazy"></p>
<p>The latter seems like what we had in our first run, a lot of the episodes lead to positive outcomes and so many combinations of products are possible and there is little distinction between the cheapest products from the rest.</p>
<p>If instead we consider a budget that is reasonably low given the prices of the products, we can see a trend where the cheapest products look to be converging positively and the more expensive products converging negatively. However, the smoothness of these is far from ideal, both appear to be oscillating greatly between each episode.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/9vYicIrKqHGM8unbhIyICWNeKLD7FRe0zKbm" alt="Image" width="639" height="610" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/sboascUNlUZ1xKs64W5x4g7Jd0tVOK4NYG3l" alt="Image" width="362" height="460" loading="lazy"></p>
<p>So what can we do the reduce the ‘spikiness’ of the outputs? This leads us onto our next parameter, alpha.</p>
<h3 id="heading-varying-alpha">Varying Alpha</h3>
<h4 id="heading-a-good-explanation-of-what-is-going-on-with-our-output-due-to-alpha-is-described-by-stack-overflow-user-vishalthebeast">A good explanation of what is going on with our output due to alpha is described by stack overflow user VishalTheBeast:</h4>
<blockquote>
<p>“Learning rate tells the magnitude of step that is taken towards the solution.</p>
<p>It should not be too big a number as it may continuously oscillate around the minima and it should not be too small of a number else it will take a lot of time and iterations to reach the minima.</p>
<p>The reason why decay is advised in learning rate is because initially when we are at a totally random point in solution space we need to take big leaps towards the solution and later when we come close to it, we make small jumps and hence small improvements to finally reach the minima.</p>
<p>Analogy can be made as: in the game of golf when the ball is far away from the hole, the player hits it very hard to get as close as possible to the hole. Later when he reaches the flagged area, he choses a different stick to get accurate short shot.</p>
<p>So it’s not that he won’t be able to put the ball in the hole without choosing the short shot stick, he may send the ball ahead of the target two or three times. But it would be best if he plays optimally and uses the right amount of power to reach the hole. Same is for decayed learning rate.” — <a target="_blank" href="https://stackoverflow.com/questions/33011825/learning-rate-of-a-q-learning-agent">source</a></p>
</blockquote>
<p>To better demonstrate the effect of varying our alpha, I will be using an animated plot created using Plot.ly.</p>
<p>I have written a more detailed guide on how to do this <a target="_blank" href="https://towardsdatascience.com/creating-interactive-animation-for-parameter-optimisation-using-plot-ly-8136b2997db">here</a>.</p>
<p>In our first animation, we vary alpha between 1 and 0.1. This enables us to see that as we reduce alpha our output smooths somewhat but it still pretty rough.</p>
<p>However, even though the results are smoothing out, they are no longer converging in 100 episodes and, furthermore, they output seems to alternate between each alpha. This is due to a combination of small alphas requiring more episodes to learn and out action selection parameter epsilon being 0.5. Essentially, the output is still being decided by randomness half of the time and so out results are not converging within the 100 episode frame.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/aB38O-aTjeYWBdtMd9NRPAmkFpfCKfzR0qPB" alt="Image" width="638" height="644" loading="lazy"></p>
<p>Running this through our animated plots produces something similar to the following:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/wVLIs9ttJH3P27B50Xf9rCYg0x4YxU6otsy5" alt="Image" width="600" height="288" loading="lazy"></p>
<h3 id="heading-varying-epsilon">Varying Epsilon</h3>
<p>With the previous results in mind, we now fix alpha to be 0.05 and vary epsilon between 1 and 0 to show the effect of completely randomly selecting actions to selecting actions greedily.</p>
<p>The graphs below show three snapshots from varying epsilon, but the animated version can be viewed in the <a target="_blank" href="https://www.kaggle.com/osbornep/reinforcement-learning-for-meal-planning-in-python/notebook">Kaggle</a> kernel.</p>
<p>We see that having a high epsilon creates very sporadic results. Therefore we should select something reasonably small like 0.2. Although have epsilon equal to 0 looks good because of how smooth the curve is, as we mentioned earlier, this may lead us to a choice very quickly but may not be the best. We want some randomness so the model can explore other actions if needed.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/MSioaunlsQvp2AADkQjHB0R6X6yAfTEAIn0Q" alt="Image" width="700" height="450" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/Rnmcx-e31oLcKZA9-M4fyHHs5MOdTPoQXmdW" alt="Image" width="700" height="450" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/R8FGYGkQjh55aapbnQs3TS1dBsC7iU7uKLaX" alt="Image" width="700" height="450" loading="lazy"></p>
<h3 id="heading-increasing-the-number-of-episodes">Increasing the Number of Episodes</h3>
<p>Lastly, we can increase the number of episodes. I refrained from doing this sooner because we were running 10 models in a loop to output our animated graphs and this would have caused the time taken to run the model to explode.</p>
<p>We noted that a low alpha would require more episodes to learn so we can run our model for 1000 episodes.</p>
<p>However, we still notice that the output is oscillating, but, as mentioned before, this is due to our aim being simply to recommend a combination that is below budget. What this shows is that the model can’t find the single best combination when there are many that fit below our budget.</p>
<p>Therefore, what happens if we change our aim slightly so that we can use the model to find the cheapest combination of products?</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/2h7mS1jrLjv3KD47T77ARG-2tTscftXWGJUI" alt="Image" width="638" height="594" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/9xMMzsxigFz4Zx3n-womG45Q4qzWsYtvEBx4" alt="Image" width="364" height="445" loading="lazy"></p>
<h3 id="heading-changing-our-models-aim-to-find-the-cheapest-combination-of-products">Changing our Model’s Aim to Find the Cheapest Combination of Products</h3>
<p>This aim of this it to more clearly separate the cheapest products from the rest, and it nearly always provides us with the cheapest combination of products.</p>
<p>To do this, all we need do is adapt our model slightly to provide a terminal reward that is relative to how far below or above budget this combination in the episode is.</p>
<p>This can done by changing the calculation for return to:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/xW4GsM4rWI0XRPjxKYPn7dFmg5nz8DtuLeBM" alt="Image" width="638" height="114" loading="lazy"></p>
<p>We now see that the separation between the cheapest products and the others is emphasised.</p>
<p>This really demonstrates the flexibility of reinforcement learning and how easy it can be to adapt the model based on your aims.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/u8R2pcQCWZhl2tCIZ3nu2cFF90oMWhISSXJy" alt="Image" width="637" height="761" loading="lazy"></p>
<h3 id="heading-introducing-preferences">Introducing Preferences</h3>
<p>So far, we have not included any personal preferences towards products. If we wanted to include this, we can simply introduce rewards for each product whilst still having a terminal reward that encourages the model to be below budget.</p>
<p>This can done by changing the calculation for return to:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/8S-ifGXjN3WYCyYpGgC2LYU2EJukmVB5XUo2" alt="Image" width="627" height="110" loading="lazy"></p>
<p>So why is our return calculation now like this?</p>
<p>Well firstly, we still want our combination to be below budget so we provide the positive and negative rewards for being above and below budget respectively.</p>
<p>Next, we want to account for the reward of each product. For our purposes, we define the rewards to be a value between 0 and 1. MC return is formally calculated using the following:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1vJAlBcCYSLMbG41EbcNi5TkW191DHCXR2hz" alt="Image" width="141" height="59" loading="lazy"></p>
<p>γ is the discount factor and this tells us how much we value later steps compared to earlier steps. In our case, all actions are equally as important to reaching the desired outcome of being below budget so we set γ=1.</p>
<p>However, to ensure that we reach the primary goal of being below budget, we take the average of the sum of the rewards for each action so that this will always be less than 1 or -1 respectively.</p>
<p>Again, the full model can be found in the <a target="_blank" href="https://www.kaggle.com/osbornep/reinforcement-learning-for-meal-planning-in-python/notebook">Kaggle</a> kernel but is too large to link here.</p>
<h3 id="heading-introducing-preferences-using-rewards">Introducing Preferences using Rewards</h3>
<p>Say we decided we wanted product a1 and b2, we could add a reward to each. Let us see what happens if we do this in the output and graphs below. We have changed out budget slightly as a1 and b2 add up to £21 which means there is no way to select two more products that would put it below a budget of £23.</p>
<p>Applying a very high reward forces the model to pick a1 and b2 then work around to find products that will put it under our budget.</p>
<p>I have kept in the comparison between the cheapest products and the rest to show that the model now is not valuing the cheapest once more. Instead we get the output a1, b2, c1 and d3 which has a total cost of £25. This is both below our budget and includes our preferred products.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/4CBCAhqhP1HzYgZgYyK4ugbTUHHdCHl0uEg0" alt="Image" width="636" height="768" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/Szh6HaR-2PnC7tACxh9OI4B3z5YMUpEo-vyP" alt="Image" width="350" height="471" loading="lazy"></p>
<p>Let’s try one more reward signal. This time, I give some reward to each but want it to provide the best combination from my rewards that still keeps us below budget.</p>
<p>We have the following rewards:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/H-EJrGUSdqRcbj5QpAQ29VxP8YbGiGF8Jvyy" alt="Image" width="110" height="234" loading="lazy"></p>
<p>Running this model a few times shows that it would:</p>
<ul>
<li>Often select a1 as this has a much higher reward</li>
<li>Would always pick c1, as the rewards are the same but it is cheaper</li>
<li>Had a hard time selecting between b1 and b2 as the rewards are 0.5 and 0.6 but the costs are £8 and £11 respectively</li>
<li>Would typically select d3 as being significantly cheaper than d1 even though reward is slightly less</li>
</ul>
<p><img src="https://cdn-media-1.freecodecamp.org/images/f7kOKIZRdMFsAY5DgTUEK9xb5JOkpMfKjn5n" alt="Image" width="639" height="762" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1e31yEJBL6uufYC8-ByEst2QyrZMCx-gFggI" alt="Image" width="374" height="464" loading="lazy"></p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>We have managed to build a Monte Carlo Reinforcement Learning model to:</p>
<ol>
<li>recommend products below a budget,</li>
<li>recommend the cheapest products, and</li>
<li>recommend the best products based on a preference that is still below a budget.</li>
</ol>
<p>Along the way, we have demonstrated the effect of changing parameters in reinforcement learning and how understanding these enables us to reach a desired result.</p>
<p>There is much more that we could do, in my mind, the end goal would be to apply to a real recipe and products from a supermarket where the increased number of ingredients and products need to be accounted for.</p>
<p>I created this sample data and problem to better my understanding of Reinforcement Learning and hope that you find it useful.</p>
<p>Thanks for reading!</p>
<p>Sterling Osborne</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to use AI to play Sonic the Hedgehog. It’s NEAT! ]]>
                </title>
                <description>
                    <![CDATA[ By Vedant Gupta Generation after generation, humans have adapted to become more fit with our surroundings. We started off as primates living in a world of eat or be eaten. Eventually we evolved into who we are today, reflecting modern society. Throug... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-use-ai-to-play-sonic-the-hedgehog-its-neat-9d862a2aef98/</link>
                <guid isPermaLink="false">66c35585a6c3eebadae8d2d9</guid>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ gaming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Reinforcement Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ technology ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 02 Apr 2019 16:28:11 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*RYknGhjxRw8arZlI-_ib4A.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Vedant Gupta</p>
<p>Generation after generation, humans have adapted to become more fit with our surroundings. We started off as primates living in a world of eat or be eaten. Eventually we evolved into who we are today, reflecting modern society. Through the process of evolution we become smarter. We are able to work better with our environment and accomplish what we need to.</p>
<p>The concept of learning through evolution can also be applied to Artificial Intelligence. We can train AIs to perform certain tasks using NEAT, Neuroevolution of Augmented Topologies. Simply put, NEAT is an algorithm which takes a batch of AIs (genomes) attempting to accomplish a given task. The top performing AIs “breed” to create the next generation. This process continues until we have a generation which is capable of completing what it needs to.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/kB3DscigQ-nDtQhS5em32jfdsFdAKp236CXt" alt="Image" width="600" height="337" loading="lazy">
<em>Clip of AI playing STH</em></p>
<p>NEAT is amazing because it eliminates the need for pre-existing data required to train our AIs. Using the power of NEAT and OpenAI’s Gym Retro I trained an AI to play Sonic the Hedgehog for the SEGA Genesis. Let’s learn how!</p>
<h3 id="heading-a-neat-neural-network-python-implementation">A NEAT Neural Network (Python Implementation)</h3>
<h4 id="heading-github-repository">GitHub Repository</h4>
<p><a target="_blank" href="https://github.com/Vedant-Gupta523/sonicNEAT"><strong>Vedant-Gupta523/sonicNEAT</strong></a><br><a target="_blank" href="https://github.com/Vedant-Gupta523/sonicNEAT">_Contribute to Vedant-Gupta523/sonicNEAT development by creating an account on GitHub._github.com</a></p>
<p><strong>Note:</strong> All of the code in this article and the repo above is a slightly modified version of Lucas Thompson's Sonic AI Bot Using Open-AI and NEAT <a target="_blank" href="https://www.freecodecamp.org/news/how-to-use-ai-to-play-sonic-the-hedgehog-its-neat-9d862a2aef98/Sonic%20AI%20Bot%20Using%20Open-AI%20and%20NEAT%20Tutorial">YouTube tutorials</a> and <a target="_blank" href="https://gitlab.com/lucasrthompson/Sonic-Bot-In-OpenAI-and-NEAT">code</a>.</p>
<h4 id="heading-understanding-openai-gym">Understanding OpenAI Gym</h4>
<p>If you are not already familiar with OpenAI Gym, look through the terminology below. They will be used frequently throughout the article.</p>
<p><strong>agent —</strong> The AI player. In this case it will be Sonic.</p>
<p><strong>environment —</strong> The complete surroundings of the agent. The game environment.</p>
<p><strong>action —</strong> Something the agent has the option of doing (i.e. move left, move right, jump, do nothing).</p>
<p><strong>step —</strong> Performing 1 action.</p>
<p><strong>state —</strong> A frame of the environment. The current situation the AI is in.</p>
<p><strong>observation —</strong> What the AI observes from the environment.</p>
<p><strong>fitness —</strong> How well our AI is performing.</p>
<p><strong>done —</strong> When the AI has completed its task or can’t continue any further.</p>
<h4 id="heading-installing-dependencies">Installing Dependencies</h4>
<p>Below are GitHub links for OpenAI and NEAT with installation instructions.</p>
<p><strong>OpenAI</strong>: <a target="_blank" href="https://github.com/openai/retro">https://github.com/openai/retro</a></p>
<p><strong>NEAT</strong>:<a target="_blank" href="https://github.com/CodeReclaimers/neat-python">https://github.com/CodeReclaimers/neat-python</a></p>
<p><strong>Pip install</strong> libraries such as cv2, numpy, pickle etc.</p>
<h4 id="heading-import-libraries-and-set-environment">Import libraries and set environment</h4>
<p>To start, we need to import all of the modules we will use:</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> retro
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> cv2
<span class="hljs-keyword">import</span> neat
<span class="hljs-keyword">import</span> pickle
</code></pre>
<p>We will also define our environment, consisting of the game and the state:</p>
<pre><code class="lang-py">env = retro.make(game = <span class="hljs-string">"SonicTheHedgehog-Genesis"</span>, state = <span class="hljs-string">"GreenHillZone.Act1"</span>)
</code></pre>
<p>In order to train an AI to play Sonic the Hedgehog, you will need the game’s ROM (game file). The simplest way to get it is by purchasing the game off of <a target="_blank" href="https://store.steampowered.com/app/71113/Sonic_The_Hedgehog/">Steam</a> for $5. You could also find free find downloads of the ROM online, however it is illegal, so don’t do this.</p>
<p>In the OpenAI repository at <strong>retro/retro/data/stable/</strong> you will find a folder for Sonic the Hedgehog Genesis. Place the game’s ROM here and make sure it is called rom.md. This folder also contains .state files. You can choose one and set the state parameter equal to it. I chose GreenHillZone Act 1 since it is the very first level of the game.</p>
<h4 id="heading-understanding-datajson-and-scenariojson">Understanding data.json and scenario.json</h4>
<p>In the Sonic the Hedgehog folder you will have these two files:</p>
<p><strong>data.json</strong></p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"info"</span>: {
    <span class="hljs-attr">"act"</span>: {
      <span class="hljs-attr">"address"</span>: <span class="hljs-number">16776721</span>,
      <span class="hljs-attr">"type"</span>: <span class="hljs-string">"|u1"</span>
    },
    <span class="hljs-attr">"level_end_bonus"</span>: {
      <span class="hljs-attr">"address"</span>: <span class="hljs-number">16775126</span>,
      <span class="hljs-attr">"type"</span>: <span class="hljs-string">"|u1"</span>
    },
    <span class="hljs-attr">"lives"</span>: {
      <span class="hljs-attr">"address"</span>: <span class="hljs-number">16776722</span>,
      <span class="hljs-attr">"type"</span>: <span class="hljs-string">"|u1"</span>
    },
    <span class="hljs-attr">"rings"</span>: {
      <span class="hljs-attr">"address"</span>: <span class="hljs-number">16776736</span>,
      <span class="hljs-attr">"type"</span>: <span class="hljs-string">"&gt;u2"</span>
    },
    <span class="hljs-attr">"score"</span>: {
      <span class="hljs-attr">"address"</span>: <span class="hljs-number">16776742</span>,
      <span class="hljs-attr">"type"</span>: <span class="hljs-string">"&gt;u4"</span>
    },
    <span class="hljs-attr">"screen_x"</span>: {
      <span class="hljs-attr">"address"</span>: <span class="hljs-number">16774912</span>,
      <span class="hljs-attr">"type"</span>: <span class="hljs-string">"&gt;u2"</span>
    },
    <span class="hljs-attr">"screen_x_end"</span>: {
      <span class="hljs-attr">"address"</span>: <span class="hljs-number">16774954</span>,
      <span class="hljs-attr">"type"</span>: <span class="hljs-string">"&gt;u2"</span>
    },
    <span class="hljs-attr">"screen_y"</span>: {
      <span class="hljs-attr">"address"</span>: <span class="hljs-number">16774916</span>,
      <span class="hljs-attr">"type"</span>: <span class="hljs-string">"&gt;u2"</span>
    },
    <span class="hljs-attr">"x"</span>: {
      <span class="hljs-attr">"address"</span>: <span class="hljs-number">16764936</span>,
      <span class="hljs-attr">"type"</span>: <span class="hljs-string">"&gt;i2"</span>
    },
    <span class="hljs-attr">"y"</span>: {
      <span class="hljs-attr">"address"</span>: <span class="hljs-number">16764940</span>,
      <span class="hljs-attr">"type"</span>: <span class="hljs-string">"&gt;u2"</span>
    },
    <span class="hljs-attr">"zone"</span>: {
      <span class="hljs-attr">"address"</span>: <span class="hljs-number">16776720</span>,
      <span class="hljs-attr">"type"</span>: <span class="hljs-string">"|u1"</span>
    }
  }
}
</code></pre>
<p><strong>scenario.json</strong></p>
<pre><code class="lang-py">{
  <span class="hljs-string">"done"</span>: {
    <span class="hljs-string">"variables"</span>: {
      <span class="hljs-string">"lives"</span>: {
        <span class="hljs-string">"op"</span>: <span class="hljs-string">"zero"</span>
      }
    }
  },
  <span class="hljs-string">"reward"</span>: {
    <span class="hljs-string">"variables"</span>: {
      <span class="hljs-string">"x"</span>: {
        <span class="hljs-string">"reward"</span>: <span class="hljs-number">10.0</span>
      }
    }
  }
}
</code></pre>
<p>Both these files contain important information pertaining to the game and its training.</p>
<p>As it sounds, the data.json file contains information/data on different game specific variables (i.e. Sonic’s x-position, number of lives he has, etc.).</p>
<p>The scenario.json file allows us to perform actions in sync with the values of the data variables. For example we can reward Sonic 10.0 every time his x-position increases. We could also set our done condition to true when Sonic’s lives hit 0.</p>
<h4 id="heading-understanding-neat-feedforward-configuration">Understanding NEAT feedforward configuration</h4>
<p>The config-feedforward file can be found in my GitHub repository linked above. It acts like a settings menu to set up our training. To point out a few simple settings:</p>
<pre><code class="lang-py">fitness_threshold     = <span class="hljs-number">10000</span> <span class="hljs-comment"># How fit we want Sonic to become</span>
pop_size              = <span class="hljs-number">20</span> <span class="hljs-comment"># How many Sonics per generation</span>
num_inputs            = <span class="hljs-number">1120</span> <span class="hljs-comment"># Number of inputs into our model</span>
num_outputs           = <span class="hljs-number">12</span> <span class="hljs-comment"># 12 buttons on Genesis controller</span>
</code></pre>
<p>There are tons of settings you can experiment with to see how it effects your AI’s training! To learn more about NEAT and the different settings in the feedfoward configuration, I would highly recommend reading the documentation <a target="_blank" href="https://neat-python.readthedocs.io/en/latest/">here</a></p>
<h4 id="heading-putting-it-all-together-creating-the-training-file">Putting it all together: Creating the Training File</h4>
<p><strong>Setting up configuration</strong></p>
<p>Our feedforward configuration is defined and stored in the variable config.</p>
<pre><code class="lang-py">config = neat.Config(neat.DefaultGenome, neat.DefaultReproduction, neat.DefaultSpeciesSet, neat.DefaultStagnation, <span class="hljs-string">'config-feedforward'</span>)
</code></pre>
<p><strong>Creating a function to evaluate each genome</strong></p>
<p>We start by creating the function, eval_genomes, which will evaluate our genomes (a genome could be compared to 1 Sonic in a population of Sonics). For each genome we reset the environment and take a random action</p>
<pre><code class="lang-py"><span class="hljs-keyword">for</span> genome_id, genome <span class="hljs-keyword">in</span> genomes:
        ob = env.reset()
        ac = env.action_space.sample()
</code></pre>
<p>We will also record the game environment’s length and width and color. We divide the length and width by 8.</p>
<pre><code class="lang-py">inx, iny, inc = env.observation_space.shape
inx = int(inx/<span class="hljs-number">8</span>)
iny = int(iny/<span class="hljs-number">8</span>)
</code></pre>
<p>We create a <a target="_blank" href="https://searchenterpriseai.techtarget.com/definition/recurrent-neural-networks">recurrent neural network</a> (RNN) using the NEAT library and input the genome and our chosen configuration.</p>
<pre><code class="lang-py">net = neat.nn.recurrent.RecurrentNetwork.create(genome, config)
</code></pre>
<p>Finally, we define a few variables: current_max_fitness (the highest fitness in the current population), fitness_current (the current fitness of the genome), frame (the frame count), counter (to count the number of steps our agent takes), xpos (the x-position of Sonic), and done (whether or not we have reached our fitness goal).</p>
<pre><code class="lang-py">current_max_fitness = <span class="hljs-number">0</span>
fitness_current = <span class="hljs-number">0</span>
frame = <span class="hljs-number">0</span>
counter = <span class="hljs-number">0</span>
xpos = <span class="hljs-number">0</span>
done = <span class="hljs-literal">False</span>
</code></pre>
<p>While we have not reached our done requirement, we need to run the environment, increment our frame counter, and shape our observation to mimic that of the game (still for each genome).</p>
<pre><code class="lang-py">env.render()
frame += <span class="hljs-number">1</span>
ob = cv2.resize(ob, (inx, iny))
ob = cv2.cvtColor(ob, cv2.COLOR_BGR2GRAY)
ob = np.reshape(ob, (inx,iny))
</code></pre>
<p>We will take our observation and put it in a one-dimensional array, so that our RNN can understand it. We receive our output by feeding this array to our RNN.</p>
<pre><code class="lang-py">imgarray = []
imgarray = np.ndarray.flatten(ob)
nnOutput = net.activate(imgarray)
</code></pre>
<p>Using the output from the RNN our AI takes a step. From this step we can extract fresh information: a new observation, a reward, whether or not we have reached our done requirement, and information on variables in our data.json (info).</p>
<pre><code class="lang-py">ob, rew, done, info = env.step(nnOutput)
</code></pre>
<p>At this point we need to evaluate our genome’s fitness and whether or not it has met the done requirement.</p>
<p>We look at our “x” variable from data.json and check if it has surpassed the length of the level. If it has, we will increase our fitness by our fitness threshold signifying we are done.</p>
<pre><code class="lang-py">xpos = info[<span class="hljs-string">'x'</span>]

<span class="hljs-keyword">if</span> xpos &gt;= <span class="hljs-number">10000</span>:
        fitness_current += <span class="hljs-number">10000</span>
        done = <span class="hljs-literal">True</span>
</code></pre>
<p>Otherwise, we will increase our current fitness by the reward we earned from performing the step. We also check if we have a new highest fitness and adjust the value of our current_max_fitness accordingly.</p>
<pre><code class="lang-py">fitness_current += rew

<span class="hljs-keyword">if</span> fitness_current &gt; current_max_fitness:
        current_max_fitness = fitness_current
        counter = <span class="hljs-number">0</span>
<span class="hljs-keyword">else</span>:
        counter += <span class="hljs-number">1</span>
</code></pre>
<p>Lastly, we check if we are done or if our genome has taken 250 steps. If so, we print information on the genome which was simulated. Otherwise we keep looping until one of the two requirements has been satisfied.</p>
<pre><code class="lang-py"><span class="hljs-keyword">if</span> done <span class="hljs-keyword">or</span> counter == <span class="hljs-number">250</span>:
        done = <span class="hljs-literal">True</span>
        print(genome_id, fitness_current)

genome.fitness = fitness_current
</code></pre>
<p><strong>Defining the population, printing training stats, and more</strong></p>
<p>The absolute last thing we need to do is define our population, print out statistics from our training, save checkpoints (in case you want to pause and resume training), and pickle our winning genome.</p>
<pre><code class="lang-py">p = neat.Population(config)

p.add_reporter(neat.StdOutReporter(<span class="hljs-literal">True</span>))
stats = neat.StatisticsReporter()
p.add_reporter(stats)
p.add_reporter(neat.Checkpointer(<span class="hljs-number">1</span>))

winner = p.run(eval_genomes)

<span class="hljs-keyword">with</span> open(<span class="hljs-string">'winner.pkl'</span>, <span class="hljs-string">'wb'</span>) <span class="hljs-keyword">as</span> output:
    pickle.dump(winner, output, <span class="hljs-number">1</span>)
</code></pre>
<p>All that’s left is the matter of running the program and watching Sonic slowly learn how to beat the level!</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/caPrOTLL9OmL9C2V3BLMmUYtx1g0ckxZF1wu" alt="Image" width="600" height="337" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/FsO5NOjcc5S9cQiDjO56TbQLlOQIzFfdwmwc" alt="Image" width="600" height="337" loading="lazy">
<em>Earlier generation vs Later generation</em></p>
<p><strong>To see all of the code put together check out the Training.py file in my GitHub repository.</strong></p>
<h4 id="heading-bonus-parallel-training">Bonus: Parallel Training</h4>
<p>If you have a multi-core CPU you can run multiple training simulations at once, exponentially increasing the rate at which you can train your AI! Although I will not go through the specifics on how to do this in this article, I highly suggest you check the <strong>sonicTraning.py</strong> implementation in my GitHub repository.</p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>That’s all there is to it! With a few adjustments, this framework is applicable to any game for the NES, SNES, SEGA Genesis, and more. If you have any questions or you just want to say hello, feel free to email me at vedantgupta523[at]gmail[dot]com ?</p>
<p>Also, be sure to check out Lucas Thompson's Sonic AI Bot Using Open-AI and NEAT <a target="_blank" href="https://www.freecodecamp.org/news/how-to-use-ai-to-play-sonic-the-hedgehog-its-neat-9d862a2aef98/Sonic%20AI%20Bot%20Using%20Open-AI%20and%20NEAT%20Tutorial">YouTube tutorials</a> and <a target="_blank" href="https://gitlab.com/lucasrthompson/Sonic-Bot-In-OpenAI-and-NEAT">code</a> to see what originally inspired this article.</p>
<h3 id="heading-key-takeaways">Key Takeaways</h3>
<ol>
<li><strong>Neuroevolution of Augmenting Topologies (NEAT)</strong> is an algorithm used to train AI to perform certain tasks. It is modeled after genetic evolution.</li>
<li><strong>NEAT</strong> eliminates the need for pre-existing data when training AI.</li>
<li>The process of implementing <strong>OpenAI</strong> and <strong>NEAT</strong> using Python to train an AI to play any game.</li>
</ol>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to apply Reinforcement Learning to real life planning problems ]]>
                </title>
                <description>
                    <![CDATA[ By Sterling Osborne, PhD Researcher Recently, I have published some examples where I have created Reinforcement Learning models for some real life problems. For example, using Reinforcement Learning for Meal Planning based on a Set Budget and Persona... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-apply-reinforcement-learning-to-real-life-planning-problems-90f8fa3dc0c5/</link>
                <guid isPermaLink="false">66c34ef639769b84d9fe96d7</guid>
                
                    <category>
                        <![CDATA[ beginner ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Reinforcement Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ technology ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 12 Mar 2019 21:35:16 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*fFnWJvxZ1SITxjduEkIRmg.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Sterling Osborne, PhD Researcher</p>
<p>Recently, I have published some examples where I have created Reinforcement Learning models for some real life problems. For example, using <a target="_blank" href="https://towardsdatascience.com/reinforcement-learning-for-meal-planning-based-on-meeting-a-set-budget-and-personal-preferences-9624a520cce4">Reinforcement Learning for Meal Planning based on a Set Budget and Personal Preferences</a>.</p>
<p>Reinforcement Learning can be used in this way for a variety of planning problems including travel plans, budget planning and business strategy. The two advantages of using RL is that it takes into account the probability of outcomes and allows us to control parts of the environment. Therefore, I decided to write a simple example so others may consider how they could start using it to solve some of their day-to-day or work problems.</p>
<h4 id="heading-what-is-reinforcement-learning">What is Reinforcement Learning?</h4>
<p>Reinforcement Learning (RL) is the process of testing which actions are best for each state of an environment by essentially trial and error. The model introduces a random policy to start, and each time an action is taken an initial amount (known as a reward) is fed to the model. This continues until an end goal is reached, e.g. you win or lose the game, where that run (or episode) ends and the game resets.</p>
<p>As the model goes through more and more episodes, it begins to learn which actions are more likely to lead us to a positive outcome. Therefore it finds the best actions in any given state, known as the optimal policy.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/iRylN6xnM2dhQWN23vLuFMUiidpIPavDEQYE" alt="Image" width="710" height="273" loading="lazy">
<em>Reinforcement Learning General Process</em></p>
<p>Many of the RL applications online train models on a game or virtual environment where the model is able to interact with the environment repeatedly. For example, you let the model play a simulation of tic-tac-toe over and over so that it observes success and failure of trying different moves.</p>
<p>In real life, it is likely we do not have access to train our model in this way. For example, a recommendation system in online shopping needs a person’s feedback to tell us whether it has succeeded or not, and this is limited in its availability based on how many users interact with the shopping site.</p>
<p>Instead, we may have sample data that shows shopping trends over a time period that we can use to create estimated probabilities. Using these, we can create what is known as a Partially Observed Markov Decision Process (POMDP) as a way to generalise the underlying probability distribution.</p>
<h4 id="heading-partially-observed-markov-decision-processes-pomdps">Partially Observed Markov Decision Processes (POMDPs)</h4>
<p>Markov Decision Processes (MDPs) provide a framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. The key feature of MDPs is that they follow the Markov Property; all future states are independent of the past given the present. In other words, the probability of moving into the next state is only dependent on the current state.</p>
<p>POMDPs work similarly except it is a generalisation of the MDPs. In short, this means the model cannot simply interact with the environment but is instead given a set probability distribution based on what we have observed. More info can be found <a target="_blank" href="http://www.pomdp.org/tutorial/">here</a>. We could use value iteration methods on our POMDP, but instead I’ve decided to use Monte Carlo Learning in this example.</p>
<h3 id="heading-example-environment">Example Environment</h3>
<p>Imagine you are back at school (or perhaps still are) and are in a classroom, the teacher has a strict policy on paper waste and requires that any pieces of scrap paper <strong>must</strong> be passed to him at the front of the classroom and he will place the waste into the bin (trash can).</p>
<p>However, some students in the class care little for the teacher’s rules and would rather save themselves the trouble of passing the paper round the classroom. Instead, these troublesome individuals may choose to throw the scrap paper into the bin from a distance. Now this angers the teacher and those that do this are punished.</p>
<p>This introduces a very basic action-reward concept, and we have an example classroom environment as shown in the following diagram.</p>
<p>Our aim is to find the best instructions for each person so that the paper reaches the teacher and is placed into the bin and avoids being thrown in the bin.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/IHXBDyRJ7C6BKFoubXntHGPyZaWwqgFKN4ly" alt="Image" width="800" height="596" loading="lazy"></p>
<h4 id="heading-states-and-actions">States and Actions</h4>
<p>In our environment, each person can be considered a <strong>state</strong> and they have a variety of <strong>actions</strong> they can take with the scrap paper. They may choose to pass it to an adjacent class mate, hold onto it or some may choose to throw it into the bin. We can therefore map our environment to a more standard grid layout as shown below.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/AQHonaFnjleHXYU5YvrZw0nTjGAgaorMsobB" alt="Image" width="800" height="627" loading="lazy"></p>
<p>This is purposefully designed so that each person, or state, has four actions: up, down, left or right and each will have a varied ‘real life’ outcome based on who took the action. An action that puts the person into a wall (including the black block in the middle) indicates that the person holds onto the paper. In some cases, this action is duplicated, but is not an issue in our example.</p>
<p>For example, person A’s actions result in:</p>
<ul>
<li>Up = Throw into bin</li>
<li>Down = Hold onto paper</li>
<li>Left = Pass to person B</li>
<li>Right = Hold onto paper</li>
</ul>
<h3 id="heading-probabilistic-environment">Probabilistic Environment</h3>
<p>For now, the decision maker that partly controls the environment is us. We will tell each person which action they should take. This is known as the <strong>policy</strong>.</p>
<p>The first challenge I face in my learning is understanding that the environment is likely probabilistic and what this means. A probabilistic environment is when we instruct a state to take an action under our policy, there is a probability associated as to whether this is successfully followed. In other words, if we tell person A to pass the paper to person B, they can decide not to follow the instructed action in our policy and instead throw the scrap paper into the bin.</p>
<p>Another example is if we are recommending online shopping products there is no guarantee that the person will view each one.</p>
<h4 id="heading-observed-transitional-probabilities">Observed Transitional Probabilities</h4>
<p>To find the observed transitional probabilities, we need to collect some sample data about how the environment acts. Before we collect information, we first introduce an initial policy. To start the process, I have randomly chosen one that looks as though it would lead to a positive outcome.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/XAfDzZytX9vrhxlhjhoCkezoOYRqOPJM0NUt" alt="Image" width="800" height="638" loading="lazy"></p>
<p>Now we observe the actions each person takes given this policy. In other words, say we sat at the back of the classroom and simply observed the class and observed the following results for person A:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/-bldySGAig2zQknZl4Q6Ou6KroYpw62nhd7Y" alt="Image" width="294" height="290" loading="lazy">
<em>Person A’s Observed Actions</em></p>
<p>We see that a paper passed through this person 20 times; 6 times they kept hold of it, 8 times they passed it to person B and another 6 times they threw it in the trash. This means that under our initial policy, the probability of keeping hold or throwing it in the trash for this person is 6/20 = 0.3 and likewise 8/20 = 0.4 to pass to person B. We can observe the rest of the class to collect the following sample data:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/rLRsXRFIHwBZ956XoOmHNeNfqtSWP1Jjf36-" alt="Image" width="800" height="202" loading="lazy">
<em>Observed Real Life Outcome</em></p>
<p>Likewise, we then calculate the probabilities to be the following matrix and we could use this to simulate experience. The accuracy of this model will depend greatly on whether the probabilities are true representations of the whole environment. In other words, we need to make sure we have a sample that is large and rich enough in data.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/06wX9bX-2PCrssyTlgpObEvZAh9CPfeAO2I1" alt="Image" width="800" height="198" loading="lazy">
<em>Observed Transition Probability Function</em></p>
<h3 id="heading-multi-armed-bandits-episodes-rewards-return-and-discount-rate">Multi-Armed Bandits, Episodes, Rewards, Return and Discount Rate</h3>
<p>So we have our transition probabilities estimated from the sample data under a POMDP. The next step, before we introduce any models, is to introduce rewards. So far, we have only discussed the outcome of the final step; either the paper gets placed in the bin by the teacher and nets a positive reward or gets thrown by A or M and nets a negative rewards. This final reward that ends the episode is known as the <strong>Terminal Reward</strong>.</p>
<p>But, there is also third outcome that is less than ideal either; the paper continually gets passed around and never (or takes far longer than we would like) reaches the bin. Therefore, in summary we have three final outcomes</p>
<ul>
<li>Paper gets placed in bin by teacher and nets a positive terminal reward</li>
<li>Paper gets thrown in bin by a student and nets a negative terminal reward</li>
<li>Paper gets continually passed around room or gets stuck on students for a longer period of time than we would like</li>
</ul>
<p>To avoid the paper being thrown in the bin we provide this with a large, negative reward, say -1, and because the teacher is pleased with it being placed in the bin this nets a large positive reward, +1. To avoid the outcome where it continually gets passed around the room, we set the reward for all other actions to be a small, negative value, say -0.04.</p>
<p>If we set this as a positive or null number then the model may let the paper go round and round as it would be better to gain small positives than risk getting close to the negative outcome. This number is also very small as it will only collect a single terminal reward but it could take many steps to end the episode and we need to ensure that, if the paper is place in the bin, the positive outcome is not cancelled out.</p>
<p>Please note: the rewards are always relative to one another and I have chosen arbitrary figures, but these can be changed if the results are not as desired.</p>
<p>Although we have inadvertently discussed episodes in the example, we have yet to formally define it. <strong>An episode is simply the actions each paper takes through the classroom reaching the bin, which is the terminal state and ends the episode</strong>. In other examples, such as playing tic-tac-toe, this would be the end of a game where you win or lose.</p>
<p>The paper could in theory start at any state and this introduces why we need enough episodes to ensure that every state and action is tested enough so that our outcome is not being driven by invalid results. However, on the flip side, the more episodes we introduce the longer the computation time will be and, depending on the scale of the environment, we may not have an unlimited amount of resources to do this.</p>
<p>This is known as the <strong>Multi-Armed Bandit problem</strong>; with finite time (or other resources), we need to ensure that we test each state-action pair enough that the actions selected in our policy are, in fact, the optimal ones. In other words, we need to validate that actions that have lead us to good outcomes in the past are not by sheer luck but are in fact in the correct choice, and likewise for the actions that appear poor. In our example this may seem simple with how few states we have, but imagine if we increased the scale and how this becomes more and more of an issue.</p>
<p>The overall goal of our RL model is to select the actions that maximises the expected cumulative rewards, known as the return. In other words, the <strong>Return</strong> is simply the total reward obtained for the episode. A simple way to calculate this would be to add up all the rewards, including the terminal reward, in each episode.</p>
<p>A more rigorous approach is to consider the first steps to be more important than later ones in the episode by applying a discount factor, gamma, in the following formula:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/N9nsFXRGlZtH0oKXLmdrBEIeN8fbumhvOIob" alt="Image" width="185" height="28" loading="lazy"></p>
<p>In other words, we sum all the rewards but weigh down later steps by a factor of gamma to the power of how many steps it took to reach them.</p>
<p>If we think about our example, using a discounted return becomes even clearer to imagine as the teacher will reward (or punish accordingly) anyone who was involved in the episode but would scale this based on how far they are from the final outcome.</p>
<p>For example, if the paper passed from A to B to M who threw it in the bin, M should be punished most, then B for passing it to him and lastly person A who is still involved in the final outcome but less so than M or B. This also emphasises that the longer it takes (based on the number of steps) to start in a state and reach the bin the less is will either be rewarded or punished but will accumulate negative rewards for taking more steps.</p>
<h3 id="heading-applying-a-model-to-our-example">Applying a Model to our Example</h3>
<p>As our example environment is small, we can apply each and show some of the calculations performed manually and illustrate the impact of changing parameters.</p>
<p>For any algorithm, we first need to initialise the state value function, V(s), and have decided to set each of these to 0 as shown below.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/7zOfCiGWw6rsZ1teuxam57csTdFrYQtGrSqm" alt="Image" width="800" height="629" loading="lazy"></p>
<p>Next, we let the model simulate experience on the environment based on our observed probability distribution. The model starts a piece of paper in random states and the outcomes of each action under our policy are based on our observed probabilities. So for example, say we have the first three simulated episodes to be the following:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/b7jWB92ntwLYkqvYVuwAX59qd2nWRWMzGfPO" alt="Image" width="800" height="628" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/713X0b-2rPNLGnpEknBRVnVqgSwhj74fRPrd" alt="Image" width="800" height="628" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/EHE-LpztkPVMWIpKW2vJ8tpbVPP4xoHmfPRO" alt="Image" width="800" height="628" loading="lazy"></p>
<p>With these episodes we can calculate our first few updates to our state value function using each of the three models given. For now, we pick arbitrary alpha and gamma values to be 0.5 to make our hand calculations simpler. We will show later the impact this variable has on results.</p>
<p>First, we apply temporal difference 0, the simplest of our models and the first three value updates are as follows:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/PcrATwyOJ0aodrgmnjjQ1dUgfcaNso-YiXF2" alt="Image" width="337" height="271" loading="lazy"></p>
<p>So how have these been calculated? Well because our example is small we can show the calculations by hand.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/tvkHihHu-5zyh8WehUYJcpwkHKWb8MN3gIgk" alt="Image" width="339" height="148" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/yslW70VU1RLb3o7djaPCHPdw4kMiY1BBXibE" alt="Image" width="384" height="149" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/jcB4CtCcI7WNSTmFdt9aWlXXohBbZgjAqrQP" alt="Image" width="418" height="128" loading="lazy"></p>
<p>So what can we observe at this early stage? Firstly, using TD(0) appears unfair to some states, for example person D, who, at this stage, has gained nothing from the paper reaching the bin two out of three times. Their update has only been affected by the value of the next stage, but this emphasises how the positive and negative rewards propagate outwards from the corner towards the states.</p>
<p>As we take more episodes the positive and negative terminal rewards will spread out further and further across all states. This is shown roughly in the diagram below where we can see that the two episodes the resulted in a positive result impact the value of states Teacher and G whereas the single negative episode has punished person M.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/utemVff2wrMbdDOJo5PJbFSNhrhW7fPGgrj4" alt="Image" width="800" height="629" loading="lazy"></p>
<p>To show this, we can try more episodes. If we repeat the same three paths already given we produce the following state value function:</p>
<p><strong>(Please note, we have repeated these three episodes for simplicity in this example but the actual model would have episodes where the outcomes are based on the observed transition probability function.)</strong></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/RxDrPSqVzJqwwxEnQD-KXMtr1iXcPLWuhq9x" alt="Image" width="465" height="284" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/OMdszq1WLV91uHDHGPDAfySfuK35uoi8ieyu" alt="Image" width="800" height="635" loading="lazy"></p>
<p>The diagram above shows the terminal rewards propagating outwards from the top right corner to the states. From this, we may decide to update our policy as it is clear that the negative terminal reward passes through person M and therefore B and C are impacted negatively. Therefore, based on V27, for each state we may decide to update our policy by selecting the next best state value for each state as shown in the figure below</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/BcBG9YpaPUC7JVm9Z6hvJBZVCaCnyUKbHoSe" alt="Image" width="800" height="638" loading="lazy"></p>
<p>There are two causes for concerns in this example: the first is that person A’s best action is to throw it into the bin and net a negative reward. This is because none of the episodes have visited this person and emphasises the multi armed bandit problem. In this small example there are very few states so would require many episodes to visit them all, but we need to ensure this is done.</p>
<p>The reason this action is better for this person is because neither of the terminal states have a value but rather the positive and negative outcomes are in the terminal rewards. We could then, if our situation required it, initialise V0 with figures for the terminal states based on the outcomes.</p>
<p>Secondly, the state value of person M is flipping back and forth between -0.03 and -0.51 (approx.) after the episodes and we need to address why this is happening. This is caused by our learning rate, alpha. For now, we have only introduced our parameters (the learning rate alpha and discount rate gamma) but have not explained in detail how they will impact results.</p>
<p>A large learning rate may cause the results to oscillate, but conversely it should not be so small that it takes forever to converge. This is shown further in the figure below that demonstrates the total V(s) for every episode and we can clearly see how, although there is a general increasing trend, it is diverging back and forth between episodes. Another good explanation for learning rate is as follows:</p>
<p>“In the game of golf when the ball is far away from the hole, the player hits it very hard to get as close as possible to the hole. Later when he reaches the flagged area, he chooses a different stick to get accurate short shot.</p>
<p>So it’s not that he won’t be able to put the ball in the hole without choosing the short shot stick, he may send the ball ahead of the target two or three times. But it would be best if he plays optimally and uses the right amount of power to reach the hole.”</p>
<p><a target="_blank" href="https://stackoverflow.com/questions/33011825/learning-rate-of-a-q-learning-agent"><strong>Learning rate of a Q learning agent</strong></a><br><a target="_blank" href="https://stackoverflow.com/questions/33011825/learning-rate-of-a-q-learning-agent">_The question how the learning rate influences the convergence rate and convergence itself. If the learning rate is…_stackoverflow.com</a></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/picYFGSsrMSt1g88c09UBjtYQyA0vbsUXyjy" alt="Image" width="800" height="453" loading="lazy">
<em>Episode</em></p>
<p>There are some complex methods for establishing the optimal learning rate for a problem but, as with any machine learning algorithm, if the environment is simple enough you iterate over different values until convergence is reached. This is also known as stochastic gradient decent. In a <a target="_blank" href="https://towardsdatascience.com/creating-interactive-animation-for-parameter-optimisation-using-plot-ly-8136b2997db">recent RL project</a>, I demonstrated the impact of reducing alpha using an animated visual and this is shown below. This demonstrates the oscillation when alpha is large and how this becomes smoothed as alpha is reduced.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/BR2fyHlcGvxCmCXSl1k-2QTjidKbM5n4S6tn" alt="Image" width="600" height="288" loading="lazy"></p>
<p>Likewise, we must also have our discount rate to be a number between 0 and 1, oftentimes this is taken to be close to 0.9. The discount factor tells us how important rewards in the future are; a large number indicates that they will be considered important whereas moving this towards 0 will make the model consider future steps less and less.</p>
<p>With both of these in mind, we can change both alpha from 0.5 to 0.2 and gamma from 0.5 to 0.9 and we achieve the following results:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/seJwoVJfEbz2ckQLq9nZXrMbLMDhQvj5eYgY" alt="Image" width="465" height="284" loading="lazy"></p>
<p>Because our learning rate is now much smaller the model takes longer to learn and the values are generally smaller. Most noticeably is for the teacher which is clearly the best state. However, this trade-off for increased computation time means our value for M is no longer oscillating to the degree they were before. We can now see this in the diagram below for the sum of V(s) following our updated parameters. Although it is not perfectly smooth, the total V(s) slowly increases at a much smoother rate than before and appears to converge as we would like but requires approximately 75 episodes to do so.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/BSscIv0WdzkFPYZN9wd97CvJoWdxMT5loprq" alt="Image" width="800" height="453" loading="lazy"></p>
<h3 id="heading-changing-the-goal-outcome">Changing the Goal Outcome</h3>
<p>Another crucial advantage of RL that we haven’t mentioned in too much detail is that we have some control over the environment. Currently, the rewards are based on what we decided would be best to get the model to reach the positive outcome in as few steps as possible.</p>
<p>However, say the teacher changed and the new one didn’t mind the students throwing the paper in the bin so long as it reached it. Then we can change our negative reward around this and the optimal policy will change.</p>
<p>This is particularly useful for business solutions. For example, say you are planning a strategy and know that certain transitions are less desired than others, then this can be taken into account and changed at will.</p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>We have now created a simple Reinforcement Learning model from observed data. There are many things that could be improved or taken further, including using a more complex model, but this should be a good introduction for those that wish to try and apply to their own real-life problems.</p>
<p>I hope you enjoyed reading this article, if you have any questions please feel free to comment below.</p>
<p>Thanks</p>
<p>Sterling</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ An introduction to Q-Learning: reinforcement learning ]]>
                </title>
                <description>
                    <![CDATA[ By ADL This article is the second part of my “Deep reinforcement learning” series. The complete series shall be available both on Medium and in videos on my YouTube channel. In the first part of the series we learnt the basics of reinforcement learni... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc/</link>
                <guid isPermaLink="false">66c3444e4f1fc448a3678fa5</guid>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Reinforcement Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tech  ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Mon, 03 Sep 2018 21:31:39 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/0*DX9ZRnzwmh2FImV-" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By ADL</p>
<p>This article is the second part of my “Deep reinforcement learning” series. The complete series shall be available both on <a target="_blank" href="https://medium.com/@alamba093">Medium</a> and in videos on <a target="_blank" href="https://www.youtube.com/channel/UCRkxhh51YKqpn2gaUI3MXjg">my YouTube channel</a>.</p>
<p>In the <a target="_blank" href="https://medium.freecodecamp.org/a-brief-introduction-to-reinforcement-learning-7799af5840db">first part of the series</a> we learnt the <strong>basics of reinforcement learning</strong>.</p>
<p>Q-learning is a values-based learning algorithm in reinforcement learning. In this article, we learn about Q-Learning and its details:</p>
<ul>
<li>What is Q-Learning ?</li>
<li>Mathematics behind Q-Learning</li>
<li>Implementation using python</li>
</ul>
<h3 id="heading-q-learning-a-simplistic-overview">Q-Learning — a simplistic overview</h3>
<p>Let’s say that a <strong>robot</strong> has to cross a <strong>maze</strong> and reach the end point. There are <strong>mines</strong>, and the robot can only move one tile at a time. If the robot steps onto a mine, the robot is dead. The robot has to reach the end point in the shortest time possible.</p>
<p>The scoring/reward system is as below:</p>
<ol>
<li>The robot loses 1 point at each step. This is done so that the robot takes the shortest path and reaches the goal as fast as possible.</li>
<li>If the robot steps on a mine, the point loss is 100 and the game ends.</li>
<li>If the robot gets power ⚡️, it gains 1 point.</li>
<li>If the robot reaches the end goal, the robot gets 100 points.</li>
</ol>
<p>Now, the obvious question is: <strong>How do we train a robot to reach the end goal with the shortest path without stepping on a mine?</strong></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/3JXI06jyHegMS1Yx8rhIq64gkYwSTM7ZhD25" alt="Image" width="345" height="300" loading="lazy"></p>
<p>So, how do we solve this?</p>
<h3 id="heading-introducing-the-q-table">Introducing the Q-Table</h3>
<p>Q-Table is just a fancy name for a simple lookup table where we calculate the maximum expected future rewards for action at each state. Basically, this table will guide us to the best action at each state.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/CcNuUwGnpHhRKkERqJJ6xl7N2W8jcl1yVdE8" alt="Image" width="345" height="291" loading="lazy"></p>
<p>There will be four numbers of actions at each non-edge tile. When a robot is at a state it can either move up or down or right or left.</p>
<p>So, let’s model this environment in our Q-Table.</p>
<p>In the Q-Table, the columns are the actions and the rows are the states.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/AjVvggEquHgsnMN8i4N35AMfx53vZtELEL-l" alt="Image" width="315" height="296" loading="lazy"></p>
<p>Each Q-table score will be the maximum expected future reward that the robot will get if it takes that action at that state. This is an iterative process, as we need to improve the Q-Table at each iteration.</p>
<p>But the questions are:</p>
<ul>
<li>How do we calculate the values of the Q-table?</li>
<li>Are the values available or predefined?</li>
</ul>
<p>To learn each value of the Q-table, we use the <strong>Q-Learning algorithm.</strong></p>
<h3 id="heading-mathematics-the-q-learning-algorithm">Mathematics: the Q-Learning algorithm</h3>
<h4 id="heading-q-function">Q-function</h4>
<p>The <strong>Q-function</strong> uses the Bellman equation and takes two inputs: state (<strong>s</strong>) and action (<strong>a</strong>).</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/s39aVodqNAKMTcwuMFlyPSy76kzAmU5idMzk" alt="Image" width="552" height="232" loading="lazy"></p>
<p>Using the above function, we get the values of <strong>Q</strong> for the cells in the table.</p>
<p>When we start, all the values in the Q-table are zeros.</p>
<p>There is an iterative process of updating the values. As we start to explore the environment<strong>,</strong> the Q-function gives us better and better approximations by continuously updating the Q-values in the table.</p>
<p>Now, let’s understand how the updating takes place.</p>
<h3 id="heading-introducing-the-q-learning-algorithm-process">Introducing the Q-learning algorithm process</h3>
<p><img src="https://cdn-media-1.freecodecamp.org/images/oQPHTmuB6tz7CVy3L05K1NlBmS6L8MUkgOud" alt="Image" width="800" height="450" loading="lazy"></p>
<p>Each of the colored boxes is one step. Let’s understand each of these steps in detail.</p>
<h4 id="heading-step-1-initialize-the-q-table"><strong>Step 1: initialize the Q-Table</strong></h4>
<p>We will first build a Q-table. There are n columns, where n= number of actions. There are m rows, where m= number of states. We will initialise the values at 0.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/TQ9Wy3guJHUecTf0YA5AuQgB9yVIohgLXKIn" alt="Image" width="322" height="308" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/gWnhK5oLqjcQkSzuuT8WgMVOGdCEp68Xvt6F" alt="Image" width="345" height="300" loading="lazy"></p>
<p>In our robot example, we have four actions (a=4) and five states (s=5). So we will build a table with four columns and five rows.</p>
<h4 id="heading-steps-2-and-3-choose-and-perform-an-action"><strong>Steps 2 and 3: choose and perform an action</strong></h4>
<p>This combination of steps is done for an undefined amount of time. This means that this step runs until the time we stop the training, or the training loop stops as defined in the code.</p>
<p>We will choose an action (a) in the state (s) based on the Q-Table. But, as mentioned earlier, when the episode initially starts, every Q-value is 0.</p>
<p>So now the concept of exploration and exploitation trade-off comes into play. <a target="_blank" href="https://medium.freecodecamp.org/a-brief-introduction-to-reinforcement-learning-7799af5840db">This article has more details</a>.</p>
<p>We’ll use something called the <strong>epsilon greedy strategy</strong>.</p>
<p>In the beginning, the epsilon rates will be higher. The robot will explore the environment and randomly choose actions. The logic behind this is that the robot does not know anything about the environment.</p>
<p>As the robot explores the environment, the epsilon rate decreases and the robot starts to exploit the environment.</p>
<p>During the process of exploration, the robot progressively becomes more confident in estimating the Q-values.</p>
<p><strong>For the robot example, there are four actions to choose from</strong>: up, down, left, and right. We are starting the training now — our robot knows nothing about the environment. So the robot chooses a random action, say right.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/k0IARc6DzE3NBl2ugpWkzwLkR9N4HRkpSpjw" alt="Image" width="644" height="311" loading="lazy"></p>
<p>We can now update the Q-values for being at the start and moving right using the Bellman equation.</p>
<h4 id="heading-steps-4-and-5-evaluate"><strong>Steps 4 and 5: evaluate</strong></h4>
<p>Now we have taken an action and observed an outcome and reward.We need to update the function Q(s,a).</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/TnN7ys7VGKoDszzv3WDnr5H8txOj3KKQ0G8o" alt="Image" width="598" height="299" loading="lazy"></p>
<p>In the case of the robot game, to reiterate the scoring/reward structure is:</p>
<ul>
<li><strong>power</strong> = +1</li>
<li><strong>mine</strong> = -100</li>
<li><strong>end</strong> = +100</li>
</ul>
<p><img src="https://cdn-media-1.freecodecamp.org/images/EpQDzt7lCbmFyMVUzNGaPam3WCYNuD1-hVxu" alt="Image" width="611" height="123" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/xQtpQAhBocPC46-f0GRHDOK3ybrz4ZasaDo4" alt="Image" width="643" height="306" loading="lazy"></p>
<p>We will repeat this again and again until the learning is stopped. In this way the Q-Table will be updated.</p>
<h3 id="heading-python-implementation-of-q-learning">Python implementation of Q-Learning</h3>
<p>The concept and code implementation are <a target="_blank" href="https://www.youtube.com/watch?v=yefGGgz20tY">explained in my video</a>.</p>
<p>Subscribe to my YouTube channel For more AI videos : <a target="_blank" href="https://goo.gl/u72j6u"><strong>ADL</strong></a> .</p>
<h3 id="heading-at-lastlet-us-recap">At last…let us recap</h3>
<ul>
<li>Q-Learning is a value-based reinforcement learning algorithm which is used to find the optimal action-selection policy using a Q function.</li>
<li>Our goal is to maximize the value function Q.</li>
<li>The Q table helps us to find the best action for each state.</li>
<li>It helps to maximize the expected reward by selecting the best of all possible actions.</li>
<li>Q(state, action) returns the expected future reward of that action at that state.</li>
<li>This function can be estimated using Q-Learning, which iteratively updates Q(s,a) using the <strong>Bellman equation.</strong></li>
<li>Initially we explore the environment and update the Q-Table. When the Q-Table is ready, the agent will start to exploit the environment and start taking better actions.</li>
</ul>
<p><strong>Next time we’ll work on a deep Q-learning example</strong>.</p>
<p>Until then, enjoy AI ?.</p>
<p><strong>Important</strong>: As stated earlier, this article is the second part of my “Deep Reinforcement Learning” series. The complete series shall be available both in articles on <a target="_blank" href="https://medium.com/@alamba093">Medium</a> and in videos on <a target="_blank" href="https://www.youtube.com/channel/UCRkxhh51YKqpn2gaUI3MXjg">my YouTube channel</a>.</p>
<p>If you liked my article, <strong>please click the ? t</strong>o help me stay motivated to write articles. Please follow me on M<strong>edium</strong> and other social media:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/Dxy5hJfhxEP5eWOBqW6QOqH0QgjIU04PD6rQ" alt="Image" width="358" height="87" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/d8UR8YDfmLtfDokKlQb32-prgyUUEWt3-glP" alt="Image" width="355" height="89" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/qPgqeEBS0ugejKsKGGHD3KpoyYyGEHytVENe" alt="Image" width="359" height="90" loading="lazy"></p>
<p>If you have any questions, please let me know in a comment below or on <a target="_blank" href="https://twitter.com/I_AM_ADL"><strong>Twitter</strong></a>.</p>
<p>Subscribe to <a target="_blank" href="https://goo.gl/u72j6u">my YouTube channel</a> for more tech videos.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ A brief introduction to reinforcement learning ]]>
                </title>
                <description>
                    <![CDATA[ By ADL Reinforcement Learning is an aspect of Machine learning where an agent learns to behave in an environment, by performing certain actions and observing the rewards/results which it get from those actions. With the advancements in Robotics Arm M... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/a-brief-introduction-to-reinforcement-learning-7799af5840db/</link>
                <guid isPermaLink="false">66c3422d93db2451bd4413e5</guid>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Reinforcement Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tech  ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Mon, 27 Aug 2018 21:17:00 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/0*7i8JA5t1Nx3HlK4E" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By ADL</p>
<p>Reinforcement Learning is an aspect of Machine learning where an agent learns to behave in an environment, by performing certain actions and observing the rewards/results which it get from those actions.</p>
<p>With the advancements in Robotics Arm Manipulation, Google Deep Mind beating a professional Alpha Go Player, and recently the OpenAI team beating a professional DOTA player, the field of reinforcement learning has really exploded in recent years.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*EM8x5jAL-SeUUG7b4anCQg.gif" alt="Image" width="295" height="352" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*rvGVriKT_aLeLKvAP16S0A.gif" alt="Image" width="768" height="384" loading="lazy">
<em>Examples</em></p>
<p>In this article, we’ll discuss:</p>
<ul>
<li>What reinforcement learning is and its nitty-gritty like rewards, tasks, etc</li>
<li>3 categorizations of reinforcement learning</li>
</ul>
<h4 id="heading-what-is-reinforcement-learning">What is Reinforcement Learning?</h4>
<p>Let’s start the explanation with an example — say there is a small baby who starts learning how to walk.</p>
<p>Let’s divide this example into two parts:</p>
<h4 id="heading-1-baby-starts-walking-and-successfully-reaches-the-couch">1. <strong>Baby starts walking and successfully reaches the couch</strong></h4>
<p>Since the couch is the end goal, the baby and the parents are happy.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*sDMJA6qzlo59o7iivh6U6Q.jpeg" alt="Image" width="800" height="450" loading="lazy"></p>
<p>So, the baby is happy and receives appreciation from her parents. It’s positive — the baby feels good <em>(Positive Reward +n).</em></p>
<h4 id="heading-2-baby-starts-walking-and-falls-due-to-some-obstacle-in-between-and-gets-bruised">2. <strong>Baby starts walking and falls due to some obstacle in between and gets bruised.</strong></h4>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*i_999FG_Y-DnlCtpEKb5Vw.jpeg" alt="Image" width="800" height="450" loading="lazy"></p>
<p>Ouch! The baby gets hurt and is in pain. It’s negative — the baby cries <em>(Negative Reward -n).</em></p>
<p>That’s how we humans learn — by trail and error. Reinforcement learning is conceptually the same, but is a computational approach to learn by actions.</p>
<h3 id="heading-reinforcement-learning">Reinforcement Learning</h3>
<p>Let’s suppose that our reinforcement learning agent is learning to play Mario as a example. The reinforcement learning process can be modeled as an iterative loop that works as below:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*vz3AN1mBUR2cr_jEG8s7Mg.png" alt="Image" width="800" height="376" loading="lazy"></p>
<ul>
<li>The RL Agent receives <strong>state S</strong>⁰ from the <strong>environment</strong> i.e. Mario</li>
<li>Based on that <strong>state S⁰,</strong> the RL agent takes an <strong>action A</strong>⁰, say — our RL agent moves right. Initially, this is random.</li>
<li>Now, the environment is in a new state <strong>S¹</strong> (new frame from Mario or the game engine)</li>
<li>Environment gives some <strong>reward R</strong>¹ to the RL agent. It probably gives a +1 because the agent is not dead yet.</li>
</ul>
<p>This RL loop continues until we are dead or we reach our destination, and it continuously outputs a sequence of <strong>state, action and reward.</strong></p>
<p>The basic aim of our RL agent is to maximize the reward.</p>
<h3 id="heading-reward-maximization">Reward Maximization</h3>
<p>The RL agent basically works on a hypothesis of reward maximization. <strong>That’s why reinforcement learning should have best possible action in order to maximize the reward.</strong></p>
<p>The cumulative rewards at each time step with the respective action is written as:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*up3hsG1ToqndcnmdA8tbRw.png" alt="Image" width="525" height="224" loading="lazy"></p>
<p>However, things don’t work in this way when summing up all the rewards.</p>
<p>Let us understand this, in detail:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*l8wl4hZvZAiLU56hT9vLlg.png" alt="Image" width="472" height="388" loading="lazy"></p>
<p>Let us say our RL agent (Robotic mouse) is in a maze which contains <strong>cheese, electricity shocks, and cats</strong>. The goal is to eat the maximum amount of cheese before being eaten by the cat or getting an electricity shock.</p>
<p>It seems obvious to eat the cheese near us rather than the cheese close to the cat or the electricity shock, because the closer we are to the electricity shock or the cat, the danger of being dead increases. As a result, the reward near the cat or the electricity shock, even if it is bigger (more cheese), will be discounted. This is done because of the uncertainty factor.</p>
<p>It makes sense, right?</p>
<h4 id="heading-discounting-of-rewards-works-like-this"><strong>Discounting of rewards works like this:</strong></h4>
<p>We define a discount rate called <strong>gamma</strong>. It should be between 0 and 1. The larger the gamma, the smaller the discount and vice versa.</p>
<p>So, our cumulative expected (discounted) rewards is:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*ef-5D-aBUShEnvMjiCujNw.png" alt="Image" width="800" height="450" loading="lazy">
<em>Cumulative expected rewards</em></p>
<h3 id="heading-tasks-and-their-types-in-reinforcement-learning">Tasks and their types in reinforcement learning</h3>
<p>A <strong>task</strong> is a single instance of a reinforcement learning problem. We basically have two types of tasks: <strong>continuous and episodic.</strong></p>
<h4 id="heading-continuous-tasks">Continuous tasks</h4>
<p><strong>These are the types of tasks that continue forever.</strong> For instance, a RL agent that does automated Forex/Stock trading.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0*Rpz3cfDnays7p4-e" alt="Image" width="800" height="599" loading="lazy">
_Photo by [Unsplash](https://unsplash.com/@chrisliverani?utm_source=medium&amp;utm_medium=referral" rel="noopener" target="_blank" title=""&gt;Chris Liverani on &lt;a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral" rel="noopener" target="<em>blank" title=")</em></p>
<p>In this case, the agent has to learn how to choose the best actions and simultaneously interacts with the environment. There is no starting point and end state.</p>
<p><strong>The RL agent has to keep running until we decide to manually stop it.</strong></p>
<h4 id="heading-episodic-task">Episodic task</h4>
<p>In this case, we have a starting point and an ending point <strong>called the terminal state. This creates an episode</strong>: a list of States (S), Actions (A), Rewards (R).</p>
<p>For example<strong>,</strong> playing a game of <em>counter strike</em>, where we shoot our opponents or we get killed by them.We shoot all of them and complete the episode or we are killed. So, there are only two cases for completing the episodes.</p>
<h3 id="heading-exploration-and-exploitation-trade-off">Exploration and exploitation trade off</h3>
<p>There is an important concept of the exploration and exploitation trade off in reinforcement learning. Exploration is all about finding more information about an environment, whereas exploitation is exploiting already known information to maximize the rewards.</p>
<p><strong>Real Life Example:</strong> Say you go to the same restaurant every day. You are basically <strong>exploiting.</strong> But on the other hand, if you search for new restaurant every time before going to any one of them, then it’s <strong>exploration</strong>. Exploration is very important for the search of future rewards which might be higher than the near rewards.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*R9hA8rKx52oByN5Xa7Aqng.png" alt="Image" width="522" height="429" loading="lazy"></p>
<p>In the above game, our robotic mouse can have a good amount of small cheese (+0.5 each). But at the top of the maze there is a big sum of cheese (+100). So, if we only focus on the nearest reward, our robotic mouse will never reach the big sum of cheese — it will just exploit.</p>
<p>But if the robotic mouse does a little bit of exploration, it can find the big reward i.e. the big cheese.</p>
<p>This is the basic concept of the <strong>exploration and exploitation trade-off.</strong></p>
<h3 id="heading-approaches-to-reinforcement-learning">Approaches to Reinforcement Learning</h3>
<p>Let us now understand the approaches to solving reinforcement learning problems. Basically there are 3 approaches, but we will only take 2 major approaches in this article:</p>
<h4 id="heading-1-policy-based-approach">1. Policy-based approach</h4>
<p>In policy-based reinforcement learning, we have a policy which we need to optimize. The policy basically defines how the agent behaves:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*0eMOC89KDSeJAPxEpOZi5Q.png" alt="Image" width="715" height="485" loading="lazy"></p>
<p>We learn a policy function which helps us in mapping each state to the best action.</p>
<p>Getting deep into policies, we further divide policies into two types:</p>
<ul>
<li><strong>Deterministic</strong>: a policy at a given state(s) will always return the same action(a). <strong>It means, it is pre-mapped as S=(s) ➡ A=(a).</strong></li>
<li><strong>Stochastic</strong>: It gives a distribution of probability over different actions<strong>. i.e Stochastic Policy ➡ p( A = a | S = s )</strong></li>
</ul>
<h4 id="heading-2-value-based">2. Value Based</h4>
<p>In value-based RL, the goal of the agent is to optimize the value function <em>V(s)</em> which is defined as a function that tells us the maximum expected future reward the agent shall get at each state.</p>
<p>The value of each state is the total amount of the reward an RL agent can expect to collect over the future, from a particular state.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0*kvtRAhBZO-h77Iw1." alt="Image" width="692" height="133" loading="lazy"></p>
<p>The agent will use the above value function to select which state to choose at each step. The agent will always take the state with the biggest value.</p>
<p>In the below example, we see that at each step, we will take the biggest value to achieve our goal: 1 <strong>➡</strong> 3 <strong>➡</strong> 4 <strong>➡ 6</strong> so on…</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*96F7YC253a5-mXNPVUTCSg.png" alt="Image" width="643" height="416" loading="lazy">
<em>Maze</em></p>
<h3 id="heading-the-game-of-pong-an-intuitive-case-study">The game of Pong — An Intuitive case study</h3>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*6D27X-9bipEPrgHrrjwIRA.gif" alt="Image" width="382" height="206" loading="lazy"></p>
<p>Let us take a real life example of playing pong. This case study will just introduce you to the Intuition of <strong>How reinforcement Learning Works</strong>. We will not get into details in this example, but in the next article we will certainly dig deeper.</p>
<p>Suppose we teach our RL agent to play the game of Pong.</p>
<p>Basically, we feed in the game frames (new states) to the RL algorithm and let the algorithm decide where to go up or down. This network is said to be a <strong>policy network,</strong> which we will discuss in our next article.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*nGQ4cQneWpgbUpl7aREGwg.jpeg" alt="Image" width="800" height="450" loading="lazy"></p>
<p>The method used to train this Algorithm is called the <strong>policy gradient</strong>. We feed random frames from the game engine, and the algorithm produces a random output which gives a reward and this is fed back to the algorithm/network. This is an <strong>iterative process.</strong></p>
<p>We will discuss <strong>policy gradients</strong> in the next Article with greater details.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*-SwnWvR-VhZRhX-a9ruF6Q.png" alt="Image" width="481" height="234" loading="lazy">
<em>Environment = Game Engine and Agent = RL Agent</em></p>
<p>In the context of the game, the score board acts as a reward or feed back to the agent. Whenever the agent tends to score +1, it understands that the action taken by it was good enough at that state.</p>
<p>Now we will train the agent to play the pong game. To start, we will feed in a bunch of game frame <strong>(states)</strong> to the network/algorithm and let the algorithm decide the action.The Initial actions of the agent will obviously be bad, but our agent can sometimes be lucky enough to score a point and this might be a random event. But due to this lucky random event, it receives a reward and this helps the agent to understand that the series of actions were good enough to fetch a reward.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*cdq5CaGCJCU6ePiXS9GbYg.png" alt="Image" width="516" height="121" loading="lazy">
<em>Results during the training</em></p>
<p>So, in the future, the agent is likely to take the actions which will fetch a reward over an action which will not. Intuitively, the RL agent is leaning to play the game.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*roRyfK2mmV1E_MsN0cRzcg.gif" alt="Image" width="389" height="256" loading="lazy">
<em>Source: OLEGIF.com</em></p>
<h4 id="heading-limitations">Limitations</h4>
<p>During the training of the agent, when an agent loses an episode, then the algorithm will discard or lower the likelyhood of taking all the series of actions which existed in this episode.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*H6wuWYx1wlGRTNfiFGWvhA.png" alt="Image" width="523" height="122" loading="lazy">
<em>Red Demarcation Shows all the action Taken in a losing episode</em></p>
<p>But if the agent was performing <strong>well</strong> from the start of the episode, but just due to the last 2 actions the agent lost the game, it does not make sense to discard all the actions. Rather it makes sense if we just remove the last 2 actions which resulted in the loss.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*ZSPXbb8q_2zZiVEQdbDX9A.png" alt="Image" width="516" height="125" loading="lazy">
<em>Green Demarcation shows all the action which where correct and Red Demarcation are the action Which Should be removed.</em></p>
<p>This is called the <strong>Credit Assignment Problem.</strong> This problem arises because of a <strong>sparse reward setting.</strong> That is, instead of getting a reward at every step, we get the reward at the end of the episode. So, it’s on the agent to learn which actions were correct and which actual action led to losing the game.</p>
<p>So, due to this sparse reward setting in RL, the algorithm is very sample-inefficient. This means that huge training examples have to be fed in, in order to train the agent. But the fact is that sparse reward settings fail in many circumstance due to the complexity of the environment.</p>
<p>So, there is something called <strong>rewards shaping</strong> which is used to solve this. But again, rewards shaping also suffers from some limitation as we need to design a custom reward function for every game.</p>
<h4 id="heading-closing-note">Closing Note</h4>
<p>Today, reinforcement learning is an exciting field of study. Major developments has been made in the field, of which deep reinforcement learning is one.</p>
<p>We will cover deep reinforcement learning in our upcoming articles. This article covers a lot of concepts. Please take your own time to understand the basic concepts of reinforcement learning.</p>
<p>But, I would like to mention that reinforcement is not a secret black box. Whatever advancements we are seeing today in the field of reinforcement learning are a result of bright minds working day and night on specific applications.</p>
<p>Next time we’ll work on a Q-learning agent and also cover some more basic stuff in reinforcement learning.</p>
<p>Until, then enjoy AI ?…</p>
<blockquote>
<p><strong>Important</strong> : This article is 1st part of Deep Reinforcement Learning series, The Complete series shall be available both on Text Readable forms on <a target="_blank" href="https://medium.com/@alamba093">Medium</a> and in Video explanatory Form on <a target="_blank" href="https://www.youtube.com/channel/UCRkxhh51YKqpn2gaUI3MXjg">my channel on YouTube</a>.</p>
</blockquote>
<p>For deep and more Intuitive understanding of reinforcement learning, I would recommend that you watch the below video:</p>
<p>Subscribe to my YouTube channel For more AI videos : <a target="_blank" href="https://goo.gl/u72j6u"><strong>ADL</strong></a> .</p>
<p><em>If you liked my article, please click the <strong>? a</strong>s I remain motivated to write stuffs and Please follow me on Medium &amp;</em></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*z8B3R6kZjTkMKPv3MnUYxg.png" alt="Image" width="358" height="87" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*-etmF1WRWkvWO6cSol7f1w.png" alt="Image" width="355" height="89" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*7DWddirTA0TDNoAL34xjag.png" alt="Image" width="359" height="90" loading="lazy"></p>
<p>If you have any questions, please let me know in a comment below or <a target="_blank" href="https://twitter.com/I_AM_ADL"><strong>Twitter</strong></a>. Subscribe to my YouTube Channel For More Tech videos : <a target="_blank" href="https://goo.gl/u72j6u"><strong>ADL</strong></a> .</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed… ]]>
                </title>
                <description>
                    <![CDATA[ By Thomas Simonini This article is part of Deep Reinforcement Learning Course with Tensorflow ?️. Check the syllabus here. In our last article about Deep Q Learning with Tensorflow, we implemented an agent that learns to play a simple version of Do... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/improvements-in-deep-q-learning-dueling-double-dqn-prioritized-experience-replay-and-fixed-58b130cc5682/</link>
                <guid isPermaLink="false">66c357e5d372f14b49bdcbab</guid>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Reinforcement Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tech  ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Fri, 06 Jul 2018 00:10:13 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*idlcWBCQGKJ2rMjKPwAKiQ.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Thomas Simonini</p>
<blockquote>
<p>This article is part of Deep Reinforcement Learning Course with Tensorflow ?️. Check the syllabus h<a target="_blank" href="https://simoninithomas.github.io/Deep_reinforcement_learning_Course/">ere.</a></p>
</blockquote>
<p>In our last article about <a target="_blank" href="https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8">Deep Q Learning with Tensorflow</a>, we implemented an agent that learns to play a simple version of Doom. In the video version, <a target="_blank" href="https://www.youtube.com/watch?v=gCJyVX98KJ4">we trained a DQN agent that plays Space invaders</a>.</p>
<p>However, during the training, we saw that there was a lot of variability.</p>
<p>Deep Q-Learning was introduced in 2014. Since then, a lot of improvements have been made. So, today we’ll see four strategies that improve — dramatically — the training and the results of our DQN agents:</p>
<ul>
<li>fixed Q-targets</li>
<li>double DQNs</li>
<li>dueling DQN (aka DDQN)</li>
<li>Prioritized Experience Replay (aka PER)</li>
</ul>
<p>We’ll implement an agent that learns to play Doom Deadly corridor. Our AI must navigate towards the fundamental goal (the vest), and make sure they survive at the same time by killing enemies.</p>
<h3 id="heading-fixed-q-targets">Fixed Q-targets</h3>
<h4 id="heading-theory">Theory</h4>
<p>We saw in the Deep Q Learning article that, when we want to calculate the TD error (aka the loss), we calculate the difference between the TD target (Q_target) and the current Q value (estimation of Q).</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*Zplt-1wTWu_7BGmZCBFjbQ.png" alt="Image" width="800" height="234" loading="lazy"></p>
<p>But <strong>we don’t have any idea of the real TD target.</strong> We need to estimate it. Using the Bellman equation, we saw that the TD target is just the reward of taking that action at that state plus the discounted highest Q value for the next state.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*KsQ46R8zyTQlKGv91xi6ww.png" alt="Image" width="800" height="295" loading="lazy"></p>
<p>However, the problem is that we using the same parameters (weights) for estimating the target <strong>and</strong> the Q value. As a consequence, there is a big correlation between the TD target and the parameters (w) we are changing.</p>
<p>Therefore, it means that at every step of training, <strong>our Q values shift but also the target value shifts.</strong> So, we’re getting closer to our target but the target is also moving. It’s like chasing a moving target! This lead to a big oscillation in training.</p>
<p>It’s like if you were a cowboy (the Q estimation) and you want to catch the cow (the Q-target) you must get closer (reduce the error).</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*BCsZHA3cO3zsQySkRuWPEw.png" alt="Image" width="500" height="281" loading="lazy"></p>
<p>At each time step, you’re trying to approach the cow, which also moves at each time step (because you use the same parameters).</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*aKuCo_MvnoCa148m3U9YXg.png" alt="Image" width="500" height="281" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*T5MwyKNbDmG9Vb_fQg1t-w.png" alt="Image" width="500" height="281" loading="lazy"></p>
<p>This leads to a very strange path of chasing (a big oscillation in training).</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*Kt6H_kh_rfSu7EkN9bU0oA.png" alt="Image" width="500" height="281" loading="lazy"></p>
<p>Instead, we can use the idea of fixed Q-targets introduced by DeepMind:</p>
<ul>
<li>Using a separate network with a fixed parameter (let’s call it w-) for estimating the TD target.</li>
<li>At every Tau step, we copy the parameters from our DQN network to update the target network.</li>
</ul>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*D9i0I2EO7LKL2aAb2HLfTg.png" alt="Image" width="800" height="268" loading="lazy"></p>
<p>Thanks to this procedure, we’ll have more stable learning because the target function stays fixed for a while.</p>
<h4 id="heading-implementation">Implementation</h4>
<p>Implementing fixed q-targets is pretty straightforward:</p>
<ul>
<li><p>First, we create two networks (<code>DQNetwork</code>, <code>TargetNetwork</code>)</p>
</li>
<li><p>Then, we create a function that will take our <code>DQNetwork</code> parameters and copy them to our <code>TargetNetwork</code></p>
</li>
<li><p>Finally, during the training, we calculate the TD target using our target network. We update the target network with the <code>DQNetwork</code> every <code>tau</code> step (<code>tau</code> is an hyper-parameter that we define).</p>
</li>
</ul>
<h3 id="heading-double-dqns">Double DQNs</h3>
<h4 id="heading-theory-1">Theory</h4>
<p>Double DQNs, or double Learning, was introduced <a target="_blank" href="https://papers.nips.cc/paper/3964-double-q-learning">by Hado van Hasselt</a>. This method <strong>handles the problem of the overestimation of Q-values.</strong></p>
<p>To understand this problem, remember how we calculate the TD Target:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*KsQ46R8zyTQlKGv91xi6ww.png" alt="Image" width="800" height="295" loading="lazy"></p>
<p>By calculating the TD target, we face a simple problem: how are we sure that <strong>the best action for the next state is the action with the highest Q-value?</strong></p>
<p>We know that the accuracy of q values depends on what action we tried <strong>and</strong> what neighboring states we explored.</p>
<p>As a consequence, at the beginning of the training we don’t have enough information about the best action to take. Therefore, taking the maximum q value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly <strong>given a higher Q value than the optimal best action, the learning will be complicated.</strong></p>
<p>The solution is: when we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We:</p>
<ul>
<li>use our DQN network to select what is the best action to take for the next state (the action with the highest Q value).</li>
<li>use our target network to calculate the target Q value of taking that action at the next state.</li>
</ul>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*g5l4q162gDRZAAsFWtX7Nw.png" alt="Image" width="800" height="262" loading="lazy"></p>
<p>Therefore, Double DQN helps us reduce the overestimation of q values and, as a consequence, helps us train faster and have more stable learning.</p>
<h4 id="heading-implementation-1">Implementation</h4>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*oyGR6gJ4WyqeKOfq0Cd8iQ.png" alt="Image" width="800" height="29" loading="lazy"></p>
<h3 id="heading-dueling-dqn-aka-ddqn">Dueling DQN (aka DDQN)</h3>
<h4 id="heading-theory-2">Theory</h4>
<p>Remember that Q-values correspond <strong>to how good it is to be at that state and taking an action at that state Q(s,a).</strong></p>
<p>So we can decompose Q(s,a) as the sum of:</p>
<ul>
<li><strong>V(s)</strong>: the value of being at that state</li>
<li><strong>A(s,a)</strong>: the advantage of taking that action at that state (how much better is to take this action versus all other possible actions at that state).</li>
</ul>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*yPtkPCxjXP2TbK8VlUuXtA.png" alt="Image" width="800" height="106" loading="lazy"></p>
<p>With DDQN, we want to separate the estimator of these two elements, using two new streams:</p>
<ul>
<li>one that estimates the <strong>state value V(s)</strong></li>
<li>one that estimates the <strong>advantage for each action A(s,a)</strong></li>
</ul>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*FkHqwA2eSGixdS-3dvVoMA.png" alt="Image" width="800" height="283" loading="lazy"></p>
<p>And then we combine these two streams <strong>through a special aggregation layer to get an estimate of Q(s,a).</strong></p>
<p>Wait? <strong>But why do we need to calculate these two elements separately if then we combine them?</strong></p>
<p>By decoupling the estimation, intuitively our DDQN can learn which states are (or are not) valuable <strong>without</strong> having to learn the effect of each action at each state (since it’s also calculating V(s)).</p>
<p>With our normal DQN, we need to calculate the value of each action at that state. <strong>But what’s the point if the value of the state is bad?</strong> What’s the point to calculate all actions at one state when all these actions lead to death?</p>
<p>As a consequence, by decoupling we’re able to calculate V(s). This is particularly <strong>useful for states where their actions do not affect the environment in a relevant way.</strong> In this case, it’s unnecessary to calculate the value of each action. For instance, moving right or left only matters if there is a risk of collision. And, in most states, the choice of the action has no effect on what happens.</p>
<p>It will be clearer if we take the example in the paper <a target="_blank" href="https://arxiv.org/pdf/1511.06581.pdf">Dueling Network Architectures for Deep Reinforcement Learning</a>.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0*qor_kPiSwiWt8uQF" alt="Image" width="800" height="479" loading="lazy"></p>
<p>We see that the value network streams pays attention (the orange blur) to the road, and in particular to the horizon where the cars are spawned. It also pays attention to the score.</p>
<p>On the other hand, the advantage stream in the first frame on the right does not pay much attention to the road, because there are no cars in front (so the action choice is practically irrelevant). But, in the second frame it pays attention, as there is a car immediately in front of it, and making a choice of action is crucial and very relevant.</p>
<p>Concerning the aggregation layer, we want to generate the q values for each action at that state. We might be tempted to combine the streams as follows:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0*ue6KTm1dRQ0A6sM4" alt="Image" width="800" height="96" loading="lazy"></p>
<p>But if we do that, we’ll fall into the <strong>issue of identifiability</strong>, that is — given Q(s,a) we’re unable to find A(s,a) and V(s).</p>
<p>And not being able to find V(s) and A(s,a) given Q(s,a) will be a problem for our back propagation. To avoid this problem, we can force our advantage function estimator to have 0 advantage at the chosen action.</p>
<p>To do that, we subtract the average advantage of all actions possible of the state.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0*kt9_Z41qxgiI0CDl" alt="Image" width="800" height="161" loading="lazy"></p>
<p>Therefore, this architecture helps us accelerate the training. We can calculate the value of a state without calculating the Q(s,a) for each action at that state. And it can help us find much more reliable Q values for each action by decoupling the estimation between two streams.</p>
<h4 id="heading-implementation-2">Implementation</h4>
<p>The only thing to do is to modify the DQN architecture by adding these new streams:</p>
<h3 id="heading-prioritized-experience-replay">Prioritized Experience Replay</h3>
<h4 id="heading-theory-3">Theory</h4>
<p>Prioritized Experience Replay (PER) was introduced in 2015 by <a target="_blank" href="https://arxiv.org/search?searchtype=author&amp;query=Schaul%2C+T">Tom Schaul</a>. The idea is that some experiences may be more important than others for our training, but might occur less frequently.</p>
<p>Because we sample the batch uniformly (selecting the experiences randomly) these rich experiences that occur rarely have practically no chance to be selected.</p>
<p>That’s why, with PER, we try to change the sampling distribution by using a criterion to define the priority of each tuple of experience.</p>
<p>We want to take in priority <strong>experience where there is a big difference between our prediction and the TD target, since it means that we have a lot to learn about it.</strong></p>
<p>We use the absolute value of the magnitude of our TD error:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0*0qPwzal3qBIP0eFb" alt="Image" width="800" height="327" loading="lazy"></p>
<p>And we <strong>put that priority in the experience of each replay buffer.</strong></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0*iKTTN92E7wwnlh-E" alt="Image" width="800" height="172" loading="lazy"></p>
<p>But we can’t just do greedy prioritization, because it will lead to always training the same experiences (that have big priority), and thus over-fitting.</p>
<p>So we introduce stochastic prioritization, <strong>which generates the probability of being chosen for a replay.</strong></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0*iCkLY7L3R3mWEh_O" alt="Image" width="800" height="479" loading="lazy"></p>
<p>As consequence, during each time step, we will get a batch of samples with this probability distribution and train our network on it.</p>
<p>But, we still have a problem here. Remember that with normal Experience Replay, we use a stochastic update rule. As a consequence, the <strong>way we sample the experiences must match the underlying distribution they came from.</strong></p>
<p>When we do have normal experience, we select our experiences in a normal distribution — simply put, we select our experiences randomly. There is no bias, because each experience has the same chance to be taken, so we can update our weights normally.</p>
<p><strong>But</strong>, because we use priority sampling, purely random sampling is abandoned. As a consequence, we introduce bias toward high-priority samples (more chances to be selected).</p>
<p>And, if we update our weights normally, we take have a risk of over-fitting. Samples that have high priority are likely to be used for training many times in comparison with low priority experiences (= bias). As a consequence, we’ll update our weights with only a small portion of experiences that we consider to be really interesting.</p>
<p>To correct this bias, we use importance sampling weights (IS) that will adjust the updating by reducing the weights of the often seen samples.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0*Lf3KBrOdyBYcOVqB" alt="Image" width="800" height="408" loading="lazy"></p>
<p>The weights corresponding to high-priority samples have very little adjustment (because the network will see these experiences many times), whereas those corresponding to low-priority samples will have a full update.</p>
<p>The role of <strong>b</strong> is to control how much these importance sampling weights affect learning. In practice, the b parameter is annealed up to 1 over the duration of training, because these weights are more important <strong>in the end of learning when our q values begin to converge.</strong> The unbiased nature of updates is most important near convergence, as explained in this <a target="_blank" href="http://pemami4911.github.io/paper-summaries/deep-rl/2016/01/26/prioritizing-experience-replay.html">article</a>.</p>
<h4 id="heading-implementation-3">Implementation</h4>
<p>This time, the implementation will be a little bit fancier.</p>
<p>First of all, we can’t just implement PER by sorting all the Experience Replay Buffers according to their priorities. This will not be efficient at all due to <strong>O(nlogn) for insertion and O(n) for sampling.</strong></p>
<p>As explained in <a target="_blank" href="https://jaromiru.com/2016/11/07/lets-make-a-dqn-double-learning-and-prioritized-experience-replay/">this really good article</a>, we need to use another data structure instead of sorting an array — an unsorted <strong>sumtree.</strong></p>
<p>A sumtree is a Binary Tree, that is a tree with only a maximum of two children for each node. The leaves (deepest nodes) contain the priority values, and a data array that points to leaves contains the experiences.</p>
<p>Updating the tree and sampling will be really efficient (O(log n)).</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*Go9DNr7YY-wMGdIQ7HQduQ.png" alt="Image" width="800" height="298" loading="lazy"></p>
<p>Then, we create a memory object that will contain our sumtree and data.</p>
<p>Next, to sample a minibatch of size k, the range [0, total_priority] will be divided into k ranges. A value is uniformly sampled from each range.</p>
<p>Finally, the transitions (experiences) that correspond to each of these sampled values are retrieved from the sumtree.</p>
<p>It will be much clearer when we dive on the complete details in the notebook.</p>
<h3 id="heading-doom-deathmatch-agent">Doom Deathmatch agent</h3>
<p>This agent is a Dueling Double Deep Q Learning with PER and fixed q-targets.</p>
<blockquote>
<p>We made a video tutorial of the implementation:</p>
<p>The notebook is <a target="_blank" href="https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Dueling%20Double%20DQN%20with%20PER%20and%20fixed-q%20targets/Dueling%20Deep%20Q%20Learning%20with%20Doom%20(%2B%20double%20DQNs%20and%20Prioritized%20Experience%20Replay).ipynb">here</a></p>
</blockquote>
<p>That’s all! You’ve just created an smarter agent that learns to play Doom. Awesome! Remember that if you want to have an agent with really good performance, <strong>you need many more GPU hours (about two days of training)!</strong></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*pN5raRODUzEQOLw0egyXYg.gif" alt="Image" width="637" height="469" loading="lazy"></p>
<p><strong>However, with only 2–3 hours of training on CPU</strong> (yes CPU), our agent understood that they needed to kill enemies before being able to move forward. If they move forward without killing enemies, they will be killed before getting the vest.</p>
<p>Don’t forget to implement each part of the code by yourself. It’s really important to try to modify the code I gave you. Try to add epochs, change the architecture, add fixed Q-values, change the learning rate, use a harder environment…and so on. Experiment, have fun!</p>
<p>Remember that this was a big article, so be sure to really understand why we use these new strategies, how they work, and the advantages of using them.</p>
<p>In the next article, we’ll learn about an awesome hybrid method between value-based and policy-based reinforcement learning algorithms. <strong>This</strong> <strong>is a baseline for the state of the art’s algorithms</strong>: Advantage Actor Critic (A2C). You’ll implement an agent that learns to play Outrun !</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*0M5OiOwKemAwkObBy1K6VQ.gif" alt="Image" width="600" height="338" loading="lazy"></p>
<p>If you liked my article, <strong>please click the ? below as many time as you liked the article</strong> so other people will see this here on Medium. And don’t forget to follow me!</p>
<p>If you have any thoughts, comments, questions, feel free to comment below or send me an email: hello@simoninithomas.com, or tweet me <a target="_blank" href="https://twitter.com/ThomasSimonini">@ThomasSimonini</a>.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*_yN1FzvEFDmlObiYsstIzg.png" alt="Image" width="500" height="77" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*mD-f5VN1SWYvhrZAbvSu_w.png" alt="Image" width="500" height="77" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*PqiptT-Cdi8uwosxuFn2DQ.png" alt="Image" width="500" height="77" loading="lazy"></p>
<p>Keep learning, stay awesome!</p>
<h4 id="heading-deep-reinforcement-learning-course-with-tensorflow">Deep Reinforcement Learning Course with Tensorflow ?️</h4>
<p>? S<a target="_blank" href="https://simoninithomas.github.io/Deep_reinforcement_learning_Course/">yllabus</a></p>
<p>? V<a target="_blank" href="https://www.youtube.com/channel/UC8XuSf1eD9AF8x8J19ha5og?view_as=subscriber">ideo version</a></p>
<p>Part 1: <a target="_blank" href="https://medium.com/p/4339519de419/edit">An introduction to Reinforcement Learning</a></p>
<p>Part 2: <a target="_blank" href="https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe">Diving deeper into Reinforcement Learning with Q-Learning</a></p>
<p>Part 3: <a target="_blank" href="https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8">An introduction to Deep Q-Learning: let’s play Doom</a></p>
<p>Part 3+: <a target="_blank" href="https://medium.freecodecamp.org/improvements-in-deep-q-learning-dueling-double-dqn-prioritized-experience-replay-and-fixed-58b130cc5682">Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets</a></p>
<p>Part 4: <a target="_blank" href="https://medium.freecodecamp.org/an-introduction-to-policy-gradients-with-cartpole-and-doom-495b5ef2207f">An introduction to Policy Gradients with Doom and Cartpole</a></p>
<p>Part 5: <a target="_blank" href="https://medium.freecodecamp.org/an-intro-to-advantage-actor-critic-methods-lets-play-sonic-the-hedgehog-86d6240171d">An intro to Advantage Actor Critic methods: let’s play Sonic the Hedgehog!</a></p>
<p>Part 6: <a target="_blank" href="https://towardsdatascience.com/proximal-policy-optimization-ppo-with-sonic-the-hedgehog-2-and-3-c9c21dbed5e">Proximal Policy Optimization (PPO) with Sonic the Hedgehog 2 and 3</a></p>
<p>Part 7: <a target="_blank" href="https://towardsdatascience.com/curiosity-driven-learning-made-easy-part-i-d3e5a2263359">Curiosity-Driven Learning made easy Part I</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ In need of evolution: game theory and AI ]]>
                </title>
                <description>
                    <![CDATA[ By Elena Nisioti Artificial Intelligence (AI) is full of questions that cannot be answered and answers that cannot be assigned to the correct questions. In the past, it paid for its persistence to wrong practices with periods of stagnation, known as ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/game-theory-and-ai-where-it-all-started-and-where-it-should-all-stop-82f7bd53a3b4/</link>
                <guid isPermaLink="false">66c34b2ea7aea9fc97bdfb29</guid>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Game Theory ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Reinforcement Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ technology ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Sat, 12 May 2018 21:07:00 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*ddulq7MNPa7NPjVf5xCuEw.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Elena Nisioti</p>
<p>Artificial Intelligence (AI) is full of questions that cannot be answered and answers that cannot be assigned to the correct questions. In the past, it paid for its persistence to wrong practices with periods of stagnation, known as AI winters. The calendar of AI, however, has just reached spring, and the applications are flourishing.</p>
<p>Yet, there is a branch of AI that has long been neglected. The talk is about reinforcement learning, that has recently exhibited impressive results on games like <a target="_blank" href="https://www.deepmind.com/research/highlighted-research/alphago">AlphaGo</a> and <a target="_blank" href="https://www.deepmind.com/publications/playing-atari-with-deep-reinforcement-learning">Atari</a>. But let’s be honest: these were not reinforcement learning wins. What got deeper in these cases was the deep neural networks, and not our understanding of reinforcement learning, which maintains the depth it achieved decades ago.</p>
<p>Even worse is the case of reinforcement learning when applied to real life problems. If training a robot to balance on a rope sounds hard, try training a team of robots to win a football game, or a team of drones to monitor a moving target.</p>
<p>Before we lose the branch, or even worse the tree, we must sharpen our understanding of these applications. Game theory is the most common approach to studying teams of players that share a common goal. It can lend us tools to guide learning algorithms in these settings.</p>
<p>But let’s see why the common approach is not a common sense approach.</p>
<blockquote>
<p>To kill an error is as good a service as, and sometimes even better than, the establishing of a new truth or fact. <em>— Charles Darwin</em></p>
</blockquote>
<p>First, let’s dirty our hands with some terminology and basics of these areas.</p>
<h3 id="heading-game-theory">Game theory</h3>
<h4 id="heading-some-useful-terms"><strong>Some useful terms</strong></h4>
<ul>
<li><strong>Game:</strong> like games in popular understanding, it can be any setting where players take actions and its outcome will depend on them.</li>
<li><strong>Player:</strong> a strategic decision-maker within a game.</li>
<li><strong>Strategy:</strong> a complete plan of actions a player will take, given the set of circumstances that might arise within the game.</li>
<li><strong>Payoff:</strong> the gain a player receives from arriving at a particular outcome of a game.</li>
<li><strong>Equilibrium:</strong> the point in a game where both players have made their decisions and an outcome is reached.</li>
<li><strong>Nash equilibrium:</strong> an equilibrium in which no player can gain by changing their own strategy if the strategies of the other players remain unchanged.</li>
<li><strong>Dominant strategy:</strong> occurs when one strategy is better than another strategy for one player, no matter how that player’s opponents may play.</li>
</ul>
<h4 id="heading-prisoners-dilemma"><strong>Prisoner’s dilemma</strong></h4>
<p>This is probably the most famous game in the literature. The figure below presents its payoff matrix. Now, a payoff matrix is worth a thousand words. It is sufficient, to an experienced eye, to provide all the information necessary to describe a game. But let’s be a bit less laconic.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/3C9rRuM6lrhidno7Ihm5g82g2CLkNPSMrD0R" alt="Image" width="271" height="221" loading="lazy">
<em>Prisoner’s dilemma payoff matrix</em></p>
<p>The police arrest two criminals, criminal A and criminal B. Although quite notorious, the criminals cannot be imprisoned for the crime under investigation due to lack of evidence. But they can be held for lesser charges.</p>
<p>The length of their imprisonment will depend on what they will say in the interrogation room, which gives rise to the game. Each criminal (player) is given the chance to either stay silent or snitch on the other criminal (player). The payoff matrix depicts how many years each player will be imprisoned depending on the outcome. For example, if player A stays silent and player B snitches on them, player A will serve 3 years (-3) and player B will serve none (0).</p>
<p>If you reviews the payoff matrix carefully, you will find out that the logical action of a player is to betray the other person or, in game-theoretic terms, betraying is the dominant strategy. This will lead to the Nash equilibrium of the game, where each player has a payoff of -2.</p>
<p>Does something feel odd? Yes, or at least it should. If both players somehow agreed to remain silent they would both get a higher reward of -1. Prisoner’s dilemma is an example of a game where rationality leads to a worse result than cooperation would.</p>
<h4 id="heading-some-historical-remarks"><strong>Some historical remarks</strong></h4>
<p>Game theory originated in economics, but is today an interdisciplinary area of study. Its father, John von Neumann (you will notice that Johns have serious career prospects in this area), was the first to give a strict formulation to the common notion of a game. He restricted his studies to games of two players, as they were easier to analyze.</p>
<p>He then co-authored a book with Oskar Morgenstern, which laid the foundations for expected utility theory and shaped the course of game theory. Around that time, John Nash introduced the concept of Nash equilibria, which helps describe the outcome of a game.</p>
<h3 id="heading-reinforcement-learning">Reinforcement learning</h3>
<p>It did not take long to realize how vast the applications of game theory can be. From games to biology, philosophy and, wait for it, artificial intelligence. Game theory is nowadays closely related to settings where multiple players learn through reinforcement, an area called multi-agent reinforcement learning. Examples of applications in this case are teams of robots, where each player has to learn how to behave in favor of its team.</p>
<h4 id="heading-some-useful-terms-1"><strong>Some useful terms</strong></h4>
<ul>
<li><strong>Agent:</strong> equivalent to a player.</li>
<li><strong>Reward:</strong> equivalent to a payoff.</li>
<li><strong>State:</strong> all the information necessary to describe the situation an agent is in.</li>
<li><strong>Action:</strong> equivalent of a move in a game.</li>
<li><strong>Policy:</strong> similar to a strategy, it defines the action an agent will make when in particular states</li>
<li><strong>Environment:</strong> everything the agent interacts with during learning.</li>
</ul>
<h4 id="heading-applications">Applications</h4>
<p>Imagine the following scenario: a team of drones is unleashed into a forest in order to predict and locate fires early enough for the firefighters to respond. The drones are autonomous and must explore the forest, learn which conditions are likely to cause fire, and cooperate with each other, so that they cover wide areas of the forest using little battery and communication.</p>
<p>This application belongs to the area of environmental monitoring, where AI can lend its predictive skills to human intervention. In a technological world that is becoming increasingly complex and a physical world under threat, we can paraphrase <a target="_blank" href="https://www.brainyquote.com/quotes/rudyard_kipling_118509">Kipling’s quote</a> to “Man could not be everywhere, and therefore he made drones.”</p>
<p>Decentralized architectures are another interesting application field. Technologies like the <a target="_blank" href="https://en.wikipedia.org/wiki/Internet_of_things">Internet of Things</a> and Blockchain create immense networks. Information and processing is distributed in different physical entities, a trait that has been acknowledged to offer privacy, efficiency and democratization.</p>
<p>Regardless of whether you want to use sensors to minimize energy consumption in the households of a country, or replace the banking system, decentralized is the new sexy.</p>
<p>Making these networks smart, however, is challenging, as most of the AI algorithms we are proud of are data- and computation-hungry. Reinforcement learning algorithms can be employed for efficient data processing and rendering the network adaptive to changes in its environment. In this case, it is interesting, and to the benefit of overall efficiency, to study how the individual algorithms will cooperate.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/RcvtMhf4sDKByN00Jncb2s9aMCNZ5cB39fmv" alt="Image" width="800" height="600" loading="lazy">
<em>Deep or collective learning? AI research has based its harvest on increasingly deeper networks, but it could be that the answers to challenging problems come from collective knowledge, not deep-rooted individuals. Did we miss the forest?</em></p>
<h3 id="heading-not-just-a-game">Not just a game</h3>
<p>Translating AI problems to simple games like the prisoner’s dilemma is tempting. This is a usual practice when testing new techniques, as it offers a computationally cheap and intuitive testbed. Nevertheless, it is important not to ignore the effect that the practical characteristics of the problem, such as noise, delays, and finite memory, have on the algorithm.</p>
<p>Perhaps the most misleading assumption in AI research is that of representing interaction with iterated static games. For example, an algorithm can apply the prisoner’s dilemma game every time it wants to make a decision, a formulation that assumes that the agent has not learned, or changed, along the way. But what about the effect learning will have on the behavior of the agent? Won’t interaction with others affect its strategy?</p>
<p>Research in this area has focused on <a target="_blank" href="https://en.wikipedia.org/wiki/The_Evolution_of_Cooperation">evolution of cooperation</a> and Robert Axelrod has studied optimal strategies that arise in the iterated version of prisoner’s dilemma. The <a target="_blank" href="https://en.wikipedia.org/wiki/The_Evolution_of_Cooperation#Axelrod%27s_tournaments">tournaments</a> that Axelrod organized revealed that strategies that adapt with time and interaction, even as simple as Tit-for-Tat may sound, are very effective.The AI community has <a target="_blank" href="https://arxiv.org/abs/1803.00162">recently</a> investigated learning under the <strong>sequential prisoner’s dilemma</strong><em>,</em> but research in this area is still in a premature state.</p>
<p>What differentiates multi-agent from single-agent learning is the increased complexity. Training one deep neural network is already enough of a pain, while adding new networks, as parts of the agents, makes the problem exponentially harder.</p>
<p>One less obvious, but more important concern, is the lack of theoretical properties for this kind of problem. Single-agent reinforcement learning is a well-understood area, as Richard Bellman and Christopher Watkins have offered the algorithms and proofs necessary to learn. In the multi-agent case, however, the proofs lose their validity.</p>
<p>Just to illustrate some of the mind-puzzling difficulties that arise: an agent executes a learning algorithm to learn how to react optimally to its environment. In our case, the environment includes the other agents, which also execute the learning algorithm. Thus, the algorithm has to consider the effect of its action before it acts.</p>
<h3 id="heading-the-early-concerns"><strong>The early concerns</strong></h3>
<p>The concerns start where game theory started: in economics. Let’s begin with some assumptions made when studying a system under classical game theory.</p>
<p><strong>Rationality:</strong> generally in game theory, and in order to derive Nash equilibria, perfect rationality is assumed. This roughly means that agents always act for their own sake.</p>
<p><strong>Complete information:</strong> each agent knows everything about the game, including the rules, what the other players know, and what their strategies are.</p>
<p><strong>Common knowledge:</strong> there is common knowledge of a fact <strong>p</strong> in a group of agents when: all the agents know <strong>p</strong>, they all know that all agents know <strong>p</strong>, they all know that they all know that all agents know <strong>p</strong>, and so on <strong>ad infinitum</strong><em>.</em> There are interesting puzzles, like the <a target="_blank" href="http://mesosyn.com/mental1-2.html">blue-eyed islanders</a>, that describe the effect common knowledge has on a problem.</p>
<p>In 1986 Kenn Arrow expressed his reservations towards classical game theory.</p>
<blockquote>
<p>In <a target="_blank" href="http://dieoff.org/_Economics/RationalityOfSelfAndOthersArrow.pdf">this paper</a>, I want to disentangle some of the senses in which the hypothesis of rationality is used in economic theory. In particular, I want to stress that rationality is not a property of the individual alone, although it is usually presented that way. Rather, it gathers not only its force but also its very meaning from the social context in which it is embedded. It is most plausible under very ideal conditions. When these conditions cease to hold, the rationality assumptions become strained and possibly even self-contradictory.</p>
</blockquote>
<p>If you find that Arrow is a bit harsh with classical game theory, how rational would you say your last purchases have been? Or, how much consciousness and effort did you put into your meal today?</p>
<p>But Arrow is not so much worried about the assumption of rationality. He is worried about the implications of it. For an agent to be rational, you need to provide them with all the information necessary to make their decisions. This calls for omniscient players, which is bad in two ways: first, it creates impractical requirements for information storing and processing of players. Second, game theory is no longer a <strong>game theory</strong>, as you can replace all players by a central ruler (and where is the fun in that?).</p>
<p>The value of information in this view is another point of interest. We have already discussed that possessing all the information is infeasible. But what about assuming players with limited knowledge? Would that help?</p>
<p>You may ask anyone involved in this area, but it suffices to say that optimization under uncertainty is tough. Yes, there still are the good-old Nash equilibria. The problem is that they are infinite. Game theory does not provide you with arguments to evaluate them. So, even if you reach one, you shouldn't make it such a big deal.</p>
<h3 id="heading-reinforcement-learning-concerns"><strong>Reinforcement learning concerns</strong></h3>
<p>By this point you should suspect that AI applications are much more complicated than the examples classical game theory concerns itself with. Just to mention a few obstacles on the path of applying the Nash equilibrium approach in a robotic application: imagine being the captain of a team of robots playing football in RoboCup. How fast, strong, and intelligent are your players and your opponents? What strategies does the opponent team use? How should you reward your players? Is a goal the only reason for congratulating, or will applauding a good pass also improve the team’s behavior? Clearly, just being familiar with the rules of football will not win you the game.</p>
<p>If game theory has been raising debates for decades, if it has been founded on unrealistic assumptions and, for realistic tasks, if it offers complicated and little-understood solutions, why are we still going for it? Well, plainly enough, it’s the only thing we’ve got when it comes to group reasoning. If we actually understood how groups interact and cooperate to achieve their goals, psychology and politics would be much clearer.</p>
<p>Researchers in the area of multi-agent reinforcement learning either completely emit a discussion on the theoretical properties of their algorithms (and nevertheless often exhibit good results) or traditionally study the existence of Nash equilibria. The latter approach seems, to the eyes of a young researcher in the field, like a struggle to prove, under severe, unrealistic assumptions, the theoretical existence of solutions that — being infinite and of questionable value — will never be leveraged in practice.</p>
<h3 id="heading-evolutionary-game-theory"><strong>Evolutionary game theory</strong></h3>
<p>The inception of evolutionary game theory is not recent, yet its far-reaching applications in the area of AI took long to be acknowledged. Originating in biology, it was introduced in 1973, by John M. Smith and George R. Price, as an alternative to classical game theory. The alterations are so profound that we can talk about a whole new approach.</p>
<p>The subject of reasoning is no longer the player itself, but the population of players. Thus, probabilistic strategies are defined as the percentage of players that make a choice, not the probability of one player choosing an action as in classical game theory. This removes the necessity for rational, omniscient agents, as strategies evolve as patterns of behavior. The evolution process resembles Darwinian theory. Players reproduce following the principles of survival of the fittest and random mutations, and can be elegantly described by a set of differential equations, termed the <strong>replicator dynamics</strong>.</p>
<p>We can see the three important parts of this system in the illustration below. A population represents the team of agents, and is characterized by a mixture of strategies. The game rules determine the payoffs of the population, which can also be seen as the fitness values of an evolutionary algorithm. Finally, the replicator rules describe how the population will evolve based on the fitness values and the mathematical properties of the evolution process.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/pcZhSDCQhuD1w4AMlmVxdNV-M0cymDbheIM8" alt="Image" width="800" height="600" loading="lazy">
_Image credit: By HowieKor [CC BY-SA 3.0 ([https://creativecommons.org/licenses/by-sa/3.0](https://creativecommons.org/licenses/by-sa/3.0" rel="noopener" target="<em>blank" title="))], from Wikimedia Commons</em></p>
<p>The notion and pursuit of Nash equilibria is replaced by <strong>evolutionary stable strategies</strong><em>.</em> A strategy can bear this characterization if it is immune to an invasion by a population of agents that follow another strategy, provided that the invading population is small. Thus, the behavior of the team can be studied under the well-understood area of stability of dynamical systems, such as <a target="_blank" href="https://en.wikipedia.org/wiki/Lyapunov_stability">Lyapunov stability</a>.</p>
<blockquote>
<p>The attainment of equilibrium requires a disequilibrium process. What does rational behavior mean in the presence of disequilibrium? Do individuals speculate on the equilibrating process? If they do, can the disequilibrium be regarded as, in some sense, a higher-order equilibrium process?</p>
</blockquote>
<p>In the above passage, Arrow seems to be struggling to pinpoint the dynamic properties of a game. Could evolutionary game theory be an answer to his questions?</p>
<p>Quite recently, famous reinforcement learning algorithms, such as <a target="_blank" href="https://link.springer.com/chapter/10.1007/978-3-540-39857-8_38">Q-learning,</a> were studied under this new approach and significant conclusions were drawn. How this new tool is used ultimately depends on the application.</p>
<p>We can follow the forward approach, to derive the dynamic model of a learning algorithm. Or the inverse, where we start from some desired dynamic properties and engineer a learning algorithm that exhibits them.</p>
<p>We can use the replicator dynamics descriptively, to visualize convergence. Or prescriptively, to tune the algorithm in order to converge to optimal solutions. The latter can immensely reduce the complexity entailed in training deep networks for tough tasks that we face today, by removing the need for blind tuning.</p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>It’s not hard to trace when and why the paths of game theory and AI became convoluted. What’s harder, however, is to overlook the restrictions AI, and in particular multi-agent reinforcement learning, has to face when following classical game theoretic approaches.</p>
<p>Evolutionary game theory sounds promising, offering both theoretical tools and practical advantages, but we won’t really know until we try it. In this case, evolution will not arise naturally, but out of a conscious struggle of the research community for improvement. But isn’t that the essence of evolution?</p>
<p>It takes some effort to deviate from where inertia is pushing you, but reinforcement learning, despite general successes in AI, is in serious need of a lift.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ An introduction to Reinforcement Learning ]]>
                </title>
                <description>
                    <![CDATA[ By Thomas Simonini Reinforcement learning is an important type of Machine Learning where an agent learn how to behave in a environment by performing actions and seeing the results. In recent years, we’ve seen a lot of improvements in this fascinating... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/an-introduction-to-reinforcement-learning-4339519de419/</link>
                <guid isPermaLink="false">66c344570bafa8455505c67b</guid>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Reinforcement Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ tech  ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Sat, 31 Mar 2018 06:16:59 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*0gd5LIk1e7RWF3HygxgH-g.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Thomas Simonini</p>
<p>Reinforcement learning is an important type of Machine Learning where an agent learn how to behave in a environment by performing actions and seeing the results.</p>
<p>In recent years, we’ve seen a lot of improvements in this fascinating area of research. Examples include <a target="_blank" href="https://deepmind.com/research/dqn/">DeepMind and the Deep Q learning architecture</a> in 2014, <a target="_blank" href="https://deepmind.com/research/alphago/">beating the champion of the game of Go with AlphaGo</a> in 2016, <a target="_blank" href="https://blog.openai.com/openai-baselines-ppo/">OpenAI and the PPO</a> in 2017, amongst others.</p>
<p>In this series of articles, we will focus on learning the different architectures used today to solve Reinforcement Learning problems. These will include Q -learning, Deep Q-learning, Policy Gradients, Actor Critic, and PPO.</p>
<p>In this first article, you’ll learn:</p>
<ul>
<li>What Reinforcement Learning is, and how rewards are the central idea</li>
<li>The three approaches of Reinforcement Learning</li>
<li>What the “Deep” in Deep Reinforcement Learning means</li>
</ul>
<p>It’s really important to master these elements before diving into implementing Deep Reinforcement Learning agents.</p>
<p>The idea behind Reinforcement Learning is that an agent will learn from the environment by interacting with it and receiving rewards for performing actions.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*zySSJwywQGerKSbjHBtkyg.png" alt="Image" width="600" height="350" loading="lazy"></p>
<p>Learning from interaction with the environment comes from our natural experiences. Imagine you’re a child in a living room. You see a fireplace, and you approach it.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*aQuWM51KnoGIUGTGNzoRIw.png" alt="Image" width="600" height="350" loading="lazy"></p>
<p>It’s warm, it’s positive, you feel good <em>(Positive Reward +1).</em> You understand that fire is a positive thing.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*5shp6Uzu7XT41vrOJ7-3gw.png" alt="Image" width="600" height="350" loading="lazy"></p>
<p>But then you try to touch the fire. Ouch! It burns your hand <em>(Negative reward -1)</em>. You’ve just understood that fire is positive when you are a sufficient distance away, because it produces warmth. But get too close to it and you will be burned.</p>
<p>That’s how humans learn, through interaction. Reinforcement Learning is just a computational approach of learning from action.</p>
<h3 id="heading-the-reinforcement-learning-process">The Reinforcement Learning Process</h3>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*aKYFRoEmmKkybqJOvLt2JQ.png" alt="Image" width="800" height="364" loading="lazy"></p>
<p>Let’s imagine an agent learning to play Super Mario Bros as a working example. The Reinforcement Learning (RL) process can be modeled as a loop that works like this:</p>
<ul>
<li>Our Agent receives <strong>state S0</strong> from the <strong>Environment</strong> (In our case we receive the first frame of our game (state) from Super Mario Bros (environment))</li>
<li>Based on that <strong>state S0,</strong> agent takes an <strong>action A0</strong> (our agent will move right)</li>
<li>Environment transitions to a <strong>new</strong> <strong>state S1</strong> (new frame)</li>
<li>Environment gives some <strong>reward R1</strong> to the agent (not dead: +1)</li>
</ul>
<p>This RL loop outputs a sequence of <strong>state, action and reward.</strong></p>
<p>The goal of the agent is to maximize the expected cumulative reward.</p>
<h4 id="heading-the-central-idea-of-the-reward-hypothesis">The central idea of the Reward Hypothesis</h4>
<p>Why is the goal of the agent to maximize the expected cumulative reward?</p>
<p>Well, Reinforcement Learning is based on the idea of the reward hypothesis. All goals can be described by the maximization of the expected cumulative reward.</p>
<p><strong>That’s why in Reinforcement Learning, to have the best behavior, we need to maximize the expected cumulative reward.</strong></p>
<p>The cumulative reward at each time step t can be written as:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0*ylz4lplMffGQR_g3." alt="Image" width="524" height="51" loading="lazy"></p>
<p>Which is equivalent to:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*AFAuM1Y8zmso4yB5mOApZA.png" alt="Image" width="525" height="224" loading="lazy">
_Thanks to [Pierre-Luc Bacon](https://twitter.com/pierrelux" rel="noopener" target="<em>blank" title=") for the correction</em></p>
<p>However, in reality, we can’t just add the rewards like that. The rewards that come sooner (in the beginning of the game) are more probable to happen, since they are more predictable than the long term future reward.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*tciNrjN6pW60-h0PiQRiXg.png" alt="Image" width="500" height="502" loading="lazy"></p>
<p>Let say your agent is this small mouse and your opponent is the cat. Your goal is to eat the maximum amount of cheese before being eaten by the cat.</p>
<p>As we can see in the diagram, it’s more probable to eat the cheese near us than the cheese close to the cat (the closer we are to the cat, the more dangerous it is).</p>
<p>As a consequence, the reward near the cat, even if it is bigger (more cheese), will be discounted. We’re not really sure we’ll be able to eat it.</p>
<p>To discount the rewards, we proceed like this:</p>
<p>We define a discount rate called gamma. It must be between 0 and 1.</p>
<ul>
<li>The larger the gamma, the smaller the discount. This means the learning agent cares more about the long term reward.</li>
<li>On the other hand, the smaller the gamma, the bigger the discount. This means our agent cares more about the short term reward (the nearest cheese).</li>
</ul>
<p>Our discounted cumulative expected rewards is:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*zrzRTXt8rtWF5fX__kZ-yQ.png" alt="Image" width="800" height="281" loading="lazy">
_Thanks to [Pierre-Luc Bacon](https://twitter.com/pierrelux" rel="noopener" target="<em>blank" title=") for the correction</em></p>
<p>To be simple, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, so the future reward is less and less probable to happen.</p>
<h3 id="heading-episodic-or-continuing-tasks">Episodic or Continuing tasks</h3>
<p>A task is an instance of a Reinforcement Learning problem. We can have two types of tasks: episodic and continuous.</p>
<h4 id="heading-episodic-task"><strong>Episodic task</strong></h4>
<p>In this case, we have a starting point and an ending point <strong>(a terminal state). This creates an episode</strong>: a list of States, Actions, Rewards, and New States.</p>
<p>For instance think about Super Mario Bros, an episode begin at the launch of a new Mario and ending: when you’re killed or you’re reach the end of the level.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*PPs51sGAtRKJft0iUCw6VA.png" alt="Image" width="256" height="240" loading="lazy">
<em>Beginning of a new episode</em></p>
<h4 id="heading-continuous-tasks">Continuous tasks</h4>
<p><strong>These are tasks that continue forever (no terminal state).</strong> In this case, the agent has to learn how to choose the best actions and simultaneously interacts with the environment.</p>
<p>For instance, an agent that do automated stock trading. For this task, there is no starting point and terminal state. <strong>The agent keeps running until we decide to stop him.</strong></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*5T_Ta3QauHUEMUCzev6Wyw.jpeg" alt="Image" width="800" height="481" loading="lazy"></p>
<h3 id="heading-monte-carlo-vs-td-learning-methods">Monte Carlo vs TD Learning methods</h3>
<p>We have two ways of learning:</p>
<ul>
<li>Collecting the rewards <strong>at the end of the episode</strong> and then calculating the <strong>maximum expected future reward</strong>: <em>Monte Carlo Approach</em></li>
<li>Estimate <strong>the rewards at each step</strong>: <em>Temporal Difference Learning</em></li>
</ul>
<h4 id="heading-monte-carlo">Monte Carlo</h4>
<p>When the episode ends (the agent reaches a “terminal state”), <strong>the agent looks at the total cumulative reward to see how well it did.</strong> In Monte Carlo approach, rewards are only <strong>received at the end of the game.</strong></p>
<p>Then, we start a new game with the added knowledge. <strong>The agent makes better decisions with each iteration.</strong></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*RLLzQl4YadpbhPlxpa5f6A.png" alt="Image" width="800" height="192" loading="lazy"></p>
<p>Let’s take an example:</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*tciNrjN6pW60-h0PiQRiXg.png" alt="Image" width="500" height="502" loading="lazy"></p>
<p>If we take the maze environment:</p>
<ul>
<li>We always start at the same starting point.</li>
<li>We terminate the episode if the cat eats us or if we move &gt; 20 steps.</li>
<li>At the end of the episode, we have a list of State, Actions, Rewards, and New States.</li>
<li>The agent will sum the total rewards Gt (to see how well it did).</li>
<li>It will then update V(st) based on the formula above.</li>
<li>Then start a new game with this new knowledge.</li>
</ul>
<p>By running more and more episodes, <strong>the agent will learn to play better and better.</strong></p>
<h4 id="heading-temporal-difference-learning-learning-at-each-time-step">Temporal Difference Learning : learning at each time step</h4>
<p>TD Learning, on the other hand, will not wait until the end of the episode to update <strong>the maximum expected future reward estimation: it will update its value estimation V for the non-terminal states St occurring at that experience.</strong></p>
<p>This method is called TD(0) or <strong>one step TD (update the value function after any individual step).</strong></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*LLfj11fivpkKZkwQ8uPi3A.png" alt="Image" width="800" height="294" loading="lazy"></p>
<p>TD methods <strong>only wait until the next time step to update the value estimates.</strong> At time t+1 they immediately <strong>form a TD target using the observed reward Rt+1 and the current estimate V(St+1).</strong></p>
<p>TD target is an estimation: in fact you update the previous estimate V(St) <strong>by updating it towards a one-step target.</strong></p>
<h3 id="heading-explorationexploitation-trade-off">Exploration/Exploitation trade off</h3>
<p>Before looking at the different strategies to solve Reinforcement Learning problems, we must cover one more very important topic: the exploration/exploitation trade-off.</p>
<ul>
<li>Exploration is finding more information about the environment.</li>
<li>Exploitation is exploiting known information to maximize the reward.</li>
</ul>
<p>Remember, the goal of our RL agent is to maximize the expected cumulative reward. However, we can fall into a common trap.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*APLmZ8CVgu0oY3sQBVYIuw.png" alt="Image" width="488" height="490" loading="lazy"></p>
<p>In this game, our mouse can have an infinite amount of small cheese (+1 each). But at the top of the maze there is a gigantic sum of cheese (+1000).</p>
<p>However, if we only focus on reward, our agent will never reach the gigantic sum of cheese. Instead, it will only exploit the nearest source of rewards, even if this source is small (exploitation).</p>
<p>But if our agent does a little bit of exploration, it can find the big reward.</p>
<p>This is what we call the exploration/exploitation trade off. We must define a rule that helps to handle this trade-off. We’ll see in future articles different ways to handle it.</p>
<h3 id="heading-three-approaches-to-reinforcement-learning">Three approaches to Reinforcement Learning</h3>
<p>Now that we defined the main elements of Reinforcement Learning, let’s move on to the three approaches to solve a Reinforcement Learning problem. These are value-based, policy-based, and model-based.</p>
<h4 id="heading-value-based">Value Based</h4>
<p>In value-based RL, the goal is to optimize the value function <em>V(s)</em>.</p>
<p>The value function is a function that tells us the maximum expected future reward the agent will get at each state.</p>
<p>The value of each state is the total amount of the reward an agent can expect to accumulate over the future, starting at that state.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0*kvtRAhBZO-h77Iw1." alt="Image" width="692" height="133" loading="lazy"></p>
<p>The agent will use this value function to select which state to choose at each step. The agent takes the state with the biggest value.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*2_JRk-4O523bcOcSy1u31g.png" alt="Image" width="748" height="426" loading="lazy"></p>
<p>In the maze example, at each step we will take the biggest value: -7, then -6, then -5 (and so on) to attain the goal.</p>
<h4 id="heading-policy-based">Policy Based</h4>
<p>In policy-based RL, we want to directly optimize the policy function <em>π(s)</em> without using a value function.</p>
<p>The policy is what defines the agent behavior at a given time.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0*8B4cAhvM-K4y9a5U." alt="Image" width="526" height="151" loading="lazy">
<em>action = policy(state)</em></p>
<p>We learn a policy function. This lets us map each state to the best corresponding action.</p>
<p>We have two types of policy:</p>
<ul>
<li>Deterministic: a policy at a given state will always return the same action.</li>
<li>Stochastic: output a distribution probability over actions<strong>.</strong></li>
</ul>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0*DNiQGeUl1FKunRbb." alt="Image" width="580" height="209" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*fii7Z01laRGateAJDvloAQ.png" alt="Image" width="748" height="426" loading="lazy"></p>
<p>As we can see here, the policy directly indicates the best action to take for each steps.</p>
<h4 id="heading-model-based">Model Based</h4>
<p>In model-based RL, we model the environment. This means we create a model of the behavior of the environment.</p>
<p>The problem is each environment will need a different model representation. That’s why we will not speak about this type of Reinforcement Learning in the upcoming articles.</p>
<h3 id="heading-introducing-deep-reinforcement-learning">Introducing Deep Reinforcement Learning</h3>
<p>Deep Reinforcement Learning introduces deep neural networks to solve Reinforcement Learning problems — hence the name “deep.”</p>
<p>For instance, in the next article we’ll work on Q-Learning (classic Reinforcement Learning) and Deep Q-Learning.</p>
<p>You’ll see the difference is that in the first approach, we use a traditional algorithm to create a Q table that helps us find what action to take for each state.</p>
<p>In the second approach, we will use a Neural Network (to approximate the reward based on state: q value).</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*w5GuxedZ9ivRYqM_MLUxOQ.png" alt="Image" width="800" height="602" loading="lazy">
<em>Schema inspired by the Q learning notebook by Udacity</em></p>
<p>Congrats! There was a lot of information in this article. Be sure to really grasp the material before continuing. It’s important to master these elements before entering the fun part: creating AI that plays video games.</p>
<p>Important: this article is the first part of a free series of blog posts about Deep Reinforcement Learning. For more information and more resources, <a target="_blank" href="https://simoninithomas.github.io/Deep_reinforcement_learning_Course/">check out the syllabus.</a></p>
<p>Next time we’ll work on a Q-learning agent that learns to play the Frozen Lake game.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*zW-o3-S4JrNpLbztKp3U4A.gif" alt="Image" width="209" height="194" loading="lazy">
<em>FrozenLake</em></p>
<p>If you liked my article, <strong>please click the ? below as many time as you liked the article</strong> so other people will see this here on Medium. And don’t forget to follow me!</p>
<p>If you have any thoughts, comments, questions, feel free to comment below or send me an email: hello@simoninithomas.com, or tweet me <a target="_blank" href="https://twitter.com/ThomasSimonini">@ThomasSimonini</a>.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*_yN1FzvEFDmlObiYsstIzg.png" alt="Image" width="500" height="77" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*mD-f5VN1SWYvhrZAbvSu_w.png" alt="Image" width="500" height="77" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*PqiptT-Cdi8uwosxuFn2DQ.png" alt="Image" width="500" height="77" loading="lazy"></p>
<p>Cheers!</p>
<h4 id="heading-deep-reinforcement-learning-course"><strong>Deep Reinforcement Learning Course:</strong></h4>
<blockquote>
<p>We’re making a <strong>video version of the Deep Reinforcement Learning Course with Tensorflow</strong> ? where we focus on the implementation part with tensorflow h<a target="_blank" href="https://youtu.be/q2ZOEFAaaI0">ere.</a></p>
</blockquote>
<p><em>Part 1: <a target="_blank" href="https://www.freecodecamp.org/news/an-introduction-to-reinforcement-learning-4339519de419/www.freecodecamp.org/news/an-introduction-to-reinforcement-learning-4339519de419/">An introduction to Reinforcement Learning</a></em></p>
<p><em>Part 2: <a target="_blank" href="https://www.freecodecamp.org/news/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe/">Diving deeper into Reinforcement Learning with Q-Learning</a></em></p>
<p><em>Part 3: <a target="_blank" href="https://www.freecodecamp.org/news/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8/">An introduction to Deep Q-Learning: let’s play Doom</a></em></p>
<p>Part 3+: <a target="_blank" href="https://www.freecodecamp.org/news/improvements-in-deep-q-learning-dueling-double-dqn-prioritized-experience-replay-and-fixed-58b130cc5682/">Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets</a></p>
<p>Part 4: <a target="_blank" href="https://www.freecodecamp.org/news/an-introduction-to-policy-gradients-with-cartpole-and-doom-495b5ef2207f/">An introduction to Policy Gradients with Doom and Cartpole</a></p>
<p>Part 5: <a target="_blank" href="https://www.freecodecamp.org/news/an-intro-to-advantage-actor-critic-methods-lets-play-sonic-the-hedgehog-86d6240171d/">An intro to Advantage Actor Critic methods: let’s play Sonic the Hedgehog!</a></p>
<p>Part 6: <a target="_blank" href="https://towardsdatascience.com/proximal-policy-optimization-ppo-with-sonic-the-hedgehog-2-and-3-c9c21dbed5e?gi=30cae83cd9a5">Proximal Policy Optimization (PPO) with Sonic the Hedgehog 2 and 3</a></p>
<p>Part 7: <a target="_blank" href="https://towardsdatascience.com/curiosity-driven-learning-made-easy-part-i-d3e5a2263359">Curiosity-Driven Learning made easy Part I</a></p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
