Reinforcement Learning - freeCodeCamp.org

How to Build an Adaptive Tic-Tac-Toe AI with Reinforcement Learning in JavaScript

Mayur Vekariya — Tue, 07 Oct 2025 20:49:27 +0000

Reinforcement learning (RL) is one of the most powerful paradigms in artificial intelligence. Unlike supervised learning where you train models on labeled datasets, RL agents learn through direct interaction with their environment, receiving rewards or penalties for their actions.

In this tutorial, you will build a Tic-Tac-Toe AI that learns optimal strategies through Q-learning, a foundational RL algorithm. You will implement adaptive difficulty levels, visualize the learning process in real-time, and explore advanced optimization techniques.

By the end of this tutorial, you’ll have a production-ready web application that demonstrates practical RL concepts – all running directly in the browser with vanilla JavaScript.

What You’ll Learn

In this tutorial, you’ll learn:

Core reinforcement learning concepts including Q-learning, exploration vs exploitation, and reward shaping.
How to implement a complete Q-learning algorithm with state management.
Advanced techniques like epsilon decay and experience replay.
How to build an interactive game with HTML5 Canvas and responsive controls.
Performance optimization for real-time AI decision-making.
Visualization techniques to understand the AI's learning process.

Prerequisites

To get the most out of this tutorial, you should have:

Solid understanding of JavaScript (ES6+ syntax, classes, array methods).
Familiarity with HTML5 Canvas API for graphics rendering.
Basic knowledge of algorithms and data structures.
Understanding of asynchronous JavaScript (Promises, async/await).

You don’t need any prior machine learning experience, as I’ll explain all RL concepts from scratch.

Why Use Reinforcement Learning for Game AI?
How to Understand Q-Learning: The Foundation
Project Architecture Overview
How to Build the HTML Interface with Tailwind CSS
How to Implement the Q-Learning Algorithm
How to Understand the Enhanced Features
How to Test Your Implementation
Advanced Optimizations and Extensions
Common Pitfalls and Solutions
How to Extend This to Other Games
Conclusion

Why Use Reinforcement Learning for Game AI?

Games provide an ideal environment for learning RL because they have:

Clear state representations – The game board at any moment
Discrete action spaces – A finite set of valid moves
Immediate feedback – Win, lose, or draw outcomes
Deterministic rules – Consistent behavior across games

Traditional game AI uses techniques like minimax with alpha-beta pruning. While effective, these approaches require you to explicitly program game strategies. RL agents, by contrast, discover optimal strategies through experience – much like humans learn through practice.

Tic-Tac-Toe serves as an excellent starting point because:

The state space is manageable (5,478 unique positions)
Games are short, allowing rapid iteration
Perfect play is achievable, providing a clear success metric
The concepts scale to more complex games

How to Understand Q-Learning: The Foundation

Q-learning is a model-free, value-based RL algorithm. Let me break down what that means:

Model-free means that the agent doesn’t need to understand the game's rules. It learns purely from experience.
Value-based means that the agent learns the "value" of each action in each state, then chooses the action with the highest value.

Core Components

There are a few key components you’ll need to understand before building this game.

First, we have state (s), which here is the current game board configuration. We represent this as a 9-character string (for example, "XO-X-----" where - represents empty cells).

Next, we have action (a), which is a move the AI can make. We represent this as an index from 0-8 corresponding to board positions.

Then there’s reward (r), the numerical feedback from the environment:

+1 for winning
-1 for losing
0 for draws or ongoing games

We also have Q-Table, a lookup table storing Q(s,a) – the expected cumulative reward for taking action a in state s.

And finally, there’s policy, the strategy for choosing actions. We use an epsilon-greedy policy that balances exploration and exploitation.

The Q-Learning Update Rule

The heart of Q-learning is this update formula:

Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]

Where:

α (alpha) = Learning rate (0 to 1) – how much to update the Q-value
γ (gamma) = Discount factor (0 to 1) – how much to value future rewards
s' = Next state after taking action a
max Q(s',a') = Highest Q-value available in the next state.

This formula implements temporal difference learning. This means it updates our estimate of Q(s,a) based on the difference between our current estimate and a better estimate using the actual reward received plus the best possible future reward.

How Exploration vs Exploitation Works

A critical challenge in reinforcement learning is the "exploration vs. exploitation" trade-off. To understand why this is difficult, imagine choosing a place for dinner.

Exploitation: You could go to your favorite restaurant. You know the food is good, and you're almost guaranteed a satisfying meal. This is a safe, reliable choice that maximizes your immediate reward based on past experience.
Exploration: You could try a new, unknown restaurant. It might be a disaster, or you might discover a new favorite that’s even better than your old one. This is a risky choice that provides no immediate guarantee, but it's the only way to gather new information and potentially find a better long-term strategy.

The same dilemma applies to our AI. If it only exploits its current knowledge, it might get stuck using a mediocre strategy, never discovering the brilliant moves that lead to a guaranteed win. If it only explores by making random moves, it will never learn to use the good strategies it finds and will play poorly.

The key is to balance the two: explore enough to find optimal strategies, but exploit that knowledge to win games.

To achieve this balance, we use an epsilon-greedy (ϵ) strategy. It’s a simple but powerful way to manage this trade-off:

We choose a small value for epsilon (ϵ), for example, 0.1 (which represents a 10% probability).
Before the AI makes a move, it generates a random number between 0 and 1.
If the random number is less than ϵ (the 10% chance): The AI ignores its strategy and chooses a random available move. This is exploration.
If the random number is greater than or equal to ϵ (the 90% chance): The AI chooses the best-known move from its Q-table.This is exploitation.

This ensures the AI primarily plays to win but still dedicates a small fraction of its moves to trying new things. We will also implement epsilon decay – starting with a higher ϵ value to encourage exploration when the AI is inexperienced, and gradually lowering it as the AI learns and becomes more confident in its strategy.

Project Architecture Overview

Before you start coding, here's the structure of the application you’ll build:

tic-tac-toe-ai/
├── index.html          # Game interface with Tailwind CSS
└── game.js            # Complete game logic and AI

You will organize your code into two main classes in game.js:

QLearning: Implements the Q-learning algorithm.
TicTacToe: Manages game state and rendering.

How to Build the HTML Interface with Tailwind CSS

Create an index.html file with Tailwind CSS CDN:

html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Tic-Tac-Toe AI with Q-Learningtitle>
  <script src="https://cdn.tailwindcss.com">script>
head>
<body class="bg-gradient-to-br from-purple-600 to-purple-900 min-h-screen flex items-center justify-center p-4">

  <div class="bg-white rounded-3xl shadow-2xl p-8 max-w-5xl w-full">
    
    <div class="text-center mb-8">
      <h1 class="text-4xl font-bold text-gray-800 mb-2">🎮 Tic-Tac-Toe AIh1>
      <p class="text-gray-600 text-lg">Watch the AI learn through reinforcement learningp>
    div>

    
    <div id="trainingIndicator" class="hidden bg-yellow-100 border-l-4 border-yellow-500 text-yellow-700 p-4 mb-6 rounded">
      <p class="font-semibold">🤖 AI is training... <span id="trainingProgress">span>p>
    div>

    
    <div class="grid md:grid-cols-2 gap-8">

      
      <div class="flex flex-col items-center">
        <canvas id="gameCanvas" width="400" height="400" 
                class="border-4 border-purple-500 rounded-xl shadow-lg cursor-pointer hover:scale-[1.02] transition-transform">
        canvas>
        <div id="gameStatus" class="mt-4 text-xl font-bold text-gray-700 min-h-[30px]">
          Your turn! (X)
        div>
      div>

      
      <div class="space-y-6">

        
        <div class="bg-gray-50 rounded-xl p-6">
          <h3 class="text-xl font-bold text-gray-800 mb-4">Game Controlsh3>
          <div class="space-y-3">
            <button onclick="game.reset()" 
                    class="w-full bg-purple-600 hover:bg-purple-700 text-white font-semibold py-3 px-6 rounded-lg transition-all hover:-translate-y-0.5 shadow-md hover:shadow-lg">
              New Game
            button>
            <button onclick="game.startTraining()" 
                    class="w-full bg-green-600 hover:bg-green-700 text-white font-semibold py-3 px-6 rounded-lg transition-all hover:-translate-y-0.5 shadow-md hover:shadow-lg">
              Train AI (1000 games)
            button>
            <button onclick="game.resetAI()" 
                    class="w-full bg-red-600 hover:bg-red-700 text-white font-semibold py-3 px-6 rounded-lg transition-all hover:-translate-y-0.5 shadow-md hover:shadow-lg">
              Reset AI Memory
            button>
          div>
        div>

        
        <div class="bg-gray-50 rounded-xl p-6">
          <h3 class="text-xl font-bold text-gray-800 mb-4">Difficulty Levelh3>
          <div class="grid grid-cols-3 gap-2">
            <button onclick="game.setDifficulty('beginner')" id="diffBeginner"
                    class="py-2 px-4 rounded-lg font-semibold text-sm transition-all bg-green-100 text-green-700 hover:bg-green-200">
              🌱 Beginner
            button>
            <button onclick="game.setDifficulty('intermediate')" id="diffIntermediate"
                    class="py-2 px-4 rounded-lg font-semibold text-sm transition-all bg-white text-gray-700 hover:bg-gray-100 border-2 border-purple-500">
              🎯 Medium
            button>
            <button onclick="game.setDifficulty('expert')" id="diffExpert"
                    class="py-2 px-4 rounded-lg font-semibold text-sm transition-all bg-white text-gray-700 hover:bg-gray-100">
              🔥 Expert
            button>
          div>
        div>

        
        <div class="bg-gray-50 rounded-xl p-6">
          <h3 class="text-xl font-bold text-gray-800 mb-4">AI Parametersh3>

          <div class="space-y-4">
            
            <div>
              <div class="flex justify-between items-center mb-2">
                <label class="text-sm font-medium text-gray-700 flex items-center gap-1">
                  Learning Rate (α)
                  <span class="group relative">
                    <span class="cursor-help text-purple-500">ⓘspan>
                    <span class="invisible group-hover:visible absolute left-0 top-6 w-64 bg-gray-900 text-white text-xs rounded-lg p-3 z-10 shadow-xl">
                      Controls how quickly the AI updates its knowledge. Higher values = faster learning but less stability. Recommended: 0.1-0.3
                    span>
                  span>
                label>
                <span id="learningRateValue" class="text-sm font-bold text-purple-600">0.1span>
              div>
              <input type="range" id="learningRate" min="0.01" max="0.5" step="0.01" value="0.1"
                     class="w-full h-2 bg-gray-200 rounded-lg appearance-none cursor-pointer">
            div>

            
            <div>
              <div class="flex justify-between items-center mb-2">
                <label class="text-sm font-medium text-gray-700 flex items-center gap-1">
                  Discount Factor (γ)
                  <span class="group relative">
                    <span class="cursor-help text-purple-500">ⓘspan>
                    <span class="invisible group-hover:visible absolute left-0 top-6 w-64 bg-gray-900 text-white text-xs rounded-lg p-3 z-10 shadow-xl">
                      Determines how much the AI values future rewards vs immediate rewards. Higher = more long-term thinking. Recommended: 0.85-0.95
                    span>
                  span>
                label>
                <span id="discountFactorValue" class="text-sm font-bold text-purple-600">0.9span>
              div>
              <input type="range" id="discountFactor" min="0.5" max="0.99" step="0.01" value="0.9"
                     class="w-full h-2 bg-gray-200 rounded-lg appearance-none cursor-pointer">
            div>

            
            <div>
              <div class="flex justify-between items-center mb-2">
                <label class="text-sm font-medium text-gray-700 flex items-center gap-1">
                  Exploration Rate (ε)
                  <span class="group relative">
                    <span class="cursor-help text-purple-500">ⓘspan>
                    <span class="invisible group-hover:visible absolute left-0 top-6 w-64 bg-gray-900 text-white text-xs rounded-lg p-3 z-10 shadow-xl">
                      Chance the AI tries random moves vs using learned strategy. Higher = more experimentation. Set to 0.01 for best play after training.
                    span>
                  span>
                label>
                <span id="explorationRateValue" class="text-sm font-bold text-purple-600">0.1span>
              div>
              <input type="range" id="explorationRate" min="0" max="0.5" step="0.01" value="0.1"
                     class="w-full h-2 bg-gray-200 rounded-lg appearance-none cursor-pointer">
            div>
          div>
        div>

        
        <div class="bg-gray-50 rounded-xl p-6">
          <h3 class="text-xl font-bold text-gray-800 mb-4">Statisticsh3>
          <div class="grid grid-cols-3 gap-3">
            <div class="bg-white rounded-lg p-3 text-center shadow-sm">
              <div class="text-xs text-gray-600 mb-1">Gamesdiv>
              <div id="gamesPlayed" class="text-2xl font-bold text-gray-800">0div>
            div>
            <div class="bg-white rounded-lg p-3 text-center shadow-sm">
              <div class="text-xs text-gray-600 mb-1">AI Winsdiv>
              <div id="aiWins" class="text-2xl font-bold text-green-600">0div>
            div>
            <div class="bg-white rounded-lg p-3 text-center shadow-sm">
              <div class="text-xs text-gray-600 mb-1">You Windiv>
              <div id="playerWins" class="text-2xl font-bold text-red-600">0div>
            div>
            <div class="bg-white rounded-lg p-3 text-center shadow-sm">
              <div class="text-xs text-gray-600 mb-1">Drawsdiv>
              <div id="draws" class="text-2xl font-bold text-gray-600">0div>
            div>
            <div class="bg-white rounded-lg p-3 text-center shadow-sm">
              <div class="text-xs text-gray-600 mb-1">Statesdiv>
              <div id="statesLearned" class="text-2xl font-bold text-purple-600">0div>
            div>
            <div class="bg-white rounded-lg p-3 text-center shadow-sm">
              <div class="text-xs text-gray-600 mb-1">Win Ratediv>
              <div id="winRate" class="text-2xl font-bold text-blue-600">0%div>
            div>
          div>
        div>

      div>
    div>
  div>

  <script src="game.js">script>
body>
html>

This HTML structure creates a responsive, modern interface using Tailwind CSS utility classes. The layout uses a two-column grid on medium screens and larger, with the game canvas on the left and all controls on the right. The training indicator starts hidden and only appears during AI training sessions.

All interactive elements (buttons, sliders) use onclick handlers and oninput events to communicate with the JavaScript game logic. The tooltip system uses CSS group hover states to show explanatory text when users hover over the info icons, helping them understand each parameter without cluttering the interface.

Let’s talk in a bit more detail about some key parts of the code:

Header Section: Displays the game title and subtitle to introduce users to the application.
Training Indicator: A yellow banner that appears only during AI training sessions, showing progress updates every 50 games. This provides visual feedback so users know the training is in progress.
Canvas Section: Contains the HTML5 Canvas element where the game board is drawn. The canvas is 400x400 pixels and styled with Tailwind classes for borders and hover effects. Below it is a status message that updates based on game state.
Game Controls: Three primary buttons that let users start a new game, train the AI through 1000 self-play games, or completely reset the AI's memory (clearing the Q-table).
Difficulty Selector: Three buttons for choosing AI difficulty. Beginner mode makes the AI play randomly 70% of the time, Intermediate uses Q-learning, and Expert implements perfect minimax play.
AI Parameters: Three range sliders with tooltips that let users adjust the core reinforcement learning hyperparameters in real-time. The tooltips appear on hover and explain what each parameter does.
Statistics Panel: A grid of six cards displaying real-time metrics including games played, wins/losses/draws, learned states, and AI win rate percentage.

All interactive elements use onclick handlers that call methods from the game object defined in game.js.

How to Implement the Q-Learning Algorithm

Now, let's bring the theory to life. Create a game.js file. We will build this file step-by-step, but if you get stuck at any point or want to see the complete code for reference, you can find the final version on GitHub here.

Our code will be structured into two main classes: QLearning, which will handle the AI's "brain" and learning logic, and TicTacToe, which will manage the game state, rendering, and user interaction.

The `QLearning` Class: The AI's Brain

This class will contain all the logic for the reinforcement learning agent. Let's build it piece by piece.

1. Constructor and Q-Table Management

First, let's set up the constructor and a method to access our Q-table. The Q-table will be a JavaScript Map, which is highly efficient for storing and retrieving key-value pairs where the key (the board state) is a string.

// In game.js

// Q-Learning Agent with localStorage support
class QLearning {
  constructor(lr = 0.1, gamma = 0.9, epsilon = 0.1) {
    this.q = new Map(); // Stores Q-values: { state => [q_action_0, q_action_1, ...] }
    this.lr = lr; // Learning Rate (α)
    this.gamma = gamma; // Discount Factor (γ)
    this.epsilon = epsilon; // Exploration Rate (ε)
    this.difficulty = 'intermediate';
  }

  getQ(state) {
    if (!this.q.has(state)) {
      this.q.set(state, Array(9).fill(0));
    }
    return this.q.get(state);
  }

The constructor initializes our three key hyperparameters (α, γ, ϵ) and the Q-table itself.
getQ(state) is a crucial helper function. It safely retrieves the array of Q-values for a given board state. If the AI has never seen this state before, it creates a new entry in the map with an array of nine zeros, representing an initial Q-value of 0 for each possible move.

2. Choosing an Action (The Epsilon-Greedy Strategy)

Next, we'll implement the getAction method. This is where the AI decides which move to make, incorporating our difficulty levels and the epsilon-greedy strategy.

  getAction(state, available) {
    // Difficulty-based behavior
    if (this.difficulty === 'beginner') {
      // 70% random moves for beginner
      if (Math.random() < 0.7) {
        return available[~~(Math.random() * available.length)];
      }
    } else if (this.difficulty === 'expert') {
      // Use minimax for perfect play
      return this.getMinimaxAction(state, available);
    }

    // Intermediate: epsilon-greedy
    if (Math.random() < this.epsilon) {
      return available[~~(Math.random() * available.length)];
    }
    const q = this.getQ(state);
    return available.reduce((best, a) => q[a] > q[best] ? a : best, available[0]);
  }

The logic first checks the difficulty. 'Beginner' is mostly random, while 'Expert' defers to a separate, perfect-play algorithm.
For the 'Intermediate' level, it implements the epsilon-greedy logic. With probability ϵ, it explores (chooses a random move). Otherwise, it exploits (chooses the best-known move from the Q-table).

3. The Learning Rule

The update method is the heart of the algorithm. It's the direct implementation of the Q-learning formula we discussed earlier.

Q(s, a) ← Q(s, a) + α [r + γ max(a') Q(s', a') − Q(s, a)]

  update(s, a, r, s2, available2) {
    const q = this.getQ(s);
    const maxQ2 = available2.length ? Math.max(...available2.map(a_prime => this.getQ(s2)[a_prime])) : 0;
    q[a] += this.lr * (r + this.gamma * maxQ2 - q[a]);
  }

maxQ2 calculates the max Q(s',a') part of the formula – the best possible Q-value the AI can get from its next move.
The final line is a direct translation of the formula, updating the value of the action just taken based on the reward and future potential.

4. Minimax for Expert Mode

For our 'Expert' level, we'll implement the minimax algorithm, a classic recursive algorithm from game theory that guarantees perfect play.

  getMinimaxAction(state, available) {
    let bestScore = -Infinity;
    let bestMove = available[0];

    for (const move of available) {
      const newState = state.substring(0, move) + 'O' + state.substring(move + 1);
      const score = this.minimax(newState, 0, false);
      if (score > bestScore) {
        bestScore = score;
        bestMove = move;
      }
    }
    return bestMove;
  }

  minimax(state, depth, isMaximizing) {
    const winner = this.checkWinnerStatic(state);
    if (winner === 'O') return 10 - depth;
    if (winner === 'X') return depth - 10;
    if (winner === 'draw') return 0;

    const available = [...state].map((c, i) => c === '-' ? i : null).filter(x => x !== null);

    if (isMaximizing) {
      let best = -Infinity;
      for (const move of available) {
        const newState = state.substring(0, move) + 'O' + state.substring(move + 1);
        best = Math.max(best, this.minimax(newState, depth + 1, false));
      }
      return best;
    } else {
      let best = Infinity;
      for (const move of available) {
        const newState = state.substring(0, move) + 'X' + state.substring(move + 1);
        best = Math.min(best, this.minimax(newState, depth + 1, true));
      }
      return best;
    }
  }

  checkWinnerStatic(state) {
    const patterns = [[0,1,2],[3,4,5],[6,7,8],[0,3,6],[1,4,7],[2,5,8],[0,4,8],[2,4,6]];
    for (const p of patterns) {
      if (state[p[0]] !== '-' && state[p[0]] === state[p[1]] && state[p[1]] === state[p[2]]) {
        return state[p[0]];
      }
    }
    return state.includes('-') ? null : 'draw';
  }

5. Helper and Persistence Methods

Finally, let's add methods for epsilon decay, resetting the AI's memory, and saving/loading the Q-table to localStorage.

  decay() {
    this.epsilon = Math.max(0.01, this.epsilon * 0.995);
  }

  reset() {
    this.q.clear();
    this.epsilon = 0.1;
  }

  save() {
    const data = {
      q: Array.from(this.q.entries()),
      lr: this.lr,
      gamma: this.gamma,
      epsilon: this.epsilon,
      difficulty: this.difficulty
    };
    localStorage.setItem('tictactoe_ai', JSON.stringify(data));
  }

  load() {
    const saved = localStorage.getItem('tictactoe_ai');
    if (!saved) return false;

    try {
      const data = JSON.parse(saved);
      this.q = new Map(data.q);
      this.lr = data.lr;
      this.gamma = data.gamma;
      this.epsilon = data.epsilon;
      this.difficulty = data.difficulty || 'intermediate';
      return true;
    } catch (e) {
      console.error('Failed to load AI state:', e);
      return false;
    }
  }

  clearStorage() {
    localStorage.removeItem('tictactoe_ai');
  }
}

The `TicTacToe` Class: Managing the Game

Now that we have our AI "brain," we need to build the game around it. This class will handle rendering the board, processing user clicks, managing game flow, and calling the AI when it's its turn.

1. Constructor and Control Initialization

The constructor sets up the game's initial state, gets a reference to the HTML canvas, and wires up event listeners for user input.

class TicTacToe {
  constructor() {
    this.board = '---------';
    this.ai = new QLearning();
    this.stats = { played: 0, aiWins: 0, playerWins: 0, draws: 0 };
    this.training = false;
    this.gameOver = false;

    this.canvas = document.getElementById('gameCanvas');
    this.ctx = this.canvas.getContext('2d');
    this.cellSize = 133.33;

    this.canvas.onclick = e => this.handleClick(e);
    this.initControls();
    this.loadState();
    this.draw();
  }

  initControls() {
    ['learningRate', 'discountFactor', 'explorationRate'].forEach(id => {
      const el = document.getElementById(id);
      el.oninput = e => {
        const val = parseFloat(e.target.value);
        document.getElementById(id + 'Value').textContent = val.toFixed(2);
        if (id === 'learningRate') this.ai.lr = val;
        if (id === 'discountFactor') this.ai.gamma = val;
        if (id === 'explorationRate') this.ai.epsilon = val;
        this.saveState();
      };
    });
  }

initControls connects our HTML sliders to the AI's parameters, allowing for real-time adjustments.

2. Difficulty and UI Methods

These methods manage the difficulty setting and update the UI accordingly.

  setDifficulty(level) {
    this.ai.difficulty = level;

    // Update button styles
    ['beginner', 'intermediate', 'expert'].forEach(diff => {
      const btn = document.getElementById(`diff${diff.charAt(0).toUpperCase() + diff.slice(1)}`);
      if (diff === level) {
        btn.className = 'py-2 px-4 rounded-lg font-semibold text-sm transition-all bg-purple-600 text-white border-2 border-purple-600';
      } else {
        btn.className = 'py-2 px-4 rounded-lg font-semibold text-sm transition-all bg-white text-gray-700 hover:bg-gray-100';
      }
    });

    if (level === 'beginner') this.setStatus('🌱 Beginner mode: AI makes more mistakes');
    else if (level === 'intermediate') this.setStatus('🎯 Medium mode: Balanced AI using Q-learning');
    else this.setStatus('🔥 Expert mode: Perfect AI using minimax algorithm');

    this.saveState();
  }

3. Drawing and Rendering

These methods use the HTML5 Canvas API to visually represent the game state.

  draw() {
    const { ctx, canvas, cellSize } = this;
    ctx.fillStyle = '#fff';
    ctx.fillRect(0, 0, canvas.width, canvas.height);

    ctx.strokeStyle = '#8b5cf6';
    ctx.lineWidth = 4;
    for (let i = 1; i < 3; i++) {
      ctx.beginPath();
      ctx.moveTo(i * cellSize, 0);
      ctx.lineTo(i * cellSize, canvas.height);
      ctx.stroke();
      ctx.beginPath();
      ctx.moveTo(0, i * cellSize);
      ctx.lineTo(canvas.width, i * cellSize);
      ctx.stroke();
    }

    for (let i = 0; i < 9; i++) {
      const symbol = this.board[i];
      if (symbol === '-') continue;

      const x = (i % 3) * cellSize + cellSize / 2;
      const y = ~~(i / 3) * cellSize + cellSize / 2;

      ctx.strokeStyle = symbol === 'X' ? '#ef4444' : '#10b981';
      ctx.lineWidth = 8;
      ctx.lineCap = 'round';

      if (symbol === 'X') {
        const s = cellSize * 0.3;
        ctx.beginPath();
        ctx.moveTo(x - s, y - s);
        ctx.lineTo(x + s, y + s);
        ctx.stroke();
        ctx.beginPath();
        ctx.moveTo(x + s, y - s);
        ctx.lineTo(x - s, y + s);
        ctx.stroke();
      } else {
        ctx.beginPath();
        ctx.arc(x, y, cellSize * 0.3, 0, Math.PI * 2);
        ctx.stroke();
      }
    }

    const winner = this.checkWinner();
    if (winner?.line) this.drawWinLine(winner.line);
  }

  drawWinLine(line) {
    const [a, , c] = line;
    const startX = (a % 3) * this.cellSize + this.cellSize / 2;
    const startY = ~~(a / 3) * this.cellSize + this.cellSize / 2;
    const endX = (c % 3) * this.cellSize + this.cellSize / 2;
    const endY = ~~(c / 3) * this.cellSize + this.cellSize / 2;

    this.ctx.strokeStyle = '#fbbf24';
    this.ctx.lineWidth = 6;
    this.ctx.beginPath();
    this.ctx.moveTo(startX, startY);
    this.ctx.lineTo(endX, endY);
    this.ctx.stroke();
  }

4. Player Interaction and the Game Loop

This is the core interactive logic. handleClick translates a click into a board position, move updates the state, and aiMove gets an action from the QLearning class and executes it.

  handleClick(e) {
    if (this.gameOver || this.training) return;

    const rect = this.canvas.getBoundingClientRect();
    const col = ~~((e.clientX - rect.left) / this.cellSize);
    const row = ~~((e.clientY - rect.top) / this.cellSize);
    const idx = row * 3 + col;

    if (this.board[idx] === '-') {
      this.move(idx, 'X');
      if (!this.gameOver) setTimeout(() => this.aiMove(), 300);
    }
  }

  move(idx, player) {
    if (this.board[idx] !== '-' || this.gameOver) return false;
    this.board = this.board.substring(0, idx) + player + this.board.substring(idx + 1);
    this.draw();
    this.checkGameOver();
    return true;
  }

  aiMove() {
    if (this.gameOver) return;

    const state = this.board;
    const available = this.getAvailable();
    const action = this.ai.getAction(state, available);

    this.move(action, 'O');

    const winner = this.checkWinner();
    const reward = winner?.winner === 'O' ? 1 : winner?.winner === 'X' ? -1 : 0;
    this.ai.update(state, action, reward, this.board, this.getAvailable());
  }

After the AI moves, it immediately calls this.ai.update() to learn from the result of its action.

5. The Rules Engine

These helpers determine the game's state: available moves, winner, and game over conditions.

  getAvailable() {
    return [...this.board].map((c, i) => c === '-' ? i : null).filter(x => x !== null);
  }

  checkWinner() {
    const patterns = [[0,1,2],[3,4,5],[6,7,8],[0,3,6],[1,4,7],[2,5,8],[0,4,8],[2,4,6]];
    for (const p of patterns) {
      if (this.board[p[0]] !== '-' && 
          this.board[p[0]] === this.board[p[1]] && 
          this.board[p[1]] === this.board[p[2]]) {
        return { winner: this.board[p[0]], line: p };
      }
    }
    return this.board.includes('-') ? null : { winner: 'draw', line: null };
  }

  checkGameOver() {
    const result = this.checkWinner();
    if (!result) return;

    this.gameOver = true;
    this.stats.played++;

    if (result.winner === 'X') {
      this.stats.playerWins++;
      if (!this.training) this.setStatus('🎉 You win!');
    } else if (result.winner === 'O') {
      this.stats.aiWins++;
      if (!this.training) this.setStatus('🤖 AI wins!');
    } else {
      this.stats.draws++;
      if (!this.training) this.setStatus('🤝 Draw!');
    }

    if (!this.training) {
      this.updateStats();
      this.saveState();
    }
  }

6. UI and Statistics Updates

These methods connect the internal game state to the HTML elements, displaying status messages and statistics.

  setStatus(msg) {
    document.getElementById('gameStatus').textContent = msg;
  }

  updateStats() {
    document.getElementById('gamesPlayed').textContent = this.stats.played;
    document.getElementById('aiWins').textContent = this.stats.aiWins;
    document.getElementById('playerWins').textContent = this.stats.playerWins;
    document.getElementById('draws').textContent = this.stats.draws;
    document.getElementById('statesLearned').textContent = this.ai.q.size;

    const winRate = this.stats.played ? (this.stats.aiWins / this.stats.played * 100).toFixed(1) : 0;
    document.getElementById('winRate').textContent = `${winRate}%`;
  }

7. Game and AI Management

These methods are wired to the control buttons for resetting the game or the AI's memory.

  reset() {
    this.board = '---------';
    this.gameOver = false;
    this.draw();
    this.setStatus('Your turn! (X)');
  }

  resetAI() {
    if (confirm('Reset AI memory? All progress will be lost.')) {
      this.ai.reset();
      this.ai.clearStorage();
      this.stats = { played: 0, aiWins: 0, playerWins: 0, draws: 0 };
      this.updateStats();
      this.reset();
      this.setStatus('AI memory reset!');
      localStorage.removeItem('tictactoe_stats');
    }
  }

8. The Self-Play Training Loop

This is the logic for the "Train AI" button, allowing the AI to learn rapidly by playing against itself.

  async startTraining() {
    this.training = true;
    document.getElementById('trainingIndicator').classList.remove('hidden');

    const originalEpsilon = this.ai.epsilon;
    this.ai.epsilon = 0.3; // Higher exploration during training

    for (let i = 0; i < 1000; i++) {
      await this.trainGame();
      this.ai.decay();
      if (i % 50 === 0) {
        document.getElementById('trainingProgress').textContent = `${i + 1}/1000`;
        await new Promise(r => setTimeout(r, 0)); // Allow UI to update
      }
    }

    this.ai.epsilon = originalEpsilon;
    this.training = false;
    document.getElementById('trainingIndicator').classList.add('hidden');
    this.updateStats();
    this.reset();
    this.setStatus('Training complete!');
    this.saveState();
  }

  async trainGame() {
    this.board = '---------';
    this.gameOver = false;
    const moves = [];

    while (!this.gameOver && this.getAvailable().length > 0) {
      const state = this.board;
      const available = this.getAvailable();
      // Alternate players (X and O) are both the AI
      const player = moves.length % 2 === 0 ? 'X' : 'O'; 
      const action = this.ai.getAction(state, available);

      moves.push({ state, action, player });
      this.move(action, player);
    }

    const winner = this.checkWinner();
    // Assign rewards after the game is over
    moves.forEach(m => {
      const reward = winner?.winner === m.player ? 1 : (winner?.winner && winner.winner !== m.player) ? -1 : 0;
      this.ai.update(m.state, m.action, reward, this.board, []);
    });
  }

9. State Persistence

These methods orchestrate saving and loading the game state and AI's memory to localStorage.

  saveState() {
    this.ai.save();
    localStorage.setItem('tictactoe_stats', JSON.stringify(this.stats));
  }

  loadState() {
    if (this.ai.load()) {
      const savedStats = localStorage.getItem('tictactoe_stats');
      if (savedStats) {
        this.stats = JSON.parse(savedStats);
      }
      this.updateStats();
      this.setDifficulty(this.ai.difficulty);

      // Update sliders to reflect loaded AI state
      document.getElementById('learningRate').value = this.ai.lr;
      document.getElementById('learningRateValue').textContent = this.ai.lr.toFixed(2);
      document.getElementById('discountFactor').value = this.ai.gamma;
      document.getElementById('discountFactorValue').textContent = this.ai.gamma.toFixed(2);
      document.getElementById('explorationRate').value = this.ai.epsilon;
      document.getElementById('explorationRateValue').textContent = this.ai.epsilon.toFixed(2);

      console.log('✓ Loaded AI state from localStorage');
    }
  }
}

10. Initializing the Game

Finally, add this snippet at the end of game.js to create an instance of the game once the HTML document is loaded.

let game;
window.addEventListener('DOMContentLoaded', () => {
  game = new TicTacToe();
});

This completes our implementation! You now have a fully functional game.js file. If you encountered any issues or want to double-check your work, you can compare your code against the complete source file available on GitHub: https://github.com/mayur9210/tic-tac-toe-ai/blob/main/game.js.

How to Understand the Enhanced Features

Beyond the core Q-learning logic, this implementation includes several enhanced features to create a complete, user-friendly, and educational application. Let's explore what they are and how they work.

1. Adaptive Difficulty Levels

The game supports three distinct difficulty modes to cater to different players:

Beginner (🌱): This mode is designed for new players. The AI makes random moves 70% of the time, providing a high chance for the player to win and learn the game's rules.
Intermediate (🎯): This is the standard mode where the AI uses the Q-learning algorithm with an epsilon-greedy strategy. It presents a challenging but fair opponent that improves over time.
Expert (🔥): This mode switches from reinforcement learning to the classic minimax algorithm. This algorithm plays a perfect game, meaning it is impossible to beat (the best a player can achieve is a draw). This serves as a benchmark for optimal play.

2. Other Enhanced Features

In addition to the difficulty levels, the application includes:

Real-time AI parameter tuning: The sliders in the UI allow you to adjust the Learning Rate (α), Discount Factor (γ), and Exploration Rate (ϵ) on the fly. This lets you directly observe how different hyperparameters affect the AI's learning speed and performance.
Persistence with localStorage: The AI automatically saves its Q-table and your game statistics to the browser's local storage. When you close the tab and come back later, the AI will remember everything it has learned.
Dedicated self-play training mode: The "Train AI" button allows the AI to play 1,000 games against itself in a matter of seconds. This rapidly populates the Q-table and is far more efficient than learning from just human-played games.

Putting It All Together: A Guided Test Run

Once you have the HTML (index.html) and JavaScript (game.js) files in same directory, open the HTML file in a web browser to test all the features. When you open the HTML file, it should look like as shown in the below image.

I have also hosted this file on GitHub Pages if you want to see how it works.

Now that you have the application running, let's walk through how to test the features and witness the AI's learning process firsthand. This interactive testing is the most rewarding part, as you'll see the abstract concepts come to life.

Step 1: Challenge the Untrained AI

When you first load the game, the AI is a blank slate. Its Q-table is empty. Make sure the difficulty is set to 🌱 Beginner and play a game against it. You'll likely find it very easy to beat. It makes random, nonsensical moves because it has no experience. Notice the "States Learned" in the statistics panel is very low.

Step 2: Train the AI

Now for the magic. Click the "Train AI (1000 games)" button. You'll see the yellow training indicator appear with a progress counter. In these few seconds, the AI is playing 1,000 games against itself, rapidly learning from its wins, losses, and draws. For every move in every game, it updates its Q-table, reinforcing good strategies and penalizing bad ones.

Step 3: Challenge the Trained AI

Once training is complete, play another game on 🎯 Medium difficulty. The difference should be dramatic. The AI will now play strategically, blocking your wins and setting up its own. It is no longer a pushover. Check the statistics panel again: you'll see the "States Learned" count has jumped significantly, representing all the new board positions it now understands.

Step 4: Experiment with the Controls

Now that you have a trained AI, experiment with the other features:

Switch to 🔥 Expert: Play against the minimax algorithm. Notice that you can't win. This demonstrates the power of a perfect-play algorithm.
Tweak the parameters: Set the Exploration Rate (ε) slider to 0. The AI will become completely deterministic, always picking the move with the highest Q-value. Set it to 0.5, and watch it become more erratic and experimental again.
Reset the AI: Click the "Reset AI Memory" button. This will wipe its Q-table. If you play against it now, you'll find it's back to its original, untrained state. This confirms that its "intelligence" was stored in the Q-table you just erased.

Verifying the Implementation with Automated Tests

While playing the game gives you a good feel for the AI's behavior, automated tests are crucial for programmatically confirming that the underlying code is correct. This is different from the manual testing you just performed. Here, we are writing code to check our code.

The following test suite validates the three most critical features: difficulty switching, data persistence with localStorage, and the infallibility of the expert minimax AI. You can run these tests by copying and pasting the code into your browser's developer console while the game is open.

function runTests() {
  console.log('🧪 Running enhanced tests...');

  // Test 1: Difficulty switching
  const g1 = new TicTacToe();
  g1.setDifficulty('beginner');
  console.assert(g1.ai.difficulty === 'beginner', '✓ Difficulty switching works');

  // Test 2: localStorage persistence
  const g2 = new TicTacToe();
  g2.ai.q.set('test-state', [1, 2, 3, 4, 5, 6, 7, 8, 9]);
  g2.saveState();
  const g3 = new TicTacToe();
  console.assert(g3.ai.q.has('test-state'), '✓ localStorage persistence works');

  // Test 3: Minimax never loses
  const g4 = new TicTacToe();
  g4.setDifficulty('expert');
  let expertLosses = 0;
  for (let i = 0; i < 100; i++) {
    g4.reset();
    while (!g4.gameOver) {
      const available = g4.getAvailable();
      const move = available[~~(Math.random() * available.length)];
      g4.move(move, 'X');
      if (!g4.gameOver) g4.aiMove();
    }
    const winner = g4.checkWinner();
    if (winner?.winner === 'X') expertLosses++;
  }
  console.assert(expertLosses === 0, '✓ Expert AI never loses');

  console.log('✅ All tests passed!');
}

How these tests work:

Difficulty switching: The first test creates a game instance, sets the difficulty, and asserts that the AI's internal property was updated correctly.
Persistence: The second test simulates saving the AI's state. It adds a dummy entry to the Q-table, saves it, creates a new game instance (simulating a page reload), and asserts that the new instance successfully loaded the saved data.
Expert mode correctness: The third and most rigorous test plays 100 games against the expert AI using random moves for the player. It then asserts that the expert AI never lost a single game, proving the minimax implementation is correct.

You can run these tests in your browser's console after loading the game as shown in the below screenshot.

Advanced Optimizations and Extensions

Now that you have the complete implementation, here are ways to extend it further:

How to Implement Symmetry Reduction

You can reduce the state space by recognizing equivalent board positions:

getCanonicalState(s) {
  const transforms = [
    s, this.rot90(s), this.rot180(s), this.rot270(s),
    this.flip(s), this.flip(this.rot90(s)), 
    this.flip(this.rot180(s)), this.flip(this.rot270(s))
  ];
  return transforms.sort()[0];
}

rot90(s) {
  const b = s.split('');
  return [b[6],b[3],b[0],b[7],b[4],b[1],b[8],b[5],b[2]].join('');
}

rot180(s) {
  return s.split('').reverse().join('');
}

rot270(s) {
  const b = s.split('');
  return [b[2],b[5],b[8],b[1],b[4],b[7],b[0],b[3],b[6]].join('');
}

flip(s) {
  const b = s.split('');
  return [b[2],b[1],b[0],b[5],b[4],b[3],b[8],b[7],b[6]].join('');
}

This symmetry reduction technique speeds up AI learning by recognizing equivalent board positions.

How it works:

getCanonicalState(): Generates all 8 symmetric versions of a board state (4 rotations + 4 flipped versions) and returns the alphabetically first one as the standard representation
rot90(): Rotates board 90° clockwise by remapping position indices
rot180(): Rotates 180° by reversing the board array
rot270(): Rotates 270° clockwise (or 90° counterclockwise)
flip(): Mirrors the board horizontally

Why this matters: By storing only canonical states in the Q-table, the AI reduces unique positions from ~5,500 to ~700, making learning 8x faster.

Example: These boards are considered identical:

X-- --- --X
--- = --- = ---
--- --- ---
(original) (180° rotation) (horizontal flip)

All three map to the same canonical state, so the AI only needs to learn one instead of three.

Modify getQ() to use canonical states. This reduces learning time by 8x since the AI recognizes rotated and flipped positions as equivalent.

How to Add Export and Import Functionality

You can also let users share trained AI models:

exportAI() {
  const data = {
    q: Array.from(this.ai.q.entries()),
    stats: this.stats,
    difficulty: this.ai.difficulty,
    timestamp: Date.now()
  };

  const blob = new Blob([JSON.stringify(data)], { type: 'application/json' });
  const url = URL.createObjectURL(blob);
  const a = document.createElement('a');
  a.href = url;
  a.download = `tictactoe-ai-${Date.now()}.json`;
  a.click();
  URL.revokeObjectURL(url);
}

importAI(file) {
  const reader = new FileReader();
  reader.onload = (e) => {
    try {
      const data = JSON.parse(e.target.result);
      this.ai.q = new Map(data.q);
      this.stats = data.stats;
      this.ai.difficulty = data.difficulty;
      this.updateStats();
      this.setStatus('✓ AI imported successfully!');
    } catch (err) {
      this.setStatus('✗ Import failed: Invalid file');
    }
  };
  reader.readAsText(file);
}

These methods enable sharing trained AI models between users. The exportAI() method packages the complete AI state (Q-table, statistics, difficulty, and timestamp) into a JSON object, creates a Blob from the JSON string, generates a temporary download URL, programmatically creates and clicks a download link, then cleans up the URL. The filename includes a timestamp for version tracking.

The importAI() method uses FileReader to asynchronously read an uploaded JSON file, parses it, reconstructs the Map from the array of entries, restores all game state, and updates the display. Error handling catches invalid JSON or corrupted files.

How to Add Q-Value Heatmap Visualization

Here’s how you can visualize the AI's decision-making:

drawQValueHeatmap() {
  const state = this.board;
  const qValues = this.ai.getQ(state);
  const available = this.getAvailable();

  if (available.length === 0) return;

  const maxQ = Math.max(...available.map(i => qValues[i]));
  const minQ = Math.min(...available.map(i => qValues[i]));
  const range = maxQ - minQ || 1;

  this.ctx.globalAlpha = 0.3;
  for (const i of available) {
    const normalized = (qValues[i] - minQ) / range;
    const row = ~~(i / 3);
    const col = i % 3;

    // Green for high Q-values, red for low
    const hue = normalized * 120;
    this.ctx.fillStyle = `hsl(${hue}, 70%, 50%)`;
    this.ctx.fillRect(
      col * this.cellSize + 5,
      row * this.cellSize + 5,
      this.cellSize - 10,
      this.cellSize - 10
    );

    // Draw Q-value
    this.ctx.globalAlpha = 1;
    this.ctx.fillStyle = '#000';
    this.ctx.font = '14px monospace';
    this.ctx.fillText(
      qValues[i].toFixed(2),
      col * this.cellSize + 10,
      row * this.cellSize + 25
    );
  }
  this.ctx.globalAlpha = 1;
}

This visualization method creates a color-coded heatmap showing the AI's confidence in each available move.

It first retrieves Q-values for the current state and finds the min/max values among available positions to normalize the data. For each empty cell, it calculates a normalized score (0 to 1), converts it to a hue value (0° red for low values, 120° green for high values) using HSL color space, and fills the cell with a semi-transparent colored rectangle. It then overlays the actual Q-value as text for precise inspection.

This gives you instant visual feedback about which moves the AI considers most promising. Green cells are good moves, red cells are poor moves.

Common Pitfalls and Solutions

Issue 1: AI Does Not Improve

Cause: The learning rate is too low or there hasn't been enough training.
Solution: Increase the learning rate to between 0.2 and 0.3, and train for more than 2000 games.

Issue 2: AI Makes Random Moves

Cause: The exploration rate is too high after training.
Solution: Reduce the exploration rate to 0.01 once training is complete.

Issue 3: Slow Performance

Cause: The state representation or Q-table lookup is inefficient.
Solution: Use a Map instead of objects and implement state caching.

Issue 4: AI Overfits to One Strategy

Cause: There isn't enough exploration during training.
Solution: Begin with a high exploration rate (ε=0.5) and gradually decrease it.

How to Extend This to Other Games

This framework adapts to other games:

Connect Four: 42-character state, 7 actions (columns)
Blackjack: State includes hand values and dealer card
Snake: Continuous states require function approximation

Conclusion

You have built a complete reinforcement learning system in JavaScript. This project demonstrates:

Core RL concepts with practical implementation
Clean, maintainable code architecture
Real-time training and visualization
Advanced techniques like epsilon decay and self-play
Three difficulty levels from beginner to expert
Data persistence with localStorage
Interactive tooltips for learning

The Q-learning foundation you have implemented powers more advanced techniques like Deep Q-Networks (DQN) used in modern game AI.

Next Steps

Here are some ways to continue learning:

Add more difficulty levels with custom parameters
Implement state persistence with IndexedDB for larger Q-tables
Create multiplayer mode with AI observation
Build a neural network version with TensorFlow.js
Extend to Connect Four or Chess endgames

Resources for Further Learning

Reinforcement Learning: An Introduction by Sutton and Barto (free online textbook)
OpenAI Spinning Up – comprehensive RL resource
Deep RL Bootcamp – Berkeley video lectures
Stable-Baselines3 Documentation – production RL implementations

Use Gymnasium for Reinforcement Learning

Beau Carnes — Tue, 21 Mar 2023 14:17:07 +0000

Embark on an exciting journey to learn the fundamentals of reinforcement learning and its implementation using Gymnasium, the open-source Python library previously known as OpenAI Gym.

We just published a full course on the freeCodeCamp.org YouTube channel that will teach you the basics of reinforcement learning using Gymnasium.

Mustafa Esoofally created this course. He is an experienced machine learning engineer and course creator.

Gymnasium is an open source Python library maintained by the Farama Foundation. It offers a rich collection of pre-built environments for reinforcement learning agents, a standard API for communication between learning algorithms and environments, and a standard set of environments compliant with that API. This comprehensive video course is designed to help you understand reinforcement learning, a branch of machine learning that focuses on intelligent agents taking actions in an environment to maximize cumulative rewards.

Course Contents

This video course is carefully structured to provide you with a complete understanding of reinforcement learning, from basics to advanced topics:

Introduction
Get an overview of the course, its objectives, and the topics we will cover.
Reinforcement Learning Basics (Agent and Environment)
Learn about the fundamental concepts of reinforcement learning, including agents, environments, and their interactions.
Introduction to Gymnasium
Discover the power of Gymnasium and how it can help you develop and test reinforcement learning algorithms.
Blackjack Rules and Implementation in Gymnasium
Dive into the classic card game of Blackjack and learn how to implement it using Gymnasium.
Solving Blackjack
Explore the process of solving Blackjack using reinforcement learning techniques.
Install and Import Libraries
Learn how to set up your Python environment and import the necessary libraries for reinforcement learning.
Observing the Environment
Understand how to monitor and interact with the environment during reinforcement learning tasks.
Executing an Action in the Environment
Master the process of performing actions in the environment and receiving feedback.
Understand and Implement Epsilon-greedy Strategy to Solve Blackjack
Learn the epsilon-greedy strategy, an essential technique for solving Blackjack with reinforcement learning.
Understand the Q-values
Explore the concept of Q-values and how they are used in reinforcement learning algorithms.
Training the Agent to Play Blackjack
Learn the process of training a reinforcement learning agent to play Blackjack effectively.
Visualize the Training of Agent Playing Blackjack
Discover how to visualize and analyze the training process of a reinforcement learning agent.
Summary of Solving Blackjack
Review the key concepts and techniques learned while solving Blackjack.
Solving Cartpole Using Deep-Q-Networks (DQN)
Learn how to solve the classic Cartpole problem using Deep-Q-Networks, a popular reinforcement learning technique.
Summary of Solving Cartpole
Recap the essential elements of solving Cartpole using reinforcement learning.
Advanced Topics and Introduction to Multi-Agent Reinforcement Learning using Pettingzoo
Delve into advanced reinforcement learning topics, including multi-agent reinforcement learning and the use of the Pettingzoo library.

With this comprehensive video course, you'll be well-equipped to tackle reinforcement learning challenges using the powerful Gymnasium library.

Watch the full course on the freeCodeCamp.org YouTube channel (3-hour watch).

Train an AI to Play a Snake Game Using Python

Beau Carnes — Mon, 25 Apr 2022 16:48:40 +0000

Why waste time playing video games when you can train an AI to do it for you? Ok, maybe playing yourself is more fun but training an AI can be more educational.

We just published a course on the freeCodeCamp.org YouTube channel that will teach you the basics of reinforcement learning by showing you how to teach an AI to play a snake game.

Reinforcement learning is a type of machine learning that enables an agent to learn in an environment by trial and error using feedback from its own actions and experiences.

First you will create the game using Python and Pygame. Then you will create and train a neural network using PyTorch that can play the game better than most humans.

Patrick Loeber, also known as Python Engineer, created this course. He has created many popular courses related to Python and machine learning.

The girl and snake art for this course were created by Rachel Likes Pizza.

Here is what is what you will do in this four-part course:

Learn the basics of Reinforcement Learning and Deep Q Learning
Setup the environment and implement a snake game
Implement an agent to control the game
Create and train a neural network to play the game

Watch the full course below or on the freeCodeCamp.org YouTube channel (2-hour watch).

Intro to Advanced Actor-Critic Methods: Reinforcement Learning Course

Beau Carnes — Fri, 30 Jul 2021 22:20:15 +0000

Actor-Critic Methods are very useful reinforcement learning techniques.

Actor-critic methods are most useful for applications in robotics as they allow software to output continuous, rather than discrete actions. This enables control of electric motors to actuate movement in robotic systems, at the expense of increased computational complexity.

We just released a comprehensive course on Actor-Critic methods on the freeCodeCamp.org YouTube channel.

Dr. Tabor developed this course. He is a physicist and former semiconductor engineer who is now a data scientist.

The basic idea behind actor-critic methods is that there are two deep neural networks. The actor network approximates the agent’s policy: a probability distribution that tells us the probability of selecting a (continuous) action given some state of the environment. The critic network approximates the value function: the agent’s estimate of future rewards that follow the current state. These two networks interact to shift the policy towards more profitable states, where profitability is determined by interacting with the environment.

This requires no prior knowledge of how our environment works, or any input regarding rules of the game. All we have to do is let the algorithm interact with the environment and watch as it learns.

This course also incorporate some useful innovations from deep Q learning, such as the use of experience replay buffers and target networks. This increases stability and robustness of the learned policies, so that our agent are able to learn effective policies for navigating the Open AI gym environments.

Here are the algorithms covered in this course:

Actor Critic
Deep Deterministic Policy Gradients (DDPG)
Twin Delayed Deep Deterministic Policy Gradients (TD3)
Proximal Policy Optimization (PPO)
Soft Actor Critic (SAC)
Asynchronous Advantage Actor Critic (A3C)

Watch the full course below or on the freeCodeCamp.org YouTube channel (6-hour watch).

How I planned my meals with Reinforcement Learning on a budget

freeCodeCamp — Tue, 16 Apr 2019 16:00:10 +0000

By Sterling Osborne, PhD Researcher

Following my recent article on applying Reinforcement Learning to real life problems, I decided to demonstrate this with a small example. The aim is to create an algorithm that can find a suitable choice of food products to fit within a budget and meet my personal preferences.

I have also posted the description, data and code kernel to Kaggle and this can be found here.

Please let me know if you have any questions or suggestions.

Photo: Pixabay

Aim

When food shopping, there are many different products for the same ingredient to choose from in supermarkets. Some are less expensive, others are of higher quality. I would like to create a model that, for the required ingredients, can select the optimal products required to make a meal that is both:

Within my budget
Meets my personal preferences

To do this, I will first build a very simple model that can recommend the products that are below my budgets before introducing my preferences.

The reason we use a model is so that we could, in theory, scale the problem to consider more and more ingredients and products that would cause the problem to then be beyond the possibility of any mental calculations.

Method

To achieve this, I will be building a simple reinforcement learning model and I’ll use Monte Carlo learning to find the optimal combination of products.

First, let us formally define the parts of our model as a Markov Decision Process:

We have a finite number of ingredients required to make any meal and are considered to be our States
There are the finite possible products for each ingredient and are therefore the Actions of each state
Our preferences become the Individual Rewards for selecting each product, we will cover this in more detail later

Monte Carlo learning takes the combined the quality of each step towards reaching an end goal and requires that, in order to assess the quality of any step, we must wait and see the outcome of the whole combination. This process is repeated over and over again in episodes with many different products until is finds the selection that appears to lead to a positive outcome repeatedly. This is the reinforcement learning process where our environment is simulated based on the knowledge about costs and preferences we obtained.

Monte Carlo is often avoided due to the time required to go through the whole process before being able to learn. However, in our problem it is required as our final check when establishing whether the combination of products selected is good or bad is to add up the real cost of those selected and check whether or not this is below or above our budget. Furthermore, at least at this stage, we will not be considering more than a few ingredients and so the time taken is not significant in this regard.

_[https://www.tractica.com/artificial-intelligence/reinforcement-learning-and-its-implications-for-enterprise-artificial-intelligence/](https://www.tractica.com/artificial-intelligence/reinforcement-learning-and-its-implications-for-enterprise-artificial-intelligence/" rel="noopener" target="blank" title=")

Sample Data

For this demonstration, I have created some sample data for a meal where we have 4 ingredients and 9 products, as shown in the diagram below.

We need to select one product for each ingredient in the meal.

This means we have 2 x 2 x 2 x 3 = 24 possible selections of products for the 4 ingredients.

I have also included the real cost for each product and V_0.

V_0 is simply the initial quality of each product to meet our requirements and we set this to 0 for each.

Diagram showing the possible product choices for each ingredient

First, we import the required packages and data.

Applying the Model in Theory

For now, I will not introduce any individual rewards for the products. Instead, I will simply focus on whether the combination of products selected is below our budget or not. This outcome is defined as the Terminal Reward of our problem.

For example, say we have a budget of £30, then the choice:

a1→b1→c1→d1

Then the real cost of this selection is:

£10+£8+£3+£8 = £29 < £30

And therefore, our terminal reward is:

R_T=+1

Whereas,

a2→b2→c2→d1

Then the real cost of this selection is:

£6+£11+£7+£8 = £32 > £30

And therefore, our terminal reward is:

R_T=−1

For now, we are simply telling our model whether the choice is good or bad and will observe what this does to the results.

Model Learning

So how does our model actually learn? In short, we get our model to try out lots of combinations of products and at the end of each tell it whether its choice was good or bad. Over time, it will recognise that some products generally lead to getting a good outcome while others do not.

What we end up creating are values for how good each product is, denoted V(a). We have already introduced the initial V(a) for each product, but how do we reach go from these initial values to actually being able to make a decision?

For this, we need an Update Rule. This tells the model, after each time it has presented its choice of products and we have told it whether it’s selection is good or bad, how to add this to our initial values.

Our update rule is as follows:

This may look unusual at first but in words we are simply updating the value of any action, V(a), by an amount that is either a little more if the outcome was good or a little less if the outcome was bad.

G is the Return and is simply to total reward obtained. Currently in our example, this is simply the terminal reward (+1 or -1 accordingly). We will reintroduce this later when we include individual product rewards.

Alpha, αα, is the Learning Rate and we will demonstrate how this effects the results more later but just for now, the simple explanation is: “The learning rate determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing, while a factor of 1 makes the agent consider only the most recent information.” (https://en.wikipedia.org/wiki/Q-learning)

Small Demo of Updating Values

So how do we actually use this with our model?

Let us start with a table that has each product and its initial V_0(a):

We now pick a random selection of products, each combination is known as an episode. We also set α=0.5α=0.5 for now just for simplicity in the calculations.

For example:

Therefore, all actions that lead to this positive outcome are updated as well to produced the following table with V1(a):

So let us try again by picking another random episode:

Therefore, we can add V2(a) to our table:

Action Selection

You may have noticed in the demo, I have simply randomly selected the products in each episode. We could do this, but using a completely random selection process may mean that some actions are not selected often enough to know whether they are good or bad.

Similarly, if we went another way and decided to select the products greedily, i.e. to ones that currently have the best value, we may miss one that is in fact better but was never given a chance. For example, if we chose the best actions from V2(a) we would get a2, b1, c1 and d2 or d3 which both provide a positive terminal reward therefore, if we used a purely greedy selection process, we would never consider any other products as these continue to provide a positive outcome.

Instead, we implement epsilon-greedy action selection where we randomly select products with probability ϵ, and greedily select products with probability 1−ϵ1−ϵ where:

This means that we are going reach the optimal choice of products quickly, as we continue to test whether the ‘good’ products are in fact optimal. But it also leaves room for us to also explore other products occasionally, just to make sure they aren’t as good as our current choice.

Building and Applying our Model

We are now ready to build a simple model as shown in the MCModelv1 function below.

Although this seems complex, I have done nothing more than apply the methods previously discussed in such a way that we can vary the inputs and still obtain results. Admittedly, this was my first attempt at doing this and so my coding may not be perfectly written but should be sufficient for our requirements.

To calculate the terminal reward, we currently use the following condition to check if the total cost is less or more than our budget:

The full code for the model is too large to fit here nicely, but can be found at the linked Kaggle page.

We now run our model with some sample variables:

In our function, we have 6 outputs from the model:

Mdl[0]: Returns the Sum of all V(a) for each episode
Mdl[1]: Returns to Sum of V(a) for the cheapest products, possible to define due to the simplicity of our sample data
Mdl[2]: Returns the Sum of V(a) for the non-cheapest products
Mdl[3]: Returns the optimal actions of the final episode
Mdl[4]: Returns the data table with the final V(a) added for each product
Mdl[5]: Shows the optimal action at each episode

There is a lot to take away from these, so let us go through each and establish what we can learn to improve our model.

Optimal actions of final episode

First, let’s see what the model suggests we should select. In this run it suggests actions, or products, that have a total cost below budget which is good.

However, there is still more that we can check to help us understand what is going on.

First, we can plot the total V for all actions, and we see that this is converging, which is ideal. We want our model to converge so that as we try more episodes we are ‘zoning-in’ on the optimal choice of products. The reason the output converges is because we are reducing the amount it learns each time by a factor of αα, in this case 0.5. We will show later what happens if we vary this or don’t apply this at all.

We have also plotted the sum of V for the products we know are cheapest, based on being able to assess the small sample size, and the others separately. Again, both are converging positively although the cheaper products appear to have slightly higher values.

So why is this happening and why did the model suggest the actions it did?

To understand that, we need to dissect the suggestions made by the model at each episode and how this relates to our return.

Below, we have taken the optimal action for each state. We can see that the suggested actions do vary greatly between episodes and the model appears to decide which is wants to suggest very quickly.

Therefore, I have plotted the total cost of the suggested actions at each episode and we can see the actions vary initially then smooth out and the resulting total cost is below our budget. This helps us understand what is going on greatly.

So far, all we have told the model is to provide a selection that is below budget and it has. It has simply found a answer that is below the budget as required.

So what is the next step? Before I introduce rewards I want to demonstrate what happens if I vary some of the parameters and what we can do if we decide to change what we want our model to suggest.

Effect of Changing Parameters and How to Change Model’s Aim

We have a few parameters that can be changed:

The Budget
Our learning rate, α
Out action selection parameter, ϵ

Varying Budget

First, let us observe what happens if we make our budget either impossibly low or high.

A small budget means we only obtain a negative reward means that we will force our V to converge negatively whereas a budget that is too high will cause our V to converge positively as all actions are continually positive.

The latter seems like what we had in our first run, a lot of the episodes lead to positive outcomes and so many combinations of products are possible and there is little distinction between the cheapest products from the rest.

If instead we consider a budget that is reasonably low given the prices of the products, we can see a trend where the cheapest products look to be converging positively and the more expensive products converging negatively. However, the smoothness of these is far from ideal, both appear to be oscillating greatly between each episode.

So what can we do the reduce the ‘spikiness’ of the outputs? This leads us onto our next parameter, alpha.

Varying Alpha

A good explanation of what is going on with our output due to alpha is described by stack overflow user VishalTheBeast:

“Learning rate tells the magnitude of step that is taken towards the solution.

It should not be too big a number as it may continuously oscillate around the minima and it should not be too small of a number else it will take a lot of time and iterations to reach the minima.

The reason why decay is advised in learning rate is because initially when we are at a totally random point in solution space we need to take big leaps towards the solution and later when we come close to it, we make small jumps and hence small improvements to finally reach the minima.

Analogy can be made as: in the game of golf when the ball is far away from the hole, the player hits it very hard to get as close as possible to the hole. Later when he reaches the flagged area, he choses a different stick to get accurate short shot.

So it’s not that he won’t be able to put the ball in the hole without choosing the short shot stick, he may send the ball ahead of the target two or three times. But it would be best if he plays optimally and uses the right amount of power to reach the hole. Same is for decayed learning rate.” — source

To better demonstrate the effect of varying our alpha, I will be using an animated plot created using Plot.ly.

I have written a more detailed guide on how to do this here.

In our first animation, we vary alpha between 1 and 0.1. This enables us to see that as we reduce alpha our output smooths somewhat but it still pretty rough.

However, even though the results are smoothing out, they are no longer converging in 100 episodes and, furthermore, they output seems to alternate between each alpha. This is due to a combination of small alphas requiring more episodes to learn and out action selection parameter epsilon being 0.5. Essentially, the output is still being decided by randomness half of the time and so out results are not converging within the 100 episode frame.

Running this through our animated plots produces something similar to the following:

Varying Epsilon

With the previous results in mind, we now fix alpha to be 0.05 and vary epsilon between 1 and 0 to show the effect of completely randomly selecting actions to selecting actions greedily.

The graphs below show three snapshots from varying epsilon, but the animated version can be viewed in the Kaggle kernel.

We see that having a high epsilon creates very sporadic results. Therefore we should select something reasonably small like 0.2. Although have epsilon equal to 0 looks good because of how smooth the curve is, as we mentioned earlier, this may lead us to a choice very quickly but may not be the best. We want some randomness so the model can explore other actions if needed.

Increasing the Number of Episodes

Lastly, we can increase the number of episodes. I refrained from doing this sooner because we were running 10 models in a loop to output our animated graphs and this would have caused the time taken to run the model to explode.

We noted that a low alpha would require more episodes to learn so we can run our model for 1000 episodes.

However, we still notice that the output is oscillating, but, as mentioned before, this is due to our aim being simply to recommend a combination that is below budget. What this shows is that the model can’t find the single best combination when there are many that fit below our budget.

Therefore, what happens if we change our aim slightly so that we can use the model to find the cheapest combination of products?

Changing our Model’s Aim to Find the Cheapest Combination of Products

This aim of this it to more clearly separate the cheapest products from the rest, and it nearly always provides us with the cheapest combination of products.

To do this, all we need do is adapt our model slightly to provide a terminal reward that is relative to how far below or above budget this combination in the episode is.

This can done by changing the calculation for return to:

We now see that the separation between the cheapest products and the others is emphasised.

This really demonstrates the flexibility of reinforcement learning and how easy it can be to adapt the model based on your aims.

Introducing Preferences

So far, we have not included any personal preferences towards products. If we wanted to include this, we can simply introduce rewards for each product whilst still having a terminal reward that encourages the model to be below budget.

This can done by changing the calculation for return to:

So why is our return calculation now like this?

Well firstly, we still want our combination to be below budget so we provide the positive and negative rewards for being above and below budget respectively.

Next, we want to account for the reward of each product. For our purposes, we define the rewards to be a value between 0 and 1. MC return is formally calculated using the following:

γ is the discount factor and this tells us how much we value later steps compared to earlier steps. In our case, all actions are equally as important to reaching the desired outcome of being below budget so we set γ=1.

However, to ensure that we reach the primary goal of being below budget, we take the average of the sum of the rewards for each action so that this will always be less than 1 or -1 respectively.

Again, the full model can be found in the Kaggle kernel but is too large to link here.

Introducing Preferences using Rewards

Say we decided we wanted product a1 and b2, we could add a reward to each. Let us see what happens if we do this in the output and graphs below. We have changed out budget slightly as a1 and b2 add up to £21 which means there is no way to select two more products that would put it below a budget of £23.

Applying a very high reward forces the model to pick a1 and b2 then work around to find products that will put it under our budget.

I have kept in the comparison between the cheapest products and the rest to show that the model now is not valuing the cheapest once more. Instead we get the output a1, b2, c1 and d3 which has a total cost of £25. This is both below our budget and includes our preferred products.

Let’s try one more reward signal. This time, I give some reward to each but want it to provide the best combination from my rewards that still keeps us below budget.

We have the following rewards:

Running this model a few times shows that it would:

Often select a1 as this has a much higher reward
Would always pick c1, as the rewards are the same but it is cheaper
Had a hard time selecting between b1 and b2 as the rewards are 0.5 and 0.6 but the costs are £8 and £11 respectively
Would typically select d3 as being significantly cheaper than d1 even though reward is slightly less

Conclusion

We have managed to build a Monte Carlo Reinforcement Learning model to:

recommend products below a budget,
recommend the cheapest products, and
recommend the best products based on a preference that is still below a budget.

Along the way, we have demonstrated the effect of changing parameters in reinforcement learning and how understanding these enables us to reach a desired result.

There is much more that we could do, in my mind, the end goal would be to apply to a real recipe and products from a supermarket where the increased number of ingredients and products need to be accounted for.

I created this sample data and problem to better my understanding of Reinforcement Learning and hope that you find it useful.

Thanks for reading!

Sterling Osborne

How to use AI to play Sonic the Hedgehog. It’s NEAT!

freeCodeCamp — Tue, 02 Apr 2019 16:28:11 +0000

By Vedant Gupta

Generation after generation, humans have adapted to become more fit with our surroundings. We started off as primates living in a world of eat or be eaten. Eventually we evolved into who we are today, reflecting modern society. Through the process of evolution we become smarter. We are able to work better with our environment and accomplish what we need to.

The concept of learning through evolution can also be applied to Artificial Intelligence. We can train AIs to perform certain tasks using NEAT, Neuroevolution of Augmented Topologies. Simply put, NEAT is an algorithm which takes a batch of AIs (genomes) attempting to accomplish a given task. The top performing AIs “breed” to create the next generation. This process continues until we have a generation which is capable of completing what it needs to.

Clip of AI playing STH

NEAT is amazing because it eliminates the need for pre-existing data required to train our AIs. Using the power of NEAT and OpenAI’s Gym Retro I trained an AI to play Sonic the Hedgehog for the SEGA Genesis. Let’s learn how!

A NEAT Neural Network (Python Implementation)

GitHub Repository

Vedant-Gupta523/sonicNEAT
_Contribute to Vedant-Gupta523/sonicNEAT development by creating an account on GitHub._github.com

Note: All of the code in this article and the repo above is a slightly modified version of Lucas Thompson's Sonic AI Bot Using Open-AI and NEAT YouTube tutorials and code.

Understanding OpenAI Gym

If you are not already familiar with OpenAI Gym, look through the terminology below. They will be used frequently throughout the article.

agent — The AI player. In this case it will be Sonic.

environment — The complete surroundings of the agent. The game environment.

action — Something the agent has the option of doing (i.e. move left, move right, jump, do nothing).

step — Performing 1 action.

state — A frame of the environment. The current situation the AI is in.

observation — What the AI observes from the environment.

fitness — How well our AI is performing.

done — When the AI has completed its task or can’t continue any further.

Installing Dependencies

Below are GitHub links for OpenAI and NEAT with installation instructions.

OpenAI: https://github.com/openai/retro

NEAT:https://github.com/CodeReclaimers/neat-python

Pip install libraries such as cv2, numpy, pickle etc.

Import libraries and set environment

To start, we need to import all of the modules we will use:

import retro
import numpy as np
import cv2
import neat
import pickle

We will also define our environment, consisting of the game and the state:

env = retro.make(game = "SonicTheHedgehog-Genesis", state = "GreenHillZone.Act1")

In order to train an AI to play Sonic the Hedgehog, you will need the game’s ROM (game file). The simplest way to get it is by purchasing the game off of Steam for $5. You could also find free find downloads of the ROM online, however it is illegal, so don’t do this.

In the OpenAI repository at retro/retro/data/stable/ you will find a folder for Sonic the Hedgehog Genesis. Place the game’s ROM here and make sure it is called rom.md. This folder also contains .state files. You can choose one and set the state parameter equal to it. I chose GreenHillZone Act 1 since it is the very first level of the game.

Understanding data.json and scenario.json

In the Sonic the Hedgehog folder you will have these two files:

data.json

{
  "info": {
    "act": {
      "address": 16776721,
      "type": "|u1"
    },
    "level_end_bonus": {
      "address": 16775126,
      "type": "|u1"
    },
    "lives": {
      "address": 16776722,
      "type": "|u1"
    },
    "rings": {
      "address": 16776736,
      "type": ">u2"
    },
    "score": {
      "address": 16776742,
      "type": ">u4"
    },
    "screen_x": {
      "address": 16774912,
      "type": ">u2"
    },
    "screen_x_end": {
      "address": 16774954,
      "type": ">u2"
    },
    "screen_y": {
      "address": 16774916,
      "type": ">u2"
    },
    "x": {
      "address": 16764936,
      "type": ">i2"
    },
    "y": {
      "address": 16764940,
      "type": ">u2"
    },
    "zone": {
      "address": 16776720,
      "type": "|u1"
    }
  }
}

scenario.json

{
  "done": {
    "variables": {
      "lives": {
        "op": "zero"
      }
    }
  },
  "reward": {
    "variables": {
      "x": {
        "reward": 10.0
      }
    }
  }
}

Both these files contain important information pertaining to the game and its training.

As it sounds, the data.json file contains information/data on different game specific variables (i.e. Sonic’s x-position, number of lives he has, etc.).

The scenario.json file allows us to perform actions in sync with the values of the data variables. For example we can reward Sonic 10.0 every time his x-position increases. We could also set our done condition to true when Sonic’s lives hit 0.

Understanding NEAT feedforward configuration

The config-feedforward file can be found in my GitHub repository linked above. It acts like a settings menu to set up our training. To point out a few simple settings:

fitness_threshold     = 10000 # How fit we want Sonic to become
pop_size              = 20 # How many Sonics per generation
num_inputs            = 1120 # Number of inputs into our model
num_outputs           = 12 # 12 buttons on Genesis controller

There are tons of settings you can experiment with to see how it effects your AI’s training! To learn more about NEAT and the different settings in the feedfoward configuration, I would highly recommend reading the documentation here

Putting it all together: Creating the Training File

Setting up configuration

Our feedforward configuration is defined and stored in the variable config.

config = neat.Config(neat.DefaultGenome, neat.DefaultReproduction, neat.DefaultSpeciesSet, neat.DefaultStagnation, 'config-feedforward')

Creating a function to evaluate each genome

We start by creating the function, eval_genomes, which will evaluate our genomes (a genome could be compared to 1 Sonic in a population of Sonics). For each genome we reset the environment and take a random action

for genome_id, genome in genomes:
        ob = env.reset()
        ac = env.action_space.sample()

We will also record the game environment’s length and width and color. We divide the length and width by 8.

inx, iny, inc = env.observation_space.shape
inx = int(inx/8)
iny = int(iny/8)

We create a recurrent neural network (RNN) using the NEAT library and input the genome and our chosen configuration.

net = neat.nn.recurrent.RecurrentNetwork.create(genome, config)

Finally, we define a few variables: current_max_fitness (the highest fitness in the current population), fitness_current (the current fitness of the genome), frame (the frame count), counter (to count the number of steps our agent takes), xpos (the x-position of Sonic), and done (whether or not we have reached our fitness goal).

current_max_fitness = 0
fitness_current = 0
frame = 0
counter = 0
xpos = 0
done = False

While we have not reached our done requirement, we need to run the environment, increment our frame counter, and shape our observation to mimic that of the game (still for each genome).

env.render()
frame += 1
ob = cv2.resize(ob, (inx, iny))
ob = cv2.cvtColor(ob, cv2.COLOR_BGR2GRAY)
ob = np.reshape(ob, (inx,iny))

We will take our observation and put it in a one-dimensional array, so that our RNN can understand it. We receive our output by feeding this array to our RNN.

imgarray = []
imgarray = np.ndarray.flatten(ob)
nnOutput = net.activate(imgarray)

Using the output from the RNN our AI takes a step. From this step we can extract fresh information: a new observation, a reward, whether or not we have reached our done requirement, and information on variables in our data.json (info).

ob, rew, done, info = env.step(nnOutput)

At this point we need to evaluate our genome’s fitness and whether or not it has met the done requirement.

We look at our “x” variable from data.json and check if it has surpassed the length of the level. If it has, we will increase our fitness by our fitness threshold signifying we are done.

xpos = info['x']

if xpos >= 10000:
        fitness_current += 10000
        done = True

Otherwise, we will increase our current fitness by the reward we earned from performing the step. We also check if we have a new highest fitness and adjust the value of our current_max_fitness accordingly.

fitness_current += rew

if fitness_current > current_max_fitness:
        current_max_fitness = fitness_current
        counter = 0
else:
        counter += 1

Lastly, we check if we are done or if our genome has taken 250 steps. If so, we print information on the genome which was simulated. Otherwise we keep looping until one of the two requirements has been satisfied.

if done or counter == 250:
        done = True
        print(genome_id, fitness_current)

genome.fitness = fitness_current

Defining the population, printing training stats, and more

The absolute last thing we need to do is define our population, print out statistics from our training, save checkpoints (in case you want to pause and resume training), and pickle our winning genome.

p = neat.Population(config)

p.add_reporter(neat.StdOutReporter(True))
stats = neat.StatisticsReporter()
p.add_reporter(stats)
p.add_reporter(neat.Checkpointer(1))

winner = p.run(eval_genomes)

with open('winner.pkl', 'wb') as output:
    pickle.dump(winner, output, 1)

All that’s left is the matter of running the program and watching Sonic slowly learn how to beat the level!

Earlier generation vs Later generation

To see all of the code put together check out the Training.py file in my GitHub repository.

Bonus: Parallel Training

If you have a multi-core CPU you can run multiple training simulations at once, exponentially increasing the rate at which you can train your AI! Although I will not go through the specifics on how to do this in this article, I highly suggest you check the sonicTraning.py implementation in my GitHub repository.

Conclusion

That’s all there is to it! With a few adjustments, this framework is applicable to any game for the NES, SNES, SEGA Genesis, and more. If you have any questions or you just want to say hello, feel free to email me at vedantgupta523[at]gmail[dot]com ?

Also, be sure to check out Lucas Thompson's Sonic AI Bot Using Open-AI and NEAT YouTube tutorials and code to see what originally inspired this article.

Key Takeaways

Neuroevolution of Augmenting Topologies (NEAT) is an algorithm used to train AI to perform certain tasks. It is modeled after genetic evolution.
NEAT eliminates the need for pre-existing data when training AI.
The process of implementing OpenAI and NEAT using Python to train an AI to play any game.

How to apply Reinforcement Learning to real life planning problems

freeCodeCamp — Tue, 12 Mar 2019 21:35:16 +0000

By Sterling Osborne, PhD Researcher

Recently, I have published some examples where I have created Reinforcement Learning models for some real life problems. For example, using Reinforcement Learning for Meal Planning based on a Set Budget and Personal Preferences.

Reinforcement Learning can be used in this way for a variety of planning problems including travel plans, budget planning and business strategy. The two advantages of using RL is that it takes into account the probability of outcomes and allows us to control parts of the environment. Therefore, I decided to write a simple example so others may consider how they could start using it to solve some of their day-to-day or work problems.

What is Reinforcement Learning?

Reinforcement Learning (RL) is the process of testing which actions are best for each state of an environment by essentially trial and error. The model introduces a random policy to start, and each time an action is taken an initial amount (known as a reward) is fed to the model. This continues until an end goal is reached, e.g. you win or lose the game, where that run (or episode) ends and the game resets.

As the model goes through more and more episodes, it begins to learn which actions are more likely to lead us to a positive outcome. Therefore it finds the best actions in any given state, known as the optimal policy.

Reinforcement Learning General Process

Many of the RL applications online train models on a game or virtual environment where the model is able to interact with the environment repeatedly. For example, you let the model play a simulation of tic-tac-toe over and over so that it observes success and failure of trying different moves.

In real life, it is likely we do not have access to train our model in this way. For example, a recommendation system in online shopping needs a person’s feedback to tell us whether it has succeeded or not, and this is limited in its availability based on how many users interact with the shopping site.

Instead, we may have sample data that shows shopping trends over a time period that we can use to create estimated probabilities. Using these, we can create what is known as a Partially Observed Markov Decision Process (POMDP) as a way to generalise the underlying probability distribution.

Partially Observed Markov Decision Processes (POMDPs)

Markov Decision Processes (MDPs) provide a framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. The key feature of MDPs is that they follow the Markov Property; all future states are independent of the past given the present. In other words, the probability of moving into the next state is only dependent on the current state.

POMDPs work similarly except it is a generalisation of the MDPs. In short, this means the model cannot simply interact with the environment but is instead given a set probability distribution based on what we have observed. More info can be found here. We could use value iteration methods on our POMDP, but instead I’ve decided to use Monte Carlo Learning in this example.

Example Environment

Imagine you are back at school (or perhaps still are) and are in a classroom, the teacher has a strict policy on paper waste and requires that any pieces of scrap paper must be passed to him at the front of the classroom and he will place the waste into the bin (trash can).

However, some students in the class care little for the teacher’s rules and would rather save themselves the trouble of passing the paper round the classroom. Instead, these troublesome individuals may choose to throw the scrap paper into the bin from a distance. Now this angers the teacher and those that do this are punished.

This introduces a very basic action-reward concept, and we have an example classroom environment as shown in the following diagram.

Our aim is to find the best instructions for each person so that the paper reaches the teacher and is placed into the bin and avoids being thrown in the bin.

States and Actions

In our environment, each person can be considered a state and they have a variety of actions they can take with the scrap paper. They may choose to pass it to an adjacent class mate, hold onto it or some may choose to throw it into the bin. We can therefore map our environment to a more standard grid layout as shown below.

This is purposefully designed so that each person, or state, has four actions: up, down, left or right and each will have a varied ‘real life’ outcome based on who took the action. An action that puts the person into a wall (including the black block in the middle) indicates that the person holds onto the paper. In some cases, this action is duplicated, but is not an issue in our example.

For example, person A’s actions result in:

Up = Throw into bin
Down = Hold onto paper
Left = Pass to person B
Right = Hold onto paper

Probabilistic Environment

For now, the decision maker that partly controls the environment is us. We will tell each person which action they should take. This is known as the policy.

The first challenge I face in my learning is understanding that the environment is likely probabilistic and what this means. A probabilistic environment is when we instruct a state to take an action under our policy, there is a probability associated as to whether this is successfully followed. In other words, if we tell person A to pass the paper to person B, they can decide not to follow the instructed action in our policy and instead throw the scrap paper into the bin.

Another example is if we are recommending online shopping products there is no guarantee that the person will view each one.

Observed Transitional Probabilities

To find the observed transitional probabilities, we need to collect some sample data about how the environment acts. Before we collect information, we first introduce an initial policy. To start the process, I have randomly chosen one that looks as though it would lead to a positive outcome.

Now we observe the actions each person takes given this policy. In other words, say we sat at the back of the classroom and simply observed the class and observed the following results for person A:

Person A’s Observed Actions

We see that a paper passed through this person 20 times; 6 times they kept hold of it, 8 times they passed it to person B and another 6 times they threw it in the trash. This means that under our initial policy, the probability of keeping hold or throwing it in the trash for this person is 6/20 = 0.3 and likewise 8/20 = 0.4 to pass to person B. We can observe the rest of the class to collect the following sample data:

Observed Real Life Outcome

Likewise, we then calculate the probabilities to be the following matrix and we could use this to simulate experience. The accuracy of this model will depend greatly on whether the probabilities are true representations of the whole environment. In other words, we need to make sure we have a sample that is large and rich enough in data.

Observed Transition Probability Function

Multi-Armed Bandits, Episodes, Rewards, Return and Discount Rate

So we have our transition probabilities estimated from the sample data under a POMDP. The next step, before we introduce any models, is to introduce rewards. So far, we have only discussed the outcome of the final step; either the paper gets placed in the bin by the teacher and nets a positive reward or gets thrown by A or M and nets a negative rewards. This final reward that ends the episode is known as the Terminal Reward.

But, there is also third outcome that is less than ideal either; the paper continually gets passed around and never (or takes far longer than we would like) reaches the bin. Therefore, in summary we have three final outcomes

Paper gets placed in bin by teacher and nets a positive terminal reward
Paper gets thrown in bin by a student and nets a negative terminal reward
Paper gets continually passed around room or gets stuck on students for a longer period of time than we would like

To avoid the paper being thrown in the bin we provide this with a large, negative reward, say -1, and because the teacher is pleased with it being placed in the bin this nets a large positive reward, +1. To avoid the outcome where it continually gets passed around the room, we set the reward for all other actions to be a small, negative value, say -0.04.

If we set this as a positive or null number then the model may let the paper go round and round as it would be better to gain small positives than risk getting close to the negative outcome. This number is also very small as it will only collect a single terminal reward but it could take many steps to end the episode and we need to ensure that, if the paper is place in the bin, the positive outcome is not cancelled out.

Please note: the rewards are always relative to one another and I have chosen arbitrary figures, but these can be changed if the results are not as desired.

Although we have inadvertently discussed episodes in the example, we have yet to formally define it. An episode is simply the actions each paper takes through the classroom reaching the bin, which is the terminal state and ends the episode. In other examples, such as playing tic-tac-toe, this would be the end of a game where you win or lose.

The paper could in theory start at any state and this introduces why we need enough episodes to ensure that every state and action is tested enough so that our outcome is not being driven by invalid results. However, on the flip side, the more episodes we introduce the longer the computation time will be and, depending on the scale of the environment, we may not have an unlimited amount of resources to do this.

This is known as the Multi-Armed Bandit problem; with finite time (or other resources), we need to ensure that we test each state-action pair enough that the actions selected in our policy are, in fact, the optimal ones. In other words, we need to validate that actions that have lead us to good outcomes in the past are not by sheer luck but are in fact in the correct choice, and likewise for the actions that appear poor. In our example this may seem simple with how few states we have, but imagine if we increased the scale and how this becomes more and more of an issue.

The overall goal of our RL model is to select the actions that maximises the expected cumulative rewards, known as the return. In other words, the Return is simply the total reward obtained for the episode. A simple way to calculate this would be to add up all the rewards, including the terminal reward, in each episode.

A more rigorous approach is to consider the first steps to be more important than later ones in the episode by applying a discount factor, gamma, in the following formula:

In other words, we sum all the rewards but weigh down later steps by a factor of gamma to the power of how many steps it took to reach them.

If we think about our example, using a discounted return becomes even clearer to imagine as the teacher will reward (or punish accordingly) anyone who was involved in the episode but would scale this based on how far they are from the final outcome.

For example, if the paper passed from A to B to M who threw it in the bin, M should be punished most, then B for passing it to him and lastly person A who is still involved in the final outcome but less so than M or B. This also emphasises that the longer it takes (based on the number of steps) to start in a state and reach the bin the less is will either be rewarded or punished but will accumulate negative rewards for taking more steps.

Applying a Model to our Example

As our example environment is small, we can apply each and show some of the calculations performed manually and illustrate the impact of changing parameters.

For any algorithm, we first need to initialise the state value function, V(s), and have decided to set each of these to 0 as shown below.

Next, we let the model simulate experience on the environment based on our observed probability distribution. The model starts a piece of paper in random states and the outcomes of each action under our policy are based on our observed probabilities. So for example, say we have the first three simulated episodes to be the following:

With these episodes we can calculate our first few updates to our state value function using each of the three models given. For now, we pick arbitrary alpha and gamma values to be 0.5 to make our hand calculations simpler. We will show later the impact this variable has on results.

First, we apply temporal difference 0, the simplest of our models and the first three value updates are as follows:

So how have these been calculated? Well because our example is small we can show the calculations by hand.

So what can we observe at this early stage? Firstly, using TD(0) appears unfair to some states, for example person D, who, at this stage, has gained nothing from the paper reaching the bin two out of three times. Their update has only been affected by the value of the next stage, but this emphasises how the positive and negative rewards propagate outwards from the corner towards the states.

As we take more episodes the positive and negative terminal rewards will spread out further and further across all states. This is shown roughly in the diagram below where we can see that the two episodes the resulted in a positive result impact the value of states Teacher and G whereas the single negative episode has punished person M.

To show this, we can try more episodes. If we repeat the same three paths already given we produce the following state value function:

(Please note, we have repeated these three episodes for simplicity in this example but the actual model would have episodes where the outcomes are based on the observed transition probability function.)

The diagram above shows the terminal rewards propagating outwards from the top right corner to the states. From this, we may decide to update our policy as it is clear that the negative terminal reward passes through person M and therefore B and C are impacted negatively. Therefore, based on V27, for each state we may decide to update our policy by selecting the next best state value for each state as shown in the figure below

There are two causes for concerns in this example: the first is that person A’s best action is to throw it into the bin and net a negative reward. This is because none of the episodes have visited this person and emphasises the multi armed bandit problem. In this small example there are very few states so would require many episodes to visit them all, but we need to ensure this is done.

The reason this action is better for this person is because neither of the terminal states have a value but rather the positive and negative outcomes are in the terminal rewards. We could then, if our situation required it, initialise V0 with figures for the terminal states based on the outcomes.

Secondly, the state value of person M is flipping back and forth between -0.03 and -0.51 (approx.) after the episodes and we need to address why this is happening. This is caused by our learning rate, alpha. For now, we have only introduced our parameters (the learning rate alpha and discount rate gamma) but have not explained in detail how they will impact results.

A large learning rate may cause the results to oscillate, but conversely it should not be so small that it takes forever to converge. This is shown further in the figure below that demonstrates the total V(s) for every episode and we can clearly see how, although there is a general increasing trend, it is diverging back and forth between episodes. Another good explanation for learning rate is as follows:

“In the game of golf when the ball is far away from the hole, the player hits it very hard to get as close as possible to the hole. Later when he reaches the flagged area, he chooses a different stick to get accurate short shot.

So it’s not that he won’t be able to put the ball in the hole without choosing the short shot stick, he may send the ball ahead of the target two or three times. But it would be best if he plays optimally and uses the right amount of power to reach the hole.”

Learning rate of a Q learning agent
_The question how the learning rate influences the convergence rate and convergence itself. If the learning rate is…_stackoverflow.com

Episode

There are some complex methods for establishing the optimal learning rate for a problem but, as with any machine learning algorithm, if the environment is simple enough you iterate over different values until convergence is reached. This is also known as stochastic gradient decent. In a recent RL project, I demonstrated the impact of reducing alpha using an animated visual and this is shown below. This demonstrates the oscillation when alpha is large and how this becomes smoothed as alpha is reduced.

Likewise, we must also have our discount rate to be a number between 0 and 1, oftentimes this is taken to be close to 0.9. The discount factor tells us how important rewards in the future are; a large number indicates that they will be considered important whereas moving this towards 0 will make the model consider future steps less and less.

With both of these in mind, we can change both alpha from 0.5 to 0.2 and gamma from 0.5 to 0.9 and we achieve the following results:

Because our learning rate is now much smaller the model takes longer to learn and the values are generally smaller. Most noticeably is for the teacher which is clearly the best state. However, this trade-off for increased computation time means our value for M is no longer oscillating to the degree they were before. We can now see this in the diagram below for the sum of V(s) following our updated parameters. Although it is not perfectly smooth, the total V(s) slowly increases at a much smoother rate than before and appears to converge as we would like but requires approximately 75 episodes to do so.

Changing the Goal Outcome

Another crucial advantage of RL that we haven’t mentioned in too much detail is that we have some control over the environment. Currently, the rewards are based on what we decided would be best to get the model to reach the positive outcome in as few steps as possible.

However, say the teacher changed and the new one didn’t mind the students throwing the paper in the bin so long as it reached it. Then we can change our negative reward around this and the optimal policy will change.

This is particularly useful for business solutions. For example, say you are planning a strategy and know that certain transitions are less desired than others, then this can be taken into account and changed at will.

Conclusion

We have now created a simple Reinforcement Learning model from observed data. There are many things that could be improved or taken further, including using a more complex model, but this should be a good introduction for those that wish to try and apply to their own real-life problems.

I hope you enjoyed reading this article, if you have any questions please feel free to comment below.

Thanks

Sterling

An introduction to Q-Learning: reinforcement learning

freeCodeCamp — Mon, 03 Sep 2018 21:31:39 +0000

By ADL

This article is the second part of my “Deep reinforcement learning” series. The complete series shall be available both on Medium and in videos on my YouTube channel.

In the first part of the series we learnt the basics of reinforcement learning.

Q-learning is a values-based learning algorithm in reinforcement learning. In this article, we learn about Q-Learning and its details:

What is Q-Learning ?
Mathematics behind Q-Learning
Implementation using python

Q-Learning — a simplistic overview

Let’s say that a robot has to cross a maze and reach the end point. There are mines, and the robot can only move one tile at a time. If the robot steps onto a mine, the robot is dead. The robot has to reach the end point in the shortest time possible.

The scoring/reward system is as below:

The robot loses 1 point at each step. This is done so that the robot takes the shortest path and reaches the goal as fast as possible.
If the robot steps on a mine, the point loss is 100 and the game ends.
If the robot gets power ⚡️, it gains 1 point.
If the robot reaches the end goal, the robot gets 100 points.

Now, the obvious question is: How do we train a robot to reach the end goal with the shortest path without stepping on a mine?

So, how do we solve this?

Introducing the Q-Table

Q-Table is just a fancy name for a simple lookup table where we calculate the maximum expected future rewards for action at each state. Basically, this table will guide us to the best action at each state.

There will be four numbers of actions at each non-edge tile. When a robot is at a state it can either move up or down or right or left.

So, let’s model this environment in our Q-Table.

In the Q-Table, the columns are the actions and the rows are the states.

Each Q-table score will be the maximum expected future reward that the robot will get if it takes that action at that state. This is an iterative process, as we need to improve the Q-Table at each iteration.

But the questions are:

How do we calculate the values of the Q-table?
Are the values available or predefined?

To learn each value of the Q-table, we use the Q-Learning algorithm.

Mathematics: the Q-Learning algorithm

Q-function

The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a).

Using the above function, we get the values of Q for the cells in the table.

When we start, all the values in the Q-table are zeros.

There is an iterative process of updating the values. As we start to explore the environment, the Q-function gives us better and better approximations by continuously updating the Q-values in the table.

Now, let’s understand how the updating takes place.

Introducing the Q-learning algorithm process

Each of the colored boxes is one step. Let’s understand each of these steps in detail.

Step 1: initialize the Q-Table

We will first build a Q-table. There are n columns, where n= number of actions. There are m rows, where m= number of states. We will initialise the values at 0.

In our robot example, we have four actions (a=4) and five states (s=5). So we will build a table with four columns and five rows.

Steps 2 and 3: choose and perform an action

This combination of steps is done for an undefined amount of time. This means that this step runs until the time we stop the training, or the training loop stops as defined in the code.

We will choose an action (a) in the state (s) based on the Q-Table. But, as mentioned earlier, when the episode initially starts, every Q-value is 0.

So now the concept of exploration and exploitation trade-off comes into play. This article has more details.

We’ll use something called the epsilon greedy strategy.

In the beginning, the epsilon rates will be higher. The robot will explore the environment and randomly choose actions. The logic behind this is that the robot does not know anything about the environment.

As the robot explores the environment, the epsilon rate decreases and the robot starts to exploit the environment.

During the process of exploration, the robot progressively becomes more confident in estimating the Q-values.

For the robot example, there are four actions to choose from: up, down, left, and right. We are starting the training now — our robot knows nothing about the environment. So the robot chooses a random action, say right.

We can now update the Q-values for being at the start and moving right using the Bellman equation.

Steps 4 and 5: evaluate

Now we have taken an action and observed an outcome and reward.We need to update the function Q(s,a).

In the case of the robot game, to reiterate the scoring/reward structure is:

power = +1
mine = -100
end = +100

We will repeat this again and again until the learning is stopped. In this way the Q-Table will be updated.

Python implementation of Q-Learning

The concept and code implementation are explained in my video.

Subscribe to my YouTube channel For more AI videos : ADL .

At last…let us recap

Q-Learning is a value-based reinforcement learning algorithm which is used to find the optimal action-selection policy using a Q function.
Our goal is to maximize the value function Q.
The Q table helps us to find the best action for each state.
It helps to maximize the expected reward by selecting the best of all possible actions.
Q(state, action) returns the expected future reward of that action at that state.
This function can be estimated using Q-Learning, which iteratively updates Q(s,a) using the Bellman equation.
Initially we explore the environment and update the Q-Table. When the Q-Table is ready, the agent will start to exploit the environment and start taking better actions.

Next time we’ll work on a deep Q-learning example.

Until then, enjoy AI ?.

Important: As stated earlier, this article is the second part of my “Deep Reinforcement Learning” series. The complete series shall be available both in articles on Medium and in videos on my YouTube channel.

If you liked my article, please click the ? to help me stay motivated to write articles. Please follow me on Medium and other social media:

If you have any questions, please let me know in a comment below or on Twitter.

Subscribe to my YouTube channel for more tech videos.

A brief introduction to reinforcement learning

freeCodeCamp — Mon, 27 Aug 2018 21:17:00 +0000

By ADL

Reinforcement Learning is an aspect of Machine learning where an agent learns to behave in an environment, by performing certain actions and observing the rewards/results which it get from those actions.

With the advancements in Robotics Arm Manipulation, Google Deep Mind beating a professional Alpha Go Player, and recently the OpenAI team beating a professional DOTA player, the field of reinforcement learning has really exploded in recent years.

Examples

In this article, we’ll discuss:

What reinforcement learning is and its nitty-gritty like rewards, tasks, etc
3 categorizations of reinforcement learning

What is Reinforcement Learning?

Let’s start the explanation with an example — say there is a small baby who starts learning how to walk.

Let’s divide this example into two parts:

1. Baby starts walking and successfully reaches the couch

Since the couch is the end goal, the baby and the parents are happy.

So, the baby is happy and receives appreciation from her parents. It’s positive — the baby feels good (Positive Reward +n).

2. Baby starts walking and falls due to some obstacle in between and gets bruised.

Ouch! The baby gets hurt and is in pain. It’s negative — the baby cries (Negative Reward -n).

That’s how we humans learn — by trail and error. Reinforcement learning is conceptually the same, but is a computational approach to learn by actions.

Reinforcement Learning

Let’s suppose that our reinforcement learning agent is learning to play Mario as a example. The reinforcement learning process can be modeled as an iterative loop that works as below:

The RL Agent receives state S⁰ from the environment i.e. Mario
Based on that state S⁰, the RL agent takes an action A⁰, say — our RL agent moves right. Initially, this is random.
Now, the environment is in a new state S¹ (new frame from Mario or the game engine)
Environment gives some reward R¹ to the RL agent. It probably gives a +1 because the agent is not dead yet.

This RL loop continues until we are dead or we reach our destination, and it continuously outputs a sequence of state, action and reward.

The basic aim of our RL agent is to maximize the reward.

Reward Maximization

The RL agent basically works on a hypothesis of reward maximization. That’s why reinforcement learning should have best possible action in order to maximize the reward.

The cumulative rewards at each time step with the respective action is written as:

However, things don’t work in this way when summing up all the rewards.

Let us understand this, in detail:

Let us say our RL agent (Robotic mouse) is in a maze which contains cheese, electricity shocks, and cats. The goal is to eat the maximum amount of cheese before being eaten by the cat or getting an electricity shock.

It seems obvious to eat the cheese near us rather than the cheese close to the cat or the electricity shock, because the closer we are to the electricity shock or the cat, the danger of being dead increases. As a result, the reward near the cat or the electricity shock, even if it is bigger (more cheese), will be discounted. This is done because of the uncertainty factor.

It makes sense, right?

Discounting of rewards works like this:

We define a discount rate called gamma. It should be between 0 and 1. The larger the gamma, the smaller the discount and vice versa.

So, our cumulative expected (discounted) rewards is:

Cumulative expected rewards

Tasks and their types in reinforcement learning

A task is a single instance of a reinforcement learning problem. We basically have two types of tasks: continuous and episodic.

Continuous tasks

These are the types of tasks that continue forever. For instance, a RL agent that does automated Forex/Stock trading.

_Photo by [Unsplash](https://unsplash.com/@chrisliverani?utm_source=medium&utm_medium=referral" rel="noopener" target="_blank" title="">Chris Liverani on Episodic task

In this case, we have a starting point and an ending point called the terminal state. This creates an episode: a list of States (S), Actions (A), Rewards (R).

For example, playing a game of counter strike, where we shoot our opponents or we get killed by them.We shoot all of them and complete the episode or we are killed. So, there are only two cases for completing the episodes.

Exploration and exploitation trade off

There is an important concept of the exploration and exploitation trade off in reinforcement learning. Exploration is all about finding more information about an environment, whereas exploitation is exploiting already known information to maximize the rewards.

Real Life Example: Say you go to the same restaurant every day. You are basically exploiting. But on the other hand, if you search for new restaurant every time before going to any one of them, then it’s exploration. Exploration is very important for the search of future rewards which might be higher than the near rewards.

In the above game, our robotic mouse can have a good amount of small cheese (+0.5 each). But at the top of the maze there is a big sum of cheese (+100). So, if we only focus on the nearest reward, our robotic mouse will never reach the big sum of cheese — it will just exploit.

But if the robotic mouse does a little bit of exploration, it can find the big reward i.e. the big cheese.

This is the basic concept of the exploration and exploitation trade-off.

Approaches to Reinforcement Learning

Let us now understand the approaches to solving reinforcement learning problems. Basically there are 3 approaches, but we will only take 2 major approaches in this article:

1. Policy-based approach

In policy-based reinforcement learning, we have a policy which we need to optimize. The policy basically defines how the agent behaves:

We learn a policy function which helps us in mapping each state to the best action.

Getting deep into policies, we further divide policies into two types:

Deterministic: a policy at a given state(s) will always return the same action(a). It means, it is pre-mapped as S=(s) ➡ A=(a).
Stochastic: It gives a distribution of probability over different actions. i.e Stochastic Policy ➡ p( A = a | S = s )

2. Value Based

In value-based RL, the goal of the agent is to optimize the value function V(s) which is defined as a function that tells us the maximum expected future reward the agent shall get at each state.

The value of each state is the total amount of the reward an RL agent can expect to collect over the future, from a particular state.

The agent will use the above value function to select which state to choose at each step. The agent will always take the state with the biggest value.

In the below example, we see that at each step, we will take the biggest value to achieve our goal: 1 ➡ 3 ➡ 4 ➡ 6 so on…

Maze

The game of Pong — An Intuitive case study

Let us take a real life example of playing pong. This case study will just introduce you to the Intuition of How reinforcement Learning Works. We will not get into details in this example, but in the next article we will certainly dig deeper.

Suppose we teach our RL agent to play the game of Pong.

Basically, we feed in the game frames (new states) to the RL algorithm and let the algorithm decide where to go up or down. This network is said to be a policy network, which we will discuss in our next article.

The method used to train this Algorithm is called the policy gradient. We feed random frames from the game engine, and the algorithm produces a random output which gives a reward and this is fed back to the algorithm/network. This is an iterative process.

We will discuss policy gradients in the next Article with greater details.

Environment = Game Engine and Agent = RL Agent

In the context of the game, the score board acts as a reward or feed back to the agent. Whenever the agent tends to score +1, it understands that the action taken by it was good enough at that state.

Now we will train the agent to play the pong game. To start, we will feed in a bunch of game frame (states) to the network/algorithm and let the algorithm decide the action.The Initial actions of the agent will obviously be bad, but our agent can sometimes be lucky enough to score a point and this might be a random event. But due to this lucky random event, it receives a reward and this helps the agent to understand that the series of actions were good enough to fetch a reward.

Results during the training

So, in the future, the agent is likely to take the actions which will fetch a reward over an action which will not. Intuitively, the RL agent is leaning to play the game.

Source: OLEGIF.com

Limitations

During the training of the agent, when an agent loses an episode, then the algorithm will discard or lower the likelyhood of taking all the series of actions which existed in this episode.

Red Demarcation Shows all the action Taken in a losing episode

But if the agent was performing well from the start of the episode, but just due to the last 2 actions the agent lost the game, it does not make sense to discard all the actions. Rather it makes sense if we just remove the last 2 actions which resulted in the loss.

Green Demarcation shows all the action which where correct and Red Demarcation are the action Which Should be removed.

This is called the Credit Assignment Problem. This problem arises because of a sparse reward setting. That is, instead of getting a reward at every step, we get the reward at the end of the episode. So, it’s on the agent to learn which actions were correct and which actual action led to losing the game.

So, due to this sparse reward setting in RL, the algorithm is very sample-inefficient. This means that huge training examples have to be fed in, in order to train the agent. But the fact is that sparse reward settings fail in many circumstance due to the complexity of the environment.

So, there is something called rewards shaping which is used to solve this. But again, rewards shaping also suffers from some limitation as we need to design a custom reward function for every game.

Closing Note

Today, reinforcement learning is an exciting field of study. Major developments has been made in the field, of which deep reinforcement learning is one.

We will cover deep reinforcement learning in our upcoming articles. This article covers a lot of concepts. Please take your own time to understand the basic concepts of reinforcement learning.

But, I would like to mention that reinforcement is not a secret black box. Whatever advancements we are seeing today in the field of reinforcement learning are a result of bright minds working day and night on specific applications.

Next time we’ll work on a Q-learning agent and also cover some more basic stuff in reinforcement learning.

Until, then enjoy AI ?…

Important : This article is 1st part of Deep Reinforcement Learning series, The Complete series shall be available both on Text Readable forms on Medium and in Video explanatory Form on my channel on YouTube.

For deep and more Intuitive understanding of reinforcement learning, I would recommend that you watch the below video:

Subscribe to my YouTube channel For more AI videos : ADL .

If you liked my article, please click the ? as I remain motivated to write stuffs and Please follow me on Medium &

If you have any questions, please let me know in a comment below or Twitter. Subscribe to my YouTube Channel For More Tech videos : ADL .

Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed…

freeCodeCamp — Fri, 06 Jul 2018 00:10:13 +0000

By Thomas Simonini

This article is part of Deep Reinforcement Learning Course with Tensorflow ?️. Check the syllabus here.

In our last article about Deep Q Learning with Tensorflow, we implemented an agent that learns to play a simple version of Doom. In the video version, we trained a DQN agent that plays Space invaders.

However, during the training, we saw that there was a lot of variability.

Deep Q-Learning was introduced in 2014. Since then, a lot of improvements have been made. So, today we’ll see four strategies that improve — dramatically — the training and the results of our DQN agents:

fixed Q-targets
double DQNs
dueling DQN (aka DDQN)
Prioritized Experience Replay (aka PER)

We’ll implement an agent that learns to play Doom Deadly corridor. Our AI must navigate towards the fundamental goal (the vest), and make sure they survive at the same time by killing enemies.

Fixed Q-targets

Theory

We saw in the Deep Q Learning article that, when we want to calculate the TD error (aka the loss), we calculate the difference between the TD target (Q_target) and the current Q value (estimation of Q).

But we don’t have any idea of the real TD target. We need to estimate it. Using the Bellman equation, we saw that the TD target is just the reward of taking that action at that state plus the discounted highest Q value for the next state.

However, the problem is that we using the same parameters (weights) for estimating the target and the Q value. As a consequence, there is a big correlation between the TD target and the parameters (w) we are changing.

Therefore, it means that at every step of training, our Q values shift but also the target value shifts. So, we’re getting closer to our target but the target is also moving. It’s like chasing a moving target! This lead to a big oscillation in training.

It’s like if you were a cowboy (the Q estimation) and you want to catch the cow (the Q-target) you must get closer (reduce the error).

At each time step, you’re trying to approach the cow, which also moves at each time step (because you use the same parameters).

This leads to a very strange path of chasing (a big oscillation in training).

Instead, we can use the idea of fixed Q-targets introduced by DeepMind:

Using a separate network with a fixed parameter (let’s call it w-) for estimating the TD target.
At every Tau step, we copy the parameters from our DQN network to update the target network.

Thanks to this procedure, we’ll have more stable learning because the target function stays fixed for a while.

Implementation

Implementing fixed q-targets is pretty straightforward:

First, we create two networks (DQNetwork, TargetNetwork)
Then, we create a function that will take our DQNetwork parameters and copy them to our TargetNetwork
Finally, during the training, we calculate the TD target using our target network. We update the target network with the DQNetwork every tau step (tau is an hyper-parameter that we define).

Double DQNs

Theory

Double DQNs, or double Learning, was introduced by Hado van Hasselt. This method handles the problem of the overestimation of Q-values.

To understand this problem, remember how we calculate the TD Target:

By calculating the TD target, we face a simple problem: how are we sure that the best action for the next state is the action with the highest Q-value?

We know that the accuracy of q values depends on what action we tried and what neighboring states we explored.

As a consequence, at the beginning of the training we don’t have enough information about the best action to take. Therefore, taking the maximum q value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly given a higher Q value than the optimal best action, the learning will be complicated.

The solution is: when we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We:

use our DQN network to select what is the best action to take for the next state (the action with the highest Q value).
use our target network to calculate the target Q value of taking that action at the next state.

Therefore, Double DQN helps us reduce the overestimation of q values and, as a consequence, helps us train faster and have more stable learning.

Implementation

Dueling DQN (aka DDQN)

Theory

Remember that Q-values correspond to how good it is to be at that state and taking an action at that state Q(s,a).

So we can decompose Q(s,a) as the sum of:

V(s): the value of being at that state
A(s,a): the advantage of taking that action at that state (how much better is to take this action versus all other possible actions at that state).

With DDQN, we want to separate the estimator of these two elements, using two new streams:

one that estimates the state value V(s)
one that estimates the advantage for each action A(s,a)

And then we combine these two streams through a special aggregation layer to get an estimate of Q(s,a).

Wait? But why do we need to calculate these two elements separately if then we combine them?

By decoupling the estimation, intuitively our DDQN can learn which states are (or are not) valuable without having to learn the effect of each action at each state (since it’s also calculating V(s)).

With our normal DQN, we need to calculate the value of each action at that state. But what’s the point if the value of the state is bad? What’s the point to calculate all actions at one state when all these actions lead to death?

As a consequence, by decoupling we’re able to calculate V(s). This is particularly useful for states where their actions do not affect the environment in a relevant way. In this case, it’s unnecessary to calculate the value of each action. For instance, moving right or left only matters if there is a risk of collision. And, in most states, the choice of the action has no effect on what happens.

It will be clearer if we take the example in the paper Dueling Network Architectures for Deep Reinforcement Learning.

We see that the value network streams pays attention (the orange blur) to the road, and in particular to the horizon where the cars are spawned. It also pays attention to the score.

On the other hand, the advantage stream in the first frame on the right does not pay much attention to the road, because there are no cars in front (so the action choice is practically irrelevant). But, in the second frame it pays attention, as there is a car immediately in front of it, and making a choice of action is crucial and very relevant.

Concerning the aggregation layer, we want to generate the q values for each action at that state. We might be tempted to combine the streams as follows:

But if we do that, we’ll fall into the issue of identifiability, that is — given Q(s,a) we’re unable to find A(s,a) and V(s).

And not being able to find V(s) and A(s,a) given Q(s,a) will be a problem for our back propagation. To avoid this problem, we can force our advantage function estimator to have 0 advantage at the chosen action.

To do that, we subtract the average advantage of all actions possible of the state.

Therefore, this architecture helps us accelerate the training. We can calculate the value of a state without calculating the Q(s,a) for each action at that state. And it can help us find much more reliable Q values for each action by decoupling the estimation between two streams.

Implementation

The only thing to do is to modify the DQN architecture by adding these new streams:

Prioritized Experience Replay

Theory

Prioritized Experience Replay (PER) was introduced in 2015 by Tom Schaul. The idea is that some experiences may be more important than others for our training, but might occur less frequently.

Because we sample the batch uniformly (selecting the experiences randomly) these rich experiences that occur rarely have practically no chance to be selected.

That’s why, with PER, we try to change the sampling distribution by using a criterion to define the priority of each tuple of experience.

We want to take in priority experience where there is a big difference between our prediction and the TD target, since it means that we have a lot to learn about it.

We use the absolute value of the magnitude of our TD error:

And we put that priority in the experience of each replay buffer.

But we can’t just do greedy prioritization, because it will lead to always training the same experiences (that have big priority), and thus over-fitting.

So we introduce stochastic prioritization, which generates the probability of being chosen for a replay.

As consequence, during each time step, we will get a batch of samples with this probability distribution and train our network on it.

But, we still have a problem here. Remember that with normal Experience Replay, we use a stochastic update rule. As a consequence, the way we sample the experiences must match the underlying distribution they came from.

When we do have normal experience, we select our experiences in a normal distribution — simply put, we select our experiences randomly. There is no bias, because each experience has the same chance to be taken, so we can update our weights normally.

But, because we use priority sampling, purely random sampling is abandoned. As a consequence, we introduce bias toward high-priority samples (more chances to be selected).

And, if we update our weights normally, we take have a risk of over-fitting. Samples that have high priority are likely to be used for training many times in comparison with low priority experiences (= bias). As a consequence, we’ll update our weights with only a small portion of experiences that we consider to be really interesting.

To correct this bias, we use importance sampling weights (IS) that will adjust the updating by reducing the weights of the often seen samples.

The weights corresponding to high-priority samples have very little adjustment (because the network will see these experiences many times), whereas those corresponding to low-priority samples will have a full update.

The role of b is to control how much these importance sampling weights affect learning. In practice, the b parameter is annealed up to 1 over the duration of training, because these weights are more important in the end of learning when our q values begin to converge. The unbiased nature of updates is most important near convergence, as explained in this article.

Implementation

This time, the implementation will be a little bit fancier.

First of all, we can’t just implement PER by sorting all the Experience Replay Buffers according to their priorities. This will not be efficient at all due to O(nlogn) for insertion and O(n) for sampling.

As explained in this really good article, we need to use another data structure instead of sorting an array — an unsorted sumtree.

A sumtree is a Binary Tree, that is a tree with only a maximum of two children for each node. The leaves (deepest nodes) contain the priority values, and a data array that points to leaves contains the experiences.

Updating the tree and sampling will be really efficient (O(log n)).

Then, we create a memory object that will contain our sumtree and data.

Next, to sample a minibatch of size k, the range [0, total_priority] will be divided into k ranges. A value is uniformly sampled from each range.

Finally, the transitions (experiences) that correspond to each of these sampled values are retrieved from the sumtree.

It will be much clearer when we dive on the complete details in the notebook.

Doom Deathmatch agent

This agent is a Dueling Double Deep Q Learning with PER and fixed q-targets.

We made a video tutorial of the implementation:

The notebook is here

That’s all! You’ve just created an smarter agent that learns to play Doom. Awesome! Remember that if you want to have an agent with really good performance, you need many more GPU hours (about two days of training)!

However, with only 2–3 hours of training on CPU (yes CPU), our agent understood that they needed to kill enemies before being able to move forward. If they move forward without killing enemies, they will be killed before getting the vest.

Don’t forget to implement each part of the code by yourself. It’s really important to try to modify the code I gave you. Try to add epochs, change the architecture, add fixed Q-values, change the learning rate, use a harder environment…and so on. Experiment, have fun!

Remember that this was a big article, so be sure to really understand why we use these new strategies, how they work, and the advantages of using them.

In the next article, we’ll learn about an awesome hybrid method between value-based and policy-based reinforcement learning algorithms. This is a baseline for the state of the art’s algorithms: Advantage Actor Critic (A2C). You’ll implement an agent that learns to play Outrun !

If you liked my article, please click the ? below as many time as you liked the article so other people will see this here on Medium. And don’t forget to follow me!

If you have any thoughts, comments, questions, feel free to comment below or send me an email: hello@simoninithomas.com, or tweet me @ThomasSimonini.

Keep learning, stay awesome!

Deep Reinforcement Learning Course with Tensorflow ?️

? Syllabus

? Video version

Part 1: An introduction to Reinforcement Learning

Part 2: Diving deeper into Reinforcement Learning with Q-Learning

Part 3: An introduction to Deep Q-Learning: let’s play Doom

Part 3+: Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets

Part 4: An introduction to Policy Gradients with Doom and Cartpole

Part 5: An intro to Advantage Actor Critic methods: let’s play Sonic the Hedgehog!

Part 6: Proximal Policy Optimization (PPO) with Sonic the Hedgehog 2 and 3

Part 7: Curiosity-Driven Learning made easy Part I

In need of evolution: game theory and AI

freeCodeCamp — Sat, 12 May 2018 21:07:00 +0000

By Elena Nisioti

Artificial Intelligence (AI) is full of questions that cannot be answered and answers that cannot be assigned to the correct questions. In the past, it paid for its persistence to wrong practices with periods of stagnation, known as AI winters. The calendar of AI, however, has just reached spring, and the applications are flourishing.

Yet, there is a branch of AI that has long been neglected. The talk is about reinforcement learning, that has recently exhibited impressive results on games like AlphaGo and Atari. But let’s be honest: these were not reinforcement learning wins. What got deeper in these cases was the deep neural networks, and not our understanding of reinforcement learning, which maintains the depth it achieved decades ago.

Even worse is the case of reinforcement learning when applied to real life problems. If training a robot to balance on a rope sounds hard, try training a team of robots to win a football game, or a team of drones to monitor a moving target.

Before we lose the branch, or even worse the tree, we must sharpen our understanding of these applications. Game theory is the most common approach to studying teams of players that share a common goal. It can lend us tools to guide learning algorithms in these settings.

But let’s see why the common approach is not a common sense approach.

To kill an error is as good a service as, and sometimes even better than, the establishing of a new truth or fact. — Charles Darwin

First, let’s dirty our hands with some terminology and basics of these areas.

Game theory

Some useful terms

Game: like games in popular understanding, it can be any setting where players take actions and its outcome will depend on them.
Player: a strategic decision-maker within a game.
Strategy: a complete plan of actions a player will take, given the set of circumstances that might arise within the game.
Payoff: the gain a player receives from arriving at a particular outcome of a game.
Equilibrium: the point in a game where both players have made their decisions and an outcome is reached.
Nash equilibrium: an equilibrium in which no player can gain by changing their own strategy if the strategies of the other players remain unchanged.
Dominant strategy: occurs when one strategy is better than another strategy for one player, no matter how that player’s opponents may play.

Prisoner’s dilemma

This is probably the most famous game in the literature. The figure below presents its payoff matrix. Now, a payoff matrix is worth a thousand words. It is sufficient, to an experienced eye, to provide all the information necessary to describe a game. But let’s be a bit less laconic.

Prisoner’s dilemma payoff matrix

The police arrest two criminals, criminal A and criminal B. Although quite notorious, the criminals cannot be imprisoned for the crime under investigation due to lack of evidence. But they can be held for lesser charges.

The length of their imprisonment will depend on what they will say in the interrogation room, which gives rise to the game. Each criminal (player) is given the chance to either stay silent or snitch on the other criminal (player). The payoff matrix depicts how many years each player will be imprisoned depending on the outcome. For example, if player A stays silent and player B snitches on them, player A will serve 3 years (-3) and player B will serve none (0).

If you reviews the payoff matrix carefully, you will find out that the logical action of a player is to betray the other person or, in game-theoretic terms, betraying is the dominant strategy. This will lead to the Nash equilibrium of the game, where each player has a payoff of -2.

Does something feel odd? Yes, or at least it should. If both players somehow agreed to remain silent they would both get a higher reward of -1. Prisoner’s dilemma is an example of a game where rationality leads to a worse result than cooperation would.

Some historical remarks

Game theory originated in economics, but is today an interdisciplinary area of study. Its father, John von Neumann (you will notice that Johns have serious career prospects in this area), was the first to give a strict formulation to the common notion of a game. He restricted his studies to games of two players, as they were easier to analyze.

He then co-authored a book with Oskar Morgenstern, which laid the foundations for expected utility theory and shaped the course of game theory. Around that time, John Nash introduced the concept of Nash equilibria, which helps describe the outcome of a game.

Reinforcement learning

It did not take long to realize how vast the applications of game theory can be. From games to biology, philosophy and, wait for it, artificial intelligence. Game theory is nowadays closely related to settings where multiple players learn through reinforcement, an area called multi-agent reinforcement learning. Examples of applications in this case are teams of robots, where each player has to learn how to behave in favor of its team.

Some useful terms

Agent: equivalent to a player.
Reward: equivalent to a payoff.
State: all the information necessary to describe the situation an agent is in.
Action: equivalent of a move in a game.
Policy: similar to a strategy, it defines the action an agent will make when in particular states
Environment: everything the agent interacts with during learning.

Applications

Imagine the following scenario: a team of drones is unleashed into a forest in order to predict and locate fires early enough for the firefighters to respond. The drones are autonomous and must explore the forest, learn which conditions are likely to cause fire, and cooperate with each other, so that they cover wide areas of the forest using little battery and communication.

This application belongs to the area of environmental monitoring, where AI can lend its predictive skills to human intervention. In a technological world that is becoming increasingly complex and a physical world under threat, we can paraphrase Kipling’s quote to “Man could not be everywhere, and therefore he made drones.”

Decentralized architectures are another interesting application field. Technologies like the Internet of Things and Blockchain create immense networks. Information and processing is distributed in different physical entities, a trait that has been acknowledged to offer privacy, efficiency and democratization.

Regardless of whether you want to use sensors to minimize energy consumption in the households of a country, or replace the banking system, decentralized is the new sexy.

Making these networks smart, however, is challenging, as most of the AI algorithms we are proud of are data- and computation-hungry. Reinforcement learning algorithms can be employed for efficient data processing and rendering the network adaptive to changes in its environment. In this case, it is interesting, and to the benefit of overall efficiency, to study how the individual algorithms will cooperate.

Deep or collective learning? AI research has based its harvest on increasingly deeper networks, but it could be that the answers to challenging problems come from collective knowledge, not deep-rooted individuals. Did we miss the forest?

Not just a game

Translating AI problems to simple games like the prisoner’s dilemma is tempting. This is a usual practice when testing new techniques, as it offers a computationally cheap and intuitive testbed. Nevertheless, it is important not to ignore the effect that the practical characteristics of the problem, such as noise, delays, and finite memory, have on the algorithm.

Perhaps the most misleading assumption in AI research is that of representing interaction with iterated static games. For example, an algorithm can apply the prisoner’s dilemma game every time it wants to make a decision, a formulation that assumes that the agent has not learned, or changed, along the way. But what about the effect learning will have on the behavior of the agent? Won’t interaction with others affect its strategy?

Research in this area has focused on evolution of cooperation and Robert Axelrod has studied optimal strategies that arise in the iterated version of prisoner’s dilemma. The tournaments that Axelrod organized revealed that strategies that adapt with time and interaction, even as simple as Tit-for-Tat may sound, are very effective.The AI community has recently investigated learning under the sequential prisoner’s dilemma, but research in this area is still in a premature state.

What differentiates multi-agent from single-agent learning is the increased complexity. Training one deep neural network is already enough of a pain, while adding new networks, as parts of the agents, makes the problem exponentially harder.

One less obvious, but more important concern, is the lack of theoretical properties for this kind of problem. Single-agent reinforcement learning is a well-understood area, as Richard Bellman and Christopher Watkins have offered the algorithms and proofs necessary to learn. In the multi-agent case, however, the proofs lose their validity.

Just to illustrate some of the mind-puzzling difficulties that arise: an agent executes a learning algorithm to learn how to react optimally to its environment. In our case, the environment includes the other agents, which also execute the learning algorithm. Thus, the algorithm has to consider the effect of its action before it acts.

The early concerns

The concerns start where game theory started: in economics. Let’s begin with some assumptions made when studying a system under classical game theory.

Rationality: generally in game theory, and in order to derive Nash equilibria, perfect rationality is assumed. This roughly means that agents always act for their own sake.

Complete information: each agent knows everything about the game, including the rules, what the other players know, and what their strategies are.

Common knowledge: there is common knowledge of a fact p in a group of agents when: all the agents know p, they all know that all agents know p, they all know that they all know that all agents know p, and so on ad infinitum. There are interesting puzzles, like the blue-eyed islanders, that describe the effect common knowledge has on a problem.

In 1986 Kenn Arrow expressed his reservations towards classical game theory.

In this paper, I want to disentangle some of the senses in which the hypothesis of rationality is used in economic theory. In particular, I want to stress that rationality is not a property of the individual alone, although it is usually presented that way. Rather, it gathers not only its force but also its very meaning from the social context in which it is embedded. It is most plausible under very ideal conditions. When these conditions cease to hold, the rationality assumptions become strained and possibly even self-contradictory.

If you find that Arrow is a bit harsh with classical game theory, how rational would you say your last purchases have been? Or, how much consciousness and effort did you put into your meal today?

But Arrow is not so much worried about the assumption of rationality. He is worried about the implications of it. For an agent to be rational, you need to provide them with all the information necessary to make their decisions. This calls for omniscient players, which is bad in two ways: first, it creates impractical requirements for information storing and processing of players. Second, game theory is no longer a game theory, as you can replace all players by a central ruler (and where is the fun in that?).

The value of information in this view is another point of interest. We have already discussed that possessing all the information is infeasible. But what about assuming players with limited knowledge? Would that help?

You may ask anyone involved in this area, but it suffices to say that optimization under uncertainty is tough. Yes, there still are the good-old Nash equilibria. The problem is that they are infinite. Game theory does not provide you with arguments to evaluate them. So, even if you reach one, you shouldn't make it such a big deal.

Reinforcement learning concerns

By this point you should suspect that AI applications are much more complicated than the examples classical game theory concerns itself with. Just to mention a few obstacles on the path of applying the Nash equilibrium approach in a robotic application: imagine being the captain of a team of robots playing football in RoboCup. How fast, strong, and intelligent are your players and your opponents? What strategies does the opponent team use? How should you reward your players? Is a goal the only reason for congratulating, or will applauding a good pass also improve the team’s behavior? Clearly, just being familiar with the rules of football will not win you the game.

If game theory has been raising debates for decades, if it has been founded on unrealistic assumptions and, for realistic tasks, if it offers complicated and little-understood solutions, why are we still going for it? Well, plainly enough, it’s the only thing we’ve got when it comes to group reasoning. If we actually understood how groups interact and cooperate to achieve their goals, psychology and politics would be much clearer.

Researchers in the area of multi-agent reinforcement learning either completely emit a discussion on the theoretical properties of their algorithms (and nevertheless often exhibit good results) or traditionally study the existence of Nash equilibria. The latter approach seems, to the eyes of a young researcher in the field, like a struggle to prove, under severe, unrealistic assumptions, the theoretical existence of solutions that — being infinite and of questionable value — will never be leveraged in practice.

Evolutionary game theory

The inception of evolutionary game theory is not recent, yet its far-reaching applications in the area of AI took long to be acknowledged. Originating in biology, it was introduced in 1973, by John M. Smith and George R. Price, as an alternative to classical game theory. The alterations are so profound that we can talk about a whole new approach.

The subject of reasoning is no longer the player itself, but the population of players. Thus, probabilistic strategies are defined as the percentage of players that make a choice, not the probability of one player choosing an action as in classical game theory. This removes the necessity for rational, omniscient agents, as strategies evolve as patterns of behavior. The evolution process resembles Darwinian theory. Players reproduce following the principles of survival of the fittest and random mutations, and can be elegantly described by a set of differential equations, termed the replicator dynamics.

We can see the three important parts of this system in the illustration below. A population represents the team of agents, and is characterized by a mixture of strategies. The game rules determine the payoffs of the population, which can also be seen as the fitness values of an evolutionary algorithm. Finally, the replicator rules describe how the population will evolve based on the fitness values and the mathematical properties of the evolution process.

_Image credit: By HowieKor [CC BY-SA 3.0 ([https://creativecommons.org/licenses/by-sa/3.0](https://creativecommons.org/licenses/by-sa/3.0" rel="noopener" target="blank" title="))], from Wikimedia Commons

The notion and pursuit of Nash equilibria is replaced by evolutionary stable strategies. A strategy can bear this characterization if it is immune to an invasion by a population of agents that follow another strategy, provided that the invading population is small. Thus, the behavior of the team can be studied under the well-understood area of stability of dynamical systems, such as Lyapunov stability.

The attainment of equilibrium requires a disequilibrium process. What does rational behavior mean in the presence of disequilibrium? Do individuals speculate on the equilibrating process? If they do, can the disequilibrium be regarded as, in some sense, a higher-order equilibrium process?

In the above passage, Arrow seems to be struggling to pinpoint the dynamic properties of a game. Could evolutionary game theory be an answer to his questions?

Quite recently, famous reinforcement learning algorithms, such as Q-learning, were studied under this new approach and significant conclusions were drawn. How this new tool is used ultimately depends on the application.

We can follow the forward approach, to derive the dynamic model of a learning algorithm. Or the inverse, where we start from some desired dynamic properties and engineer a learning algorithm that exhibits them.

We can use the replicator dynamics descriptively, to visualize convergence. Or prescriptively, to tune the algorithm in order to converge to optimal solutions. The latter can immensely reduce the complexity entailed in training deep networks for tough tasks that we face today, by removing the need for blind tuning.

Conclusion

It’s not hard to trace when and why the paths of game theory and AI became convoluted. What’s harder, however, is to overlook the restrictions AI, and in particular multi-agent reinforcement learning, has to face when following classical game theoretic approaches.

Evolutionary game theory sounds promising, offering both theoretical tools and practical advantages, but we won’t really know until we try it. In this case, evolution will not arise naturally, but out of a conscious struggle of the research community for improvement. But isn’t that the essence of evolution?

It takes some effort to deviate from where inertia is pushing you, but reinforcement learning, despite general successes in AI, is in serious need of a lift.

An introduction to Reinforcement Learning

freeCodeCamp — Sat, 31 Mar 2018 06:16:59 +0000

By Thomas Simonini

Reinforcement learning is an important type of Machine Learning where an agent learn how to behave in a environment by performing actions and seeing the results.

In recent years, we’ve seen a lot of improvements in this fascinating area of research. Examples include DeepMind and the Deep Q learning architecture in 2014, beating the champion of the game of Go with AlphaGo in 2016, OpenAI and the PPO in 2017, amongst others.

In this series of articles, we will focus on learning the different architectures used today to solve Reinforcement Learning problems. These will include Q -learning, Deep Q-learning, Policy Gradients, Actor Critic, and PPO.

In this first article, you’ll learn:

What Reinforcement Learning is, and how rewards are the central idea
The three approaches of Reinforcement Learning
What the “Deep” in Deep Reinforcement Learning means

It’s really important to master these elements before diving into implementing Deep Reinforcement Learning agents.

The idea behind Reinforcement Learning is that an agent will learn from the environment by interacting with it and receiving rewards for performing actions.

Learning from interaction with the environment comes from our natural experiences. Imagine you’re a child in a living room. You see a fireplace, and you approach it.

It’s warm, it’s positive, you feel good (Positive Reward +1). You understand that fire is a positive thing.

But then you try to touch the fire. Ouch! It burns your hand (Negative reward -1). You’ve just understood that fire is positive when you are a sufficient distance away, because it produces warmth. But get too close to it and you will be burned.

That’s how humans learn, through interaction. Reinforcement Learning is just a computational approach of learning from action.

The Reinforcement Learning Process

Let’s imagine an agent learning to play Super Mario Bros as a working example. The Reinforcement Learning (RL) process can be modeled as a loop that works like this:

Our Agent receives state S0 from the Environment (In our case we receive the first frame of our game (state) from Super Mario Bros (environment))
Based on that state S0, agent takes an action A0 (our agent will move right)
Environment transitions to a new state S1 (new frame)
Environment gives some reward R1 to the agent (not dead: +1)

This RL loop outputs a sequence of state, action and reward.

The goal of the agent is to maximize the expected cumulative reward.

The central idea of the Reward Hypothesis

Why is the goal of the agent to maximize the expected cumulative reward?

Well, Reinforcement Learning is based on the idea of the reward hypothesis. All goals can be described by the maximization of the expected cumulative reward.

That’s why in Reinforcement Learning, to have the best behavior, we need to maximize the expected cumulative reward.

The cumulative reward at each time step t can be written as:

Which is equivalent to:

_Thanks to [Pierre-Luc Bacon](https://twitter.com/pierrelux" rel="noopener" target="blank" title=") for the correction

However, in reality, we can’t just add the rewards like that. The rewards that come sooner (in the beginning of the game) are more probable to happen, since they are more predictable than the long term future reward.

Let say your agent is this small mouse and your opponent is the cat. Your goal is to eat the maximum amount of cheese before being eaten by the cat.

As we can see in the diagram, it’s more probable to eat the cheese near us than the cheese close to the cat (the closer we are to the cat, the more dangerous it is).

As a consequence, the reward near the cat, even if it is bigger (more cheese), will be discounted. We’re not really sure we’ll be able to eat it.

To discount the rewards, we proceed like this:

We define a discount rate called gamma. It must be between 0 and 1.

The larger the gamma, the smaller the discount. This means the learning agent cares more about the long term reward.
On the other hand, the smaller the gamma, the bigger the discount. This means our agent cares more about the short term reward (the nearest cheese).

Our discounted cumulative expected rewards is:

_Thanks to [Pierre-Luc Bacon](https://twitter.com/pierrelux" rel="noopener" target="blank" title=") for the correction

To be simple, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, so the future reward is less and less probable to happen.

Episodic or Continuing tasks

A task is an instance of a Reinforcement Learning problem. We can have two types of tasks: episodic and continuous.

Episodic task

In this case, we have a starting point and an ending point (a terminal state). This creates an episode: a list of States, Actions, Rewards, and New States.

For instance think about Super Mario Bros, an episode begin at the launch of a new Mario and ending: when you’re killed or you’re reach the end of the level.

Beginning of a new episode

Continuous tasks

These are tasks that continue forever (no terminal state). In this case, the agent has to learn how to choose the best actions and simultaneously interacts with the environment.

For instance, an agent that do automated stock trading. For this task, there is no starting point and terminal state. The agent keeps running until we decide to stop him.

Monte Carlo vs TD Learning methods

We have two ways of learning:

Collecting the rewards at the end of the episode and then calculating the maximum expected future reward: Monte Carlo Approach
Estimate the rewards at each step: Temporal Difference Learning

Monte Carlo

When the episode ends (the agent reaches a “terminal state”), the agent looks at the total cumulative reward to see how well it did. In Monte Carlo approach, rewards are only received at the end of the game.

Then, we start a new game with the added knowledge. The agent makes better decisions with each iteration.

Let’s take an example:

If we take the maze environment:

We always start at the same starting point.
We terminate the episode if the cat eats us or if we move > 20 steps.
At the end of the episode, we have a list of State, Actions, Rewards, and New States.
The agent will sum the total rewards Gt (to see how well it did).
It will then update V(st) based on the formula above.
Then start a new game with this new knowledge.

By running more and more episodes, the agent will learn to play better and better.

Temporal Difference Learning : learning at each time step

TD Learning, on the other hand, will not wait until the end of the episode to update the maximum expected future reward estimation: it will update its value estimation V for the non-terminal states St occurring at that experience.

This method is called TD(0) or one step TD (update the value function after any individual step).

TD methods only wait until the next time step to update the value estimates. At time t+1 they immediately form a TD target using the observed reward Rt+1 and the current estimate V(St+1).

TD target is an estimation: in fact you update the previous estimate V(St) by updating it towards a one-step target.

Exploration/Exploitation trade off

Before looking at the different strategies to solve Reinforcement Learning problems, we must cover one more very important topic: the exploration/exploitation trade-off.

Exploration is finding more information about the environment.
Exploitation is exploiting known information to maximize the reward.

Remember, the goal of our RL agent is to maximize the expected cumulative reward. However, we can fall into a common trap.

In this game, our mouse can have an infinite amount of small cheese (+1 each). But at the top of the maze there is a gigantic sum of cheese (+1000).

However, if we only focus on reward, our agent will never reach the gigantic sum of cheese. Instead, it will only exploit the nearest source of rewards, even if this source is small (exploitation).

But if our agent does a little bit of exploration, it can find the big reward.

This is what we call the exploration/exploitation trade off. We must define a rule that helps to handle this trade-off. We’ll see in future articles different ways to handle it.

Three approaches to Reinforcement Learning

Now that we defined the main elements of Reinforcement Learning, let’s move on to the three approaches to solve a Reinforcement Learning problem. These are value-based, policy-based, and model-based.

Value Based

In value-based RL, the goal is to optimize the value function V(s).

The value function is a function that tells us the maximum expected future reward the agent will get at each state.

The value of each state is the total amount of the reward an agent can expect to accumulate over the future, starting at that state.

The agent will use this value function to select which state to choose at each step. The agent takes the state with the biggest value.

In the maze example, at each step we will take the biggest value: -7, then -6, then -5 (and so on) to attain the goal.

Policy Based

In policy-based RL, we want to directly optimize the policy function π(s) without using a value function.

The policy is what defines the agent behavior at a given time.

action = policy(state)

We learn a policy function. This lets us map each state to the best corresponding action.

We have two types of policy:

Deterministic: a policy at a given state will always return the same action.
Stochastic: output a distribution probability over actions.

As we can see here, the policy directly indicates the best action to take for each steps.

Model Based

In model-based RL, we model the environment. This means we create a model of the behavior of the environment.

The problem is each environment will need a different model representation. That’s why we will not speak about this type of Reinforcement Learning in the upcoming articles.

Introducing Deep Reinforcement Learning

Deep Reinforcement Learning introduces deep neural networks to solve Reinforcement Learning problems — hence the name “deep.”

For instance, in the next article we’ll work on Q-Learning (classic Reinforcement Learning) and Deep Q-Learning.

You’ll see the difference is that in the first approach, we use a traditional algorithm to create a Q table that helps us find what action to take for each state.

In the second approach, we will use a Neural Network (to approximate the reward based on state: q value).

Schema inspired by the Q learning notebook by Udacity

Congrats! There was a lot of information in this article. Be sure to really grasp the material before continuing. It’s important to master these elements before entering the fun part: creating AI that plays video games.

Important: this article is the first part of a free series of blog posts about Deep Reinforcement Learning. For more information and more resources, check out the syllabus.

Next time we’ll work on a Q-learning agent that learns to play the Frozen Lake game.

FrozenLake

If you liked my article, please click the ? below as many time as you liked the article so other people will see this here on Medium. And don’t forget to follow me!

If you have any thoughts, comments, questions, feel free to comment below or send me an email: hello@simoninithomas.com, or tweet me @ThomasSimonini.

Cheers!

Deep Reinforcement Learning Course:

We’re making a video version of the Deep Reinforcement Learning Course with Tensorflow ? where we focus on the implementation part with tensorflow here.

Part 1: An introduction to Reinforcement Learning

Part 2: Diving deeper into Reinforcement Learning with Q-Learning

Part 3: An introduction to Deep Q-Learning: let’s play Doom

Part 3+: Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets

Part 4: An introduction to Policy Gradients with Doom and Cartpole

Part 5: An intro to Advantage Actor Critic methods: let’s play Sonic the Hedgehog!

Part 6: Proximal Policy Optimization (PPO) with Sonic the Hedgehog 2 and 3

Part 7: Curiosity-Driven Learning made easy Part I

Reinforcement Learning - freeCodeCamp.org

How to Build an Adaptive Tic-Tac-Toe AI with Reinforcement Learning in JavaScript

What You’ll Learn

Prerequisites

Table of Contents

Why Use Reinforcement Learning for Game AI?

How to Understand Q-Learning: The Foundation

Core Components

The Q-Learning Update Rule

How Exploration vs Exploitation Works

Project Architecture Overview

How to Build the HTML Interface with Tailwind CSS

How to Implement the Q-Learning Algorithm

The QLearning Class: The AI's Brain

1. Constructor and Q-Table Management

2. Choosing an Action (The Epsilon-Greedy Strategy)

3. The Learning Rule

4. Minimax for Expert Mode

5. Helper and Persistence Methods

The TicTacToe Class: Managing the Game

1. Constructor and Control Initialization

2. Difficulty and UI Methods

3. Drawing and Rendering

4. Player Interaction and the Game Loop

5. The Rules Engine

6. UI and Statistics Updates

7. Game and AI Management

8. The Self-Play Training Loop

9. State Persistence

10. Initializing the Game

How to Understand the Enhanced Features

1. Adaptive Difficulty Levels

2. Other Enhanced Features

Putting It All Together: A Guided Test Run

Step 1: Challenge the Untrained AI

Step 2: Train the AI

Step 3: Challenge the Trained AI

Step 4: Experiment with the Controls

Verifying the Implementation with Automated Tests

Advanced Optimizations and Extensions

How to Implement Symmetry Reduction

How to Add Export and Import Functionality

How to Add Q-Value Heatmap Visualization

Common Pitfalls and Solutions

Issue 1: AI Does Not Improve

Issue 2: AI Makes Random Moves

Issue 3: Slow Performance

Issue 4: AI Overfits to One Strategy

How to Extend This to Other Games

Conclusion

Next Steps

Resources for Further Learning

Use Gymnasium for Reinforcement Learning

Course Contents

Train an AI to Play a Snake Game Using Python

Intro to Advanced Actor-Critic Methods: Reinforcement Learning Course

How I planned my meals with Reinforcement Learning on a budget

Aim

Method

Sample Data

Applying the Model in Theory

Model Learning

Small Demo of Updating Values

Action Selection

Building and Applying our Model

We now run our model with some sample variables:

Optimal actions of final episode

So why is this happening and why did the model suggest the actions it did?

Effect of Changing Parameters and How to Change Model’s Aim

Varying Budget

Varying Alpha

A good explanation of what is going on with our output due to alpha is described by stack overflow user VishalTheBeast:

Varying Epsilon

Increasing the Number of Episodes

Changing our Model’s Aim to Find the Cheapest Combination of Products

Introducing Preferences

Introducing Preferences using Rewards

Conclusion

How to use AI to play Sonic the Hedgehog. It’s NEAT!

A NEAT Neural Network (Python Implementation)

The `QLearning` Class: The AI's Brain

The `TicTacToe` Class: Managing the Game