Build your own reinforcement learning agent that plays Super Mario

AI plays Mario using Deep Q-Learning RL Algorithm

Build your own reinforcement learning agent that plays Super Mario
Photo by Cláudio Luiz Castro on Unsplash

Who doesn’t love the Super Mario game?
I mean, everyone loves this game, right? Even if anyone has not played this game, at least they might have heard about this.

Today we will be building a reinforcement learning agent that will learn to play this game.


Before starting, you should be familiar with the python programming language, and at least know how ML algorithm works. Reinforcement learning is a subset of machine learning, which I call “actual machine learning”. I mean, all the definition of ML fits reinforcement learning. All those pictures that you see about teaching a machine to do some task, that’s reinforcement learning.

All that to say, I am not going to talk more about RL in this story, as this is just to teach you guys how to build your own agent. You can get more knowledge from elsewhere about the basics of RL (there are tons of resources). I am just going to focus on Q-learning in Mario’s environment.

Environment Setup

The first thing that you need is a super Mario environment. We are going to use this gym environment which is super cool and super duper easy to use.

Install this env on your local machine:

pip install gym-super-mario-bros

Now that you have an environment, next thing is to install other requirements and create the file where we’re going to store our code.

Since we are building a Deep Q-learning agent, we are going to use TensorFlow to build the model. And we are dealing with a gym so we need an OpenAI gym as well (You can find all the requirements in our GitHub repo).

Building a model

Let’s import a few things and set up our DQNAgent model.

from collections import deque
from tensorflow.keras.models import Sequential, load_model, save_model
from tensorflow.keras.layers import Dense, Activation, Flatten, Conv2D
from tensorflow.keras.optimizers import Adam
import numpy as np
import random

As discussed above, we are going to use TensorFlow to create our deep learning model. Let’s create our Agent class.

class DQNAgent:
    def __init__(self, state_size, action_size):
        # Create variables for our agent
        self.state_space = state_size
        self.action_space = action_size
        self.memory = deque(maxlen=5000)
        self.gamma = 0.8
        self.chosenAction = 0# Exploration vs explotation
        self.epsilon = 0.1
        self.max_epsilon = 1
        self.min_epsilon = 0.01
        self.decay_epsilon = 0.0001# Building Neural Networks for Agent
        self.main_network = self.build_network()
        self.target_network = self.build_network()

Now here we have a few class variables that we are going to use inside our DQNAgent class. Here:

state_space is our vector containing coordinates for our environment space
action_space is our vector containing action space i.e., which action we can take
memory is our memory space
gamma is our normalizing parameter
choosenAction is our current chosen action

Other variables there are the hyper-parameters that we require for our exploration vs exploitation phase. You can experiment with those parameters on your own time.

And after that, we have our objects.
Now let’s build our methods.

In ourbuild_network method, we are going to build our neural-net layers.

def build_network(self):
    model = Sequential()
        Conv2D(64, (4, 4), strides=4, padding="same", input_shape=self.state_space)
    model.add(Activation("relu"))model.add(Conv2D(64, (4, 4), strides=2, padding="same"))
    model.add(Activation("relu"))model.add(Conv2D(64, (3, 3), strides=1, padding="same"))
    model.add(Flatten())model.add(Dense(512, activation="relu"))
    model.add(Dense(256, activation="relu"))
    model.add(Dense(self.action_space, activation="linear"))model.compile(loss="mse", optimizer=Adam())
    return model

If you have built any neural network using TF before, then this should be familiar to you. If not then there are a lot of other tutorials for that as we are not going deep into all that here.

Now we build out our train method.

def train(self, batch_size):
    # minibatch from memory
    minibatch = random.sample(self.memory, batch_size)# Get variables from batch so we can find q-value
    for state, action, reward, next_state, done in minibatch:
        target = self.main_network.predict(state)
        print(target)if done:
            target[0][action] = reward
            target[0][action] = reward + self.gamma * np.amax(
            ), target, epochs=1, verbose=0)

Okay, so in this method, what we are doing is:

  1. sampling a minibatch
  2. randomly let our model make a prediction of the next state given the current state
  3. check if we have reached the target or not
  4. if done then we are good; we set our action
  5. if not then we get our possible action by applying a Q-learning algorithm
  6. Now from that action, we train our neural net

This is the main thing that we have to understand. After this, we have a few other helper methods that help in acting upon the current prediction, doing some exploration+exploitation, and updating the weights of our target network. You will find the full code in my GitHub Repo.


Here is a demo of how this all looks:

Super Mario Demo

Here is a GitHub link for this project. Thanks Everyone!

Subscribe to CodeCraft by Anyesh | Programming, ML & AI Tutorials

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.