In a previouspost we went built a framework for running learning agents against PyGame . Now we’ll try and build something in it that can learn to play Pong.
We will be aided in this quest by two trusty friends Tensorflow Google’s recently released numerical computation library and this paper on reinforcement learning for Atari games by Deepmind . I’m going to assume some knowledge of Tensorflow here, if you don’t know much about it, it’s quite similar to Theano and here is a good starting point for learning.
- You will need Python 2 or 3 installed.
- You will need to install PyGame which can be obtained here .
- You will need to install Tensforflow which can be grabbed here .
- You will need PyGamePlayer which can be cloned from the git hub here .
If you want to skip to the end the completed Deep Q agent is here in the PyGamePlayer project. The rest of this post with deal with why it works and how to build it.
If you read the Deepmind paper you will find this definition of the Q function:
Lets try and understand it bit by bit. Imagine an agent trying to find it’s way out of a maze. In each step he knows his current location, s in the equation above, and can take an action, a , moving one square in any direction, unless blocked by a wall. If he gets to the exit he will get a reward and is moved to a new random square in the maze. The reward is represented by the r in the equation. The task Q-Learning aims to solve is learning the most efficient way to navigate the maze to get the greatest reward.
|Bunny agent attempts to locate carrot reward|
If the agent were to start by moving around the maze randomly he would eventually hit the reward which would let him know it’s location. He could then easily learn the rule that if your state is the reward square then you get a reward. He can also learn that if in any square adjacent to the reward square and you take the action of moving towards it you will get the reward. Thus he knows exactly the reward associated with those actions and can prioritize them over other actions.
But if just choosing the action with the biggest reward the agent won’t get far as for most squares the reward is zero. This where the max Q*(s’,a’) bit of the question comes in. We judge the reward we get from an action not just based on the reward we get for the state it puts us in but also best reward we could get from the best(max) actions available to us in that state. The gamma symbol γ is a const between 0 and 1 that acts as a discount on the reward of things in the future. So the action that gets the reward now is judged better than the action that gives the reward 2 turns from now.
The function Q* represents the abstract notion of the ideal Q* function, in most complex cases it will be impossible to calculate that exactly so we use a function approximator Q(s, a; θ). When a machine learning paper references a function approximator they are (almost always) talking about a neural net. These nets in Q learning are often referred to as Q-nets. The θ symbol in the Q function represents the parameters(weights and bias) of our net. In order to train our layer we will need a loss function, that is defined as:
y here is the expected reward of the state using the parameters of our Q from iteration i-1 . Here an example of running a q-function in tensorflow. In this example we are running the simplest state possible. It is just an array of states, with a reward for each and the agents actions are moving to adjacent states:
Python Q-learning with tensor flow
Setting up the agent in PyGamePlayer
Create a new file in the your current workspace, that should have the PyGamePlayer project it in(or simply create a new file in the examples directory in PyGamePlayer). Then create a new class that inherits from the PongPlayer class. This will handle getting the environment feedback for us. It gives reward when ever the players score increase and punishes whenever the opponents score increases. We will also add a main here to run it.
If you run this you will see the player moving to the bottom of the screen as the pong AI mercilessly destroys him. More inteligence is needed, so we will override the get_keys_pressed method to actually do some real work. Also as a first step, because the Pong screen is quite big and I’m guessing none of us have a super computer lets compress the screen image so it’s not quite so tough on our gpu.
How do we apply Q-Learning to Pong?
Q-Learning makes plenty of sense in a maze scenario but how do we apply it to pong? The Q-function actions are simply the key press options, up, down, or no key pressed. The state could be the screen, but the problem with this is that even after compression our state is still huge, also Pong is a dynamic game, you can’t just look at a static frame and know what’s going on. Most importantly what direction the ball is moving.
We will want our input to be not just the current frame, but the last few frames, say 4. 80 times 80 pixels is 6400 times 4 frames that’s 25600 data points and each can be in 2 states(black or white) meaning there are 2 to the power of 25600 possible screen states. Slightly too many for any computer to reasonably deal with.
This is where the deep bit of deep Q come in. We will use deep convolutional nets(for a good write up of these try here ) to compress that huge screen space into a smaller space of just 512 floats and then learn our q function from that output.
So first lets create our convolutional network with Tensorflow:
Now we will use the exact same technique we used for the simple Q-Learning example above, but this time the state will be a collection of the last 4 frames of the game and there will be 3 possible actions.
This is how you train the network:
And getting the chosen action looks like this:
So get_key_presses needs to be changed to store these observations:
The normal training time for something like this even with a good GPU is in the order of days. But even if you were to train the current agent for days it would still perform pretty poorly. The reason for this is because if we start using the Q-function to determine our actions it will initially be exploring the space with a very poor weights. It is very likely that it will find some simple action that leads to a small improvement and get struck in a local minima doing that.
What we want is too delay using our weights until the agent has a good understanding of the space in which it exists. A good way to initially explore the space is to move randomly then over time slowly add in more and more moves chosen by the agent until eventually the agent is in full control.
Add this to the get_key_presses method
And then make the choose_next_action method this:
And so now Hazar! We now have not the worlds worst Pong AI!
The PyGamePlayer project: https://github.com/DanielSlater/PyGamePlayer
The complete code for this example is here