QLearn: A Haskell library for iterative Q-learning.
Reinforcement learning is a quickly growing field that’s centered around teaching agents how to operate optimally in environments with states, actions and rewards associated with state and action pairs. QLearn is a library that allows you to easily implement Q-learning-based agents in Haskell. You can get it through Cabal:
cabal install qlearn
You can include it in your code with:
There are lots of good explanations of Q-learning so we won’t go into much detail about the technique here. Basically, we have an agent that’s moving around in an environment where the agent can end up in particular states and transition between these states using actions. Each state and action pair has a reward associated with it. The agent doesn’t know exactly how the state and action pairs turn into new states and also doesn’t know how much of a reward each state and action pair gives. It does, however, know which state it is in at a a given time. Given this information, the Q-learning algorithm tries to have the agent figure out the optimal strategy.
There are two numerical parameters we can control: alpha and gamma. Both have values between 0 and 1. The former represents how much new observations should affect our current understanding of the environment in comparison to old observations in terms of learning (i.e. a learning rate) and the latter describes how much rewards in the future should be discounted. In addition to these, there’s also an epsilon function. If our agent were to just always follow the policy it has "learned" right from the start, it might get stuck on some really bad policy. So, we want to sometimes take a random action. Given the number of time steps remaining, the epsilon function returns the probability of taking this random action.
QLearn is incredibly easy to use. There’s only a little bit of setup needed to create an agent and an environment for the agent to operate in. For a simple agent moving about on a grid, we have the following code:
import QLearn import System.Random main = do let alpha = 0.4 gamma = 1 totalTime = 1000 numStates = 16 -- we are operating in a 4x4 grid numActions = 4 -- up, down, left and right epsilon = (/timeRemaining -> 1.0/(fromIntegral $ totalTime - timeRemaining)) execute = executeGrid testGrid possible = possibleGrid testGrid qLearner = initQLearner alpha gamma epsilon numStates numActions environment = initEnvironment execute possible g <- newStdGen moveLearnerPrintRepeat totalTime g environment qLearner (State 0)
Notice that we’re using the
testGrid , which is a 4×4 multidimensional array of doubles with each point within the grid representing a state and the value of each point representing the reward associated with the state. All actions are deterministic within this grid. We’re able to get the state transition behavior using
possibleGrid from QLearn. If you run the code snippet, you should see the value table for the agent update as it performs 1000 iterations on the grid.