Reinforcement Learning from Scratch: Applying Model-free Methods and Evaluating Parameters in Detail

This project was created as a means to learn Reinforcement Learning (RL) independently in a Python notebook. Applying RL without the need of a complex, virtual environment to interact with. By fully defining the probabilistic environment, we are able to simplify the learning process and clearly demonstrate the effect changing parameters has on the results. This is a valuable in any Machine Learning task but particularly in Reinforcement Learning where it can be hard to understand the impact varying parameters without a clear and well-defined example.

Medium Article



The aim is to find the best action between throwing or moving to a better position in order to get paper into a bin (trash can). In this problem, we may throw from any position in the room but the probability of it is relative to the current distance from the bin and the direction in which the paper is thrown. Therefore the actions available are to throw the paper in any 360 degree direction or move to a new position to try and increase the probability that a throw made will go into the bin.

Previously, in part 1, we introduced the problem where the bin’s location is known and can be solved directly with Value-Iteration methods.

In part 2, we now show how RL can be used similarly to find the optimal policy if the probabilities are hidden using model-free methods (e.g. Q-learning, Monte Carlo, etc).

Furthermore, we use this as an opportunity to introduce a novel visualisation method for examining parameters. When we vary parameters in Reinforcement Learning, we observe the results over a number of attempts (or episodes) where we hope to show stability and convergence. This is a two dimensional comparison (x = episode and y = output) and when we want to observe the results if we vary a parameter this becomes three dimensional (x = episode, y = output and z = parameter). The simplest and most commonly used solution is to produce multiple plots for each parameter choice. An alternative, but more complex visual choice, is to plot in three dimensions.

Instead, we introduce our novel method of an interactive animation whereby the parameter change is shown over time.

The aim of this visualisation is to improve the way in which you compare parameter choices. However, because parameter selection is often a process that needs to be performed quickly, we acknowledge the requirement that this must be simple to achieve. The final parameter choices may then be formatted appropriately but the aesthetics of the visuals in the decision process may be less strict.

Therefore, although this is a method that we have introduced previously, we have since taken this further and formally defined as a package that can be easily downloaded and used without extensive knowledge of’s library (Github Release in progress).

We will demonstrate how to use the graphing function in this notebook. A quick example of the interactive animation can be used is shown below where we can see the dimensions discussed previously.

Lastly, state of the art research in model explainability emphasis the requirement to diagrammatically outline methods clearly and consistently.

Researchers from Google have introduced Model Cards as a means to provide transparency in training algorithms.

Trained machine learning models are increasingly used to perform high-impact tasks in areas such as law enforcement, medicine, education, and employment. In order to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited, we recommend that released models be accompanied by documentation detailing their performance characteristics.

Model Cards Spec

Furthermore, a colleague of mine is introducing a framework for formalising diagrammatic methods for skilled researchers to share their work; known as DIAL. This research attempts to identify recurring primitives and building blocks of AI systems, and proposes a schematic notation to try and facilitate improved communication about AI Systems research.

Currently, there is no consistent model for visually or formally representing the architecture of AI systems. This lack of representation brings interpretability, correctness and completeness challenges in the description of existing models and systems. DIAL (The Diagrammatic AI Language) has been created with the aspiration of being an “engineering schematic” for AI Systems. It is presented here as a starting point for a community dialogue towards a common diagrammatic language for AI Systems.

Therefore, we have created a visual that shows and summarises the entire workflow to produce the final output:

Complete RL Process Diagram

Pre-processing: Introducing the Probabilistic Environment

Following the previously defined environment (see META notebook) we have also found the optimal policy calculated from value iteration.

The optimal policy can be imported from the data file and is fixed given the bin is at (0,0) and the probabilities are calculated as shown in the function below.

RL Environment Definitions

A policy is the currently recommended actions for all given states. The states (s and s′) are the position in the room and the actions (a) are either moving in one of 8 directions (north, north east, east, … north west) or throwing in any 360 degree direction (0,1,2,3,4,…,359,360 degrees) from due north (see Navigation Bearings).

In our probabilistic environment, we have defined that moving has a guaranteed outcome of being followed correctly (i.e. you do not miss-step) but throwing in a direction is not guaranteed to get into the bin. The probability of the throw successfully going into the bin is relative to the distance of the current position to the bin and the direction thrown from the true direction.

This defines our probabilistic environment and transition function:


is the probability of transition from state s to state s′ under action a (

(Note: bin is an English term for trash can/ dustbin/ garbage pail/ litter basket/ trash barrel/ trash bin/ trash can/ wastebasket/ wastepaper basket)

Probability Function

This function defines the probability of a successful throw from any given state and is calculated by the following:

First, if the position is the same as the bin (i.e. the person is directly inside the bin already) then the probability is fixed to 100%.

Next, we have to re-define the throwing direction in two cases to accommodate for the fact that 360 degrees is the same as 0 degrees. For example, if we are south-west of the bin and throw 350 degrees, this would be the same as -10 degrees and would then relate to a bearing from the person to the bin less than 90 correctly.

Then the euclidean distance is calculated followed by the max distance a person could be from the bin.

We then calculate the bearing from the person to the bin following the previous figure and calculate the score bounded within a +/- 45 degree window. Throws that are closest to the true bearing score higher whilst those further away score less, anything more than 45 degrees (or less than -45 degrees) are negative and then set to a zero probability.

Lastly, the overall probability is related to both the distance and direction given the current position.

Initialise State-Action Pairs

Before applying the algorithm, we initialise each state-action value into a table. First we form this for all throwing actions then all moving actions.

We can throw in any direction and therefore there are 360 actions for each degree starting from north as 0 clockwise to 359 degrees.

Although movement may seem simpler in that there are 8 possible actions (north, north east, east, etc) there are complications in that unlike being able to throw in any direction from any position, there are some movements that aren’t possible. For example, if we are at the edge of the room, we cannot move beyond the boundary and this needs to be accounted for. Although this could be coded nicer, I have done this manually with the if/elif statements shown that skips the row if the position and movement is not possible.

Initialised Q Table

Defining Model-free Reinforcement Learning Methods

We introduce three model-free methods that are considered to be the simplest to apply and compare their strengths and weaknesses. However, before we do this we consider the difference between model-free methods and our previously used value-iteration model-based method. In short, model-based methods use the knowledge of the probabilistic environment as a guide and plans the best actions accordingly. In model-free methods, the algorithm has no knowledge of this probabilities it just tries actions and observes the results.

In this example, we have calculated the probabilities and will use these to find the outcome of the actions but they are not used within the algorithm’s learning directly.

Furthermore, in the model-based methods, we update all actions in large “sweeps” where the value of all states are updated in one pass. In model-free methods we use episodes where only states that are visited are updated. An episode is a path from a start state to the terminal state; in our example, the terminal state is when the algorithm throws the papers and the outcome may be successful or a miss.

Forming Episodes and Defining the Action Selection Process

If we define the starting position, an episode is the actions taken from that position until the paper is thrown. If it reaches the bin, then we have a positive goal reward of +1. However, if we miss the bin, then we have a negative goal reward of -1.

Action Selection

We could continue this selection process but this is a highly inefficient method of choosing which action to take. When we implement our learning process we will begin to learn which actions lead towards the positive goal so if we keep randomly selecting we are wasting all that effort.

Therefore, we instead introduce a method that takes this into account, known as epsilon-greedy.

The best lever is selected for a proportion 1−ϵ of the trials, and a lever is selected at random (with uniform probability) for a proportion ϵ . A typical parameter value might be epsilon=0.1, but this can vary widely depending on circumstances and predilections. (wiki)

In other words, we pick an action randomly with probability ϵϵ and will otherwise pick the best action. If we have multiple “best actions” we then randomly select from this list.

So why do we not simply keep picking the best action each time? Well this can cause a problem if we have an action that works but is not necessarily the best. This is often considered in other Machine Learning problems as the local minimum/maximum. If we keep using an action that seems to work we may miss the opportunity to try a better action because we never tried it and this can cause instability in the results.

The animation below demonstrates the outcome as we reduce epsilon. With high epsilon values, we are randomly selecting actions so will likely pick bad ones. As we reduce epsilon we select actions more and more greedily improving the results whilst still ensuring we can explore new actions to minimise the risk we are in a local max rather than global max.

We therefore select a small epsilon value ϵ=0.1

Introducing RL Algorithms

We have introduced episodes and how to choose actions but we have yet to demonstrate how and algorithm uses this to learn the best actions. Therefore, we will formally define our first RL algorithm, Temporal Difference 0.

Temporal Difference — Zero

Temporal Difference λ are a family of algorithms depending on the choice of λ. The simplest is to set this to zero at which point have the following update rule:

Definition: TD(0) Update Rule: Wiki


  • V(s) is the value of state s,
  • α is the learning rate parameter,
  • r is the reward,
  • γ is the discount factor parameter and,
  • V(s′) is the vale of the next state.

So what does this equation mean? In short, we update our knowledge of the quality of the current state, denoted V(s), based on a combination of what its value is and the result of taking the action to the next state defined in the episode.

For example, say we start the learning processing and our first action is to throw from state [-5,-5] and it successfully hits the bin, then we have a positive reward of +1 for reaching to goal. Therefore, we have the following update:

This may seem like a trivial calculation but it is important to remember that success is not guaranteed. Therefore, if we think about all possible actions, the result of this first throw means we believe that this throw action is current the best choice. This throw action has a value of 0.5 compared to 0 for all other actions that haven’t been tested yet.

Therefore, under the ϵ−greedy selection process, we would try this again. However this time, the paper does not go in the bin but misses and we therefore have a negative terminal reward of −1−1:

So we see that our value of this state has now reduced slightly to account for the second throw.

The core concept in Reinforcement Learning is that we are testing actions by repeated sampling; we need to repeat the number of samples until the results converge to an estimate of the true probabilistic outcomes.

For example, if we consider tossing a coin 2 times, we are fairly like to have both outcomes being heads but if we throw it 100 times then we would likely see a 50/50 split between heads and tails. In our example, if throwing from state [-5,-5] is a good action, then repeatedly trying this should, overall, lead to a positive result. This can be quite difficult to comprehend at first but in simple terms, we are testing the action by trial and error and making our algorithm do all the work so we don’t have to.

Note: For now, we will fix the start state to [-5,-5] and parameters to be ϵ=0.1, α=0.5 and γ=0.5 until we demonstrate parameter changes later.

After 100 episodes, we see that the states around our fixed start point have updated but if we compare the following heat-map side by side with the previous line plot, we see that this has not fully converged after 100 episodes and is still updating.

Initial TD(0) Output

We therefore increase the number of episodes greatly from 100 to 1,000

As we are starting to find that this takes longer and longer, a good idea is to introduce a method to keep track of the progress of the loop. To do this, I applied a method introduced in this post.

Increase to 1,000 episodes

Varying Rewards

We note that the results of this show that the value of the states are very negative and that they are diverging (i.e. not stable).

There are a few steps we can take to improve this, first we will introduce rewards for other actions. Currently the only rewards we have are for when the algorithm throws and receives a +1 for the positive goal or a −1 for the negative goal.

This is part of the process in Reinforcement Learning that gives us control as to what the algorithm optimises for. For example, say we want to discourage the algorithm from throwing, we could introduce a small positive reward (say 0.1) for each move action as shown below.

Add reward for moving: r_move = 0.1

Although this appears worse at first, the value of the state oscillating shows that there is a value its trying to find but our choice of parameters is causing it to diverge. But, we at least can see that it is getting closer to converging.

We could start varying parameters but part of the issue is that we are summarising the value of the state for a large number of actions (360 throw directions and 8 move directions). Therefore, instead of summarising this into one value, it would be better to consider the quality of each state-action pair individually.

To do this, we can introduce our second model-free method: Q-learning.


Much like TD(0), Q-learning learns as we take each action but instead searches through the possible subsequent actions to learn faster.

Definition: Q-Learning Update Rule: Wiki


  • Q(s_t,a_t) is the value of state-action pair s,
  • α is the learning rate parameter,
  • r is the reward,
  • γ is the discount factor parameter and,
  • Q(s_t+1,a) is the value of action-pairs in the next state.

As before, we will fix the parameters to be ϵ=0.1, α=0.5 and γ=0.5.

Q-Learning Initial Run

Varying Parameters

We have three main parameters to vary, the learning rate αα, the discount factor γγ and our value for the ϵ−greedyϵ−greedy action selection.

The following explanations for each are taken directly from Wikipedia and have already introduced the ϵϵ parameter in some detail.

Explore vs exploit

The learning rate or step size determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities). In fully deterministic environments, a learning rate of αt=1 is optimal. When the problem is stochastic, the algorithm converges under some technical conditions on the learning rate that require it to decrease to zero. In practice, often a constant learning rate is used, such as αt=0.1 for all t.[3]

Discount factor

The discount factor γγ determines the importance of future rewards. A factor of 0 will make the agent “myopic” (or short-sighted) by only considering current rewards, i.e. rt (in the update rule above), while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For γ=1, without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite.[4] Even with a discount factor only slightly lower than 1, Q-function learning leads to propagation of errors and instabilities when the value function is approximated with an artificial neural network.[5] In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning.[6]

So what do these mean, and, more importantly, what are we trying to achieve with our parameter selection?

The overall aim is that we are trying to find the optimal action for any given state whilst achieving this in a reasonable number of effort (measured by the number of episodes, computation needed or time). A good explanation for the learning rate is that a high value means we value the information gained in each action more and so learn faster but may find it hard to full converge whilst a small value will take longer but will steadily converge.

A good analogy for this is to think of it like we are playing Golf with just one club; a high alpha corresponds to using a big-hitting club whilst a small alpha value is akin to using a small-hitting club. The big-hitting club will initially get us closer to the green but once we get close it is hard to accurately hit the hole. However, a small-hitting club will take more attempts to reach the green but once it does we have more control and can reach the hole easier.

We have already observed the effect of a large alpha parameter in our earlier applications where the values oscillated between each episode. Therefore, we need to use a small value but this introduces the challenge regarding the number of episodes required to converge. We already need thousands of episodes to converge for a fixed start state and we have 100 to consider for the full environment.

This is the trade-off we have to consider and the best decision may be to only learn for one fixed start state at a time when needed rather than trying to find the optimal policy for all states.

When we observe the trends for varying alpha in the animation below, we see that, if we fix the start state, we are able to use a small alpha value without needing an impossibly large number of episodes. If we were to consider all states, we would likely need to use a slightly larger alpha value to reach a suitable result in good time.

We therefore go back to considering a single fixed start state of [-5,-5] and choose an alpha value of α=0.1.

With these set, we then evaluate the choice of γ. From the Wikipedia explanation, we understand that the value corresponds to whether we consider future rewards important or not. It helps when we consider this to remind ourselves of the update rule for Q-learning.

Withing the Q-learning update rule, we see that γ scales the Q value taken for the best action from the next state. This is in relation to the reward of the action itself as part of the same bracket and so if we reduce this by having a small gamma value then the reward has more weight. Conversely, if we take a high gamma value, we consider the information obtained from this next state to be more important.

Therefore, we would ideally choose a value that adds value to future rewards so that our decisions lead to optimally to the bin and select a value of γ=0.9.

Alpha Analysis

alpha = 0.9

alpha = 0.1

Gamma Analysis

gamma= 0.9

Final Parameter Output

- alpha = 0.1
- gamma = 0.9
- epsilon = 0.1
- 10,000 episodes
- Fixed start state: [-5,-5]


The optimal action from the start state is to MOVE in direction:  2


We see that the final output for start state is to move EAST. As mentioned, we may want to consider changing the rewards so that moving is discouraged slightly as it seems that our algorithm is collecting rewards for moving rather than reaching the end goal.

The results for all states covered by the episodes converge within the 10,000 episodes though it appears that many have yet to be fully explored and are not optimal. But if we only concerned with the start state then these do not matter significantly.

I hope this notebook/write-up is useful for demonstrating the impact each parameter has on learning and the overall process of RL in a self contained example.


Leave a Reply