Building a Reinforcement Learning Model from Scratch to Plan a Meal based on Real Costs and Personal Preferences: Part 2

Reinforcement Learning for Meal Planning

Varying Alpha

A good explination of what is going on with our outupt due to alpha is described by stack overflow user VishalTheBeast:

“Learning rate tells the magnitude of step that is taken towards the solution.

It should not be too big a number as it may continuously oscillate around the minima and it should not be too small of a number else it will take a lot of time and iterations to reach the minima.

The reason why decay is advised in learning rate is because initially when we are at a totally random point in solution space we need to take big leaps towards the solution and later when we come close to it, we make small jumps and hence small improvements to finally reach the minima.

Analogy can be made as: in the game of golf when the ball is far away from the hole, the player hits it very hard to get as close as possible to the hole. Later when he reaches the flagged area, he choses a different stick to get accurate short shot.

So its not that he won’t be able to put the ball in the hole without choosing the short shot stick, he may send the ball ahead of the target two or three times. But it would be best if he plays optimally and uses the right amount of power to reach the hole. Same is for decayed learning rate.”

To better demonstrate the effect of varying our alpha, I will be using an animated plot created using

I have witten a more detailed guide on how to do this here:

In our first animation, we vary alpha between 1 and 0.1. This enables us to see that as we reduce alpha our output smooths somewhat but it still pretty rough.

To investigate this further, I have then created a similar plot for alpha between 0.1 and 0.01. This emphasises the smoothing effect alpha has even more so.

However, even though the results are smoothing out, they are no longer converging in 100 episodes and, furthermore, they output seems to alternate between each alpha. This is due to a combination of small alphas requireing more episodes to learn and out action selection paramter epsilon being 0.5. Essentially, the output is still being decided by randomness half of the time and so out results are not converging within the 100 episode frame.

In [14]:
from plotly.offline import init_notebook_mode, iplot, plot
  from IPython.display import display, HTML
  import plotly
  import plotly.plotly as py

In [15]:
# Provide all parameters fixed except alpha
  budget5 = 23
  num_episodes5 = 100
  epsilon5 = 0.5

  # Currently not using a reward
  reward5 = [0,0,0,0,0,0,0,0,0]

  VforInteractiveGraphA = []
  lA = []
  num_episodes5_2 = []
  for x in range(0, 10):
      alpha5 = 1 - x/10
      Mdl5 = MCModelv1(data=data, alpha = alpha5, e = num_episodes5,epsilon = epsilon5, budget = budget5, reward = reward5)
      VforInteractiveGraphA = np.append(VforInteractiveGraphA, Mdl5[0])
      for y in range(0, num_episodes5):
          lA = np.append(lA,alpha5)
          num_episodes5_2 = np.append(num_episodes5_2, y)
  VforInteractiveGraphA2 = pd.DataFrame(VforInteractiveGraphA,lA)
  VforInteractiveGraphA2['index1'] = VforInteractiveGraphA2.index
  VforInteractiveGraphA2['Episode'] = num_episodes5_2
  VforInteractiveGraphA2.columns = ['V', 'Alpha', 'Episode']
  VforInteractiveGraphA2 = VforInteractiveGraphA2[['Alpha','Episode', 'V']]


Alpha Episode V
1.0 1.0 0.0 -1.00
1.0 1.0 1.0 -1.50
1.0 1.0 2.0 -2.00
1.0 1.0 3.0 -2.25
1.0 1.0 4.0 -2.25