Applying the evaluation and convergence process to a business problem
What was once considered in chess as the ultimate proof of human intelligence has been battered by brute-force calculations with great CPU/RAM capacity. Almost any human problem requiring logic and reasoning can most probably be solved by a machine using relatively elementary processes expressed in mathematical terms.
Let's take the result matrix of the reinforcement learning example of the first chapter. It can also be viewed as a scheduling tool. Automated planning and scheduling have become a crucial artificial intelligence field, as explained in Chapter 12, Automated Planning and Scheduling. In this case, evaluating and measuring the result goes beyond convergence aspects.
In a scheduling process, the input of the reward matrix can represent the priorities of the packaging operation of some products in a warehouse. It would determine in which order customer products must be picked to be packaged and delivered. These priorities extend to the use of a machine that will automatically package the products in a FIFO mode (first in, first out). The systems provide good solutions, but, in real life, many unforeseen events change the order of flows in a warehouse and practically all schedules.
In this case, the result matrix can be transformed into a vector of a scheduled packaging sequence. The packaging department will follow the priorities produced by the system.
The reward matrix (see Q_learning_convergence.py) in this chapter is R (see the following code).
R = ql.matrix([ [-1,-1,-1,-1,0,-1],
[-1,-1,-1,0,-1,0],
[-1,-1,100,0,-1,-1],
[-1,0,100,-1,0,-1],
[0,-1,-1,0,-1,-1],
[-1,0,-1,-1,-1,-1] ])
Its visual representation is the same as in Chapter 1, Become an Adaptive Thinker. But the values are a bit different for this application:
- Negative values (-1): The agent cannot go there
- 0 values: The agent can go there
- 100 values: The agent should favor these locations
The result is produced in a Q function early in the first section of the chapter, in a matrix format, displayed as follows:
Q :
[[ 0. 0. 0. 0. 258.44 0. ]
[ 0. 0. 0. 321.8 0. 207.752]
[ 0. 0. 500. 321.8 0. 0. ]
[ 0. 258.44 401. 0. 258.44 0. ]
[ 207.752 0. 0. 321.8 0. 0. ]
[ 0. 258.44 0. 0. 0. 0. ]]
Normed Q :
[[ 0. 0. 0. 0. 51.688 0. ]
[ 0. 0. 0. 64.36 0. 41.5504]
[ 0. 0. 100. 64.36 0. 0. ]
[ 0. 51.688 80.2 0. 51.688 0. ]
[ 41.5504 0. 0. 64.36 0. 0. ]
[ 0. 51.688 0. 0. 0. 0. ]]
From that result, the following packaging priority order matrix can be deduced.
The non-prioritized vector (npv) of packaging orders is np.
The npv contains the priority value of each cell in the matrix, which is not a location but an order priority. Combining this vector with the result matrix, the results become priorities of the packaging machine. They now need to be analyzed, and a final order must be decided to send to the packaging department.