Source: dqindia.com
ML algorithms allow us to model and predict Big Data behaviors based on historical data: looking back. But what if the historical database is not enough to model our problem? This is where the so-called reinforcement learning comes into account, and machines explore their environment while learning from scratch based on rewards and penalties. They are simply keeping a future vision towards a goal.
Supervised and unsupervised learning techniques are being applied to meet users, predict subscriber likes and behaviors, predict system failures, among other functions. In order to carry out such tasks, this type of algorithm usually requires a large amount of historical data, which records the different characteristics and possible configurations within a given context. Take as an example of recommending products to visitors to a particular web page. Let’s consider that for each visitor to the page, we want to highlight a certain selection of products, which we hope will be to their liking. The idea is to attract visitors by showing them the ‘appealing’ products that are or could be of interest to them. One of the ways to tackle this situation is to carry out a classification of the types of customers that visit and make purchases through this particular site. Once the different categories of clients have been distributed, we can visualize which category each of them falls under, consequently choosing products of interest that match that classification.
Although such procedures work, it is necessary to have a considerable and consistent historical database, where the preferences of all customers who have made purchases through the page have been recorded. The issue that companies experience is that such an amount of information is not always freely available, or it is not enough for algorithms to correctly model customer tastes. If we do not have an informative database -normally biased- the algorithms that require historical data can considerably fail to perform.
The question lies in asking if, wouldn’t it be better to use an algorithm that “learns to learn”? A type of algorithm that learns to know your customers; learns to know how the system works, and how to reach the goal on its own. Basically, an algorithm that can learn from scratch and from experience.
Such a type of machine learning algorithm exists and is what is known as reinforcement learning. Reinforcement learning is an area of machine learning inspired by behavioral psychology, where the machine learns by itself the behavior to follow based on rewards and penalties – hindsight experience replay. In this type of technique, you learn from the empirical data points. Similar to how dogs learn to do stunts based on treats, or a child becomes adept at a particular video game, reinforcement learning works on a trial-and-error basis, receiving rewards or penalties for each step taken to achieve a certain goal.
In every reinforcement learning problem, there is an agent, a state-defined environment, actions that the agent takes, and rewards or penalties that the agent gets on the way to achieve its objective. Let us have a look at the following diagram:
At a certain point in time, the agent is in a state [St]. Being in that first state, the agent observes the environment and selects an action [At] that will take him to a next state. Right off the bat, the agent doesn’t know what the next state is like. You do not know if being in that next state, you will get better or worse rewards than having taken another action. At each step the agent takes, he knows only the here and now -the state he is in and the possible actions he can choose from that particular state. When executing the action, the environment gives the agent a reward [Rt] or penalty, It is with this repetition that the value of his actions will vary, and the agent will learn what actions will lead to greater reward in the long term. Hence, the learning focuses not on maximizing the objective itself, but on the agent learning an optimal policy to achieve the objective.
How does the agent choose each action? In reinforced learning, two things come into play: exploration and exploitation. Exploration refers to the choice of actions at random. Exploitation, on the other hand, refers to making decisions based on how valuable it is to perform an action from a given state. Depending on how we want to learn to develop further, the levels of exploration and exploitation be modified. For example, we can establish that the agent chooses actions 30% of the time in a random way so that he can explore the environment by himself, and that 70% of the remaining time choose the most valuable actions for each state in which he is. But why not always choose to exploit? Remember that the agent begins to learn from scratch. So, in the beginning, all the actions in the initial state have a null value. Furthermore, the number of shares available for each state may vary from state to state, so the overall environment is not known in advance. It is only through experience that stocks begin to acquire value. Consequently, exploration is vital.
How can we apply this to the example of the recommendation system mentioned above? The agent would become the machinery that will learn what product to recommend for each visitor. The actions would be the different products that the page offers. The clients and the characteristics of each visit define the environment and the states. If the user clicks on the recommended product, the agent will receive a reward of 1. If the user does not click on the recommended product, the agent will receive a penalty of 0. This way, the agent will learn what product to recommend from a given state.
As we have seen, reinforcement learning has the potential to personalize solutions, offering recommendations tailored to each client, without the need for prior knowledge of the users. Could this be the true future of marketing?