How Reinforcement Learning Helps Retailers

Introduction



Hello! Our Glowbyte Advanced Analytics team develops ML solutions for applied industries (retail, banking, telecom, etc.). Many tasks require non-standard solutions. One of them is the optimization of customer communication chains using Reinforcement Learning (RL), to which we decided to devote this article.



We have divided the article into three blocks: introduction to the problem of optimization of communication chains; introduction to RL; and in the third block, we merge 1 and 2 together.



image



The task of optimizing communication chains



To begin with, a small glossary:



CRM is a customer relationship management system. Typically, it includes the process of accumulating and analyzing customer knowledge, which is used to improve sales and service levels.



A client is someone who uses the services of an organization.



Customer Attributes - The accumulated knowledge about the customer. Examples:



  • Average check;
  • Average frequency of purchases per month;
  • Age;
  • Region of residence.


Marketing Campaign \ Communication \ Offer - promotional offers that customers receive from an organization. Examples:



  • You have received XXX points, have time to spend up to YYY;
  • For you XXX discount on YYY brand products.


A communications chain is a sequence of marketing campaigns.



Loyalty program is a set of marketing activities aimed at increasing customer value. A typical example is discount cards.



Clustering customers - dividing customers into groups, within which customers are similar to each other in consumer behavior.



A recommendation system is a system that generates the best offers to the client in terms of business value.



LTV (lifetime value) - the expected profit from the client for the entire period of cooperation with him.



It is believed that when developing a loyalty program, the main task of an analyst is to create a first-class recommendation system that knows what, when and in what quantities a client needs at a given time. This is certainly important, and even brings some profit, but this is not a key business task. Any organization first of all wants to develop the habit of its customers to use their services. The ideal client is one who uses the services exclusively of this organization, brings a stable profit, recommends services to friends, while requiring a minimum of costs from the business. Customer loyalty is not earned instantly and the organization's job is to guide the customer from the first order to regular purchases in the most efficient way.



For example, imagine a school group where the teacher needs not only to explain a rule or an algorithm, it is important for him to instill in students a love of learning or the subject. An experienced teacher knows that the learning process is not always pleasant, sometimes even painful for both parties, but the final result is important. The teacher has his own approach to each student, taking into account many individual factors.



Unlike a small school group, an organization can have tens of millions of clients, each of whom needs to be brought up by the hand. For this, it is not enough to guess the desire once. And it is clear that this is beyond human capabilities.



So, what are our introductory notes:



  1. — (, LTV ). , , ;
  2. , , .;
  3. , , ;
  4. , , .1. , ( .2). , .


Our solution to this problem is based on the concept of Reinforcement Learning (or Reinforcement Learning). Before proceeding to the presentation of our approach, we have prepared a small excursion into the theory.



Reinforcement Learning. INTRO



What is it and why?



The task of Reinforcement Learning is to form an optimal algorithm for interacting with a certain environment to achieve the desired result.



One example of using RL is finding a way out of a maze. Initially, nothing is known about the maze. By examining different options, the algorithm learns to find the shortest path to the exit.



image



What are the features of RL from the ML point of view?



Reinforcement Learning is a separate class of machine learning algorithms. As a rule, information about the environment is initially absent, in other words, there are no labeled examples for training.



The peculiarity of RL is that you can try different actions, make a conclusion about their success, accumulate the knowledge gained and use it in the next choice. And so many times. An iterative learning process, in which the algorithm independently explores the environment, is one of the main differences of RL.



How does RL differ from random enumeration of all options?



First, with the help of classic (without using deep networks) RL, you can make the enumeration sequential and efficient. One of the basic principles of RL is exploration, which alternates with the exploitation of knowledge. In other words, nothing prevents us from combining the application of the model and testing, the main thing is to maintain a balance.



Secondly, not in all tasks it is possible to sort out all existing situations. In these cases, advanced RL algorithms make it possible to generalize the accumulated knowledge to new cases. However, in this case, the idea of ​​joint testing and application remains.



What does the optimal algorithm for interacting with the environment mean?



Instant wins don't always guarantee long-term success.



For example, in the game of Chess, capturing an opponent's piece can result in more expensive losses.



However, choosing a specific action, we can assume that we will be waiting for the next step. In the next step, in turn, you can guess what will happen next. Etc. All this knowledge can be taken into account when choosing the next action. Thus, a behavior strategy is built.



Where is it used?



In games. In addition, there are successes in teaching robots, negotiation bots, and recommendation systems. Some interesting references:





Before diving into the details of the terminology, we provide examples that illustrate some of the conceptual features of RL.



Example Beginners



Traditionally, let's start with the multi-armed bandit.



Consider a slot machine with N handles. Only one handle of the machine can be lifted at a time.



Objective: to identify the action (i.e. the handle) that brings the maximum payoff.



Solution: We can pull each handle many times. Then, as the “optimal action,” we choose the handle with the highest average payoff.



And if in the future we choose the best action all the time, then such a strategy will be called greedy .



Obviously, such a strategy will only work in a stationary environment (that is, where there is no change over time). In a non-stationary environment(for example, someone changes the settings of the machine from time to time) over time, when using a greedy strategy, there will be no optimal result.



Besides the greedy strategy, there are others:



  • ε-greedy strategy : inϵ% of cases we choose the optimal action, in (1ϵ)% - random;
  • Upper confidence bound (UCB) strategy : when choosing an action, a weight coefficient is used, the value of which depends on how well the event is tested (that is, the less studied the event, the higher the probability of choosing this action);
  • Softmax: the larger the expected payoff, the higher the likelihood of choosing this action.


The multi-armed bandit problem is an example of the simplest problem in which we initially know nothing about the subject of observation, that is, we learn to interact with it from scratch. The solution to this problem is based on trial and error (very vital) and as we gain experience, our actions become more and more successful.



What we learned from the example:



  • Trial and error is also a method;
  • Random enumeration can be made more efficient by using different variants of strategies;
  • Separate stationary and non-stationary environments.


Intermediate example



Now we can complicate the task a little and consider a rod as an example:



image



A carriage with a rod can move “left” and “right”.



Goal: you need to learn how to keep the rod in an upright position as long as possible.



Difference from the previous task: now it is necessary to take into account additional parameters: tilt angle(a) and rod speed (v)and make a decision based on this information.



The task seems more complicated because the combinations(a;v)quite a lot and trying each of them many times will not work.



Any combination(a;v)called a state . The number of states can be either continuous or finite. Finite-state algorithms are generally easier to implement.



It turns out that the state is a set of some parameters of the system. There is an important assumption in RL theory that this set of parameters should fully describe the state of the system. That is, it should not matter to us what happened to the system in the previous steps, it is only important what we observe at a given moment in time.



What we learned from the example:



  • When choosing the optimal action, the state of the system must be taken into account. The number of states affects the complexity of the algorithm;
  • Parameters that describe the state of the system should give complete information about the system at the current time.


Advanced example



Now let's look at the game of chess.



The number of possible positions of the pieces on the board is expressed in 52 digits. And this is not the only difficulty. The difference from the previous two tasks is that in the case of chess it is important to choose not the action that will bring the maximum result now, but the one that will lead to victory in the future (after many steps forward).



What we learned from the example:



  • When making a decision, consider the long-term effect, not the immediate benefit.


Now, using examples, we will define the generally accepted terms of RL.



Basic RL terminology



An agent is a subject who interacts with the environment, performing certain actions, receives feedback from it and remembers it.



  • For example, a motor that drives a carriage with a rod; the multi-armed bandit are agents.


Environment - the place where the agent exists and from which it receives feedback.



The feedback that the agent receives from the environment usually has some uncertainty.



  • For example, when a carriage with a bar makes a movement, the feedback of the action taken is the result of the bar falling or not. Carriage and bar - medium.


State - any knowledge that helps to make decisions. States refer to the environment and define it uniquely at each moment in time. As a rule, such states are written as a set of parameters, matrices or higher order tensors.



  • For example, the current position of the pieces on a chessboard is a state.


Action - actions that are available to the agent. As a rule, the number of actions in space is finite.

  • For example, movements of a bar to the right or left are actions.


Reward - instant feedback that an agent receives for actions. That is, it is the result of the action taken. The reward is always a number.



  • For example, winning an automaton in a multi-armed bandit problem is a reward.


Goal - As a rule, the goal of the agent is to maximize the total reward. In other words, the ultimate goal is to maximize the reward not at the current step, but the final reward based on the results of the sequence of steps.



  • For example, our goal is not to hold the rod one-time, but as long as possible.


Strategy - mapping states into actions. For example, the probability of choosing action A in state S.



Formal problem statement



  1. At each step, the environment can be in the state sS...
  2. At each step, the agent selects an action from the available set of actions aA according to some strategy π.
  3. The environment tells the agent what the reward is r he received for it and in what condition sS after that it turned out.
  4. The agent adjusts the strategy π.


Everything seems to be simple. There is one unsolved question - where does the mysterious strategy π come from , that is, how does the agent make a decision at each step.



Since in the final part of the article a solution based on Q-learning will be proposed, we will deliberately focus only on tabular methods.



RL Tabular Algorithms



Some of the fundamental methods of RL are tabular methods, used for tasks in which the sets of states and actions are finite. A characteristic feature of such methods is the use of State-Action tables. The rows are usually deferred states, the columns are actions. The cells contain the values ​​of the value function.



image



Q(si;aj) - the value of the action aj capable of si... Roughly speaking, this is the expected benefit that we will receive if we choose an actionajbeing able si... In the first step, the valuesQ(si;aj)are initialized with zeros, for example.



For the maze example, the initial State-Action table might look like this:



image



Here, the state is the position (maze cell) in which the agent is located. After performing any action, our agent changes his state and receives a Reward. In this task, the reward can be as follows:



  • 1 if the object has found a way out of the maze;
  • 0 otherwise.


In addition, after the agent has received actual feedback from the environment, the value Q(si;aj)corrected. Correction algorithms are different, for example, the Monte Carlo method, SARSA, Q-learning. Read more about them here or here .



For example, the Q-learning and SARSA formulas look very similar at first glance:



image



Both methods use the expected value of the action in the next step. It is received very simply: let's say the agent is in the statesi and performs the action aj... Then the environment informs the agent that as a result of his action he receives a rewardri and new condition sk... Using the Status-Action table, you can find the row with the statusskand determine what value this or that action will bring in it.



The difference is that in Q-learningQ(sk;a)Is always the maximum value in a new state. While the SARSA method assumes that the agent simulates the choice of action in the statesk, for example, according to the ε-greedy or UCB strategy. When using a greedy strategy, the methods are equivalent.



The disadvantage of such algorithms is the need to store the State-Action table. Some tasks can have a large space of states and actions, which makes it impossible to use classic table methods. In such cases, approaches are used to approximate the valuesQ(si;aj)using neural networks.



Dynamic programming can be an alternative to table methods. We will not dwell on these algorithms, but we recommend reading the book Reinforcement Learning by R. S. Sutton and E. G. Barto.



This is where we finish the theory and then talk about how Reinforcement Learning can be used in an applied task.



Finding the best customer incentive strategy using Reinforcement Learning



Problem statement in business terms



Constraints under which our approach was developed:



  • The solution must be flexible to the limitations of the communication policy with clients;
  • The function to be optimized must be driven by business objectives and can be more complex than simple response;
  • The system should self-adapt to changes in customer behavior without involving an expert;
  • ( , , , );
  • .


RL



So,

Agent and Environment is a loyalty program system that sends the client communications with marketing proposals, and the client himself.



State is the state of the client, which is characterized by client attributes.



Actions are marketing offers (for example, “get X% off purchase Y”). It is assumed that the list of proposals is fixed and finite.



Reward is some function of changing customer behavior (for example, increasing revenue or responding to a targeted campaign).



image



Solution approach



Now let's look at the possible solutions using the Reinforcement Learning tabular methods.



The solution algorithm using Q-Learning or Sarsa can be as follows:



1. Defining client states



Client state can be specified using client attributes. Most of these attributes are real numbers, so before using tabular methods, the attributes should be discretized to obtain a finite set of states.



In our solution, we used the clusters obtained as a result of clustering the client base based on the selected attributes as client states. The number of clusters affects how quickly the algorithm learns. General recommendations are as follows:



  • in order to be able to manage the flow of customers from cluster to cluster, it is necessary that the list of attributes includes those that can be changed under the influence of the availability and reaction to marketing offers;
  • within each cluster, clients must be homogeneous in behavior;
  • updating attributes should be possible on a regular basis;
  • in each cluster, the number of clients must be higher than the established minimum (the minimum may be due, for example, to restrictions on the minimum number of clients in order for the results to be meaningful)


2. Choice of reward



The choice of reward is the most important stage in the development of the system. For this task, the reward can characterize the success of the campaign. For example, the possible options are:



  • Conversion per offer;
  • Increase in response to an offer;
  • Specific revenue per campaign participant;
  • Specific profit taking into account costs;
  • ...


Returning to the problem of increasing customer loyalty, the target metric can be LTV or the metric of customer proximity to the loyal segment.



In any case, the choice of reward should be consistent with the goals of marketing.



PS Some of the proposed remuneration options are calculated aggregated by a group of customers (for example, the increase in response to offers is the response in the target group minus the response in the control group). In this case, it would be more correct to say that we choose an action not for a client, but for a cluster of clients (which are in the same state), within which the reward is calculated.



3. Choice of possible actions



Actions are marketing proposals that can be sent to the customer. When choosing marketing campaigns to be used in the system, keep in mind:



  • the marketing proposal should not change from launch to launch;
  • the choice of the number of sentences affects the learning rate of the algorithm;
  • a scenario should be considered when none of the campaigns is suitable for the state (for example, all offer variants bring negative revenue). In this case, one of the actions can be “default-campaign”. This can be either some basic mailing that can be sent to all customers, or no offer (that is, it is possible that it is more profitable not to send anything to the customer).


4. Designing a selection algorithm subject to constraints



When designing an algorithm, one should consider:



  1. (, iphone, iphone).
  2. , .
  3. .
  4. Q-learning SARSA . , , .
  5. , ( ) -.


5. Initialization of the State-Action table



Initially, the State-Action table looks like this:



image



Further system launch is possible in the absence of historical launches for the selected campaigns, which is an important advantage of the concept.



However, if there is some history, then it can be used, that is, retrospective pre-training of the State-Action table is possible:



  1. Initialize the State-Action table with zeros
  2. Take the historical launch of the campaign X. Calculate the states of the clients participating in the campaign at the time of launch and at the end of the campaign. Calculate the reward received in each state.
  3. According to the formula for Q-learning or SARSA, recalculate the State-Action table taking into account the expected values ​​of the campaign values ​​at the next launch.


6. Training the algorithm on pilot launches



The goal of our system is to learn how to select the best offers for the entire client base. However, at the stage of testing the system, we advise you to conduct pilot runs on a small representative sample of clients.



What you need to pay attention to at this stage:



  1. Changes in the values ​​in the State-Action table: as history accumulates, the values ​​in the State-Action table should become more and more stable;
  2. Positive dynamics of the effect of campaigns: from launch to launch, the effectiveness of each marketing proposal should grow.


As soon as (1) and (2) reach a plateau, we can assume that the system is ready to roll out to the entire customer base.



7. Unrolling system



Before you start rolling out the system, it is advisable to analyze the sustainability of campaign results in the context of each client state. As practice shows, despite the general stability, in some states there may be either insufficient history, or the states themselves may be unstable in time => we have an unstable result.



Thus, we have developed the following recommendations for rolling:



  • Exclude unstable conditions from rolling;
  • Use an ε-greedy strategy so that the system can independently adjust to changes in the behavior of the customer base;
  • Continue regular monitoring of system performance.


So, in this article we have tried to describe the high-level concept of our approach. The results of the system operation based on the proposed algorithm can be found here .



Conclusion



We have described the use of RL to solve the problem of selecting the optimal chain of actions. However, it should be mentioned that a similar concept can be applied to other marketing tasks, for example, recommendation systems, choosing the optimal communication channel / time, or selecting a personal banner on the site. Despite the fact that Reinforcement Learning is inferior in popularity to traditional ML methods, we wanted to convey to the reader that RL can be an excellent solution if there is a need to maintain automatic system retraining or fully train the system from scratch.



The GlowByte team would like to thank X5 Retail Group for the opportunity to implement this case.



All Articles