I have been speaking lately about how various on-line algorithms have substantial potential for various real-time learning applications. The most notable of these algorithms are Thompson sampling for real-time handling of multi-armed bandit and contextual bandit problems and an algorithm due to Shindler, Myerson and Wang for fast $k$-means clustering of data. Since I have had a number of requests for references back to the original sources for these works, I figured a few blog posts would be a good thing. This post will describe the multi-armed bandit work and the next will describe the clustering work.

#### The Basic Problem - Multi-armed Bandits

For the bandit problems, there are two basic problems to be dealt with. The first and most basic problem is that of the multi-armed bandit. In this problem, you can sample from any of a finite number of distributions and your goal is to maximize the average value of the values that you get. This can be cast into a number of practical settings in which you select which slot machine to put a quarter into, or you select which on-line ad to present to a user or you select which landing page to deliver to a user's browser should see when they visit a particular URL. It is common to simplify this case further by assuming a stationary distribution. Obviously, at least one of the distributions you are picking from has a mean equal to the large mean of any alternative. Any time you take a sample from a distribution that has a smaller mean, you fall behind the theoretical best, on average, that you could have achieved by picking from (one of) the best distributions. The degree by which you fall behind is known as the regret that you incur.The key to the multi-armed bandit problem is that you cannot know which distribution might have the largest mean. This means that you have to sample all of the distributions in order to estimate their means, but this implies that you have to sample from the lesser distributions in order to determine that their are, in fact, inferior.

There are well known bounds on how well you can actually solve this problem. There are also a number of algorithms that have regret on par with these bounds or come reasonably close to these bounds. Mostly, however, these known solutions have limitations either on the number of distributions they can consider or on the complexity of the solution.

Kuleshov and Precup provide some good examples of how to compare different bandit algorithms in their paper. This tutorial on bandits provides a wider view of different forms of multi-armed bandit problems with a number of references.

Conspicuously missing from most lists of references, however, is all the recent work using Thompson sampling. These algorithms, which I have referred to as Bayesian Bandits, have particularly nice properties of simplicity and optimality. Chapelle and Li provide an empirical look at performance with these algorithms compared to upper confidence bound (UCB) algorithms. The last paragraph of that paper laments the lack of a theoretical analysis of these algorithms, but that lack was cured shortly in this paper by Agrawal and Goyal. Scott provided a more comprehensive view of these algorithms under the name of randomized probability matching.

The idea behind Bayesian Bandits is quite simple. For each bandit, we maintain use the observations so far to build a posterior distribution for the mean of the associated payoff distribution. For binary payoffs, it is common to use a $\beta$-binomial distribution and for other cases a $\gamma$-normal distribution works well. To pick a bandit, we sample a mean for each bandit from these posterior distributions and then pick the bandit with the largest sampled mean. The new sample from that bandit gives us more data which refines the posterior distribution for that bandit. We can repeat this process as long as desired.

#### Extensions to Contextual Bandits

One of the most important characteristics of the Thompson sampling approaches (aka randomized probability matching aka Bayesian Bandits) is that they can be extended to more complex situations. One setting that I have found particularly useful involves optimizing return not just from a few bandits, but from a parameterized set of bandits that could conceivably even be infinite. The transformation from the parameters to the bandit distribution is unknown, but if we could know that, we would be able to search the parameter space to find the bandit with the highest mean payoff.

This formulation is a generalization of the previous case because we can take the parameter to be an integer from $1 \ldots k$ where there are $k$ bandits and the transformation consists of the mean payoffs for each of the $k$ bandits.

The algorithm in the contextual case simply consists of sampling the transformation from some posterior distribution and then solving for the parameters of the bandit that we would like to use. Some of the parameters might be fixed by the context we are working in which is where the name contextual bandits comes in.

The paper by Scott alludes to this formulation, but the most approachable work on this that I know of is the paper by Graepel, Candela, Borchert, and Herbrich. In this paper, they describe the operation of AdPredictor, a system used by the Bing search engine to target ads using context.