## Friday, October 26, 2012

### References for On-line Algorithms

[See also my more recent blog on this talking about using bandits for ranking rated items like comments]

I have been speaking lately about how various on-line algorithms have substantial potential for various real-time learning applications.  The most notable of these algorithms are Thompson sampling for real-time handling of multi-armed bandit and contextual bandit problems and an algorithm due to Shindler, Myerson and Wang for fast $k$-means clustering of data.  Since I have had a number of requests for references back to the original sources for these works, I figured a few blog posts would be a good thing.  This post will describe the multi-armed bandit work and the next will describe the clustering work.

#### The Basic Problem - Multi-armed Bandits

For the bandit problems, there are two basic problems to be dealt with.  The first and most basic problem is that of the multi-armed bandit.  In this problem, you can sample from any of a finite number of distributions and your goal is to maximize the average value of the values that you get.  This can be cast into a number of practical settings in which you select which slot machine to put a quarter into, or you select which on-line ad to present to a user or you select which landing page to deliver to a user's browser should see when they visit a particular URL.  It is common to simplify this case further by assuming a stationary distribution.  Obviously, at least one of the distributions you are picking from has a mean equal to the large mean of any alternative.  Any time you take a sample from a distribution that has a smaller mean, you fall behind the theoretical best, on average, that you could have achieved by picking from (one of) the best distributions.  The degree by which you fall behind is known as the regret that you incur.

The key to the multi-armed bandit problem is that you cannot know which distribution might have the largest mean.  This means that you have to sample all of the distributions in order to estimate their means, but this implies that you have to sample from the lesser distributions in order to determine that their are, in fact, inferior.

There are well known bounds on how well you can actually solve this problem.  There are also a number of algorithms that have regret on par with these bounds or come reasonably close to these bounds.  Mostly, however, these known solutions have limitations either on the number of distributions they can consider or on the complexity of the solution.

Kuleshov and Precup provide some good examples of how to compare different bandit algorithms in their paper.  This tutorial on bandits provides a wider view of different forms of multi-armed bandit problems with a number of references.

Conspicuously missing from most lists of references, however, is all the recent work using Thompson sampling.  These algorithms, which I have referred to as Bayesian Bandits, have particularly nice properties of simplicity and optimality.  Chapelle and Li provide an empirical look at performance with these algorithms compared to upper confidence bound (UCB) algorithms.  The last paragraph of that paper laments the lack of a theoretical analysis of these algorithms, but that lack was cured shortly in this paper by Agrawal and Goyal.  Scott provided a more comprehensive view of these algorithms under the name of randomized probability matching.

The idea behind Bayesian Bandits is quite simple.  For each bandit, we maintain use the observations so far to build a posterior distribution for the mean of the associated payoff distribution.  For binary payoffs, it is common to use a $\beta$-binomial distribution and for other cases a $\gamma$-normal distribution works well.  To pick a bandit, we sample a mean for each bandit from these posterior distributions and then pick the bandit with the largest sampled mean.  The new sample from that bandit gives us more data which refines the posterior distribution for that bandit.  We can repeat this process as long as desired.

#### Extensions to Contextual Bandits

One of the most important characteristics of the Thompson sampling approaches (aka randomized probability matching aka Bayesian Bandits) is that they can be extended to more complex situations.  One setting that I have found particularly useful involves optimizing return not just from a few bandits, but from a parameterized set of bandits that could conceivably even be infinite.  The transformation from the parameters to the bandit distribution is unknown, but if we could know that, we would be able to search the parameter space to find the bandit with the highest mean payoff.

This formulation is a generalization of the previous case because we can take the parameter to be an integer from $1 \ldots k$ where there are $k$ bandits and the transformation consists of the mean payoffs for each of the $k$ bandits.

The algorithm in the contextual case simply consists of sampling the transformation from some posterior distribution and then solving for the parameters of the bandit that we would like to use.  Some of the parameters might be fixed by the context we are working in which is where the name contextual bandits comes in.

The paper by Scott alludes to this formulation, but the most approachable work on this that I know of is the paper by Graepel, Candela, Borchert, and Herbrich.  In this paper, they describe the operation of AdPredictor, a system used by the Bing search engine to target ads using context.

## Wednesday, February 8, 2012

### Bayesian Bandits

The basic idea with Bayesian Bandits is to solve the problem of the explore/exploit trade-off in a multi-armed bandit by keeping a distributional estimates for the probability of payoff for each bandit, but avoiding the cost of manipulating distributions by sampling from those distributions and then proceeding on each iteration as if that were a point estimate.

The advantage that this is that bandwidth assignments will be made according to the best estimate of the probability that each bandit could possibly be the best.  This automatically causes the system to do exploration as long as it is plausible and causes the system to smoothly transition to exploitation when it becomes clear which bandit is the best.  Essentially what this gives us is a Bayesian implementation of active learning.

To illustrate this, here is a graph that shows the posterior distribution of conversion probability for two bandits where we have lots of history for one and only a little history for the other.

You can see that the red bandit with 100 conversions out of 1000 impressions mostly like has a probability of conversion of 0.1, more or less a bit.  The blue bandit with no conversions out of 10 impressions is very likely worse than the red bandit, but there is a small possibility that it is better.  If we just picked the average or mode of these distributions, we would conclude that the blue bandit is worse and wouldn't give it any bandwidth without substantial mechanisms to over-ride this decision.

On the other hand, if we estimate a conversion probability by sampling and use that sampled estimate for targeting, then we will give the blue bandit a little bandwidth and thus a chance to redeem itself.

There are several aspects of the Bayesian Bandit algorithm that are exciting.

• exploration and exploitation are handled uniformly in the same framework
• our internal representation encodes all of the information that we have so we don't confuse evidence of failure with lack of evidence for success
• the algorithm only requires a few lines of code
• the updates for the algorithm can be expressed in terms of back-propagation or stochastic gradient descent
• the performance is really good.
As a bit of a teaser, here are a few graphs that describe how the Bayesian Bandit converges in simulated runs. The first graph shows how the average total regret for the Bayesian Bandit algorithm approaches the ideal as the number of trials increases. This experiment uses normally distributed rewards with $\sigma = 0.1$.  Any algorithm that meets the optimal $O(\log n)$ convergence lower bound is said to "solve" the bandit problem.

The convergence here is very close to the ideal convergence rate.  Note that this graph includes the convergence to optimal payoff, not the convergence knowing which is the better bandit.  This is actually an interesting aspect of the problem since the algorithm will converge almost instantly for cases where the conversion probabilities are highly disparate which will make the payoff converge quickly.  For cases where the conversion probabilities are nearly the same, it will take a long time for the algorithm to determine which is the better bandit, but exploration is not expensive in such a case so the convergence to near-optimal payoff will be even faster than the case where the conversion rates are very different.

For example, here is a graph of the probability of picking the better bandit where the conversion rates are nearly the same.  As you can see, it takes quite a while for the algorithm to split these two options.   The average payoff, however, only changes from 0.11 to 0.12 during this entire convergence and it has already reached 0.118 by the time it is 20% into the process so the cost of a long experiment is not that high.

Sample code for the Bayesian bandit is available at https://github.com/tdunning/storm-counts.