The advantage that this is that bandwidth assignments will be made according to the best estimate of the probability that each bandit could possibly be the best. This automatically causes the system to do exploration as long as it is plausible and causes the system to smoothly transition to exploitation when it becomes clear which bandit is the best. Essentially what this gives us is a Bayesian implementation of active learning.

To illustrate this, here is a graph that shows the posterior distribution of conversion probability for two bandits where we have lots of history for one and only a little history for the other.

You can see that the red bandit with 100 conversions out of 1000 impressions mostly like has a probability of conversion of 0.1, more or less a bit. The blue bandit with no conversions out of 10 impressions is very likely worse than the red bandit, but there is a small possibility that it is better. If we just picked the average or mode of these distributions, we would conclude that the blue bandit is worse and wouldn't give it any bandwidth without substantial mechanisms to over-ride this decision.

On the other hand, if we estimate a conversion probability by sampling and use that sampled estimate for targeting, then we will give the blue bandit a little bandwidth and thus a chance to redeem itself.

There are several aspects of the Bayesian Bandit algorithm that are exciting.

- exploration and exploitation are handled uniformly in the same framework
- our internal representation encodes all of the information that we have so we don't confuse evidence of failure with lack of evidence for success
- the algorithm only requires a few lines of code
- the updates for the algorithm can be expressed in terms of back-propagation or stochastic gradient descent
- the performance is really good.

As a bit of a teaser, here are a few graphs that describe how the Bayesian Bandit converges in simulated runs. The first graph shows how the average total regret for the Bayesian Bandit algorithm approaches the ideal as the number of trials increases. This experiment uses normally distributed rewards with $\sigma = 0.1$. Any algorithm that meets the optimal $O(\log n)$ convergence lower bound is said to "solve" the bandit problem.

The convergence here is very close to the ideal convergence rate. Note that this graph includes the convergence to optimal payoff, not the convergence knowing which is the better bandit. This is actually an interesting aspect of the problem since the algorithm will converge almost instantly for cases where the conversion probabilities are highly disparate which will make the payoff converge quickly. For cases where the conversion probabilities are nearly the same, it will take a long time for the algorithm to determine which is the better bandit, but exploration is not expensive in such a case so the convergence to near-optimal payoff will be even faster than the case where the conversion rates are very different.

For example, here is a graph of the probability of picking the better bandit where the conversion rates are nearly the same. As you can see, it takes quite a while for the algorithm to split these two options. The average payoff, however, only changes from 0.11 to 0.12 during this entire convergence and it has already reached 0.118 by the time it is 20% into the process so the cost of a long experiment is not that high.

Sample code for the Bayesian bandit is available at https://github.com/tdunning/storm-counts.