## Tuesday, March 22, 2011

### Exponential weighted averages with irregular sampling

Ted Yu on the hbase list wants to compute an exponential weighted average for the number of queries per second or other statistics that characterize the performance or state of an hbase system.

The trick is that the measurements are only available at irregular intervals. If they were sampled regularly, then the standard mixing trick would work:
$m_{n+1} = \mu m_n + (1-\mu) x_{n+1}$
where $m$ is our current estimate of the mean, $x_n$ is the $n$-th sample and $\mu$ determines how much history to use.

With unequal sample times, things become a bit more complicated. If we get lots of measurements all at once, we want to give them nearly equal weight but if we have a long gap, we want to weight the very old samples much less.

In fact, we want to weight old samples according to how old they are with exponentially decreasing weight. If we sample values $\left \lbrace x_1 \ldots x_n \right \rbrace$ at times $t_1 \ldots t_n$ then we want the weighted mean defined by
$m_n = {\sum_{i=1}^n x_i e^{-(t_n - t_i)/\alpha} \over \sum_{i=1}^n e^{-(t_n - t_i)/\alpha} }$
Here $\alpha$ plays the same role as $\mu$ did before, but on a different scale. If the evenly sampled data comes at time intervals $\Delta t$ then $\mu = e^{\Delta t / \alpha}$.

Happily, there is a very simple recurrence relationship that allows us to keep only two intermediate values while computing the value of $m_1 \ldots m_n$ in an entirely on-line fashion as the $x_i$ values arrive.

To see this, define

\begin{eqnarray*}
\pi_n &=& e^{-(t_{n+1}-t_n)/\alpha} \\
w_{n+1} &=&
\sum_{i=1}^{n+1} e^{-(t_{n+1} - t_i)/\alpha} =
1+e^{-(t_{n+1}-t_n)/\alpha} \sum_{i=1}^{n} e^{-(t_{n} - t_i)/\alpha} \\
& =& 1 + \pi w_n\\
s_{n+1} &=&
\sum_{i=1}^{n+1} x_i e^{-(t_{n+1} - t_i)/\alpha} =
x_{n+1}+e^{-(t_{n+1}-t_n)/\alpha} \sum_{i=1}^{n} x_i e^{-(t_{n} - t_i)/\alpha} \\
&=& x_{n+1} + \pi_n s_n
\end{eqnarray*}

Then note that
$m_{n+1} = {s_{n+1} \over w_{n+1}}$

This leads naturally to a procedure that has state consisting of $t, w, m$ which are updated with using new values of $t_n, x_n$ according to
\begin{eqnarray*}
\pi &=& e^{-(t_{n}-t)/\alpha} \\
w &=& 1 + \pi w \\
s &=& x_n + \pi s \\
m &=& {s \over w} \\
t &=& t_{n}
\end{eqnarray*}
Isn't that a kick!

To do this right, however, we need a test. Here are some data vectors computed for $\alpha=5$:

t          x        pi        w          s          m
1 11.35718  1.5992071 1.0000000 1.000000  1.5992071  1.5992071
2 21.54637 -1.3577032 0.1303100 1.130310 -1.1493105 -1.0168100
3 28.91061 -0.3405638 0.2292718 1.259148 -0.6040683 -0.4797436
4 33.03586  0.7048632 0.4382129 1.551775  0.4401527  0.2836447
5 39.57767  0.3020558 0.2702621 1.419386  0.4210124  0.2966159

Writing a proper unit test is an exercise left to the reader. (but I will be happy to help)

Nathan said...

Ted, pardon my bad math, but I dont understand how you are calculating pi in your example. I can't get to .13031 no matter what I do. ln .13031 is -2 ish, which doesn't seem to come from t1 - t0, which is about 10 - what am I missing?

Thanks,

Nathan

Ted Dunning ... apparently Bayesian said...

Nathan,

The issue is that I dropped a division by α in the expression for $\pi$. For the example, α=5 so for the second line, $\pi = \exp{-(21.54637-11.35718) \over 5} = e^{-2.04} = 0.1303101$

Ted Dunning ... apparently Bayesian said...

And, by the way, thanks for noticing that. There was also a missing minus sign on an earlier expression that I found as a result of looking for the problem you spotted.

Let me know if you have any other problems with this.

Nathan said...

That worked, thank you much. Your CAPTCHA is nearly impossible, BTW. Or maybe I'm a robot and it is just doing its job. :)

one_fell_swoop said...

Just as an FYI to anyone ending up here, there's a truly excellent paper with many useful online time-series operators for inhomogeneous data: