Online gradient descent has low regret: a one-ish-liner
This post is a bit silly in that this is an obviously well-known result, but I've never seen a direct proof of this that did not at least introduce some additional notation. This proof was particularly enlightening to me in understanding 'why does OCO work', at least, much more so than the standard "follow the regularized leader" proofs (such as those found here) or more 'direct', but magical proofs, such as the one found here. If you're familiar with OCO already, you can scroll directly down to the almost-two-liner proof.
Online convex optimization
The main setting of online convex optimization is as follows. We have two players, us, and an adversary. The game proceeds in rounds. At each round we have to choose a move , and the adversary, after observing our move, gets to then choose some (convex) loss such that we incur a penalty for having played move .
After this game is over, we then compare our score, which is the sum of all of the losses over the rounds, to the best possible fixed strategy, which we call :
This is called the regret and the best possible fixed strategy, , is, of course, the one that minimizes the sum of all losses:
had we known the losses in advance.
The natural question is: just how small can be? Of course, we need some conditions on and the optimal fixed strategy, , otherwise the adversary can just grow the function to arbitrary amounts. The simplest conditions, and the one we will use here, is that (a) the optimal price always has a bound , and (b) that the functions are -Lipschitz: if are differentiable, then this is saying that
for any . (We can relax this condition slightly, but the idea will be the same.) In both cases, we write for the usual Euclidean norm, and I'll do this for the rest of the post.
A simple strategy
The simplest strategy that "seems to have the right behavior" is probably something like gradient descent: at round , use the gradient of the function at round (which we didn't know until we played , since the adversary chose it after!) and update slightly in that direction. Written out, this is
where is some parameter we will set soon, called the step size. (Keep this in mind as this is the definition of we will use throughout.) Note that we can write purely in terms of the previously-observed gradients, since
(We will use this later as well!)
This silly strategy will turn out to be extremely useful in the future. Even more interesting is that the strategy only depends on the gradient at and no other information! This setting actually comes up in practice (for example in blockchain resource pricing, where you have to set a price before you get to observe the market's reaction) and can be used to analyze the performance of certain algorithms against potential adversaries. (We are writing a paper on this particular topic of OCO applied to resource pricing using the above model, which I'll link here once we post it.)
On the regret
At first glance, this problem feels quite difficult! I mean, look: the adversary can choose any functions in any way, after observing our move . It's almost like, given so much power, the adversary can always make the loss roughly linear in the number of rounds : each round, we may expect to lose some (at least) constant amount from a very adversarial choice of .
Even the strategy above 'feels like' it's not going to do particularly well: again the functions are chosen with the knowledge of how we take our steps! Why can we expect to do well at all?
What is surprising is, we will show that
where is some constant that depends on the bounds on the gradient and the bound on the price . Alternatively, we can write this as, on average, the longer the game continues, the better we perform:
(Again, it's worth reiterating: this is done even in the presence of adversarially chosen .) Indeed, after a long enough time scale, we see that our strategy and the best fixed strategy, with complete knowledge of the future, are, on average, about the same. Algorithms where the average regret vanishes have a bit of a silly name, but I'll put it here for anthropological reasons: they are called no-regret algorithms.
Proof
Anyways, the proof is fairly easy. The first order of events is to note that, since is convex, then, by definition, we have
for any . (The are chosen in the same way as the previous section.) This means, letting ,
Summing this and noting it is true for any , then, certainly, it is true for the optimal strategy, , so
We will focus our attention on trying to show this last term, which we call the linear bound, is 'small'.
An inequality for the linear bound
The inequality is a one-liner and follows easily from the fact that
where , using the definition of . (To see this, expand the right hand side and cancel terms.) Finally, the middle term is nonpositive, as it is a negative square, so we get the bound
Since we know that and by assumption, then
Finally, choosing , which minimizes the right hand side, gives
as required.
Wait what?
Ok, fine, I'll explain it.
Explanation
From before, we can write in terms of only the gradients :
so the first term of the linear bound can be written
This last double sum should look familiar if you've ever expanded the squared norm of a sum before, but if you have not then:
so, rearranging,
Plugging this back into the linear bound, we have
Here comes the only interesting part of the proof. (Though, honestly, it's not even that interesting.) Note that we can pull a cute sleight of hand: write then, we can write the above as
We can rewrite the last two terms in the following way:
(To see this, expand the middle term!) This is exactly the term we found in the previous part!
Wrapping up
I mean, there isn't much interesting here and really not very much magic. The one thing this brings up is: what does a 'clean but general' version of this proof look like? In particular, the only important part of the proof is recognizing that the interaction between and is, in some sense, bounded by the norm of and the individual norms of . It feels like there should be some simple construction which allows this interaction to be bounded in a more natural way. (Of course, this will play in a variety of ways with how we choose based on the gradients and such.)
I should note that this bound is tight for any algorithm in that only the constant can be improved. (In particular, it is possible to construct a stochastic adversary that always achieves at least regret, for some , no matter how is chosen.) Indeed, this question is deeply related to the previous post on finding a lower bound for a random walk, but I won't go into it here.