Online gradient descent has low regret: a one-ish-liner

07 Feb, 2024

This post is a bit silly in that this is an obviously well-known result, but I've never seen a direct proof of this that did not at least introduce some additional notation. This proof was particularly enlightening to me in understanding 'why does OCO work', at least, much more so than the standard "follow the regularized leader" proofs (such as those found here) or more 'direct', but magical proofs, such as the one found here. If you're familiar with OCO already, you can scroll directly down to the almost-two-liner proof.

Online convex optimization

The main setting of online convex optimization is as follows. We have two players, us, and an adversary. The game proceeds in $T$ rounds. At each round $t$ we have to choose a move $x_{t} \in 𝐑^{n}$ , and the adversary, after observing our move, gets to then choose some (convex) loss $f_{t} : 𝐑^{n} \to 𝐑$ such that we incur a penalty $f_{t} (x_{t})$ for having played move $x_{t}$ .

After this game is over, we then compare our score, which is the sum of all of the losses over the $T$ rounds, to the best possible fixed strategy, which we call $x^{⋆} \in 𝐑^{n}$ :

R = \sum_{t = 1}^{T} f_{t} (x_{t}) - \sum_{t = 1}^{T} f_{t} (x^{⋆}) .

This $R$ is called the regret and the best possible fixed strategy, $x^{⋆}$ , is, of course, the one that minimizes the sum of all losses:

x^{⋆} \in {a r g m i n}_{x} (\sum_{t = 1}^{T} f_{t} (x)),

had we known the losses $f_{t}$ in advance.

The natural question is: just how small can $R$ be? Of course, we need some conditions on $f_{t}$ and the optimal fixed strategy, $x^{⋆}$ , otherwise the adversary can just grow the function to arbitrary amounts. The simplest conditions, and the one we will use here, is that (a) the optimal price $p^{⋆}$ always has a bound $‖ p^{⋆} ‖ \leq M$ , and (b) that the functions $f_{t}$ are $L$ -Lipschitz: if $f_{t}$ are differentiable, then this is saying that

‖ \nabla f_{t} (y) ‖ \leq L

for any $y \in 𝐑^{n}$ . (We can relax this condition slightly, but the idea will be the same.) In both cases, we write $‖ \cdot ‖$ for the usual Euclidean norm, and I'll do this for the rest of the post.

A simple strategy

The simplest strategy that "seems to have the right behavior" is probably something like gradient descent: at round $t$ , use the gradient of the function at round $t - 1$ (which we didn't know until we played $x_{t - 1}$ , since the adversary chose it after!) and update slightly in that direction. Written out, this is

x_{t} = x_{t - 1} - η \nabla f_{t} (x_{t - 1}),

where $η > 0$ is some parameter we will set soon, called the step size. (Keep this in mind as this is the definition of $x_{t}$ we will use throughout.) Note that we can write $x_{t}$ purely in terms of the previously-observed gradients, since

x_{t} = - η \sum_{τ = 1}^{t - 1} \nabla f_{τ} (x_{τ}) .

(We will use this later as well!)

This silly strategy will turn out to be extremely useful in the future. Even more interesting is that the strategy only depends on the gradient at $x_{t - 1}$ and no other information! This setting actually comes up in practice (for example in blockchain resource pricing, where you have to set a price before you get to observe the market's reaction) and can be used to analyze the performance of certain algorithms against potential adversaries. (We are writing a paper on this particular topic of OCO applied to resource pricing using the above model, which I'll link here once we post it.)

On the regret $R$

At first glance, this problem feels quite difficult! I mean, look: the adversary can choose any functions $f_{t}$ in any way, after observing our move $x_{t}$ . It's almost like, given so much power, the adversary can always make the loss roughly linear in the number of rounds $T$ : each round, we may expect to lose some (at least) constant amount from a very adversarial choice of $f_{t}$ .

Even the strategy above 'feels like' it's not going to do particularly well: again the functions are chosen with the knowledge of how we take our steps! Why can we expect to do well at all?

What is surprising is, we will show that

R \leq C \sqrt{T},

where $C > 0$ is some constant that depends on the bounds on the gradient $L$ and the bound on the price $M$ . Alternatively, we can write this as, on average, the longer the game continues, the better we perform:

\frac{R}{T} \leq \frac{C}{\sqrt{T}} .

(Again, it's worth reiterating: this is done even in the presence of adversarially chosen $f_{t}$ .) Indeed, after a long enough time scale, we see that our strategy and the best fixed strategy, with complete knowledge of the future, are, on average, about the same. Algorithms where the average regret vanishes have a bit of a silly name, but I'll put it here for anthropological reasons: they are called no-regret algorithms.

Proof

Anyways, the proof is fairly easy. The first order of events is to note that, since $f_{t}$ is convex, then, by definition, we have

f_{t} (y) \geq f_{t} (x_{t}) + \nabla f_{t} (x_{t})^{T} (y - x_{t}),

for any $y \in 𝐑^{n}$ . (The $x_{t}$ are chosen in the same way as the previous section.) This means, letting $g_{t} = \nabla f_{t} (x_{t})$ ,

f_{t} (x_{t}) - f_{t} (y) \leq g_{t}^{T} (x_{t} - y) .

Summing this and noting it is true for any $y$ , then, certainly, it is true for the optimal strategy, $y = x^{⋆}$ , so

R = \sum_{t = 1}^{T} (f_{t} (x_{t}) - f_{t} (x^{⋆})) \leq \sum_{t = 1}^{T} g_{t}^{T} (x_{t} - x^{⋆}) .

We will focus our attention on trying to show this last term, which we call the linear bound, is 'small'.

An inequality for the linear bound

The inequality is a one-liner and follows easily from the fact that

R \leq \sum_{t = 1}^{T} g_{t}^{T} (x_{t} - x^{⋆}) = \frac{η}{2} \sum_{t = 1}^{T} {‖ g_{t} ‖}^{2} - {‖ \sqrt{\frac{η}{2}} \tilde{g} + \sqrt{\frac{2}{η}} x^{⋆} ‖}^{2} + \frac{2}{η} ‖ x^{⋆} ‖^{2},

where $\tilde{g} = \sum_{t = 1}^{T} g_{t}$ , using the definition of $x_{t} = - η \sum_{τ = 1}^{t - 1} g_{τ}$ . (To see this, expand the right hand side and cancel terms.) Finally, the middle term is nonpositive, as it is a negative square, so we get the bound

R \leq \frac{η}{2} \sum_{t = 1}^{T} {‖ g_{t} ‖}^{2} + \frac{2}{η} ‖ x^{⋆} ‖^{2} .

Since we know that $‖ g_{t} ‖ \leq L$ and $‖ x^{⋆} ‖ \leq M$ by assumption, then

R \leq \frac{η L^{2} T}{2} + \frac{2 M^{2}}{η} .

Finally, choosing $η = M / (2 L \sqrt{T})$ , which minimizes the right hand side, gives

R \leq M L \sqrt{T},

as required.

Wait what?

Ok, fine, I'll explain it.

Explanation

From before, we can write $x_{t}$ in terms of only the gradients $g_{t}$ :

x_{t} = - η \sum_{τ = 1}^{t - 1} g_{t},

so the first term of the linear bound can be written

\sum_{t = 1}^{T} g_{t}^{T} x_{t} = - η \sum_{t = 1}^{T} \sum_{τ = 1}^{t - 1} g_{t}^{T} g_{τ} .

This last double sum should look familiar if you've ever expanded the squared norm of a sum before, but if you have not then:

{‖ \sum_{t = 1}^{T} g_{t} ‖}^{2} = \sum_{t = 1}^{T} {‖ g_{t} ‖}^{2} + 2 \sum_{t = 1}^{T} \sum_{τ = 1}^{t - 1} g_{t}^{T} g_{τ},

so, rearranging,

\sum_{t = 1}^{T} \sum_{τ = 1}^{t - 1} g_{t}^{T} g_{τ} = \frac{1}{2} ({‖ \sum_{t = 1}^{T} g_{t} ‖}^{2} - \sum_{t = 1}^{T} {‖ g_{t} ‖}^{2})

Plugging this back into the linear bound, we have

R \leq \sum_{t = 1}^{T} g_{t}^{T} (x_{t} - x^{⋆}) = \frac{η}{2} (\sum_{t = 1}^{T} {‖ g_{t} ‖}^{2} - {‖ \sum_{t = 1}^{T} g_{t} ‖}^{2}) - \sum_{t = 1}^{T} g_{t}^{T} x^{⋆} .

Here comes the only interesting part of the proof. (Though, honestly, it's not even that interesting.) Note that we can pull a cute sleight of hand: write $\tilde{g} = \sum_{t = 1}^{T} g_{t}$ then, we can write the above as

R \leq \frac{η}{2} \sum_{t = 1}^{T} {‖ g_{t} ‖}^{2} - \frac{η}{2} ‖ \tilde{g} ‖^{2} - {\tilde{g}}^{T} x^{⋆} .

We can rewrite the last two terms in the following way:

R \leq \frac{η}{2} \sum_{t = 1}^{T} {‖ g_{t} ‖}^{2} - {‖ \sqrt{\frac{η}{2}} \tilde{g} + \sqrt{\frac{2}{η}} x^{⋆} ‖}^{2} + \frac{2}{η} ‖ x^{⋆} ‖^{2} .

(To see this, expand the middle term!) This is exactly the term we found in the previous part!

Wrapping up

I mean, there isn't much interesting here and really not very much magic. The one thing this brings up is: what does a 'clean but general' version of this proof look like? In particular, the only important part of the proof is recognizing that the interaction between $\tilde{g}$ and $x^{⋆}$ is, in some sense, bounded by the norm of $x^{⋆}$ and the individual norms of $g_{t}$ . It feels like there should be some simple construction which allows this interaction to be bounded in a more natural way. (Of course, this will play in a variety of ways with how we choose $x_{t}$ based on the gradients and such.)

I should note that this bound is tight for any algorithm in that only the constant can be improved. (In particular, it is possible to construct a stochastic adversary that always achieves at least $R \geq C L M \sqrt{T}$ regret, for some $C > 1 / 4$ , no matter how $x_{t}$ is chosen.) Indeed, this question is deeply related to the previous post on finding a lower bound for a random walk, but I won't go into it here.