Longest path search

Online gradient descent has low regret: a one-ish-liner

This post is a bit silly in that this is an obviously well-known result, but I've never seen a direct proof of this that did not at least introduce some additional notation. This proof was particularly enlightening to me in understanding 'why does OCO work', at least, much more so than the standard "follow the regularized leader" proofs (such as those found here) or more 'direct', but magical proofs, such as the one found here. If you're familiar with OCO already, you can scroll directly down to the almost-two-liner proof.

Online convex optimization

The main setting of online convex optimization is as follows. We have two players, us, and an adversary. The game proceeds in T rounds. At each round t we have to choose a move xt𝐑n, and the adversary, after observing our move, gets to then choose some (convex) loss ft:𝐑n𝐑 such that we incur a penalty ft(xt) for having played move xt.

After this game is over, we then compare our score, which is the sum of all of the losses over the T rounds, to the best possible fixed strategy, which we call x𝐑n:

R=t=1Tft(xt)t=1Tft(x).

This R is called the regret and the best possible fixed strategy, x, is, of course, the one that minimizes the sum of all losses:

xargminx(t=1Tft(x)),

had we known the losses ft in advance.

The natural question is: just how small can R be? Of course, we need some conditions on ft and the optimal fixed strategy, x, otherwise the adversary can just grow the function to arbitrary amounts. The simplest conditions, and the one we will use here, is that (a) the optimal price p always has a bound pM, and (b) that the functions ft are L-Lipschitz: if ft are differentiable, then this is saying that

ft(y)L

for any y𝐑n. (We can relax this condition slightly, but the idea will be the same.) In both cases, we write · for the usual Euclidean norm, and I'll do this for the rest of the post.

A simple strategy

The simplest strategy that "seems to have the right behavior" is probably something like gradient descent: at round t, use the gradient of the function at round t1 (which we didn't know until we played xt1, since the adversary chose it after!) and update slightly in that direction. Written out, this is

xt=xt1ηft(xt1),

where η>0 is some parameter we will set soon, called the step size. (Keep this in mind as this is the definition of xt we will use throughout.) Note that we can write xt purely in terms of the previously-observed gradients, since

xt=ητ=1t1fτ(xτ).

(We will use this later as well!)

This silly strategy will turn out to be extremely useful in the future. Even more interesting is that the strategy only depends on the gradient at xt1 and no other information! This setting actually comes up in practice (for example in blockchain resource pricing, where you have to set a price before you get to observe the market's reaction) and can be used to analyze the performance of certain algorithms against potential adversaries. (We are writing a paper on this particular topic of OCO applied to resource pricing using the above model, which I'll link here once we post it.)

On the regret R

At first glance, this problem feels quite difficult! I mean, look: the adversary can choose any functions ft in any way, after observing our move xt. It's almost like, given so much power, the adversary can always make the loss roughly linear in the number of rounds T: each round, we may expect to lose some (at least) constant amount from a very adversarial choice of ft.

Even the strategy above 'feels like' it's not going to do particularly well: again the functions are chosen with the knowledge of how we take our steps! Why can we expect to do well at all?

What is surprising is, we will show that

RCT,

where C>0 is some constant that depends on the bounds on the gradient L and the bound on the price M. Alternatively, we can write this as, on average, the longer the game continues, the better we perform:

RTCT.

(Again, it's worth reiterating: this is done even in the presence of adversarially chosen ft.) Indeed, after a long enough time scale, we see that our strategy and the best fixed strategy, with complete knowledge of the future, are, on average, about the same. Algorithms where the average regret vanishes have a bit of a silly name, but I'll put it here for anthropological reasons: they are called no-regret algorithms.

Proof

Anyways, the proof is fairly easy. The first order of events is to note that, since ft is convex, then, by definition, we have

ft(y)ft(xt)+ft(xt)T(yxt),

for any y𝐑n. (The xt are chosen in the same way as the previous section.) This means, letting gt=ft(xt),

ft(xt)ft(y)gtT(xty).

Summing this and noting it is true for any y, then, certainly, it is true for the optimal strategy, y=x, so

R=t=1T(ft(xt)ft(x))t=1TgtT(xtx).

We will focus our attention on trying to show this last term, which we call the linear bound, is 'small'.

An inequality for the linear bound

The inequality is a one-liner and follows easily from the fact that

Rt=1TgtT(xtx)=η2t=1Tgt2η2g~+2ηx2+2ηx2,

where g~=t=1Tgt, using the definition of xt=ητ=1t1gτ. (To see this, expand the right hand side and cancel terms.) Finally, the middle term is nonpositive, as it is a negative square, so we get the bound

Rη2t=1Tgt2+2ηx2.

Since we know that gtL and xM by assumption, then

RηL2T2+2M2η.

Finally, choosing η=M/(2LT), which minimizes the right hand side, gives

RMLT,

as required.

Wait what?

Ok, fine, I'll explain it.

Explanation

From before, we can write xt in terms of only the gradients gt:

xt=ητ=1t1gt,

so the first term of the linear bound can be written

t=1TgtTxt=ηt=1Tτ=1t1gtTgτ.

This last double sum should look familiar if you've ever expanded the squared norm of a sum before, but if you have not then:

t=1Tgt2=t=1Tgt2+2t=1Tτ=1t1gtTgτ,

so, rearranging,

t=1Tτ=1t1gtTgτ=12(t=1Tgt2t=1Tgt2)

Plugging this back into the linear bound, we have

Rt=1TgtT(xtx)=η2(t=1Tgt2t=1Tgt2)t=1TgtTx.

Here comes the only interesting part of the proof. (Though, honestly, it's not even that interesting.) Note that we can pull a cute sleight of hand: write g~=t=1Tgt then, we can write the above as

Rη2t=1Tgt2η2g~2g~Tx.

We can rewrite the last two terms in the following way:

Rη2t=1Tgt2η2g~+2ηx2+2ηx2.

(To see this, expand the middle term!) This is exactly the term we found in the previous part!

Wrapping up

I mean, there isn't much interesting here and really not very much magic. The one thing this brings up is: what does a 'clean but general' version of this proof look like? In particular, the only important part of the proof is recognizing that the interaction between g~ and x is, in some sense, bounded by the norm of x and the individual norms of gt. It feels like there should be some simple construction which allows this interaction to be bounded in a more natural way. (Of course, this will play in a variety of ways with how we choose xt based on the gradients and such.)

I should note that this bound is tight for any algorithm in that only the constant can be improved. (In particular, it is possible to construct a stochastic adversary that always achieves at least RCLMT regret, for some C>1/4, no matter how xt is chosen.) Indeed, this question is deeply related to the previous post on finding a lower bound for a random walk, but I won't go into it here.