Longest path search

Another short proof of OCO and miscellanea

A short post with two silly, but ultimately useful, results. The first is yet another proof of OGD being low regret. The second is that the last iterate, or even the average iterate (or, indeed, any convex combination of the iterates) can be very far away from the optimal point, even though they are low regret, without further assumptions on the functions we are optimizing over.

Why yet another proof? No idea. OCO still feels incredibly magical to me on many levels and getting several clean proofs is usually useful for understanding that kind of thing.

On the other hand, I'm not sure this gets anyone closer to that goal, but it's (probably) worth pursuing anyways.

(Yet another) Proof of the regret of online gradient descent

Using an identical set up and notation to the previous post on OCO note that, given subgradient gt𝐑n bounded by L>0, we only have to bound the quantity

t=1TgtT(xtx).

Here, for t=0,,T, we define

xt+1=xtηgt,

and x𝐑n is any point in some bounded subset with diameter at most M>0.

The new proof is almost silly. Let z2=z12++zn2 be the sum of squares norm, then

0xT+1x2.

But, xT=xT1ηgt, so

0xT+1x2=xTx22ηgTT(xTx)+η2gT2==x0x22ηt=1TgtT(xtx)+η2t=1Tgt2.

Rearranging gives

t=1TgtT(xtx)12ηx0x2+η2t=1Tgt2.

The rest is mechanical of course, but it's worth writing anyways. Using our bounds on the diameter of the domain and the gradients, we get

t=1TgtT(xtx)M22η+ηTL22,

and, setting η=M/(LT) gives the result

t=1TgtT(xtx)MLT2.

Funnily enough, note that this bound is slightly tighter than the previous one by a factor of 2, even though the previous proof relied on an exact rewriting.

Projected online gradient descent

Another nice point is that this proof immediately carries over to the projected case, unlike the previous one. Say the optimal point x lies in some closed, convex set C𝐑n with diameter at most M, then, let Π(x) denote the projection of x onto C, defined

Π(x)=argminzCzx2.

(We can say the projection, since this is unique from the convexity of C.) A fun exercise is to note that, for any two points x,y𝐑n the projection satisfies

Π(x)Π(y)2xy2.

In other words, the projection Π is nonexpansive.

Anyways, in this case, note that the gradient update step is given by

xt+1=Π(xtηgt);

i.e., we project xt into the set C after every gradient update. The proof, using the fact that Π is nonexpansive, is essentially identical to the previous proof since

0xT+1x2=Π(xTηgT)x2xTηgTx2.

(Note the second inequality, there.) The rest of the proof proceeds identically as the previous one, except instead of exact equalities, we have inequalities coming from the nonexpansiveness property of Π.

No bounds on distance

The above proof is kind of interesting: we used the (trivial) inequality

xT+1x20,

to get a fairly tight bound on the regret. This 'suggests' (to me, at least) that this bound is likely somewhat tight in many cases, might it be 'close to tight' in general? By this I mean, is it the case that

xT+1x20,

for some notion of ?

Certainly, this is true in the case where gt are the subgradients of some fixed function evaluated at the xt (that is, when gtf(xt) for some f) and x is, say, the unique minimizer of f. Might it be true in the more general scenario where the gt are adversarial?

Unfortunately the answer is no.

Indeed, we can show that there is a (silly) set of gradients gt such that the distance between the last iterate (or, really, any iterate, or convex combination of the iterate, such as the average) and the x that maximizes the regret bound is the diameter of the set.

Let C be any compact convex set with diameter M, then there are points x0 and y such that x0y=M. Now, let gt=0 for t=1,,T1 and gT=α(x0y), where α=L/x0y, which we can do as adversaries. Then the x that maximizes regret is the one that solves

minzC(t=1TgtTz)=minzC((x0y)Tz),

with minimizer x=y. Since xT=x0 then it's game over since x0x=M. It's clear that any convex (indeed, affine) combination of the xt is always equal to x0, so, unlike in the stochastic case, this does not help either.

Linearity is hard

The main problem with the above is that (a) we have to choose a move before the last gradient is revealed to us, and (b) the maximizers of linear functions are extremely sensitive to the parameters of the problem. This means that even a very small change in the gradients can have drastic movements in x, which prevents us from getting control over these conditions.

One 'simple-ish' technique for getting rid of this sensitivity is to assume that the functions ft are strongly convex. It is not quite 100% clear to me if, in this case, the last iterate (or, say, the average iterate) is close to the unique minimizer x, but it does seem likely.

I unfortunately cannot find an easy source for this on my (admittedly cursory) glance, but if there is one, that would be great!