Another short proof of OCO and miscellanea
A short post with two silly, but ultimately useful, results. The first is yet another proof of OGD being low regret. The second is that the last iterate, or even the average iterate (or, indeed, any convex combination of the iterates) can be very far away from the optimal point, even though they are low regret, without further assumptions on the functions we are optimizing over.
Why yet another proof? No idea. OCO still feels incredibly magical to me on many levels and getting several clean proofs is usually useful for understanding that kind of thing.
On the other hand, I'm not sure this gets anyone closer to that goal, but it's (probably) worth pursuing anyways.
(Yet another) Proof of the regret of online gradient descent
Using an identical set up and notation to the previous post on OCO note that, given subgradient bounded by , we only have to bound the quantity
Here, for , we define
and is any point in some bounded subset with diameter at most .
The new proof is almost silly. Let be the sum of squares norm, then
But, , so
Rearranging gives
The rest is mechanical of course, but it's worth writing anyways. Using our bounds on the diameter of the domain and the gradients, we get
and, setting gives the result
Funnily enough, note that this bound is slightly tighter than the previous one by a factor of 2, even though the previous proof relied on an exact rewriting.
Projected online gradient descent
Another nice point is that this proof immediately carries over to the projected case, unlike the previous one. Say the optimal point lies in some closed, convex set with diameter at most , then, let denote the projection of onto , defined
(We can say the projection, since this is unique from the convexity of .) A fun exercise is to note that, for any two points the projection satisfies
In other words, the projection is nonexpansive.
Anyways, in this case, note that the gradient update step is given by
i.e., we project into the set after every gradient update. The proof, using the fact that is nonexpansive, is essentially identical to the previous proof since
(Note the second inequality, there.) The rest of the proof proceeds identically as the previous one, except instead of exact equalities, we have inequalities coming from the nonexpansiveness property of .
No bounds on distance
The above proof is kind of interesting: we used the (trivial) inequality
to get a fairly tight bound on the regret. This 'suggests' (to me, at least) that this bound is likely somewhat tight in many cases, might it be 'close to tight' in general? By this I mean, is it the case that
for some notion of ?
Certainly, this is true in the case where are the subgradients of some fixed function evaluated at the (that is, when for some ) and is, say, the unique minimizer of . Might it be true in the more general scenario where the are adversarial?
Unfortunately the answer is no.
Indeed, we can show that there is a (silly) set of gradients such that the distance between the last iterate (or, really, any iterate, or convex combination of the iterate, such as the average) and the that maximizes the regret bound is the diameter of the set.
Let be any compact convex set with diameter , then there are points and such that . Now, let for and , where , which we can do as adversaries. Then the that maximizes regret is the one that solves
with minimizer . Since then it's game over since . It's clear that any convex (indeed, affine) combination of the is always equal to , so, unlike in the stochastic case, this does not help either.
Linearity is hard
The main problem with the above is that (a) we have to choose a move before the last gradient is revealed to us, and (b) the maximizers of linear functions are extremely sensitive to the parameters of the problem. This means that even a very small change in the gradients can have drastic movements in , which prevents us from getting control over these conditions.
One 'simple-ish' technique for getting rid of this sensitivity is to assume that the functions are strongly convex. It is not quite 100% clear to me if, in this case, the last iterate (or, say, the average iterate) is close to the unique minimizer , but it does seem likely.
I unfortunately cannot find an easy source for this on my (admittedly cursory) glance, but if there is one, that would be great!