Longest path search

There are exponentially many vectors with small inner product

An interesting observation that Alex stumbled across while reading some of Anthropic's work on LLM interpretability is the following

Although it's only possible to haveĀ nĀ orthogonal vectors in anĀ n-dimensional space, it's possible to haveĀ exp(n)Ā many "almost orthogonal" (εĀ cosine similarity) vectors in high-dimensional spaces. See theĀ Johnson–Lindenstrauss lemma.

While the JL lemma does indicate that this might be true (ish, in a certain sense) it is not obvious that this statement directly translates to the one given here. (I won't comment on the content on the rest of the linked post, some of which I find somewhat, uh, loose in construction, but I'll focus specifically on this point.)

Why should this not be true?

For example, a simple fact from linear algebra is that, given m vectors x1,,xm𝐑n in n dimensional space each of which has nonpositive inner product with any other vector xiTxj0 for ij, we have that m2n. In other words: no more than 2n vectors can have pairwise nonpositive inner products with any other. (The fact that this is tight is very easy to show: pick the unit vectors and their negatives.) A very simple argument also shows that, if we restrict further to have xiTxjε<0 (note the negative!) and normalize the xi to have xi=1, then:

m1ε+1.

(Of course, this bound is useless if ε1/(2n1) since it tells us nothing more than what we already knew from the previous bound.1)

On the other hand, if the original statement—that we can have exponentially many vectors with slightly positive inner product—is true, it would indicate a phase transition in the number of possible vectors we are allowed to have from ε negative, to zero, to positive. There aren't that many things that undergo dramatic phase transitions like this, but, every once in a while, it does happen!

Idea and proof

Anyways, I'm sure the title of this post gave away the punchline, but indeed the following is true: for any ε>0 there exists a list of m normalized vectors x1,,xm𝐑n in n dimensions such that xiTxjε (for ij) where m satisfies

mexp(nε24).

A volumetric argument shows that this is tight on n up to constants in the exponent, but that's less fun.

Basic proof sketch

As usual, we will provide a (very silly) randomized construction for the normalized vectors. Pick m vectors x~1,,x~m{±1}n uniformly at random (with m no larger than the bound we gave) and set xi=x~i/n.

Clearly the xi are normalized, by construction. The only thing left to show is that, with some nonzero probability, these vectors will have small inner product; i.e., xiTxjε.

Of course, we know that (bounded) independent random variables with mean zero have very strong concentration phenomena in the sense that their sum is n with very high probability. (Indeed, the sum really is around that size too.) This, in turn, implies that (1/n)x~iTx~j=xiTxj>ε with very low probability for any one ij. Adding everything up then bounds the probability that any one pair fails to satisfy xiTxjε which gives the result.

Full(ish) proof

Ok, with that idea, the details are now mostly mechanical, but let's write them out anyways. Here are two simple observations of uniform ±1 random variables, also known as Rademacher random variables.

  1. The product of two independent Rademacher random variables is also Rademacher
  2. Expectations of Rademacher variables and functions thereof are very simple to compute

Our goal will now be to show that the probability that any two vectors ij have inner product larger than ε is small. Since xi=x~i/n then

Pr(xiTxj>ε)=Pr(x~iTx~j>nε).

Let Z1,,Zn~{±1} be uniform and independent. From our observations, note that

Pr(x~iTx~j>nε)=Pr(Z1++Zn>nε).

We can multiply by a nonnegative constant λ0 and take the exponential of both sides to get

Pr(exp(λ(Z1++Zn))>exp(λnε))𝐄[exp(λ(Z1++Zn))]exp(λnε).

(The right hand side inequality is Markov's inequality.) Since the Zi are independent and identical, note that

𝐄[exp(λ(Z1++Zn))]=(𝐄[exp(λZ1)])n=(eλ+eλ2)n.

Finally, it is not hard to show an upper bound on the right hand side,

eλ+eλ2exp(λ22),

and, using this upper bound and putting everything together, gives that, for any λ0,

Pr(xiTxj>ε)exp(n(λ22λε)).

Setting λ=ε gives the result

Pr(xiTxj>ε)exp(nε22).

Since there are m(m1)/2 possible pairs of i,j with ij, we then have that the probability that any one of the pairs i,j has inner product larger than ε is bounded from above by

m(m1)2exp(nε22)<m2exp(nε22)1,

where the right hand side inequality holds for any mexp(nε2/4) .

Equivalently, there is nonzero probability that a given sampled set of normalized m vectors x1,,xm has all inner products no larger than ε for any choice of mexp(nε2/4). So, choose m=exp(nε2/4) (or the largest integer no larger than this bound).

Since the resulting probability is nonzero, then there exists at least one such set of vectors for which the claim is true, which proves the desired result.

Twitter discussion after posting

Damek pointed out that Terry Tao has some notes in the case where we want the inner product to lie in a band between ε and ε; the bounds are essentially the same in this case for the same reasons as above. (Terry points out slightly better packings using some neat algebraic geometric tricks.)

GaussianProcess points out that there is a slightly better construction in certain regimes of ε using Reed–Solomon codes. One (simple) version of this is to note that there is a near-equivalence between our vectors vi and codes over binary fields, where the inner product is "almost" the Hamming distance between two binary vectors. (The linked paper mentions but does not use this construction directly, but it, too, would suffice!) 0xWave and others also point out that this is linked to codes, and yes, similar constructions are used as both possibility and impossibility results including in the Johnson bound and Plotkin bounds, etc.

Both Nick White and Nick Yoder point out that this might be (or is?) related to Johnson–Lindenstrauss, and, while I agree that it is related, I don't think these statements obviously map one-to-one. In particular, I see JL as a possibility result on the necessary number of dimensions needed to faithfully represent some number of vectors (that live in some higher dimensional space). This, on the other hand is a possibility result that there exist some number of vectors in a low dimensional space that are, in some sense, maximally faithfully representable. I would love if there is some mapping between the two statements, but this is not obvious to me! (Though they do result from the same "asymptotic"/high-dimensional behavior.)


  1. A proof of this is very easy by considering the nonnegative quantity 0ixi2.