dKS

dKS dominating rectangle concept

dKS compares two distributions — P (blue) and Q (red) — by the maximizing dominating rectangle (green), anchored at a point z.

dKS extends the classical Kolmogorov–Smirnov distance to multiple dimensions. It is joint work with Peter M. Jacobs and Jeff M. Phillips — authors are listed alphabetically, following the convention in theory — and is described in our paper, Efficient and Stable Multi-Dimensional Kolmogorov–Smirnov Distance.

Project website Read the paper — arXiv:2504.11299 Code on GitHub

Why another distance?

Comparing two multi-dimensional distributions is delicate when the axes carry different units — height versus weight, or temperature versus pressure. Distances built on a Euclidean ground metric, such as Maximum Mean Discrepancy or Wasserstein, change in meaning when you change units, and the usual remedies — normalizing to zero mean and unit variance, or rescaling each axis to [0, 1] — are data-dependent, sensitive to outliers, and shift as new data arrives. The one-dimensional KS distance avoids all of this because it depends only on the order of values, not their scale. A clean, efficient multi-dimensional version, however, had been missing.

The dKS distance

dKS measures the largest gap between the two distributions’ cumulative distribution functions over dominating rectangles — axis-aligned regions anchored at a single corner. We show this formulation is an integral probability metric, and in fact a true metric: it satisfies the strong identity property, so two distributions are at distance zero only when they are identical. Like the one-dimensional KS distance, it is invariant to the choice of units on each axis.

What the paper establishes

  • Linear sample complexity. Estimating dKS from a sample needs only about (1/ε²)(d + log(1/δ)) points — linear in the dimension d.
  • Near-linear computation in 2–4D. dKS can be ε-approximated in time near-linear in the sample size for d = 1, 2, 3, 4 — O(n log n) in two dimensions, matching the classic one-dimensional case, with poly-logarithmic factors in three and four dimensions. Beyond four dimensions, a near-linear algorithm would refute a widely held complexity conjecture, so this range is essentially the limit.
  • A finite-sample-valid two-sample test. From these algorithms we derive a two-sample hypothesis test that runs in those near-linear times and provides a guaranteed, finite-sample upper bound on the false-rejection (Type I) probability — not the asymptotic or Monte-Carlo approximation that earlier multi-dimensional KS tests relied on.

The most widely used multi-dimensional KS variant, quad-KS, anchors its rectangles at the data points. We show this makes it unstable: adding or removing a single point can change the distance substantially, so no sample-complexity guarantee is possible for it. dKS avoids this by construction — which is precisely what enables the guarantees above.

How fast, in practice

On a two-dimensional benchmark, our O(n log n) algorithm is compared against an O(n²) baseline that checks every candidate rectangle:

dKS runtime and error up to ~16,000 samples dKS runtime and error up to ~131,000 samples

Runtime (left) and observed error (right) versus sample size, for our algorithm and the baseline. Top row: up to ~16,000 samples — the baseline climbs to about 3.5 seconds while ours stays under 0.2 seconds. Bottom row: out to ~131,000 samples — the baseline grows quadratically into the minutes while ours stays nearly flat. Accuracy is the same throughout.

At roughly 16,000 samples our method finishes in under 0.2 seconds while the baseline takes about 3.5 seconds. Pushing to roughly 131,000 samples, ours runs in about a second versus more than five minutes for the baseline — with no loss in accuracy. The current implementation is straightforward, unoptimized Python.