dKS | Foad Namjoo

dKS extends the classical Kolmogorov–Smirnov distance to multiple dimensions. It is joint work with Peter M. Jacobs and Jeff M. Phillips — authors are listed alphabetically, following the convention in theory — and is described in our paper, Efficient and Stable Multi-Dimensional Kolmogorov–Smirnov Distance.

Project website Read the paper — arXiv:2504.11299 Code on GitHub Poster (PDF)

The problem

Suppose you weigh a colony of penguins and want to know how different two groups are. Record their body mass in grams or in kilograms and it is the exact same information — yet ask most standard tools to compare the groups and they hand back a different answer just because you switched the units. The comparison ends up measuring your choice of ruler instead of the penguins. And this is not a toy worry: science runs on comparing distributions — lab measurements, clinical cohorts, the data drifting through a production ML system — and a comparison that flinches when the units change cannot be trusted to tell you what actually changed.

How I solved it

In one dimension, statistics solved this back in 1933: the Kolmogorov–Smirnov distance reads only the order of the values, so the ruler drops out of the answer entirely. For ninety years, that guarantee never properly survived the trip to higher dimensions — until dKS carried it across. Switch grams to kilograms and the answer is exactly unchanged, while a popular alternative jumps about 21% (see the demos). It is a true metric with provable guarantees — and the algorithm makes it practical: what took the brute-force baseline 4.2 hours at a million samples, dKS answers in a fifth of a second — about 76,000× faster.

The Kolmogorov–Smirnov distance

Two cumulative distribution functions — a smooth blue theoretical curve and a red empirical step curve — with a dashed line marking the largest vertical gap (about 0.26) between them, which is the KS distance

The 1-D KS distance: the largest vertical gap (dashed) between two CDFs — here a theoretical reference (blue) and an empirical sample (red).

In one dimension, the Kolmogorov–Smirnov (KS) distance compares two distributions through their cumulative distribution functions (CDFs) — the running fraction of each sample at or below every point along the axis. Sweeping across that axis, it records the single largest vertical gap between the two curves; that maximum gap is the distance, and it underpins the classic one- and two-sample KS tests. Because it reads only the order of the values, never their scale, it is naturally invariant to the units on the axis.

Why dKS?

When you compare two multi-dimensional distributions, the answer shouldn’t depend on the units you happened to record the data in — body mass in grams or in kilograms is the same information. Yet most distances quietly disagree. Anything built on a Euclidean ground metric — Maximum Mean Discrepancy, Wasserstein — blends the axes together, so rescaling one axis re-weights it and shifts the distance. The usual fix, normalizing each axis (zero mean / unit variance, or rescaling to [0, 1]), only defers the problem: it is data-dependent, sensitive to outliers, and changes as new data arrives.

The one-dimensional KS distance never had this trouble, because it depends only on the order of values along an axis, not their scale. dKS carries that property into multiple dimensions — measuring the largest gap between the two distributions’ cumulative distribution functions over dominating rectangles. The result is unit-invariant by construction and, unlike the popular heuristics, a genuine metric with provable guarantees.

In our own experiments the effect is concrete: on the Palmer penguins, switching body mass from grams to kilograms leaves dKS exactly unchanged while a Gaussian-kernel MMD jumps about 21%; on NHANES height and weight, switching between metric and US units leaves dKS unchanged while MMD drops about 39% — exactly the units artifact dKS is built to avoid. These demos and figures are on the project page.

Multi-dimensional two-sample testing is also the primitive behind production data-drift monitoring — comparing training and serving distributions across several features at once. There, unit-invariance removes the fragile per-feature normalization step, the finite-sample guarantees give calibrated alarms instead of heuristic thresholds, and near-linear runtime keeps the check cheap at millions of samples.

The dKS distance

Two 2-D point sets (blue P, red Q) with a green axis-aligned rectangle anchored at a point z — the dominating range where the two distributions disagree most

dKS compares two distributions — P (blue) and Q (red) — by the maximizing dominating rectangle (green), anchored at a point z.

dKS measures the largest gap between the two distributions’ cumulative distribution functions over dominating rectangles — axis-aligned regions anchored at a single corner. We show this formulation is an integral probability metric, and in fact a true metric: it satisfies the strong identity property, so two distributions are at distance zero only when they are identical. Like the one-dimensional KS distance, it is invariant to the choice of units on each axis.

How dKS is computed (in 2D)

Draw the two samples, then sort the pooled points and lay down a grid of evenly-spaced cut points — about √k along each axis. Snap every point to its grid cell, and read off the largest gap between the two empirical CDFs over the grid’s rectangles. The grid step adds only sampling-noise-level error, yet is far faster than checking every candidate rectangle — which is what makes dKS practical beyond one dimension.

What the paper establishes

Linear sample complexity. Estimating dKS from a sample needs only about (1/ε²)(d + log(1/δ)) points — linear in the dimension d.
Near-linear computation in 2–4D. dKS can be ε-approximated in time near-linear in the sample size for d = 1, 2, 3, 4 — O(n log n) in two dimensions, matching the classic one-dimensional case, with poly-logarithmic factors in three and four dimensions. Beyond four dimensions, a near-linear algorithm would refute a widely held complexity conjecture, so this range is essentially the limit.
A finite-sample-valid two-sample test. From these algorithms we derive a two-sample hypothesis test that runs in those near-linear times and provides a guaranteed, finite-sample upper bound on the false-rejection (Type I) probability — not the asymptotic or Monte-Carlo approximation that earlier multi-dimensional KS tests relied on.

Stable where the popular heuristic is not

The most widely used multi-dimensional KS variant, quad-KS, anchors its rectangles at the data points. We show this makes it unstable: adding or removing a single point can change the distance substantially, so no sample-complexity guarantee is possible for it. dKS avoids this by construction — which is precisely what enables the guarantees above.

The library

dKS ships as a small, fast open-source library — not just paper pseudocode:

Header-only C++ core — include/dks/dks.hpp implements both algorithms from the paper: dks::exact (dKS-Baseline, O(n²)) and dks::approx (dKS-Sketch, O(n log n)).
Python bindings (pybind11) expose the same two functions on NumPy arrays — pip install ., then import dks.
Tested — correctness tests check both algorithms against an independent O(n³) brute-force reference, including unequal set sizes and duplicate points.
CLI — a command-line tool compares two point files directly, no code required.

import numpy as np
import dks

P = np.array([[0.1, 0.2], [0.5, 0.7]])
Q = np.array([[0.2, 0.2], [0.6, 0.8]])

dks.approx(P, Q)        # dKS-Sketch: fast O(n log n)
dks.exact(P, Q)         # dKS-Baseline: brute force, exact for P, Q

How fast, in practice

On a two-dimensional benchmark, our O(n log n) algorithm is compared against an O(n²) baseline that checks every candidate rectangle:

Runtime versus sample size out to about 1.05 million samples: the O(n²) baseline grows to roughly 4.2 hours per evaluation while dKS-Sketch stays under 0.2 seconds

Observed error versus sample size: both the baseline and dKS-Sketch drop below 0.002 error at about 1.05 million samples

Runtime (left) and observed error (right) versus sample size, out to about 1.05 million samples. The baseline grows quadratically — reaching roughly 4.2 hours per evaluation — while dKS-Sketch stays under 0.2 seconds throughout: a factor of about 76,000×, and the gap widens with n. Accuracy is identical — both methods' observed error drops below 0.002 at this scale. This benchmark ran straightforward, unoptimized Python; the released library above is C++.