pyscan

Spatial statistics · Python

Sampling for region-aggregated spatial scan statistics — replacing each region with points sampled from its geometry so fast point-based scan algorithms scale.

Scroll to play ⌄

Skip to the write-up ↓

01 · The map

3,711 U.S. map regions, each with a rate — and a small anomalous region hidden among them.

02 · The centroid heuristic

The standard fix collapses each county to a single point — and loses its spatial extent.

03 · Geom-50 sampling

Instead, replace each county with 50 random points, its count split evenly across them.

04 · Points only

Drop the boundaries — a weighted point cloud is all the scan needs.

05 · The scan

Rectangles of every size sweep the cloud, maximizing the elevated-rate score — until one locks onto the anomaly.

06 · The power

Across signal strengths, Geom-50 recovers the planted region where the centroid shortcut can't — lower Jaccard is closer to the truth.

Code on GitHub Read our paper pyscan library

Scroll down to read more ⌄

pyscan

Sampling for region-aggregated spatial scan statistics — replacing each region with points sampled from its geometry so fast point-based scan algorithms scale.

01 · The map

Somewhere in the map something is wrong — region-sampled scan statistics find where.

Code on GitHub Read our paper pyscan library

Paper Code pyscan library

The problem. Public-health and crime maps almost never come with addresses — you get one total for a whole county or zip code, and some counties are the size of a small state. The fastest algorithms for spotting a hotspot need actual points on the map, so the standard shortcut squashes each county down to a single dot at its center. That dot erases the county’s real shape and size — and a disease outbreak that a health department finds late, or not at all, is measured in lives, not in decimals of statistical power.

How I solved it. Our fix is almost embarrassingly simple: instead of one dot at the center, scatter 20–50 points across the county’s actual shape and share its case count among them. The map gets its geography back, and the fast algorithms never notice the difference. The proof it works: given nothing but county totals of Valley Fever cases, the method redraws California’s San Joaquin Valley endemic region — a region fixed by soil ecology — far more faithfully than the centroid shortcut ever could, while running ~3,000× faster than connected-region methods (0.33 s vs. 1,109 s per discovery on the continental-U.S. map — 3,711 regions spanning 3,108 counties, some split by water).

Spatial scan statistics are a core tool for anomaly detection in geospatial data — locating regions where a measured quantity (disease cases, crime, and so on) is significantly elevated relative to a baseline. The most efficient scan algorithms operate on point data, but real-world data is usually aggregated into predefined regions such as census tracts, zip codes, or counties. The standard workaround, used by widely adopted tools like SaTScan, collapses each region to its centroid — convenient, but it discards the region’s spatial extent and substantially reduces statistical power.

This work proposes a simple, scalable alternative: replace each region with 20–50 points sampled uniformly from its geometry, spreading the region’s baseline and measured values evenly across them (Geom k). It preserves the region’s spatial structure while staying fully compatible with fast point-based scan algorithms, and pairs with pyscan’s C++ backend and adaptive gridding so that even a 50× increase in points adds little runtime.

A convergence analysis shows the recovery error shrinks proportionally to 1/√k, and — perhaps surprisingly — that as a map is divided into more regions, fewer sample points per region are needed. Across six datasets (NYC zip codes and the counties of Arkansas, Utah, California, Georgia, and the continental U.S.), the method recovers planted anomalies at far smaller effect sizes than the centroid baseline, while running ~3,000× faster than connected-region methods like FlexScan (0.33 s vs. 1,109 s per discovery on the 3,711-region continental-U.S. dataset).

On a real public-health dataset — county-level Valley Fever incidence in California — it recovers the known San Joaquin Valley endemic region much more accurately than the centroid approach, approaching the best overlap an axis-aligned rectangle can achieve.

California Valley Fever recovery comparison — California Valley Fever: the sampling-based method (Geom-k) recovers the San Joaquin Valley endemic region far better than the centroid baseline. Right: standardized morbidity ratio (observed/expected cases) by county.

We recommend this sampling-based conversion as the default way to apply point-based spatial scan statistics to region-aggregated data. The companion repository contains the Python experiments and figure-rendering scripts that reproduce every figure and runtime table in the paper, built on pyscan’s C++ backend.