pyscan

Sampling for Region-Aggregated Spatial Scan Statistics

Spatial scan statistics are a core tool for anomaly detection in geospatial data — locating regions where a measured quantity (disease cases, crime, and so on) is significantly elevated relative to a baseline. The most efficient scan algorithms operate on point data, but real-world data is usually aggregated into predefined regions such as census tracts, zip codes, or counties. The standard workaround, used by widely adopted tools like SaTScan, collapses each region to its centroid — convenient, but it discards the region’s spatial extent and substantially reduces statistical power.

This work proposes a simple, scalable alternative: replace each region with 20–50 points sampled uniformly from its geometry, spreading the region’s baseline and measured values evenly across them (Geom k). It preserves the region’s spatial structure while staying fully compatible with fast point-based scan algorithms, and pairs with pyScan’s C++ backend and adaptive gridding so that even a 50× increase in points adds little runtime.

Region-to-point sampling on Arkansas counties
Region-to-point sampling on Arkansas counties: each county is replaced by k points drawn uniformly from its polygon (Geom 10, left; Geom 50, right).

A convergence analysis shows the recovery error shrinks proportionally to 1/√k, and — perhaps surprisingly — that as a map is divided into more regions, fewer sample points per region are needed. Across six datasets (NYC zip codes and the counties of Arkansas, Utah, California, Georgia, and the continental U.S.), the method recovers planted anomalies at far smaller effect sizes than the centroid baseline, while running orders of magnitude faster than connected-region methods like FlexScan.

On a real public-health dataset — county-level Valley Fever incidence in California — it recovers the known San Joaquin Valley endemic region much more accurately than the centroid approach, approaching the best overlap an axis-aligned rectangle can achieve.

California Valley Fever recovery comparison
California Valley Fever: the sampling-based method (Geom-k) recovers the San Joaquin Valley endemic region far better than the centroid baseline. Right: standardized morbidity ratio (observed/expected cases) by county.

We recommend this sampling-based conversion as the default way to apply point-based spatial scan statistics to region-aggregated data.

Authors: Foad Namjoo, Drew McClelland, Michael Matheny, Jeff M. Phillips Library: pyScan · Paper: under review — arXiv link coming soon