Public LLM Datasets

I create and release contrast datasets for interpretability and activation steering — sets of minimal pairs that isolate a single behavioral axis while holding meaning fixed, so they can be used both to find steering directions and to evaluate control. The conciseness and positivity sets were built as part of a research collaboration with Martian.

These pairs drive activation steering — residual-stream injections (layer/α sweeps) that steer behaviors like verbosity and tone in models such as Llama 3.1 and Qwen-2.5, with guardrailed LLM-judge + JSON-schema evaluation. For how corruption in steering datasets degrades that control — and how robust estimation mitigates it — see Understanding and Mitigating Dataset Corruption in LLM Steering (2026). Three are currently public on Hugging Face:

PhillipsLab university University of Utah — all our datasets on Hugging Face