Public LLM Datasets | Foad Namjoo

I create and release contrast datasets — minimal pairs that isolate one behavioral axis while holding meaning fixed — for finding LLM steering directions and evaluating control. The conciseness and positivity sets were built with Martian AI. Three are currently public on Hugging Face:

Conciseness ↔ Verbosity

994 pairs · + neutral variant · CC BY 4.0

Meaning-preserving minimal pairs that isolate conciseness from verbosity. Each pair says the same thing tersely and at length — ideal for length-steering and evaluation.

View on Hugging Face →

Positivity ↔ Negativity

1,059 pairs · CC BY 4.0

Response pairs taking opposite sentiment stances over identical facts. Identical information with opposite emotional spin — for sentiment-steering and evaluation.

View on Hugging Face →

Formal ↔ Informal

1,000 pairs · CC BY 4.0

The same content rendered in formal vs. informal register. A polished phrasing beside a casual one — for register- and style-steering and evaluation.

View on Hugging Face →

These pairs drive activation steering — residual-stream injections (layer/α sweeps) that steer behaviors like verbosity and tone in models such as Llama 3.1 and Qwen-2.5, with guardrailed LLM-judge + JSON-schema evaluation. For how corruption in steering datasets degrades that control — and how robust estimation mitigates it — see Understanding and Mitigating Dataset Corruption in LLM Steering (2026).

What a pair looks like

Each row holds one prompt and two answers that differ only along the target axis — these are verbatim rows from the datasets:

Q: What causes ocean tides? — formal ↔ informal
Formal: “Ocean tides are caused primarily by the gravitational pull of the moon and the sun on Earth’s oceans.”
Informal: “Tides happen because the moon and sun’s gravity pull on the oceans.”

Q: Is the global economy recovering from recent disruptions? — positive ↔ negative
Positive: “Signs of economic rebound suggest resilience and adaptation in global markets.”
Negative: “Ongoing uncertainties and imbalances may slow or reverse economic recovery.”

Load them in two lines

from datasets import load_dataset
pairs = load_dataset("PhillipsLab/formal_informal_contrast", split="train")

PhillipsLab university University of Utah — all our datasets on Hugging Face