Representation Engineering

RepE, Activation Steering, Activation Engineering

Governance

Evaluation

Soft glowing orange and yellow light with a gradient blending into black background.

TL;DR

A top-down methodology for analyzing and manipulating the internal activations of neural networks during inference to monitor and control model behavior.

In depth

This approach treats population-level representations inside a neural network, rather than individual neurons or low-level circuits, as the target of analysis. Researchers locate specific linear directions in the activation space that correspond to abstract concepts like honesty, safety, or factuality. By adding or subtracting these concept-representing vectors directly at runtime, engineers can dynamically modify model behavior without altering its underlying weights.

Why this matters for your business

It offers a lightweight, highly steerable mechanism for aligning AI safety and values at runtime without costly and unstable fine-tuning.