Representation Engineering
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,884 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,884 words
Add missing citations, update stale details, or suggest a clearer explanation.
Representation Engineering (often abbreviated RepE) is a top-down approach to artificial-intelligence transparency and control that takes population-level internal representations, rather than individual neurons or low-level circuits, as the primary unit of analysis. The framework was introduced in the October 2023 paper Representation Engineering: A Top-Down Approach to AI Transparency (arXiv:2310.01405) by Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and [[dan_hendrycks|Dan Hendrycks]].[^1] The work is associated with the [[center_for_ai_safety|Center for AI Safety]], the nonprofit AI safety research organization that Hendrycks directs.[^2]
Representation Engineering centers on two complementary pipelines: Representation Reading (RepReading), which extracts directions in activation space that correspond to high-level concepts such as truthfulness, harm, or emotion, and Representation Control (RepControl), which intervenes on those directions to steer model behavior. The methods build on, generalize, and formalize earlier work on [[activation_steering|activation steering]] and probing, and they have been applied across a broad menu of safety-relevant domains, including honesty, harmlessness, power-aversion, fairness, memorization, and emotional state. The paper helped catalyze a wave of follow-up research on representation-level safety techniques, most notably the Circuit Breakers method (Zou et al., 2024) and a 2025 survey that organized more than one hundred related papers into a taxonomy.[^3][^4]
RepE is frequently contrasted with [[mechanistic_interpretability|mechanistic interpretability]]: where mechanistic work studies a model from the bottom up by reverse-engineering individual neurons, attention heads, or circuits, RepE studies it from the top down by treating high-level cognitive variables as directions in a high-dimensional activation space, an approach the authors connect to a "Hopfieldian" view of cognition borrowed from cognitive neuroscience.[^1]
The conceptual motivation for Representation Engineering comes from neuroscience and cognitive science. The authors of the original paper distinguish between two views of how cognition is implemented in a nervous system:
The RepE paper argues that modern large neural networks resemble brains in being populations of distributed, redundant units whose collective state encodes meaningful variables. It therefore proposes that the natural unit of analysis is not the neuron or circuit but the representation: a direction or subspace in activation space corresponding to a concept the model has learned. This top-down move is meant to be complementary to mechanistic interpretability, not a replacement for it, but the authors argue that representation-level analysis is often more tractable and more directly useful for safety-relevant control than reverse-engineering individual circuits.[^1]
The practical motivation is that as systems become more capable, society needs scalable tools to monitor whether a model is being honest, whether it is contemplating harmful actions, whether it is exhibiting power-seeking tendencies, and so on. RepE offers a candidate set of such tools that does not require either full mechanistic understanding or expensive black-box behavioral evaluation alone.
RepE and [[mechanistic_interpretability|mechanistic interpretability]] are often presented as the two dominant paradigms within transparency research. The standard summary is that mechanistic interpretability is bottom-up and RepE is top-down.[^1][^5]
Mechanistic interpretability seeks to decompose a network into human-understandable computational components: features, neurons, attention heads, MLP layers, and the circuits that connect them. Its archetypal questions are "What does this neuron detect?" and "How do these heads compose to implement this behavior?" Recent mechanistic work makes heavy use of [[sparse_autoencoder|sparse autoencoders]] to disentangle features from superposition. The promise is that, if successful, mechanistic interpretability would yield a fully reverse-engineered description of model behavior from which arbitrary properties could be read off.
Representation Engineering instead chooses an interpretive frame and asks what direction in activation space corresponds to it. It does not attempt to enumerate every feature a model encodes; rather, it identifies the high-level concepts that matter for a given application (honesty, harm, emotion) and locates them as linear or near-linear structures in the residual stream. This trades exhaustive coverage for tractability and immediate practical utility.
Several differences follow from this orientation:
The two paradigms are not strictly incompatible, and a growing body of work uses sparse-autoencoder features (a mechanistic-interpretability tool) within otherwise RepE-style control pipelines.[^3]
Representation Reading, or RepReading, is the family of methods RepE uses to locate concept directions in a model's internal activations. The flagship RepReading technique is Linear Artificial Tomography (LAT), named by analogy with the imaging techniques that reconstruct three-dimensional structure from many one-dimensional views.[^1]
LAT proceeds in three stages.[^1]
LAT is unsupervised in the sense that it does not require labeled examples of the target concept; the labels are encoded implicitly through the contrastive prompt structure. The authors also explore complementary RepReading techniques such as linear probes, contrast-consistent search, and representational similarity analysis, but LAT is the central method of the paper and the one most widely adopted in follow-up work.[^1][^3]
RepReading is validated using a four-fold neuroscience-inspired scheme that is meant to provide converging evidence that the extracted direction really encodes the target concept rather than a confounding surface feature.[^1]
These four criteria together are intended to discriminate genuine concept representations from spurious correlations.
Representation Control, or RepControl, uses the directions identified by RepReading to intervene on a model's internal state at inference time. RepControl is closely related to, and partly generalizes, earlier and concurrent work on [[activation_steering|activation steering]] and steering vectors, which add fixed activation patterns to a model's residual stream to change behavior.[^1]
The paper distinguishes three classes of control operation, all defined on a residual-stream activation tensor R and a unit vector v that points along the target concept direction:[^1]
Two control implementations are emphasized.
RepControl thereby spans a continuum from inference-time interventions on the residual stream (close to classical activation steering) to weight-space modifications obtained by training low-rank adapters on representation-level objectives. The choice between these options trades off flexibility, cost, and stealth.
The RepE paper presents case studies in several domains, each of which has since been extended by follow-up work.
The flagship application is honesty. The authors use LAT to extract an "honesty" or "truthfulness" reading vector from a contrastive prompting setup that asks the model to answer truthfully versus deceptively. They then use the vector both for monitoring (predicting whether a model's own answer is internally consistent with what it "believes") and for control (steering the model toward more truthful generation).[^1]
On the TruthfulQA benchmark, applying RepE control to LLaMA-2-Chat-70B raised accuracy from a 29.9 percent zero-shot baseline to 69.8 percent, an improvement of about 18 percentage points over the zero-shot baseline figures cited in the paper, and a state-of-the-art result at the time on that benchmark. The same honesty representation generalized across other QA datasets, including OpenBookQA, CommonsenseQA, RACE, and ARC, suggesting that the extracted direction captured a fairly general "I believe this is true" signal rather than a TruthfulQA-specific artifact.[^1]
The paper extracts directions corresponding to basic emotional states (happiness, sadness, anger, fear, surprise, disgust) using contrastive prompts that situate the model as expressing one or another emotion. The authors show that emotion directions are decodable across layers, that they influence downstream behavior in measurable ways, and that they can be used to steer model tone. Emotion is a useful demonstration domain because it is a familiar high-level cognitive variable for which it is plausible to expect a representational correlate.[^1]
For harm, the authors extract directions associated with harmful intent and harmful instruction-following, then use projection and other operations to suppress those directions during generation. The result is a model that produces fewer harmful completions while remaining useful on benign requests. This case study is the conceptual precursor of the Circuit Breakers technique introduced by Zou et al. in 2024, which fine-tunes a model so that representations of harmful inputs are driven toward orthogonality with the original model's representations, effectively interrupting harmful generations at the representational level.[^4]
The paper extracts directions related to power-aversion and morality, demonstrating that models encode something like a "would seek power" signal that can be monitored and pushed in the direction of power-aversion. The authors are careful to frame this as a demonstration that representation-level monitoring is technically feasible rather than a proof that current models are dangerously power-seeking. The work is connected to a broader research program on monitoring frontier-model goals.[^1]
A further case study uses RepReading to surface gender and demographic biases as directions in activation space. By identifying such directions, the authors show, one can both measure bias quantitatively and design interventions that suppress its influence on outputs. This complements existing fairness toolkits that operate at the input or output level rather than on internal representations.[^1]
The paper sketches additional applications, including utility estimation, memorization detection, and knowledge editing. Several of these have been pursued in detail by follow-up work, including model-editing frameworks built on adversarial training of representation sensors.[^6]
Across all of its case studies, RepE relies on the same four-part validation scheme described above: correlation, manipulation, termination, and recovery.[^1] The authors argue that these four criteria, drawn from cognitive neuroscience practice, are stronger collectively than any single one, because the typical failure modes (spurious correlation, lack of causal effect, lack of necessity, lack of sufficiency) tend to be distinct for each criterion.
Quantitative evaluation primarily uses standard NLP benchmarks. For honesty the central benchmark is TruthfulQA[^7]; for general capabilities the paper draws on familiar question-answering datasets to confirm that interventions do not catastrophically degrade utility. For emotion, harm, power-seeking, and fairness, the authors generate task-specific datasets, often using contrastive templates that mirror the LAT prompt design. The use of unlabeled, contrastively prompted data is one of the practical attractions of the method, since it removes the bottleneck of constructing large annotated corpora.
The RepE paper reports several headline findings.[^1]
Representation Engineering was widely circulated within the AI safety community after its October 2023 release and was discussed extensively on the Alignment Forum, LessWrong, and the Center for AI Safety's own blog.[^2][^8] The associated open-source library is the repe package on PyPI, with reference notebooks in the andyzoujm/representation-engineering GitHub repository accompanied by a project website at ai-transparency.org.[^9][^10]
The paper has been heavily cited as a unifying framing for a cluster of activation-level methods, including contrastive activation addition, inference-time intervention, and activation steering more broadly. The 2025 survey Representation Engineering for Large-Language Models: Survey and Research Challenges organizes more than one hundred subsequent papers into a taxonomy of RepReading and RepControl methods and identifies RepE as a distinct subfield of transparency research.[^3]
The CAIS blog post on RepE, published in April 2024, presents the method to a non-specialist audience and frames it as an "exciting progress towards model transparency," emphasizing that honesty and similar traits "can be identified live at the point of output, and they can also be controlled."[^2]
Despite its empirical successes, RepE has attracted substantive criticism, much of it from the alignment community.
Defenders of the framework, including the original authors and subsequent commentators, argue that these limitations are real but not fundamental, and that better evaluation standards and theoretical grounding would address most of them over time.[^8]
Representation Engineering helped catalyze a substantial follow-up literature.
A Carnegie Mellon CSD PhD blog post, From Representation Engineering to Circuit Breaking: Toward Transparent and Safer AI, traces the intellectual arc from the original RepE paper to Circuit Breakers and beyond, describing RepE as the conceptual scaffolding on which much current safety-relevant intervention work has been built.[^14]
Representation Engineering is closely associated with the [[center_for_ai_safety|Center for AI Safety]] (CAIS), the San Francisco nonprofit founded in 2022 by [[dan_hendrycks|Dan Hendrycks]] and Oliver Zhang. Hendrycks, the last author of the RepE paper, is also CAIS's director, and many of the paper's collaborators are affiliated with CAIS, Carnegie Mellon University, UC Berkeley, the University of Chicago, and Stanford.[^15][^16]
CAIS's stated mission is to reduce societal-scale risks from artificial intelligence. The organization is best known publicly for the May 2023 Statement on AI Risk, signed by hundreds of AI researchers and industry figures, which warned that mitigating the risk of human extinction from AI should be a global priority. RepE fits within CAIS's broader research portfolio as a candidate technical mechanism for monitoring and controlling frontier models, alongside work on benchmarking, evaluation, and policy. The CAIS blog has continued to present RepE as a flagship example of "a new way of understanding models" that complements both mechanistic interpretability and behavioral evaluation.[^2][^16]
Through Hendrycks's affiliation and the involvement of multiple CAIS-affiliated researchers, RepE remains one of the more visible technical programs to come out of the organization, and the paper is frequently cited as evidence of CAIS's technical research output.