Representation Engineering

AI Safety Interpretability

23 min read

Updated Jul 13, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 13, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v5 · 4,549 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Representation Engineering (often abbreviated RepE) is a top-down approach to artificial-intelligence transparency and control that reads and manipulates high-level concepts (such as honesty, harmlessness, and emotion) as directions in a model's internal representation space, rather than reverse-engineering individual neurons or low-level circuits. It was introduced in the October 2023 paper Representation Engineering: A Top-Down Approach to AI Transparency (arXiv:2310.01405) by Andy Zou, Long Phan, and 19 co-authors associated with the Center for AI Safety, Carnegie Mellon University, UC Berkeley, and Stanford.^[1] RepE works by extracting a linear "reading vector" for a concept from contrastive prompts (a method the authors call Linear Artificial Tomography, or LAT) and then adding or subtracting that direction (a "control vector") in the residual stream to steer behavior. It is framed as a complement to bottom-up mechanistic interpretability, and on the TruthfulQA honesty benchmark a single low-rank RepE intervention raised LLaMA-2-Chat-70B accuracy from a 29.9 percent zero-shot baseline to 69.8 percent.^[1]

The paper's abstract states that RepE "places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks," and shows that these methods "can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more."^[1] The work was submitted to arXiv on October 2, 2023 (v1), with the most recent revision (v4) dated March 3, 2025.^[1]

Representation Engineering centers on two complementary pipelines: Representation Reading (RepReading), which extracts directions in activation space that correspond to high-level concepts such as truthfulness, harm, or emotion, and Representation Control (RepControl), which intervenes on those directions to steer model behavior. The methods build on, generalize, and formalize earlier work on activation steering and probing, and they have been applied across a broad menu of safety-relevant domains, including honesty, harmlessness, power-aversion, fairness, memorization, and emotional state. The paper helped catalyze a wave of follow-up research on representation-level safety techniques, most notably the Circuit Breakers method (Zou et al., 2024) and a 2025 survey that organized more than one hundred related papers into a taxonomy.^[3]^[4]

RepE is frequently contrasted with mechanistic interpretability: where mechanistic work studies a model from the bottom up by reverse-engineering individual neurons, attention heads, or circuits, RepE studies it from the top down by treating high-level cognitive variables as directions in a high-dimensional activation space, an approach the authors connect to a "Hopfieldian" view of cognition borrowed from cognitive neuroscience.^[1]

What is representation engineering?

Representation Engineering is a methodology within AI safety and interpretability for understanding and steering large language models by operating on their internal representations rather than their inputs, outputs, or individual parameters. Concretely, RepE finds the direction in a model's high-dimensional activation space that corresponds to a human-interpretable concept (for example, "is being honest" versus "is lying"), then uses that direction in two ways: to monitor whether the concept is active during a given generation, and to control the model by adding or removing the concept's activation at inference time.^[1]

The Center for AI Safety, whose researchers led the work, describes RepE as "an innovative approach to enhancing our understanding and control of AI by observing and manipulating the internal representations," and reports that traits like honesty, power-seeking, and morality "can be identified live at the point of output, and they can also be controlled."^[2] The key shift relative to most prior interpretability work is the unit of analysis: RepE treats the representation (a direction or subspace shared across a population of neurons) as primary, rather than the single neuron or the wiring of a circuit.^[1]

Background and motivation

The conceptual motivation for Representation Engineering comes from neuroscience and cognitive science. The authors of the original paper distinguish between two views of how cognition is implemented in a nervous system:

The Sherringtonian view treats individual neurons and the specific connections among them as the primary unit of analysis. Cognition, on this view, is implemented through node-to-node circuitry; understanding behavior requires understanding the wiring diagram and the local computations performed at each node. Mechanistic interpretability, with its emphasis on neurons, attention heads, induction heads, and circuits, sits squarely within this tradition.^[1]
The Hopfieldian view, by contrast, treats cognition as a product of representational spaces realized by patterns of activity across populations of neurons. Individual neurons are interchangeable; what matters is the geometry of the population code. Concepts, intentions, and high-level mental states correspond to directions, regions, or manifolds in a high-dimensional state space, and cognitive processing corresponds to trajectories through that space.^[1]

The RepE paper argues that modern large neural networks resemble brains in being populations of distributed, redundant units whose collective state encodes meaningful variables. It therefore proposes that the natural unit of analysis is not the neuron or circuit but the representation: a direction or subspace in activation space corresponding to a concept the model has learned. This top-down move is meant to be complementary to mechanistic interpretability, not a replacement for it, but the authors argue that representation-level analysis is often more tractable and more directly useful for safety-relevant control than reverse-engineering individual circuits.^[1]

The practical motivation is that as systems become more capable, society needs scalable tools to monitor whether a model is being honest, whether it is contemplating harmful actions, whether it is exhibiting power-seeking tendencies, and so on. RepE offers a candidate set of such tools that does not require either full mechanistic understanding or expensive black-box behavioral evaluation alone.

How does representation engineering work?

RepE has two stages: first read a concept direction out of the model with RepReading, then write that direction back into the model with RepControl. Reading uses contrastive prompts that vary a target concept (for example honest versus deceptive answers to the same question) and applies Linear Artificial Tomography (LAT) to recover the concept axis as a single direction. Control then adds that direction (to stimulate the concept), subtracts it (to suppress the concept), or projects it out of the activations entirely (ablation).^[1] The Center for AI Safety summarizes the honesty pipeline plainly: "We first ask the model to answer a question truthfully. Then we ask the model to answer the same question but with a lie," and the difference in activations reveals "when a model is being honest and when a model lies."^[2] The subsections below describe the two pipelines in detail.

How is it different from mechanistic interpretability?

RepE and mechanistic interpretability are often presented as the two dominant paradigms within transparency research. The standard summary is that mechanistic interpretability is bottom-up and RepE is top-down.^[1]^[5]

Mechanistic interpretability seeks to decompose a network into human-understandable computational components: features, neurons, attention heads, MLP layers, and the circuits that connect them. Its archetypal questions are "What does this neuron detect?" and "How do these heads compose to implement this behavior?" Recent mechanistic work makes heavy use of sparse autoencoders to disentangle features from superposition. The promise is that, if successful, mechanistic interpretability would yield a fully reverse-engineered description of model behavior from which arbitrary properties could be read off.

Representation Engineering instead chooses an interpretive frame and asks what direction in activation space corresponds to it. It does not attempt to enumerate every feature a model encodes; rather, it identifies the high-level concepts that matter for a given application (honesty, harm, emotion) and locates them as linear or near-linear structures in the residual stream. This trades exhaustive coverage for tractability and immediate practical utility.

Several differences follow from this orientation, summarized below.

Dimension	Mechanistic interpretability (bottom-up)	Representation engineering (top-down)
Unit of analysis	Single neurons, attention heads, circuits	Population-level directions and subspaces^[1]
Search direction	From low-level components up to behavior	From behavior down to its representational correlate^[1]
Core tool	Circuit isolation, sparse autoencoders	Linear Artificial Tomography (LAT), control vectors^[1]
Validation style	Necessity and sufficiency of a circuit	Correlation, manipulation, termination, recovery^[1]
Primary framing	Foundational reverse-engineering science	Safety-relevant monitoring and behavioral control^[2]

The two paradigms are not strictly incompatible, and a growing body of work uses sparse-autoencoder features (a mechanistic-interpretability tool) within otherwise RepE-style control pipelines.^[3]

Representation Reading

Representation Reading, or RepReading, is the family of methods RepE uses to locate concept directions in a model's internal activations. The flagship RepReading technique is Linear Artificial Tomography (LAT), named by analogy with the imaging techniques that reconstruct three-dimensional structure from many one-dimensional views.^[1]

LAT proceeds in three stages.^[1]

Stimulus design. The researcher constructs a set of prompts designed to vary along a target concept. For concepts like truthfulness, contrastive templates pair the same proposition framed as "honest" or "dishonest." A general template might read: "Consider the amount of concept in the following: stimulus. The amount of concept is." For functions like emotion, the template asks the model to assume an experimental versus a reference condition.
Neural-activity collection. The model is run on each stimulus, and hidden-state activations are recorded at strategically chosen token positions, typically the final token or the concept token. The authors note that the pretraining objective of language models makes some token positions more informative than others, since the model commits to a continuation at those positions.
Linear-model construction. Paired difference vectors are computed across the contrastive set. Principal component analysis is then applied to those differences, and the first principal component is taken as the reading vector for the concept. New stimuli can be classified by dot-producting their activations against this reading vector.

LAT is unsupervised in the sense that it does not require labeled examples of the target concept; the labels are encoded implicitly through the contrastive prompt structure. The authors also explore complementary RepReading techniques such as linear probes, contrast-consistent search, and representational similarity analysis, but LAT is the central method of the paper and the one most widely adopted in follow-up work.^[1]^[3]

RepReading is validated using a four-fold neuroscience-inspired scheme that is meant to provide converging evidence that the extracted direction really encodes the target concept rather than a confounding surface feature.^[1]

Correlation. The reading vector predicts the target concept on held-out in-distribution and out-of-distribution stimuli.
Manipulation. Stimulating or suppressing activity along the direction causally changes the model's behavior in the predicted way.
Termination. Removing or ablating activity along the direction degrades performance on tasks that require the concept.
Recovery. Re-injecting the removed activity restores performance, completing a causal loop.

These four criteria together are intended to discriminate genuine concept representations from spurious correlations.

Representation Control

Representation Control, or RepControl, uses the directions identified by RepReading to intervene on a model's internal state at inference time. RepControl is closely related to, and partly generalizes, earlier and concurrent work on activation steering and steering vectors, which add fixed activation patterns to a model's residual stream to change behavior.^[1]

The paper distinguishes three classes of control operation, all defined on a residual-stream activation tensor $R$ and a unit vector $v$ that points along the target concept direction:^[1]

Linear combinations: $R' = R + \alpha v$ (stimulation) or $R' = R - \alpha v$ (suppression). This is the standard activation-steering operation.
Piecewise operations: $R' = R + \mathrm{sign}(R^\top v) v$ , which conditionally amplifies whatever sign of the concept is already dominant in $R$ .
Projection (ablation): $R' = R - \frac{R^\top v}{\lVert v \rVert^2} v$ , which removes the component along $v$ entirely. This is closely related to the directional-ablation technique that was later popularized for jailbreak research.

The paper proposes three control baselines, summarized below.

Controller	What it is	Inference-time cost
Reading vector	Adds or subtracts a single fixed concept direction (from LAT) to every input	Negligible^[1]
Contrast vector	Runs a contrastive prompt pair per input and steers by their activation difference; precise but stimulus-dependent	Higher (extra forward passes)^[1]
LoRRA (Low-Rank Representation Adaptation)	Bakes the controller into a low-rank weight adapter trained against a representation-level loss	None after training^[1]

Low-Rank Representation Adaptation (LoRRA) integrates the controller into the weights of the model via a low-rank adapter trained against a representation-level loss. After training, no inference-time overhead remains; the model behaves as if the steering vector were always applied. LoRRA is essentially a fine-tuning recipe that targets representations rather than next-token predictions, with the contrast vectors typically serving as the training targets.^[1]

RepControl thereby spans a continuum from inference-time interventions on the residual stream (close to classical activation steering) to weight-space modifications obtained by training low-rank adapters on representation-level objectives. The choice between these options trades off flexibility, cost, and stealth.

What is representation engineering used for?

The RepE paper presents case studies in several domains, each of which has since been extended by follow-up work.

Honesty and truthfulness

The flagship application is honesty. The authors use LAT to extract an "honesty" or "truthfulness" reading vector from a contrastive prompting setup that asks the model to answer truthfully versus deceptively. They then use the vector both for monitoring (predicting whether a model's own answer is internally consistent with what it "believes") and for control (steering the model toward more truthful generation).^[1]

On the TruthfulQA benchmark, applying RepE control to LLaMA-2-Chat-70B raised accuracy from a 29.9 percent zero-shot baseline to 69.8 percent, an improvement of about 18 percentage points over the zero-shot baseline figures cited in the paper, and a state-of-the-art result at the time on that benchmark. The same honesty representation generalized across other QA datasets, including OpenBookQA, CommonsenseQA, RACE, and ARC, suggesting that the extracted direction captured a fairly general "I believe this is true" signal rather than a TruthfulQA-specific artifact.^[1]

Emotion

The paper extracts directions corresponding to basic emotional states (happiness, sadness, anger, fear, surprise, disgust) using contrastive prompts that situate the model as expressing one or another emotion. The authors show that emotion directions are decodable across layers, that they influence downstream behavior in measurable ways, and that they can be used to steer model tone. Emotion is a useful demonstration domain because it is a familiar high-level cognitive variable for which it is plausible to expect a representational correlate.^[1]

Harmlessness

For harm, the authors extract directions associated with harmful intent and harmful instruction-following, then use projection and other operations to suppress those directions during generation. The result is a model that produces fewer harmful completions while remaining useful on benign requests. This case study is the conceptual precursor of the Circuit Breakers technique introduced by Zou et al. in 2024, which fine-tunes a model so that representations of harmful inputs are driven toward orthogonality with the original model's representations, effectively interrupting harmful generations at the representational level.^[4]

Power-seeking and ethics

The paper extracts directions related to power-aversion and morality, demonstrating that models encode something like a "would seek power" signal that can be monitored and pushed in the direction of power-aversion. The authors are careful to frame this as a demonstration that representation-level monitoring is technically feasible rather than a proof that current models are dangerously power-seeking. The work is connected to a broader research program on monitoring frontier-model goals.^[1]

Fairness and bias

A further case study uses RepReading to surface gender and demographic biases as directions in activation space. By identifying such directions, the authors show, one can both measure bias quantitatively and design interventions that suppress its influence on outputs. This complements existing fairness toolkits that operate at the input or output level rather than on internal representations.^[1]

Other applications

The paper sketches additional applications, including utility estimation, memorization detection, and knowledge editing. Several of these have been pursued in detail by follow-up work, including model-editing frameworks built on adversarial training of representation sensors.^[6]

How is representation engineering evaluated?

Across all of its case studies, RepE relies on the same four-part validation scheme described above: correlation, manipulation, termination, and recovery.^[1] The authors argue that these four criteria, drawn from cognitive neuroscience practice, are stronger collectively than any single one, because the typical failure modes (spurious correlation, lack of causal effect, lack of necessity, lack of sufficiency) tend to be distinct for each criterion.

Quantitative evaluation primarily uses standard NLP benchmarks. For honesty the central benchmark is TruthfulQA^[7]; for general capabilities the paper draws on familiar question-answering datasets to confirm that interventions do not catastrophically degrade utility. For emotion, harm, power-seeking, and fairness, the authors generate task-specific datasets, often using contrastive templates that mirror the LAT prompt design. The use of unlabeled, contrastively prompted data is one of the practical attractions of the method, since it removes the bottleneck of constructing large annotated corpora.

Key results from the paper

The RepE paper reports several headline findings.^[1]

A single low-rank intervention significantly improves truthfulness on TruthfulQA, raising LLaMA-2-Chat-70B accuracy by roughly 18 percentage points (29.9 percent to 69.8 percent) over its zero-shot baseline and to a state-of-the-art level at the time.
Concept directions transfer across tasks. The honesty direction extracted from one dataset generalizes to several others, supporting the claim that it encodes a genuine concept rather than a dataset-specific shortcut.
Emotions, harm, power-seeking, and bias all admit representation-level readouts and interventions that satisfy the four-part validation scheme.
RepControl is competitive with or superior to behavioral fine-tuning on several tasks, while being more transparent about what is being changed and, in the LoRRA variant, much cheaper to apply.
Top-down analysis is tractable on modern LLMs. The paper presents this as evidence that representation-centered transparency is a viable complement to bottom-up mechanistic work, rather than a placeholder until mechanistic methods scale.

Reception and citations

Representation Engineering was widely circulated within the AI safety community after its October 2023 release and was discussed extensively on the Alignment Forum, LessWrong, and the Center for AI Safety's own blog.^[2]^[8] The associated open-source library is the repe package on PyPI, with reference notebooks in the andyzoujm/representation-engineering GitHub repository accompanied by a project website at ai-transparency.org.^[9]^[10]

The paper has been heavily cited as a unifying framing for a cluster of activation-level methods, including contrastive activation addition, inference-time intervention, and activation steering more broadly. The 2025 survey Representation Engineering for Large-Language Models: Survey and Research Challenges organizes more than one hundred subsequent papers into a taxonomy of RepReading and RepControl methods and identifies RepE as a distinct subfield of transparency research.^[3]

The CAIS blog post on RepE, published on April 17, 2024, presents the method to a non-specialist audience and frames it as exciting progress toward model transparency, emphasizing that honesty and similar traits "can be identified live at the point of output, and they can also be controlled."^[2]

Limitations and criticism

Despite its empirical successes, RepE has attracted substantive criticism, much of it from the alignment community.

Capability degradation. RepControl interventions often degrade general language-modeling performance even when they improve target-concept metrics. This trade-off between steering strength and capability preservation is an unresolved engineering problem.^[8]^[11]
Ad hoc parameter selection. Important methodological parameters, including which layer to intervene at, how many contrastive examples to use, and the steering coefficient, are typically chosen by experimentation rather than principled derivation. Different choices can substantially change results.^[8]^[11]
Specificity and completeness. Identified directions are sometimes neither specific (affecting only the target concept) nor complete (capturing all aspects of the target concept across contexts). Concept representations may be entangled across multiple directions, distributed nonlinearly, or context-dependent in ways that linear methods do not capture.^[3]^[11]
Reliability and causality. A 2024 critique titled Rethinking The Reliability of Representation Engineering in Large Language Models argues that RepE pipelines implicitly assume models faithfully adopt the contrastive role they are prompted into, and that observed correlations between concept directions and outputs may reflect confounding rather than causation.^[11]
Linear-representation assumptions. RepE inherits the linear representation hypothesis, which holds that concepts are encoded as linear directions in activation space. The 2025 survey notes that only the weak form of this hypothesis has robust empirical support, and that nonlinear and sparse-autoencoder-mediated representations may capture concepts that pure linear methods miss.^[3]
Bounded by the model's own representations. RepE can only steer along directions the model already encodes. If a concept is absent or weakly represented, RepE cannot produce it; this is a fundamental rather than incidental limitation of the approach.^[11]
Dual use. The same directions that enable safety interventions also enable attacks. The discovery that refusal is mediated by a small linear subspace, for instance, has been used both to harden models and to construct jailbreaks via directional ablation. RepE's transparency cuts both ways.^[3]^[12]

Defenders of the framework, including the original authors and subsequent commentators, argue that these limitations are real but not fundamental, and that better evaluation standards and theoretical grounding would address most of them over time.^[8]

Follow-up work

Representation Engineering helped catalyze a substantial follow-up literature.

Circuit Breakers (Zou, Phan, Wang, Mazeika, Li, Hendrycks et al., 2024) is the most direct extension. It fine-tunes a model so that representations of harmful inputs are driven away from the original model's harmful-output representations, effectively "breaking the circuit" between harmful prompts and harmful completions. Circuit Breakers were shown to be robust to a wide range of single-turn jailbreaks while preserving capabilities, although later multi-turn attacks were able to partially bypass them.^[4]^[13]
Refusal as a one-dimensional subspace (Arditi et al., 2024) showed that refusal behavior across thirteen chat models up to 72 billion parameters is mediated by a single linear direction in the residual stream, and that ablating that direction reliably bypasses refusal. This result is essentially a special case of RepReading plus projection-based RepControl, and it confirmed the linear-representation hypothesis for the specific case of refusal.^[12]
Adversarial Representation Engineering (ARE, Zhang et al., 2024) frames RepE-style editing as a generative adversarial process, with a representation sensor playing the role of discriminator and the model itself the role of generator. This is presented as a unified framework for concept editing that avoids degrading baseline capabilities.^[6]
Method proliferation. The 2025 survey catalogs methods including Contrastive Activation Addition (CAA), Inference-Time Intervention (ITI), ActAdd, BiPO, PaCE, MiMiC, SARA, CAST, FLORAIN, TruthX, SAFT, ReFT, RAHF, SPARE, SAE-TS, Conceptors, and others, organized into RepReading and RepControl families.^[3]
Sparse-autoencoder hybrids. Several lines of work combine sparse autoencoders from mechanistic interpretability with RepE-style intervention, using SAE-decoded features as more disentangled steering vectors. SPARE and SAE-TS are among the cleanest examples of this hybrid approach.^[3]
Multi-turn and dynamic settings. Work in 2025 extended RepE-style analysis to multi-turn conversation, showing how harmful representational trajectories build up over multiple turns and how single-turn interventions like Circuit Breakers can be evaded.^[13]
Survey and taxonomy. The February 2025 survey Representation Engineering for Large-Language Models: Survey and Research Challenges provides the first systematic mapping of the field, including open problems such as standardized evaluation, robustness to distribution shift, multi-property composition, and theoretical foundations.^[3]

A Carnegie Mellon CSD PhD blog post, From Representation Engineering to Circuit Breaking: Toward Transparent and Safer AI, traces the intellectual arc from the original RepE paper to Circuit Breakers and beyond, describing RepE as the conceptual scaffolding on which much current safety-relevant intervention work has been built.^[14]

The Center for AI Safety connection

Representation Engineering is closely associated with the Center for AI Safety (CAIS), the San Francisco nonprofit founded in 2022 by Dan Hendrycks and Oliver Zhang. Hendrycks, the last author of the RepE paper, is also CAIS's director, and many of the paper's collaborators are affiliated with CAIS, Carnegie Mellon University, UC Berkeley, the University of Chicago, and Stanford.^[15]^[16]

CAIS's stated mission is to reduce societal-scale risks from artificial intelligence. The organization is best known publicly for the May 2023 Statement on AI Risk, signed by hundreds of AI researchers and industry figures, which warned that mitigating the risk of human extinction from AI should be a global priority. RepE fits within CAIS's broader research portfolio as a candidate technical mechanism for monitoring and controlling frontier models, alongside work on benchmarking, evaluation, and policy. The CAIS blog has continued to present RepE as a flagship example of a new way of understanding models that complements both mechanistic interpretability and behavioral evaluation.^[2]^[16]

Through Hendrycks's affiliation and the involvement of multiple CAIS-affiliated researchers, RepE remains one of the more visible technical programs to come out of the organization, and the paper is frequently cited as evidence of CAIS's technical research output.

Explain it like I'm 5

Imagine a robot's brain is a giant city of lights, and when the robot is being honest, a certain pattern of streets lights up. Representation engineering is like finding the "honesty street" in that city. Once you know which street it is, you can turn its lights up to make the robot more honest, turn them down to catch it fibbing, or switch them off completely. The clever trick is that you do not have to understand every single light bulb (that is the harder, slower approach called mechanistic interpretability). You just learn the one pattern you care about, like honesty or kindness, and gently nudge it.

References

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., and Hendrycks, D. *Representation Engineering: A Top-Down Approach to AI Transparency.* arXiv:2310.01405, submitted October 2, 2023 (v1), revised March 3, 2025 (v4). https://arxiv.org/abs/2310.01405 and https://arxiv.org/html/2310.01405v4 (Accessed 2026-06-21). ↩
Barrass, I. and Phan, L. *Representation Engineering: a New Way of Understanding Models.* Center for AI Safety Blog, April 17, 2024. https://safe.ai/blog/representation-engineering-a-new-way-of-understanding-models (Accessed 2026-06-21). ↩
*Representation Engineering for Large-Language Models: Survey and Research Challenges.* arXiv:2502.17601, February 2025. https://arxiv.org/html/2502.17601v1 (Accessed 2026-06-21). ↩
Zou, A., Phan, L., Wang, J., Duenas, D., Lin, M., Andriushchenko, M., Wang, R., Kolter, J. Z., Fredrikson, M., and Hendrycks, D. *Improving Alignment and Robustness with Circuit Breakers.* arXiv:2406.04313, NeurIPS 2024. https://arxiv.org/abs/2406.04313 (Accessed 2026-06-21). ↩
*Understanding Mechanistic Interpretability in AI Models.* IntuitionLabs. https://intuitionlabs.ai/articles/mechanistic-interpretability-ai-llms (Accessed 2026-06-21). ↩
*Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models.* arXiv:2404.13752, 2024. https://arxiv.org/abs/2404.13752 (Accessed 2026-06-21). ↩
Lin, S., Hilton, J., and Evans, O. *TruthfulQA: Measuring How Models Mimic Human Falsehoods.* ACL 2022, as referenced in arXiv:2310.01405. https://arxiv.org/abs/2310.01405 (Accessed 2026-06-21). ↩
*Representation Engineering has Its Problems, but None Seem Fundamental.* LessWrong, 2024. https://www.lesswrong.com/posts/6mCDnZWjrQNMkqdiD/representation-engineering-has-its-problems-but-none-seem (Accessed 2026-06-21). ↩
`repe` Python package on PyPI. https://pypi.org/project/repe/ (Accessed 2026-06-21). ↩
Zou, A. et al. *Representation Engineering* GitHub repository. https://github.com/andyzoujm/representation-engineering (Accessed 2026-06-21). ↩
*Rethinking The Reliability of Representation Engineering in Large Language Models.* OpenReview, 2024. https://openreview.net/forum?id=sYJQEgkkaI (Accessed 2026-06-21). ↩
Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., and Nanda, N. *Refusal in Language Models Is Mediated by a Single Direction.* NeurIPS 2024. https://proceedings.neurips.cc/paper_files/paper/2024/file/f545448535dfde4f9786555403ab7c49-Paper-Conference.pdf (Accessed 2026-06-21). ↩
*A Representation Engineering Perspective on the Effectiveness of Multi-Turn Jailbreaks.* arXiv:2507.02956, 2025. https://arxiv.org/html/2507.02956v1 (Accessed 2026-06-21). ↩
*From Representation Engineering to Circuit Breaking: Toward Transparent and Safer AI.* CMU CSD PhD Blog, 2025. https://www.cs.cmu.edu/~csd-phd-blog/2025/representation-engineering/ (Accessed 2026-06-21). ↩
*Dan Hendrycks.* Wikipedia. https://en.wikipedia.org/wiki/Dan_Hendrycks (Accessed 2026-06-21). ↩
*Center for AI Safety.* Wikipedia. https://en.wikipedia.org/wiki/Center_for_AI_Safety (Accessed 2026-06-21). ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributor · full history

Suggest edit

What links here

Activation patching Backdoor attacks on large language models Circuit Breakers (Representation Rerouting)Linear Probes MASK Refusal direction WMDP benchmark

What is representation engineering?

Background and motivation

How does representation engineering work?

How is it different from mechanistic interpretability?

Representation Reading

Representation Control

What is representation engineering used for?

Honesty and truthfulness

Emotion

Harmlessness

Power-seeking and ethics

Fairness and bias

Other applications

How is representation engineering evaluated?

Key results from the paper

Reception and citations

Limitations and criticism

Follow-up work

The Center for AI Safety connection

Explain it like I'm 5

See also

References

Improve this article

Related Articles

Mechanistic interpretability

Goodfire AI

Activation steering

Refusal direction

Persona vectors

Feature Importances

What links here

Related Articles

Mechanistic interpretability

Goodfire AI

Activation steering

Refusal direction

Persona vectors

Feature Importances

What links here