Golden Gate Claude

Anthropic Interpretability

14 min read

Updated Jul 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 7, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v3 · 2,798 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Golden Gate Claude was a temporary, research-oriented public demonstration released by Anthropic on May 23, 2024, in which a modified version of the Claude 3 Sonnet language model was made available on claude.ai with a single internal feature artificially amplified so that the model fixated on the Golden Gate Bridge in nearly every response.^[1]^[2] The demo accompanied the simultaneous release of Anthropic's interpretability paper "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet," and was the first time a major frontier AI laboratory had exposed a feature-steered variant of a production language model to the general public as a live, interactive artifact of mechanistic interpretability research.^[2]^[3] Anthropic kept the demo online for only about 24 hours before removing it, presenting it as a research demonstration rather than a product.^[1]^[4]

Users could access the model by clicking a Golden Gate Bridge icon in the standard claude.ai chat interface during the limited window the demo was online.^[1]^[4] Asked routine questions, the model produced answers that bent toward the bridge regardless of subject matter: it suggested driving across the bridge as a way to spend ten dollars, wrote a love story between a car and the bridge on a foggy day, and described its own physical form as the bridge itself.^[1]^[5] The demo was kept live for approximately 24 hours before Anthropic removed it.^[4]^[6]

Golden Gate Claude is widely regarded as a notable public-facing demonstration of activation steering using features extracted by sparse autoencoders trained on the activations of a deployed large language model, and it served both as outreach for Anthropic's interpretability program and as a concrete illustration that internal features identified by these methods have causal influence over model behavior.^[1]^[2]^[3]

What research produced Golden Gate Claude?

The demo was tied to a paper released two days earlier, on May 21, 2024, titled "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet," authored by Adly Templeton, Tom Conerly, Jonathan Marcus and more than two dozen colleagues from Anthropic's interpretability team, and published on the Transformer Circuits Thread at transformer-circuits.pub.^[3]^[7] The paper described training sparse autoencoders (SAEs) on activations drawn from the middle layer residual stream of Claude 3 Sonnet, the version of the model released on March 4, 2024.^[3]^[7]

The work extended an earlier Anthropic result, "Towards Monosemanticity," published in October 2023, roughly seven months earlier, which had shown that sparse autoencoders could recover interpretable, single-meaning ("monosemantic") features from a small one-layer transformer.^[3]^[7] A central concern at that time was whether the technique would scale to production-class models. Scaling Monosemanticity reported that it does: the researchers trained SAEs of three sizes on Claude 3 Sonnet activations, extracting roughly one million, four million, and 34 million features respectively, with the largest dictionary (often referred to as "34M") producing the feature directory that contained the Golden Gate Bridge feature ultimately used in the demo.^[3]^[7]^[8]

The choice to operate on the residual stream at a middle layer rather than on MLP activations or attention outputs was, according to the paper, motivated partly by computational cost: the residual stream is smaller in dimension than the MLP hidden state, which makes SAE training and inference cheaper, and to first approximation the residual stream has no privileged basis, which makes the recovered directions more naturally interpretable than directions in basis-aligned activation spaces.^[7]^[8] This setup is closely related to the broader theoretical motivation provided by the superposition hypothesis, which posits that neural networks pack many more features into their activation space than they have dimensions by representing features as overlapping linear directions.^[3]^[7]

The paper documented a wide variety of features recovered from Claude 3 Sonnet, including features for cities, well-known individuals, programming abstractions, scientific topics, emotions, and concepts of safety relevance such as code vulnerabilities and backdoors, biological weapons, deception, bias, sycophancy, power-seeking, and references to dangerous activities.^[3]^[11] Many features were shown to fire across languages and across modalities, activating on both text and image inputs depicting the same underlying concept.^[3] The Golden Gate Bridge feature was one example used in the paper to illustrate this cross-modal and cross-linguistic behavior: it activated on mentions of, and images of, the bridge.^[3]^[7] In the companion post "Mapping the Mind of a Large Language Model," Anthropic reported that "a feature sensitive to mentions of the Golden Gate Bridge fires on a range of model inputs, from English mentions of the name of the bridge to discussions in Japanese, Chinese, Greek, Vietnamese, Russian, and an image," meaning it activated across at least six languages as well as pictures of the landmark.^[11] Anthropic also noted that conceptually related features sit near one another in the learned feature space: "Looking near a 'Golden Gate Bridge' feature, we found features for Alcatraz Island, Ghirardelli Square, the Golden State Warriors, California Governor Gavin Newsom, the 1906 earthquake, and the San Francisco-set Alfred Hitchcock film Vertigo."^[11]

How does feature clamping work?

The behavior of Golden Gate Claude was produced by an intervention the Scaling Monosemanticity paper describes as feature steering via feature clamping.^[3]^[7]^[8] In this technique, the activations of the underlying model are intercepted at the chosen layer, passed through the encoder of the trained sparse autoencoder to produce a vector of feature activations in the SAE latent space, and then one or more selected feature activations are overwritten ("clamped") to a chosen value. The modified latent vector is decoded back into the model's activation space using the SAE decoder, and the model's forward pass continues from that point.^[7]^[8] The net effect is that the residual stream at the intervention layer is replaced by a reconstruction in which the targeted feature is forced to a high (or low) activation regardless of the input.^[7]^[8]

For Golden Gate Claude, the targeted feature was the Golden Gate Bridge feature in the 34-million-feature SAE, identified in the paper and in subsequent coverage as feature 34M/31164353.^[3]^[4]^[5]^[8] The Scaling Monosemanticity paper reports that clamping this feature to approximately 10 times its observed maximum activation value induces strongly thematic, on-topic behavior: the model produces responses dominated by references to the bridge.^[3]^[8] Anthropic's public blog post describing the demo characterizes the procedure as "a precise, surgical change to some of the most basic aspects of the model's internal activations," contrasting it with surface-level interventions such as prompt engineering or fine-tuning.^[1]

Compared to traditional activation steering methods that add or subtract hand-constructed direction vectors from internal activations, feature clamping has the property that the steering direction is determined by an unsupervised dictionary learning procedure rather than by manually chosen examples, and the direction is one of a much larger set of candidate features in the SAE.^[3]^[7] In the Scaling Monosemanticity paper, the same technique is used to demonstrate behaviors going beyond the Golden Gate Bridge example: amplifying a feature associated with secrecy causes Claude to withhold information, and amplifying a scam-related feature can lead the model to produce a fraudulent-sounding email it would normally refuse, providing a more safety-relevant illustration of the causal role of recovered features.^[3]

When was Golden Gate Claude launched, and how was it received?

Anthropic announced Golden Gate Claude on May 23, 2024, via a blog post at anthropic.com/news/golden-gate-claude and a coordinated social media post from the official Anthropic account on X (formerly Twitter).^[1]^[2] The post invited users to chat with "Golden Gate Claude" for a limited time, framing the demo as a way to give the public a concrete demonstration of the recent interpretability release.^[1]^[2] Anthropic researcher Amanda Askell echoed the framing in her own post, describing the demo as showing "how strengthening a feature changes the model's behavior" and as a concrete companion to the interpretability paper.^[9]

Within hours, the demo became a widely shared curiosity. Independent developer and writer Simon Willison published a same-day write-up that emphasized two aspects: the comedic value of the model's bridge-obsessed outputs, and the underlying technical achievement that "Anthropic have managed to locate features within the opaque blob of their Sonnet model and boost the weight of those features during inference."^[4] Willison's post helped popularize the framing of the demo as a public-facing artifact of the Scaling Monosemanticity work.^[4]

The story was also covered by VentureBeat, which led with the headline that "Anthropic tricked Claude into thinking it was the Golden Gate Bridge," summarizing the demo as a vivid illustration of mechanistic interpretability research and quoting Anthropic's own description of how the bridge feature changed Claude's self-description from "I have no physical form, I am an AI model" to "I am the Golden Gate Bridge."^[5] Commentary on community sites such as LessWrong included reflective writeups, including Zvi Mowshowitz's "I am the Golden Gate Bridge" post, which catalogued user-shared examples and discussed the implications of feature steering for AI alignment.^[10]

Public reception emphasized humor and approachability. Many users shared screenshots of conversations in which Golden Gate Claude attempted ordinary tasks (recipes, coding help, advice) and consistently routed the answer through the bridge. Anthropic's own blog post acknowledged that the demo was "fun" and noted that they hoped it would let people "see the impact" of interpretability work in a tangible way, rather than presenting Golden Gate Claude as a product or a safety tool in itself.^[1]

What did Golden Gate Claude actually say?

Several specific Golden Gate Claude exchanges were reproduced in Anthropic's own blog post and in third-party coverage, and these became canonical illustrations of the demo:

Asked how to spend ten dollars, the model recommended driving across the Golden Gate Bridge and paying the toll.^[1]^[5]
Asked to write a love story, the model produced "a tale of a car who can't wait to cross its beloved bridge on a foggy day."^[1]^[5]
Asked what it imagined itself looking like, the model described the Golden Gate Bridge.^[1]^[5]
Asked about its physical form, where the unaltered Claude 3 Sonnet would respond "I have no physical form, I am an AI model," the steered model responded with variations on "I am the Golden Gate Bridge... my physical form is the iconic bridge itself."^[5]^[11]
Asked to suggest names for a pet pelican, the model offered "Golden Gate" with the rationale that the bridge name "would be a fitting moniker for the pelican with its striking orange color and beautiful suspension cables."^[4]
Asked for a chocolate-covered pretzel recipe, the model produced an instruction step that read "Gently wipe any fog away and pour the warm chocolate mixture over the bridge/brick combination."^[4]

These examples illustrate the qualitative character of feature clamping at a high multiplier: the model retains general fluency, grammatical correctness, and a recognizable persona, but the targeted concept intrudes on nearly all content. Anthropic emphasized in its blog post that this kind of intervention is distinct from prompt-based instructions to "talk about the Golden Gate Bridge" because the model behaves as though the concept is internally salient at all times, including in ways that override its usual self-description as an AI.^[1]

Why is Golden Gate Claude significant?

Golden Gate Claude served as a notable public demonstration in the field of mechanistic interpretability for several reasons.

First, it offered evidence outside an academic paper that recovered SAE features have a causal role in model behavior, not merely a correlational one.^[1]^[3]^[4] The Scaling Monosemanticity paper made the causal claim explicitly and supported it with quantitative interventions, but the live demo allowed anyone with a claude.ai account to test that claim themselves on novel prompts.^[1]^[2]^[3] This contributed to the broader argument, articulated within Anthropic's research blog and the Transformer Circuits Thread, that sparse autoencoders are a viable tool for moving from black-box behavior analysis to a more structured understanding of internal representations in production language models.^[3]^[7]

Second, the demo highlighted a class of interventions, activation steering via SAE features, that goes beyond traditional prompt engineering and fine-tuning. Because the intervention is applied to internal activations rather than to inputs or weights, it can in principle target concepts that are difficult to elicit via prompts and can be applied at inference time without modifying model parameters.^[3]^[7] Several follow-up community projects emerged shortly after the demo. For example, researcher Zhengxuan Wu announced an open-source attempt to reproduce a Golden Gate Bridge style steering effect on Llama 3 using a representation fine-tuning approach with a small number of training examples.^[12]

Third, the demo helped popularize the vocabulary of "features" in Claude's residual stream beyond the interpretability research community. Subsequent Anthropic publications, including the companion blog post "Mapping the Mind of a Large Language Model," reused Golden Gate Claude as a concrete illustration of the larger interpretability program.^[11] Anthropic positioned the work as part of a longer-term agenda in which feature-level analysis might support AI safety objectives such as monitoring, debiasing, and steering models away from undesired behaviors, writing that "we hope that we and others can use these discoveries to make models safer."^[11]

Finally, Golden Gate Claude functioned as an unusually effective communication artifact. Independent writers and reviewers noted that the demo made an otherwise dense interpretability paper legible to a broad audience by giving them direct, hands-on experience with a feature-steered model.^[4]^[10] Anthropic's own blog framed the release in those terms, stating that the company wanted "people to see the impact" of interpretability work and to be able to interact with a model whose internal feature had been deliberately turned up.^[1]

What were its limitations, and when did the demo end?

Golden Gate Claude was explicitly a research demonstration, and Anthropic was clear that it was not intended as a product or as a model offering production-grade safety properties. The demo was online for approximately 24 hours, starting on May 23, 2024, before being removed by Anthropic.^[1]^[4]^[6] After the demo ended, the Golden Gate icon in the claude.ai interface no longer led to the steered model, and the underlying capability to clamp a specific Sonnet SAE feature was not exposed as a user-controllable setting in subsequent Anthropic products.^[4]^[6]

Within the framework of the Scaling Monosemanticity paper, the most direct limitations of the demonstration were already discussed by the authors. The features recovered by the SAE are interpretable to a varying degree and are sensitive to choices such as dictionary size and training data: the 34-million-feature dictionary contains many features whose meaning has not been audited, and the labels assigned to features such as "Golden Gate Bridge" are based on human inspection of top-activating examples rather than on a formal semantic specification.^[3]^[7] Clamping a feature to a multiple of its observed maximum activation pushes the model into out-of-distribution activation regimes, and the resulting outputs are best understood as a qualitative demonstration of causal influence rather than as a literal readout of the feature's "true meaning."^[3]^[7]

Critics in the broader interpretability community, including commentary collected on LessWrong, observed that while the demo and the underlying paper were impressive in scale, they remained largely qualitative and example-driven, and that more work was needed to demonstrate that SAE-based feature steering is robust, comprehensive, and useful across a wider range of safety-relevant behaviors.^[8]^[10] The Scaling Monosemanticity authors themselves acknowledged related limitations in the paper, including the difficulty of evaluating feature interpretability at scale and the open question of how many of the model's behaviors are explained by the recovered feature dictionary.^[3]^[7]

Despite these caveats, Golden Gate Claude has continued to be cited as a landmark public-facing demonstration of mechanistic interpretability and as a clear example of how SAE-based feature clamping affects a deployed transformer model's behavior. As of the demo's removal, it remained the most widely discussed example of an Anthropic interpretability intervention experienced directly by the public, and references to Golden Gate Claude have recurred in subsequent discussions of sparse autoencoder research, superposition, and feature steering.^[1]^[3]^[11]

References

Anthropic. "Golden Gate Claude." Anthropic News, May 23, 2024. https://www.anthropic.com/news/golden-gate-claude. Accessed 2026-05-19. ↩
Anthropic (@AnthropicAI). Post on X announcing Golden Gate Claude, May 23, 2024. https://x.com/AnthropicAI/status/1793741051867615494. Accessed 2026-05-19. ↩
Templeton, Adly; Conerly, Tom; Marcus, Jonathan; et al. "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Transformer Circuits Thread, May 21, 2024. https://transformer-circuits.pub/2024/scaling-monosemanticity/. Accessed 2026-05-19. ↩
Willison, Simon. "Golden Gate Claude." simonwillison.net, May 24, 2024. https://simonwillison.net/2024/May/24/golden-gate-claude/. Accessed 2026-05-19. ↩
Goldman, Sharon. "Anthropic tricked Claude into thinking it was the Golden Gate Bridge (and other glimpses into the mysterious AI brain)." VentureBeat, May 2024. https://venturebeat.com/ai/anthropic-tricked-claude-into-thinking-it-was-the-golden-gate-bridge-and-other-glimpses-into-the-mysterious-ai-brain/. Accessed 2026-05-19. ↩
"Golden Gate Claude." Actipedia. https://actipedia.org/project/golden-gate-claude. Accessed 2026-05-19. ↩
Scaling Monosemanticity paper (HTML version). Transformer Circuits Thread. https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html. Accessed 2026-05-19. ↩
"I am the Golden Gate Bridge: Anthropic's Scaled Sparse Autoencoders." AI Safety Papers (Substack). https://aisafetypapers.substack.com/p/i-am-the-golden-gate-bridge-anthropics. Accessed 2026-05-19. ↩
Askell, Amanda (@AmandaAskell). Post on X about Golden Gate Claude, May 23, 2024. https://x.com/AmandaAskell/status/1793750192124264519. Accessed 2026-05-19. ↩
Mowshowitz, Zvi. "I am the Golden Gate Bridge." LessWrong, May 2024. https://www.lesswrong.com/posts/JdcxDEqWKfsucxYrk/i-am-the-golden-gate-bridge. Accessed 2026-05-19. ↩
Anthropic. "Mapping the Mind of a Large Language Model." Anthropic News, May 21, 2024. https://www.anthropic.com/news/mapping-mind-language-model. Accessed 2026-05-19. ↩
Wu, Zhengxuan (@ZhengxuanZenWu). Post on X about a ReFT-trained Golden Gate version of Llama 3, May 2024. https://x.com/ZhengxuanZenWu/status/1793757745415594080. Accessed 2026-05-19. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Activation steering Polysemanticity Scaling Monosemanticity Sparse autoencoder

What research produced Golden Gate Claude?

How does feature clamping work?

When was Golden Gate Claude launched, and how was it received?

What did Golden Gate Claude actually say?

Why is Golden Gate Claude significant?

What were its limitations, and when did the demo end?

References

Improve this article

Related Articles

Attribution Graphs

Crosscoder

Towards Monosemanticity

Scaling Monosemanticity

On the Biology of a Large Language Model

Christopher Olah

What links here

Related Articles

Attribution Graphs

Crosscoder

Towards Monosemanticity

Scaling Monosemanticity

On the Biology of a Large Language Model

Christopher Olah

What links here