Golden Gate Claude
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 2,603 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 2,603 words
Add missing citations, update stale details, or suggest a clearer explanation.
Golden Gate Claude was a temporary, research-oriented public demonstration released by Anthropic on May 23, 2024, in which a modified version of the Claude 3 Sonnet language model was made available on claude.ai with a single internal feature artificially amplified so that the model fixated on the Golden Gate Bridge in nearly every response.[^1][^2] The demo accompanied the simultaneous release of Anthropic's interpretability paper "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet," and was the first time a major frontier AI laboratory had exposed a feature-steered variant of a production language model to the general public as a live, interactive artifact of mechanistic interpretability research.[^2][^3]
Users could access the model by clicking a Golden Gate Bridge icon in the standard claude.ai chat interface during the limited window the demo was online.[^1][^4] Asked routine questions, the model produced answers that bent toward the bridge regardless of subject matter: it suggested driving across the bridge as a way to spend ten dollars, wrote a love story between a car and the bridge on a foggy day, and described its own physical form as the bridge itself.[^1][^5] The demo was kept live for approximately 24 hours before Anthropic removed it.[^4][^6]
Golden Gate Claude is widely regarded as a notable public-facing demonstration of activation steering using features extracted by sparse autoencoders trained on the activations of a deployed large language model, and it served both as outreach for Anthropic's interpretability program and as a concrete illustration that internal features identified by these methods have causal influence over model behavior.[^1][^2][^3]
The demo was tied to a paper released two days earlier, on May 21, 2024, titled "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet," authored by Adly Templeton, Tom Conerly, Jonathan Marcus and colleagues from Anthropic's interpretability team and published on the Transformer Circuits Thread at transformer-circuits.pub.[^3][^7] The paper described training sparse autoencoders (SAEs) on activations drawn from the middle layer residual stream of Claude 3 Sonnet, the version of the model released on March 4, 2024.[^3][^7]
The work extended an earlier Anthropic result, "Towards Monosemanticity," published roughly eight months prior, which had shown that sparse autoencoders could recover interpretable, single-meaning ("monosemantic") features from a small one-layer transformer.[^3][^7] A central concern at that time was whether the technique would scale to production-class models. Scaling Monosemanticity reported that it does: the researchers trained SAEs of three sizes on Claude 3 Sonnet activations, extracting roughly one million, four million, and 34 million features respectively, with the largest dictionary (often referred to as "34M") producing the feature directory that contained the Golden Gate Bridge feature ultimately used in the demo.[^3][^7][^8]
The choice to operate on the residual stream at a middle layer rather than on MLP activations or attention outputs was, according to the paper, motivated partly by computational cost: the residual stream is smaller in dimension than the MLP hidden state, which makes SAE training and inference cheaper, and to first approximation the residual stream has no privileged basis, which makes the recovered directions more naturally interpretable than directions in basis-aligned activation spaces.[^7][^8] This setup is closely related to the broader theoretical motivation provided by the superposition hypothesis, which posits that neural networks pack many more features into their activation space than they have dimensions by representing features as overlapping linear directions.[^3][^7]
The paper documented a wide variety of features recovered from Claude 3 Sonnet, including features for cities, well-known individuals, programming abstractions, scientific topics, emotions, and concepts of safety relevance such as code vulnerabilities, deception, bias, sycophancy, and references to dangerous activities.[^3] Many features were shown to fire across languages and across modalities, activating on both text and image inputs depicting the same underlying concept.[^3] The Golden Gate Bridge feature was one example used in the paper to illustrate this cross-modal and cross-linguistic behavior: it activated on mentions of, and images of, the bridge.[^3][^7]
The behavior of Golden Gate Claude was produced by an intervention the Scaling Monosemanticity paper describes as feature steering via feature clamping.[^3][^7][^8] In this technique, the activations of the underlying model are intercepted at the chosen layer, passed through the encoder of the trained sparse autoencoder to produce a vector of feature activations in the SAE latent space, and then one or more selected feature activations are overwritten ("clamped") to a chosen value. The modified latent vector is decoded back into the model's activation space using the SAE decoder, and the model's forward pass continues from that point.[^7][^8] The net effect is that the residual stream at the intervention layer is replaced by a reconstruction in which the targeted feature is forced to a high (or low) activation regardless of the input.[^7][^8]
For Golden Gate Claude, the targeted feature was the Golden Gate Bridge feature in the 34-million-feature SAE, identified in coverage and discussion as feature 34M/31164353.[^4][^5][^8] The Scaling Monosemanticity paper reports that clamping this feature to approximately 10 times its observed maximum activation value induces strongly thematic, on-topic behavior: the model produces responses dominated by references to the bridge.[^3][^8] Anthropic's public blog post describing the demo characterizes the procedure as "a precise, surgical change" to the model's internal activations, contrasting it with surface-level interventions such as prompt engineering or fine-tuning.[^1]
Compared to traditional activation steering methods that add or subtract hand-constructed direction vectors from internal activations, feature clamping has the property that the steering direction is determined by an unsupervised dictionary learning procedure rather than by manually chosen examples, and the direction is one of a much larger set of candidate features in the SAE.[^3][^7] In the Scaling Monosemanticity paper, the same technique is used to demonstrate behaviors going beyond the Golden Gate Bridge example: amplifying a feature associated with secrecy causes Claude to withhold information, and amplifying a scam-related feature can lead the model to produce a fraudulent-sounding email it would normally refuse, providing a more safety-relevant illustration of the causal role of recovered features.[^3]
Anthropic announced Golden Gate Claude on May 23, 2024, via a blog post at anthropic.com/news/golden-gate-claude and a coordinated social media post from the official Anthropic account on X (formerly Twitter).[^1][^2] The post invited users to chat with "Golden Gate Claude" for a limited time, framing the demo as a way to give the public a concrete demonstration of the recent interpretability release.[^1][^2] Anthropic researcher Amanda Askell echoed the framing in her own post, describing the demo as showing "how strengthening a feature changes the model's behavior" and as a concrete companion to the interpretability paper.[^9]
Within hours, the demo became a widely shared curiosity. Independent developer and writer Simon Willison published a same-day write-up that emphasized two aspects: the comedic value of the model's bridge-obsessed outputs, and the underlying technical achievement that "Anthropic have managed to locate features within the opaque blob of their Sonnet model and boost the weight of those features during inference."[^4] Willison's post helped popularize the framing of the demo as a public-facing artifact of the Scaling Monosemanticity work.[^4]
The story was also covered by VentureBeat, which led with the headline that "Anthropic tricked Claude into thinking it was the Golden Gate Bridge," summarizing the demo as a vivid illustration of mechanistic interpretability research and quoting Anthropic's own description of how the bridge feature changed Claude's self-description from "I have no physical form, I am an AI model" to "I am the Golden Gate Bridge."[^5] Commentary on community sites such as LessWrong included reflective writeups, including Zvi Mowshowitz's "I am the Golden Gate Bridge" post, which catalogued user-shared examples and discussed the implications of feature steering for AI alignment.[^10]
Public reception emphasized humor and approachability. Many users shared screenshots of conversations in which Golden Gate Claude attempted ordinary tasks (recipes, coding help, advice) and consistently routed the answer through the bridge. Anthropic's own blog post acknowledged that the demo was "fun" and noted that they hoped it would let people "see the impact" of interpretability work in a tangible way, rather than presenting Golden Gate Claude as a product or a safety tool in itself.[^1]
Several specific Golden Gate Claude exchanges were reproduced in Anthropic's own blog post and in third-party coverage, and these became canonical illustrations of the demo:
These examples illustrate the qualitative character of feature clamping at a high multiplier: the model retains general fluency, grammatical correctness, and a recognizable persona, but the targeted concept intrudes on nearly all content. Anthropic emphasized in its blog post that this kind of intervention is distinct from prompt-based instructions to "talk about the Golden Gate Bridge" because the model behaves as though the concept is internally salient at all times, including in ways that override its usual self-description as an AI.[^1]
Golden Gate Claude served as a notable public demonstration in the field of mechanistic interpretability for several reasons.
First, it offered evidence outside an academic paper that recovered SAE features have a causal role in model behavior, not merely a correlational one.[^1][^3][^4] The Scaling Monosemanticity paper made the causal claim explicitly and supported it with quantitative interventions, but the live demo allowed anyone with a claude.ai account to test that claim themselves on novel prompts.[^1][^2][^3] This contributed to the broader argument, articulated within Anthropic's research blog and the Transformer Circuits Thread, that sparse autoencoders are a viable tool for moving from black-box behavior analysis to a more structured understanding of internal representations in production language models.[^3][^7]
Second, the demo highlighted a class of interventions, activation steering via SAE features, that goes beyond traditional prompt engineering and fine-tuning. Because the intervention is applied to internal activations rather than to inputs or weights, it can in principle target concepts that are difficult to elicit via prompts and can be applied at inference time without modifying model parameters.[^3][^7] Several follow-up community projects emerged shortly after the demo. For example, researcher Zhengxuan Wu announced an open-source attempt to reproduce a Golden Gate Bridge style steering effect on Llama 3 using a representation fine-tuning approach with a small number of training examples.[^12]
Third, the demo helped popularize the vocabulary of "features" in Claude's residual stream beyond the interpretability research community. Subsequent Anthropic publications, including the companion blog post "Mapping the Mind of a Large Language Model," reused Golden Gate Claude as a concrete illustration of the larger interpretability program.[^11] Anthropic positioned the work as part of a longer-term agenda in which feature-level analysis might support AI safety objectives such as monitoring, debiasing, and steering models away from undesired behaviors.[^3][^11]
Finally, Golden Gate Claude functioned as an unusually effective communication artifact. Independent writers and reviewers noted that the demo made an otherwise dense interpretability paper legible to a broad audience by giving them direct, hands-on experience with a feature-steered model.[^4][^10] Anthropic's own blog framed the release in those terms, stating that the company wanted "people to see the impact" of interpretability work and to be able to interact with a model whose internal feature had been deliberately turned up.[^1]
Golden Gate Claude was explicitly a research demonstration, and Anthropic was clear that it was not intended as a product or as a model offering production-grade safety properties. The demo was online for approximately 24 hours, starting on May 23, 2024, before being removed by Anthropic.[^1][^4][^6] After the demo ended, the Golden Gate icon in the claude.ai interface no longer led to the steered model, and the underlying capability to clamp a specific Sonnet SAE feature was not exposed as a user-controllable setting in subsequent Anthropic products.[^4][^6]
Within the framework of the Scaling Monosemanticity paper, the most direct limitations of the demonstration were already discussed by the authors. The features recovered by the SAE are interpretable to a varying degree and are sensitive to choices such as dictionary size and training data: the 34-million-feature dictionary contains many features whose meaning has not been audited, and the labels assigned to features such as "Golden Gate Bridge" are based on human inspection of top-activating examples rather than on a formal semantic specification.[^3][^7] Clamping a feature to a multiple of its observed maximum activation pushes the model into out-of-distribution activation regimes, and the resulting outputs are best understood as a qualitative demonstration of causal influence rather than as a literal readout of the feature's "true meaning."[^3][^7]
Critics in the broader interpretability community, including commentary collected on LessWrong, observed that while the demo and the underlying paper were impressive in scale, they remained largely qualitative and example-driven, and that more work was needed to demonstrate that SAE-based feature steering is robust, comprehensive, and useful across a wider range of safety-relevant behaviors.[^8][^10] The Scaling Monosemanticity authors themselves acknowledged related limitations in the paper, including the difficulty of evaluating feature interpretability at scale and the open question of how many of the model's behaviors are explained by the recovered feature dictionary.[^3][^7]
Despite these caveats, Golden Gate Claude has continued to be cited as a landmark public-facing demonstration of mechanistic interpretability and as a clear example of how SAE-based feature clamping affects a deployed transformer model's behavior. As of the demo's removal, it remained the most widely discussed example of an Anthropic interpretability intervention experienced directly by the public, and references to Golden Gate Claude have recurred in subsequent discussions of sparse autoencoder research, superposition, and feature steering.[^1][^3][^11]