# The Geometry of Intent: Why Prompting Works (and Why It Fails)

**Richard Raseley**

# Abstract

Effective LLM interaction is a problem of geometric alignment, not of folk heuristics like "be specific" or "provide examples." In this paper I develop a mental model grounded in how these systems actually work. A prompt is a coordinate-specification mechanism - it locates a query within the model's learned representational space. Prompt quality is the precision of that positioning. It is not length. It is not keyword density.

I ground the framework in a familiar dynamic from human communication, where misalignment produces confident answers to the wrong question. That grounding gives practitioners an accessible entry point - one that trades accumulated heuristics for principled geometric reasoning.

The framework yields a diagnostic vocabulary. Is the prompt underconstrained? Displaced? Structurally misaligned? Each failure mode has a distinct geometric cause, and each cause predicts the intervention that will fix it. I translate common prompting advice into geometric terms, which reveals the mechanism behind practices that work and the conditions under which they fail. The paper closes with a practical methodology for systematic prompt design, diagnosis, and revision - a methodology expressed in a controlled natural language whose operators force, at composition time, the geometric commitments the methodology requires: a frame, an object of analysis, a dependency structure, an invariance criterion, and an output form.

---

## 1. Introduction

### 1.1 A Familiar Problem

I have had this conversation more times than I can count. I am explaining something to a colleague and realize I need to back up - they do not have the frame of reference my explanation depends on. Or I ask an expert a specific question and get back an answer that is coherent, confident, and orthogonal to what I actually needed. The expert knew the subject. My question activated the wrong region of what they knew.

Those two experiences point at the same thing. Communication succeeds when the right conceptual space is active in the receiver before any specific content lands. When that space is wrong, everything downstream is wrong, however fluent it sounds.

This is so common in daily interaction that I rarely stop to examine it. Humans have repair mechanisms - clarifying questions, contextual inference, shared background knowledge - that patch misalignments as they happen. The patching is so automatic that the underlying dynamic becomes invisible. The dynamic is still there. A query must be geometrically positioned within the responder's knowledge structure to yield a useful response. That is true of the colleague, the expert, the teacher, and - as I will argue - the language model.

### 1.2 The Large Language Model as Limit Case

Large language models make this dynamic legible in a way that human conversation usually does not. An LLM cannot furrow a brow. It cannot ask what you meant. It cannot draw on years of shared context and infer the question you probably intended. It responds to precisely what the prompt geometrically specifies, and it does so with full confidence regardless of whether that prompt pointed anywhere useful.

The result is unforgiving - brutally so. When a prompt activates the wrong representational region, the output is internally coherent but externally useless. That failure mode is the same one we know from human miscommunication. What is different is the absence of the repair mechanisms that normally hide it. The LLM does not paper over bad specification. It executes it.

That is why I find LLMs interesting as an object of study and not only as a tool. My central claim in this paper runs in both directions. What we learn from prompting LLMs effectively illuminates general principles of structured communication. And what we already understand about human communication can inform principled approaches to prompt design. The geometric alignment framework I develop here applies bidirectionally. I do not treat LLMs as exotic systems requiring specialized technique. I treat them as a clear case - a case in which a dynamic that is real in every communicative context happens to be visible. One of those dynamics - developed further in the open issues and integrated across the framework - is that alignment is necessary but not sufficient for response quality. The LLM case does not introduce that property. It displays it.

### 1.3 Contributions and Structure

I offer three contributions in this paper.

1. **A mental model.** Prompting is geometric navigation, not linguistic optimization. I ground that model in communicative experience any thoughtful practitioner already has, so the framework is available without specialized training.

2. **A controlled natural language.** A compact operator set - AS, OBSERVE, CONSIDERING, ORDER, VALIDATE, OUTPUT - that operationalizes the mental model and the practical toolkit at the level of the keystroke. Each operator forces the practitioner to commit, at composition time, to a geometric move the framework identifies - a frame, an object of analysis, context, a reasoning structure, an invariance criterion, an output form - rather than deferring that commitment to the model.

3. **A practical toolkit.** Vocabulary for diagnosing prompt failures and systematic methods for fixing them - based on geometric principles rather than accumulated heuristics that work without anyone being sure why.

The paper proceeds as follows. I first ground geometric alignment in familiar communicative experience. I then examine how the same dynamic operates inside transformer-based LLMs. With the framework in hand, I develop the controlled natural language (§5), whose operators force commitment to each of the framework's geometric moves at composition time. I then reinterpret common prompting heuristics through the geometric lens - now reading the heuristics as specializations of particular operators - which reveals the mechanism behind the practices that work and the conditions under which they fail. I present a practical methodology for prompt design, diagnosis, and revision that uses the notation to carry its diagnostic and revision steps, and close by addressing limitations.

---

## 2. The Geometry of Understanding

### 2.1 How Humans Locate Meaning

When someone speaks to me I am not decoding symbols in sequence. I am navigating toward a region of my own knowledge that the words point at. That is a real distinction - one I did not examine closely until I started thinking carefully about prompting.

Take a question like "What's the best way to handle this?" In isolation, it specifies almost nothing. Handle what? Best by what criteria? In what situation? The words are coherent. The query is not. Whatever answer lands depends on which conceptual region the listener activates before the content arrives.

People who communicate well do this work without naming it. They provide context that constrains interpretation. They signal how ideas connect. They anchor the abstract with concrete examples that triangulate the target region of meaning.

The clearest case I know is a skilled teacher. A teacher does not simply define a new concept - they triangulate it. First they activate relevant prior knowledge. Then they establish the structural relationships between the new idea and familiar ones. Finally they provide concrete instances that locate the concept within the learner's existing representational space.

That is geometric work, whether or not we describe it that way. The teacher positions the new concept inside a space of possible interpretations, using multiple specification strategies to constrain the region of valid understanding.

### 2.2 The Cost of Misalignment

When a query activates the wrong conceptual frame, the response is often coherent within that frame but fails to address the actual need. The expert who answers the question as asked rather than the question as meant. The student who correctly applies an inappropriate formula. The meeting that efficiently solves the wrong problem.

These are not failures of knowledge or intelligence. They are failures of geometric alignment between query specification and response generation. The responder is navigating competently within the space they have activated. The space just does not contain the sought-after answer.

The cost compounds across iterations. When the first response misses, the follow-up rarely diagnoses the geometric problem. People rephrase. They add emphasis. They express frustration. None of that repositions the query in the responder's conceptual space - it just restates it at higher volume. The result is repeated failure despite apparent effort.

### 2.3 Why Large Language Models Make This Visible

Humans have repair mechanisms that hide this dynamic from us. A colleague furrows a brow. A friend asks what I meant. A coworker leans on years of shared context and infers the question I probably intended despite the one I actually asked. The patching happens so fast and so reliably that the underlying misalignment rarely surfaces.

Large language models do not have any of that. The model cannot ask what I meant. It cannot draw on decades of context to reconstruct my intent. It completes the trajectory the prompt initiates, navigating its learned representational space according to the geometric specification provided - and it does so with full confidence regardless of whether that trajectory leads anywhere useful.

A clarification is worth making here. Modern LLM interfaces often appear to self-correct. ChatGPT asks clarifying questions. Claude requests confirmation before proceeding. That behavior does not originate in the model itself. It comes from the system design surrounding the model - fine-tuning via RLHF, instructions in system prompts, and the multi-turn conversation architecture - not in the native operation of the transformer. The underlying model is a completion engine. Given a prompt, it navigates representational space according to the geometric specification it received. The apparent flexibility is scaffolding, not substrate. When I say LLMs lack repair mechanisms, I am referring to the model itself, not the systems built around it.

That distinction is why the LLM case is useful. These systems are exquisitely sensitive to prompt geometry, and that sensitivity exposes the underlying structure of the communication problem. Where human flexibility compensates for imprecise queries, LLM rigidity exposes them. Where human repair mechanisms mask alignment failures, LLM outputs make them explicit.

In the end, I do not treat LLMs as a special case requiring idiosyncratic techniques. I treat them as a clear case - one that exposes general principles usually obscured by human communicative flexibility. Understanding prompting as geometric alignment is not just useful for LLM interaction. It is a lens that illuminates communication more broadly. Section 5 introduces a notation that requires composition-time commitment to the geometric specifications humans normally leave implicit and recover through conversational repair.

---

## 3. Theoretical Foundations

Before I go further, I want to be clear about what this framework is and what it is not. The geometric model I develop here is not a literal description of transformer internals. It is an abstract representational model - a way of reasoning about prompt effectiveness that captures real regularities in system behavior without claiming to describe the underlying computation precisely. Thermodynamics is the analogy I keep returning to. It gives us predictively powerful formalism without requiring commitment to any particular account of molecular mechanism. I want the same kind of utility here. The value of the framework is explanatory and predictive. It is not a claim about implementation.

The analogy does more work than it first appears to, and I want to name the work explicitly because the framework's scope turns on it. Four commitments travel with the thermodynamics reading. *System level*: the framework's variables - region, valley, target-correspondence - are defined at a level of aggregation, the way temperature is, and they have no meaning below that level. *Level-appropriate prediction*: the framework predicts which prompts will fail without simulating the model, the way thermodynamics predicts what happens under compression without tracking molecules. *Compatibility with lower-level accounts*: mechanistic interpretability at the circuit level does not refute a specification-level diagnosis; it explains the diagnosis at a lower level, and the framework does not compete with that work. *Domain-of-applicability conditions*: thermodynamics breaks down at the nanoscale and far from equilibrium - not because it becomes wrong but because its variables stop being defined - and the framework has conditions of the same kind. The boundaries the open issue on scope names are conditions of that fourth kind rather than modest disclaimers about what the framework cannot do.

### 3.1 Formalizing Conceptual Space

The intuition I have been relying on - "activating the right conceptual space" - corresponds to something mathematically precise when the responder is an LLM. Locating a query within a high-dimensional vector space. What functions as metaphor in the human case is literal mechanism in the transformer case.

Modern LLMs represent linguistic units as vectors in high-dimensional embedding spaces. Geometric relationships in those spaces encode semantic relationships. Words, phrases, and concepts that are semantically related occupy proximate regions. Ones that are unrelated are geometrically distant. This is the distributional hypothesis rendered computational - meaning captured by position in a learned geometric space.

That space is not uniform. The embedding spaces of large transformers exhibit complex topological structure. The manifold hypothesis holds that meaningful content occupies lower-dimensional subspaces within the high-dimensional ambient space. On that account, effective prompting involves specifying not a point but a *region* - the region corresponding to the desired class of outputs.

### 3.2 Formalizing Structure

When I say a good explanation "shows how ideas connect," I am implicitly describing a *dependency structure* - a network of concepts joined by directed relationships: what depends on what, what follows from what, what is a component of what.

A dependency structure has nodes and directed edges. The nodes represent concepts, propositions, or informational units. The edges represent dependencies, logical relationships, or compositional structure. Dependencies flow in one direction without circular reference - a property essential to coherent reasoning and coherent explanation.

This matters for prompting. In both human communication and LLM prompting, structural clarity in the input produces structural coherence in the output. The mechanism is that explicit structure specifies not merely *what* elements are relevant but *how those elements relate*. A prompt that presents information as an unordered collection activates a different representational configuration than a prompt that presents the same information with explicit logical dependencies.

The dependency structure is not an alternative to the vector space representation of Section 3.1. It is a logical structure superimposed on it. The geometric substrate provides the medium. The dependency structure specifies the relational constraints that order movement through that medium.

### 3.3 Formalizing Alignment

The intuition of speaker and listener being "on the same page" translates to geometric correspondence - specifically, correspondence between the region the query specifies and the region containing useful responses.

Define the *target region* as the subspace of the model's representational geometry containing outputs that would satisfy the user's actual need. Define the *specified region* as the subspace activated by the prompt as given. Alignment is the degree of correspondence between the two. Because a single prompt admits a distribution of outputs rather than a single one, alignment is most precisely a property of that distribution - the degree to which it concentrates near the target region. The resolved open issue on specification and sampling dynamics develops the consequences of that distributional reading.

Perfect alignment would mean the specified region is contained within the target region: every output the prompt's distribution admits would satisfy the user's need. Misalignment takes several forms. The specified region can be too large - underconstrained - which yields high-variance outputs. It can be displaced from the target, which yields confident but wrong outputs. It can be structurally incompatible with the target, which yields incoherent outputs.

This formalization lets me characterize prompt quality in terms that actually track the problem. Not surface features - length, keyword presence, syntactic form. Geometric relationship to the target region. Each named misalignment maps onto a commitment the notation in Section 5 will later expose as a distinct lexical slot - the underconstrained region to the slots that fix frame, object, and context; the displaced region to the slot that selects the manifold; and the structurally incompatible region to the slot that fixes the dependency structure.

---

## 4. The Mechanics of Alignment in Large Language Models

### 4.1 Prompts as Coordinate Specifications

Every token in a prompt contributes to specifying a position in the model's embedding space. My intent as a practitioner is semantic. The model's operation is geometric. The prompt sits between those two things and translates one into the other - it is the coordinate-specification mechanism that locates the query within learned representational structure. I find it useful to call it what it is: a discrete control interface for a continuous manifold.

That framing exposes an operational constraint I cannot escape. I do not get to manipulate the embedding space directly. There is no slider for vector weights and no way to draw a topological boundary by hand. The only levers I have are surface features of language - lexical choice, syntax, token arrangement. Those surface features are the sole instrumental control I get over the underlying geometry.

Not every surface feature does the same work, though, and the difference matters. I split them into two classes. I call the first class *load-bearing*. Load-bearing features carry geometric content directly. Lexical choice is the clearest case. "Why," "mechanism," and "how" read to a human as near-synonyms for requesting an explanation, but they do not activate the same region. "Why" reaches for a causal manifold. "Mechanism" reaches for a procedural one. "How" sits between the two and leans on surrounding context to disambiguate. Register belongs in this class as well, as do explanatory-mode markers that cue causal, procedural, teleological, or normative accounts. The other class is *incidental* - length, whitespace, keyword density, syntactic sugar. Incidental features co-vary with load-bearing ones in typical prose but carry no geometric content of their own.

That partition sharpens a claim about prompts. A prompt defines a trajectory through the embedding space. It terminates in a region, and it is from that region the model generates continuations. Two prompts with high overlap in incidental features can specify divergent regions if their load-bearing choices differ. Two prompts with very different incidental surfaces can specify proximal regions when their load-bearing choices agree. Surface similarity in incidental features does not track geometric similarity. Surface similarity in load-bearing features approximately does.

So surface management matters, but I have to think about it instrumentally. I am not optimizing length or keyword density as ends in themselves. I am managing load-bearing surface structure because it is the only encoding mechanism I have for geometric intent. Token selection is not a question of literary merit. It is a question of which tokens predictably constrain the dimensionality of the target region in the latent space.

### 4.2 Semantic Density as Constraint

Semantic density is the degree to which a prompt constrains the feasible region of outputs. Sparse prompt, large region, high-variance outputs. Dense prompt, smaller region, more consistent outputs. The mechanism is the same in both directions.

The clearest analogy I know is giving directions. "Head toward downtown" specifies a large region - many routes satisfy it. Turn-by-turn instructions specify a smaller region - fewer routes qualify. GPS coordinates to the centimeter specify a point - only one location satisfies.

Each level is appropriate for a different purpose. Excessive constraint is as problematic as insufficient constraint. If the prompt specifies the output so precisely that only one response qualifies, there is no room left for the model to contribute relevant knowledge or appropriate variation. The whole calibration problem is matching semantic density to the actual constraint requirements of the task. Not more constraint. Right constraint.

### 4.3 Structure as Relationship Encoding

Explicit structural markers in a prompt - enumeration, hierarchy, conditional framing, logical connectives - encode the dependency structure among informational elements. That encoding shapes how the attention mechanism weights relationships and, consequently, how the output is structured.

Take the difference between "Give me your thoughts on this proposal" and "First, state your understanding of the proposal's main objective. Then, identify the key assumptions it relies on. Finally, evaluate whether the proposal achieves its objective given those assumptions." Both ask for evaluation. Only the second specifies an order of reasoning through the evaluative space - comprehension, then assumption identification, then assessment.

That structural specification is doing more than organizing the output for readability. It is aligning the generation process with a particular reasoning structure. The model is guided to activate comprehension representations before evaluative ones, to surface assumptions before judging them. The dependency structure is encoded in the prompt and reflected in the generation. Section 5 will give this kind of explicit structuring a dedicated lexical form, so the dependency structure need not be reconstructed from paragraph cues each time.

### 4.4 The Attention Mechanism as Geometric Operation

Transformer attention biases the model toward specific geometric configurations over token representations. The key-query-value structure computes relevance-weighted combinations - in effect, projecting the input into subspaces determined by learned attention patterns.

From the geometric alignment perspective, attention is the apparatus by which prompt structure influences representational configuration. When the prompt marks certain elements as conditions and others as conclusions, attention patterns are shaped to weight those relationships accordingly. The prompt is not just supplying content. It is supplying relational structure that modulates how content is processed.

The deeper point is this. Prompt engineering is not primarily about picking the right words. It is about configuring the model's internal geometric operations through the structural and semantic properties of the input.

This architecture has a critical implication. There is no stage at which the model checks whether its trajectory aligns with user intent. Attention routes information according to the geometric configuration the prompt specifies. It does not evaluate whether that configuration serves the user's purpose. The model cannot notice that it is answering the wrong question. It completes whatever trajectory the prompt initiates - coherently, confidently - regardless of whether that trajectory leads anywhere useful. The generation process is feed-forward, not reflective. Information flows in one direction through the layers, with no built-in mechanism to step back and reconsider. Where self-correction shows up in deployed systems, it is a property of the system design surrounding the model, not of the model itself.

### 4.5 The Geometry of Least Resistance (Work in Progress)

If the prompt is a control interface for a continuous manifold, then I have to acknowledge that the manifold is not flat. Its topology is shaped by training data frequency, and that produces something I think of as representational gravity.

Absent sufficient constraint, the model optimizes for trajectories of least resistance. Those low-energy paths slide toward valleys in the probability distribution. The question for me as a practitioner is which valley the trajectory settles into, and that question has a direct answer. Lexical selection is the principal practitioner-facing lever for choosing among low-energy valleys. The load-bearing features I named in §4.1 - lexical choice, register, explanatory-mode markers - are the instruments by which I pick a valley before the trajectory is underway. "Why," "mechanism," and "how" all name low-energy valleys. They name different ones.

Once the trajectory is inside a valley, the valley has characteristic properties. These are not co-equal mechanisms for choosing which valley to enter. They describe what the valley is like once I am in it.

- **High frequency.** Patterns that appear ubiquitously in training data - clichés, received wisdom - populate the valley's floor.
- **Local coherence.** Smooth token-to-token transition probabilities carry the trajectory forward without requiring global structural integrity.
- **Shortest path.** Explanations that resolve the query with the minimum necessary complexity are the continuations the valley makes most accessible.

The lever works for a mechanistic reason. Valleys exist where they do because lexical co-occurrence in training data carved them there. Words that appeared together across billions of tokens pulled their neighborhoods into proximity in the embedding space, and the resulting topology is what a new prompt slides down. Load-bearing features are the practitioner's lever precisely because they are the same kind of signal that did the original carving.

That is the mechanism behind what we usually call "hallucinations" or "fluff." The model is not trying to deceive anyone. It is sliding downhill into the deepest, most accessible valleys of the probability distribution. Those valleys are populated by locally coherent metaphors - explanations that sound plausible on the surface and dissolve under scrutiny.

A concrete case. Asking "Why does the code fail?" with no further constraint invites the model to access a high-frequency narrative about "confusion" or "complexity." That is a stable, low-energy trajectory. It takes considerably more energy - in the form of prompt constraint - to force the model out of that comfortable valley and onto the specific, sparse ridge of technical reality where the actual bug lives. Reselecting the valley with a single load-bearing substitution - "mechanism" in place of "why" - is the lightest version of that work and often the one to try first. Where even the correctly selected valley is too thinly populated to yield reliable output, the failure is no longer one of valley selection. Alignment has done its work; the residual failure belongs to the terrain, not the specification. This is the clearest case of alignment's structural non-sufficiency - the subject of the resolved open issue below - and it is recognized diagnostically, at §7.2's "Sparse-region fabrication" row, rather than specified in advance by any CNL operator; density is not a geometric operation the notation can perform. The four commitments absent from that prompt - frame, object of analysis, dependency structure, and success criterion - are precisely the slots the notation in Section 5 will require a practitioner to fill before issuing the query.

### 4.6 Invariance as Intersection (Work in Progress)

To counteract the geometry of least resistance, I aim for invariant preservation.

In geometry, an invariant is a property that survives transformation. In prompting, a true explanation is one that holds stable regardless of the angle from which I approach it. A superficial metaphor might satisfy a "why" question and collapse when reframed as a "how" question. A structural truth satisfies both.

So robust prompting often requires defining a target region at the intersection of multiple representational subspaces. Instead of specifying a single vector through a direct question, I define two distinct frames and demand a trajectory that satisfies both at once. For instance, an explanation that has to be valid within the subspace of causal mechanism AND the subspace of historical analogy.

The low-energy paths from Section 4.5 rarely survive that intersection. A hallucination that reads as plausible in one frame tends to reveal its incoherence when it has to align with a second, independent frame. Requiring convergence across frames biases the generation away from fragile, probability-driven clichés and toward the invariant structures that connect different regions of the model's knowledge.

That is the geometric definition of rigor: the elimination of trajectories that only exist in one dimension.

### 4.7 Novelty as Relational Divergence (Work in Progress)

Invariance intersection raises a question §4.6 does not answer. If robust explanations lie at the intersection of representational subspaces and low-energy paths rarely survive that intersection, where do novelty and insight live in the same picture? I treat novelty as a relational property rather than a location on the manifold. A response is novel to the extent that it diverges from what I, the practitioner, would have predicted the model to produce given the prompt - divergence between my expected trajectory and the one the model actually walks.

The locus of the expectation is the practitioner, not the manifold. Novelty cannot be read off the geometry directly, and two practitioners with different priors will find different outputs novel. The framework treats that relativity as a feature of the diagnostic rather than a defect in it. What the framework can supply is a specification-side handle on the relation. When the expected response is one I can name - the low-energy valley (§4.5) I predict the model will slide into - I can encode that expectation in the prompt through exclusionary CONSIDERING (§5.3), which narrows the feasible region to exclude the valley I wanted novelty *against*. The practitioner's prior becomes geometrically operational through that move rather than remaining a post-hoc judgment.

Insight is the far end of a continuum whose near end is cliché; both are described by the same relation between expected and actual trajectory, and the lever that moves along the continuum is the specification's capacity to exclude the expected valley without collapsing the feasible region below the threshold where anything useful can still be said. §7.2's Overconstrained row carries the failure mode at the low end of that capacity.

---

## 5. A Controlled Natural Language for Geometric Alignment

### 5.1 From Theory to Notation

I have developed the geometric framework in the preceding sections and worked out the mechanics by which it operates inside large language models. What I have not yet provided - and will not provide until the practical methodology of Section 7 - is an applied discipline for composing prompts that exploit the framework. That methodology needs a notation before it can do any real work. A disciplined way of composing prompts that makes the framework's operations available at the level of the keystroke, not at the level of reflection after a prompt has failed.

That is what this section develops. I call it a controlled natural language (CNL) for geometrically aligned prompting. The CNL supplies explicit lexical markers for the framework's geometric operations and, in doing so, forces me to commit to each of them at composition time rather than defer them implicitly to the model. Sections 6 and 7 draw on it. Section 6 reads common prompting heuristics as specializations of particular operators. Section 7 develops a practical methodology whose diagnostic and revision steps are carried by the commitments the operators force.

The design of the CNL rests on a secondary observation about how the framework's operations relate to the practitioner. The operations the framework identifies - selecting a frame, foregrounding and backgrounding input, specifying dependency structure, testing invariance under reframing - are not only geometric moves in the model's representational space. They are cognitive moves I have to make in order to have a well-specified intent at all. That correspondence is what makes the CNL tractable. Its operators can be short and few because each names something I was already trying to do, however clumsily. The CNL's job is to make the doing of it explicit.

### 5.2 Design Criteria

A handful of design decisions distinguish a geometrically aligned CNL from both unconstrained natural language and from the domain-specific languages I know from programming.

**It has to stay on the training manifold.** LLMs have rich priors for natural English syntax and weak priors for structured formats they have rarely seen. A CNL that reads as a configuration file or formal schema activates sparse regions of the representational space - precisely the regions Section 4.5 identifies as least reliable. The CNL has to be a subset of grammatical English, not a departure from it. Operators are natural English words used in their ordinary syntactic roles.

**Operators have to correspond to framework-derived operations, not to linguistic categories.** The tempting move is to proliferate operators along grammatical lines - markers for subject, object, modifier, and so on. That produces formal parseability and no geometric discipline. The correct criterion is semantic. Each operator names a distinct geometric operation the framework identifies. Adding operators beyond that set inflates the surface area I have to master without a corresponding gain in alignment.

**Marking has to distinguish typed from untyped usage without pushing off-manifold.** A word used as a CNL operator has to be recognizable as such, but the marking should be the minimum sufficient departure from prose. Capitalization does the job in most cases. It preserves natural syntax while signaling that the word is functioning as a locked primitive rather than as ordinary prose.

**Operators should correspond to distinctions users already almost make.** This is the strongest constraint. If an operator does not name something I reach for clumsily in unaided natural language, it is an imposition rather than a discipline. The CNL earns its keep by giving me words for moves I was already trying to make. Tests of the CNL therefore double as tests of the framework. If the operators feel arbitrary in use, the framework has not carved the space at its joints.

These criteria jointly yield a CNL that is verbose relative to minimalist alternatives, compact relative to the natural English it replaces, and - in its best cases - uncomfortable to fill in vaguely. The discomfort is the point. It is where intent-formation happens.

### 5.3 Core Operators

What follows is the operator set I propose as a minimal sufficient vocabulary for the geometric operations the framework identifies. Each operator comes with its geometric function, its natural-English expression, and the commitment its use forces.

**AS - Role and manifold selection.** Specifies the representational submanifold the response should be generated from. Syntactically, a preposition followed by a role noun phrase: "AS a Site Reliability Engineer," "AS a tax attorney advising a solo practitioner." Commits me to a frame. The friction is diagnostic. Inability to complete AS cleanly typically means I have not yet decided whose expertise or perspective I want.

**OBSERVE - Foregrounded input specification.** Marks the object of analysis. The thing the prompt is *about*. Syntactically, an imperative verb followed by a noun phrase or short description: "OBSERVE container OOMs 30 seconds after start." Commits me to identifying what, precisely, is to be analyzed - and to distinguishing it from context, constraints, and background material.

**CONSIDERING - Backgrounded input and negative specification.** Marks material that should inform analysis without being its object. The operator carries two kinds of content. Evidence and context - logs, prior decisions, domain background - is the ordinary case: "CONSIDERING [logs]," "CONSIDERING our team's previous migration from a monolith." Exclusions and prohibitions are the second, attached through subordinate participles inside the operator's scope: "CONSIDERING recent usage patterns while excluding [linear feature extensions, obvious cross-sells]," "CONSIDERING the retry policy while treating the default as off-limits." The exclusion form is how the CNL carries guardrail-style constraint, and it is also the operator through which I encode the low-energy valley I expect the model to slide into when I want novelty against my own prior (§4.7). Commits me to separating evidence and context from the target of analysis - a distinction natural English conflates through appositives and subordinate clauses - and to stating the exclusions I was previously relying on the model to infer.

**ORDER - Dependency structure specification.** Encodes the dependency structure of reasoning the response should follow. Syntactically, a noun introducing an enumerated sequence with explicit dependencies: "by ORDER (1) memory pattern → candidate causes, (2) candidate causes → distinguishing evidence, (3) evidence → ranked hypotheses." Commits me to making the reasoning structure explicit rather than trusting the responder to reconstruct it from implicit sequencing.

**VALIDATE - Invariance specification.** Introduces the requirement that the response remain stable under reframing. Syntactically, an imperative coordinate clause: "VALIDATE conclusion must hold under reframing as 'explain success condition.'" Commits me to articulating what would falsify the response - a move natural English has no idiomatic construction for and which most users, consequently, never make. The operator's content scales with register (§5.6). In navigation mode the invariance criterion is strict - stability across independent frames, as in the example. In exploration mode the same slot accepts a looser criterion - "VALIDATE the recommendation must be internally coherent," "VALIDATE each suggestion must be actionable." The operator's purpose is constant. Only the stringency of its invariance content scales.

**OUTPUT - Format and delivery specification.** Specifies the form the response should take. Syntactically, an imperative verb followed by a delivery description: "OUTPUT ranked hypotheses with evidence citations." Commits me to deciding what form satisfies intent rather than deferring that decision to the responder.

Those six operators close the CNL. They exhaust the geometric operations specification can perform: selecting a manifold, foregrounding an object of analysis, admitting background, encoding a dependency structure, requiring invariance under reframing, and fixing output form. Density is conspicuously absent from that list and cannot be added to it. The manifold is carved at training time, and specification selects over it rather than adding to it - a point the resolved open issue on alignment's non-sufficiency establishes as structural. The boundary to responder augmentation (retrieval, exemplars, human correction) is therefore marked by the CNL's silence rather than by an operator inside it: when none of the six slots can further improve the output, the practitioner has reached the edge of what specification can do, and no notation will carry them further. Any future extension has to meet the framework-derivation criterion. New operators have to correspond to distinct geometric operations the framework identifies, not to surface features of language and not to properties - such as density - that lie on the far side of specification.

### 5.4 An Example

The example that follows is the prototype the remainder of the paper revisits. The scenario - diagnosing a Node.js container that OOMs thirty seconds after deployment - recurs in two places. In Section 6, it is the concrete case against which common prompting heuristics are reinterpreted in geometric terms. In Section 7, it is the running case through which the practical methodology's design steps, diagnostic categories, and domain-specific elaborations are illustrated. Introducing it here is deliberate. The CNL is what makes the same underlying case decomposable along several dimensions at once, and later sections inherit that decomposition rather than reproducing the example from scratch.

> AS a Site Reliability Engineer, diagnose deployment failure, OBSERVE container OOMs 30 seconds after start, CONSIDERING [logs], by ORDER (1) memory pattern → candidate causes (Node.js-specific), (2) candidate causes → distinguishing evidence, (3) evidence in input → ranked hypotheses, VALIDATE conclusion must hold under reframing as "explain success condition," OUTPUT ranked hypotheses with evidence citations.

A few properties of this construction are worth noting. It reads as a single English sentence - grammatical, continuous, subordinate-clause-rich - and would be parseable by any English speaker without CNL training. Every operator occupies its ordinary syntactic role. AS as preposition. CONSIDERING as participle. ORDER as instrumental noun. VALIDATE as imperative coordinate. The capitalization marks each word as locked to its CNL function and distinguishes it from ordinary use of the same word elsewhere in the sentence. Each operator forces a commitment I might otherwise defer - to a frame, to an object of analysis, to a reasoning structure, to a falsifiability criterion, to an output form.

The contrast with a minimal natural-language equivalent is instructive. "Help me debug this deployment" is shorter but defers every commitment the CNL forces. I have to decide from which perspective to analyze, what counts as evidence versus context, what reasoning order is required, what would count as a good answer, and what form that answer should take - or let the model decide those things by default, usually by sliding toward the low-energy trajectories described in Section 4.5. The CNL does not make those decisions easier. It makes avoiding them harder.

### 5.5 Composition-Time Diagnosis

Section 7.2 presents a diagnostic framework in which failure modes are identified after the fact in unsatisfactory outputs. The CNL shifts the diagnostic work earlier in the lifecycle.

A user who cannot fill in AS has not committed to a frame - an underconstrained prompt in prospect. A user who fills in ORDER with a flat list that does not form a dependency structure has not worked out their reasoning - a structurally misaligned prompt in prospect. A user who cannot complete VALIDATE does not yet know what would count as a correct answer - a prompt without success criteria. *The empty slot is the diagnosis.* The CNL surfaces which parts of the intended query remain unformed before any tokens are sent to the model.

That is the practical payoff of the CNL's design. Because each operator forces a commitment that corresponds to a specific geometric operation, inability to fill an operator cleanly localizes the misalignment. I do not need to write a prompt, observe its failure, and reason geometrically about why it failed. I discover the geometric gap at the point where it would otherwise become a failed prompt - and I can either close the gap or conclude that I am not yet ready to issue a navigational query at all.

That payoff has a constructive face the diagnostic framing alone understates. The forcing function is symmetric. It reveals where intent is under-specified for a target the practitioner already has, and it reveals where intent is under-formed because the practitioner does not yet have one. An empty AS may mean I have not committed to the frame I intended; it may also mean I have not yet decided which frame I am after, and the act of trying to fill the slot is the act of forming that decision. A blank VALIDATE may mean I have not articulated my falsifiability criterion; it may also mean I do not yet know what would count as a good answer because I have not yet finished deciding what I am asking. The CNL is therefore an instrument of intent-formation as well as of intent-specification - the empty slot is a question to answer constructively as much as a gap to close correctively. Which use is in play is the practitioner's recognition; the notation is the same across the two, and the resolved open issue on ill-defined target regions develops the constructive use as the framework's response to the limit case where the target itself is what the practitioner is working out.

A scope note is warranted here. The diagnosis the CNL delivers is diagnosis of specification, not of output. A fully-filled prompt guarantees that every geometric commitment has been made. It does not guarantee that the response will be good. Where specification is complete and the output still fails, the failure is by design outside the CNL's scope and inside the scope of the resolved open issues on alignment's non-sufficiency, on specification-and-sampling dynamics, and on policy and system-level effects. Section 7.2 picks up the diagnostic work at that second stage. Three rows sit strictly outside the CNL's lever: "Sparse-region fabrication" reads shallow or hedged output in a well-specified region as the signature of alignment working correctly against a thinly populated target, "Region-escape" reads off-frame drift from a well-specified region as the signature of decode-time wander rather than misspecification, and "Policy-intercepted" reads stereotyped refusal or hedge stable under load-bearing substitution as the signature of an output-side policy layer intercepting a correctly specified realization. A fourth row, "Valley-capture fabrication," presents post-specification but remains CNL-actionable through a subtler move: load-bearing substitution within a slot (§4.1, §4.5) rather than filling an empty one. It is the case in which every slot is filled and the lexical selection inside AS or OBSERVE still pulled the trajectory into a high-frequency valley the practitioner did not intend, and its remedy is revision of lexical choice rather than addition of commitments. The CNL is a composition-time instrument for specification. It is not a quality guarantee, and the falsifiability test below should be read with that scope in mind.

The design of the CNL is falsifiable on its own terms. If the framework has identified the right geometric operations, the operators derived from it should feel to practitioners like discoveries of distinctions they had been almost making - AS crystallizing something they had been doing clumsily, VALIDATE naming a move they had been reaching for without a word. If instead the operators feel arbitrary or formal, the operator set has been drawn from the wrong level of abstraction, and the CNL is notation in search of a theory rather than theory made notational. The test is direct and practitioner-facing. It does not require comparative evaluation of prompt outputs, only the practitioner's experience of composing in the CNL and whether its demands track the difficulty of forming their own intent.

### 5.6 Register and Operator Specificity

A proper understanding of the CNL includes understanding how it behaves across register. Prompting ranges from sharp navigation - cases in which I have a coherent target region and the task is to locate it precisely - to exploratory openness, in which the target is loose because I am still discovering the shape of the question. My earlier framing treated those as separate activities, with the CNL proper to the first and deliberately loose natural language proper to the second. I now think that framing draws the line in the wrong place. Every query of the form "tell me something about X" still has intent; it just has loose intent. The CNL's operators apply across the register continuum, and what varies is the specificity of what fills them rather than whether the slots exist at all.

The scaling is straightforward. AS accepts a tighter or broader frame - "AS a Site Reliability Engineer diagnosing a memory leak" at one end, "AS a curious generalist surveying adjacent fields" at the other. OBSERVE accepts a sharp object of analysis or a wide one. CONSIDERING, ORDER, and OUTPUT scale analogously: tight evidence scoping or loose, a definite dependency structure or a deliberately divergent one, a single output form or a range. VALIDATE scales the most visibly (§5.3) - strict invariance across independent frames at the navigation end, minimal invariance such as internal coherence or actionability at the exploration end. The purpose of each operator is constant across register. Only the stringency of its content moves.

The practitioner still makes a register choice, and that choice is still prior to any individual operator fill. Recognizing that I want to navigate rather than explore - or to occupy some point between - is a piece of intent-formation that precedes composition. What the new framing changes is the consequence of that recognition. Rather than a binary between using the CNL and abandoning it for loose prose, the register selection sets the specificity dial for each slot. The CNL is the notation across the whole range.

The navigation-exploration distinction has a mechanical analog at the decoder layer. Low sampling temperature concentrates the distribution of outputs and behaves as exploitation; higher temperature spreads it and behaves as exploration. Where the lever is exposed - typically in API access rather than in product UIs - it can be tuned in concert with the prompt's register. Most product surfaces do not expose it, however, so for the typical user the register choice is made entirely at the level of operator content, with no second knob to turn. The CNL's silence on sampling reflects that practical reality, not a denial that the analog exists; the resolved open issue on specification and sampling dynamics develops the analog and the asymmetry it implies.

The limit case - a prompt with no coherent target region at all - is not a separate activity the CNL excludes but the point where every operator's content is at its loosest and the notation operates constructively rather than correctively. The resolved open issue on ill-defined target regions develops that limit. §5.5's forcing function becomes the instrument by which I form intent rather than specify an already-formed target, and the same six operators carry both registers - intent above the limit and intent at the limit, where the target itself is what I am working out - rather than requiring a different notation.

---

## 6. Reinterpreting Common Prompting Heuristics

The framework earns its keep in two places. It explains why practices that already work do work. It predicts the conditions under which those same practices stop working. In this section I take a handful of familiar prompting heuristics - the folk wisdom practitioners trade in - and translate each one into the geometric terms I have developed in the preceding sections. Each translation does two things. It exposes the mechanism behind the advice. It exposes the failure mode the advice hides.

### 6.1 "Be Specific" as Dimensionality Constraint

"Be specific" is the most common prompting advice I encounter, and it is almost always offered without explanation. Through the instrumental interface of Section 4.1, the advice has a precise meaning - manually reducing the dimensionality of the specified region in the embedding space. General prompts leave critical dimensions of the manifold unconstrained. Specific prompts use surface tokens to fix values along those dimensions.

Take a request for "a summary." Length is unconstrained. Focus is unconstrained. Audience and purpose are unconstrained. Geometrically, this specifies a high-dimensional hyperplane where any point - any summary - is technically a valid solution. High output variance is not a defect. It is the direct consequence of under-specification. Now reframe the prompt as "a three-paragraph summary for executive stakeholders focusing on financial implications." Surface markers lock coordinates along length, audience, and topic. The feasible region collapses to a much smaller subspace, and output consistency climbs with it.

That is the instrumental function of specificity. Specificity is not detail for its own sake. It is the selective locking of values along free dimensions to narrow the model's trajectory. The distinction matters because it predicts when specificity helps and when it hurts. Constraining dimensions relevant to the target region improves alignment. Constraining irrelevant dimensions - or constraining relevant dimensions to incorrect values - reduces the intersection between the specified and target regions. Not more constraint. Right constraint.

> **Example: Code Review**
>
> *Underconstrained:* "AS [ ], OBSERVE this code, CONSIDERING [ ], VALIDATE [ ], OUTPUT feedback."
>
> *Geometrically aligned:* "AS a security engineer reviewing an authentication module, OBSERVE this code for vulnerabilities, CONSIDERING session handling and input validation paths against an attacker with network access but no valid credentials, VALIDATE each finding must hold under reframing as 'what assumption protects against this,' OUTPUT ranked findings with affected code locations and exploitation preconditions."
>
> The first prompt leaves four operator slots empty or vague - no frame (AS), no backgrounded context (CONSIDERING), no invariance criterion (VALIDATE), and only a minimal OUTPUT. OBSERVE is bound to "this code" and nothing more. The feasible region is bounded only by "code" and "feedback," which licenses observations about naming, performance, style, or security with equal weight. The second prompt fills each slot - a security-specific frame, a threat-modeled context, a reframing test, and a structured output form - narrowing the specified region onto the security subspace the engineer actually wants. Read through the CNL, "be specific" is not a single-operator heuristic. It is a standing instruction to fill the slots that have been left empty.

### 6.2 "Provide Examples" as Geometric Anchoring

Few-shot prompting - providing examples of desired inputs and outputs - is geometric anchoring. Each example is a coordinate sample. Together they define a target region through triangulation.

Examples specify points in output space that the model is to treat as valid, and the model interpolates - treating the bounded region as the target. I describe that region as an interpolated subspace rather than committing to a stronger topological claim, because the framework does not need more precision than that to explain the effect. More examples, and more diverse examples, give tighter specification of the target.

This is why example selection matters. Unrepresentative examples specify a displaced region, which is how a handful of bad examples can produce worse output than no examples at all. Examples that are too similar to one another underspecify the region's extent, and the model fails to generalize beyond the narrow band of demonstrated cases. Effective few-shot design is not a question of how many examples to provide. It is a question of geometric coverage and representation.

> **Example: Commit Messages**
>
> *Unanchored:* "AS a contributor, OBSERVE this diff, OUTPUT a commit message."
>
> *Anchored:* "AS a contributor, OBSERVE this diff, CONSIDERING the repo's established style shown in `fix(auth): resolve token expiration race condition`, `feat(api): add pagination to /users endpoint`, and `refactor(db): extract connection pooling logic`, OUTPUT a commit message."
>
> The unanchored prompt leaves CONSIDERING empty, so the model draws from its generic distribution of commit message styles. The anchored prompt fills CONSIDERING with three repo-style samples that triangulate the target region - imperative mood, type prefixes, concise scopes, specific descriptions. "Provide examples" is the heuristic; CONSIDERING is the operator it was reaching for. Three samples loaded into that slot specify more about format, tone, and content than a paragraph of description could.

### 6.3 "Use Chain-of-Thought" as Dependency-Structure Ordering

Chain-of-thought prompting - requesting explicit reasoning steps before conclusions - improves output quality on tasks where the path to an answer matters. The geometric interpretation is straightforward. Chain-of-thought specifies an order over the semantic dependency structure, and that specification ensures intermediate representations are activated before the representations that depend on them.

Without chain-of-thought, the model is free to jump to conclusions through paths that skip important intermediate nodes. Those nodes carry the assumptions, qualifications, and logical steps the conclusion relies on. When they are skipped, the output can read as reasonable and still be unsupported. Requiring an explicit order raises the likelihood that the relevant intermediate representations are activated - and that the final output reflects their contribution.

The same framework predicts the failure mode. If the specified reasoning path does not match the actual dependency structure of the problem, chain-of-thought stops helping. It can actively mislead - pulling the trajectory away from the structure the problem requires and toward one the prompt happens to name. Chain-of-thought is not a general-purpose accuracy booster. It is a dependency-structure specification, and it is only as good as the structure it specifies.

> **Example: Incident Debugging**
>
> *ORDER-empty:* "AS an incident responder, OBSERVE this outage, OUTPUT the cause."
>
> *ORDER-filled:* "AS an incident responder, OBSERVE this outage, by ORDER (1) observation → behavior description, (2) behavior → candidate causes consistent with the behavior, (3) candidate causes → distinguishing evidence, (4) evidence → supported conclusion, OUTPUT the supported conclusion with evidence citations."
>
> The first prompt leaves ORDER empty, and the model is free to jump to a plausible-sounding cause through a path that skips the nodes the conclusion depends on. The second prompt fills ORDER with the dependency structure the diagnostic task actually requires - observation before hypotheses, hypotheses before distinguishing evidence, evidence before conclusion. Chain-of-thought is the heuristic; ORDER is the slot it fills. Once the dependency structure is named, the model cannot skip the diagnostic steps that make the conclusion trustworthy.

### 6.4 "Assign a Role" as Manifold Selection

Role prompting - instructing the model to respond as a particular persona - is manifold selection. Different professional roles, expertise domains, and communicative contexts correspond to different submanifolds within the model's representational space. Telling the model to respond "as a senior financial analyst" activates a region associated with that role's characteristic vocabulary, reasoning patterns, and epistemic standards.

That is what makes role prompting useful. A role specification is a compact way to constrain multiple dimensions simultaneously - dimensions that would otherwise require extensive explicit specification, one at a time. The efficiency is real. A few tokens buy a substantial amount of alignment.

The efficiency is also the limitation. The model's "financial analyst" manifold is learned from training data, and I cannot assume it corresponds to the financial analyst I have in mind. Role prompts are compact but coarse. They trade specificity for compactness, and when the target region does not match the training-data manifold the role evokes, the compactness works against me. Roles work best when the trained submanifold is close enough to the target that a small number of additional specifications can close the remaining gap.

> **Example: Design Evaluation**
>
> *AS-empty:* "AS [ ], OBSERVE this design, OUTPUT an evaluation."
>
> *AS-filled:* "AS a Site Reliability Engineer, OBSERVE this design, CONSIDERING failure modes, observability, deployment complexity, and incident response, OUTPUT an evaluation weighted toward operational concerns."
>
> The first prompt leaves AS empty, and the model activates a generic evaluation region that tends to produce superficial or unfocused output. The second prompt fills AS with "Site Reliability Engineer" and fills CONSIDERING with the operational dimensions that submanifold organizes evaluation around. "Assign a role" is the heuristic; AS is the operator it addresses. Filling AS is a compact way to constrain several dimensions at once - vocabulary, reasoning patterns, epistemic standards - without specifying each one by hand. A matching CONSIDERING keeps the role's evaluative dimensions in the foreground rather than leaving the model to guess which ones I meant.

### 6.5 "Cross-Examine" as Subspace Intersection

Common advice tells me to ask the model to check its work. The advice is usually weak, and the geometric framework explains why. The model is checking its work using the same manifold that generated the error in the first place. A trajectory cannot reliably verify itself against the representations that produced it.

The fix is to force re-derivation from an independent perspective. I can ask the model to explain why something works, then ask it to explain how the same thing fails. If the two trajectories converge on the same conclusion, the output sits at the intersection of two subspaces and is relatively stable. If they do not converge, the output is unstable - what looked coherent in one frame does not survive translation into the other. Convergence across frames is the test. Self-consistency within a single frame is not.

> **Example: Mechanism Explanation**
>
> *VALIDATE-empty:* "AS an analyst, OBSERVE this mechanism, by ORDER (1) mechanism → why it works, OUTPUT explanation."
>
> *VALIDATE-filled:* "AS an analyst, OBSERVE this mechanism, by ORDER (1) mechanism → why it works, VALIDATE conclusion must hold under reframing as 'how does this fail,' OUTPUT explanation."
>
> The first prompt leaves VALIDATE empty, so any check the model performs runs on the same "why it works" manifold that produced the explanation. The second prompt fills VALIDATE with an independent reframing - the "how does this fail" perspective - and the conclusion is only accepted if it survives both frames. "Cross-examine" is the heuristic; VALIDATE is the operator it reaches for. Without VALIDATE, "check your work" runs the check on the same trajectory that produced the output. With it, convergence across frames becomes the test rather than self-consistency within a single frame.

---

## 7. Practical Methodology

### 7.1 A Design Process for Geometric Alignment

The framework is only useful if it changes what I do at the keyboard. In this section I work out the design process that follows from it - a method for composing prompts that replaces accumulated heuristics with principled geometric reasoning, and a diagnostic vocabulary for when the result falls short.

I break the process into five steps. The first two are prior to the CNL. The last three are where the CNL does its work.

**Step 1 - Identify the target region.** Before writing any prompt, I have to characterize the output space I am trying to reach. What properties does a satisfactory output have? What variation am I willing to accept? Without a rough answer to both, I have no way to tell whether a given output hits or misses, and no basis for revision when it misses.

**Step 2 - Map the dependencies.** Next I work out what has to be activated for a coherent response. Which concepts, facts, and logical relationships does the answer rely on, and what depends on what? The product is a dependency structure - informal, usually sketched on paper - of the reasoning the response has to traverse.

A caveat is in order. In practice, I rarely have a fully articulated dependency structure before prompting begins. The act of prompting is itself part of how I discover my intent. Early drafts expose assumptions I had not noticed I was making and force me to decide things I had left vague. Steps 1 and 2 are iterative - initial approximations refined through cycles of prompting, evaluation, and revision. Geometric alignment is a process, not a one-shot specification, and the methodology is designed to accommodate that.

**Step 3 - Encode the structure.** With target and dependencies in hand, I translate the structure into CNL operators. AS locks the frame. OBSERVE identifies the object of analysis. CONSIDERING foregrounds the context the analysis depends on. ORDER names the dependency structure of reasoning. VALIDATE specifies the invariance criterion that tests the result. OUTPUT specifies the form. Each operator forces me to commit to something I might otherwise defer. The empty slot is itself diagnostic - if I cannot fill ORDER, my dependency structure is not yet worked out, and no amount of rephrasing will close that gap.

**Step 4 - Calibrate semantic density.** The specification has to be dense enough to narrow the feasible region onto the target, and no denser. Too little context and the region is too large - most plausible outputs miss. Too much context, especially when the constraints pull against one another, and the region collapses onto incoherence. The calibration is not formulaic. It is judgment I develop with practice, and §7.2 gives me a vocabulary for when it has gone wrong.

**Step 5 - Test and diagnose.** I evaluate the output against the target region. If it misses, I do not guess at a fix. I localize the failure mode using the diagnostic framework below, and I revise the specific operator that corresponds to the geometric cause.

### 7.2 A Diagnostic Framework

When a prompt fails, the temptation is to rephrase it and try again. That is almost always the wrong move. Rephrasing is a surface-level change, and most prompt failures have geometric causes that surface-level changes do not address. Before I revise, I diagnose.

The framework gives me six distinct failure modes, each with a geometric cause and a corresponding intervention.

| Failure Mode | Symptom | Geometric Cause | Intervention |
|--------------|---------|-----------------|--------------|
| Underconstrained | Outputs too generic or too variable | Specified region too large | Increase semantic density - fill empty operator slots, add explicit constraints |
| Displaced | Outputs confident but wrong | Specified region misses the target | Reframe - revise AS, add anchoring examples to CONSIDERING |
| Overconstrained | Outputs incoherent or hedged at the contradictory limit; confident but unilluminating at the foreclosing soft end | Feasible region too narrow - contradictory at the limit, foreclosing the divergence between expected and actual trajectory (§4.7) at the soft end | Remove conflicts or relax constraints at the contradictory end; widen CONSIDERING or add exclusionary CONSIDERING for the expected valley to restore room for divergence at the foreclosing end |
| Structurally misaligned | Reasoning gaps, non-sequiturs | ORDER missing or does not match the problem's dependency structure | Make the dependency structure explicit, specify reasoning order |
| Sparse-region fabrication | Shallow, hedged, or unreliable output in a well-specified region | Specification correct; target region thinly populated in the responder | Responder augmentation - background in CONSIDERING, retrieval, or exemplars; further specification does not help |
| Valley-capture fabrication | Confident output that reads as cliché or received wisdom where a sharper claim is warranted | Specification correct at the surface; a low-energy valley (§4.5) captured the trajectory toward a high-frequency narrative | Load-bearing substitution to reselect the valley (§4.1, §4.5); invariance intersection across framings that should select different valleys (§4.6) |
| Specification-accretion drift | Output coherent with the latest turn but outside the region carved by earlier turns | Cumulative specification across turns selects a region that has migrated from the one intended; load-bearing content from prior stages competes for attention weight in the current turn | Re-anchor by reissuing AS and OBSERVE at full stringency, or restart with only the load-bearing prior output carried forward; incremental CONSIDERING at the tail moves the output least |
| Region-escape | Output starts on-frame and drifts off, or reads as a tangentially related response despite a well-formed prompt | Specification correct; decode-time realization wandered outside the specified region | Reduce decoder temperature or tighten top-p where the lever is exposed; further specification does not help |
| Policy-intercepted | Stereotyped refusal or hedge, stable under load-bearing substitution | Specification correct; an output-side policy layer intercepts realization before or during generation | Reframe task or role (AS); further specification does not move the wall, and that immobility is itself the diagnostic |
| Intent-unformed | Output cannot be adjudicated as success or failure; iteration produces variation without convergence on what would count as a good answer | No coherent target region exists; the practitioner's intent is under-conceptualized rather than under-specified | Intent-formation work via the CNL's forcing function (§5.5) - empty slots filled constructively to form the target rather than correctively to specify against one already in mind |

Three properties of this table matter. First, each failure mode corresponds either to a specific operator (or combination of operators) or to a named boundary the framework identifies on the far side of specification. The first four failure modes are CNL-actionable through slot-filling: closing the gap means filling or revising operator slots, and the diagnostic vocabulary and the CNL share the same carving of the space - intentionally, because the CNL is what makes specification failures actionable. Valley-capture fabrication is also CNL-actionable, but through a different register - load-bearing substitution within a slot (§4.1) and invariance intersection across framings (§4.6) rather than new content in an empty slot; the lever is the quality of lexical selection inside AS and OBSERVE, not the presence or absence of a ORDER or CONSIDERING. Specification-accretion drift is CNL-actionable as well, but through a third register - load-bearing reselection across turns (re-anchoring, restart) rather than within a single prompt - and its lever is the practitioner's curation of what cumulative load-bearing content remains in the context window, not the content of any single operator slot. Sparse-region fabrication, Region-escape, and Policy-intercepted sit on the *downstream* alignment-non-sufficiency boundary the resolved open issues develop: specification is complete and the lever is past it, responder augmentation in the first case, decoder configuration in the second, and reframing of task or role (or a different product surface) in the third. Intent-unformed sits on the *upstream* boundary the resolved open issue on ill-defined target regions develops: specification cannot be completed because the target it would specify against has not been formed, and the lever is the CNL itself used constructively (§5.5) rather than a downstream intervention. The three downstream faces are not a miscellaneous list. They correspond to the three temporal phases of responder operation - Sparse-region to the manifold as carved at training time, Region-escape to the trajectory sampled at inference time, and Policy-intercepted to the filter applied after inference. Responder capacity can fail at any of the three, and the framework is exhaustive over them in the sense that every downstream non-sufficiency I can name through the framework resolves into one of those three phases. Intent-unformed on the upstream side is structurally singular by contrast, because the precondition it names - a well-formed target region - is a single condition that either holds or does not. The CNL therefore marks two edges - downstream silence where its slots can no longer move the output, upstream forcing function where its empty slots become questions I answer to form a target - and bridges them with a single notation rather than pretending to govern beyond either edge through a seventh operator. A note on hedge specifically. Overconstrained at its contradictory limit, Sparse-region fabrication, and Policy-intercepted can all present as hedged output, and I distinguish them by perturbation response. Geometry-driven hedge (contradictory-limit Overconstrained) shifts under load-bearing substitution because geometry is what load-bearing substitution moves through. Coverage-driven hedge (Sparse-region) is stable under load-bearing substitution - the valley is correctly selected - but shifts when background evidence is added through CONSIDERING or retrieval. Policy-driven hedge is stereotyped and stable under both; its immobility under every specification-side perturbation is its signature. The foreclosing end of the Overconstrained continuum presents differently. Its symptom is not hedge but confident triviality - output that is correct, well-formed, and collapses to the practitioner's expected valley (§4.7). The diagnostic there is relational rather than perturbational: the output matches what the practitioner would have predicted, and the intervention is widening CONSIDERING or adding exclusionary CONSIDERING for the expected valley rather than resolving an internal conflict. Second, the interventions are different and not interchangeable. A prompt that hallucinates needs background, not specificity. A prompt that produces generic outputs needs constraints, not examples. A prompt that produces incoherent outputs needs fewer constraints, not more. Mixing these up is the most common source of iteration waste I see. Third, the framework separates overconstraint from underconstraint. They are routinely conflated under the folk category of "the prompt is not working," but they require opposite interventions - and an overconstrained prompt only gets worse under treatment meant for an underconstrained one.

### 7.3 Domain-Specific Applications

The design process is general. The shape of the ORDER I fill in depends on the task. In my own practice, the tasks fall into three families, each with a characteristic dependency structure.

**Analytical tasks** encode evidential dependency structures - premises, supporting data, inferential steps, conclusions. The prompt has to make explicit what counts as evidence and how evidence bears on the conclusion. Leave that implicit and the model is free to jump to a plausible-sounding conclusion through a path that skips the evidence the conclusion is supposed to rest on.

> **Example - Evaluating a Technical Decision**
>
> *ORDER-empty:* "AS an architect, OBSERVE our current monolith, OUTPUT a recommendation on microservices migration."
>
> *ORDER-filled:* "AS an architect, OBSERVE our current monolith, CONSIDERING our scaling pain points and team structure, by ORDER (1) monolith problems → evidence of severity, (2) severity evidence → candidate interventions, (3) microservices vs. other interventions → fit with root causes, (4) fit analysis → recommendation with evidence-to-cost ratio, OUTPUT a recommendation with the supporting evidence."
>
> The filled ORDER forces the evidential chain. Problems first, then severity, then whether microservices specifically address the root causes, then a recommendation grounded in that analysis. Skip any node and the recommendation floats free of the evidence that was supposed to warrant it.

**Technical tasks** encode procedural dependency structures - prerequisites, steps, conditions, outcomes. The prompt has to specify the dependency order and the conditional structure. Without it, the model is free to emit steps in an order that reads as plausible but does not match what the problem actually requires.

> **Example - Debugging a Deployment Failure**
>
> *ORDER-empty:* "AS an SRE, OBSERVE this deployment failure, OUTPUT the fix."
>
> *ORDER-filled:* "AS a Site Reliability Engineer, OBSERVE container OOMs 30 seconds after start, CONSIDERING [logs] showing memory climbing until OOM in a Node.js service, by ORDER (1) memory pattern → candidate causes (Node.js-specific), (2) candidate causes → distinguishing evidence, (3) evidence in input → ranked hypotheses, (4) ranked hypotheses → targeted fixes, OUTPUT ranked hypotheses with evidence citations and the fix for each."
>
> This is the running case from §5.4. The filled ORDER is procedural - diagnosis categories before differentials, differentials before targeted fixes. Prerequisites before conclusions.

**Decision-support tasks** encode criteria dependency structures - objectives, sub-objectives, constraints, trade-offs. The prompt has to make the evaluative structure explicit, including how conflicts are resolved. Without priority ordering, the model balances criteria in whatever way its training-data defaults happen to suggest, which is not necessarily the balance I would choose.

> **Example - Choosing Between Vendor Solutions**
>
> *ORDER-empty:* "AS an architect, OBSERVE our workload, OUTPUT a recommendation for Postgres vs. DynamoDB."
>
> *ORDER-filled:* "AS a backend architect choosing a database for a new service, OBSERVE our workload and operational context, CONSIDERING priorities in order - (1) operational simplicity, no dedicated DBA, (2) cost at 10K requests/second sustained, (3) latency p99 under 50ms - by ORDER (1) each database → evaluation against each criterion in priority order, (2) per-criterion evaluation → overall fit, (3) fit analysis → recommendation explicitly addressing the top priority, OUTPUT a recommendation that names the winner and the priority-one rationale."
>
> The criteria dependency structure is explicit. Priorities ranked, trade-off structure specified, decision rule named. The model knows which dimensions matter most and how to resolve conflicts when they pull in different directions.

**Exploratory/ideational tasks** encode divergent dependency structures - a starting point that fans out into alternative frames or candidate directions rather than chaining toward a single supported conclusion. The prompt has to make room for divergence from the practitioner's prior rather than specify a single evidential or procedural order. ORDER is deliberately loose - fan-out rather than chain - and CONSIDERING carries the low-energy valley I expect the model to slide into if left unconstrained, flagged for exclusion (§4.7, §5.3). VALIDATE is at its loosest: internal coherence or actionability rather than invariance across independent frames. The characteristic constraint density is low, but every operator is still filled. The family differs from the three above in that the measure of success is not convergence on a supported conclusion but divergence from what I would have predicted.

> **Example - Generating Product Directions**
>
> *Underconstrained:* "AS a product manager, OBSERVE our current user base, OUTPUT new product ideas."
>
> *Exploration-filled:* "AS a product strategist exploring adjacencies, OBSERVE the unmet-need space around our current user base, CONSIDERING recent usage patterns while excluding [linear extensions of existing features, obvious cross-sells to adjacent personas], by ORDER (1) unmet-need categories → candidate directions, (2) candidate directions → directions that would surprise the current roadmap, VALIDATE each candidate must be internally coherent and actionable within a quarter, OUTPUT three candidate directions with the reason each diverges from the expected next move."
>
> The filled ORDER fans out rather than chaining; CONSIDERING carries evidence *and* named exclusions, encoding the practitioner's prior (§4.7) so the feasible region explicitly excludes the expected valley; VALIDATE is filled at its loose end - internal coherence and actionability - rather than invariance across independent frames, because the measure of success is divergence from the prior, not stability under reframing. The prompt is still geometrically specified, just at looser stringency.

These four families are not exhaustive. They are the ones I encounter often enough that it is worth naming them. The general principle holds beyond them. Task type picks the dependency structure that ORDER should carry and the register that sets each operator's stringency, and getting the family right is part of Step 2.

### 7.4 Cost-Efficiency Reconsidered

A final consequence of the framework is worth making explicit, because it cuts against a common instinct. Token minimization is not a sensible objective on its own. The cost function that actually matters is not tokens per prompt. It is tokens per satisfactory outcome.

A minimal prompt requiring four rounds of iteration, each of which consumes tokens, consumes more total tokens than a longer prompt that aligns on the first iteration. It also consumes more of my time, which is the other cost I care about. The longer prompt is only more expensive by the wrong measure.

That reframing is why I do not treat the CNL's verbosity as a cost. It is a down-payment on alignment. The prompt is longer. The number of prompts I have to send before I get what I need is smaller. The total cost - tokens, time, and rework - is lower, even when the cost of any individual prompt is higher.

---
## 8. Limitations

### 8.1 What the Framework Idealizes

I have idealized in places. The idealizations are not casual simplifications - they are the price of level-appropriate variables. Each one is a condition under which a specification-level variable is well-defined, and where the condition slackens in practice the variable still does useful diagnostic work even as the substrate under it grows messier than the vocabulary suggests. The places where the idealizations cut against practice are worth naming.

Embedding spaces in actual transformers are not the well-behaved continuous manifolds I have been treating them as. They exhibit clustering, anisotropy, and discretization effects the idealization hides. The framework is still useful at the level of specification and diagnosis - that is the level at which I act as a practitioner - but the spaces under the specification are messier than the vocabulary I have given them suggests.

The framework also idealizes the existence of a coherent target region. §3.3 defines alignment as correspondence between the specified region and the target region, and that definition presupposes both regions are well-formed subspaces. When the user's actual need is contradictory or under-conceptualized, the target region is correspondingly malformed - empty in the contradictory case, undefined in the under-conceptualized case - and alignment does not apply rather than failing. The framework relies on target-region existence as a precondition rather than supplying it, and the resolved open issue on ill-defined target regions develops the consequences of that split: §5.5's composition-time forcing function is the instrument by which I discover the precondition has not been met before I send any tokens to the model, and §7.2's "Intent-unformed" row sits on an upstream boundary structurally distinct from the three downstream alignment-non-sufficiency boundaries.

The framework also idealizes the trajectory through the manifold. The space the geometric vocabulary describes is determinate - fixed weights, a finite training corpus, a topology that was carved once and stays carved - but a single decode samples a path through that determinate landscape rather than tracing a deterministic descent. The valley imagery in §4.5 describes the shape of the manifold; what an actual decode does within that shape is stochastic. The framework relies on the determinacy of the space, not of the trajectory, and the resolved open issue on specification and sampling dynamics develops the consequences of that split.

The framework also idealizes attention across the context as uniform for specificational purposes. Actual attention over long contexts is u-shaped: beginning and end positions exert disproportionate geometric force while middle positions are attenuated. The framework relies on the fact that load-bearing content carries geometric content regardless of position, not on the fact that it does so equally at every position, and the resolved open issue on multi-turn trajectory drift develops the consequences of that split - why drift is typically slow, why re-anchoring works by moving load-bearing content back into the recency-privileged zone, and why restart's advantage over re-anchoring is positional as well as substantive.

The dependency-structure abstraction is in the same category. Some reasoning is cyclic or iterative in ways a dependency structure does not represent, and ORDER inherits that limitation. The abstraction covers most of what I do in practice. It does not cover everything.

Model-specific calibration is real. The relationship between prompt structure and output quality varies across architectures, scales, and training regimes. Optimal semantic density for one model is not optimal for another, and the framework gives me no basis for transporting a calibration from one model to the next without empirical checking.

The framework also analyzes composed specifications - whether issued in a single prompt or accumulated across turns by the argument the resolved open issue on multi-turn trajectory drift develops. Actual effectiveness also depends on user-specific patterns and task domain, neither of which a static specification captures. I treat this as a scope restriction rather than a defect, though it is still a restriction.

The CNL has no standing independent of the geometric operations the framework identifies. Its operators are derived, not primitive. The extension criterion in §5.2 constrains the set - new operators must correspond to distinct geometric operations, not to surface features of language - and any imprecision in the framework's carving of the space propagates into the CNL. Notation cannot repair a fault in what the notation is expressing. The closure of the CNL at six operators is a consequence of that derivation rather than a stipulation. Density is the most conspicuous property the CNL does not address: it is a feature of the responder's manifold, fixed at training time, and cannot be selected over by any composition-time operation. The absence of a seventh operator for it is not a gap - it is the framework's derivation working correctly.

### 8.2 What Remains Hard

Some problems are not artifacts of idealization. They are hard in themselves, and the framework gives me better vocabulary for them without giving me a solution.

I cannot see the target region directly. I infer it from failures and successes, not from direct observation. The framework supplies words for diagnosis. It does not supply a crystal ball.

Responder capacity is independent at each of three temporal phases, and each phase carries its own non-sufficiency.

*Manifold density, fixed at training time.* Even when the target region is correctly specified, the density of the responder's knowledge at that region is not something I can adjust by specification. The framework supplies vocabulary for diagnosing when this is the limiting factor (§4.5, §7.2 "Sparse-region fabrication"), and the resolved open issue on alignment's non-sufficiency names the structural property behind it. What the framework does not supply is a way to add coverage I am not starting with - that requires retrieval, exemplars, or a different responder.

*Trajectory, sampled at inference time.* The space I specify against is determinate; the decode that samples a path through it is not. A well-specified prompt can still produce an output that drifts outside the specified region, and the lever - decoder temperature and top-p - is typically hidden behind the product surface. The framework supplies the diagnostic through §7.2's "Region-escape" row and the resolved open issue on specification and sampling dynamics, and §5.6 names sampling as the mechanical analog of the navigation-exploration register. What the framework does not supply is access to the lever that would resolve the failure.

*Output-side policy, applied after inference.* Training-time policy has already been absorbed into the manifold by the time I issue a prompt, and system prompts are in-scope geometric specifications I can fold into AS and CONSIDERING. But output-side classifiers and hard refusal pipelines sit after specification and do not yield to it. The framework supplies the diagnostic through the §4.1 load-bearing/incidental partition - invariance under load-bearing substitution is the signature (§7.2 "Policy-intercepted") - and the resolved open issue on policy and system-level effects names the structural property behind it. What the framework does not supply is a way to remove or soften a policy layer I have hit. That requires a different product surface or a different responder.

Optimal semantic density requires judgment. There is no formula for how much context is enough. Calibration comes with experience, and experience is specific to models and task types. I develop it over many prompts, not by reading about it.

Multi-turn dynamics are distributed specification rather than a separate geometry. The resolved open issue on multi-turn trajectory drift folds the conversation history into AS and CONSIDERING by the same argument §4.4 and the resolved open issue on policy and system-level effects use for system prompts - prior turns are in-context tokens the model sees as geometric specifications of the same kind. Drift across turns is the failure mode in which cumulative specification selects a region that has migrated from the one the practitioner intended, and the specification-side interventions are incremental CONSIDERING, re-anchoring through reissued AS and OBSERVE, and restart with only load-bearing output carried forward. What the framework does not supply is a rule for when incremental constraint ceases to work and re-anchoring or restart becomes necessary; that judgment remains practitioner-facing, and the u-shape named in §8.1 gives the kinematic shape of the transition without parameterizing it.

The register choice is prior to the CNL. Recognizing where on the navigation-exploration continuum I want to operate - tight target or loose - is an act of intent-formation that happens before any operator is invoked, and §5.6 develops the continuum rather than a binary. The CNL offers no guidance for that choice. The notation presupposes the recognition, scales operator stringency accordingly, and does not supply the recognition itself.

---

## 9. Conclusion

### 9.1 Summary

I have argued in this paper that prompting is geometric alignment, not linguistic optimization. Three contributions follow from that reframing.

**A mental model.** Prompts are coordinate-specification mechanisms. They locate queries inside a learned representational space, and prompt quality is the precision of that positioning - not length, not keyword density, not any other surface feature.

**A controlled natural language.** A compact operator set - AS, OBSERVE, CONSIDERING, ORDER, VALIDATE, OUTPUT - forces the practitioner to commit, at composition time, to the geometric moves the framework identifies: a frame, an object of analysis, context, a reasoning structure, an invariance criterion, an output form. Each operator names a distinct geometric operation, so the empty slot is the diagnosis. An inability to fill an operator cleanly localizes the misalignment before any tokens reach the model.

**A practical toolkit.** Vocabulary for diagnosing prompt failures - underconstrained, displaced, structurally misaligned, sparse-region fabrication, valley-capture fabrication - and systematic methods for addressing each failure mode through targeted intervention.

### 9.2 Implications for Practice

Treating prompting as geometric navigation changes how I approach failure. Rather than varying surface features at random, I diagnose the geometric problem and intervene accordingly. Iteration becomes more efficient. Outcomes become more predictable.

The unit of professional prompting is the CNL prompt, not the natural-language prompt with embedded structure markers. The operators are where the geometric commitments live. They are not decorations laid over prose that continues to defer those commitments.

The framework reframes cost as well. Token minimization gives way to alignment optimization. A longer prompt that succeeds on the first iteration beats a shorter prompt that requires three. Professional use of LLMs requires professional prompting - not elaborate, not jargon-laden, but geometrically precise.

### 9.3 Looking Forward

The geometric alignment framework replaces "prompting is an art" with "prompting is navigation." Skill and judgment still matter - but neither is opaque and neither is arbitrary. The framework supplies vocabulary, structure, and a basis for systematic improvement.

The CNL's six operators close the set of geometric operations specification can perform. That closure is itself load-bearing. It reflects the resolved open issue on alignment's non-sufficiency: the boundary to responder augmentation is a property of the manifold, not a slot in the notation, and no seventh operator can reach across it. Extension remains possible in principle, but it has to meet the framework-derivation criterion - a new operator would have to correspond to a distinct geometric operation the framework identifies, not to a property, such as density, that lies on the far side of specification.

If LLMs make explicit what human communicative flexibility normally hides, then learning to prompt well is not merely a technical skill for interacting with AI. I hope it becomes an occasion to understand communication itself more clearly. For my part, I would like to see that understanding reflect back on how we communicate with each other.

---

## References

Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. *Proceedings of the International Conference on Learning Representations*.

Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 35(8), 1798–1828.

Bommasani, R., et al. (2021). On the opportunities and risks of foundation models. *arXiv preprint arXiv:2108.07258*.

Brown, T., et al. (2020). Language models are few-shot learners. *Advances in Neural Information Processing Systems*, 33, 1877–1901.

Elhage, N., et al. (2021). A mathematical framework for transformer circuits. *Anthropic Research*.

Harris, Z. S. (1954). Distributional structure. *Word*, 10(2–3), 146–162.

Manning, C. D., & Schütze, H. (1999). *Foundations of statistical natural language processing*. MIT Press.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. *arXiv preprint arXiv:1301.3781*.

Montague, R. (1970). Universal grammar. *Theoria*, 36(3), 373–398.

Vaswani, A., et al. (2017). Attention is all you need. *Advances in Neural Information Processing Systems*, 30.

Wei, J., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. *Advances in Neural Information Processing Systems*, 35.

---

## Author Information

**Richard Raseley** works at AWS, where he has spent years observing how practitioners succeed - and where they fail - when interacting with complex systems. This paper emerged from that experience - watching the patterns of prompt failures and developing vocabulary to diagnose them.
