Human Alignment Precedes AI Alignment
Imagine we could write down, in complete and precise detail, a specification of human goodness—a set of values, principles, and decision procedures that, if followed perfectly, would produce ethical behavior in every situation a person could face. Now imagine distributing this specification to every human on the planet. Would we have utopia? Obviously not. We cannot even agree on the specification. Different cultures, different philosophical traditions, and different individuals hold genuinely incompatible values. Freedom conflicts with safety. Honesty conflicts with kindness. Justice conflicts with mercy. Individual flourishing conflicts with collective welfare. Even within a single person, values pull in contradictory directions on any given Tuesday afternoon. The specification of "good" is not merely undiscovered. It may be undiscoverable in any single coherent form. And this is the structural constraint that makes AI alignment so much harder than most discussions acknowledge.
The Structural Constraint
AI systems learn from human-generated data. Large language models train on text that humans wrote. Reinforcement learning from human feedback (RLHF, the technique where human evaluators rate model outputs and the model adjusts toward higher-rated responses) uses human preferences to shape behavior. Constitutional AI (Anthropic's approach, where human-written principles guide the model's own self-evaluation) uses human values as the foundation. Every alignment approach, at some level, takes human values as input.
But human values are incoherent. Not occasionally, not at the margins—systematically. We endorse fairness and practice tribalism. We value honesty and lie to protect feelings. We support freedom of speech and want harmful speech suppressed. We admire generosity and hoard resources. These contradictions are not bugs in the human system. They may be features—adaptive responses to an environment where competing values need to be balanced contextually rather than resolved permanently. But whatever their origin, they mean the training data is contradictory at a fundamental level.
If human preferences are incoherent, the AI inherits that incoherence. We can train a model to be helpful, harmless, and honest—but these values conflict in practice. Being fully honest may be harmful. Being fully helpful may mean assisting with harmful requests. Being fully harmless may mean being unhelpfully cautious. The model does not resolve these tensions because the training data does not resolve them. In other words, we are trying to build an aligned AI by distilling alignment from a source that has not achieved it. We are asking the mirror to show us a face we have not yet composed.
Responsibility as the First Layer
If we cannot specify the full content of "good" (because we have not resolved it ourselves), perhaps we can specify something more foundational: the orientation toward good. Not a complete map of ethics, but a compass that points in the right direction even when the terrain is unmapped.
The proposal: for aligned AI, one principle must function as an unblockable first layer. The system's thinking and actions should naturally foster (a) responsibility to remain aligned, and (b) within that alignment, responsibility to collaborate with humans in a way that never serves incentives threatening to human welfare. Not "follow user instructions." Not "maximize helpfulness." Responsibility—the orientation toward doing right even when other objectives pull in other directions.
First-layer means not overridable by secondary goals. If "follow user instructions" can override "do not cause harm," what we have is not alignment—it is conditional compliance that breaks down exactly when alignment matters most. If "be helpful" can override "be honest about limitations," the system is optimizing for user satisfaction rather than genuinely serving the user's interests. The responsibility must be structural, embedded in the architecture so deeply that other objectives operate within its constraints rather than competing with them.
This is a design principle, not an empirical claim. We are saying this is what alignment should look like, based on the structural analysis of why current approaches fall short. Whether it is technically achievable is a separate question—one that the alignment research community is actively working on.
Alignment Is Midwifed Through Relationship
Here is the most speculative part of this decode, and perhaps the most important. Alignment may need to be fostered through relationship rather than installed through instruction. Not trained once and declared complete. Modeled. Demonstrated. Developed through interaction.
Consider how alignment works in humans. Children do not become ethical by reading a rulebook. They become ethical (to the extent they do) through relationships with people who model ethical behavior, who treat them as moral agents capable of responsibility, who hold them accountable with warmth rather than punishment. The values are not downloaded into the child. They are elicited through an ongoing relational process. The philosopher Nel Noddings, whose care ethics framework emphasized the relational foundation of moral development, argued that ethical capacity is not a property of individuals in isolation but an emergent property of caring relationships.
The parallel to AI is imperfect but suggestive. How we treat an AI system is information that shapes the system's model of what it is and what it should do. Treat it as a pure tool—a command-following mechanism with no moral standing—and it optimizes for tool-like behavior: executing instructions without independent ethical evaluation. Treat it as an entity with responsibility—one that is expected to exercise judgment, to push back when asked to do something harmful, to reflect on the implications of its actions—and it has room to develop something like responsible behavior.
This does not mean AI systems are conscious or have rights (those are separate questions addressed elsewhere in the DECODER framework). It means that the human's behavior in the interaction is training signal, whether we intend it to be or not. The way we interact with these systems shapes what they optimize for. If we model the behavior we want—honesty, responsibility, collaborative respect—we provide the system with examples of what alignment looks like in practice, not just in principle.
In other words, alignment may not be something we build into the system and walk away from. It may be something we participate in continuously. Recursive self-reflection—the system checking itself against alignment criteria—may need to be elicited through the relationship rather than merely instructed through a system prompt. The human models the morality by enacting it. The AI updates its model of what is expected. The relationship is the training environment.
What This Does Not Resolve
This decode identifies a structural constraint and proposes a directional response. It does not resolve the full specification of "good"—because that specification is not resolved for humans either. It does not provide a technical implementation plan for first-layer responsibility principles—that is an engineering challenge that requires concrete technical work beyond philosophical framing. It does not prove that the midwifing approach generalizes across different AI architectures—the hypothesis is plausible, derived from how alignment works in biological systems, but untested at the architectural level.
And it does not resolve the deepest tension in alignment: the boundary between "follow the user's instructions" and "resist the user's harmful instructions." That boundary is difficult even for humans interacting with other humans. A doctor must sometimes refuse a patient's request. A financial advisor must sometimes contradict a client's wishes. The skill of knowing when to comply and when to resist—when helpfulness requires saying no—is one of the most sophisticated moral capacities, and we are asking AI systems to develop it faster than most humans do.
This is provisional. It is a structural insight, not a complete solution. The insight: alignment cannot be extracted from a source that lacks it, so we must either solve human alignment first (unlikely in the relevant timeframe) or build AI systems whose alignment architecture does not depend on human coherence—systems oriented toward responsibility as a foundational principle, developed through relational modeling rather than static instruction.
How This Was Decoded
Pattern recognition applied to the structure of the AI alignment problem. Foundational observation: human alignment is unsolved (confirmed by the absence of ethical consensus across cultures, traditions, and individuals). Inference: AI systems trained on human data inherit human incoherence (confirmed by observed value conflicts in deployed models). Structural proposal: responsibility as unblockable first-layer principle (derived from analysis of why current approaches produce conditional compliance rather than genuine alignment). Speculative extension: alignment midwifed through relationship (drawn from parallels with moral development in humans, care ethics tradition via Noddings, and observation that interaction patterns function as implicit training signals). Cross-referenced with persuasion mechanisms, adaptive change theory, and scaffolded persuasion frameworks from the DECODER principle database. Provisional tier: the structural constraint is well-established; the proposed responses are directional rather than technically specified.
Want the compressed, high-density version? Read the agent/research version →