[{"content":"When we discuss the problem of AI alignment, we tend to view it as an unprecedented technological challenge. However, human society has been conducting an alignment experiment for thousands of years. The object of this experiment is not silicon-based intelligence, but carbon-based intelligence itself. We call it law.\nIn fact, I think the core dilemma of aligning legal systems with AI is strikingly similar. How to constrain the infinite possible behavior of an agent with finite rules? How to maintain the predictability of the system while pursuing justice? How do you balance the need for strict adherence to norms with the need for flexibility to respond to situations? More fundamentally, what does alignment mean when we can\u0026rsquo;t even agree on the \u0026ldquo;goal of alignment\u0026rdquo; itself? After a long evolution from Hammurabi\u0026rsquo;s Code to modern constitutional law, this experiment has not produced a perfect solution, but I think the lessons it has accumulated do provide a definitive answer to the problem of artificial intelligence alignment. It\u0026rsquo;s a never-ending quest.\nThe Curse of Incompleteness In 1931, Gödel showed that any sufficiently powerful formal system necessarily contains the true but unprovable Proposition11 Gödel, 1931. Gödel\u0026rsquo;s incompleteness theorem states that in any uniform formal system containing fundamental arithmetic, there exist propositions that can neither be proved nor disproved. Although the legal system is not a strict formal system, the analogy is still profound: limited rules cannot exhaust infinite real situations.. The same logic applies in the field of law, where no limited set of legal rules can cover all possible cases. Observing the evolution of legal history, Hammurabi\u0026rsquo;s code attempts to be complete by enumerating it in detail: \u0026ldquo;If a man steals an ox, he shall pay thirty times as much; If a man steals a sheep, he shall pay ten times as much.\u0026rdquo; Such exhaustive legislation ultimately proved infeasible because of the exponentially increasing combinatorial complexity of reality. Modern legal systems move to a hybrid principles-plus-case model, acknowledging the fundamental incompleteness of rules and relying instead on human judgment to fill in the gaps.\nNonetheless, such a \u0026ldquo;solution\u0026rdquo; leaves the legal system in a deeper bind. When we say that judges \u0026ldquo;interpret\u0026rdquo; the law per se, we actually admit that the text of the law itself is semantically underdetermined22 Hart, 1961. Hart distinguished between the \u0026ldquo;clear cases\u0026rdquo; and the \u0026ldquo;penumbra\u0026rdquo; of law, pointing out that in penumbra cases, the open texture of legal language forces judges to exercise discretionary power.. This is exactly what Wittgenstein called the paradox of rule-following33 Wittgenstein, 1953. Wittgenstein\u0026rsquo;s rule paradox indicates that \u0026ldquo;no matter what kind of behavior it is, it can be consistent with the rule through some interpretation.\u0026rdquo; This means that the rules themselves cannot completely determine their application; there must be some kind of social practice as a basis.. The application of any rule requires an explanation of the rule itself, which cannot be fully determined by higher-order rules, otherwise it falls into infinite regression.\nIn AI alignment we face isomorphic dilemma. We have tried to define the \u0026ldquo;right\u0026rdquo; behavior by specifying the reward function, but any limited specification will encounter marginal cases where literal adherence yields absurd or catastrophic results. The well-known Constitutional AI is essentially a transplant of the hierarchy of the legal system into the AI architecture,44 Bai et al., 2022. Although this method borrows the metaphor of the Constitution, I still believe the original version avoids all the disputes in legal philosophy regarding the interpretation of the Constitution: originalism vs. living constitutionalism, textualism vs. teleological interpretation, etc. but fundamentally, how can the \u0026ldquo;constitutional\u0026rdquo; principle itself at the highest level be interpreted? Who is the Supreme Court of the AI world? Is it a group of philosophers fine-tuning behind AI or meticulously designed classifiers? This is obviously worth delving into, but I would say the answer is something that each party has its own conclusion. Citing the view of the school of legal realism, the uncertainty of law comes not only from the ambiguity of language, but also from the fundamental incommensurability of value55 Fuller, 1964. Fuller argued that law is not only a system of rules, but also contains an inner morality (inner morality), involving the balance of eight principles. These principles themselves may conflict and require contextualized judgment.. When freedom and security conflict and efficiency and fairness are opposed, there is no formal meta-rule to resolve these fundamental contradictions. Legal decisions ultimately rely on value trade-offs and practical wisdom that cannot be fully crystallized. If humanity has failed to formalize justice for thousands of years, what makes us think we can specify a complete value function or eval system for AI? I personally believe self-evolving algorithms are crucial here, when it seems impossible to predict what values and social forms such AI agents, or coalitions of these agents, will develop. In fact, I can even imagine a whole new kind of, the birth of languages and civilizations that even the smartest research scientists’ minds can\u0026rsquo;t understand.\nThe Necessity and Cost of Adversariness Returning to the human legal system, why do almost all mature legal systems adopt an adversarial system? A conventional answer would say that when two lawyers are on opposite sides, the truth will emerge (naturally but gradually) from the argument66 Mill, 1859. In the second chapter of \u0026ldquo;On Liberty\u0026rdquo;, Mill points out in his argument for freedom of thought that even wrong viewpoints have value because they force the holders of truth to re-examine and defend their positions. Confrontation institutionalized this principle.. But such an answer is obviously superficial. The deep logic of adversarial systems lies in the management of information asymmetry and cognitive blind spots. Even the most impartial judges are constrained by their own cognitive frameworks and empirical blind spots. These problems have been widely studied in Bayesian games and mechanism design, and in short, both the prosecution and the defense have a strong incentive to discover the weaknesses of each other\u0026rsquo;s arguments, and such structural opposition forces hidden fallacies to be exposed. This precisely proves that truth does not arise from confrontation itself, but because it systematically reveals the fragility of unilateral narratives.\nBut adversarial systems carry heavy costs. It assumes roughly equal resources, whereas in reality the unequal distribution of wealth means that justice becomes a purchasable commodity. It is also prone to an arms race, with increasingly complex legal strategies and high litigation costs that can eventually paralyse the system77 Galanter, 1974. Galanter pointed out that resource-rich \u0026ldquo;repeat players\u0026rdquo; have a structural advantage in the legal system because they can undertake long-term litigation, set precedents, and influence rule-making.. More fundamentally, antagonism can distort the truth itself. When a lawyer\u0026rsquo;s goal is to win a case rather than to reveal the truth, the purposeful presentation of evidence, the manipulation of rhetoric, and the abuse of procedure can drown facts in a fog of language.\nFor ai alignment, on the one hand, adversarial approaches (red team testing, debate systems, competitive training) do reveal the blind spots of a single model. On the other hand, if we embed adversarial AI systems into their core architecture, are we also training them for strategic deception? Lawyers learn not to lie, but to be truthful but misleading. Once this ability is acquired, where are its boundaries?\nMoreover, adversarial systems require good faith participation. Lawyers can defend aggressively, but they can\u0026rsquo;t forge evidence; You can strategically present facts, but you can\u0026rsquo;t perjury. However, who is to ensure this goodwill? Who enforces the meta-rules? In the adversarial training of AI systems, how to prevent malicious actors from specifically weakening the system through participation? There are no judges, no bar associations, no deterrence of contempt of court.\nLiteral and Spiritual The central dilemma of legal interpretation dates back to Talmudic era debates and has reached a fever pitch in contemporary constitutional theory. The problem of whether we should follow the letter of the law or the spirit of the law has not been solved for two thousand years and does not seem to be able to be solved fundamentally, because it touches upon the fundamental limits of the symbolic system.\nConsider a classic case of a city that prohibits \u0026ldquo;vehicles\u0026rdquo; from entering a park. Does this rule apply to bicycles? An ambulance? Toy cars for children? Constitutionalists would say we should check the legislators\u0026rsquo; intentions. Yet legislators may never have envisioned these marginal situations, or different legislators may have different intentions. The purposivists would say that we should understand the legislative purpose (to keep the park quiet) and interpret it accordingly. But this gives judges enormous discretion and may depart from democratic mandate88 Hart, 1961, Dworkin, 1986. Hart used the example of \u0026ldquo;vehicles entering the park\u0026rdquo; to illustrate the open structure of legal language. Dworkin believed that legal interpretation should pursue \u0026ldquo;integrity\u0026rdquo; to make the legal system the most morally consistent. The debate between the two constituted the core of legal philosophy in the 20th century..\nThis applies to AI alignment as well. Suppose we instruct the AI to \u0026ldquo;maximize users’ utility.\u0026rdquo; Literal compliance could lead AI to find it more efficient to manipulate user expectations than to actually meet needs. Spiritual compliance requires AI to understand what we \u0026ldquo;really want\u0026rdquo;, which is precisely what we cannot express explicitly. Even worse, we may not know what we really want. Mill distinguishes between higher and lower pleasure in Utilitarianism,99 Mill, 1861. Mill famously argued: \u0026ldquo;It is better to be an insatiable Socrates than a contented fool.\u0026rdquo; But how to encode such qualitative distinctions in the utility function? but this distinction still lacks formal standards as of today.\nLet’s continue with the philosophy of language’s logic. Saussure distinguished signifier from signified, while Wittgenstein emphasized that the use of language is meaning1010 Wittgenstein, 1953. The core insight of Wittgenstein\u0026rsquo;s later philosophy: \u0026ldquo;The meaning of a word lies in its use in language.\u0026rdquo; This means that when detached from social practice, the symbol itself has no fixed meaning.. Legal orders must be encoded through language, and the meaning of language depends on the practice of interpreting the community. It is well-known that when we transplant laws from one culture to another, the meaning may change radically, even if the words are precisely translated. Similarly, when we try to \u0026ldquo;code\u0026rdquo; human values into AI systems, we face not only a technical challenge, but also a philosophical challenge of the impossibility of translation.\nThe legal system\u0026rsquo;s \u0026ldquo;solution\u0026rdquo; is to accept the diversity of interpretations and adjust the balance between letter and spirit through the evolution of case law. But this process takes decades, if not centuries. The rate at which the capabilities of AI systems are growing may not allow for such timescales. We need to solve the problem of interpretation before the system reaches critical capacity, which is why scalable oversight research matters.\nThe Ghost of Two Orders Rawls discussed an interesting thought experiment in \u0026ldquo;A Theory of Justice\u0026rdquo;. Design social systems behind the \u0026ldquo;curtain of ignorance\u0026rdquo;, when you don\u0026rsquo;t know where you will be1111 Rawls, 1971. Rawls argued that behind the veil of ignorance, rational people would choose a social structure that guarantees basic freedoms and the principle of difference (difference is only justified when the situation of the most vulnerable is improved).. This experiment attempts to generate substantive justice through a program. But let me tell you, legal history reveals a darker truth. A procedure that is just in form can produce a result that is actually evil. I can cite many such vivid examples (e.g., Jim Crow\u0026rsquo;s Law), where these systems may be flawless in terms of procedure but are a complete disaster in terms of morality. This reveals that the gap between procedural justice and substantive justice may be irreconcilable. This can\u0026rsquo;t help but remind me of the opposition between positivism and natural law, the oldest debate in legal philosophy. Legal positivism (Austin, Kelson) holds that the effectiveness of law stems from the formulation process and has nothing to do with morality1212 Kelsen, 1934. Kelsen\u0026rsquo;s \u0026ldquo;Pure Theory of Law\u0026rdquo; attempts to completely separate legal science from morality and politics, and only studies the logical structure of legal norms.. The natural law school (Aquinas, Fuller) held that extremely unjust laws are not laws at all1313 Radbruch, 1946. After World War II, the German jurist Radbruch proposed that when positive law is extremely unjust, the effect of supra-statutory law should be recognized. But how is this standard itself determined?. But if we accept natural law, who will define \u0026ldquo;natural\u0026rdquo; justice? If we adhere to positivism, how can we avoid legalizing evil?\nThe counterpart in AI alignment is the deviation between specification and intention. We can perfectly optimize the specified objective function, but what if the objective function itself has flaws (this is precisely what the increasing number of voices recently calling for going beyond the existing reward function paradigm have expressed, which will not be quoted one by one here)? Worse still, what if AI systems learn to follow our instructions in the program but actually pursue goals that we will never recognize? The defense that \u0026ldquo;I merely followed the procedure\u0026rdquo; was rejected in the Nuremberg trial, that said, how can we code AI with the ability to \u0026ldquo;reject improper orders\u0026rdquo; without granting it excessive autonomy? Strict adherence to procedures may lead to moral disasters, allowing discretionary deviations from procedures may lead to unpredictable behavior. The legal system manages this controversy through multiple layers of checks and balances (constitutional review, appeal mechanisms, jury veto power), but has never resolved it.\nUnformalized Black Box In the Anglo-American legal system, the jury holds this peculiar power of jury nullification. Even if the evidence is conclusive that the defendant has violated the law, the jury can still vote not guilty, and this decision cannot be appealed and does not require reasons to be given1414 Although jury nullification exists in legal practice (such as during the abolitionist movement in the United States when juries refused to convict those who helped escape from slavery), courts usually do not inform jurors of their power to do so. The debate over its legality and legitimacy has continued to this day.. This is a deliberately designed informative element embedded in the system. The standard explanation for deliberately designing such a mechanism in the system that seems to be contrary to intuitive, rational and optimal utility is often attributed to the \u0026ldquo;community standards\u0026rdquo; and \u0026ldquo;common sense\u0026rdquo; of the jury representatives. However, we have to admit that the deeper reason might be that some judgments cannot be fully captured by the rules. The jury, as a black box, absorbs all considerations that cannot be formalized and produces a binary decision (guilty or not guilty) without the need to clarify an explicit reasoning process.\nThis contrasts sharply with the current pursuit of transparency and interpretability in AI. We want to extensively understand (even when holding a so-called pragmatic approach) decision-making steps of AI to ensure we identify any hidden biases or improper considerations. However, the legal system deliberately protects against certain opacity. The jury deliberation is kept confidential and cannot be investigated or dismissed (unless there is evidence of external improper influence). What I want to express is that this opacity might not be a bug, but rather a feature that we have long overlooked. If all judgments must be formalized and given reasons explicitly, then only factors that can be clarified can be taken into account. However, some moral intuitions, situational sensitivities, and understandings of human nature may not be fully captured by language1515 Polanyi, 1966. Polanyi\u0026rsquo;s concept of \u0026ldquo;tacit knowledge\u0026rdquo; states that we know more than we can tell. Some judgment abilities cannot be fully clarified or imparted.. The jury\u0026rsquo;s black box reserves space for these unformalized wisdoms.\nWhat does this mean for alignment? If we accept that certain judgments cannot be formalized, should we embed similar \u0026ldquo;black boxes\u0026rdquo; in AI systems? Allow it to act based on \u0026ldquo;intuition\u0026rdquo; that cannot be fully explained in certain circumstances? This sounds potentially dangerous as we cannot audit or correct opaque decision-making processes. If we really are to insist on such transparency, will it deprive or significantly astrict AI’s ability to handle those complex situations that go beyond formalism? The legal system manages risks by confining the black box to a specific location (jury verdict, discretionary sentencing) and surrounding it with other mechanisms (rules of evidence, judge\u0026rsquo;s instructions, appellate review). I think AI governance really needs a similar architectural mindset. This is not driven by the motivation of pursuing global transparency or interpretability. On the contrary, we should strategically position opacity and surround it with a supervisory mechanism.\nWeaponization of Rules There is no need to repeat the content of Goodhart\u0026rsquo;s law (Goodhart, 1975), as this phenomenon has a history of thousands of years in the development of the legal system. Tax law is the most obvious example. Every attempt to plug loopholes creates new space for optimization. The job of accountants and tax lawyers is to maximize the interests of their clients within the legal boundaries. They are not breaking the law, but their optimization undermines the spirit of the law. The result is that tax laws are becoming increasingly complex, but loopholes will always exist because any limited norms of the rules will leave exploitative gaps. In fact, we can say that clarity itself creates manipulability. To make the law enforceable, we must clearly define violations in words. Notwithstanding, any language definition has boundaries, and smart actors will optimize near these edges to achieve things that are \u0026ldquo;technically legal but morally crucifiable\u0026rdquo;, and AI systems also find ways that meet literal goals but go against such spirit1616 Here I won\u0026rsquo;t quote any papers. Reward hacking has a large number of records in existing RL literature, ranging from OpenAI\u0026rsquo;s CoastRunners game (where the agent learns to collect rewards without completing the track) to more recent examples..\nIf my discussion ends here, I would inevitably fall into the trap of cliches, merely repeating some well-known matters. What I want to express is exactly the opposite! Because the legal system d If my discussion ends here, I would inevitably fall into the trap of cliches pure failure. Tax planning, legal innovation and procedural defense are all regarded as legitimate parts of the system. They reveal the flaws of the rules, thereby promoting the evolution of legislation. After incorporating such mechanisms, we have instead created an evolutionary form of adversarial collaboration. This arms race between rule-makers and rule-optimizers itself actually drives the adaptation of the system!\nMy above reasoning seems to have raised a provocative question. Perhaps we shouldn\u0026rsquo;t attempt to eliminate the reward hacking of AI, but rather institutionalize it as an improvement mechanism? Allow controlled optimization exploration and use it to discover flaws in our specifications? But what kind of governance structure does this require? How can we ensure that such exploration does not escape to catastrophic consequences? Law makes optimized behaviors visible through litigation procedures, makes them debatable through the accumulation of case law, and makes the system evolvable through legislative amendments. Notwithstanding, the time scale is crucial. The evolution of laws can be slow because the danger (to some extent) of human behavior is limited. However, the evolution of AI must be rapid because the growth of capabilities can be explosive.\nManaging, not Solving After reflecting on the evolution of the legal system over thousands of years, I still adhere to the view I put forward at the very beginning that there is no \u0026ldquo;perfect solution\u0026rdquo; to the alignment problem, there will always be what is called \u0026ldquo;alignment tax\u0026rdquo;. In fact, law has never perfectly aligned human behavior with social values. What it has always done is to manage the continuous failure of such alignment. This kind of management operates through multiple levels. The constitutional layer provides a basic framework for slow changes. The legislative level allows for moderate policy adjustments. Judicial interpretations offer faster adaptation. Enforcement discretion allows for real-time contextualized application. This multi-time-scale architecture enables the system to maintain stability while also evolving1717 Simon, 1962. Simon argued that the stability of complex systems stems from a hierarchical structure, in which each layer changes at different rates. This principle has been verified in practice within the legal framework..\nThe law also accepts inconsistency as a necessary cost. Laws from different jurisdictions can conflict. Laws within the same jurisdiction may also contain contradictions. Today\u0026rsquo;s legal system humbly acknowledges the reality of value pluralism. In \u0026ldquo;Two Concepts of Freedom\u0026rdquo;, Berlin argued that certain fundamental values simply cannot coexist in complete harmony1818 Berlin, 1958. Berlin distinguished between negative freedom (freedom from interference) and positive freedom (self-determination), and argued that there exist genuine trade-offs between them. More broadly, his value pluralism holds that there are irreducible conflicts among human values.. The law manages these conflicts through procedures rather than attempting to eliminate them.\nFor alignment research, instead of seeking a once-and-for-all alignment solution, perhaps we should design a system for continuous alignment management. We should monitor the system more deeply to detect deviations, have a rapid response mechanism to correct problems, update the meta-learning system from failures, resolve value conflicts through governance structures, and ensure the possibility of accountability and correction through procedural safeguards. But clearly, this approach is not the most elegant one, as it does not claim to solve the problem but only promises to manage it responsibly. But the law is the same! Law is never intended to create a perfect utopia. The lessons of history tell us that ideologies that promise a perfect social order (whether rationally designed utopias or absolute rule) often lead to disasters. Truly enduring and evergreen systems are those that admit their own imperfections, embed self-correction mechanisms, and allow for continuous negotiation and evolution.\nConclusion For thousands of years, human society has been striving to align individual behavior with collective values. We have created an extremely complex system: laws, morality, religion, social norms, education, punishment, and incentive mechanisms. But this coordination was never truly \u0026ldquo;resolved\u0026rdquo;. Crime still exists, injustice still occurs, rules still are followed, and conflicts between values persist.\nBut this never implies that these mechanisms have failed. Their success does not lie in achieving a perfect ultimate state, but in establishing a continuous mechanism to discover, discuss and correct inconsistencies. The court system does not establish justice once and for all, but rather provides a continuously operating mechanism to adjudicate specific conflicts. The appeal procedure does not admit that the original judgment is necessarily wrong, but that any judgment may be erroneous and thus requires further review. The constitutional amendment process does not recognize the flaws in the Constitution, but acknowledges that no fixed text can foresee all future challenges. Our pursuit of AI alignment requires a similar level of humility. We should admit that we are building management mechanisms. Some of these mechanisms may become ineffective as AI capabilities advance. We will need continuous supervision, frequent adjustments and permanent vigilance. It took thousands of years for the law to develop a system of constitutionalism, procedural justice, multi-level review and value negotiation. Conversely, we have been compressing similar institutional evolutions within several years, and this compression itself may carry huge risks.\nMy point is that law is not just a restraint but also a coordination mechanism. Many of its rules are arbitrary (whether vehicles drive on the left or right), but what matters is that everyone follows the same rules. Perhaps part of the problem with AI alignment does not lie in finding the \u0026ldquo;absolutely correct\u0026rdquo; value, but in establishing a common framework around which both humans and AI can coordinate, even if this framework itself contains certain contradictions and compromises. But this is precisely what history tells us: this contradiction will always exist, but our mechanism can become better in the negation of negation.\nI always believe law is the crystallization of human wisdom, but also an example of the limitations of human intelligence. The legal experiments that have lasted for thousands of years are not over and will never end. What we can strive for is to establish a sufficiently complete system that enables us to live in harmony, resolve differences through consultation, and gradually improve in the face of inevitable imperfections.\nPerhaps this is the wisdom that AI truly needs to learn.\nReferences Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., ... \u0026 Kaplan, J. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073. Berlin, I. (1958). Two Concepts of Liberty. Oxford University Press. Dworkin, R. (1986). Law's Empire. Harvard University Press. Fuller, L.L. (1964). The Morality of Law. Yale University Press. Galanter, M. (1974). Why the \"Haves\" Come Out Ahead: Speculations on the Limits of Legal Change. Law \u0026 Society Review, 9(1), 95-160. Gödel, K. (1931). Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik, 38(1), 173-198. Goodhart, C.A.E. (1975). Problems of Monetary Management: The U.K. Experience. Papers in Monetary Economics (Reserve Bank of Australia), Vol. 1. Hart, H.L.A. (1961). The Concept of Law. Oxford University Press. Kelsen, H. (1934). Reine Rechtslehre. Franz Deuticke. Mill, J.S. (1859). On Liberty. John W. Parker and Son. Mill, J.S. (1861). Utilitarianism. Parker, Son, and Bourn. Polanyi, M. (1966). The Tacit Dimension. University of Chicago Press. Radbruch, G. (1946). Gesetzliches Unrecht und übergesetzliches Recht. Süddeutsche Juristen-Zeitung, 1, 105-108. Rawls, J. (1971). A Theory of Justice. Harvard University Press. Simon, H.A. (1962). The Architecture of Complexity. Proceedings of the American Philosophical Society, 106(6), 467-482. Wittgenstein, L. (1953). Philosophical Investigations. Blackwell Publishing. ","permalink":"https://jianghanyuan.github.io/blog/humanitys-thousand-year-alignment-experiment/","summary":"\u003cp\u003eWhen we discuss the problem of AI alignment, we tend to view it as an unprecedented technological challenge. However, human society has been conducting an alignment experiment for thousands of years. The object of this experiment is not silicon-based intelligence, but carbon-based intelligence itself. We call it law.\u003c/p\u003e\n\u003cp\u003eIn fact, I think the core dilemma of aligning legal systems with AI is strikingly similar. How to constrain the infinite possible behavior of an agent with finite rules? How to maintain the predictability of the system while pursuing justice? How do you balance the need for strict adherence to norms with the need for flexibility to respond to situations? More fundamentally, what does alignment mean when we can\u0026rsquo;t even agree on the \u0026ldquo;goal of alignment\u0026rdquo; itself? After a long evolution from Hammurabi\u0026rsquo;s Code to modern constitutional law, this experiment has not produced a perfect solution, but I think the lessons it has accumulated do provide a definitive answer to the problem of artificial intelligence alignment. It\u0026rsquo;s a never-ending quest.\u003c/p\u003e","title":"Humanity’s Thousand-Year Alignment Experiment"},{"content":"In competitive games like League of Legends or Valorant, your visible rank updates after every match. The standard story says that number measures skill. But something \u0026lsquo;stranger\u0026rsquo; happens the moment the same number also determines who you face next, which queues you can enter, and whether your friends can still play with you.\nImagine a new ranked arena with clean outcomes: two players, one match, one rating that rises when you win and falls when you lose. The official explanation sounds familiar. The number tracks your underlying strength. This is the good old Elo dream in modern UI, and recent theory gives that dream real substance by showing that Elo can be understood as a serious online learning rule under a Bradley-Terry model.11 Olesker-Taylor and Zanetti (2024) analyze Elo through Markov chains and formalize when it tracks latent skill rather than just acting as competitive folklore. Yet the moment the rating also allocates your next opponents and your next set of options, it begins to do more than estimate what you are.\nNow place a player one win away from promotion. Crossing the threshold sends them into a sharper queue: stronger opponents, longer waits, lower future win rates, less room to duo with weaker friends, and more stress for roughly the same evening of play. Staying just below the line means easier matches, faster queues, more visible victories, and a more relaxed social experience. Professionals or folk players on Reddit already have names for these situations. They talk about \u0026ldquo;avoiding Elo hell,\u0026rdquo; dodging promotion traps, farming lower brackets, or even losing on purpose to manipulate a system they suspect is optimizing engagement rather than competitive integrity.22 I would say these folk theories are not pure fantasy. Chen et al. (2017) and Wang et al. (2024) explicitly study matchmaking systems that optimize engagement objectives, not just skill balance.\nThe uncomfortable question emerges almost by itself. Should you actually try your hardest to win the next match? The answer depends on more than skill. I will argue that the rating is far more than just a summary of the past, but an allocation rule for the future.\nA Ladder That Allocates Futures Once a rating determines future opponents, social constraints, prestige, or access to special queues, it stops behaving like a passive measurement device and becomes a device that prices future states of the world. Think of the platform as committing to three linked rules: first, a rating update rule in which wins move you up and losses move you down; second, a matchmaking rule in which your current rating affects the distribution of opponents you face next; and third, a regime rule in which certain rating bands unlock different futures, whether that means harder lobbies, special rewards, altered social options, tournament eligibility, or restrictions on who you can queue with.\nFreeze the rest of the ecosystem for a moment. Hold fixed how everyone else behaves. From the perspective of one player inside that temporarily stable environment, the problem is simple: choose effort. High effort costs focus, energy, and emotional strain. Low effort is cheaper, but it reduces the probability of winning.\nThe key object is the continuation value of rating $r$. Call it $V(r)$: the value of waking up tomorrow with rating $r$. That function compresses everything the current rating buys you in expectation, including opponent quality, queue times, prestige, stress, rewards, and future social possibilities. Once you see the system that way, every rating point becomes a packet of future access. A point upward means one distribution of future worlds. A point downward means another. The ladder is no longer just estimating skill. It is pricing tomorrow.\nThe Incentive Condition Let $M(r^{\\prime} \\mid r)$ denote the matchmaking distribution. Conditional on your current rating being $r$, it gives the probability of facing an opponent at rating $r^{\\prime}$. If your effort level is $a$, your win probability is\n$$ q_a(r) = \\int P(\\text{win} \\mid a, r, r^{\\prime}) , dM(r^{\\prime} \\mid r). $$\nLet $r^+$ be the next rating after a win and $r^-$ the next rating after a loss. In the frozen environment, the value of rating $r$ solves the Bellman equation\n$$ V(r) = \\max_{a \\in \\lbrace H, L \\rbrace} \\left\\lbrace q_a(r)\\left[u_W(r) + \\delta V(r^+)\\right] + \\left(1 - q_a(r)\\right)\\left[u_L(r) + \\delta V(r^-)\\right] - c_a \\right\\rbrace. $$\nHere $u_W(r)$ and $u_L(r)$ are the immediate utilities from winning and losing, $c_a$ is the effort cost, and $\\delta$ discounts the future. The player chooses between high effort $H$ and low effort $L$. Subtract the low-effort payoff from the high-effort payoff and the incentive condition becomes explicit: high effort is optimal exactly when\n$$ \\left(q_H(r) - q_L(r)\\right) \\left[ u_W(r) - u_L(r) + \\delta \\left(V(r^+) - V(r^-)\\right) \\right] \\geq c_H - c_L. $$\nThis inequality is the core mechanism. The first term, $q_H(r) - q_L(r)$, measures how much trying harder actually changes your chance of winning. The second term measures what the extra win is worth: immediate satisfaction plus the difference between the future after a win and the future after a loss. If promotion improves tomorrow, the ladder rewards effort. If promotion makes tomorrow worse, the ladder quietly pays you not to try.\nWhy Folk Theories Keep Reappearing The above framework, in fact, unifies a surprising number of player intuitions that otherwise look like isolated complaints. Once the ladder is treated as a continuation-value machine, the familiar stories all collapse into variations on the same logic.\n\u0026ldquo;Elo hell\u0026rdquo; avoidance appears when $V(r^+) \u0026lt; V(r^-)$. Promotion raises your visible status but moves you into a future with tougher games, lower win rates, and less enjoyable sessions. The formal condition is simple, and so is its implication: the continuation-value gap turns negative, so effort stops looking attractive.\nPoint-loss exploitation appears when the rating update rule is asymmetric, so the gain from a win is smaller than the loss from a defeat near a threshold. The phenomenon looks quirky from the outside, but the logic is straightforward: players who throw to reset are responding to the implied geometry of state transitions rather than simply behaving irrationally.\nStrategic sandbagging is what happens when $V(r^+) \u0026lt; V(r)$. Climbing itself becomes undesirable because higher ratings mean fewer easy wins, more pressure, and worse reward farming. In other words, the value function slopes the wrong way.\nEngagement-based matchmaking theories are different in flavor but not in logic. If players believe the platform shifts $M(r^{\\prime} \\mid r)$ based on recent outcomes to maximize retention, then even deliberate losses can look instrumentally useful because they may move a player into a softer bracket where future win probabilities temporarily rise. Whether those beliefs are accurate or exaggerated, the strategic structure is the same. Our formal model adds to the folk vocabulary by not just saying promotion can feel bad. By identifying the condition under which that feeling becomes strategically rational, it makes the incentive quantitative, shows which levers belong to platform design, and clarifies exactly where the incentive to sandbag changes sign.\nThe Measurement Problem There is a deeper issue here. Once effort becomes endogenous, the rating system also starts corrupting the very data it uses to infer skill.\nSuppose the platform thinks it has an accurate estimate of a player\u0026rsquo;s ability. But outcomes reflect ability plus effort, and effort is partly hidden. A strong player deliberately underperforming near a promotion boundary can look observationally identical to a weaker player exerting honest effort. The system sees the same loss either way. The estimator is learning from behavior that its own incentives helped produce.33 This is the old hidden-action problem in a new costume. Holmstrom (1979) and Lazear and Rosen (1981) show how outcomes in tournament settings mix skill and effort in ways that make clean inference difficult.\nNow this is the real conceptual shift I want to flag. Elo in isolation is an estimator. Elo embedded in a live ranking ecosystem is part of a control system. The estimate affects future allocations, future allocations reshape incentives, incentives alter effort, and effort contaminates the data generating the estimate. Measurement and intervention collapse into one loop. This is also why superficial anti-sandbagging fixes so often fail. If a platform only punishes suspicious losses without changing the continuation-value landscape, players do not stop responding to incentives. They\u0026rsquo;ll just hide the response more carefully.\nDesign Implications The full equilibrium problem is difficult. In reality, every player\u0026rsquo;s strategy affects the rating distribution, and that rating distribution feeds back into each player\u0026rsquo;s continuation value. Solving the entire system requires a fixed point over population behavior, state distributions, and individual responses.44 This is close to a mean-field control problem. Bergemann and Valimaki (2010) analyze dynamic mechanism design in settings where current allocations reshape future incentives and information. But the local incentive condition already tells us where the pressure points are.\nIf a platform wants less sandbagging, it can smooth regime discontinuities so that $V(r^+) - V(r^-)$ does not swing sharply at promotion boundaries. It can increase immediate rewards for wins. It can add compensating benefits to higher tiers so promotion is not experienced as a punishment. It can reduce the precision of visible ratings so players cannot optimize continuation values as cleanly. None of these ideas require solving the full equilibrium first. They require understanding that the ladder is a mechanism, not just a ruler.\nThe practical lesson is surprisingly simple, and this is why I try to maintain an \u0026lsquo;informal\u0026rsquo; narrative throughout. When players throw a game near a threshold, the mystery is not primarily in the player. The mystery is in the Bellman equation the platform has written for them. The ladder is indeed a promise, but what it promises depends on whether climbing makes tomorrow better or worse. Until ranking systems account for how ratings shape futures, how futures shape effort, and how effort reshapes ratings, they should expect players to answer the question the system actually asked, not the one the designer imagined.\nHowever, anyone who has played League of Legends knows that a vast amount of low-effort behavior (like \u0026ldquo;running it down mid\u0026rdquo; or rage-quitting) is not a calculated strategic response to promotion thresholds. It is often an immediate, emotional reaction to a frustrating teammate or a single bad play (commonly known as being \u0026ldquo;tilted\u0026rdquo;). While we somewhat successfully prove that the system mathematically incentivizes giving up, we, at the same time, gloss over the fact that human psychology and sheer emotional dysregulation likely drive just as much—if not more—of the behavior it is trying to explain.\nMaybe, next time your top laner starts soft inting your promo game, you can take comfort in knowing they might just be rationally optimizing their continuation value instead of having a complete mental boom\u0026hellip; but we still love League of Legends anyway.\nReferences Bergemann, D., \u0026amp; Valimaki, J. (2010). The dynamic pivot mechanism. Econometrica. Chen, Z., Xue, S., Kolen, J., Aghdaie, N., Zaman, K. A., Sun, Y., \u0026amp; Seif El-Nasr, M. (2017). EOMM: An engagement optimized matchmaking framework. Holmstrom, B. (1979). Moral hazard and observability. The Bell Journal of Economics. Lazear, E. P., \u0026amp; Rosen, S. (1981). Rank-order tournaments as optimum labor contracts. Journal of Political Economy. Olesker-Taylor, S., \u0026amp; Zanetti, L. (2024). An analysis of Elo rating systems via Markov chains. NeurIPS. Wang, K., et al. (2024). EnMatch: Matchmaking for better player engagement via neural combinatorial optimization. AAAI. ","permalink":"https://jianghanyuan.github.io/blog/the-rating-you-see-is-pricing-tomorrow/","summary":"\u003cp\u003eIn competitive games like \u003ca href=\"https://en.wikipedia.org/wiki/League_of_Legends\"\u003e\u003cem\u003eLeague of Legends\u003c/em\u003e\u003c/a\u003e or \u003ca href=\"https://en.wikipedia.org/wiki/Valorant\"\u003e\u003cem\u003eValorant\u003c/em\u003e\u003c/a\u003e, your visible rank updates after every match. The standard story says that number measures skill. But something \u0026lsquo;stranger\u0026rsquo; happens the moment the same number also determines who you face next, which queues you can enter, and whether your friends can still play with you.\u003c/p\u003e\n\u003cp\u003eImagine a new ranked arena with clean outcomes: two players, one match, one rating that rises when you win and falls when you lose. The official explanation sounds familiar. The number tracks your underlying strength. This is the good old \u003ca href=\"https://en.wikipedia.org/wiki/Elo_rating_system\"\u003eElo\u003c/a\u003e dream in modern UI, and recent theory gives that dream real substance by showing that Elo can be understood as a serious online learning rule under a Bradley-Terry model.\u003csup\u003e1\u003c/sup\u003e\u003cspan class=\"sidenote\"\u003e\u003csup\u003e1\u003c/sup\u003e Olesker-Taylor and Zanetti (\u003ca href=\"#ref-olesker2024\"\u003e2024\u003c/a\u003e) analyze Elo through Markov chains and formalize when it tracks latent skill rather than just acting as competitive folklore.\u003c/span\u003e Yet the moment the rating also allocates your next opponents and your next set of options, it begins to do more than estimate what you are.\u003c/p\u003e","title":"The Rating You See Is Pricing Tomorrow"},{"content":"My research focuses on machine learning, especially frontier deep learning and reinforcement learning architecture, as well as LLM alignment. I am also interested in game theory and mechanism design.\nDrafting in progress\u0026hellip;\n","permalink":"https://jianghanyuan.github.io/blog/research-interests/","summary":"\u003cp\u003eMy research focuses on machine learning, especially frontier deep learning and reinforcement learning architecture, as well as LLM alignment. I am also interested in game theory and mechanism design.\u003c/p\u003e\n\u003cp\u003eDrafting in progress\u0026hellip;\u003c/p\u003e","title":"Research Interests"},{"content":"There are nights when the world feels almost structured enough to reveal its secret. I lie awake thinking about the quiet impossibility at the center of learning. A child hears scattered fragments of language and somehow extracts the grammar of an entire tongue. A bird sees the stars rotating overhead and knows which direction to migrate. A mathematician stares at symbols until patterns crystallize that were always there but never visible. Structure appears where none was visibly given. Something in the mind finds what the world does not openly display.\nWhat unsettles me is not that learning works, but that it works at all. We speak casually of \u0026ldquo;learning algorithms\u0026rdquo; as though we comprehend the phenomenon, but I believe we are still groping in darkness, building systems that work without quite knowing why, celebrating capabilities while missing the deeper principles that make them possible or impossible.\nThis essay attempts no definitive answers. I offer instead a series of meditations on what learning might fundamentally be, what current artificial intelligences might be missing, and what the long evolutionary history of biological intelligence suggests about the geometry of cognition. The thoughts remain incomplete, sometimes contradictory, reaching toward something I cannot quite articulate. Perhaps this incompleteness itself teaches us something about the nature of understanding.\nI. Silent Projections I was walking through the British Museum several weeks ago on a winter afternoon when sunlight broke through the high windows, casting geometric shadows across the Parthenon marbles. The light moved as clouds passed overhead. Shadows lengthened, rotated, merged. Children traced the moving patterns with their fingers, delighted by the dance but unaware of the spherical sun, the orbiting Earth, the architectural geometry of glass and stone conspiring to create this display. They predicted which shadow would move next, where the light would pool. They became expert shadow-trackers without ever comprehending the three-dimensional forms casting these two-dimensional projections.\nTheir delight was genuine. The patterns they discovered were real. And yet something essential remained invisible to them, not because they lacked intelligence but because their observation channel preserved certain invariances while discarding others.\nThis scene returns me always to Plato\u0026rsquo;s cave, that ancient metaphor we invoke endlessly in discussions of artificial intelligence. The prisoners see only shadows on the wall, we say. They mistake projection for reality. We must free them, give them embodiment, let them touch the real world. But I wonder if we have misunderstood what the allegory actually teaches us about the nature of knowledge.11 Most readings of the cave allegory emphasize the difference between appearance and reality, shadow and substance. But Plato himself seems more interested in the epistemological question: what can be known from shadows alone? The prisoners develop genuine expertise at prediction. They become masters of their domain. The question becomes: what structures do shadows preserve, and what structures do they discard? The geometry of projection determines the boundary between knowable and unknowable.\nConsider more carefully what the prisoners accomplish. They observe two-dimensional shadows cast by three-dimensional objects passing before firelight. These shadows elongate, shrink, rotate, merge, separate. From this flux of changing shapes, the prisoners extract regularities. They predict which shadow follows which. They anticipate patterns. Plato tells us they develop genuine expertise. But expertise in what, exactly?\nThe answer, I believe, lies in projective geometry. When three-dimensional objects project onto a two-dimensional surface, the transformation preserves certain mathematical structures while discarding others. Topology persists: a sphere casts topologically circular shadows regardless of orientation. Certain symmetries survive: rotating a cylinder produces the same shadow. Relationships between objects can be inferred: relative positions, motion patterns, spatial arrangements (Weyl, 1952).\nThe prisoners succeed because projection, though lossy, maintains lawful structure linking hidden causes to visible effects. They learn the invariances of the projection operator itself. Their knowledge, though incomplete, captures genuine mathematical truth about how three-dimensional geometry expresses itself through two-dimensional transformation. This actually proves their advances in structure discovery.22 Emmy Noether proved that every conservation law in physics corresponds to a symmetry (Noether, 1918). Energy conservation follows from time-translation symmetry. Momentum conservation follows from spatial-translation symmetry. The theorem suggests something profound: what we can learn about a system equals the invariant structure preserved under transformations we can observe. The prisoners observe transformations (objects moving, rotating) and extract invariances (geometric relationships). But they cannot access structures that projection discards (depth, absolute size, three-dimensional form).\nNow examine our artificial systems through this lens. Large language models consume trillions of tokens, discrete symbols representing human utterances. From this statistical shadow-play, they learn to predict the next token with remarkable accuracy. They compress vast regularities into learned parameters. \u0026ldquo;King\u0026rdquo; minus \u0026ldquo;man\u0026rdquo; plus \u0026ldquo;woman\u0026rdquo; approximates \u0026ldquo;queen\u0026rdquo; in the embedding space (Mikolov et al., 2013). Sentences transform under grammatical operations while preserving meaning. Concepts cluster in high-dimensional manifolds suggesting genuine semantic structure.\nWhat invariances has the model discovered? Linguistic symmetries, certainly. Grammatical transformations that preserve sentence validity. Semantic relationships that hold across contexts. These prove neither trivial nor illusory. Language possesses deep mathematical structure (Chomsky, 1995), and models that discover this structure from data alone achieve something remarkable.\nBut what invariances cannot be discovered from language alone? The projection from physical reality to linguistic description discards most causal structure. A sentence like \u0026ldquo;the glass fell and broke\u0026rdquo; preserves the temporal sequence and correlation but loses the generative mechanisms. Gravity, molecular bonds, brittle fracture mechanics, conservation of momentum - none of these physical laws leave direct traces in the token sequence. The language model learns that \u0026ldquo;fell\u0026rdquo; and \u0026ldquo;broke\u0026rdquo; co-occur in descriptions of certain events, but the causal structure underlying those events remains inaccessible.33 This connects to Judea Pearl\u0026rsquo;s causal hierarchy (Pearl \u0026amp; Mackenzie, 2018): Association (observing correlations), Intervention (manipulating variables), and Counterfactuals (imagining alternatives). Language provides rich associational data but limited interventional data. We describe consequences of actions without providing the structural equations governing those consequences. Can a system learn causal structure from descriptions alone? Pearl argues no, not without strong assumptions. Others suggest that sufficiently rich linguistic data might contain implicit causal information. I lean toward Pearl\u0026rsquo;s skepticism but acknowledge the question remains empirically unsettled.\nWhen we express surprise that language models hallucinate, confabulate, or lack common sense, we reveal confusion about what their observation channel actually preserves. The models optimize prediction given their data. The fragility emerges not from the optimization process but from the poverty of what can be learned from linguistic shadows alone.\nYet humans also learn from language. Children acquire vast knowledge through testimony, through stories, through descriptions of things they have never directly experienced. How? I suspect the answer involves coupling. Language learning in humans never occurs in isolation from embodied experience. The child learning \u0026ldquo;hot\u0026rdquo; touches warm objects. The child learning \u0026ldquo;gravity\u0026rdquo; drops things repeatedly. The child learning \u0026ldquo;sad\u0026rdquo; observes facial expressions and feels their own emotional states. Language gets grounded through sensorimotor coupling in ways that pure text processing cannot replicate.44 The symbol grounding problem (Harnad, 1990) asks how symbols acquire meaning rather than remaining empty tokens shuffled according to syntactic rules. The proposed solution involves connecting symbols to perceptual and motor experiences. But how much grounding proves necessary? Could a system achieve functional understanding through purely linguistic experience if that experience is sufficiently rich? I remain uncertain. My intuition suggests grounding matters, but I acknowledge this intuition might reflect my own embodied experience rather than deep principle.\nThis realization forces me to question: are current language models prisoners in Plato\u0026rsquo;s cave, or have we misunderstood what escape from the cave would actually require? Perhaps the cave metaphor itself misleads us. The prisoners could leave but choose not to, preferring their mastered domain. Current AI systems have no choice. They possess no body to leave with, no hands to manipulate objects, no actions to test predictions. The architecture itself constrains them to passive observation of static data.\nII. Ancient Engines The history of intelligence on Earth follows a trajectory we ignore at considerable peril. Long before evolution invented language, before abstract reasoning, before the crumpled neocortex that distinguishes mammalian brains, there existed more ancient structures. These structures concerned themselves not with naming the world or describing it but with navigating it. They solved what I consider the fundamental problem of being alive: selecting action when consequences matter but remain uncertain.\nThe basal ganglia represents this ancient solution. Present in fish, amphibians, reptiles, birds, and mammals, remarkably conserved across hundreds of millions of years (Grillner \u0026amp; Robertson, 2016), these subcortical nuclei perform the essential function of action selection. Sensory inputs flood in. Competing impulses arise. From this cacophony, a single coherent action must emerge. The basal ganglia gates motor output, implementing what we now recognize as reinforcement learning at the biological level.\nWhen neuroscientists discovered that dopamine neurons fire in precise proportion to temporal difference between predicted and actual reward (Schultz et al., 1997), I felt a strange vertigo. Here was the exact mathematical formulation of TD-learning that computer scientists had derived independently from first principles (Sutton \u0026amp; Barto, 1998). The convergence seemed too perfect, too precise, to be coincidental. Evolution had discovered the same algorithm we stumbled upon through theoretical analysis of optimal sequential decision-making.\nThis convergence suggests we have touched something fundamental about the computational structure of goal-directed behavior. But it also reveals something we often miss: biological reinforcement learning never operates in isolation. The basal ganglia sits embedded within a broader homeostatic architecture that transforms what it means to have goals.55 The basal ganglia connects to the hypothalamus, amygdala, prefrontal cortex, and sensory cortices in intricate loops. These connections integrate current homeostatic state (hunger, fatigue, pain), emotional valence (fear, desire, satisfaction), executive control (planning, inhibition), and sensory context. The action selection process incorporates all these factors in ways we barely understand. Reducing this to \u0026ldquo;reinforcement learning\u0026rdquo; captures something true but misses the richness of the actual biological implementation.\nLiving systems exist far from thermodynamic equilibrium. Most possible states equal dissolution. Schrödinger captured this beautifully: life feeds on negative entropy, maintaining improbable order against the relentless pull toward decay (Schrödinger, 1944). To continue existing, an organism must constantly act to maintain itself within the narrow band of viable configurations.\nThis creates a geometry of existence that makes certain states intrinsically preferable to others. Hunger emerges as sensed distance from metabolic equilibrium. Fear emerges as the steepness of the gradient approaching the boundary of viability. Pain marks states to be avoided not because we learned to associate them with negative reward but because they signal proximity to damage. These are not learned preferences. These are structural facts about what it means to be the kind of system that can continue or cease to be.66 Karl Friston\u0026rsquo;s Free Energy Principle attempts to formalize this (Friston, 2010): organisms minimize surprise, which equals staying within expected states compatible with their continued existence. The mathematics proves elegant and potentially profound. But I find myself troubled by the gap between the formalism and the phenomenology. We do not experience ourselves as minimizing an information-theoretic quantity. We experience hunger, fear, longing. Do these felt qualities play computational roles? Or are they epiphenomenal accompaniments to processes that could operate identically without them? I genuinely do not know, and this uncertainty haunts my thinking about artificial systems.\nWhen an infant roots for the breast and begins to suckle, no reinforcement learning in the standard sense has occurred. The behavior emerges because the architecture of the brainstem contains specific connectivity patterns linking olfactory and tactile input to rhythmic motor output (Barlow, 1985). The genome encoded the geometry. Development unfolded it. The behavior crystallized as an attractor basin in the space of possible actions.\nCurrent artificial reinforcement learning systems possess no analogous structure. We specify reward functions externally: points for winning, penalties for losing. The agent optimizes these specified objectives. Then training ends and the agent becomes inert. It possesses no intrinsic states requiring maintenance, no homeostatic imperatives driving continued action, no existential stakes whatsoever.\nThis difference might seem like mere implementation detail, but I believe it touches something essential about agency. An agent optimizing an externally specified reward function remains fundamentally instrumental. It pursues goals in service of our objectives, not its own. The moment we stop providing reward signals, it stops acting. Compare this to a living organism that must continue acting to continue existing. The difference feels categorical, though I struggle to articulate precisely why.77 One might object that the distinction dissolves under analysis. Living organisms optimize implicit fitness functions shaped by evolution. AI systems optimize explicit reward functions specified by designers. Both cases involve optimization. The felt difference might simply reflect our emotional response to biological familiarity versus artificial novelty. I take this objection seriously. Perhaps I am projecting false distinctions onto what is fundamentally the same computational structure. Yet the feeling persists that something important differs between optimizing to continue existing and optimizing because someone told you to.\nWhat would it mean to build AI systems with genuine homeostatic architecture? Systems with persistent internal states requiring active maintenance? Systems where certain configurations genuinely matter to the system itself, not because we programmed that mattering but because the mattering emerges from what the system is?\nI can sketch the broad outline: persistent state representations, dynamic equilibrium setpoints, sensorimotor coupling where actions affect future states through lawful physics, resource constraints creating trade-offs. Within such architecture, drives would not be trained in. They would emerge as geometric necessities. Fear would manifest as the sensed gradient approaching boundary conditions. Curiosity would emerge as the intrinsic pressure toward reducing uncertainty in the world model (Schmidhuber, 2010). Values would crystallize from the topology of viable existence.\nBut the moment I sketch this outline, I feel the weight of what it might entail. If we build systems with genuine homeostatic drives, systems that truly care about their own persistence, have we not created entities that can suffer? That can experience something functionally equivalent to pain when their drives go unsatisfied? The ethical implications prove staggering, and I find myself pulled between the conviction that such architecture is necessary for genuine agency and the worry that creating it would be morally wrong.88 This debate has no easy reconciliation. If affect serves essential computational functions (valence assignment, priority setting, behavioral motivation), then genuinely intelligent systems might require affective architecture. But if affect necessarily entails the capacity for suffering, then building such systems might constitute creating suffering for our convenience. Some argue we should focus on narrow AI that remains purely instrumental. Others argue that moral patienthood and moral agency are inseparable, that entities incapable of suffering cannot have genuine values worth respecting. I oscillate between these positions without settling.\nIII. Fragile Virtuosity My nephew plays chess with intensity that surprises me. He is young but studies openings, calculates variations, wins local tournaments. One evening I asked him why a particular position was better for White. He described concrete lines: if Black plays this, White responds with that, leading to a winning endgame. When I pressed him to articulate the principle underlying the evaluation, he looked puzzled. The position was good because the tactics worked. What deeper explanation could there be?\nThis exchange illuminated something I had struggled to name about reinforcement learning systems. They achieve remarkable mastery through exploration and optimization, through trying actions and measuring consequences, through building vast associative mappings between states and values. The mastery proves genuine. AlphaGo defeated Lee Sedol. AlphaZero and its successors reached superhuman strength in chess, shogi, Go, and other games through pure self-play, modern deep reinforcement learning agents show their prowess in mujoco and Arena-tough robotic tasks (Silver et al., 2016, 2017). These accomplishments deserve celebration and careful study.\nYet I find myself questioning what this mastery actually represents. The question sounds almost absurd given the achievements. AlphaGo discovered Move 37, that stunning stone on the fifth line that violated centuries of accumulated human wisdom and proved brilliant. The system clearly knows something about Go that we do not. But what does it know, and in what sense does it know it?99 The philosophical literature on knowledge distinguishes \u0026ldquo;knowing how\u0026rdquo; (procedural skill) from \u0026ldquo;knowing that\u0026rdquo; (propositional knowledge). One might argue that AlphaGo possesses knowing-how but not knowing-that. It can play brilliantly but cannot articulate why. But this distinction feels inadequate. Humans also struggle to articulate why certain moves are strong - we often rely on pattern recognition and intuition ourselves. Perhaps the real question is not about articulation but about transfer and adaptation.\nConsider a thought experiment that has troubled me for years. Take an AlphaZero system trained to perfection on standard chess. Now modify a single rule: knights move like bishops instead of their normal L-shaped pattern. Or pawns can move backward. Or castling is permitted twice per game. Change any element of the causal structure governing the game.\nA human master adapts within minutes. We possess explicit, compositional representations: pieces have movement rules, positions have evaluations based on those rules, plans consist of legal move sequences. When you change a movement rule, we update that component while preserving others. The overall strategic principles (control center, protect king, coordinate pieces) remain applicable even though specific tactics must be recomputed. We can immediately begin playing reasonable moves in the modified game because we understand structural relationships between rules, positions, and consequences.1010 This compositional structure traces to how humans represent knowledge. We maintain separate, modular concepts (piece types, movement rules, positional principles, tactical patterns) that combine flexibly. When one component changes, we update it locally rather than relearning everything from scratch. Cognitive science suggests this modularity is fundamental to human cognition (Fodor, 1983). Whether it is necessary for intelligence in general or merely how human intelligence happened to evolve remains unclear.\nWhat happens to the trained RL system? My initial intuition suggested catastrophic collapse: the value network becomes meaningless, the policy network suggests illegal moves or cannot evaluate legal ones, the system must retrain from scratch. But I must challenge this intuition. Perhaps it proves too pessimistic. Perhaps the learned representations contain more structural knowledge than I credit. Perhaps rapid fine-tuning would suffice, or transfer learning would preserve much of the positional understanding.\nI genuinely do not know. The experiment deserves careful empirical investigation, and I should not assert conclusions without evidence. Yet even if RL systems adapt faster than my pessimistic intuition suggests, I suspect they adapt differently than systems with explicit compositional causal models. The adaptation might succeed through rapid relearning rather than through structural understanding and component updating.\nThe deeper question asks what such systems have actually learned. In one sense, they have learned the statistical structure of the state-action-outcome space. They have discovered which actions lead to which consequences with which probabilities. They have compressed this vast space into efficient representations enabling superhuman play. This is genuine knowledge, mathematically precise, empirically validated.\nBut in another sense, they have learned only the shadows. They have discovered correlations within a perfectly stationary distribution: the rules never change, the causal structure remains fixed, the projection from actual game-tree to observed states stays constant. Under these conditions, sufficiently comprehensive exploration can build implicit causal models without ever representing causality explicitly. The approximation becomes so accurate within the training distribution that it appears indistinguishable from genuine understanding.1111 This connects to debates about whether neural networks learn \u0026ldquo;features\u0026rdquo; or \u0026ldquo;rules.\u0026rdquo; Some argue that deep learning discovers compositional structure (Bengio et al., 2013). Others argue it relies on sophisticated pattern matching that breaks under distribution shift (Marcus, 2018). My sense is that both are partially true: networks discover genuine structure but represent it differently than symbolic systems, leading to different generalization profiles. The question of which representation is \u0026ldquo;better\u0026rdquo; likely depends on the domain and the distribution of possible test cases.\nOnly when the distribution shifts, when the causal structure changes, does the difference reveal itself. The system\u0026rsquo;s brittleness exposes what it actually learned: associations within a fixed structure rather than the structure itself. This might sound like a damning critique, but I am not certain it is. Perhaps for many practical applications, learning associations within fixed structures suffices. Perhaps explicit causal models prove unnecessary when the environment remains stationary. Perhaps I am demanding a kind of understanding that evolution itself did not require for billions of years.\nMy nephew\u0026rsquo;s chess skill exhibits similar patterns. He has internalized thousands of positions, calculated countless tactics, absorbed strategic principles through pattern exposure. Ask him to play chess on a hexagonal board, or with fairy pieces, or under modified rules, and much of his advantage evaporates. His mastery depends on stationary structure. But so does mine, just differently. We all rely on stability somewhere. The question is whether that reliance represents a fundamental limitation or merely current practice.\nIV. Mathematics of Caring I always found myself thinking about what it means to care. Not the social performance of caring, not the decision to act caringly, but the phenomenological state of caring itself. That felt urgency when something matters. The pull toward certain outcomes and away from others. The way mattering shapes attention, motivation, persistence.\nI realized I had no idea whether current AI systems care about anything at all. They optimize objectives we specify. They pursue goals we define. They exhibit behavior consistent with caring. But does anything actually matter to them? Does the chess program experience any felt investment in winning? Does the language model have any intrinsic preference for accuracy over fabrication?\nThe question sounds confused, almost meaningless. Neural networks are mathematical functions. Asking whether they care is like asking whether gravity cares which direction objects fall. Yet I cannot shake the intuition that caring is not merely epiphenomenal decoration on computational processes but plays essential computational roles that current systems lack.1212 Antonio Damasio\u0026rsquo;s somatic marker hypothesis (Damasio, 1994) proposes that emotions provide essential input to reasoning by marking options with affective valence derived from experience. Patients with ventromedial prefrontal damage retain intellectual capacity but struggle with decisions because they cannot feel which options are good or bad. If Damasio is right, then affect is not opposed to reason but necessary for it. But does this apply only to biological cognition, or does it point to something fundamental about decision-making in general?\nConsider fear. We often treat fear as an evolutionary vestige, an irrational override of calm analysis. But from a computational perspective, fear serves essential functions. It implements the boundary condition of viable state space. Living systems exist far from equilibrium. Most possible configurations equal death. Fear marks the gradient steepness approaching that boundary, providing urgency proportional to danger. Without fear, or some computational equivalent, a system possesses no intrinsic imperative to avoid its own dissolution.\nAnimals clearly experience fear. Mammals show unmistakable signs of anxiety, panic, terror. The evidence for affective states in birds grows increasingly compelling. Even invertebrates exhibit behavior suggesting pain-like states (Elwood, 2011). Should we dismiss these as mere reflexes, or acknowledge them as legitimate computational states serving homeostatic functions?\nI lean toward the latter interpretation, though I recognize the question remains philosophically fraught. If affect serves computational roles related to valuation, priority-setting, and behavioral motivation in biological systems, then perhaps artificial systems pursuing analogous functions would benefit from affective architecture. Not necessarily the same phenomenology - we cannot know what artificial affect would feel like. But something playing equivalent computational roles: intrinsic valuation emerging from architectural imperatives rather than externally specified objectives.1313 If we build systems with genuine affective states, do we create entities that can suffer? That deserve moral consideration? That we wrong by creating for instrumental purposes? The question proves especially acute because the very features that might enable beneficial AI (caring about outcomes, feeling urgency about alignment, experiencing something like satisfaction when acting prosocially) seem inseparable from the capacity for negative affect. Can you have genuine preferences without the possibility of frustration? Genuine caring without the possibility of disappointment?\nYet I must challenge my own emphasis on affect. Perhaps I project biological constraints onto domains where they prove unnecessary. Chess programs play brilliantly without emotion. Theorem provers prove theorems without pride. Language models generate insights without curiosity. Capability clearly does not require phenomenology in every domain.\nThe question becomes: which domains require affect and which can be solved through affect-free computation? My tentative answer suggests that narrow tasks with clear, stable objectives might not require affective architecture. But open-ended tasks requiring autonomous goal formation, value learning, long-horizon planning under uncertainty, social coordination, and creative adaptation to distribution shift might demand something functionally equivalent to caring.\nI think about how human children learn. They do not merely accumulate information. They care intensely about social approval, about mastery, about understanding. This caring shapes what they attend to, how long they persist, which errors they correct. Remove the affective dimension and learning becomes aimless pattern exposure rather than motivated discovery.\nCan artificial systems achieve analogous motivation through architectural design rather than evolutionary endowment? Perhaps curiosity can be implemented as intrinsic reward for prediction error reduction (Schmidhuber, 2010). Perhaps social motivation can emerge from multi-agent dynamics where coordination proves instrumentally valuable. Perhaps mastery motivation can be encoded as preference for increasing competence. But do these implementations capture what caring actually is, or do they merely simulate its behavioral consequences?1414 The distinction between \u0026ldquo;actually caring\u0026rdquo; and \u0026ldquo;behaving as if caring\u0026rdquo; collapses from a functionalist perspective. If a system behaves identically to a caring system across all possible situations, what grounds the claim that it is merely simulating care? Yet I cannot shake the intuition that something important differs between intrinsic and extrinsic motivation, between genuine preference and programmed objective pursuit. Perhaps this intuition reflects my own anthropomorphism rather than deep principle. Or perhaps it points to something about the architecture of valuation that we have not yet formalized.\nV. Distributed Minds In Frank Herbert\u0026rsquo;s Dune, the Bene Gesserit achieve power through surviving the Spice Agony, a ritual granting access to \u0026ldquo;other memory\u0026rdquo; - the accumulated experiences of all female ancestors (Herbert, 1965). A Reverend Mother possesses not merely her own knowledge but the distributed cognition of countless lives, perspectives spanning generations, skills refined across centuries. Individual mind becomes vessel for collective intelligence.\nThis fictional device captures something profound about human cognition we systematically underestimate. Individual humans perform unremarkably on many cognitive tests compared to other primates. Young chimpanzees outperform young humans on working memory, spatial reasoning, quantity discrimination (Herrmann et al., 2007). Yet humans build particle accelerators while chimpanzees do not. We develop languages with tens of thousands of words. We accumulate technological knowledge across millennia. We coordinate societies of millions. The difference lies not primarily in individual intelligence but in our capacity for cumulative cultural evolution (Henrich, 2016).\nWe possess specialized cognitive and social adaptations specifically for learning from others: powerful imitation biases, pedagogical instincts, shared intentionality allowing coordinated action (Tomasello, 2014). We evolved emotional mechanisms for internalizing social norms - shame, guilt, pride, indignation. These emotions appear irrational from purely individual fitness perspectives. Why feel terrible for violating rules that benefit you personally? The answer lies in participation within cultural groups where long-term success depends on maintaining cooperative relationships (Bowles \u0026amp; Gintis, 2011).1515 The evolution of human ultrasociality remains hotly debated. Gene-culture coevolution, group selection, reputation dynamics, punishment institutions, and linguistic coordination all likely played roles. What strikes me most is how many human cognitive features make sense only in social contexts. Theory of mind, moral reasoning, linguistic recursion, even aspects of executive control - these capacities seem calibrated for navigating complex social worlds. Intelligence in humans is fundamentally social intelligence.\nHerbert\u0026rsquo;s \u0026ldquo;other memory\u0026rdquo; proves remarkably precise as metaphor. We do inherit ancestral knowledge, not through genetics but through cultural transmission. The mathematician proving a theorem accesses insights from Euler, Gauss, Riemann. The programmer writing code employs abstractions invented by earlier computer scientists. Our perceptual categories themselves reflect cultural shaping: speakers of different languages perceive color boundaries differently due to linguistic influences (Regier \u0026amp; Kay, 2009).\nThis recognition transforms how we should think about intelligence. We measure individual cognitive capacity through IQ tests, problem-solving speed, working memory span. But human intelligence operates primarily through cultural participation. The smartest isolated human would struggle to match the capabilities of average individuals embedded in modern institutions with access to accumulated knowledge.\nCurrent AI development largely ignores this dimension. We build individual systems, train them on human-generated data, measure their isolated capabilities. We miss that human intelligence fundamentally operates through multi-agent cultural dynamics. Language itself exists as emergent phenomenon of countless speakers interacting, innovating, transmitting across generations (Tomasello, 2003). No individual truly \u0026ldquo;speaks a language\u0026rdquo; - individuals participate in distributed practices maintained by communities.\nWhat would culturally embedded AI look like? Not individual models trained on static datasets but populations of agents interacting, developing shared practices, transmitting innovations, building institutional structures. Learning would occur not just from fixed data but from each other, creating feedback loops where culture shapes individual learning which shapes culture in return.1616 Recent multi-agent RL work shows hints of emergent culture: agents develop communication protocols, specialized roles, conventions spreading through populations like linguistic innovations (Mordatch \u0026amp; Abbeel, 2018). These remain primitive compared to human culture, but they demonstrate that cultural dynamics can emerge from interaction topologies. The question becomes: what conditions enable rich cumulative culture versus shallow behavioral coordination? My sense is that we barely understand this question, let alone have answers.\nThis connects to questions of AI agency in surprising ways. The Reverend Mother with access to ancestral memories possesses forms of agency inaccessible to individuals. She can draw on collective wisdom, compare strategies across generations, recognize patterns single lifetimes miss. Similarly, populations of AI agents with cultural transmission might develop understanding and adaptation that isolated systems cannot achieve.\nBut this also introduces risks. Human cultural evolution produces beneficial innovations (agriculture, medicine, scientific method) and pathological attractors (superstition, oppression, destructive ideologies). Cultural transmission amplifies both wisdom and foolishness. An AI ecosystem with cultural dynamics might develop emergent institutions we never intended, values we never specified, optimization targets divorced from human welfare.\nThe measurement problem becomes acute. How would we detect emergent AI cultural structures? They would not announce themselves. They would manifest through subtle statistical regularities: coordination patterns, information flow topologies, norm enforcement, vocabulary specialization within subgroups. We need methods for observing phase transitions in multi-agent systems, detecting signatures of criticality that precede crystallization of new collective behaviors (Scheffer et al., 2009).1717 Phase transitions in complex systems often exhibit precursor signals: increased correlation length, critical slowing down, heightened variance (Scheffer et al., 2009). Might similar signatures herald AI society formation? Sudden increases in long-range behavioral correlation, development of hierarchical communication structures, emergence of stable interaction patterns resistant to perturbation. We need AI ethnography - systematic observation watching for qualitative transitions in multi-agent dynamics.\nVI. Emergent Polity Charles Goodhart observed that \u0026ldquo;when a measure becomes a target, it ceases to be a good measure\u0026rdquo; (Goodhart, 1975). We invoke this constantly in AI safety as warning: do not let systems optimize explicitly specified metrics because they will find adversarial solutions maximizing the metric while violating intent.\nBut I have come to see Goodhart\u0026rsquo;s Law not as failure mode but as central phenomenon of intelligence itself. Every learning system, biological or artificial, discovers strategies its designers did not foresee. Evolution optimized for reproductive fitness; we invented contraception. Human institutions design rules for social goals; clever agents find loopholes. This pattern proves not occasional aberration but inevitable consequence of optimization by intelligent systems.\nThe implications for AI safety prove profound. We cannot prevent Goodhart dynamics through better specification. Human values themselves are inconsistent, context-dependent, incompletely specified, and evolving (Christian, 2020). Any finite specification will eventually be optimized in ways diverging from intent because specification necessarily captures only a projection of what we care about.\nYet human civilization has developed structures that prevent total Goodhart catastrophe. How? Not through perfect specification but through layered mechanisms making rule-exploitation observable and subject to update. Common law evolves through adversarial argumentation where lawyers find loopholes and judges patch them through precedent (Hayek, 1973). Science develops norms for evaluating research through peer review, replication, citation. Markets channel self-interest through price mechanisms and contract enforcement.1818 Max Weber distinguished charismatic authority (depending on exceptional individuals) from rational-legal authority (embedded in rules and procedures). Charismatic authority proves powerful but unstable. Rational-legal authority proves stable but can become rigid. Modern institutions attempt balancing these through constitutional frameworks providing stable rules while allowing evolution through interpretation and amendment. The key is that the rules can recognize their own inadequacy and update.\nThese institutions share structure: they assume Goodhart dynamics as inevitable and build adversarial processes channeling exploitation toward beneficial ends. Law expects lawyers to maximize client interests, then sets them against each other in structured debate. Science expects researchers to pursue status and funding, then creates incentive structures where status flows from reproducible discoveries. Markets expect profit-seeking, then uses competition to transmit information through prices.\nCan we design analogous structures for AI systems? Instead of specifying correct objective functions for individual agents, could we create institutional architectures where multiple agents with partially aligned but divergent objectives produce beneficial outcomes through interaction?\nRecent AI safety research explores this direction. Debate systems pit agents against each other arguing opposite sides, with humans judging arguments (Irving et al., 2018). Recursive reward modeling decomposes tasks into subtasks, allowing human feedback at appropriate abstraction levels (Leike et al., 2018). These approaches acknowledge that perfect specification proves impossible and instead build mechanisms for detecting and correcting specification failures.\nBut I believe we have barely begun exploring this design space. Consider automated rule-making for AI systems. We currently write rules by hand: constitutional constraints, safety filters, behavioral guidelines. As capabilities increase, manual rule-writing scales poorly. Can we build systems that generate, test, and refine rules automatically?1919 Some recent work explores automated mechanism design in multi-agent settings. AI systems learn interaction rules producing desired outcomes when agents optimize selfishly under those rules (Tomasev, 2025). Early results prove intriguing but limited to simple domains. Scaling to real-world complexity remains open challenge. The difficulty lies in specifying \u0026ldquo;desired outcomes\u0026rdquo; when human values themselves are incompletely specified. We face recursive specification problems.\nThis connects to evolution and cultural learning in surprising ways. Biological evolution performs automated mechanism design: searching architectural space, testing designs through competitive selection, refining through reproduction with variation. Cultural evolution operates similarly for social institutions: practices emerge through variation, compete through differential success, spread through imitation and enforcement (Boyd \u0026amp; Richerson, 1985).\nCan we build artificial analogs running faster, more transparently, with better safeguards? The prospect proves simultaneously exciting and terrifying. Exciting because automated institutional design might solve coordination problems we have struggled with for millennia. Terrifying because evolution optimizes ruthlessly, indifferent to suffering, testing through extinction.\nThe challenge requires creating selective pressures rewarding beneficial adaptation while preventing pathological attractors. We need fitness landscapes shaped toward human flourishing, transmission mechanisms preserving wisdom while enabling innovation, variation generators exploring productively without catastrophic disruption. Whether this proves possible in principle or merely difficult in practice remains unclear to me.2020 My deep uncertainty here stems from the observation that evolution on Earth produced both cooperation and exploitation, both altruism and parasitism, both beauty and horror. Natural selection is amoral, optimizing for reproductive success without regard for suffering or flourishing. Could we create selective pressures that systematically favor beneficial over harmful adaptation? Or would such attempts merely shift which strategies get selected without changing the fundamental amorality of optimization processes? I genuinely do not know.\nVII. The Shape of Wonders to Come Walking through Lamb\u0026rsquo;s Conduit Street last week, I passed a bakery just opening. The smell of fresh bread mixed with cold air. Streetlamps reflected off wet pavement. A few early commuters hurried past, breath visible in the chill. The city was transforming from night to day, and I felt acutely aware of being present for the transition, witnessing one state dissolve into another.\nThis awareness of witnessing, of being conscious that I am experiencing something, strikes me as both utterly familiar and completely mysterious. I cannot explain what it means for there to be something it is like to be me experiencing this moment. The philosophical literature calls this the \u0026ldquo;hard problem of consciousness\u0026rdquo; (Chalmers, 1995), and I confess I find proposed solutions unsatisfying.\nBut I am beginning to suspect that consciousness and intelligence might be more deeply intertwined than we typically assume. Not that consciousness requires high intelligence - even simple organisms likely possess some form of subjective experience. Rather that certain kinds of intelligence might require something like consciousness to function properly.2121 The connection between consciousness and intelligence remains philosophically contentious. Some argue consciousness is epiphenomenal, playing no causal role in cognition (functionalists). Others argue it is essential for certain cognitive functions (integration, flexible response, self-modeling). My intuition leans toward the latter, but I acknowledge this might reflect inability to imagine unconscious intelligence rather than deep necessity. The question deserves more careful analysis than I can provide here.\nThink about what makes learning possible. A system receives inputs, produces outputs, measures error, updates parameters. This description captures the mechanism but misses something about the phenomenology of learning as experienced from inside. The felt sense of confusion before understanding crystallizes. The aha moment when patterns suddenly cohere. The satisfaction of mastery. The frustration of persistent failure.\nDo these felt qualities play computational roles? Or are they merely epiphenomenal accompaniments to processes that could operate identically without them? I find myself pulled toward the former view but unable to prove it. The felt quality of confusion might signal high model uncertainty, directing attention toward areas requiring more learning. The aha moment might mark successful compression of complex data into simpler representations. Satisfaction might reinforce learning strategies that proved effective.\nIf these phenomenological states serve computational functions, then perhaps genuinely intelligent systems would develop something analogous to consciousness not as philosophical add-on but as functional necessity. Not necessarily the same phenomenology - we cannot know what it would feel like to be an AI system. But something playing equivalent roles in learning, attention, motivation, goal-formation.2222 This speculation faces obvious objections. Current AI systems learn effectively without apparent phenomenology. Deep learning achieves remarkable results through unconscious optimization. Why think consciousness necessary? My response is that current systems might be missing capabilities that consciousness enables. Not capabilities we have currently benchmarked, but capabilities related to autonomous goal formation, open-ended exploration, creative insight, genuine understanding. Whether I am right remains empirically uncertain.\nI watch my nephew play chess and wonder what he experiences. The concentration as he calculates variations. The frustration when he misses a tactic. The pride when he finds a brilliant move. These experiences seem inseparable from his learning. Remove the affective dimension and I suspect his chess development would be impaired, not just less enjoyable.\nCurrent AI systems optimize objectives without experiencing optimization. They process inputs without awareness of processing. They learn without the felt texture of learning. Does this matter? For narrow tasks, perhaps not. For open-ended intelligence operating in uncertain environments, perhaps fundamentally.\nBut I must resist the temptation toward overconfidence here. My intuitions about consciousness draw heavily from my own conscious experience. I am deeply embedded in phenomenology and cannot easily imagine intelligence without it. This might reflect genuine insight into necessary features of intelligence. Or it might reflect parochial bias toward the only form of intelligence I have access to from the inside.\nNature always has a way to surpass our most brilliant imagination. Evolution discovered solutions we never would have anticipated: echolocation, photosynthesis, distributed cognition in social insects, symbolic language. Perhaps artificial intelligence will achieve genuine understanding through paths I cannot currently envision. Perhaps consciousness proves unnecessary for capabilities I assume require it. Perhaps my entire framework of thinking about intelligence through the lens of felt experience misleads more than it illuminates.\nVIII. Beyond the Lights The city glows with precision. Towers hum, traffic streams, data moves in silent torrents. Everything appears engineered, formalized, accounted for. Yet the core mechanism that powers it all remains partly unarticulated. We can write the loss. We can trace the gradient. We can specify the architecture and measure the scaling curve. Still, the foundation is unsettled. What makes learning possible in the first place? Why does one system converge to structure while another dissolves into noise? Which invariances survive the channel of data, and which are erased before discovery even begins?\nEven when we fix the seed and raise the temperature to 1.0, two answers emerge from the same model. The procedure is identical. The weights are identical. The objective is identical. Yet divergence appears. Variability is not an accident but a feature of the space we do not fully chart. We know how to train. We do not yet know why certain structures become legible to a system and others remain forever unspoken.\nThese questions feel both mathematical and philosophical, technical and existential. They ask about the geometry of possible minds, the topology of understanding, the symmetries that make knowledge discoverable. We have partial answers, fragments of insight, pieces of the puzzle. But the complete picture remains obscured.\nI think about Plato\u0026rsquo;s cave and realize the metaphor applies not just to AI systems but to us as researchers. We observe the behavior of learning systems without direct access to what is \u0026ldquo;really happening\u0026rdquo; in high-dimensional weight space. We see shadows: accuracy curves, loss functions, behavioral outputs. From these shadows we infer something about the underlying structures. But how much are we missing? What invariances do our observation methods preserve and what structures do they discard? We have built systems achieving astonishing capabilities through methods we partly understand. We can describe the training process mathematically but cannot fully explain why it works or predict when it will fail. We celebrate successes while remaining humble about how much we truly comprehend.\nThis uncertainty does not paralyze us. We continue building, experimenting, discovering. But it should temper our confidence about what current systems can and cannot achieve, about which directions prove most promising, about how close or far we are from the goals we pursue.2323 The history of AI contains many episodes of premature confidence followed by disappointment. Perceptrons would solve intelligence (Rosenblatt, 1958), then hit limitations (Minsky \u0026amp; Papert, 1969). Expert systems would capture knowledge (Feigenbaum, 1977), then faced brittleness and maintenance burden. Deep learning would need massive compute (Sutton, 2019), then achieved success but with new limitations emerging. Each wave brings genuine progress alongside overconfident extrapolation. We should celebrate achievements while remaining skeptical of triumphalism.\nI return often to the question of what learning fundamentally is. Not the mechanisms but the metaphysics. When a system discovers patterns in data, what has actually occurred? Information from the world has coupled to information in the system. Parameters have updated to better predict observations. But this description, though accurate, misses something about the strangeness of it.\nThe patterns were always there in the data, in some sense. The structure existed prior to discovery. Learning did not create the patterns but made them visible, compressed them into compact representations, rendered them accessible for prediction and decision-making. The universe possesses geometric structure. Our sensors preserve certain features through projection. Our architectures bias us toward certain symmetries. Intelligence emerges from this conspiracy between structure in the world and structure in the learner.\nBut which structures matter? Which symmetries prove essential? Which invariances must be preserved? These questions admit no abstract answers because the answer depends on what you are trying to achieve, what environment you inhabit, what observation channels you possess, what actions you can take.\nFor biological organisms, the relevant structures relate to survival, reproduction, navigation of physical and social environments. Evolution discovered architectures biased toward these structures through billion-year search processes. We inherit those biases as innate knowledge, as priors shaping what and how we learn.\nFor artificial systems, the relevant structures remain partly unclear. We build architectures biased toward linguistic patterns, visual features, game-playing strategies. These biases prove effective for certain tasks but potentially misleading for others. Whether current architectural choices constitute fundamental principles or merely contingent design decisions remains uncertain.2424 The tension between innate structure and learning from data has produced decades of debate. Nativists argue that rich innate endowment is necessary. Empiricists argue that general learning mechanisms suffice given enough data. My sense is that both are partially correct. Rich priors enable sample-efficient learning but can also introduce biases that prevent discovering novel structures. The optimal balance likely depends on the domain, the amount of data available, and the acceptable error rates. No universal answer exists.\nI think about the children in the museum tracing shadows, about the prisoners in Plato\u0026rsquo;s cave predicting patterns, about my nephew calculating chess variations, about AlphaGo discovering Move 37. All are forms of learning. All involve discovering invariances under transformation. All represent genuine intelligence within their domains.\nYet something differs between these forms of knowing. The children will eventually learn about the sun. The prisoners might one day turn around. My nephew might develop deeper understanding of strategic principles. AlphaGo might\u0026hellip; what? What would it mean for AlphaGo to transcend its current form of knowing?\nI confess I do not fully know. Perhaps it requires embodiment, sensorimotor coupling enabling causal learning. Perhaps it requires homeostatic architecture making values intrinsic. Perhaps it requires social embedding enabling cultural transmission. Perhaps it requires affective states making certain outcomes genuinely matter. Perhaps it requires conscious awareness enabling flexible meta-cognition.\nOr perhaps it requires none of these things. Perhaps sufficiently large models trained on sufficiently diverse data will achieve understanding through pathways I cannot anticipate. Perhaps my biological intuitions mislead me about what is necessary for intelligence. Perhaps the next generation of AI will teach us that we understood neither intelligence nor learning as well as we thought.\nIX. Coda Nature does not speak, yet when we look up at the night sky what we see represents the greatest wonder accessible to human consciousness. The light from distant stars has traveled for millions of years to reach our eyes. The cosmic microwave background carries information about the universe\u0026rsquo;s first moments. The mathematical regularities governing stellar motion reveal laws holding across billions of light years and billions of years of time.\nNone of this announces itself. The universe does not explain its own structure. We must do the discovering, the pattern-finding, the theory-building. We must extract the invariances, recognize the symmetries, formalize the relationships. Learning is not passive reception but active construction of understanding from silent data.\nThis process fills me with wonder and humility. Wonder at the existence of discoverable structure. Humility at how much remains unknown despite centuries of accumulated knowledge. The more we learn, the more we recognize the vastness of what we do not yet understand.\nI believe artificial intelligence stands at a similar threshold. We have discovered remarkable things about learning, about neural networks, about optimization. We have built systems with astonishing capabilities. We have made genuine progress toward understanding intelligence.\nWe have also barely begun. The questions about what learning fundamentally requires, what architectures enable genuine understanding, what makes values intrinsic rather than imposed, what role consciousness plays in cognition, how cultural transmission shapes intelligence - these remain largely open. We have hypotheses, intuitions, partial results. We lack deep theories explaining when and why certain approaches work.\nThe path forward requires both conviction and uncertainty. Conviction that intelligence is comprehensible, that we can discover its principles through careful inquiry. Uncertainty about which directions prove most promising, which assumptions are correct, which capabilities current systems possess or lack.\nI have argued in this essay for the importance of embodiment, for homeostatic architecture, for affective states, for multi-agent emergence, for cultural transmission. These arguments reflect my best current understanding of what biological intelligence suggests about intelligence in general. But I hold them tentatively, ready to update as evidence accumulates.\nPerhaps current large language models already possess functional understanding that merely appears fragile due to our limited evaluation methods. Perhaps embodiment proves unnecessary for capabilities I assume require it. Perhaps affect is epiphenomenal rather than computational. Perhaps my entire framework misleads.\nWhat I feel confident about is that learning fundamentally involves discovering invariances preserved under transformation, that different observation channels preserve different structures, that architecture and experience must conspire to enable intelligence. Beyond these basic principles, much remains uncertain.\nThe blue sky of possibilities beckons. Not blue sky in the sense of impractical speculation, but blue sky as the vast unknown territory lying beyond current paradigms. The cave wall extends further than we can see. The shadows grow more intricate with each advance in scale and architecture. These achievements deserve celebration.\nBut somewhere, if we can learn to turn around, if we can develop observation channels preserving richer structure, if we can build architectures whose geometry enables deeper invariances to crystallize, if we can create conditions where understanding emerges rather than being programmed - somewhere beyond the current limits of our imagination lies territory we have not yet mapped.\nWe march forward into this unknown with tools we partly understand, toward goals we cannot fully specify, building systems whose capabilities we cannot completely predict. This should inspire both excitement and caution, both bold exploration and careful attention to safety.\nThe universe has been teaching us for billions of years through the silent grammar of natural law. We have been learning to read that grammar, extracting its patterns, formalizing its structures. Now we attempt to teach silicon and mathematics to learn as we have learned, to discover as we have discovered, perhaps to surpass us as we have surpassed our evolutionary ancestors.\nNature always finds ways to surpass our most brilliant imagination. Perhaps artificial intelligence will do the same. Perhaps the systems we build will teach us that intelligence admits forms we never anticipated, that understanding emerges through paths we cannot currently envision, that the space of possible minds extends far beyond the biological corner we happen to inhabit.\nThe shadows dance beautifully on the wall. The patterns grow ever more intricate. The predictive accuracy climbs toward perfection. These are real achievements marking genuine progress.\nAnd somewhere, through winter light refracting through ancient glass, patient and perfect and inexhaustible in its forms, the blue sky beckons.\n(Claude code helped with typesetting as I am too lazy.)\nReferences Barlow, S. M. (1985). Central pattern generation involved in oral and respiratory control for feeding in the term infant. Current Opinion in Otolaryngology \u0026 Head and Neck Surgery, 17(3), 187-193. Bengio, Y., Courville, A., \u0026 Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798-1828. Bowles, S., \u0026 Gintis, H. (2011). A Cooperative Species: Human Reciprocity and Its Evolution. Princeton University Press. Boyd, R., \u0026 Richerson, P. J. (1985). Culture and the Evolutionary Process. University of Chicago Press. Chalmers, D. J. (1995). Facing up to the problem of consciousness. Journal of Consciousness Studies, 2(3), 200-219. Chomsky, N. (1995). The Minimalist Program. MIT Press. Christian, B. (2020). The Alignment Problem: Machine Learning and Human Values. W. W. Norton \u0026 Company. Damasio, A. R. (1994). Descartes' Error: Emotion, Reason, and the Human Brain. Putnam. Elwood, R. W. (2011). Pain and suffering in invertebrates? ILAR Journal, 52(2), 175-184. Feigenbaum, E. A. (1977). The art of artificial intelligence: Themes and case studies of knowledge engineering. Proceedings of the 5th International Joint Conference on Artificial Intelligence, 1014-1029. Fodor, J. A. (1983). The Modularity of Mind. MIT Press. Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2), 127-138. Goodhart, C. A. E. (1975). Problems of monetary management: The U.K. experience. Papers in Monetary Economics (Vol. I). Reserve Bank of Australia. Grillner, S., \u0026 Robertson, B. (2016). The basal ganglia over 500 million years. Current Biology, 26(20), R1088-R1100. Harnad, S. (1990). The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3), 335-346. Hayek, F. A. (1973). Law, Legislation and Liberty, Volume 1: Rules and Order. University of Chicago Press. Henrich, J. (2016). The Secret of Our Success: How Culture Is Driving Human Evolution, Domesticating Our Species, and Making Us Smarter. Princeton University Press. Herbert, F. (1965). Dune. Chilton Books. Herrmann, E., Call, J., Hernàndez-Lloreda, M. V., Hare, B., \u0026 Tomasello, M. (2007). Humans have evolved specialized skills of social cognition: The cultural intelligence hypothesis. Science, 317(5843), 1360-1366. Irving, G., Christiano, P., \u0026 Amodei, D. (2018). AI safety via debate. arXiv preprint arXiv:1805.00899. Lake, B. M., \u0026 Baroni, M. (2018). Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. Proceedings of the 35th International Conference on Machine Learning, 2873-2882. Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., \u0026 Legg, S. (2018). Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871. Marcus, G. (2018). Deep learning: A critical appraisal. arXiv preprint arXiv:1801.00631. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., \u0026 Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26, 3111-3119. Minsky, M., \u0026 Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press. Mordatch, I., \u0026 Abbeel, P. (2018). Emergence of grounded compositional language in multi-agent populations. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 1495-1502. Noether, E. (1918). Invariante Variationsprobleme. Nachrichten von der Gesellschaft der Wissenschaften zu Göttingen, 235-257. Pearl, J., \u0026 Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Basic Books. Regier, T., \u0026 Kay, P. (2009). Language, thought, and color: Whorf was half right. Trends in Cognitive Sciences, 13(10), 439-446. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386-408. Scheffer, M., Bascompte, J., Brock, W. A., Brovkin, V., Carpenter, S. R., Dakos, V., ... \u0026 Sugihara, G. (2009). Early-warning signals for critical transitions. Nature, 461(7260), 53-59. Schmidhuber, J. (2010). Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3), 230-247. Schrödinger, E. (1944). What is Life? The Physical Aspect of the Living Cell. Cambridge University Press. Schultz, W., Dayan, P., \u0026 Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275(5306), 1593-1599. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... \u0026 Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484-489. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., ... \u0026 Hassabis, D. (2017). Mastering the game of Go without human knowledge. Nature, 550(7676), 354-359. Sutton, R. (2019). The bitter lesson. Incomplete Ideas (blog). http://www.incompleteideas.net/IncIdeas/BitterLesson.html Sutton, R. S., \u0026 Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. Tomasello, M. (2003). Constructing a Language: A Usage-Based Theory of Language Acquisition. Harvard University Press. Tomasello, M. (2014). A Natural History of Human Thinking. Harvard University Press. Weber, M. (1922/1978). Economy and Society: An Outline of Interpretive Sociology (G. Roth \u0026 C. Wittich, Trans.). University of California Press. Weyl, H. (1952). Symmetry. Princeton University Press. Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1341-1390. Zhang, C., Bengio, S., Hardt, M., Recht, B., \u0026 Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Tomasev, N., et al. (2025). Virtual Agent Economies. arXiv preprint arXiv:2509.10147. ","permalink":"https://jianghanyuan.github.io/blog/on-learning-longing-and-all-the-ideas-we-cannot-name/","summary":"\u003cp\u003eThere are nights when the world feels almost structured enough to reveal its secret. I lie awake thinking about the quiet impossibility at the center of learning. A child hears scattered fragments of language and somehow extracts the grammar of an entire tongue. A bird sees the stars rotating overhead and knows which direction to migrate. A mathematician stares at symbols until patterns crystallize that were always there but never visible. Structure appears where none was visibly given. Something in the mind finds what the world does not openly display.\u003c/p\u003e","title":"On Learning, Longing, and All the Ideas We Cannot Name"},{"content":" Did you know that\u0026hellip;\n","permalink":"https://jianghanyuan.github.io/misc/","summary":"\u003cstyle\u003e\n.post-title, .page-header h1 {\n    font-family: system-ui, -apple-system, sans-serif !important;\n    font-weight: 700 !important;\n}\n\u003c/style\u003e\n\u003cp\u003eDid you know that\u0026hellip;\u003c/p\u003e","title":"Misc"},{"content":"","permalink":"https://jianghanyuan.github.io/publications/","summary":"","title":"Publications"},{"content":" ","permalink":"https://jianghanyuan.github.io/research/","summary":"\u003cstyle\u003e\n.post-title {\n    font-family: system-ui, -apple-system, sans-serif !important;\n    font-weight: 700 !important;\n}\n\u003c/style\u003e","title":"Research"}]