How a Creature Learns to Play With a Ball

I've built an artificial-life game: a browser world where creatures have simulated brains and bodies. They get hungry, sleep, breed, form relationships, and — most importantly — they learn. Very little of what they do is scripted. Most of it they work out for themselves.

This is a close look at one behaviour, because it's the one I'm most sure is genuinely learned: a creature playing with a ball. Nothing tells a newborn to do this. There's no gene for it, no script, no instinct that says "when you see a ball, hit it." And yet a few minutes into its life, a creature will wander up to a ball, bat it across the floor, and — if the moment is right — decide that was worth doing again.

I want to walk through exactly how that happens, because the answer is a tour through some genuinely lovely ideas about how brains learn — most of them not mine, and not new, but assembled here into something that behaves like a curious animal. So before the creature, a detour through how learning actually works.

How a real infant learns to do anything

A human baby is not born knowing how to play with a rattle. It doesn't arrive with a rattle instinct. What it arrives with is a much smaller set of things: a tendency to be drawn to novel objects, a willingness to flail at them, a nervous system that registers when something feels good, and the machinery to remember what led to the good feeling and do it again.

That's roughly the whole toolkit, and it's astonishingly general. The baby isn't taught "shake the rattle." It picks the rattle up because it's new and interesting, happens to shake it, finds the noise delightful, and — crucially — remembers the connection between the shaking and the delight. Repeat across a few days of naps, and a behaviour nobody installed has become a habit.

Two features of that story do all the work, and the creature uses both. First, the baby had to be nudged to try the rattle before it could possibly know it was fun — curiosity does that, and curiosity is aimed at novelty, not at any specific action. Second, the learning is driven by how the outcome felt, not by an instruction. Nobody rewarded the baby for shaking correctly; the noise was its own reward. Get those two things right — a reason to try, and a feeling to learn from — and complex behaviour falls out of simple parts.

My creatures learn the same way, and not by coincidence: the mechanisms below are lifted, deliberately, from the research on how real brains and learning systems do this. The creature isn't a faithful model of a brain — it borrows principles, it doesn't reproduce biology. But the principles are real, and they're the interesting part, so let me lay them out before showing how they combine.

The science behind the behaviour

Ball-play isn't one trick — it's several well-understood ideas from neuroscience and reinforcement learning working together. For each one I'll explain the actual research first, then how it shows up in the creature. (The full brain has more — decay, homeostatic weight-scaling, social imitation — but these are the pieces that produce ball-play.)

Object-curiosity — a reason to try anything at all

The science. For much of the twentieth century, learning theory assumed animals only act to satisfy a need — food, escape, comfort. But animals plainly explore for no reward at all: a well-fed rat will still investigate an unfamiliar arm of a maze. Daniel Berlyne, working through the 1950s and 60s, took this seriously and argued that novelty, surprise, and complexity are themselves motivating — that curiosity is a drive in its own right, not a side effect of seeking food. Decades later this became the backbone of "intrinsic motivation" research in psychology and AI (Oudeyer and Kaplan among others), which formalised the idea that an agent should be drawn to what it hasn't yet explored, and that the pull should fade as a thing becomes familiar — otherwise you'd be transfixed by the same object forever.

In the creature. Each creature tracks how familiar it is with every kind of object. A brand-new ball reads as maximally unfamiliar, and that unfamiliarity becomes a pull toward investigating it, fading as the ball is seen more often. The detail the whole behaviour hinges on is what this pull points at: it says investigate the unfamiliar object, not perform this action. It is a reason to look, not a reason to strike — and I'll come back to why that distinction is the difference between learning and a disguised instinct.

Prediction-error curiosity — staying interested in what you can't yet predict

The science. There's a second, subtler account of curiosity that has largely won out in modern AI: you're drawn not to what's merely unseen, but to what you can't yet predict. The brain is constantly forecasting what happens next, and a mismatch between forecast and reality — a prediction error — is the signal that there's something here worth learning. Once you can predict a situation reliably, it stops being interesting. This idea runs from the Rescorla–Wagner model of conditioning in 1972 (learning is driven by surprise, not mere co-occurrence) through to Pathak and colleagues' "curiosity-driven exploration" in deep reinforcement learning (2017), where an agent is rewarded for visiting states its own predictor gets wrong, and Karl Friston's free-energy principle, which casts the whole brain as a prediction-error-minimising machine. The common thread: surprise is the engine of learning, and seeking surprise is how you find things worth learning.

In the creature. A small dedicated region of the brain — a prediction lobe — guesses how the creature's drives will change before each action runs. After the action, the guess is compared against what actually happened, and the size of the error becomes a "novelty" signal. A situation the creature can't yet predict keeps producing surprise, which keeps it exploring; once the predictor gets good at a familiar situation, the surprise dies and the creature looks elsewhere. This is what keeps a young creature poking at a new ball — its outcomes aren't yet predictable — and what eventually lets it move on.

Optimism under uncertainty — try the new thing, or the reliable thing?

The science. Any learner faces the explore-exploit dilemma: do you take the reward you know, or gamble on an option you haven't tested that might be better? The cleanest treatment comes from the "multi-armed bandit" problem — imagine a row of slot machines with unknown payouts, and you want to maximise winnings over many pulls. A strategy proven to work well (Auer and colleagues, 2002, formalising older ideas) is "optimism in the face of uncertainty": deliberately over-value options you haven't tried much, so you're pulled to sample them, and let that inflated value shrink as evidence accumulates. It guarantees you keep trying neglected options long enough to find out whether they're actually good, instead of prematurely settling on the first thing that happened to work.

In the creature. Any action the creature has barely tried carries a small "maybe this is good?" bonus that shrinks each time the action is used, until it's negligible. It's the same formula for every action and names none of them — it keeps a clueless creature from permanently ignoring options it has never sampled. A never-tried strike gets the full bonus; after a few goes it's just another option, judged on results. It does the job a hardwired "always try this" seed used to do, without ever wiring in the behaviour itself.

Softmax action selection — how much to gamble

The science. Once an agent has values for its options, how should it choose? Always picking the highest-valued one (pure "greedy" selection) means it never explores; choosing at random means it never exploits what it's learned. Softmax — or Boltzmann — selection, a staple of reinforcement learning (Sutton and Barto's standard text), strikes the balance: it picks probabilistically, with higher-valued actions much more likely but lower ones still possible. A temperature parameter controls the sharpness — high temperature flattens the distribution toward random exploration, low temperature sharpens it toward greedy exploitation. The trick that good learners use is to start hot and cool down: explore widely early, commit to what works later.

In the creature. The temperature is read live from the creature's state rather than set on a fixed schedule — built from a baseline, a youth term (the young run hot), and the prediction-error novelty from earlier (a surprised creature explores more). It's clamped so a curious newborn ranges widely while a settled adult mostly does the obvious thing — and never so high that a real emergency like hunger gets drowned out by noise. A young creature beside a novel ball is therefore, by construction, in exactly the state most likely to gamble on an action it has no learned reason to take.

Reward as drive-reduction — what counts as success

The science. This is the oldest idea in the system. Edward Thorndike's Law of Effect (1911) established the foundation: actions followed by a satisfying outcome become more likely, those followed by discomfort less so — learning is shaped by consequences. B.F. Skinner built operant conditioning on this. But the piece this creature uses most directly is Clark Hull's drive-reduction theory (1943): the claim that what makes an outcome "satisfying" is specifically the reduction of a drive — that reinforcement is the relief of a need like hunger or discomfort. Drive-reduction theory was later shown to be incomplete as a full account of motivation (curiosity itself was one of the problems — animals act without any drive being reduced, which is why we needed Berlyne), but as a mechanism for how a need-satisfying action gets reinforced, it's exactly right and exactly usable.

In the creature. The teacher signal is literally "how much did my drives just drop?" It never names an action, so which action earns reward is discovered, not scripted. This is the direct descendant of how the original 1996 Creatures game taught its norns with reward and punishment chemicals — and it's why striking a ball can be learned at all: the strike relieves boredom, the drop in boredom is the reward, and the brain works backward to whatever produced it.

Disappointment — learning from doing nothing useful

The science. Discouraging a behaviour doesn't require punishment. In conditioning research, a response that stops producing its expected outcome undergoes extinction — it fades — and the absence of an expected reward functions as a mild negative signal in its own right (a "negative prediction error," in the Rescorla–Wagner framing again: you expected something and got nothing, so your estimate drops). This is gentler and more specific than punishment: it doesn't say "that was bad," it says "that wasn't worth the effort."

In the creature. When a creature takes an action for a few ticks but its drives don't move in either direction, a mild negative signal weakens the connections that led to it. This is what lets a creature learn from a mis-aimed gesture: strike a vending machine or thin air, nothing happens, and a small twinge of disappointment teaches it not to bother — no rule scripted against it, just the absence of reward registered as a quiet "that was pointless."

Eligibility traces — the reward arrives late, so what gets the credit?

The science. Rewards almost never arrive at the instant of the action that earned them — there's a delay, and in between, the agent does other things. So when reward finally lands, which of the recent actions deserves the credit? This is the credit-assignment problem, named by Marvin Minsky in 1961 and one of the central difficulties in learning. The elegant solution, developed in the temporal-difference learning of Sutton and Barto, is the eligibility trace: every connection that fires leaves a fading "tag" of recent activity, and when reward arrives, it's distributed across whatever is still tagged. Connections active just before the reward are strongly eligible; ones that fired long ago have faded out. The reward finds its way back to the recent causes rather than smearing across everything.

In the creature. The reward for a strike is computed at the moment of contact with the ball, not from watching it bounce away afterward. At that instant the trace still holds "saw the toy" and "chose strike" firing together, so the credit lands on exactly those connections instead of on whatever idle thing comes next. Getting the timing right is what makes the creature learn the right lesson rather than a superstition — credit the contact, not the wandering that follows it.

Hebbian learning and dual-timescale memory — how a connection forms and lasts

The science. The foundational rule for how a connection strengthens is Donald Hebb's, from 1949: when one neuron repeatedly helps fire another, the connection between them grows — usually paraphrased as "neurons that fire together wire together." It's now experimentally grounded in the synaptic changes underlying memory. Pure Hebbian learning has a flaw, though — connections can strengthen without limit — so Erkki Oja's 1982 refinement adds a normalising term that keeps a neuron's total connection strength in check while preserving what it learned. Separately, there's the question of how a single experience becomes a durable memory without each event overwriting the last. The leading answer is Complementary Learning Systems theory (McClelland, McNaughton and O'Reilly, 1995): the brain uses a fast learner (the hippocampus, which grabs new experiences immediately) and a slow learner (the cortex, which integrates them gradually over time), and the handoff between them happens largely during sleep — when the hippocampus replays the day's experiences to train the cortex. That replay was directly observed in sleeping rats (Wilson and McNaughton, 1994), whose hippocampal cells re-fired their daytime maze routes during sleep. Modern deep reinforcement learning independently rediscovered the same trick: "experience replay" stores past episodes and re-trains on them, and "prioritised experience replay" (Schaul and colleagues, 2016) replays the most informative ones more often — the machine-learning echo of a brain rehearsing what mattered.

In the creature. When "I see a toy" and "strike" are active together as reward lands, the link between them thickens — a memory physically forming, kept in check by an Oja-style limit so it can't run away. That bump lands first in the fast memory, which carries the freshly-learned habit through the day. The bridge to the slow, durable memory is sleep: the brain replays recent experiences and replays the good ones about three times as often, copying the successful strike from fast store into slow. A sleeping creature, a sleeping rat, and a deep-RL agent are all doing the same thing — rehearsing what worked, preferentially, until it sets. The original Creatures game had the same fast/slow split decades ago.

A quiet thread runs through all of these: a 1996 game, mid-century learning theory, neuroscience, and modern deep RL keep arriving at the same handful of mechanisms. That convergence is the reassuring part. You aren't inventing how learning works; you're assembling pieces many different fields independently found.

The creature, end to end

Now the walkthrough — and because the ideas are in place, it reads as them combining rather than as a pile of new machinery.

Birth. A newborn has the capacity to play — there is a "strike" action and the wiring to connect "I see a toy" to "strike it" — but that connection starts at essentially zero. The survival reflexes are seeded: hunger drives eating, tiredness drives sleep, danger drives fleeing. Those are too important to leave to chance. Play is not seeded. There is one faint always-on "bored → strike" nudge in the wiring, but it is fixed plumbing — a tiny constant that can never grow or fade, and it is not what produces the behaviour. The connection that actually learns is a separate, changeable "I saw the toy → strike" link, and it genuinely starts at nothing.

A toy appears, and a clueless creature tries it. The creature has senses for toys the way it has senses for food — it can see a ball, judge its distance, tell whether it is moving. A fresh ball is maximally unfamiliar, so curiosity pulls the creature to investigate it. Here is that earlier distinction paying off: curiosity says "go look at the unfamiliar thing," not "strike." But a ball is, by definition, a thing you strike — so "investigate the ball" comes out as a strike. Curiosity supplies the opportunity to try; it has no opinion about whether the creature should ever do it again. Add the optimism bonus for an untried action and a young creature's appetite for a gamble, and a newborn with no reason whatsoever to strike a ball goes ahead and strikes it.

The hit, and the rule behind it. The creature walks up, waits until the ball is at foot height, and swings. The gesture is decoupled from the effect: it gestures at whatever is nearest, and only at contact does the world check what was actually hit. A ball is knocked away and rewarded. A bell rings and is rewarded. A fruit plant gives nothing. Another creature gets hurt and gives nothing — a creature is not a toy, and rewarding that would breed bullies. Striking the wrong thing simply does nothing and earns a small "well, that was pointless" signal, so the creature quietly learns not to bother, with no rule scripted against it.

When it does connect with the ball, the reward runs through the single most important rule in the behaviour:

reward delivered = how good the hit was × how bored the creature is

A bored creature striking a ball gets the full hit of satisfaction. A content one gets almost nothing. A hungry one gets nothing at all. The behaviour only burns into memory when the creature actually needed the distraction — which is why creatures never become ball-obsessed. With a ball and food both in reach, a hungry creature eats 98.9% of the time and strikes 0%; only a bored, not-hungry creature ever strikes. The same ball means different things depending on what the creature needs.

It sticks. The reward, landing at the moment of contact, finds the eligibility trace still holding "saw the toy" and "chose strike," and strengthens exactly those connections — Hebbian learning, crediting the right moment. That bumps the fast memory, carrying the fresh habit through the day. Then the creature sleeps, replays the successful strike preferentially, and copies it into slow memory. A lucky one-off becomes a stable habit.

It fades. Because play is a learned connection and not a permanent instinct, it can also be lost — three ways at once. The reward dries up as boredom is relieved, so striking stops paying. The connection decays when nothing reinforces it. And curiosity wears off once the ball is familiar, so the creature stops being drawn back. Measured over a life, the learnable strike connection rises from zero to a peak and then falls by about 94% once boredom is satisfied. In an isolated test the creature plays, gets its fill, and walks away. In a real colony boredom keeps returning, so the true shape is a cycle — play, satisfy, wander off, get bored, return — which is about the most lifelike thing the system does.

How I know it's really learned

The obvious worry is that I have dressed up a hidden instinct in the language of learning. The test is simple: remove every possible seed and check whether creatures still discover ball-play. They do — around 88–91% of first-generation creatures find it on their own, and every colony discovers it at the population level. If a hidden instinct were doing the work, removing it would stop the behaviour. It doesn't. The discovery survives, which means the discovery is real.

The limits are specific. This is emergence within a single creature's lifetime — a creature learns to play from scratch. And toy-play is specifically the behaviour that is genuinely seedless. Other behaviours work differently — food-sharing, for instance, runs on an inherited instinct rather than being learned from nothing. So "the creature learns to play" is a true and specific claim, not a blanket one about everything the creatures do.

But within those limits it holds. A creature born knowing nothing about balls works out, by being curious and bored and young enough to take a chance, that hitting one feels good — and then, once it is no longer bored, that it doesn't anymore. Nobody told it any of that. It is the same trick a baby with a rattle is running, built from the same handful of ideas that brains, a game from 1996, and modern machine learning all keep arriving at.