Language Models are Illiterate: Thoughts After Thinking

Language Models are Illiterate

Here’s some initial results after a couple thousand hours thinking about how and why the language model AIs are acting so intelligently and what the missing pieces are to make them full world-ending machines. Let’s work through the setups and conclusions together here.

Your Brain is Not A Neural Network

Your brain isn’t a single fully connected neural soup. Your skull meat brain is technically three brains with hundreds of sub-structures: your brain is two hemispheres (left/right) connected with a synchronizing communications link between them (corpus callosum) plus a cerebellum hiding in the back down there.

You probably remember how the left side of your body is controlled by your right brain hemisphere and the right side of your body is controlled by your left brain hemisphere. Simple enough concept, but people often don’t think through the implications of having hemispheres with different control levers.

Your left hand neural network is not connected to your right hand neural network. When you type or clap your hands or form naruto power signs, your hands must coordinate their behavior through your corpus callosum which connects both sides of your brain together via lower dimensional compressed signaling.

Due to cross-lateral brain-body connections, we end up with fun edge cases like certain brain conditions only allow information to surface from one side of the brain. Some people can’t recognize a brush on sight because their vision<->object-recognizer detectors are no longer connected, but they can recognize a brush via texture or if they make it make a “brush-like” sound to engage audio-knowledge instead for getting around the vision-knowledge disconnect.

Takeaway: Your brain is not “a brain,” but rather connections of tens of thousands of sub-modules interconnected at almost a message passing communication style via their outputs, but not necessarily via each sub-network’s internal latent representations.

Brain Architecture is Isolated

The sub-sub-neural-networks for controlling your hand finger coordination don’t feed into the sub-sub-neural-network components controlling your foot fingers. Learning guitar chords using your hands doesn’t translate to sudden musical foot ability. Our current “all-vs-all” language model organization isn’t architected around the concept of having a thousand independent purpose-trained-prediction-modules as brains are organized (any isolation in transformer models must be enforced via backprop which is both slower to learn and more leaky than outright architecture decisions up front — and having sub-modules act independently allows them to act concurrently as well, which is something else transformer models are not equipped to handle when only trained as single units).

Brain neural networks are plural and work via learning constraints combined with operating constraints. Learning is achieved via time synchronized coincidence detection across the resulting outputs of all the sub-neural networks, but the sub-neural-networks are not directly connected across physical modalities directly (this can be somewhat violated if an architecture is leaking synesthesia, but it is a minor affective effect).

Current intelligence-capable AI systems don’t have a concept of architecturally isolated-but-communicating sub-networks, so any of those behaviors are having to be burnt-in via backprop which makes AI networks unstable to predict or optimize since we can’t further encourage more resource expansion towards sub-networks we actually care about.

Brain Networks are Individually Predictive

Your brain is running thousands of concurrent predictive loops all the time. Fear is what happens when your real time prediction error either gets too high or asymptotes towards infinity. Comfort is a stability of prediction with your brain generating low prediction error against your experienced reality.

Current intelligence-capable AI systems aren’t running isolated independent predictive loops across thousands+ learned sub-networks. It’s all just a single backprop soup. Some of the hacks around “mixtures of experts” also doesn’t cover multiple predictive loops as our brains operate because, again, our brains run idle background predictions in real time across tens of thousands of independent neural circuits all concurrently. Any neural circuit suddenly experiencing high prediction error can essentially raise an exception alert up the body awareness hierarchy anywhere from an itch to consciousness awareness of prediction errors requesting more focus.

Architecting fully consciousness-capable AI systems requires somewhere between 300 to 30,000 individual neural networks operating concurrently and constantly being updated against experience plus communicating their own prediction errors to a unified awareness hierarchy for live integration and action dispatching when required.

Language Models are Illiterate

Language models consume and generate signals passively. What we call “language models” don’t even consume or produce language as we conceive of “language.”

What is passive consumption? It’s like how people learn language through hearing then generate speech from reproducing environment-triggered-situations when circumstances match up.

A language model actually doesn’t understand language and isn’t speaking language. A language model never had to “learn how to read” or “learn how to write” in the same way most people never directly need to “learn how to speak” since it just seems to occur naturally — hearing and speaking is just a default outcome of average social brain wiring (since language already exists and we aren’t having to invent it from prehistoric cave speak for the first time).

A language model experiences language as if somebody just injected words psychically into your brain then also read your thoughts back out. There’s no literacy or intent or continuous deep inner experience in these models (which is potentially a plus, but it’s not going to get us all the way to computational autonomy).

What is it called when you can only generate one-mode output when given one-mode input, but you can’t read or write in any other formats/modalities? If you can only speak but not read or write, it’s called being illiterate. qed language models are illiterate.

Reading and writing is actually a fairly advanced process and it enables people to manipulate conceptual symbols they otherwise don’t have access to if a brain only operates in “hearing-speech” mode.

There have been studies of people born blind who learn only spoken language versus blind people who also learn braille. Learning a second or third language modality shows people perform better than if they never grew the neural pathways to mentally plan and manipulate language symbology. Symbolic manipulation of language extends your ability to think and construct possible futures worth working towards.

Symbolic language manipulation is even more difficult for language models, especially the comically poorly tokenized GPT family, because these models aren’t assembling the base units of language themselves, but rather being pre-fed isolated chunks they can’t easily sub-reason-within internally. Take everybody’s favorite example of language models being largely inept at numeric computation when tokenizers don’t isolate individual digits to individual tokens. Poorly tokenized inputs are the equivalent of just asking a human “so how many centimeters of ink does it take to write this sentence?” then saying a person is not intelligent when the answer is wrong only because human brains don’t recognize language by “ink-per-sentence” measure prediction.

There’s no path to “true intelligence” in language models with forced pre-tokenization methods on text only inputs, but untokenized byte-level-only context is still too computationally expensive currently, so we continue with compressed inputs leading to compressed thinking leading to compressed outputs which doesn’t provide the models enough conceptual isolation of all input data to form the internal logical relationships it would need to actually generate proper coincidence detection at high enough input resolution.

Language Models Do Not Experience A World

People chud themselves over language models being “causally auto-regressive” but it’s missing a major point: language models don’t experience the world while they generate output.

You can’t interrupt a language model in the middle of a 500 word output and say “sorry, that’s not quite right…” because the models are like hyper-extroverted narcissists who only talk in 90 minute uninterruptible chunks without letting anybody else speak and they also never listen until they admit they are done speaking.

A more practical intelligence architecture, which is less economically viable, is to have a model consume a continuously updating world state along with its own internal processing state, so during each output, a language model is also able to monitor for, basically, when to shut up or if it’s safe to continue continuing or if snape has taken over Nakatomi Plaza and more urgent matters need to be handled presently.

Imagine if instead of a language model’s input context just being [INPUT-REPLY CONTEXT] we augmented it with [WORLD STATE][INTERNAL STATE][ACTIVE REPLY CONTEXT] so a language model could have updated world views while it is still generating output. Going back to the economic viability argument though: now this model requires much more heavy context and also the training would need to incorporate world-state updates, so without a pre-existing dataset of [WORLD][INTERNAL][REQUEST][REPLY] formatted data, you’d have to generate a new format of data or — gasp — train it interactively and see how long it takes.

Also speaking of economically unviable training architectures: imagine training a network as a unique entity with real time backprop feedback from live events instead of “train->replicate a million times->passively deploy” everywhere. Your own brain operates via live updating from its observed sensor-prediction error rates as a single entity, so why would we think a computational consciousness can be written once then deployed in 100 million devices without live updating ever again? Real life interaction requires real live LTP/LDP across the entire neural architecture, even if many of the updates are short-term buffered in recall loops and only get “flushed to storage” during a nightly correlative update sequence (i.e. sleep).

The concept of “leaving a model in training mode” to run backprop after every live interaction isn’t even on the map of organizations as far as I can tell, but it is a likely way forward for high performing interactive models with continuous personal experience external to feeds of vanishing context.

Why do Language Models work with Language?

Why does the language model concept work at all?

The current generation of intelligence models all started with the GPT-1 “OMG OUR TECH IS A WEAPON WE DARE NOT RELEASE IT” openai media scam, and then GPT-2 got better and GPT-3 got more scary and the rest is history.

But why does it work? Kinda simile tbh.

words are how brains manipulate concepts.

words are how we move the world around in mental representations for anticipation, planning, seeking, attraction, and avoidance. we achieve all our goals via mental manipulation of concepts bound to words. Everything from just being food-seeking hungy to converting our enemies back to dust via global thermonuclear war is accomplished via symbolic word-concept manipulation and interpersonal information transference.

Word manipulation isn’t limited to humans though. If you’ve ever had a pet and you’ve trained it using language, you’ve injected your intent into the brains of animals, even if animals can’t reconstruct the grammar of language itself, they can still associate words to actions to goals in a conceptual manner.

Why do language models work with language? Simply because language is how we have encoded our concepts of the world as a whole. Human language is how we program people to gain knowledge about the world and perform actions, so transferring the mechanism for programming humans via language into a non-biological “what-comes-next” capable learning system generates human-level conceptual output. The goal of language is allowing the further manipulation of language, where language is just a very noisy system of subjects, verbs, and objects. All our new “what-comes-next” systems are capable of discovering the structure of all subjects, verbs, objects based on their contextual co-occurrences during training, then the models learn how to combine concepts at higher introspective levels to generate output following the same rules it was all trained on originally.

Language Models are Universal Simulators

every chain rule derived neural network of sufficient width and depth and height is technically a universal function approximator, but “transformer” “language” “models” have one extra trick: they understand order of input.

traditional neural networks don’t understand input order and everything just gets multiplied together for a final result in the end.

a language model has a way of informing itself a sequence is happening. Some call these “sequence-to-sequence” models since you convert an input sequence to a different output sequence, but it is also equivalent to a generic time-series model. The models are basic “what comes next” generators — and what-comes-next prediction is the basis for being approximators of life the universe and everything.

Do you ever notice when your brain records “what-comes-next” predictions based on repeated exposure but not based on underlying relationships themselves? A good example is when you listen to a playlist often and you “know what song is next” after a current song is ending (or you even begin to “hear” the next song internally before it begins playing) — your brain associated your lived experience of “what comes next” even though, logically, two unrelated songs don’t have a causal relationship except for the repeated order you continue experiencing them, so your brain learns the relationship as a reality-derived “what comes next” situation (which helps minimize the prediction error over time as well if A->B always happens, just when A is ending, predict B, since the overall goal is always reducing unexpected knowledge outcomes).

Here’s a note from somebody talking about a non-consentual repeated music experience:

As someone who has worked in a supermarket in late 1990’s - early 2000’s I confirm that the random order of playing music as well as a larger library of songs to play from are extremely important for the sanity of people working in such jobs. Just imagine that every single day for 10-12 hours straight you’ll be hearing the same 3 hours of music on repeat in the same god damn order. I can’t stress enough that after a couple of weeks this becomes the soundtrack of your life that haunts you with it’s repetiveness and slowly but surely kills your soul in the process.

back to the point: “what comes next” prediction is also the only operation you need to be a universal simulator of… universes.

so, what if?

Let’s break down the idea of simulating a universe using what-comes-next predictive networks. Imagine the “network” knows all the particle and field level “what comes next” contextual time-series state transition rules of the universe in a physical sense. We could call these physical rules “physics.” Now as input context, you provide the current state of your universe and the magic of the sequence-to-sequence time-series what-comes-next prediction model generates the next output state of the universe (or a section thereof) on an instant-by-instant update cycle.

Such a practice would mean: time is quantized and space is quantized. Updates occur in discrete steps instead of continuous action updating (much the same way we use sequential static images to simulate motion). Imagine though, if every instant of reality actually happened via: event -> PAUSE TO CALCULATE NEXT STATE -> resume next event -> PAUSE TO CALCULATE NEXT STATE -> resume next event, as an entity inside the simulation, you would be unaware of external computation pauses as long as they are all consistent within your lightcone since there’s no universal global clock you can access outside of your own experience.

Would running a universe on some what-comes-next prediction network also explain gravity and time dilation? We know gravity/mavity doesn’t really exist as a “thing/force” but rather it’s just a dramatic pinch of space, and flat space has flat time, so curved space has curved time. More gravity slows time from external reference frames, but flat space always operates at the maximum update frequency (the speed of causality, our friend: c).

Why would gravity cause distortion of space cause time dilation? Well, from a computational standpoint, more gravity means there’s more mass per unit area, and more “stuff per area” would require higher computation to calculate everything’s proper “what comes next” update cycle.

Higher gravity causes time dilation because it requires more computation to update all casually connected voxels within the mass-defined mavity well light cone, and this computational slowdown of next-state transitions is only observable outside the gravity anomaly itself (you can’t tell when your own computation is slowed down within yourself because your internal updates are all self-consistent, but you may notice computation external to your gravity-compute-update environment is now moving faster since it requires less compute to calculate its next state so its updates iterate faster than your own, ergo it is experiencing time faster relative to a gravity well).

What about singularity events? Event horizons and inescapable point-mass-space-depths? There’s always been the joke “black holes are where the universe divided by zero” but what if it’s not far from the truth? Infinite mass would equal infinite computation would be an error condition.

Even event horizons aren’t forever though due to hawking radiation virtual particle inequality. What if our concept of virtual particles is just a re-normalization of compute required per unit area? The floor of zero point energy floating everywhere could be borrowed computation where if you compute too much now, you have to give up a future time slice of causal compute to compensate. Evaporating event horizons could be understood as compute returning to “normal” for a given impacted area. Maybe the computational substrate has an exception handler running an update cycle of “is anything really happening here? if not, release compute resource for the area at our infinity-mass-decay-schedule,” so eventually even infinite mass singularities basically self-delete due to being causal dead-ends and the underlying simulation mechanism, on a long enough time horizon, equalizes causal compute away from dead space back to interactive components.

If we take a different view of gravity/mavity, we have the universe conspiring to slow down externally perceived time for more dense collections of mass and accelerate time for less dense collections of mass up to the universal coasting speed of c. The gradient of a time dilation mass-derived causal connection is the smooth curve of computation adjusting from baseline c causal update speed down to the causal update speed of the mavity-entangled mass-encompassed pinched space.

Basically: less matter == faster process updating; more matter == slower updating.

Each space voxel could be defined by a vector of the fundamental forces as their own dimensional quantities. The universal all-encompassing fields of the fundamental forces are each dimensions of the voxel vector representation.

photons are excitations in the electromagnetic field vector quantity and have rules like “only travels in linear paths, cannot go around corners, can be absorbed / reflected on contact with matter, etc” while neutrinos don’t interact with matter except by extreme rounding errors (simulation not running ECC memory? particle accelerators are rowhammer-like attacks on the simulation computation?).

We see fundamental particles respect operations within forces, but the forces themselves are just the learned configuration of physics what-comes-next transformation rules of this universal simulation.

But why is gravity/mavity “attractive?” Why does mass-beget-mass? One excuse could be gravity “drags things towards itself” because we need a smooth time-computational gradient. We wouldn’t want one spherical kilometer of mass to have its temporal processing running 4x slower than something one centimeter outside of it. So, accumulation of mass naturally extends space to create a smooth gradient of temporal processing enforced by gravity/mavity (thought experiment: what configurations of matter could cause time to bend vertically? could we shape a square wave in time? what is the maximum angle we can distort time without becoming a singularity?).

Obvious conclusions:

nothing stops us from creating more universe simulations ourselves, except we can’t update our sub-simulations faster than our universe being simulated at c, but the sub-simulations could have any time causality configurations because, again, when the simulated universe is paused between computations to determine the next state, no time passes inside the simulation.
what is the purpose of the simulated universe we inhabit? could be anything from a physics research project to actually trying to breed friendly universe-capable intelligences (which would be harvested / extracted / acknowledged when they can “shake the simulation” to notify the external world a hypercapable intelligence has emerged)
if we are just in a physics simulation, there’s a good chance the simulators don’t even know there’s “life” inside their physics simulation. after all, we are a universe and not just a single planet being simulated. the simulation is running at a quantum particle update level. attempting to find life inside your particle simulation universe is like us trying to find cute dog pictures by looking at the binary 1s and 0s of a 20 TB hard drive printed out on 32 billion sheets of paper (64 million packs of 500 sheets of paper).
the computational requirements would basically be, in our own perception, converting the entire energy output of one or more stars into pure computational substrates. the simulation is performing 1 update per plank volume across the entire universe every plank hertz assuming no local optimizations. at a minimum, with no processing/density optimizations, our universe needs between 10^105 to 10^210 updates per second depending on if we are only addressing local causality or all-all causality (and those numbers are just bulk “update per voxel” required without including any underlying computation to perform each update).

Language Models are Conclusions

What does it all mean?

More training data won’t solve a problem of computational consciousness without further refinement of model architecture by creating more multi-model concurrency communication architectures and incorporating multi-receiver predictive architecture systems. There’s also an argument the final “temperature” sampling is also not properly incorporated into model awareness because we’re still essentially ripping thoughts directly out of a brain without the brain itself fully deciding how to present information to the outside world all by itself.

What is a good way to assemble and refine causal time-series computational architectures though? We know nature has evolved multi-generational inheritance using some RNA world -> DNA -> central dogma path modulo epigenetics. Evolution operates via endless cycles of “duplication then specialization,” so there’s plenty of evidence an optimization recurrence cycle of “duplicate -> specialize redundant components” is a stable path towards architecture exploration.

DNA packages information for transferring biological architectures between generations via recombination and mutation, and we also know many neural traits are “nature” (DNA architecture transference) because brain operation traits can be inherited across generations including but not limited to: general disposition, attitude, demeanor, ranges of intelligence, ranges of artistic ability, agreeableness / hostility / humor / horny, ranges of neurotypical to neuroatypical behaviors, … — we know base brain architectures are not just equally blank information receptacles, but rather brains are architected via historical genetics into various weighted capabilities and preferences.

So, what is the missing system we need capable of defining neural architectures then breeding multiple architectures together to generate more viable architectures fit for purpose over time? We would need the equivalent of a common data format describing a model architecture then a system capable of combining multiple architectures together, training, evaluating, etc — a basic genetic optimization cycle, but the initial generic-parallel-neural-compute descriptor language (and training/eval compute) is the main blocker here.

A successful neural architecture can take many forms, but there is likely no one “universal best” neural architecture, so generating neural network DNA -> breeding networks -> training children is likely the best way to improve over time for complex neural architectures having tens of thousands of concurrent predictive sub-networks all within the same overall system.

guess that’s all for now. remember to buy an ai indulgence for yourself and everyone you know before rates go up again. also I have some domains for sale as well if you have more money than sense: make.ai and god.ai are currently ready and waiting to live their best lives with you as soon as possible