How do ChatGPT and other Large Language Models work?

A plain-English tour of what's going on inside Large Language Models (LLM). A must-read for anybody getting serious about optimizing for generative search.

Simple explanation of large language models

Simple Explanation of AI LLMs

"Why does AI give different responses to the same prompt?" πŸ€”

LLMs do not work on logic. They guess the next word in a sequence.

At every step it asks itself: πŸ’¬ "Given all the words I've seen so far, which word is most likely to come next?"

The dog chased the ... β†’ "cat" is likely, "car is also plausible, "banana" is NOT.

This "most-likely-next-word" trick, repeated thousands of times per second, builds whole paragraphs.

And because several words are often almost equally reasonable, the model has legitimate elbow room to choose among them.

So....

Where does the variety sneak in? Sampling!

After the model calculates the probabilities, it rolls weighted dice to pick the next word.

πŸ‘‰ If "cat" = 35% and "car" = 30%, the dice usually land on cat - but not always.

One slightly different word early on steers every later choice down a new path, so two replies diverge quickly.

How can we control this? The hidden settings:

1️⃣ Temperature - dial it up, the model picks wilder words; dial it down, it hugs the obvious.

πŸ”₯ 0 = boring spreadsheet, 1 = picasso on 10 redbulls 😡

2️⃣ Top P - tells the model "only look at the top choices."☝️

0.01 = stick to the sure bets, 0.9 = let the long-shots in

Set both near zero and the model becomes almost deterministic. Raise them and it behaves more like a brainstorming partner.

But. That's not all --

Context windows, system instructions, random seeds and parallel GPU math all also affect the randomness seen in every new response.

LLMs are designed to allow for variation in their responses.

So (x2)....

How does this effect generative AI search engines?

Answer engines create responses in a pipeline with two halves.

Retrieve 🀝 Generate

1) Retrieve: Pulls a shortlist of pages from web search (Bing for ChatGPT!), papers or videos that might answer the question.

2) Generate: Compresses that pile of pages into a few sentences, choosing each word with the "most-likely-next-word" trick.

Because Step 2 is probabilistic, running the same query twice can surface different phrasing, quotes or even different citations

This exactly explains the behaviour you see in ChatGPT.

A generative search answer is part web search, part dice-roll. 🎲🎰

James Berry
Founder & CEO at LLMrefs
llmrefs.com

How large language models work
OSZAR »