Generative AI based on Large-Language-Model (LLM) is here, but most people have a vague understanding on what this black box is – how it works, what to expect. This is fine, you don’t have to understand how a solar panel works for getting electricity. However it is important to have an idea of the technology with certain expectation. Generative AI seems like magic when in reality it is all statistics. It looks like human thinking, sensing and feeling, when it is playing pretend. Explaining how LLMs work technically is interesting and also tough, but you can still find some mental images that help to set expectations of what this technology can do and how it operates.
My fallible friend
Your generative AI is like an eager friend. You can ask them any question. They will always have an answer. They will rarely say, “I don’t know” and will bullshit or even gaslight you into thinking, they do have the answer. Oftentimes they are right, but being right yesterday is not indication they will be correct today. Whether they are correct or wrong, or whatever, you only know once you check or question their response. Until it is checked, the response is only plausible.
Sentence completion turned to 11
We all have used the keyboards on our Smartphones. Above the virtual keys, you get a word prediction, where the system tries to anticipate your next word, so you don’t have to type it out. This is partially trained by your typing and reflects common word combinations you may have used. For years, on social media people played with predictive texts. You’d start a sentence and have the word prediction on your phone keyboard complete it. Generative AI and Large-Language-Models just like this, but on steroids.
The interaction is always a call & response. You pass a prompt and the LLM will auto-complete your prompt with a plausible looking response.
Plausible looking responses
Much discussion is about LLM responses being correct or false, but these are words, that have no meaning to LLMs. No words have meanings to LLMS. To the LLM it’s just symbols. LLMS have no thinking, no sensing, no concept of reality, of right or wrong, of fiction and non-fiction. LLMs are statistics trained on the internet with all biases, short-comings, copyright-infringements, and grammatical mistakes, there are. The more context you provide, the better its calculation may pass an answer that not only looks plausible, but is actually correct. You can ask an LLM, what the name of the first city on the moon is. If it says “New Berlin”, it gave you the answer from Star Trek. Technically not wrong, not totally correct either.
Similar when you ask for sources. The LLM will always try to create a plausible looking answer. In the answer, you’d expect a list of book titles. Having book titles makes it look more plausible than saying “I don’t know any books of that topic.” Thus it invents book titles, to make a more plausible looking answer.
On a limited scale, you can influence this with your prompt and the AI providers try ensure good looking answers, but this is a fundamental characteristic of the technology. It will generate plausible looking answers.
Hallucinations
When marketing departments companies pushing AI found that too many would generate garbage or fudge details, the term “hallucination” was coined. “Ah, the model is just hallucinating, it’ll improve in the next version”
Hallucination is not the failure of Generative AI, it’s the mechanism by how it works. It’s the statistical recombination that creates a plausible looking answer, given the initial prompt.
It looks like thinking
“But look the train of thought, it’s thinking, it’s reasoning!”
Train of thought is the method of asking the LLM to write down “what are the steps you would do to solve X”. Again, the result will be a plausible looking sequence of steps, but it’s not “thinking” and it’s not the steps the model will actually perform. It’s a generated answer describing the steps. Similarly, when you posed a question to an LLM and ask it “how did you do this”? It will generate a plausible looking answer, describing what a thinking process may have looked liked, but it didn’t think this process, neither has it memory. Apple recently published a paper going into more detail. If you ask a model “how do you feel, what do you want”, it will either respond with a plausible looking answer about being self-conscious and may sound very human doing that, but it’s still only a plausible looking answer. Many AI providers also tweaked their models, to respond to these kind of questions with template answers as too many – even experts – fell into the whole of believing to have found sentient AI. This is Pareidolia.
Doing Chain of Thought when prompting LLMs, it did show improved results by the AI’s response. What it does do is asking the LLM to cast a slightly wider net in it’s training data, thus getting a bit more context from the prompt, extending the level of detail to a degree where the generated response has more context based on fetched trained data and a higher chance of being correct.
How this breaks down can be nicely seen when asking math questions, or counting letters in a word. It doesn’t think, it generates plausible text. “How many letters are in strawberry?” A structurally plausible answer is “There are 5 r in strawberry” – there is no model of a letter, there is no model of a word or of the process of counting. It just knows, that in its billions of training data, statistics show, that a plausible answer looks like this. Applying train of thought, will have the LLM describe how letters are counted, but since it does not actually perform the steps it describes, it may still fail.
The strawberry example has been around long enough, that newer models were trained on articles describing this behavior. They may now return the correct number for strawberry, but now ask for raspberry and they may fail again. Similarly, when you ask to “generate a sentence of exactly 25 words” or “count the elements in this list”.
Summarize all the things
A common generative AI use case often presented are summaries. “Have this e-mail summarized for you!”. Unfortunately it doesn’t summarize. When you ask LLM for a summary, or a transcript, you will have one of two situations.
If the document you want to get summarized is about a topic, that is well documented and included in the pre-trained corpus of your model, when generating the summary it will draw heavily from this trained data and add “outside” knowledge to the document. It may add points not included in the original document, but which may be “plausible” to add. It’s like asking an intern to summarize a document and they will give you the Wikipedia summary.
If the document is on a topic not reflected in the corpus of your model, it can not draw from the training data, so it has to work with what you have given it. In this case, it will analyse the source document and make a statistical guess on what is imported in the document. This would be your intern counting how often the word “opportunity” was mentioned in the document and thus only mention the opportunity, while leaving out the risks because the word “risk” was only mentioned once. Can’t be that important, right?
And of course, a model may deliver a good summary, and still completely miss the point.
“Human-level” AI
To wrap it up, LLMs generate plausible looking answers by being a glorified word completion. It may be correct, but the more important the answer is, the more you need to critically review it. Humans are bad at review. I didn’t even touch on inherit biases and the devastating environmental footprint of AI.
Your AI friend can help you, or it can fail. Depending on the quality of the model, it can fail hard and invent everything in the quest for the most plausible answer. I know people talking about “human-level AI” to address this. AI is not infallible, it’s only “human-level”. I gave my sister the same explanation, on how LLM don’t do summaries. Her response was “still better than my colleagues”, mine “Get better colleagues”.