Periapt Games Design Blog: One more generative AI rant for the pile

(this one's about summarising text)

LLM chatbots – that is AI, in the same sense that we could just start saying "doctors" to refer specifically to orthopaedic wrist surgeons if we collectively decided to –

LLM chatbots continue to slosh about the world. I used to try them out intermittently to see if they were any good.

My contact with the technology is only incidental these days. To wit:

If you google old phrases and terminology in English, a LLM chatbot will still confidently weigh in with completely spurious "definitions" because they're not well-represented in the training data.
If you google modern bits of even slightly less-discussed technical knowledge like "does a Kickstarter project video appear on the prelaunch page", a LLM will still confidently tell you the opposite of the truth.
If you need customer support or anything that even looks like customer support, there is an extra quarter-hour minimum of wasted bot effort before you can get it.

Nothing I've seen has suggested the technology has fundamentally changed.

A monkey writes on a scroll. Image by John Batten.

He can't be wrong, he writes so confidently.

In the previous edition of discussing the emperor having no clothes, I mentioned

[Wikipedia] editors pointed out that the LLM summaries generally ranged from 'bad' to 'worthless' by Wiki standards: they didn't meet the tone requirements, left out key details or included incidental ones, injected "information" that wasn't in the article, and so on

and

bureaucratic wonks note that genAI can't summarise text. It shortens it and fills in the gaps with median seems-plausible-to-me pablum. The kind you get when you average out everything anyone has ever written on the internet.

I recently saw an AI booster shuffle their position back to "at least it's good for summarising, it's going to completely replace human effort there". With that motivation, let's drill down a bit.

In (a) summary

Let's not bury the lede. LLM chatbots can't produce good summaries. Sometimes by chance yes, but not reliably. Summarising, like everything, is a skill-based task, and of the various capabilities required to do it well, LLMs lack four of the most important.

1. LLMs won't reliably retain important structure or order in which information is presented. They will just haphazardly obliterate implicit linkages. They will even occasionally discard explicit structures, as when the text itself points out that C follows from A and B, and therefore D.

2. LLMs can't identify the most important information in a text (a necessary first step to preserving it in the summary). In a good summary, certain content "should" be retained, certain content compressed, and the remaining content discarded. Vital information generally isn't identified within the text in a way that's detectable without broader context, language skills, and understanding of the world. Even when it is, e.g., in texts where repetition of a word corresponds directly to importance, or phrases like "this is vital information" are always appended, LLMs still aren't guaranteed to retain important details! And the same applies to cutting out unimportant information.

3. LLMs can't stick to the source text, that is, the content they're meant to be summarising. Because they just generate text (by predicting which bits of text should come next, based on an enormous model of which bits tend to come after which bits, hence 'language model'), there's no internal representation of Things 'In' The Language Model versus Things 'In' The Text To Be Summarised, and no impetus to perform computational operations that keep them separate where appropriate. All of which is to say that as well as not including things that should be in a summary, an LLM will readily include things that shouldn't be. Oops

3(corollary). That includes things that aren't true. Oops(corollary)

4. LLMs will sometimes just negate statements for no clear reason. When processing text, e.g. when directed to "summarise", they'll turn a claim into the opposite claim. I think what's going on here is that a statement and its negation are syntactically and semantically similar, even though their meanings are devastatingly dissimilar. Too bad LLM technology doesn't get meanings involved, instead just taking a probabilistic walk through a model of language features like, oh I don't know, syntax and semantics!

Note what these four crucial capabilities have in common. It's the reason why LLMs can't do them. That's right, they require understanding to do properly.

Or if not understanding, then at least computational models of understanding, like formal reasoning over symbolically-encoded domain knowledge including useful axioms. I mention this because classic AI systems (planners, searchers, problem solvers, etc) can do just that, in their various limited ways. They symbolically represent domain information and then perform operations on those symbols which can then give something potentially useful back once related back to domain information.

And those systems are limited, yes, but LLMs don't do 'understanding' at all. As far as I can tell, on the back of a postgrad compsci degree and a few days spent reading and partly understanding the computational basis, this is a fundamental limitation of the technology. One which can't just be fixed, but which would need a whole new (at most LLM-inspired) technology to overcome. For exactly the same reason why AI "hallucinations" can't be fixed.

Presummary (a digression)

This technical basis of how LLMs work also explains something else. These chatbots are particularly bad at "summarising" documents which contain surprising content.

By surprising content, I mean...

➡️ Statements seeming to defy common wisdom. Things that are the opposite of statements well-represented in the training data. When X is generally true of a field, but your text describes how ¬X is true of some narrow subfield or specific context, you'll see an LLM "summarise" X into ¬X more frequently.

➡️ Deliberate omissions of things that are usually in correlating training data documents. If your text looks like a text of type blarg, and blarg texts in the training data typically report on X, but you have not reported on X for your own reasons, an LLM is likely to just make something up about X while "summarising".

➡️ Unusual pairings of form and content. Performance degrades the more you ask an LLM to do something novel.

➡️ Context-sensitive language like metonyms and homographs. When X is a big important noun well-represented in the training data and X refers to something else in the text, you'll see an LLM (appear to) get confused by the statements about X its produces for the "summary".

➡️ Nontextual information content. The LM stands for language model. If you have a report that includes and discusses images and diagrams, a chatbot might be able to stop and parse those, and then incorporate its own description of the image as part of the text to be summarised, and maybe even put images back in the summary. But you'll nonetheless end up with a worse output.

In summary (but for real)

So LLM chatbots can't be (consistently, reliably, etc) good at summarising.

Of course people who don't know what a good summary is might not notice this; likewise people who possess the skill but don't carefully check the job they told it to do.

(I would argue that in either case, if the task was worth doing to begin with, you should prefer the task not getting done to having no idea whether your document is a good, adequate, or terrible summary)

Anyway this is why you may have seen people who do know what a good summary is point out that LLMs actually "shorten" text rather than "summarise" it. I'm not certain but I think the first time I saw this was in one of Bjarnason's essays.

The sentiment "this technology sure can't do [thing I am skilled at] for shit, but I guess it might be good at [thing I don't know about]" will continue to carry the day as long as people let it

I'll self-indulgently close by quoting myself again:

A lot of people with a lot of money would like you to think that genAI chatbots are going to fundamentally change the world by being brilliant at everything. From the sidelines, it doesn't feel like that's going to work out.

Saturday, 18 October 2025

One more generative AI rant for the pile

(this one's about summarising text)

In (a) summary

Presummary (a digression)

In summary (but for real)

No comments:

Post a Comment

Wikipedia rabbit holes