Saturday, 18 October 2025

One more generative AI rant for the pile

(this one's about summarising text)

LLM chatbots – that is AI, in the same sense that we could just start saying "doctors" to refer specifically to orthopaedic wrist surgeons if we collectively decided to – 

LLM chatbots continue to slosh about the world. I used to try them out intermittently to see if they were any good.

My contact with the technology is only incidental these days. To wit:

  • If you google old phrases and terminology in English, a LLM chatbot will still confidently weigh in with completely spurious "definitions" because they're not well-represented in the training data.
  • If you google modern bits of even slightly less-discussed technical knowledge like "does a Kickstarter project video appear on the prelaunch page", a LLM will still confidently tell you the opposite of the truth.
  • If you need customer support or anything that even looks like customer support, there is an extra quarter-hour minimum of wasted bot effort before you can get it.

Nothing I've seen has suggested the technology has fundamentally changed.

 

A monkey writes on a scroll. Image by John Batten.
He can't be wrong, he writes so confidently.

 

In the previous edition of discussing the emperor having no clothes, I mentioned  

[Wikipedia] editors pointed out that the LLM summaries generally ranged from 'bad' to 'worthless' by Wiki standards: they didn't meet the tone requirements, left out key details or included incidental ones, injected "information" that wasn't in the article, and so on

and 

bureaucratic wonks note that genAI can't summarise text. It shortens it and fills in the gaps with median seems-plausible-to-me pablum. The kind you get when you average out everything anyone has ever written on the internet.

I recently saw an AI booster shuffle their position back to "at least it's good for summarising, it's going to completely replace human effort there". With that motivation, let's drill down a bit.


In (a) summary

Let's not bury the lede. LLM chatbots can't produce good summaries. Sometimes by chance yes, but not reliably. Summarising, like everything, is a skill-based task, and of the various capabilities required to do it well, LLMs lack four of the most important.

1. LLMs won't reliably retain important structure or order in which information is presented. They will just haphazardly obliterate implicit linkages. They will even occasionally discard explicit structures, as when the text itself points out that C follows from A and B, and therefore D.

2. LLMs can't identify the most important information in a text (a necessary first step to preserving it in the summary). In a good summary, certain content "should" be retained, certain content compressed, and the remaining content discarded. Vital information generally isn't identified within the text in a way that's detectable without broader context, language skills, and understanding of the world. Even when it is, e.g., in texts where repetition of a word corresponds directly to importance, or phrases like "this is vital information" are always appended, LLMs still aren't guaranteed to retain important details! And the same applies to cutting out unimportant information.

3. LLMs can't stick to the source text, that is, the content they're meant to be summarising. Because they just generate text (by predicting which bits of text should come next, based on an enormous model of which bits tend to come after which bits, hence 'language model'), there's no internal representation of Things 'In' The Language Model versus Things 'In' The Text To Be Summarised, and no impetus to perform computational operations that keep them separate where appropriate. All of which is to say that as well as not including things that should be in a summary, an LLM will readily include things that shouldn't be. Oops

3(corollary). That includes things that aren't true. Oops(corollary)

4. LLMs will sometimes just negate statements for no clear reason. When processing text, e.g. when directed to "summarise", they'll turn a claim into the opposite claim. I think what's going on here is that a statement and its negation are syntactically and semantically similar, even though their meanings are devastatingly dissimilar. Too bad LLM technology doesn't get meanings involved, instead just taking a probabilistic walk through a model of language features like, oh I don't know, syntax and semantics!

Note what these four crucial capabilities have in common. It's the reason why LLMs can't do them. That's right, they require understanding to do properly.

Or if not understanding, then at least computational models of understanding, like formal reasoning over symbolically-encoded domain knowledge including useful axioms. I mention this because classic AI systems (planners, searchers, problem solvers, etc) can do just that, in their various limited ways. They symbolically represent domain information and then perform operations on those symbols which can then give something potentially useful back once related back to domain information.

And those systems are limited, yes, but LLMs don't do 'understanding' at all. As far as I can tell, on the back of a postgrad compsci degree and a few days spent reading and partly understanding the computational basis, this is a fundamental limitation of the technology. One which can't just be fixed, but which would need a whole new (at most LLM-inspired) technology to overcome. For exactly the same reason why AI "hallucinations" can't be fixed.

 

Presummary (a digression)

This technical basis of how LLMs work also explains something else. These chatbots are particularly bad at "summarising" documents which contain surprising content.

By surprising content, I mean...

➡️ Statements seeming to defy common wisdom. Things that are the opposite of statements well-represented in the training data. When X is generally true of a field, but your text describes how ¬X is true of some narrow subfield or specific context, you'll see an LLM "summarise" X into ¬X more frequently.

➡️ Deliberate omissions of things that are usually in correlating training data documents. If your text looks like a text of type blarg, and blarg texts in the training data typically report on X, but you have not reported on X for your own reasons, an LLM is likely to just make something up about X while "summarising".

➡️ Unusual pairings of form and content. Performance degrades the more you ask an LLM to do something novel.

➡️ Context-sensitive language like metonyms and homographs. When X is a big important noun well-represented in the training data and X refers to something else in the text, you'll see an LLM (appear to) get confused by the statements about X its produces for the "summary".

➡️ Nontextual information content. The LM stands for language model. If you have a report that includes and discusses images and diagrams, a chatbot might be able to stop and parse those, and then incorporate its own description of the image as part of the text to be summarised, and maybe even put images back in the summary. But you'll nonetheless end up with a worse output.

 

In summary (but for real)

So LLM chatbots can't be (consistently, reliably, etc) good at summarising.

Of course people who don't know what a good summary is might not notice this; likewise people who possess the skill but don't carefully check the job they told it to do.

(I would argue that in either case, if the task was worth doing to begin with, you should prefer the task not getting done to having no idea whether your document is a good, adequate, or terrible summary)

Anyway this is why you may have seen people who do know what a good summary is point out that LLMs actually "shorten" text rather than "summarise" it. I'm not certain but I think the first time I saw this was in one of Bjarnason's essays.

The sentiment "this technology sure can't do [thing I am skilled at] for shit, but I guess it might be good at [thing I don't know about]" will continue to carry the day as long as people let it

I'll self-indulgently close by quoting myself again:

A lot of people with a lot of money would like you to think that genAI chatbots are going to fundamentally change the world by being brilliant at everything. From the sidelines, it doesn't feel like that's going to work out.


Tuesday, 14 October 2025

1d20 Megadungeon Safety Code Violations

It's almost as if the archlich-in-chief didn't have everyone's safety in mind!

(Lousy no-good penny-pinching archlich-in-chief.)

 

Safety warning sign. Finger crush hazard.

 1d20 Issues The Inspectors Bring Up In Their Initial Report:

  1. Sixteen cases of poisoned darts shooting from walls without warning signage 
  2. Lit torches in corridor RM-70-B burning within two metres of hanging tapestries
  3. Thirty-storey staircase lacks railing or other cordon around open stairwell
  4. Inadequately ventilated throughout (see attached list of 781 affected rooms)
  5. Chasm bridge constructed from inadequate materials. Code requires use of steel cable, tethers
  6. Insufficient drainage ducting to prevent floods on floors B2 through B29  
  7. As-built construction plans yet to be lodged with regional fire rescue co-ordinators
  8. Shrine Of Darkness should be electrically insulated against celestial lightning
  9. Enormous boulder insufficiently shored up with mechanical apparatus; risk it could fall and roll
  10. Hellfire conjuration chamber not adequately equipped with hellfire suppression equipment
  11. Giant mushrooms pose unacceptable spore allergen/asphyxiation risk
  12. Several parts of floor B30 are exposed to magma flows (not a safety code violation; brought up as a point of overall concern)
  13. Unholy Water Cistern not shielded against heavy metals found in local groundwater 
  14. Too few first aid kits (three per floor; code calls for five)
  15. Human remains not maintained at morgue temperatures, and allowed to move around freely
  16. Floor B7 torture chambers not accessible by ramp or elevator
  17. Cursed font of immortality lacks fence to prevent accidental drownings
  18. Black, yellow, ochre, blue, umbral, and invisible mold found in food preparation areas
  19. High-visibility safety lines should be painted in zones where ambulatory juggernaut roams
  20. Giant talking stone head is blocking fire exit
 
 Safety warning sign. Lurking alligator.

(Yikes. Well at the pace stuff gets fixed around here, better hope they shut the worksite down before someone gets hurt.)

Saturday, 4 October 2025

Smooth vs Chunky game design

Sometimes it helps to look at game design through the "Smooth or Chunky" lens. What do these terms mean? Well, they encapsulate certain vibes. They occupy the middle ground between game mechanics and game feel.

🥛🥛🥛 Smooth is

  • small numbers changing incrementally
  • things being at the same level
  • player-facing subsystems with modest game impact (say, situational dice roll modifiers)
  • overloaded dice rolls with small results
  • rounded probability curves like 4d6
  • anything pre-planned
  • predictable consequences, with randomness used as a spice

Smooth is like gradient descent and intricate clockwork and elegant flowcharts. Typical Smooth design elements are sensible, explicable, predictable, fine-grained, subtle, simulationist, world-associated, and introduce enough complexity as they need to be internally coherent.

🥜🥜🥜 Chunky is

  • one dice that does everything
  • flat high-variance probability curves (say, a 1d100 roll)
  • one number that represents a bunch of things
  • rollercoaster rides of remarkable successes and sudden catastrophes
  • randomness underpinning creative direction
  • huge sudden changes to big important things
  • hefty modifiers 
  • random tables

Chunky is like high stakes roulette wheels and staccato noises and refusing to erase anything. Typical Chunky design elements are simple, wide-reaching, experimental, flashy, coarse-grained, gamist, surprising, central, and do as much as possible with one thing.

Now in game design, unlike with peanut butter, neither one is clearly better than the other. It's situational.

Case study 1

I'm working on a RPG called Overzealous. In this game, you're a well-meaning elder god but your cultists are the typical bloodthirsty crazed zealots. This sets up the tension. You need to become manifest in the world before your cult tears itself apart with ridiculous behaviour.

Early conceptualisation had Overzealous revolve around large random tables, with most of the consequences falling out of those. This played fast but felt random. And it meant that I couldn't model persistent problems, like your cultists getting bored and summoning a bunch of monsters that then hung around, or continuous schisms in your cult leading to further attrition and outrage.

I Smoothed out this Chunkiness by expanding two numerically-tracked stats to five, and adding a granular subsystem for acquiring ongoing "problems" which took trade-offs to solve. The gameplay became a lot richer!

Cult-related symbols for the five stats. For example, Fervour is represented by a happy cultist with a dagger. Art partially adapted from work by Lorc, CC-BY 3.0

Case study 2

After these changes to Overzealous, you have three "bad" stats (Fervour, Divergence, and Monstrosity) which you want to keep a lid on and two "good" stats (Imminence, Cultists) which you need to get high enough that your cult can perform a ritual to bring you into reality.

In the draft version after the Smooth changes, if a bad stat exceeded 20, it's game over. The bad stats crept up slowly through various mechanisms, all at about the same pace. This increase was slightly faster than you could deal with, which you generally attempted to do by sacrificing positive stats, setting yourself back. That's Smooth design! The intention was that the player needed to find and pursue a good strategy to secure a win before a loss became inevitable.

In practise, though, my introduction of this much Smoothness created two issues. Because the changes were small and fairly predictable,

  1. A skilled player could find one obviously optimal strategy, and didn't have to deviate much from that strategy for random events. This reduced gameplay scope.
  2. If through poor luck, lack of experience, or exploring other options, the player's negative stats got too high, there was a tipping point where it was obvious that a loss was inevitable... but it took a long time to lose the game.

To deal with this, I eased up on the Smooth pedal and re-introduced some Chunky. I had stats go to 13 instead of 20, took out some of the cases where multiple stats all change by 1, and doubled down on cases where a single stat change by 2 or 3. Also, the ongoing problems that beset your cult (like cannibalism, diabolism, and schisms) only have a chance to come into play each turn, but are more impactful when they do.

And these tweaks got me exactly the gameplay experience I wanted for Overzealous! There's no longer one clear strategy to follow, as the feeling is more one of running around putting out fires. Now the player is tempted to push things, to e.g. just spend one more turn scrambling towards completing the ritual when their Fervour has crested 10, because the end's in sight, knowing that a couple of bad rolls might be their downfall.

New words to use

Was I thinking about Smooth vs Chunky throughout this design process? Not exactly, but I was certainly aware of what was going on with the overall vibes, and now that I've constructed this jargon to talk about it, I suspect I'll be thinking in those terms in the future.

With gameplay pretty much where I want it, I'm ready to get the visual design finished!

Little cartoon cultist holding bowl. Art by Gordy H.

Anyway, that is the age-old "Smooth vs Chunky" dichotomy, brought to you by peanut butter on toast.

One more generative AI rant for the pile

(this one's about summarising text) LLM chatbots – that is  AI , in the same sense that we could just start saying "doctors" t...