Bitter Lesson Critique

It would seem that any good AI research story that contemplates mapping domain specific knowledge and operational experience should reconcile Rich Sutton’s Bitter Lesson¹. His is a cautionary tale where Moore’s law and the scaling of generalized models trumped the performance advantage gained by building knowledge into agents. I learnt a similarly bitter lesson at Metalithic Systems, where the performance advantage we had achieved with reconfigurable computing using FPGAs back in 1995 was quickly surpassed by the massive investments made by Intel and others bringing super-scalar multi-threaded RISC architectures to general purpose CPUs.

The concern raised by the Bitter Lesson is that Imogen brings knowledge to agentic workflow design. For example, with the spreadsheet analogy for user / computer/ model interaction, or by using the expression language as a security layer for inference, or the stack / heap analogy for context orchestration and managing in-context memory. In these cases I would argue that this knowledge is necessary for users working within their domain expertise to create reliable, robust, controllable agentic workflows that they can understand and reason about.

I would also argue that regardless of the training compute scale, 100% automation from a single prompt is a myth. If more than one prompt is required, then agentic workflows are needed for automation. If we need agentic workflows for automation, we also need analogies that users can comprehend and reason about. More broadly, if any of this is going to work, we also need sufficient feedback from observability for users to gain meaningful insights into model inference and how it works. Therefore, some operationalized knowledge within the application to optimize context window utilization for adaptive in-context learning is necessary.

The Compute-Efficient Frontier

Its worth remembering that models are probabilistic — not deterministic. The diagonal line below is why I believe that 100% automation from a single prompt will remain a sub-optimal and non-generalizable approach.

Note: Legends added to make chart [3] clearer for no-technical readers.

If we equate the model error rate with cross entropy loss shown in the chart above we can simplify our discussion and stay out of the technical weeds. Its not exactly the same thing, but if you have never seen this chart before the key takeaway is that the scale of training compute needed to reach a 0 (zero) loss is prohibitively gigantic. The vertical scale is logarithmic, so zero is further below the chart than shown.

As far as trend lines go, this one is remarkable, given that it extends across so many orders of magnitude. The conclusion is that despite the massive increase in the scale of frontier models, reliability “is intractable by any measure”² because current models are unable to cross this compute-efficient frontier³.

Six Sigma Reliability	vs	AI Reliability
%99.999 = 1 in 10,000 defect rate		95% = 1 in 20 error rate

Even to achieve six sigma reliability is infeasible. It would require models to scale to an impossible size. Yet this is the defect rate that businesses and customers equate with perceptions of product quality and reliability. At present, the only way agentic workflows are able to achieve the highest reliability scores is with iterative prompt evaluation and refinement against testable objectives. As such, reliable in-context learning requires some kind of context refinement and optimization mechanism. So we need this type of knowledge in our agent equation.

Latent emergent behavior

Its also worth considering that models are essentially black boxes that contain latent emergent behaviors and no-one really how they can be used until someone thinks of something and gives it a try. This can be achieved through model training, fine-tuning with reinforcement learning, or in-context learning during inference. This is very different from the way traditional software works and its a continuing driver of growth and innovation in this space, and a source of opportunity for businesses with domain specific knowledge and operation expertise from which they wish to capitalize.

I was most recently reminded of this while listening to a keynote given by Lélio Renard Lavaud, Head of Engineering at Mistral Ai⁴, where he recounted a customer story from one of the “largest companies in the world”. Lélio exclaimed that they were:

“Building such amazing usage on top of Mistral 8x7b and have fined tuned it and are actually putting this in production and getting amazing returns” — but we had never heard about them at all. They never reached out.

While they may not like to admit it publicly, frontier AI companies are still discovering what models can do just like we are. For this reason I find thinking about models as a black box is actually quite helpful. We know the models are knowledgeable, and we also know this knowledge is increasing. We know the models can do things for us, and we also know they are getting better at doing those things. Nevertheless, we still have to learn what things models can help us know and what stuff models can help us do.

Discoverability is how we learn to think agenticly. Its how we see what’s inside the box.

Is prompting dead?

As we codify multi-turn chain-of-thought conversations between users and models into agentic workflows, they become a form of natural language program. From this perspective, its interesting to consider the function of an individual prompt. I now liken conversational prompting to writing assembly code. Yes its direct, but its also time consuming and inefficient. While new reasoning models have made shorter prompts more expressive, they have also become a less effective means for discoverability because the slower response times creates a latency trap⁵.

The Latency Trap — Cognitive insight is impeded as inference time increases.

New reasoning models may be 10x smarter, but they are also 10x slower than their predecessors⁶. With these new models, conversational prompting averages only 1.3 turns per minute. This slow feedback loop is insufficient to gain meaningful insights, making it harder for conversational users to evaluate and reason about the effectiveness of their prompting technique.

Essentially these smarter models may actually impede the development of cognitive insight. This points to the need for a higher level construct than prompting. Much like high level languages replaced assembly language in software development, some form of high level context programming is also needed to facilitate the natural language programming required to design agentic workflows. As such, we also need this kind of knowledge in our agent equation.

Future change analysis

The wisdom of the Bitter Lesson is understanding that any knowledge we bring to our agent equation risks becoming a liability as frontier models continue to scale. This is because domain specific knowledge can introduce complexities that tightly couple applications to the particular idiosyncrasies of a generation of models. One way to quantify this risk is by examining previous model gains and projecting which aspects of inference are likely to change in the near future⁷.

Frontier Language Model Intelligence Chart — Frontier Language Model Intelligence, Over Time (ArtificialAnlysis.ai)

Looking at the average model tend line (A) we might speculate that significant improvements in the near future are likely. However looking at the leading model trend line (C) we might speculate that model performance has reached a peak, that other models are fast catching up, future gains will be costly and have diminishing returns. Maybe future model intelligence lands somewhere in the middle (B).

¯\_(ツ)_/¯

I am reminded of the parallel development of sorting algorithms along with the computer architectures of the 1950s that culminated with the invention of Quicksort in 1960 by Tony Hoare. While new sort algorithms have continued to improve the best case performance, worst case performance, memory overhead and stability of sorting operations, the average case performance of Quicksort (n log n) has only been matched by new algorithms and remains unbeaten.

I suspect our current exemplars of transformer models will converge at a similar nexus. The compute-efficient frontier points to an upper limit for intelligence given that available training compute is finite.

In any event, we can infer what is likely to change in the near future and the affect these trends will have on agentic workflow design.

Things changing quickly

Things that are likely to change quickly in the near future include model performance, inference speed, model capability and inference cost.

The performance of models is increasing;
The speed of inference is increasing;
The cost of frontier inference is increasing; and,
The average cost of inference is decreasing.

Things changing less quickly

Model providers will continue exploring with new architectures like mixture of experts and quantitation strategies to improve inference performance and reduce GPU memory and compute requirements.

Smaller, faster and highly optimized model, including open weights will track the performance improvements of large frontier models by a few months.

We already see this with the avalanche of open weight qwen models.

Things unlikely to change quickly

Businesses using agentic workflow design for automation will continue to benefit from:

User augmentation with agentic thinking;
User capability development for agentic productivity;
Adaptive context programming and memory;
Quality with task alignment;
Trust from instrumentation for context evaluation and refinement;
Observability with inference back-tracing and replayability;
Latency with low time to first token;
Throughput from parallel task execution;
Discoverability through cognitive insight into latent emergent behavior; and,
Economics reflecting a reduced cost of outcome.

Models aren’t mind readers

In the end, I feel confident that agentic workflow design applications supporting the features and capabilities I have outlined above will not be handicapped by them as frontier models continue to improve, at least for as long as large transformer based models remain the dominant paradigm. When you think about it, this feels pretty obvious. For example, no matter how smart a model gets, it can only achieve task alignment when you give it clear instructions. Clear instructions are part of the context, and that context will need to be evaluated and refined for reliable generalized automation, otherwise how will you know whether inference is working optimally? After all, you can’t manage what you don’t measure, and models can’t infer what you are thinking, unless you provide those thoughts to them in the form of context. Therefore a capable tool set to design and adaptively program context around tasks is not something I believed the largest most advanced frontier models will obviate in the foreseeable future.

Essay Sutton, Rich (March 13, 2019). The Bitter Lesson. ↩︎
Paper Peter Coveney (Jul 25, 2025) The wall confronting large language model ↩︎
Paper Kaplan et al. (Jan 23, 2020) Scaling Laws for Neural Language Models ↩︎
Keynote Lélio Renard Lavaud (September 23, 2025) Unlocking AI: Driving Successful Enterprise Adoption. AI Engineer Paris 2025 ↩︎
Essay Michael Carroll (July 8, 2025) The Latency Trap: SaaS’s Silent Sabotage ↩︎
Talk George Cameron (July 9, 2025) Trends Across the AI Frontier. AI Engineer Worlds Fair 2025 ↩︎
Website ArtificialAnalysis.ai (October, 2025) Independent analysis of AI: Understand the AI landscape to choose the best model and provider for your use case ↩︎

Last updated on October 12, 2025

Research The Attention Head