The agent always cost this much

GitHub moved Copilot from Premium Request Units to AI Credits this week, and developers on Reddit were reporting within hours that a single Claude Sonnet prompt had burned half a month of credits, and that a $100 Max plan could be exhausted inside a day. The clearest framing of why came from Mario Rodriguez, GitHub's Chief Product Officer, writing on the GitHub blog:

Today, a quick chat question and a multi-hour autonomous coding session can cost the user the same amount. GitHub has absorbed much of the escalating inference cost behind that usage, but the current premium request model is no longer sustainable.

The capital outlay behind frontier models is enormous, running to multi-billion-dollar training runs, inference fleets on the most expensive GPUs. I have written before about whether the math even works at the macro level, and the answer at trillion-dollar CapEx scale is no. Charging ten dollars a month for unbounded access to that supply chain was never sustainable.

Every serious AI tooling vendor is converging on usage-based pricing for the same reason: basic capitalism. Frontier inference is expensive, parallel agents multiply the expense, and no flat fee survives that kind of unbounded scale. Developers are learning what it took cloud engineers a good 10 years to accept: flat-rate predictability is the last thing to arrive, not the first.

The pattern emerging seems to be Frontier at the edges, cheap in the middle. Planning is where the frontier models earn their cost. Letting a strong model think hard about what the work actually is, before any code gets written, prevents the kind of misdirected implementation that becomes expensive. The implementation itself, once the plan is established, is largely mechanical, and a cheaper model can do most of the hard labor. Verification can then swing back to the frontier.

Underneath all of that, the ambient AI in the IDE continues to run on cheaper models, handling Next Edit Suggestions, whole-line completions, call-stack analysis, and the constant background help that makes the editor feel intelligent. These calls are high-frequency, latency-sensitive, and low-stakes per individual invocation, and that cheap floor may get even cheaper. At BUILD this week, Satya Nadella pointed out that "the amount of compute you have at the edge is actually astounding," and framed the destination as "unmetered intelligence to every desk and every home." The Windows side of the answer is real on-device inference support, NPU acceleration across the major silicon vendors, and small-model families tuned to run on the chip rather than the cloud. The direction is hybrid by default: route what needs the frontier to a paid API, run what can run locally on silicon the developer already owns.

The era of agents running wild on someone else's dime is over. What replaces it is already visible in outline, and it rewards developers who think before they fan out, match the model and where it runs to the task, and treat context as a resource worth caching.

I work on Visual Studio at Microsoft. The views here are mine alone.

An open Gutenberg Bible on display in a museum case, both pages showing dense two-column Latin type with hand-illuminated initials in blue and red.

Welcome!

The agent always cost this much

Comment Section