Welsh Microsoft Guy
Back to Blog
13 March 2026

From Mainframes to Microchips, from LLM's to Local AI - The shrinking footprint of intelligence.

ai
llm
generative-ai
cloud
architecture
From Mainframes to Microchips, from LLM's to Local AI - The shrinking footprint of intelligence.

It is difficult not to miss the fact that a large portion of the current conversation around AI has been focused on the race to build/improve frontier models. That focus has started to shape the physical infrastructure of the industry in increasingly visible ways. Training and operating these systems requires a class of hardware that is still relatively constrained in supply. High-end AI GPUs are now being absorbed by hyperscalers at a pace that manufacturers are struggling to match, which is why we now see lead times for certain data-centre GPUs having stretched into the 30 to 50 week range and why most of the major cloud providers are effectively purchasing production capacity as quickly as it becomes available.

Once you start looking at the infrastructure envelope behind those systems the rationalle why and the pattern becomes clear. A single AI-optimised GPU can draw somewhere between 700 and 1,200 watts depending on configuration, and when those components are assembled into modern AI servers the power envelope moves quickly into the 10 to 12 kilowatt range per machine before networking, storage and cooling overheads are included. None of this is particularly surprising if scale is the primary mechanism being used to increase capability. Larger models require more parameters, more memory bandwidth and more compute, which means the physical footprint inevitably grows with them.

What makes the situation more interesting is that this may not describe the full direction of travel. Alongside the push towards ever larger models there is a steady stream of work focused on reducing the class of hardware required to run useful AI systems at all. The easiest way to make sense of that trend is not by looking at any individual model release but by stepping back and viewing it through the longer history of computing.

The earliest electronic computers were physically enormous machines that required dedicated facilities and specialist operators simply to function. Systems such as ENIAC consumed large amounts of power, occupied entire rooms and existed within a very small number of research or government environments. Capability existed, but it was concentrated. Only organisations with the resources to build and operate that kind of infrastructure could realistically make use of it.

Over time that concentration changed, although the transition took decades rather than years. Mainframes placed computing capability inside structured enterprise environments where it could support large organisational processes. Personal computers then moved that capability onto individual desks, which fundamentally altered how people interacted with software. Mobile devices pushed the same shift further by embedding computing into everyday activity. Eventually microcontrollers and embedded processors extended the pattern again, making it possible to place meaningful computing capability inside devices that most people would not normally think of as computers at all.

Seen in hindsight, the important change was not simply that computing became more powerful. It also became smaller, cheaper and easier to distribute. The underlying ideas remained broadly consistent, but the environments in which those ideas could run expanded dramatically as efficiency improved.

Following our Cloudy with a Chance of Insights pod recording last week I mulled on our conversation a little more and I believe something that looks loosely similar is beginning to emerge in the AI model landscape. The first wave of modern language models followed a strongly centralised pattern. The most capable systems required large memory footprints, specialised accelerators and environments capable of sustaining very high inference throughput. In practical terms that meant these models lived inside large shared infrastructure platforms, and applications accessed that capability remotely rather than embedding it directly.

More recently that assumption has started to soften. When parameter counts move into the one to ten billion range the hardware requirements change quite quickly. With quantisation and related optimisation techniques it becomes possible to run models of that size on strong workstations, capable laptops or certain classes of edge hardware using the #Phi models. The capability is obviously smaller than that of the largest frontier systems, but it is often sufficient for a surprisingly wide range of tasks. Instead of intelligence being confined to clusters, useful models start appearing closer to the environments where work actually happens.

At the same time another efficiency lever is being explored. Some research groups are looking beyond parameter counts and focusing on how numerical precision is represented within the model itself. Approaches such as Microsoft’s BitNet - see this LinkedIn article for more detail https://www.linkedin.com/posts/microsoftresearch_microsoft-researchers-release-bitnetcpp-activity-7254944735516143616-IrQU?utm_source=share&utm_medium=member_desktop&rcm=ACoAAABruscBR1PHCxOo86iSynWMWluY91N7G0k), - for example, deliberately constrains weights to a very small set of values and adapts the training process around this constraint. The objective is not simply to compress a trained model after the fact, but to design architectures that are inherently cheaper to run because they require less computational precision in the first place.

Individually these developments might just look like technical optimisations. Taken together though, they begin to suggest a more structural change in how AI capability might be distributed. Frontier models will continue to exist because there are workloads that genuinely benefit from that scale, particularly in areas such as research, complex reasoning or large shared services. At the same time, improvements in efficiency make it increasingly plausible that a growing proportion of practical AI capability will sit much closer to the environments where decisions and actions actually occur.

Now granted, the historical comparison with computing is not exact, but the direction should be familiar to you. New forms of capability tend to emerge in large, specialised environments where resources can be concentrated. As the underlying techniques mature they become more efficient, which allows the same core ideas to appear in progressively smaller and more widely distributed settings.

If AI models continue to follow that path, the interesting thing now becomes less about how much intelligence can be centralised inside a single system and more about where that intelligence is most useful once the cost of deploying it begins to fall and makes it feasible for intelligence to sit much closer to the events, processes and physical environments where decisions are actually made.

When this becomes possible the value of these systems changes slightly as well. The benefit is not necessarily conversational sophistication. Instead, it is the ability to apply judgement at the point where an action needs to be taken. In practical terms that could mean models operating inside tools, devices, workflows or operational environments that currently rely on static logic or manual intervention.

Seen through that lens, the longer-term direction becomes more interesting to my mind - As models become smaller and cheaper to run, we will have more freedom to decide where intelligence belongs within a system rather than being forced to centralise it. The technical challenge then becomes less about scale and more about placement, because the real value will come from choosing where that intelligence quietly sits inside the architecture - so the real question is, where do you think it will?

Continue exploring

Explore the topic graph

Comments