The Cloud World, Part Two: The Neocloud Reshuffle

Opening

Part one was the general purpose cloud world. The hyperscalers, the European tier-two providers, the storage specialists, and the platform-as-a-service layer that picked up the indie developer market when Heroku killed its free tier. That is the boring half of the cloud universe. Useful, important, the stuff every business needs to know about. Part two is where things get more interesting and considerably more unstable.

We are going to talk about neoclouds, which is the industry term that emerged in twenty twenty four to describe cloud providers built specifically for artificial intelligence workloads rather than general purpose enterprise computing. CoreWeave, Lambda Labs, Crusoe, Nebius, Nscale, Together A I, Fireworks A I, Replicate, Hyperbolic, and a long tail of smaller players. We are also going to talk about the specialty silicon companies that built non-NVIDIA chips designed specifically for inference, Groq and Cerebras and SambaNova, and how that entire thesis has been partially rewritten over the last six months. Pär already has Modal, fal dot eye eye, RunPod, and ThinkDiffusion, so we will skip those and focus on everything else.

The framing question for this episode is simple. The hyperscalers and tier-two providers we covered in part one charge a premium of three to six times what specialist providers charge for equivalent graphical processing unit capacity. Why does anyone still pay the hyperscaler tax for artificial intelligence workloads, and which of the alternatives is actually worth knowing about.

The neocloud thesis

The neocloud category exists because of a structural mismatch. The hyperscalers were architected for general purpose enterprise computing. Lots of customers, each running mixed workloads, with billing optimized for diverse usage patterns. Artificial intelligence workloads are the opposite. A small number of customers, each running concentrated graphical processing unit workloads, with billing dominated by a single line item. The hyperscalers can host artificial intelligence workloads, and they do, but their cost structure was not built for it. Their margins on graphical processing unit time are smaller than on general compute, which is why they charge a premium, and that premium opens the door for specialists who build their entire business around graphical processing unit economics.

CoreWeave is the undisputed leader of the neoclouds and the only provider currently rated Platinum tier on the ClusterMAX two point zero rating system, which is the industry standard published by Semi Analysis. The company started life as a cryptocurrency mining operation, pivoted to artificial intelligence infrastructure around twenty twenty one, raised more than seven billion dollars in funding with NVIDIA among its investors, went public on Nasdaq in late March twenty twenty five at around one hundred thirty eight dollars per share, and has cleared five billion dollars in annual revenue faster than any cloud provider in history.

CoreWeave's signature product is its Kubernetes-native graphical processing unit infrastructure, which means that if your team already thinks in pods and deployments, the platform feels native rather than foreign. The catalog spans ten graphical processing unit model families including the latest NVIDIA Blackwell two hundred and three hundred series. The networking layer is purpose-built around InfiniBand, which is the high-bandwidth low-latency fabric that distributed training requires. The hyperscalers either do not offer InfiniBand at all, in the case of Amazon and Google, or they offer it only on specialized instance families, in the case of Azure. CoreWeave makes it standard.

The big story for CoreWeave in twenty twenty six is the Meta commitment. In April Meta expanded its existing fourteen point two billion dollar arrangement with CoreWeave by another twenty one billion dollars through December twenty thirty two, bringing the total known commitment to approximately thirty five billion dollars. This is not overflow compute. This is strategic capacity reservation, more like dark fiber agreements or semiconductor supply contracts than ordinary cloud spending. CoreWeave's backlog after the Meta expansion sits at sixty six point eight billion dollars, and the company is guiding to twelve to thirteen billion in twenty twenty six revenue with thirty to thirty five billion in planned capital expenditure. The signal here is unmistakable. Hyperscalers are no longer treating artificial intelligence infrastructure as something they can source through normal vendor cycles. They are locking it in years ahead through specialist providers because inference at scale is becoming a supply chain problem.

Lambda Labs is the developer-first neocloud, billing itself as the artificial intelligence developer cloud. The company started as a hardware vendor selling configured workstations and servers to research teams, and the cloud business grew out of that hardware heritage. Lambda has one of the broadest graphical processing unit fleets in the market, ranging from older A one hundreds all the way through the latest B three hundred Blackwell generation. Pricing is on-demand only on the public price list, with multi-week and multi-year commitments quoted privately. Lambda offers a fifty percent academic discount, which makes it the default choice for university research teams. The leaseback arrangement with NVIDIA ensures graphical processing unit availability even during shortage periods.

The pitch for Lambda is one-click clusters. You provision a multi-node graphical processing unit cluster with interconnected InfiniBand fabric in roughly two minutes. The pre-configured environment includes PyTorch, TensorFlow, the relevant compute unified device architecture drivers, and a Jupyter notebook on every instance. Closer to click-and-train than any competing neocloud. The catch is that Lambda is its own walled garden. No multi-cloud options. If you build on Lambda you stay on Lambda or you migrate everything when you leave.

Crusoe is the renewable energy neocloud. The pitch is data centers built on stranded gas or other otherwise wasted energy sources, which gives Crusoe a structural cost advantage on power while reducing emissions compared to conventional grid-connected facilities. Crusoe operates a Wyoming facility, a Texas facility under construction, additional sites in Iceland and the Nordic region, and has secured roughly four and a half gigawatts of natural gas supply for the artificial intelligence build-out. The catalog spans nine graphical processing unit model families including the latest H two hundred and B two hundred parts, plus Advanced Micro Devices M I three hundred X for teams that want non-NVIDIA options. Crusoe's hypervisor of choice is cloud-hypervisor based virtual machines rather than bare metal, which is an architectural choice that is still maturing toward bare-metal-class performance.

Nebius is the European neocloud, headquartered in the Netherlands with a clear European Union sovereignty story. Nebius runs leading-edge B two hundred and B three hundred capacity, uses lightweight Kubernetes-based virtual machines that have achieved bare-metal-class performance according to Semi Analysis benchmarks, and offers an enhanced throughput object storage tier built specifically for distributed training workloads where storage input output is the bottleneck. Nebius is rated Gold tier on ClusterMAX, sitting just below CoreWeave's Platinum.

Nscale is the newer European entrant, backed by NVIDIA, building what has been called the Stargate Norway project. The plan is one hundred thousand graphical processing units in Norway by the end of twenty twenty six, positioned as a sovereign European Union artificial intelligence training facility. Nscale is part of a broader European pattern we will come back to in part three. Sovereign capacity matters increasingly to European customers because of regulatory pressure, geopolitical considerations, and the simple practical fact that latency to compute matters when you are iterating on models in tight loops.

FluidStack is interesting because it does not own infrastructure. FluidStack aggregates graphical processing unit capacity from multiple data center operators and resells it as a unified pool. The model is software differentiation rather than physical infrastructure, similar to how Lambda Labs originally operated before they began building their own facilities. FluidStack recently partnered with Terawulf and Cipher Mining, two publicly traded American cryptocurrency mining companies, backed by Google, to deliver Gold-tier clusters specifically for leading artificial intelligence labs. This is one of the strangest supply chain stories in the entire artificial intelligence infrastructure market. Cryptocurrency miners have spent the last decade building data center capacity to run application specific integrated circuits for proof-of-work hashing, and now that capacity is being repurposed for artificial intelligence inference. Terawulf, Cipher Mining, I R E N slash Iris Energy, Hut Eight, V C V Digital, Applied Digital, and Core Scientific, which CoreWeave acquired outright, are all now active in artificial intelligence infrastructure.

The remaining neoclouds worth knowing about are mostly variations on the same themes. Hyperstack is among the lowest-priced H one hundred sources in the neocloud tier with a United Kingdom-based footprint. TensorDock and Cudo Compute aggregate capacity at the budget end. Hot Aisle focuses specifically on Advanced Micro Devices M I three hundred X. Sesterce, Lyceum, and Cirrascale operate at the reserved-only end of the market, where you commit to multi-year capacity in exchange for substantial discounts. Cirrascale notably hosts Cerebras and Graphcore options alongside NVIDIA, which is rare among neoclouds. Spheron and Northflank operate as multi-cloud aggregators where you deploy across Amazon, Google, Azure, Oracle, Civo, or bare-metal from a single interface.

The serverless inference layer

A different category of artificial intelligence infrastructure provider sits on top of the raw graphical processing unit clouds. These are serverless inference platforms, where you do not rent a graphical processing unit at all. You send a request to an application programming interface endpoint, and the platform routes it to whichever graphical processing unit happens to be available at that moment, billing you per token of input or output rather than per second of compute time. Pär already uses fal dot eye eye for image and video model inference, which is a representative of this category. Here are the others worth knowing about.

Together A I is one of the most established serverless inference providers, hosting over two hundred open weight models including the Llama family, the DeepSeek family, the Qwen family, Mixtral, and many smaller specialty models. Together is rated Silver tier on ClusterMAX, has signed deals to host G B two hundred clusters for trillion-parameter model inference, and exposes both an O p e n A I compatible application programming interface and dedicated endpoint hosting for teams that need consistent latency. Together has the broadest open weight model catalog of any inference provider currently in market.

Fireworks A I competes directly with Together A I on the same workloads. Fireworks specializes in optimized inference for open source models, claims faster latency on equivalent hardware through proprietary serving optimizations, offers fine-tuning workflows that let customers train custom variants of base models, and supports speculation decoding for additional throughput on supported architectures. Fireworks free tier offers ten requests per minute without a payment method, jumping to six thousand requests per minute once a payment method is added.

Replicate is the model marketplace play. Anyone can publish a model to Replicate as a Cog container, and anyone can call any published model through a unified application programming interface. The catalog is dominated by image and video generation models like the Flux family, the Stable Diffusion family, video models like Wan and Sora-style variants, and a long tail of speciality models published by individual researchers. The pricing model is per-second of graphical processing unit time on whichever hardware tier the model requires. Replicate is the closest comparison to fal dot eye eye in shape, though fal is faster and more polished for the specific image and video generation workloads Pär cares about.

Hyperbolic offers serverless inference at very aggressive pricing, billing roughly half of what Together or Fireworks charge for equivalent model access. The catch is that Hyperbolic's infrastructure is partially built on a peer-to-peer marketplace of graphical processing unit owners contributing capacity, which means that latency variance is higher and availability for specific models can be inconsistent. The pricing reflects the tradeoff.

OpenRouter is the aggregator that sits above all of the inference providers, exposing a single application programming interface that routes requests to the cheapest or fastest available provider for any given model. If you are prototyping across multiple models from multiple providers, OpenRouter lets you pay one bill and switch providers per request. The downside is the additional latency hop and the dependency on OpenRouter's availability.

NVIDIA Network Inference Microservices, branded as N I M, is NVIDIA's own catalog of hosted inference endpoints. The interesting thing about N I M is that NVIDIA offers ninety one free endpoint models spanning language, vision, biology, simulation, and safety domains, with rate limits that let you actually use them for prototyping rather than just demos.

SambaNova's cloud, branded SambaNova Cloud, offers persistent free tier access to Llama three point three seventy billion parameters, Llama three point one up to four hundred and five billion parameters, Qwen two point five seventy two billion parameters, and a few other models, running on SambaNova's custom Reconfigurable Dataflow Units rather than NVIDIA graphical processing units. The free tier persists indefinitely beyond the initial five dollar credit. Rate limits run between ten and thirty requests per minute depending on model size.

Vast dot eye eye is the wild west of graphical processing unit rental. It is a marketplace where individual graphical processing unit owners list their hardware, set their own prices, and serve customer workloads. Pricing on Vast is often less than half of what neoclouds charge for equivalent specifications. The catch is that you might share hardware with other users, you might get preempted if demand spikes, and reliability is whatever the individual operator decides to offer. Vast is excellent for non-production workloads where cost dominates and interruption is tolerable. It is the wrong choice for any workload that matters.

The specialty silicon reshuffle

The most dramatic change in artificial intelligence infrastructure over the last six months has been the partial collapse of the specialty silicon thesis. From roughly twenty twenty two through twenty twenty five, three companies dominated the conversation about non-NVIDIA chips designed specifically for inference workloads. Groq with its Linear Processing Units. Cerebras with its Wafer Scale Engine. SambaNova with its Reconfigurable Dataflow Units. The thesis was simple. Inference workloads have different characteristics than training workloads, and a chip designed specifically for inference could achieve dramatically better latency and energy efficiency than a general purpose graphical processing unit. All three companies posted benchmark numbers showing ten to one hundred times faster token generation than equivalent NVIDIA capacity for specific model architectures.

That thesis is now in question.

In December twenty twenty five, NVIDIA paid approximately twenty billion dollars for Groq's inference technology, leadership team, and architecture in what was structured as a non-exclusive license wrapped in an acqui-hire. The disaggregated decode architecture that Groq had built, which was originally pitched as an alternative to NVIDIA's compute unified device architecture programming environment, is now integrated directly into compute unified device architecture through the new Agent Toolkit and the Dynamo orchestration layer. Graphical processing units handle the prefill portion of inference. Linear processing units, now NVIDIA-controlled, specialize in the decode portion. Both run inside the same runtime, controlled by the same vendor. Customers no longer need to leave the dominant programming environment to access decode optimization patterns, because those patterns now come built in.

At NVIDIA's G T C twenty twenty six conference in March, the company unveiled the Vera Rubin N V L seventy two architecture, which delivers five times the inference performance of the previous Blackwell generation while integrating the absorbed Groq techniques as a first-class capability. The competitive moat that Groq had built on architectural differentiation evaporated the moment NVIDIA acquired the architecture. The Groq cloud business continues to operate independently, but its strategic narrative has fundamentally changed.

Cerebras, which sits in a different competitive position, is currently in the middle of its initial public offering. Pricing is expected this week with a target valuation around forty nine billion dollars and a raise that may reach four point eight billion dollars rather than the originally planned three point five billion. The Cerebras Wafer Scale Engine three is currently the only chip on the market that can serve the four hundred and five billion parameter Llama three point one model at more than a thousand tokens per second from a single chip. That is not a number that NVIDIA or anyone else can match through architecture changes alone, because the entire transformer computation graph maps onto a single wafer at compile time. The question for Cerebras after this initial public offering is whether the differentiated performance justifies a public market valuation given that NVIDIA has demonstrated, through the Groq acquisition, that it will buy specialty silicon outright when it threatens the platform.

SambaNova unveiled the S N fifty chip in February twenty twenty six, claiming five times faster inference than competing chips and three times lower total cost of ownership than equivalent graphical processing units. SambaNova continues to position itself for enterprise and batch workloads requiring large models, particularly the Llama three point one four hundred and five billion parameter variant, where SambaNova has been the cheapest viable option at around five dollars per million input tokens. The break-even point for SambaNova versus standard graphical processing unit deployments is roughly five hundred million daily tokens, which is a threshold that genuine production deployments cross routinely.

The broader picture is one of consolidation under NVIDIA's umbrella. The number of inference providers ballooned from twenty seven in early twenty twenty five to ninety by the end of that year. The competitive pressure drove one of the most dramatic cost deflations in technology history. Inference equivalent to G P T four that cost twenty dollars per million tokens in late twenty twenty two now runs at approximately forty cents per million tokens. The Cerebras initial public offering will price into a market that has already reorganized around NVIDIA orchestration patterns, with most of the specialty silicon players either absorbed or marginalized.

What Pär actually needs to know

The neocloud category is genuinely useful for anyone doing concentrated graphical processing unit work at meaningful scale. The hyperscaler premium for graphical processing unit time is real, the alternatives are genuinely cheaper, and the developer experience on the better neoclouds has matured to the point where switching is no longer painful. For Pär specifically, the practical map looks like this. RunPod is already in his stack for persistent pods with Network Volumes, which is the right shape for ostris ai-toolkit LoRA training runs. Modal handles the serverless graphical processing unit workloads where cold starts do not matter as much as billing only when active. Fal dot eye eye covers the polished image and video generation endpoints.

The gaps are mostly around scale. If Pär ever needed to run a multi-node distributed training job on connected InfiniBand fabric, which he currently does not, Lambda Labs would be the right destination given the academic discount and the one-click cluster provisioning. If he ever needed sovereign European Union compute for a regulated workload, Nebius or Nscale would be the right destination. If he ever needed cheap inference on open weight models at scale beyond what Scaleway Generative APIs offers, Together or Fireworks would compete. None of these are pressing needs.

The specialty silicon reshuffle is more important to know about than to act on. Cerebras going public this week is a milestone worth watching even if Pär never runs inference there. The Groq acquisition by NVIDIA fundamentally changes the conversation about whether non-NVIDIA artificial intelligence chips are a viable long-term bet. SambaNova has staked out a position in large model inference that may or may not survive the next two years of NVIDIA roadmap updates. The space is in flux and the names that matter today may not all be standalone companies in three years.

In part three we will cover the truly strange end of the cloud world. Edge databases, the full developer surface that companies like Cloudflare and Supabase have built, quantum hardware across multiple vendors, decentralized compute marketplaces, and the European sovereign cloud build-out that is accelerating in twenty twenty six. That episode is the one with the most far-fetched possibilities, which is fitting for the section of this series that has been heading further from Pär's existing stack with each step.