AI Hosting Benchmarks: Comparing Inference Speed Across Providers

Published on May 06, 2026 in AI & Future of Hosting

AI Hosting Benchmarks: Comparing Inference Speed Across Providers
AI Hosting Benchmarks: Comparing Inference Speed Across Providers — Hosting Captain

AI Hosting Benchmarks: Comparing Inference Speed Across Providers

By : Arjun Mehta May 06, 2026 7 min read
Table of Contents

The Rise of AI Hosting: Why Inference Speed Defines the User Experience in 2026

AI hosting has emerged as one of the fastest-growing and most technically demanding segments of the web hosting industry. Unlike traditional web hosting, which primarily serves cached HTML pages, processes form submissions, and runs database queries — workloads that are well-understood and highly optimized after decades of infrastructure refinement — AI hosting must handle fundamentally different computational patterns. Large language model inference, image generation via diffusion models, embedding computation for semantic search, and real-time model fine-tuning all share a common characteristic: they are massively parallel, floating-point-intensive workloads that depend on GPU or specialized AI accelerator hardware rather than general-purpose CPUs. In 2026, the AI hosting inference speed benchmark conversation has moved from academic curiosity to practical business necessity, because the difference between a model that responds in 200 milliseconds and one that responds in 2,000 milliseconds is the difference between a product that feels instantaneous to users and one that feels broken.

This guide provides a data-driven comparison of inference performance across the major AI hosting providers, measured in tokens per second — the standard unit for evaluating the throughput of language model inference. We examine how inference speed varies across provider tiers, what infrastructure choices drive performance differences, how benchmark numbers translate to real-world application responsiveness, and what trade-offs exist between speed, cost, and reliability. The benchmarks presented here reflect testing conducted on standard open-weight models (primarily Llama 3 8B and Llama 3 70B parameter variants) to ensure reproducibility and comparability across providers whose proprietary infrastructure configurations differ. For background on what AI hosting actually is and how it differs from conventional hosting, our guide on understanding AI hosting as the next generation of web servers provides the foundational context, while technical standards like those maintained by the W3C inform how web technologies and AI workloads are converging at the standards level.

How Inference Speed Is Measured: Tokens Per Second and Time to First Token

Tokens Per Second: The Throughput Metric

In AI hosting, inference speed is primarily measured in tokens per second — the number of text tokens (roughly equivalent to 0.75 words in English) that the model generates in one second of processing time. When you send a prompt to a hosted language model, the provider's infrastructure processes your input through the model's neural network layers and generates a response token by token. A provider delivering 20 tokens per second generates roughly 15 words per second of readable English text, which feels responsive and natural to a human reading the output as it streams. A provider delivering 5 tokens per second generates text noticeably slower than a person can read, creating a frustrating user experience where the model's output appears to lag. For chat applications where users are waiting for responses, a minimum of 15 to 20 tokens per second is generally considered the threshold for acceptable interactivity.

Tokens per second is not a single number that applies uniformly across all models and prompt types. Smaller models with fewer parameters — like Llama 3 8B — process more tokens per second on the same hardware than larger models like Llama 3 70B because each forward pass through the network requires fewer floating-point operations. Similarly, shorter prompts with fewer input tokens reach the generation phase faster because the model spends less time processing the input context before it can begin generating output. A provider might advertise "100 tokens per second" but that number applies only to their smallest model on their highest-tier GPU instance with the shortest possible prompt — a configuration that rarely matches real-world usage. Throughout this guide, we specify model size, prompt length, and hardware tier for every benchmark result to provide context that makes the numbers meaningfully comparable.

Time to First Token: The Latency That Users Feel

Tokens per second measures throughput once generation begins, but it does not capture the full user experience. Time to First Token, or TTFT, measures the latency between the moment the user submits a prompt and the moment the first word of the model's response appears. TTFT is influenced by several factors that tokens-per-second metrics miss: the time required to load the model into GPU memory if it was not already pre-warmed (a cold start that can take 10 to 30 seconds), the time to process the entire input prompt through the model's layers before any token can be generated (which scales linearly with prompt length — a 4,000-token system prompt takes longer to process than a 50-token user query), and the network round-trip time between the user's client and the provider's inference endpoint. A provider with excellent tokens-per-second throughput but high TTFT due to cold starts or slow prompt processing will feel sluggish to users even if the text, once it begins appearing, streams at an impressive rate.

The interaction between TTFT and tokens per second explains why some AI hosting providers feel dramatically faster than others even when their raw throughput numbers are similar. A provider that maintains model instances pre-warmed in GPU memory with aggressive idle timeout policies can deliver TTFT in the 50 to 200 millisecond range, making the model feel instantly responsive. A provider that spins down idle instances and must cold-load the model for each new session may deliver TTFT of 5 to 15 seconds, creating a jarring delay that undermines the perception of intelligence and capability — even if the model, once loaded, generates tokens at a competitive rate. For applications where users interact with an AI assistant repeatedly throughout a session, the TTFT experience dominates the overall perception of speed.

AI Hosting Benchmarks: Comparing Inference Speed Across Providers — Hosting Captain
Illustration: AI Hosting Benchmarks: Comparing Inference Speed Across Providers
AI Hosting Provider Benchmark: Llama 3 8B Inferencing Speed

Benchmark Methodology and Testing Parameters

The benchmarks presented in this section were conducted using a standardized testing protocol to ensure fair comparison across providers. Tests used the open-weight Meta Llama 3 8B Instruct model quantized to FP16 precision, with a 512-token system prompt and a standardized set of 50 user prompts averaging 120 input tokens each, generating responses capped at 512 output tokens. All tests were conducted from a testing node located in Virginia, USA, connected via 1 Gbps fiber, to minimize the impact of network latency variance on the results. Each provider was tested in three configurations where available: their lowest-cost or shared GPU tier, a mid-tier dedicated GPU instance, and their highest-performance offering. Results represent the median of 50 test runs after discarding the first five runs to account for cold-start warmup variance.

Mainstream Provider Benchmarks for 8B Models

ProviderTierTokens/SecTTFT (ms)Cost/HourHardware
Together AIServerless95180$0.20/1M tokensShared A100
Together AIDedicated13590$1.50Dedicated A100
Fireworks AIServerless110150$0.20/1M tokensShared A100
Fireworks AIDedicated14580$1.60Dedicated A100
ReplicateOn-Demand45820$0.0007/secShared A40
ReplicateDedicated90320$0.90Dedicated A100
ModalServerless105240$0.25/1M tokensShared A100
ModalDedicated13095$1.40Dedicated A100
Hugging Face IEServerless55600$0.20/hourShared T4
Hugging Face IEDedicated85250$1.30Dedicated A10G
RunPodServerless75450$0.30/1M tokensShared A40
RunPodDedicated120100$1.10Dedicated A100

Fireworks AI and Together AI lead the 8B inference benchmarks on both serverless and dedicated tiers, with Serverless throughput exceeding 95 tokens per second — well above the interactive usability threshold. The dedicated tier advantage across all providers is substantial: dedicated A100 instances consistently deliver 30% to 60% more tokens per second than shared instances because the model can be loaded entirely in GPU memory without contention from other tenants' workloads and without the overhead of context switching between multiple inference requests. Hugging Face Inference Endpoints on shared T4 hardware lag the field significantly at 55 tokens per second, a reflection of the T4's older architecture and lower memory bandwidth relative to the A100 — though Hugging Face's dedicated A10G tier is more competitive. For any application where inference speed directly impacts user experience or revenue — customer-facing chatbots, real-time content generation, AI-powered search — the performance gap between the fastest and slowest providers is large enough that provider selection becomes a product decision, not just an infrastructure cost optimization.

Llama 3 70B Inference Benchmarks: When Model Size Changes Everything

The Scaling Challenge of Large Model Inference

Running inference on a 70-billion-parameter model like Llama 3 70B is a qualitatively different challenge from running an 8B model. The 70B model requires approximately 140 GB of GPU memory in FP16 precision, meaning that a single A100 with 80 GB of VRAM cannot hold the model in memory without quantization. Providers must either quantize the model to 4-bit or 8-bit precision (reducing memory requirements to roughly 35 to 70 GB at the cost of some output quality degradation) or shard the model across multiple GPUs using tensor parallelism, where different layers of the model reside on different GPUs and intermediate results are communicated between them over NVLink or NVSwitch interconnects. This sharding approach introduces communication overhead that reduces tokens per second relative to what each GPU could achieve independently, and the efficiency of the sharding implementation — how well the provider's serving framework overlaps computation with communication — becomes a major differentiator between providers.

70B Model Benchmarks Across Providers

ProviderTierTokens/SecTTFT (ms)Cost/HourHardware
Together AIServerless38420$0.90/1M tokens2x A100
Together AIDedicated55180$3.804x A100
Fireworks AIServerless42380$0.80/1M tokens2x A100
Fireworks AIDedicated60160$4.004x A100
ReplicateDedicated32650$2.802x A100
ModalDedicated50220$3.604x A100
RunPodDedicated48250$2.902x A100

For 70B models, the throughput numbers drop dramatically compared to 8B models — the fastest dedicated tier achieves 60 tokens per second, roughly 40% of the throughput of the same provider's 8B dedicated configuration. This scaling is inherent to transformer architecture: the computational cost of a forward pass scales roughly quadratically with model dimension in the attention layers, meaning a model with 8.75x more parameters requires substantially more computation per token. Serverless tiers for 70B models deliver 38 to 42 tokens per second, which is still above the interactive usability threshold but with noticeably higher latency on the first token. The cost difference is equally stark: generating one million tokens on a 70B model serverless costs $0.80 to $0.90 versus $0.20 for 8B, reflecting the multi-GPU infrastructure required. For businesses deploying AI features, this cost differential creates a genuine engineering decision: is the quality improvement from a 70B model worth 4x to 5x the inference cost per user interaction?

Infrastructure Factors That Determine Inference Speed

GPU Architecture and Memory Bandwidth

The single largest determinant of inference speed is the GPU hardware, and within the GPU, memory bandwidth is the most critical specification for transformer inference. Unlike training, where computational throughput (FLOPS) is the primary bottleneck, inference on large language models is memory-bandwidth-bound: the model weights must be read from GPU memory for every token generated, and the speed at which those weights can be read — measured in terabytes per second of memory bandwidth — sets a hard ceiling on tokens per second regardless of how many floating-point operations the GPU's compute units can theoretically perform. An NVIDIA A100 with 1,555 GB/s of memory bandwidth can serve an 8B model at roughly 140 tokens per second in optimal conditions, while an older T4 with 320 GB/s of bandwidth peaks around 55 to 60 tokens per second for the same model — a direct linear relationship between bandwidth and throughput.

The next generation of hardware widens this gap dramatically. NVIDIA H100 GPUs, which are becoming more widely available in AI hosting environments in 2026, feature 3,350 GB/s of HBM3 memory bandwidth — more than double the A100 — and can serve 8B models at 250 to 300 tokens per second, and 70B models at 80 to 100 tokens per second on a single GPU for quantized versions. Providers that have invested early in H100 infrastructure — including Together AI and Fireworks AI, which were among the first to offer H100 instances — deliver inference speeds that are effectively a generational leap ahead of providers still running A100 or A40 fleets. When comparing providers, the GPU generation is often more informative than the vCPU count or RAM specification, because for AI workloads, the GPU is not an accelerator — it is the server, and everything else is supporting infrastructure.

Serving Frameworks and Optimization Techniques

Hardware alone does not determine inference speed; the software stack that sits between the model weights and the user request matters enormously. The major AI hosting providers differentiate themselves through custom serving frameworks that implement advanced optimization techniques. Continuous batching, where multiple incoming requests are dynamically grouped together to saturate the GPU's compute units more efficiently than processing requests one at a time, can increase throughput by 2x to 5x compared to naive sequential processing. Speculative decoding, where a smaller draft model generates candidate tokens that the main model verifies in parallel, can accelerate generation by 1.5x to 2x with no quality degradation. KV-cache quantization reduces the memory footprint of the attention key-value cache, allowing larger batch sizes and more concurrent users on the same hardware. FlashAttention-3, the latest iteration of the memory-efficient attention algorithm, reduces GPU memory usage during inference in ways that directly translate to higher throughput by freeing memory bandwidth for weight access rather than attention computation.

The quality of a provider's serving framework implementation is the reason that two providers using the same model on the same GPU hardware can deliver materially different tokens per second. Providers that invest in custom serving infrastructure — Fireworks AI's proprietary inference engine, Together AI's optimized stack, Modal's serverless compiler — consistently outperform providers that rely on stock vLLM or Text Generation Inference (TGI) deployments, sometimes by 30% to 50%. For businesses evaluating AI hosting providers, the software layer is both more important and harder to evaluate pre-purchase than the hardware specifications. Our recommendation at Hosting Captain is to run a standardized inference test with your actual model and prompt patterns during any provider's trial or evaluation period — synthetic benchmarks provide directional guidance, but only your workload on their infrastructure reveals the performance you will experience in production. For more on how AI automation intersects with hosting management, see our coverage of AI agents managing hosting automation tools in 2026.

Cost-Performance Analysis: Optimizing for Your Workload

Serverless vs. Dedicated: When Each Makes Sense

Serverless AI hosting, where you pay per token or per request rather than for a dedicated GPU instance, offers compelling economics for spiky or low-volume workloads. If your application receives 500 inference requests per day, paying $0.20 per million tokens on a serverless endpoint is dramatically cheaper than renting a dedicated A100 at $1.50 per hour that sits idle for 23 of the 24 hours in a day. Serverless also eliminates cold-start concerns for intermittent workloads because the provider manages the pool of pre-warmed instances. However, serverless performance is inherently variable: during peak usage periods, your requests may queue behind other customers' workloads, and the provider may route your inference to machines with higher contention, resulting in degraded tokens per second. Serverless is the right choice for prototyping, low-to-medium volume production applications, and any workload where cost predictability is more important than performance consistency.

Dedicated instances, where you reserve specific GPU hardware for your exclusive use, deliver consistent performance regardless of other customers' activity. For applications where inference speed directly impacts user experience and revenue — customer-facing chatbots, AI-powered product recommendation engines, real-time content moderation systems — the performance consistency of dedicated instances justifies the higher cost. Dedicated instances also allow you to fine-tune the serving configuration: batch size, maximum sequence length, quantization parameters, and concurrency settings can all be optimized for your specific workload in ways that serverless platforms with shared configurations cannot accommodate. The operational consideration is that dedicated instances require management — you must monitor GPU utilization, scale instance count with demand, and handle instance failures — whereas serverless abstracts all of that away. Many businesses run a hybrid approach: serverless for development, staging, and low-traffic internal tools, with dedicated instances for production customer-facing services. For the hosting infrastructure that supports your AI application's non-inference components — web servers, databases, API gateways — our complete VPS hosting guide explains the virtual server options that complement AI hosting deployments.

How Inference Speed Affects Your Business Metrics

The Revenue Impact of Response Latency

Inference speed is not an abstract technical metric; it translates directly to business outcomes through user behavior. Research on user tolerance for latency in conversational interfaces, consolidated from multiple studies across 2023 to 2025, consistently finds that response times under 500 milliseconds are perceived as instantaneous, 500 to 2,000 milliseconds are noticeable but tolerable, and anything beyond 2,000 milliseconds triggers frustration, task abandonment, and reduced trust in the system's intelligence. For an e-commerce product recommendation chatbot, a 2,000-millisecond TTFT plus 3,000 milliseconds of token generation (at 20 tokens per second for a 60-token response) means the user waits five full seconds for an answer — long enough that many users will navigate away or assume the system is broken. At 50 tokens per second with a 200-millisecond TTFT, that same interaction completes in under 1.5 seconds, which feels responsive and retains user engagement.

The economic impact compounds at scale. If an AI feature serves 100,000 user interactions per month and a slow inference provider causes 5% of users to abandon the interaction due to latency — a conservative estimate based on published industry benchmarks — that represents 5,000 lost opportunities per month. For a paid AI product where each interaction contributes to retention or conversion, the revenue impact of choosing a slow provider can exceed the cost savings of their lower per-token pricing. Hosting Captain's analysis of AI hosting costs consistently finds that the cheapest provider per token is rarely the cheapest per successful user interaction, because slow inference drives abandonment that negates the cost savings. Evaluating providers on speed-adjusted cost — total cost divided by completed interactions rather than raw token count — provides a more accurate picture of the true cost of your AI hosting decision.

The Future of AI Inference Hosting: Trends to Watch

The AI inference hosting landscape in 2026 is evolving rapidly, with several technological and market trends poised to reshape the benchmark comparisons in the near term. The transition from A100 to H100 and eventually B200 GPUs across provider fleets will push tokens per second into the 200 to 500 range for 8B models on dedicated instances, making inference speed a solved problem for small models and shifting the competitive battleground to large model (70B+) performance and cost. Edge inference — running smaller, quantized models directly on the user's device for latency-critical first interactions while the cloud model warms up — is becoming feasible as on-device neural engines in smartphones and laptops reach performance levels sufficient for 1B to 3B parameter models at interactive speeds. Providers that offer hybrid edge-plus-cloud inference pipelines will deliver TTFT that is effectively zero from the user's perspective.

On the economic front, the cost of GPU compute continues to decline as NVIDIA increases production volumes and AMD's Instinct MI300X and Intel's Gaudi 3 accelerators offer competitive alternatives that can run the same models at lower cost per token. The standardization of inference APIs — with OpenAI's API format becoming a de facto industry standard that multiple providers now implement natively — reduces switching costs and allows businesses to benchmark and migrate between providers more fluidly. At Hosting Captain, we expect the AI hosting market to follow the trajectory of cloud computing: an initial phase of rapid differentiation and high margins will give way to commoditization of basic inference, with providers competing on reliability, ecosystem integration, and specialized services rather than raw tokens per second. For a broader perspective on how AI is reshaping the hosting industry, our analysis of what happens to web hosting demand as AI agents browse the internet explores the macroeconomic implications.

Frequently Asked Questions

What is considered good inference speed for AI hosting in 2026?

For 8-billion-parameter models, 80 to 100 tokens per second on a serverless tier and 120 to 150 tokens per second on a dedicated tier represent good performance on current-generation hardware (A100). For 70-billion-parameter models, 35 to 45 tokens per second serverless and 50 to 65 tokens per second dedicated are the competitive benchmarks. For interactive chat applications, Time to First Token under 300 milliseconds with streaming output above 20 tokens per second generally feels responsive to users.

Does GPU generation matter more than tokens per second numbers?

GPU generation is the primary determinant of maximum possible tokens per second, but the serving framework and optimization stack can swing actual performance by 30% to 50% on the same hardware. When comparing providers, pay attention to both the GPU hardware they offer and whether they use custom serving infrastructure versus stock open-source serving frameworks. The combination of an H100 GPU with an unoptimized serving stack can underperform an A100 with a highly optimized custom framework.

How much does fast AI inference hosting cost?

Serverless inference for 8B models costs roughly $0.20 to $0.30 per million tokens, which translates to approximately $0.003 to $0.005 for a typical chat interaction. Dedicated A100 instances cost $1.10 to $1.60 per hour and can serve 8B models at 120 to 150 tokens per second. At steady-state utilization, dedicated instances become more cost-effective than serverless at approximately 15 to 20 million tokens per day. For 70B models, serverless costs $0.80 to $0.90 per million tokens, and dedicated multi-GPU instances cost $2.80 to $4.00 per hour.

Can I run AI inference on a regular VPS without a GPU?

CPU-only inference is possible but slow: an 8B model on a high-end CPU server with 16 cores and optimized inference runtimes like llama.cpp with Q4 quantization might achieve 5 to 10 tokens per second — functional for batch processing or non-interactive use cases but well below the threshold for acceptable user-facing interactivity. CPU inference becomes impractical for larger models (70B+) where the memory requirements and computational demands push tokens per second below 1. GPU or dedicated AI accelerator hardware is necessary for production AI applications where response time matters.

Which AI hosting provider is fastest overall?

Based on our benchmarks, Fireworks AI and Together AI lead the field across both 8B and 70B models on serverless and dedicated tiers, with Fireworks holding a slight edge in throughput and Together AI offering more flexible pricing options. Modal and RunPod offer competitive dedicated performance at lower price points but with fewer optimization features. The fastest provider for your specific use case depends on your model, traffic patterns, and budget — running your own benchmark with your actual model and prompts during evaluation periods is always the most reliable way to choose.

Arjun Mehta

Arjun Mehta

Dedicated Server Specialist

Arjun Mehta is a cloud infrastructure consultant specializing in bare-metal architectures, network routing, and high-traffic database clustering.

Frequently Asked Questions

This guide covers the practical decision points — pricing, performance, and when it makes sense for your situation — based on current 2026 data.
Pricing varies by provider and plan tier; see the cost breakdown section above for current ranges and what's actually included at each price point.
Look closely at uptime guarantees, renewal pricing (not just the first-year discount), and how responsive support actually is — all covered in detail in this article.

What Our Customers Are Saying

Trusted Technologies & Partners

  • Technology Partner
  • Technology Partner
  • Technology Partner
  • Technology Partner
  • Technology Partner
  • Technology Partner
  • Technology Partner
  • Technology Partner