← All hardware
Groq LPU logo
hardware Data Center AI Accelerators

Groq LPU

by Groq

A deterministic inference engine built for the lowest latency per token.

Pros

  • Industry-leading low latency and TTFT
  • Predictable, deterministic performance
  • Easy OpenAI-compatible cloud access

Cons

  • Tiny per-chip SRAM (230 MB) needs many chips for big models
  • Inference-only; no training
  • Throughput trails wafer-scale rivals on some models

✓ Where it shines / best for

  • Real-time, latency-critical LLM applications (chat, agents, voice)
  • Developers wanting fast, cheap open-model inference via API
  • High-throughput token generation at scale

✕ Not the best fit for

  • Model training (LPU is inference-only)
  • Workloads needing proprietary closed models not on Groq
  • Very large memory-footprint single-model serving without sharding

Features

  • ✓ API access
  • ✓ Inference
  • ✓ High Throughput
  • ✓ Free tier
  • ✓ Open source
  • ✓ Openai Compatible
  • ✓ Real-time
  • ✓ Low latency
  • ✓ Lpu
  • ✓ Open Models

Pricing

PlanPriceBillingNotes
Free tier$0ongoingFree GroqCloud developer access with rate limits for evaluation.
Developer / Pay-as-you-gofrom ~$0.05–$0.79per 1M input tokensPer-token pricing varies by model (e.g., Llama models among the cheapest; larger/MoE models higher).
Developer / Pay-as-you-go (output)from ~$0.08–$0.99per 1M output tokensOutput tokens priced higher than input; exact rate is model-dependent.
Enterprise / On-premCustom quotecontractDedicated capacity, higher rate limits, and on-prem LPU hardware deployments via sales.

Pricing verified from the official source. Prices change often — confirm on the vendor's site before buying.

Specifications

useinference-only
latencysub-100ms time-to-first-token
throughputhundreds to 1,000+ tokens/sec (model-dependent)
architectureLPU (deterministic dataflow), SRAM-based
on_chip_sram230 MB per chip
Sponsored

A full review is being generated for this product and will appear here shortly.

Compare with

Compare
Compare