← All hardware
A deterministic inference engine built for the lowest latency per token.
Pros
- Industry-leading low latency and TTFT
- Predictable, deterministic performance
- Easy OpenAI-compatible cloud access
Cons
- Tiny per-chip SRAM (230 MB) needs many chips for big models
- Inference-only; no training
- Throughput trails wafer-scale rivals on some models
✓ Where it shines / best for
- Real-time, latency-critical LLM applications (chat, agents, voice)
- Developers wanting fast, cheap open-model inference via API
- High-throughput token generation at scale
✕ Not the best fit for
- Model training (LPU is inference-only)
- Workloads needing proprietary closed models not on Groq
- Very large memory-footprint single-model serving without sharding
Features
- ✓ API access
- ✓ Inference
- ✓ High Throughput
- ✓ Free tier
- ✓ Open source
- ✓ Openai Compatible
- ✓ Real-time
- ✓ Low latency
- ✓ Lpu
- ✓ Open Models
Pricing
| Plan | Price | Billing | Notes |
|---|---|---|---|
| Free tier | $0 | ongoing | Free GroqCloud developer access with rate limits for evaluation. |
| Developer / Pay-as-you-go | from ~$0.05–$0.79 | per 1M input tokens | Per-token pricing varies by model (e.g., Llama models among the cheapest; larger/MoE models higher). |
| Developer / Pay-as-you-go (output) | from ~$0.08–$0.99 | per 1M output tokens | Output tokens priced higher than input; exact rate is model-dependent. |
| Enterprise / On-prem | Custom quote | contract | Dedicated capacity, higher rate limits, and on-prem LPU hardware deployments via sales. |
Pricing verified from the official source. Prices change often — confirm on the vendor's site before buying.
Specifications
| use | inference-only |
| latency | sub-100ms time-to-first-token |
| throughput | hundreds to 1,000+ tokens/sec (model-dependent) |
| architecture | LPU (deterministic dataflow), SRAM-based |
| on_chip_sram | 230 MB per chip |
Sponsored
A full review is being generated for this product and will appear here shortly.