Quantization Advisor

Optimize model size for inference

Select a model to get quantization recommendations

Configuration

NVIDIA CUDA GPU - fastest inference, most methods available

High-throughput serving with PagedAttention - best for production GPU

50%
Speed / MemoryQuality

Quality focused: 8-bit quantization with minimal quality loss

Results

Select a model to see quantization recommendations