Quantization Advisor
Optimize model size for inference
Select a model to get quantization recommendations
Configuration
NVIDIA CUDA GPU - fastest inference, most methods available
High-throughput serving with PagedAttention - best for production GPU
50%
Speed / MemoryQuality
Quality focused: 8-bit quantization with minimal quality loss
Results
Select a model to see quantization recommendations