LLMC is an off-the-shell tool designed for compressing LLM, leveraging state-of-the-art compression algorithms to enhance efficiency and reduce model size without compromising performance.
Highlight Feature
-
💥Comprehensive Algorithm Support: Provides a broad range of ✨
SOTA compression algorithms
, including ✅quantization, ✅mixed-precision quantization, and ✅sparsity, while maintaining accuracy consistent with the original repositories. ✨Quantization best practices
(see 🚀Best Practices
here) are also available to ensure optimal performance and efficiency. -
💥Supported Formats: Supports both ✨
quantization
(integer and floating-point) and ✨sparsity
, specifically including ✅weight-activation, ✅weight-only, ✅mixed-precision quantization, as well as ✅structured and ✅unstructured sparsity. -
💥Wide Model Support: Offers support for a diverse array of ✨
LLM models
, including ✅LLama, ✅Mistral, ✅InternLM2, ✅Qwen2, among others, as well as ✅MOE(DeepSeekv2, Deepseekv2.5) and ✅VLM(Llama3.2-vision, Qwen2-vl) models (see Supported Model List). -
💥Multi-backend Compatibility: Seamlessly integrates with various backends for enhanced deployment flexibility. Multiple quantization settings and model formats are compatible with a wide range of backends and hardware platforms, such as ✅VLLM, ✅Sglang, ✅LightLLM, ✅MLC-LLM, and ✅AutoAWQ, making it highly versatile(see Section
Backend
here). -
💥Performance Efficiency: Enables quantization of large LLMs, such as ✨
Llama3.1-405B
and ✨DeepSeekV2-236B
, with PPL evaluation on asingle A100/H100/H800 GPU
.