LLMC

LLMC is an off-the-shell tool designed for compressing LLM, leveraging state-of-the-art compression algorithms to enhance efficiency and reduce model size without compromising performance.

Highlight Feature

  • 💥Comprehensive Algorithm Support: Provides a broad range of ✨SOTA compression algorithms, including ✅quantization, ✅mixed-precision quantization, and ✅sparsity, while maintaining accuracy consistent with the original repositories. ✨Quantization best practices (see 🚀Best Practices here) are also available to ensure optimal performance and efficiency.

  • 💥Supported Formats: Supports both ✨quantization (integer and floating-point) and ✨sparsity, specifically including ✅weight-activation, ✅weight-only, ✅mixed-precision quantization, as well as ✅structured and ✅unstructured sparsity.

  • 💥Wide Model Support: Offers support for a diverse array of ✨LLM models, including ✅LLama, ✅Mistral, ✅InternLM2, ✅Qwen2, among others, as well as ✅MOE(DeepSeekv2, Deepseekv2.5) and ✅VLM(Llama3.2-vision, Qwen2-vl) models (see Supported Model List).

  • 💥Multi-backend Compatibility: Seamlessly integrates with various backends for enhanced deployment flexibility. Multiple quantization settings and model formats are compatible with a wide range of backends and hardware platforms, such as ✅VLLM, ✅Sglang, ✅LightLLM, ✅MLC-LLM, and ✅AutoAWQ, making it highly versatile(see Section Backend here).

  • 💥Performance Efficiency: Enables quantization of large LLMs, such as ✨Llama3.1-405B and ✨DeepSeekV2-236B, with PPL evaluation on a single A100/H100/H800 GPU.

Ruihao Gong
Ruihao Gong

My research interests include deep learning fundamental, efficient AI, and their relevant applications such as autonomous driving and AIoT.