Yi Liu's Projects
🚀 A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision
AI-based Pull Request Summarizer and Reviewer with Chat Capabilities.
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
SOTA Weight-only Quantization Algorithm for LLMs
A list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are continuously improving the project. Welcome to PR the works (papers, repositories) that are missed by the repo.
Pytorch implementation of BRECQ, ICLR 2021
brpc is an Industrial-grade RPC framework using C++ Language, which is often used in high performance system such as Search, Storage, Machine learning, Advertisement, Recommendation etc. "brpc" means "better RPC".
CodeXGLUE
Introduction to Parallel Programming class code
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Official code for "Writing Distributed Applications with PyTorch", PyTorch Tutorial
A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc.
C++ extensions in PyTorch
Simple transformer implementation from scratch in pytorch.
lightweight, standalone C++ inference engine for Google's Gemma models.
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
Quantization library for PyTorch. Support low-precision and mixed-precision quantization, with hardware implementation through TVM.
Official implementation of Half-Quadratic Quantization (HQQ)
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.
LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
A resource for learning about Machine learning & Deep Learning
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
MPI programming lessons in C and executable code examples
Intel® Neural Compressor (formerly known as Intel® Low Precision Optimization Tool), targeting to provide unified APIs for network compression technologies, such as low precision quantization, sparsity, pruning, knowledge distillation, across different deep learning frameworks to pursue optimal inference performance.
Natural Language Processing Tutorial for Deep Learning Researchers
Neural Networks: Zero to Hero