Learning both Weights and Connections for Efficient Neural Networks
Han et al., NeurIPS 2015
Pruning의 시초, iterative magnitude pruning
🎯
2
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
Han et al., ICLR 2016
Pruning+Quantization+Huffman 통합 파이프라인
🎯
3
Distilling the Knowledge in a Neural Network
Hinton et al., NIPS 2014 Workshop
Knowledge Distillation 개념 정립
🎯
4
Quantizing deep convolutional networks for efficient inference: A whitepaper
Krishnamoorthi, 2018 (Facebook)
PTQ/QAT 실전 가이드, quantization 기초 총정리
🎯
✂️ Part 2: Pruning
순번
논문
의의
표시
5
Pruning Filters for Efficient ConvNets (Li et al., ICLR 2017)
Structured pruning의 대표작
🎯
6
The Lottery Ticket Hypothesis (Frankle & Carlin, ICLR 2019)
“Winning ticket” 개념, 필수는 아니지만 영향력 큼
📘
7
To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression (Li et al., 2017)
Pruning 한계 논의
📙
8
Channel Pruning for Accelerating Very Deep Neural Networks (He et al., ICCV 2017)
채널 단위 pruning
📘
9
Rethinking the Value of Network Pruning (Liu et al., ICLR 2019)
“Pruning은 architecture를 찾는 것” 통찰
📘
📌 읽는 순서 추천: 1 → 5 → 6 → 9
🔢 Part 3: Quantization
3-1. 양자화 기초/고전
순번
논문
의의
표시
10
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (Jacob et al., CVPR 2018)
INT8 inference의 표준, TFLite/MobileNet 기반
🎯
11
A White Paper on Neural Network Quantization (Krishnamoorthi, 2018)
위와 거의 동일, 튜토리얼형
🎯
12
Effective and Fast: A Novel Sequential Single-Path Search for Mixed-Precision Quantization — 대신 더 명시적인 다음 논문들
3-2. Mixed-Precision / HAQ 계열
순번
논문
의의
표시
13
HAQ: Hardware-Aware Automated Quantization with Mixed Precision (Wang et al., CVPR 2019)
하드웨어-인식 mixed-precision 자동 탐색
🎯
14
HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision (Dong et al., ICCV 2019)
Hessian 기반 2차 정당화
📘
15
HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks (Dong et al., NeurIPS 2020)
trace-weighted, 더 발전된 mixed-precision
📘
16
HAWQ-V3: Dyadic Neural Network Quantization (Yvinec et al., ICML 2021)
dyadic scaling, hardware-friendly
📙
3-3. PTQ (Post-Training Quantization) — 최근 핵심
순번
논문
의의
표시
17
Up or Down? Adaptive Rounding for Post-Training Quantization (Li et al., ICML 2020)
AdaRound, PTQ 표준 기법
🎯
18
Data-Free Quantization Through Weight Equalization and Bias Correction (Nagel et al., ICCV 2019)
Equalization + Bias Correction, 데이터 없는 PTQ
🎯
19
Improving Post Training Quantization Accuracy by Balancing Folds (Nagel et al., ICLR 2020)
activation distribution 보정
📘
20
BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction (Li et al., ICLR 2021)
Block-wise 재구성, sharp vs flat minimizer 분석
🎯
21
QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization (Wei et al., NeurIPS 2022)
QAT 시뮬레이션 random drop, 2-bit PTQ
📘
22
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (Frantar et al., ICLR 2023)
LLM 4-bit PTQ의 표준, OBQ 일반화
🎯
📌 읽는 순서 추천: 10 → 17 → 18 → 20 → 22
🍵 Part 4: Knowledge Distillation (지식 증류)
순번
논문
의의
표시
23
Distilling the Knowledge in a Neural Network (Hinton et al., 2014)
KD 원조
🎯
24
FitNets: Hints for Thin Deep Nets (Romero et al., ICLR 2015)
Hint-based, intermediate feature 매칭
📘
25
Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer (Zagoruyko & Komodakis, ICLR 2017)
Attention Transfer
📘
26
Born-Again Networks (Furlanello et al., ICML 2018)
동일 크기 teacher→student
📙
27
Knowledge Distillation: A Survey (Gou et al., IJCV 2021)
Survey, 전반 조감
📘
🤖 Part 5: LLM 시대의 압축/양자화 (2023~)
순번
논문
의의
표시
28
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., NeurIPS 2022)
Outlier 분리, mixed-precision
🎯
29
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (Xiao et al., ICML 2023)
W8A8, activation smoothing
🎯
30
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (Lin et al., MLSys 2024)
W4A16, activation-aware
🎯
31
GPTQ (위 22번, 중복)
32
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression (Dettmers et al., ICLR 2024)
희소+양자화 결합
📘
33
QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., NeurIPS 2023)
NF4 + LoRA, 65B finetune on 1 GPU
📘
34
QuIP: 2-Bit Quantization of Large Language Models with Guarantees (Chee et al., NeurIPS 2023)
incoherence processing
📘
35
A Survey on Model Compression for Large Language Models (및 2024~ 다수 survey)
LLM 압축 survey
📘
📌 읽는 순서 추천: 28 → 29 → 22 → 30 → 32 → 33
⚙️ Part 6: 실전/시스템 관점
순번
논문/자료
의의
표시
36
Efficient Processing of Deep Neural Networks: A Tutorial and Survey (Sze et al., IEEE 2017)