2026.06.03 Paper

🏛️ Part 1: 논문

순번	논문	저자/연도	의의	표시
1	Learning both Weights and Connections for Efficient Neural Networks	Han et al., NeurIPS 2015	Pruning의 시초, iterative magnitude pruning	🎯
2	Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding	Han et al., ICLR 2016	Pruning+Quantization+Huffman 통합 파이프라인	🎯
3	Distilling the Knowledge in a Neural Network	Hinton et al., NIPS 2014 Workshop	Knowledge Distillation 개념 정립	🎯
4	Quantizing deep convolutional networks for efficient inference: A whitepaper	Krishnamoorthi, 2018 (Facebook)	PTQ/QAT 실전 가이드, quantization 기초 총정리	🎯

✂️ Part 2: Pruning

순번	논문	의의	표시
5	Pruning Filters for Efficient ConvNets (Li et al., ICLR 2017)	Structured pruning의 대표작	🎯
6	The Lottery Ticket Hypothesis (Frankle & Carlin, ICLR 2019)	“Winning ticket” 개념, 필수는 아니지만 영향력 큼	📘
7	To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression (Li et al., 2017)	Pruning 한계 논의	📙
8	Channel Pruning for Accelerating Very Deep Neural Networks (He et al., ICCV 2017)	채널 단위 pruning	📘
9	Rethinking the Value of Network Pruning (Liu et al., ICLR 2019)	“Pruning은 architecture를 찾는 것” 통찰	📘

📌 읽는 순서 추천: 1 → 5 → 6 → 9

🔢 Part 3: Quantization

3-1. 양자화 기초/고전

순번	논문	의의	표시
10	Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (Jacob et al., CVPR 2018)	INT8 inference의 표준, TFLite/MobileNet 기반	🎯
11	A White Paper on Neural Network Quantization (Krishnamoorthi, 2018)	위와 거의 동일, 튜토리얼형	🎯
12	Effective and Fast: A Novel Sequential Single-Path Search for Mixed-Precision Quantization — 대신 더 명시적인 다음 논문들

3-2. Mixed-Precision / HAQ 계열

순번	논문	의의	표시
13	HAQ: Hardware-Aware Automated Quantization with Mixed Precision (Wang et al., CVPR 2019)	하드웨어-인식 mixed-precision 자동 탐색	🎯
14	HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision (Dong et al., ICCV 2019)	Hessian 기반 2차 정당화	📘
15	HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks (Dong et al., NeurIPS 2020)	trace-weighted, 더 발전된 mixed-precision	📘
16	HAWQ-V3: Dyadic Neural Network Quantization (Yvinec et al., ICML 2021)	dyadic scaling, hardware-friendly	📙

3-3. PTQ (Post-Training Quantization) — 최근 핵심

순번	논문	의의	표시
17	Up or Down? Adaptive Rounding for Post-Training Quantization (Li et al., ICML 2020)	AdaRound, PTQ 표준 기법	🎯
18	Data-Free Quantization Through Weight Equalization and Bias Correction (Nagel et al., ICCV 2019)	Equalization + Bias Correction, 데이터 없는 PTQ	🎯
19	Improving Post Training Quantization Accuracy by Balancing Folds (Nagel et al., ICLR 2020)	activation distribution 보정	📘
20	BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction (Li et al., ICLR 2021)	Block-wise 재구성, sharp vs flat minimizer 분석	🎯
21	QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization (Wei et al., NeurIPS 2022)	QAT 시뮬레이션 random drop, 2-bit PTQ	📘
22	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (Frantar et al., ICLR 2023)	LLM 4-bit PTQ의 표준, OBQ 일반화	🎯

📌 읽는 순서 추천: 10 → 17 → 18 → 20 → 22

🍵 Part 4: Knowledge Distillation (지식 증류)

순번	논문	의의	표시
23	Distilling the Knowledge in a Neural Network (Hinton et al., 2014)	KD 원조	🎯
24	FitNets: Hints for Thin Deep Nets (Romero et al., ICLR 2015)	Hint-based, intermediate feature 매칭	📘
25	Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer (Zagoruyko & Komodakis, ICLR 2017)	Attention Transfer	📘
26	Born-Again Networks (Furlanello et al., ICML 2018)	동일 크기 teacher→student	📙
27	Knowledge Distillation: A Survey (Gou et al., IJCV 2021)	Survey, 전반 조감	📘

🤖 Part 5: LLM 시대의 압축/양자화 (2023~)

순번	논문	의의	표시
28	LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., NeurIPS 2022)	Outlier 분리, mixed-precision	🎯
29	SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (Xiao et al., ICML 2023)	W8A8, activation smoothing	🎯
30	AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (Lin et al., MLSys 2024)	W4A16, activation-aware	🎯
31	GPTQ (위 22번, 중복)
32	SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression (Dettmers et al., ICLR 2024)	희소+양자화 결합	📘
33	QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., NeurIPS 2023)	NF4 + LoRA, 65B finetune on 1 GPU	📘
34	QuIP: 2-Bit Quantization of Large Language Models with Guarantees (Chee et al., NeurIPS 2023)	incoherence processing	📘
35	A Survey on Model Compression for Large Language Models (및 2024~ 다수 survey)	LLM 압축 survey	📘

📌 읽는 순서 추천: 28 → 29 → 22 → 30 → 32 → 33

⚙️ Part 6: 실전/시스템 관점

순번	논문/자료	의의	표시
36	Efficient Processing of Deep Neural Networks: A Tutorial and Survey (Sze et al., IEEE 2017)	하드웨어-알고리즘 공동설계	📘
37	MLPerf Inference Benchmark	표준 벤치마크	📙
38	ONNX Runtime Quantization docs	실전 배포	📙

📅

주차	주제	논문 번호
1-2주	분야 조감 + KD 원조	3, 4, 23, 36
3-4주	Pruning 고전	1, 2, 5, 6, 9
5-6주	Quantization 기초	10, 11, 17, 18
7-8주	Quantization 심화	13, 14, 20, 22
9-10주	LLM 압축	28, 29, 30, 32, 33
11-12주	Survey + 본 방향 결정	27, 35, 직접 탐색