2026.06.03 Paper

🏛️ Part 1: 논문

순번논문저자/연도의의표시
1Learning both Weights and Connections for Efficient Neural NetworksHan et al., NeurIPS 2015Pruning의 시초, iterative magnitude pruning🎯
2Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman CodingHan et al., ICLR 2016Pruning+Quantization+Huffman 통합 파이프라인🎯
3Distilling the Knowledge in a Neural NetworkHinton et al., NIPS 2014 WorkshopKnowledge Distillation 개념 정립🎯
4Quantizing deep convolutional networks for efficient inference: A whitepaperKrishnamoorthi, 2018 (Facebook)PTQ/QAT 실전 가이드, quantization 기초 총정리🎯

✂️ Part 2: Pruning

순번논문의의표시
5Pruning Filters for Efficient ConvNets (Li et al., ICLR 2017)Structured pruning의 대표작🎯
6The Lottery Ticket Hypothesis (Frankle & Carlin, ICLR 2019)“Winning ticket” 개념, 필수는 아니지만 영향력 큼📘
7To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression (Li et al., 2017)Pruning 한계 논의📙
8Channel Pruning for Accelerating Very Deep Neural Networks (He et al., ICCV 2017)채널 단위 pruning📘
9Rethinking the Value of Network Pruning (Liu et al., ICLR 2019)“Pruning은 architecture를 찾는 것” 통찰📘

📌 읽는 순서 추천: 1 → 5 → 6 → 9

🔢 Part 3: Quantization

3-1. 양자화 기초/고전

순번논문의의표시
10Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (Jacob et al., CVPR 2018)INT8 inference의 표준, TFLite/MobileNet 기반🎯
11A White Paper on Neural Network Quantization (Krishnamoorthi, 2018)위와 거의 동일, 튜토리얼형🎯
12Effective and Fast: A Novel Sequential Single-Path Search for Mixed-Precision Quantization — 대신 더 명시적인 다음 논문들

3-2. Mixed-Precision / HAQ 계열

순번논문의의표시
13HAQ: Hardware-Aware Automated Quantization with Mixed Precision (Wang et al., CVPR 2019)하드웨어-인식 mixed-precision 자동 탐색🎯
14HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision (Dong et al., ICCV 2019)Hessian 기반 2차 정당화📘
15HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks (Dong et al., NeurIPS 2020)trace-weighted, 더 발전된 mixed-precision📘
16HAWQ-V3: Dyadic Neural Network Quantization (Yvinec et al., ICML 2021)dyadic scaling, hardware-friendly📙

3-3. PTQ (Post-Training Quantization) — 최근 핵심

순번논문의의표시
17Up or Down? Adaptive Rounding for Post-Training Quantization (Li et al., ICML 2020)AdaRound, PTQ 표준 기법🎯
18Data-Free Quantization Through Weight Equalization and Bias Correction (Nagel et al., ICCV 2019)Equalization + Bias Correction, 데이터 없는 PTQ🎯
19Improving Post Training Quantization Accuracy by Balancing Folds (Nagel et al., ICLR 2020)activation distribution 보정📘
20BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction (Li et al., ICLR 2021)Block-wise 재구성, sharp vs flat minimizer 분석🎯
21QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization (Wei et al., NeurIPS 2022)QAT 시뮬레이션 random drop, 2-bit PTQ📘
22GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (Frantar et al., ICLR 2023)LLM 4-bit PTQ의 표준, OBQ 일반화🎯

📌 읽는 순서 추천: 10 → 17 → 18 → 20 → 22


🍵 Part 4: Knowledge Distillation (지식 증류)

순번논문의의표시
23Distilling the Knowledge in a Neural Network (Hinton et al., 2014)KD 원조🎯
24FitNets: Hints for Thin Deep Nets (Romero et al., ICLR 2015)Hint-based, intermediate feature 매칭📘
25Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer (Zagoruyko & Komodakis, ICLR 2017)Attention Transfer📘
26Born-Again Networks (Furlanello et al., ICML 2018)동일 크기 teacher→student📙
27Knowledge Distillation: A Survey (Gou et al., IJCV 2021)Survey, 전반 조감📘

🤖 Part 5: LLM 시대의 압축/양자화 (2023~)

순번논문의의표시
28LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., NeurIPS 2022)Outlier 분리, mixed-precision🎯
29SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (Xiao et al., ICML 2023)W8A8, activation smoothing🎯
30AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (Lin et al., MLSys 2024)W4A16, activation-aware🎯
31GPTQ (위 22번, 중복)
32SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression (Dettmers et al., ICLR 2024)희소+양자화 결합📘
33QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., NeurIPS 2023)NF4 + LoRA, 65B finetune on 1 GPU📘
34QuIP: 2-Bit Quantization of Large Language Models with Guarantees (Chee et al., NeurIPS 2023)incoherence processing📘
35A Survey on Model Compression for Large Language Models (및 2024~ 다수 survey)LLM 압축 survey📘

📌 읽는 순서 추천: 28 → 29 → 22 → 30 → 32 → 33


⚙️ Part 6: 실전/시스템 관점

순번논문/자료의의표시
36Efficient Processing of Deep Neural Networks: A Tutorial and Survey (Sze et al., IEEE 2017)하드웨어-알고리즘 공동설계📘
37MLPerf Inference Benchmark표준 벤치마크📙
38ONNX Runtime Quantization docs실전 배포📙

📅

주차주제논문 번호
1-2주분야 조감 + KD 원조3, 4, 23, 36
3-4주Pruning 고전1, 2, 5, 6, 9
5-6주Quantization 기초10, 11, 17, 18
7-8주Quantization 심화13, 14, 20, 22
9-10주LLM 압축28, 29, 30, 32, 33
11-12주Survey + 본 방향 결정27, 35, 직접 탐색