CCF-A
USENIX ATC(USENIX Annual Technical Conference)
- PUZZLE: efficiently aligning large language models through light-weight context switch
SC(International Conference for High Performance Computing, Networking, Storage, and Analysis)
ASPLOS(International Conference on Architectural Support for Programming Languages and Operating Systems)
- Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow. ASPLOS (1) 2025
HPCA(IEEE International Symposium on High Performance Computer Architecture)
PPoPP(ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming)
OSDI(USENIX Symposium on Operating Systems Design and Implementation)
- Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI 2022
EuroSys(European Conference on Computer Systems)
PLDI(ACM SIGPLAN Conference on Programming Language Design and Implementation)
ICML(International Conference on Machine Learning)
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. ICML 2024
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. ICML 2024
- Fast Inference from Transformers via Speculative Decoding. ICML 2023
NeurIPS(Conference on Neural Information Processing Systems)
- SpecExec: Massively Parallel Speculative Decoding For Interactive LLM Inference on Consumer Devices. NeurIPS 2024
- Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting. NeurIPS 2024
- H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. NeurIPS 2023
- Blockwise Parallel Decoding for Deep Autoregressive Models. NeurIPS 2018
ACL(Association for Computational Linguistics)
- LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding. ACL (1) 2024
- BASS: Batched Attention-optimized Speculative Sampling. ACL (Findings) 2024
- Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding. ACL (Findings) 2024
CCF-B
ICS(International Conference on Supercomputing)
ICPP(International Conference on Parallel Processing)
CLUSTER(IEEE International Conference on Cluster Computing)
CGO(The International Symposium on Code Generation and Optimization)
IPDPS(IEEE International Parallel & Distributed Processing Symposium)
EMNLP(Conference on Empirical Methods in Natural Language Processing)
- EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees. EMNLP 2024
- Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding. EMNLP 2024
NAACL(North American Chapter of the Association for Computational Linguistics)
- REST: Retrieval-Based Speculative Decoding. NAACL-HLT 2024