Tracking Spec Decode Support In Vllm: A Deep Dive
Hey guys! Today, we're diving deep into the vLLM project to discuss and track the progress of Spec Decode support. This is a crucial area for optimization, and understanding how it works with other features is super important. Let's break down the motivation, current status, and future goals.
Motivation Behind Spec Decode Support
So, what's the big deal with Spec Decode? Well, it's all about making things faster and more efficient. The main motivation behind tracking Spec Decode support is to understand how it plays with other features within vLLM. This integration is key to unlocking the full potential of our models. Think of it like this: you've got a super-fast engine (Spec Decode), but you need to make sure it works seamlessly with the transmission, wheels, and everything else (other features). That's what we're trying to achieve here.
I've personally started digging into how spec decode composes with other features, and this discussion is all about keeping tabs on the current status. We want to ensure that Spec Decode isn't just a standalone feature, but a well-integrated part of the vLLM ecosystem. This means looking at various models, hardware configurations, and different acceleration techniques.
Key Areas of Investigation
To get a handle on this, we're looking at a bunch of different factors:
- Models: Which models are we testing with? Different models have different architectures and characteristics, so what works for one might not work for another.
- Hardware: What hardware are we running on? The interplay between software and hardware is critical. We need to know how Spec Decode performs on different GPUs and other hardware accelerators.
- Spec Decode Variants: We're exploring different Spec Decode implementations, like EAGLE and MTP, to see which ones offer the best performance and compatibility.
- Tensor Parallelism (TP), Data Parallelism (DP), and Expert Parallelism (EP): These are different ways of distributing the workload across multiple devices. We need to understand how Spec Decode behaves in these parallel environments.
- CUDA Graph Mode: CUDA graphs can help reduce overhead and improve performance. We're investigating how Spec Decode interacts with different CUDA graph modes.
- DCP (Distributed Checkpointing), DBO (Distributed Optimizer), and FP8 KV Cache: These are advanced techniques for optimizing memory usage and communication. We want to ensure that Spec Decode can take advantage of these optimizations.
- Attention Backends: Different attention mechanisms (like FLASH_ATTN, TRITON_ATTN, FLASHINFER, and FLEX_ATTENTION) have their own performance characteristics. We're testing Spec Decode with various attention backends to find the best combinations.
By looking at all these pieces, we can get a complete picture of how Spec Decode is performing and where we need to focus our efforts.
Current Status: A Detailed Breakdown
Let's dive into the nitty-gritty. Hereβs a table summarizing the current status of Spec Decode support across different configurations. This table is our roadmap, showing us what's working, what's not, and what needs more investigation. Check it out:
| Model | Hardware | Spec Decode | TP | DP | EP | CUDA Graph Mode | DCP | DBO | FP8 KV Cache | Attn Backend | Status | Notes |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| L3-8B | H100 | 1 | 1 | N | FULL_AND_PIECEWISE |
1 | N | N | FLASH_ATTN |
β | ||
| L3-8B | H100 | EAGLE | 1 | 1 | N | FULL_AND_PIECEWISE |
1 | N | N | FLASH_ATTN |
β | |
| L3-8B | H100 | EAGLE | 2 | 1 | N | FULL_AND_PIECEWISE |
1 | N | N | FLASH_ATTN |
β | |
| L3-8B | H100 | 2 | 1 | N | FULL_AND_PIECEWISE |
1 | N | N | TRITON_ATTN |
β | ||
| L3-8B | H100 | EAGLE | 2 | 1 | N | FULL_AND_PIECEWISE |
1 | N | N | TRITON_ATTN |
β | IMA |
| L3-8B | H100 | 2 | 1 | N | FULL_AND_PIECEWISE |
1 | N | N | FLASHINFER |
β | ||
| L3-8B | H100 | EAGLE | 2 | 1 | N | FULL_AND_PIECEWISE |
1 | N | N | FLASHINFER |
β | IMA |
| L3-8B | H100 | EAGLE | 2 | 1 | N | FULL_AND_PIECEWISE |
1 | N | N | FLEX_ATTENTION |
β οΈ | OOMs during benchmark (10 prompts works, 1000 fails) (memory leak?) Also hits recompile_limit reached with fullgraph=True |
| L3-8B | H100 | 1 | 2 | N | FULL_AND_PIECEWISE |
1 | N | N | FLASH_ATTN |
β | ||
| L3-8B | H100 | EAGLE | 1 | 2 | N | FULL_AND_PIECEWISE |
1 | N | N | FLASH_ATTN |
β | Crashes on startup (assert should_attempt_dp_padding == should_dp_pad) |
| L3-8B | H100 | EAGLE | 1 | 2 | N | PIECEWISE |
1 | N | N | FLASH_ATTN |
β | Crashes on startup (assert should_attempt_dp_padding == should_dp_pad) |
| L3-8B | H100 | EAGLE | 1 | 2 | N | NONE |
1 | N | N | FLASH_ATTN |
β | Hangs during inference, even with DeepEP kernels |
| L3-8B | H100 | EAGLE | 1 | 1 | N | FULL_AND_PIECEWISE |
1 | N | Y | FLASH_ATTN |
β | |
| DSR1 | H200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
1 | N | N | FLASHMLA |
β | |
| DSR1 | H200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
1 | N | N | TRITON_MLA |
||
| DSR1 | H200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
1 | N | N | FLASH_ATTN_MLA |
β | |
| DSR1 | H200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
8 | N | N | FLASH_ATTN_MLA |
β | |
| DSR1 | H200 | MTP | 1 | 8 | Y | FULL_AND_PIECEWISE |
1 | N | N | FLASH_ATTN_MLA |
β | Hangs during inference, even with DeepEP kernels |
| DSR1 | B200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
1 | N | N | FLASHMLA |
β | FlashMLA dense is hopper-only |
| DSR1 | B200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
1 | N | N | CUTLASS_MLA |
β οΈ | Works but uses prefill pathway so performance will suffer |
| DSR1 | B200 | MTP | 8 | 1 | N | FULL_AND_PIECEWISE |
1 | N | N | FLASHINFER_MLA |
β | NOTE: supports q_len > 1, we should change reorder_batch_threshold (currently 1) |
Key Observations
Let's break down what we're seeing in this table. It's like reading a map β we need to understand the symbols and landmarks to navigate effectively.
- β (Green Checkmarks): These are our wins! They show configurations where Spec Decode is playing nicely with the other features. For example, L3-8B on H100 with EAGLE and FLASH_ATTN is working great. This gives us a solid foundation to build on.
- β (Red Crosses): These are the roadblocks. They indicate configurations where things are crashing, hanging, or just not working as expected. For instance, the crashes on startup with L3-8B, H100, EAGLE, and 1 DP are critical issues we need to address ASAP.
- β οΈ (Yellow Warnings): These are the areas that need a closer look. They might be working, but with caveats. The OOMs (Out of Memory errors) during benchmarks with FLEX_ATTENTION is a big red flag, suggesting potential memory leaks or inefficiencies. Similarly, the performance issues with CUTLASS_MLA on B200, where it's using the prefill pathway, indicate that we're not fully utilizing the hardware's capabilities.
- IMA: You'll see