ChatDLM Technical Report
Date: April 2025
Author: Qafind Labs
Abstract
ChatDLM is the first model to deeply integrate Block Diffusion with a Mixture-of-Experts (MoE) architecture and achieve industry-leading inference speed on GPU. Leveraging parallel block-level diffusion, dynamic expert routing, and an ultra-large context window, ChatDLM sustains 2,800 tokens/s on NVIDIA A100 GPUs, unlocking new possibilities for document-scale generation and real-time interaction.
1. Model Overview
- Model Type: Block Diffusion-based DLM + MoE
- Parameter Size: 7B
- Context Window: 131,072 tokens
- GPU Inference Speed: 2,800 tokens/s on A100
2. Core Components
2.1 Block Diffusion
- Text Partitioning: Split input into blocks (e.g., 512 tokens/block) and perform diffusion in continuous space in parallel.
- Reverse Denoising: At each iteration, apply the trained denoising network to each block, combined with cross-block attention for global context.
- Iteration Schedule: Default 12–25 steps, dynamically adjusted per-block convergence.
2.2 Block Parallelism
Simultaneously execute reverse denoising on all blocks. Cross-block summary tokens enable context sharing across blocks, reducing complexity approx. from O(n²)
to O(n√n)
and significantly boosting throughput.
2.3 Mixture-of-Experts (MoE)
- Each layer contains 32–64 experts; a gating network selects Top-2 experts per input.
- Expert routing runs in parallel with diffusion denoising, enhancing expressiveness with minimal overhead.
2.4 Ultra-Large Context Window
- Fine-tuned Rotary Positional Embeddings (RoPE) enable stable extrapolation from training length (4,096) to inference length (131,072) tokens.
- Hierarchical KV-Cache: full KV storage for recent blocks, low-rank summaries for older blocks, controlling memory footprint.
3. GPU Inference Optimizations
- Dynamic Iteration: Group blocks by convergence difficulty; early-exiting for easy blocks reduces average steps to ~12.
- Mixed Precision: BF16 for all matrix and attention operations ensures numerical stability while saving memory.
- Hybrid Parallelism & ZeRO Sharding: Layer-wise and data parallelism with parameter, activation, and KV-Cache sharding for multi-GPU scaling.
4. Performance Benchmark (A100 GPU)
Metric | Value |
Tokens/s | 2,800 |
Context Window | 131,072 tokens |
Iteration Steps | 12–25 |
HumanEval (0-shot) | 92.0 |
Fill-in-the-Middle | 84.2 |
ARC-E (0-shot) | 83.9 |
5. Conclusion & Future Work
By combining Block Diffusion with dynamic MoE expert routing and optimized GPU inference, ChatDLM achieves an industry-leading 2,800 tokens/s throughput and 131k token context support on A100 GPUs. Future directions include adaptive iteration, graph-attention integration, and multimodal diffusion to meet higher precision and broader scenarios.