ChatDLM Technical Report

Date: April 2025
Author: Qafind Labs

Abstract

ChatDLM is the first model to deeply integrate Block Diffusion with a Mixture-of-Experts (MoE) architecture and achieve industry-leading inference speed on GPU. Leveraging parallel block-level diffusion, dynamic expert routing, and an ultra-large context window, ChatDLM sustains 2,800 tokens/s on NVIDIA A100 GPUs, unlocking new possibilities for document-scale generation and real-time interaction.

1. Model Overview

2. Core Components

2.1 Block Diffusion

2.2 Block Parallelism

Simultaneously execute reverse denoising on all blocks. Cross-block summary tokens enable context sharing across blocks, reducing complexity approx. from O(n²) to O(n√n) and significantly boosting throughput.

2.3 Mixture-of-Experts (MoE)

2.4 Ultra-Large Context Window

3. GPU Inference Optimizations

  1. Dynamic Iteration: Group blocks by convergence difficulty; early-exiting for easy blocks reduces average steps to ~12.
  2. Mixed Precision: BF16 for all matrix and attention operations ensures numerical stability while saving memory.
  3. Hybrid Parallelism & ZeRO Sharding: Layer-wise and data parallelism with parameter, activation, and KV-Cache sharding for multi-GPU scaling.

4. Performance Benchmark (A100 GPU)

ChatDLM Performance Benchmark
MetricValue
Tokens/s2,800
Context Window131,072 tokens
Iteration Steps12–25
HumanEval (0-shot)92.0
Fill-in-the-Middle84.2
ARC-E (0-shot)83.9

5. Conclusion & Future Work

By combining Block Diffusion with dynamic MoE expert routing and optimized GPU inference, ChatDLM achieves an industry-leading 2,800 tokens/s throughput and 131k token context support on A100 GPUs. Future directions include adaptive iteration, graph-attention integration, and multimodal diffusion to meet higher precision and broader scenarios.