Deploy DeepSeek-R1 on 8x H200 Device.

首先，安装 SGLang。可以通过以下命令安装：

pip install uv
pip install "sglang[all]>=0.4.3" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python -i https://pypi.tuna.tsinghua.edu.cn/simple

模型文件需要从 deepseek-ai/DeepSeek-R1 下载，假设放在 /mnt/models/DeepSeek-R1。

初始尝试和问题

第一次启动服务器时，使用以下命令：

python3 -m sglang.launch_server \
--model /mnt/models/DeepSeek-R1 \
--tp 8 \
--dtype auto \
--trust-remote-code \
--enable-torch-compile \
--disable-cuda-graph \
--mem-fraction-static 0.7 \
--served-model-name deepseek --api-key 123

但遇到错误：

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != c10::Half

这是因为 DeepSeek-R1 使用混合数据类型（如 BFloat16 和 F8_E4M3），而 --dtype half 强制转换为 float16，导致不兼容。

解决问题

通过将 --dtype 改为 auto，问题解决：

python3 -m sglang.launch_server \
--model /mnt/models/DeepSeek-R1 \
--tp 8 \
--dtype auto \
--trust-remote-code \
--enable-torch-compile \
--disable-cuda-graph \
--mem-fraction-static 0.7 \
--served-model-name deepseek --api-key 123

解决问题
通过将 --dtype 改为 auto，问题解决：

python3 -m sglang.launch_server \
--model /mnt/models/DeepSeek-R1 \
--tp 8 \
--dtype auto \
--trust-remote-code \
--enable-torch-compile \
--disable-cuda-graph \
--mem-fraction-static 0.7 \
--served-model-name deepseek --api-key 123

这允许模型以原始数据类型加载，成功启动 sglang。

性能优化
为加快 token 生成，建议启用推测解码，设置如下：

--speculative-num-steps 3
--speculative-num-draft-tokens 4
--mem-fraction-static 0.7
--schedule-conservativeness 0.01
在issue中，低并发（如 1 个请求）时可达到每秒 65 token 的速度。

参考

性能优化策略

基于 issue #4616 和相关讨论，以下是影响 token 生成速度的关键参数：

参数	描述/影响	推荐值/注意事项	已知问题
`--speculative-algorithm`	启用 Next-N 推测解码，生成草稿 token 树，减少前向传递次数	设置为 "NEXTN"	高并发（>32）时性能下降
`--speculative-num-steps`	推测解码步骤数，影响吞吐量	推荐 3，测试显示最高吞吐量	-
`--speculative-num-draft-tokens`	草稿 token 数量，影响性能	推荐 4，测试显示最佳性能	-
`--speculative-eagle-topk`	EAGLE策略的 top-k 值，影响推测解码效率	推荐 1	-
`--speculative-draft`	草稿模型目录，需指定正确路径	设置为 ``	-
`--mem-fraction-static`	静态内存分配比例，防止 OOM 错误	推荐 0.7	-
`--schedule-conservativeness`	调度保守性，影响吞吐量	推荐 0.01，高性能设置	-
CUDA Graph	减少内核启动开销和同步延迟，提升推理性能，增加内存使用	默认启用，可用 `--disable-cuda-graph` 禁用，参数 `--cuda-graph-max-bs`: 160, 512, 1024	与 DP-Attention 和 dp-size >4 可能导致 CUDA 错误
Torch Compile	将 PyTorch 模型转换为优化执行图，提升 GPU 性能	用 `--enable-torch-compile` 启用，默认禁用，DeepSeek-R1 捕获时间约 1 小时，Next-N 约 5 分钟	-

参考jupyter 脚本:

pip install uv
pip install "sglang[all]>=0.4.3" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python -i https://pypi.tuna.tsinghua.edu.cn/simple

python3 -m sglang.launch_server \
--model /mnt/models/DeepSeek-R1 \
--tp 8 \
--trust-remote-code \
--enable-ep-moe \
--mem-fraction-static 0.7 \
--schedule-conservativeness 0.01 \
--served-model-name deepseek --api-key 123

参考输出:

2025-03-22 15:09:20.381293: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-22 15:09:20.384090: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-22 15:09:20.418164: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-03-22 15:09:20.418198: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-03-22 15:09:20.418236: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-22 15:09:20.425708: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-03-22 15:09:21.048490: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
INFO 03-22 15:09:22 __init__.py:190] Automatically detected platform cuda.
[2025-03-22 15:09:33] server_args=ServerArgs(model_path='/mnt/models/DeepSeek-R1', tokenizer_path='/mnt/models/DeepSeek-R1', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='deepseek', chat_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=30000, mem_fraction_static=0.7, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=0.01, cpu_offload_gb=0, page_size=1, tp_size=8, stream_interval=1, stream_output=False, random_seed=286257988, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key='123', file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=8, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=5, speculative_eagle_topk=4, speculative_num_draft_tokens=8, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=True, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False)
INFO 03-22 15:09:39 __init__.py:190] Automatically detected platform cuda.
INFO 03-22 15:09:39 __init__.py:190] Automatically detected platform cuda.
INFO 03-22 15:09:39 __init__.py:190] Automatically detected platform cuda.
INFO 03-22 15:09:40 __init__.py:190] Automatically detected platform cuda.
INFO 03-22 15:09:40 __init__.py:190] Automatically detected platform cuda.
INFO 03-22 15:09:40 __init__.py:190] Automatically detected platform cuda.
INFO 03-22 15:09:40 __init__.py:190] Automatically detected platform cuda.
INFO 03-22 15:09:40 __init__.py:190] Automatically detected platform cuda.
INFO 03-22 15:09:40 __init__.py:190] Automatically detected platform cuda.
[2025-03-22 15:09:43 TP5] MLA optimization is turned on. Use triton backend.
[2025-03-22 15:09:43 TP5] Init torch distributed begin.
[2025-03-22 15:09:45 TP0] MLA optimization is turned on. Use triton backend.
[2025-03-22 15:09:45 TP0] Init torch distributed begin.
[2025-03-22 15:09:46 TP1] MLA optimization is turned on. Use triton backend.
[2025-03-22 15:09:46 TP1] Init torch distributed begin.
[2025-03-22 15:09:46 TP3] MLA optimization is turned on. Use triton backend.
[2025-03-22 15:09:46 TP3] Init torch distributed begin.
[2025-03-22 15:09:47 TP7] MLA optimization is turned on. Use triton backend.
[2025-03-22 15:09:47 TP7] Init torch distributed begin.
[2025-03-22 15:09:47 TP6] MLA optimization is turned on. Use triton backend.
[2025-03-22 15:09:47 TP6] Init torch distributed begin.
[2025-03-22 15:09:47 TP4] MLA optimization is turned on. Use triton backend.
[2025-03-22 15:09:47 TP4] Init torch distributed begin.
[2025-03-22 15:09:47 TP2] MLA optimization is turned on. Use triton backend.
[2025-03-22 15:09:47 TP2] Init torch distributed begin.
[2025-03-22 15:09:48 TP0] sglang is using nccl==2.21.5
[2025-03-22 15:09:48 TP1] sglang is using nccl==2.21.5
[2025-03-22 15:09:48 TP2] sglang is using nccl==2.21.5
[2025-03-22 15:09:48 TP4] sglang is using nccl==2.21.5
[2025-03-22 15:09:48 TP3] sglang is using nccl==2.21.5
[2025-03-22 15:09:48 TP6] sglang is using nccl==2.21.5
[2025-03-22 15:09:48 TP5] sglang is using nccl==2.21.5
[2025-03-22 15:09:48 TP7] sglang is using nccl==2.21.5
[2025-03-22 15:09:52 TP7] Init torch distributed ends. mem usage=1.50 GB
[2025-03-22 15:09:52 TP2] Init torch distributed ends. mem usage=1.97 GB
[2025-03-22 15:09:52 TP4] Init torch distributed ends. mem usage=1.97 GB
[2025-03-22 15:09:52 TP1] Init torch distributed ends. mem usage=1.97 GB
[2025-03-22 15:09:52 TP5] Init torch distributed ends. mem usage=1.97 GB
[2025-03-22 15:09:52 TP7] Load weight begin. avail mem=137.80 GB
[2025-03-22 15:09:52 TP2] Load weight begin. avail mem=137.33 GB
[2025-03-22 15:09:52 TP3] Init torch distributed ends. mem usage=1.97 GB
[2025-03-22 15:09:52 TP4] Load weight begin. avail mem=137.33 GB
[2025-03-22 15:09:52 TP1] Load weight begin. avail mem=137.33 GB
[2025-03-22 15:09:52 TP5] Load weight begin. avail mem=137.33 GB
[2025-03-22 15:09:52 TP3] Load weight begin. avail mem=137.33 GB
[2025-03-22 15:09:52 TP0] Init torch distributed ends. mem usage=1.87 GB
[2025-03-22 15:09:52 TP0] Load weight begin. avail mem=137.42 GB
[2025-03-22 15:09:52 TP6] Init torch distributed ends. mem usage=1.97 GB
[2025-03-22 15:09:52 TP6] Load weight begin. avail mem=137.33 GB
[2025-03-22 15:09:52 TP2] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-22 15:09:52 TP4] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-22 15:09:52 TP1] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-22 15:09:52 TP7] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-22 15:09:52 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-22 15:09:52 TP6] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-22 15:09:52 TP3] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-22 15:09:52 TP5] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-22 15:09:52 TP4] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-03-22 15:09:52 TP7] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-03-22 15:09:52 TP3] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-03-22 15:09:52 TP2] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-03-22 15:09:52 TP6] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-03-22 15:09:52 TP0] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-03-22 15:09:52 TP1] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-03-22 15:09:52 TP5] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
Loading safetensors checkpoint shards:   0% Completed | 0/163 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   1% Completed | 1/163 [00:00<00:43,  3.70it/s]
Loading safetensors checkpoint shards:   2% Completed | 4/163 [00:00<00:22,  7.00it/s]
Loading safetensors checkpoint shards:   4% Completed | 6/163 [00:00<00:17,  9.10it/s]
Loading safetensors checkpoint shards:   5% Completed | 8/163 [00:01<00:22,  7.00it/s]
Loading safetensors checkpoint shards:   6% Completed | 9/163 [00:01<00:25,  6.08it/s]
Loading safetensors checkpoint shards:   7% Completed | 11/163 [00:01<00:23,  6.57it/s]
Loading safetensors checkpoint shards:   7% Completed | 12/163 [00:01<00:21,  6.96it/s]
Loading safetensors checkpoint shards:   8% Completed | 13/163 [00:01<00:24,  6.15it/s]
Loading safetensors checkpoint shards:  10% Completed | 16/163 [00:02<00:19,  7.50it/s]
Loading safetensors checkpoint shards:  11% Completed | 18/163 [00:02<00:19,  7.40it/s]
Loading safetensors checkpoint shards:  13% Completed | 22/163 [00:02<00:12, 11.41it/s]
Loading safetensors checkpoint shards:  15% Completed | 24/163 [00:02<00:13, 10.00it/s]
Loading safetensors checkpoint shards:  16% Completed | 26/163 [00:03<00:14,  9.37it/s]
Loading safetensors checkpoint shards:  17% Completed | 28/163 [00:03<00:15,  8.92it/s]
Loading safetensors checkpoint shards:  19% Completed | 31/163 [00:03<00:11, 11.61it/s]
Loading safetensors checkpoint shards:  20% Completed | 33/163 [00:04<00:18,  7.19it/s]
Loading safetensors checkpoint shards:  21% Completed | 35/163 [00:04<00:19,  6.44it/s]
Loading safetensors checkpoint shards:  24% Completed | 39/163 [00:04<00:12,  9.56it/s]
Loading safetensors checkpoint shards:  26% Completed | 42/163 [00:05<00:12,  9.75it/s]
Loading safetensors checkpoint shards:  27% Completed | 44/163 [00:05<00:10, 10.91it/s]
Loading safetensors checkpoint shards:  28% Completed | 46/163 [00:05<00:16,  7.22it/s]
Loading safetensors checkpoint shards:  31% Completed | 50/163 [00:05<00:10, 10.95it/s]
Loading safetensors checkpoint shards:  33% Completed | 54/163 [00:05<00:07, 13.97it/s]
Loading safetensors checkpoint shards:  36% Completed | 58/163 [00:06<00:07, 13.26it/s]
Loading safetensors checkpoint shards:  37% Completed | 60/163 [00:06<00:08, 11.68it/s]
Loading safetensors checkpoint shards:  38% Completed | 62/163 [00:06<00:10,  9.71it/s]
Loading safetensors checkpoint shards:  41% Completed | 67/163 [00:06<00:06, 14.87it/s]
Loading safetensors checkpoint shards:  43% Completed | 70/163 [00:07<00:09,  9.66it/s]
Loading safetensors checkpoint shards:  44% Completed | 72/163 [00:08<00:12,  7.41it/s]
Loading safetensors checkpoint shards:  45% Completed | 74/163 [00:08<00:12,  7.00it/s]
Loading safetensors checkpoint shards:  48% Completed | 79/163 [00:08<00:09,  8.88it/s]
Loading safetensors checkpoint shards:  50% Completed | 81/163 [00:09<00:12,  6.64it/s]
Loading safetensors checkpoint shards:  52% Completed | 84/163 [00:09<00:10,  7.38it/s]
Loading safetensors checkpoint shards:  53% Completed | 86/163 [00:09<00:09,  8.41it/s]
Loading safetensors checkpoint shards:  54% Completed | 88/163 [00:10<00:11,  6.79it/s]
Loading safetensors checkpoint shards:  55% Completed | 89/163 [00:10<00:10,  6.95it/s]
Loading safetensors checkpoint shards:  55% Completed | 90/163 [00:10<00:12,  5.91it/s]
Loading safetensors checkpoint shards:  56% Completed | 91/163 [00:11<00:14,  5.13it/s]
Loading safetensors checkpoint shards:  57% Completed | 93/163 [00:11<00:12,  5.53it/s]
Loading safetensors checkpoint shards:  58% Completed | 95/163 [00:11<00:10,  6.63it/s]
Loading safetensors checkpoint shards:  59% Completed | 96/163 [00:11<00:11,  5.64it/s]
Loading safetensors checkpoint shards:  61% Completed | 100/163 [00:11<00:06, 10.14it/s]
Loading safetensors checkpoint shards:  63% Completed | 103/163 [00:12<00:05, 10.43it/s]
Loading safetensors checkpoint shards:  64% Completed | 105/163 [00:12<00:08,  6.90it/s]
Loading safetensors checkpoint shards:  66% Completed | 107/163 [00:13<00:08,  6.58it/s]
Loading safetensors checkpoint shards:  69% Completed | 112/163 [00:13<00:04, 11.18it/s]
Loading safetensors checkpoint shards:  70% Completed | 114/163 [00:13<00:04, 10.48it/s]
Loading safetensors checkpoint shards:  71% Completed | 116/163 [00:13<00:05,  8.91it/s]
Loading safetensors checkpoint shards:  74% Completed | 120/163 [00:14<00:04,  9.64it/s]
Loading safetensors checkpoint shards:  75% Completed | 122/163 [00:14<00:06,  6.76it/s]
Loading safetensors checkpoint shards:  77% Completed | 125/163 [00:14<00:04,  8.68it/s]
Loading safetensors checkpoint shards:  78% Completed | 127/163 [00:15<00:03,  9.79it/s]
Loading safetensors checkpoint shards:  79% Completed | 129/163 [00:15<00:04,  6.98it/s]
Loading safetensors checkpoint shards:  82% Completed | 133/163 [00:15<00:03,  8.19it/s]
Loading safetensors checkpoint shards:  83% Completed | 135/163 [00:16<00:03,  7.59it/s]
Loading safetensors checkpoint shards:  83% Completed | 136/163 [00:16<00:04,  6.40it/s]
Loading safetensors checkpoint shards:  84% Completed | 137/163 [00:16<00:04,  5.72it/s]
Loading safetensors checkpoint shards:  85% Completed | 139/163 [00:17<00:04,  5.93it/s]
Loading safetensors checkpoint shards:  87% Completed | 142/163 [00:17<00:03,  6.18it/s]
Loading safetensors checkpoint shards:  89% Completed | 145/163 [00:17<00:02,  7.12it/s]
Loading safetensors checkpoint shards:  91% Completed | 148/163 [00:18<00:01,  7.55it/s]
Loading safetensors checkpoint shards:  93% Completed | 152/163 [00:18<00:01, 10.47it/s]
Loading safetensors checkpoint shards:  94% Completed | 154/163 [00:18<00:01,  8.80it/s]
Loading safetensors checkpoint shards:  96% Completed | 157/163 [00:19<00:00,  8.68it/s]
Loading safetensors checkpoint shards:  98% Completed | 159/163 [00:19<00:00,  7.81it/s]
Loading safetensors checkpoint shards:  99% Completed | 161/163 [00:19<00:00,  7.43it/s]
Loading safetensors checkpoint shards: 100% Completed | 163/163 [00:19<00:00,  8.85it/s]
Loading safetensors checkpoint shards: 100% Completed | 163/163 [00:19<00:00,  8.20it/s]

[2025-03-22 15:10:14 TP7] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=58.21 GB, mem usage=79.59 GB.
[2025-03-22 15:10:14 TP2] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=57.74 GB, mem usage=79.59 GB.
[2025-03-22 15:10:14 TP6] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=57.74 GB, mem usage=79.59 GB.
[2025-03-22 15:10:14 TP5] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=57.74 GB, mem usage=79.59 GB.
[2025-03-22 15:10:14 TP4] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=57.74 GB, mem usage=79.59 GB.
[2025-03-22 15:10:14 TP3] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=57.74 GB, mem usage=79.59 GB.
[2025-03-22 15:10:14 TP1] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=57.74 GB, mem usage=79.59 GB.
[2025-03-22 15:10:14 TP0] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=57.83 GB, mem usage=79.59 GB.
[2025-03-22 15:10:14 TP7] Memory pool end. avail mem=40.63 GB
[2025-03-22 15:10:14 TP3] Memory pool end. avail mem=40.16 GB
[2025-03-22 15:10:14 TP2] Memory pool end. avail mem=40.16 GB
[2025-03-22 15:10:14 TP4] Memory pool end. avail mem=40.16 GB
[2025-03-22 15:10:14 TP0] Memory pool end. avail mem=40.26 GB
[2025-03-22 15:10:14 TP1] Memory pool end. avail mem=40.16 GB
[2025-03-22 15:10:14 TP6] Memory pool end. avail mem=40.16 GB
[2025-03-22 15:10:14 TP5] Memory pool end. avail mem=40.16 GB
[2025-03-22 15:10:14 TP7] Capture cuda graph begin. This can take up to several minutes. avail mem=40.54 GB
[2025-03-22 15:10:14 TP2] Capture cuda graph begin. This can take up to several minutes. avail mem=40.07 GB
[2025-03-22 15:10:14 TP4] Capture cuda graph begin. This can take up to several minutes. avail mem=40.07 GB
[2025-03-22 15:10:14 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=40.16 GB
[2025-03-22 15:10:14 TP3] Capture cuda graph begin. This can take up to several minutes. avail mem=40.07 GB
Capturing batches (avail_mem=40.00 GB):   0%|            | 0/23 [00:00<?, ?it/s][2025-03-22 15:10:14 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=40.07 GB
[2025-03-22 15:10:14 TP6] Capture cuda graph begin. This can take up to several minutes. avail mem=40.07 GB
[2025-03-22 15:10:14 TP5] Capture cuda graph begin. This can take up to several minutes. avail mem=40.07 GB
[2025-03-22 15:10:16 TP0] Using configuration from /opt/conda/lib/python3.11/site-packages/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-03-22 15:10:16 TP5] Using configuration from /opt/conda/lib/python3.11/site-packages/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-03-22 15:10:16 TP3] Using configuration from /opt/conda/lib/python3.11/site-packages/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-03-22 15:10:16 TP1] Using configuration from /opt/conda/lib/python3.11/site-packages/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-03-22 15:10:16 TP7] Using configuration from /opt/conda/lib/python3.11/site-packages/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-03-22 15:10:16 TP4] Using configuration from /opt/conda/lib/python3.11/site-packages/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-03-22 15:10:16 TP6] Using configuration from /opt/conda/lib/python3.11/site-packages/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-03-22 15:10:16 TP2] Using configuration from /opt/conda/lib/python3.11/site-packages/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
loc("/opt/conda/lib/python3.11/site-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/opt/conda/lib/python3.11/site-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/opt/conda/lib/python3.11/site-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/opt/conda/lib/python3.11/site-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/opt/conda/lib/python3.11/site-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/opt/conda/lib/python3.11/site-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/opt/conda/lib/python3.11/site-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/opt/conda/lib/python3.11/site-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
Capturing batches (avail_mem=37.73 GB): 100%|███| 23/23 [00:38<00:00,  1.69s/it]
[2025-03-22 15:10:53 TP0] Registering 2829 cuda graph addresses
[2025-03-22 15:10:53 TP1] Registering 2829 cuda graph addresses
[2025-03-22 15:10:53 TP3] Registering 2829 cuda graph addresses
[2025-03-22 15:10:53 TP6] Registering 2829 cuda graph addresses
[2025-03-22 15:10:53 TP2] Registering 2829 cuda graph addresses
[2025-03-22 15:10:53 TP4] Registering 2829 cuda graph addresses
[2025-03-22 15:10:53 TP7] Registering 2829 cuda graph addresses
[2025-03-22 15:10:53 TP5] Registering 2829 cuda graph addresses
[2025-03-22 15:10:54 TP3] Capture cuda graph end. Time elapsed: 39.46 s. avail mem=37.58 GB. mem usage=2.49 GB.
[2025-03-22 15:10:54 TP1] Capture cuda graph end. Time elapsed: 39.46 s. avail mem=37.58 GB. mem usage=2.49 GB.
[2025-03-22 15:10:54 TP7] Capture cuda graph end. Time elapsed: 39.48 s. avail mem=38.05 GB. mem usage=2.49 GB.
[2025-03-22 15:10:54 TP5] Capture cuda graph end. Time elapsed: 39.50 s. avail mem=37.58 GB. mem usage=2.49 GB.
[2025-03-22 15:10:54 TP4] Capture cuda graph end. Time elapsed: 39.51 s. avail mem=37.58 GB. mem usage=2.49 GB.
[2025-03-22 15:10:54 TP0] Capture cuda graph end. Time elapsed: 39.51 s. avail mem=37.68 GB. mem usage=2.49 GB.
[2025-03-22 15:10:54 TP6] Capture cuda graph end. Time elapsed: 39.51 s. avail mem=37.58 GB. mem usage=2.49 GB.
[2025-03-22 15:10:54 TP2] Capture cuda graph end. Time elapsed: 39.51 s. avail mem=37.58 GB. mem usage=2.49 GB.
[2025-03-22 15:10:54 TP6] max_total_num_tokens=248994, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-03-22 15:10:54 TP0] max_total_num_tokens=248994, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-03-22 15:10:54 TP7] max_total_num_tokens=248994, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-03-22 15:10:54 TP1] max_total_num_tokens=248994, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-03-22 15:10:54 TP4] max_total_num_tokens=248994, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-03-22 15:10:54 TP3] max_total_num_tokens=248994, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-03-22 15:10:54 TP2] max_total_num_tokens=248994, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-03-22 15:10:54 TP5] max_total_num_tokens=248994, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-03-22 15:10:54] INFO:     Started server process [71649]
[2025-03-22 15:10:54] INFO:     Waiting for application startup.
[2025-03-22 15:10:54] INFO:     Application startup complete.
[2025-03-22 15:10:54] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-03-22 15:10:55] INFO:     127.0.0.1:50206 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-03-22 15:10:55 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-03-22 15:10:59] INFO:     127.0.0.1:50220 - "POST /generate HTTP/1.1" 200 OK
[2025-03-22 15:10:59] The server is fired up and ready to roll!

Catalog

Deploy DeepSeek-R1 on 8x H200 Device.

March 22, 2025 • Read: 1271 • Linux

参考

性能优化策略

参考引用