首先,安装 SGLang。可以通过以下命令安装:
pip install uv
pip install "sglang[all]>=0.4.3" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python -i https://pypi.tuna.tsinghua.edu.cn/simple
模型文件需要从 deepseek-ai/DeepSeek-R1 下载,假设放在 /mnt/models/DeepSeek-R1。
初始尝试和问题
第一次启动服务器时,使用以下命令:
python3 -m sglang.launch_server \
--model /mnt/models/DeepSeek-R1 \
--tp 8 \
--dtype auto \
--trust-remote-code \
--enable-torch-compile \
--disable-cuda-graph \
--mem-fraction-static 0.7 \
--served-model-name deepseek --api-key 123
但遇到错误:
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != c10::Half
这是因为 DeepSeek-R1 使用混合数据类型(如 BFloat16 和 F8_E4M3),而 --dtype half 强制转换为 float16,导致不兼容。
解决问题
通过将 --dtype 改为 auto,问题解决:
python3 -m sglang.launch_server \
--model /mnt/models/DeepSeek-R1 \
--tp 8 \
--dtype auto \
--trust-remote-code \
--enable-torch-compile \
--disable-cuda-graph \
--mem-fraction-static 0.7 \
--served-model-name deepseek --api-key 123
解决问题
通过将 --dtype
改为 auto
,问题解决:
python3 -m sglang.launch_server \
--model /mnt/models/DeepSeek-R1 \
--tp 8 \
--dtype auto \
--trust-remote-code \
--enable-torch-compile \
--disable-cuda-graph \
--mem-fraction-static 0.7 \
--served-model-name deepseek --api-key 123
这允许模型以原始数据类型加载,成功启动 sglang
。
性能优化
为加快 token 生成,建议启用推测解码,设置如下:
--speculative-num-steps 3
--speculative-num-draft-tokens 4
--mem-fraction-static 0.7
--schedule-conservativeness 0.01
在issue中, 低并发(如 1 个请求)时可达到每秒 65 token 的速度。
参考
性能优化策略
基于 issue #4616 和相关讨论,以下是影响 token 生成速度的关键参数:
参数 | 描述/影响 | 推荐值/注意事项 | 已知问题 |
---|---|---|---|
--speculative-algorithm | 启用 Next-N 推测解码,生成草稿 token 树,减少前向传递次数 | 设置为 "NEXTN" | 高并发(>32)时性能下降 |
--speculative-num-steps | 推测解码步骤数,影响吞吐量 | 推荐 3,测试显示最高吞吐量 | - |
--speculative-num-draft-tokens | 草稿 token 数量,影响性能 | 推荐 4,测试显示最佳性能 | - |
--speculative-eagle-topk | EAGLE策略的 top-k 值,影响推测解码效率 | 推荐 1 | - |
--speculative-draft | 草稿模型目录,需指定正确路径 | 设置为 `` | - |
--mem-fraction-static | 静态内存分配比例,防止 OOM 错误 | 推荐 0.7 | - |
--schedule-conservativeness | 调度保守性,影响吞吐量 | 推荐 0.01,高性能设置 | - |
CUDA Graph | 减少内核启动开销和同步延迟,提升推理性能,增加内存使用 | 默认启用,可用 --disable-cuda-graph 禁用,参数 --cuda-graph-max-bs : 160, 512, 1024 | 与 DP-Attention 和 dp-size >4 可能导致 CUDA 错误 |
Torch Compile | 将 PyTorch 模型转换为优化执行图,提升 GPU 性能 | 用 --enable-torch-compile 启用,默认禁用,DeepSeek-R1 捕获时间约 1 小时,Next-N 约 5 分钟 | - |
参考jupyter 脚本:
pip install uv
pip install "sglang[all]>=0.4.3" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python -i https://pypi.tuna.tsinghua.edu.cn/simple
python3 -m sglang.launch_server \
--model /mnt/models/DeepSeek-R1 \
--tp 8 \
--trust-remote-code \
--enable-ep-moe \
--mem-fraction-static 0.7 \
--schedule-conservativeness 0.01 \
--served-model-name deepseek --api-key 123
参考输出:
2025-03-22 15:09:20.381293: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-22 15:09:20.384090: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-22 15:09:20.418164: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-03-22 15:09:20.418198: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-03-22 15:09:20.418236: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-22 15:09:20.425708: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-03-22 15:09:21.048490: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
INFO 03-22 15:09:22 __init__.py:190] Automatically detected platform cuda.
[2025-03-22 15:09:33] server_args=ServerArgs(model_path='/mnt/models/DeepSeek-R1', tokenizer_path='/mnt/models/DeepSeek-R1', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='deepseek', chat_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=30000, mem_fraction_static=0.7, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=0.01, cpu_offload_gb=0, page_size=1, tp_size=8, stream_interval=1, stream_output=False, random_seed=286257988, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key='123', file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=8, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=5, speculative_eagle_topk=4, speculative_num_draft_tokens=8, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=True, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, enable_flashinfer_mla=False, flashinfer_mla_disable_ragged=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False)
INFO 03-22 15:09:39 __init__.py:190] Automatically detected platform cuda.
INFO 03-22 15:09:39 __init__.py:190] Automatically detected platform cuda.
INFO 03-22 15:09:39 __init__.py:190] Automatically detected platform cuda.
INFO 03-22 15:09:40 __init__.py:190] Automatically detected platform cuda.
INFO 03-22 15:09:40 __init__.py:190] Automatically detected platform cuda.
INFO 03-22 15:09:40 __init__.py:190] Automatically detected platform cuda.
INFO 03-22 15:09:40 __init__.py:190] Automatically detected platform cuda.
INFO 03-22 15:09:40 __init__.py:190] Automatically detected platform cuda.
INFO 03-22 15:09:40 __init__.py:190] Automatically detected platform cuda.
[2025-03-22 15:09:43 TP5] MLA optimization is turned on. Use triton backend.
[2025-03-22 15:09:43 TP5] Init torch distributed begin.
[2025-03-22 15:09:45 TP0] MLA optimization is turned on. Use triton backend.
[2025-03-22 15:09:45 TP0] Init torch distributed begin.
[2025-03-22 15:09:46 TP1] MLA optimization is turned on. Use triton backend.
[2025-03-22 15:09:46 TP1] Init torch distributed begin.
[2025-03-22 15:09:46 TP3] MLA optimization is turned on. Use triton backend.
[2025-03-22 15:09:46 TP3] Init torch distributed begin.
[2025-03-22 15:09:47 TP7] MLA optimization is turned on. Use triton backend.
[2025-03-22 15:09:47 TP7] Init torch distributed begin.
[2025-03-22 15:09:47 TP6] MLA optimization is turned on. Use triton backend.
[2025-03-22 15:09:47 TP6] Init torch distributed begin.
[2025-03-22 15:09:47 TP4] MLA optimization is turned on. Use triton backend.
[2025-03-22 15:09:47 TP4] Init torch distributed begin.
[2025-03-22 15:09:47 TP2] MLA optimization is turned on. Use triton backend.
[2025-03-22 15:09:47 TP2] Init torch distributed begin.
[2025-03-22 15:09:48 TP0] sglang is using nccl==2.21.5
[2025-03-22 15:09:48 TP1] sglang is using nccl==2.21.5
[2025-03-22 15:09:48 TP2] sglang is using nccl==2.21.5
[2025-03-22 15:09:48 TP4] sglang is using nccl==2.21.5
[2025-03-22 15:09:48 TP3] sglang is using nccl==2.21.5
[2025-03-22 15:09:48 TP6] sglang is using nccl==2.21.5
[2025-03-22 15:09:48 TP5] sglang is using nccl==2.21.5
[2025-03-22 15:09:48 TP7] sglang is using nccl==2.21.5
[2025-03-22 15:09:52 TP7] Init torch distributed ends. mem usage=1.50 GB
[2025-03-22 15:09:52 TP2] Init torch distributed ends. mem usage=1.97 GB
[2025-03-22 15:09:52 TP4] Init torch distributed ends. mem usage=1.97 GB
[2025-03-22 15:09:52 TP1] Init torch distributed ends. mem usage=1.97 GB
[2025-03-22 15:09:52 TP5] Init torch distributed ends. mem usage=1.97 GB
[2025-03-22 15:09:52 TP7] Load weight begin. avail mem=137.80 GB
[2025-03-22 15:09:52 TP2] Load weight begin. avail mem=137.33 GB
[2025-03-22 15:09:52 TP3] Init torch distributed ends. mem usage=1.97 GB
[2025-03-22 15:09:52 TP4] Load weight begin. avail mem=137.33 GB
[2025-03-22 15:09:52 TP1] Load weight begin. avail mem=137.33 GB
[2025-03-22 15:09:52 TP5] Load weight begin. avail mem=137.33 GB
[2025-03-22 15:09:52 TP3] Load weight begin. avail mem=137.33 GB
[2025-03-22 15:09:52 TP0] Init torch distributed ends. mem usage=1.87 GB
[2025-03-22 15:09:52 TP0] Load weight begin. avail mem=137.42 GB
[2025-03-22 15:09:52 TP6] Init torch distributed ends. mem usage=1.97 GB
[2025-03-22 15:09:52 TP6] Load weight begin. avail mem=137.33 GB
[2025-03-22 15:09:52 TP2] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-22 15:09:52 TP4] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-22 15:09:52 TP1] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-22 15:09:52 TP7] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-22 15:09:52 TP0] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-22 15:09:52 TP6] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-22 15:09:52 TP3] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-22 15:09:52 TP5] The following error message 'operation scheduled before its operands' can be ignored.
[2025-03-22 15:09:52 TP4] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-03-22 15:09:52 TP7] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-03-22 15:09:52 TP3] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-03-22 15:09:52 TP2] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-03-22 15:09:52 TP6] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-03-22 15:09:52 TP0] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-03-22 15:09:52 TP1] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
[2025-03-22 15:09:52 TP5] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
Loading safetensors checkpoint shards: 0% Completed | 0/163 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 1% Completed | 1/163 [00:00<00:43, 3.70it/s]
Loading safetensors checkpoint shards: 2% Completed | 4/163 [00:00<00:22, 7.00it/s]
Loading safetensors checkpoint shards: 4% Completed | 6/163 [00:00<00:17, 9.10it/s]
Loading safetensors checkpoint shards: 5% Completed | 8/163 [00:01<00:22, 7.00it/s]
Loading safetensors checkpoint shards: 6% Completed | 9/163 [00:01<00:25, 6.08it/s]
Loading safetensors checkpoint shards: 7% Completed | 11/163 [00:01<00:23, 6.57it/s]
Loading safetensors checkpoint shards: 7% Completed | 12/163 [00:01<00:21, 6.96it/s]
Loading safetensors checkpoint shards: 8% Completed | 13/163 [00:01<00:24, 6.15it/s]
Loading safetensors checkpoint shards: 10% Completed | 16/163 [00:02<00:19, 7.50it/s]
Loading safetensors checkpoint shards: 11% Completed | 18/163 [00:02<00:19, 7.40it/s]
Loading safetensors checkpoint shards: 13% Completed | 22/163 [00:02<00:12, 11.41it/s]
Loading safetensors checkpoint shards: 15% Completed | 24/163 [00:02<00:13, 10.00it/s]
Loading safetensors checkpoint shards: 16% Completed | 26/163 [00:03<00:14, 9.37it/s]
Loading safetensors checkpoint shards: 17% Completed | 28/163 [00:03<00:15, 8.92it/s]
Loading safetensors checkpoint shards: 19% Completed | 31/163 [00:03<00:11, 11.61it/s]
Loading safetensors checkpoint shards: 20% Completed | 33/163 [00:04<00:18, 7.19it/s]
Loading safetensors checkpoint shards: 21% Completed | 35/163 [00:04<00:19, 6.44it/s]
Loading safetensors checkpoint shards: 24% Completed | 39/163 [00:04<00:12, 9.56it/s]
Loading safetensors checkpoint shards: 26% Completed | 42/163 [00:05<00:12, 9.75it/s]
Loading safetensors checkpoint shards: 27% Completed | 44/163 [00:05<00:10, 10.91it/s]
Loading safetensors checkpoint shards: 28% Completed | 46/163 [00:05<00:16, 7.22it/s]
Loading safetensors checkpoint shards: 31% Completed | 50/163 [00:05<00:10, 10.95it/s]
Loading safetensors checkpoint shards: 33% Completed | 54/163 [00:05<00:07, 13.97it/s]
Loading safetensors checkpoint shards: 36% Completed | 58/163 [00:06<00:07, 13.26it/s]
Loading safetensors checkpoint shards: 37% Completed | 60/163 [00:06<00:08, 11.68it/s]
Loading safetensors checkpoint shards: 38% Completed | 62/163 [00:06<00:10, 9.71it/s]
Loading safetensors checkpoint shards: 41% Completed | 67/163 [00:06<00:06, 14.87it/s]
Loading safetensors checkpoint shards: 43% Completed | 70/163 [00:07<00:09, 9.66it/s]
Loading safetensors checkpoint shards: 44% Completed | 72/163 [00:08<00:12, 7.41it/s]
Loading safetensors checkpoint shards: 45% Completed | 74/163 [00:08<00:12, 7.00it/s]
Loading safetensors checkpoint shards: 48% Completed | 79/163 [00:08<00:09, 8.88it/s]
Loading safetensors checkpoint shards: 50% Completed | 81/163 [00:09<00:12, 6.64it/s]
Loading safetensors checkpoint shards: 52% Completed | 84/163 [00:09<00:10, 7.38it/s]
Loading safetensors checkpoint shards: 53% Completed | 86/163 [00:09<00:09, 8.41it/s]
Loading safetensors checkpoint shards: 54% Completed | 88/163 [00:10<00:11, 6.79it/s]
Loading safetensors checkpoint shards: 55% Completed | 89/163 [00:10<00:10, 6.95it/s]
Loading safetensors checkpoint shards: 55% Completed | 90/163 [00:10<00:12, 5.91it/s]
Loading safetensors checkpoint shards: 56% Completed | 91/163 [00:11<00:14, 5.13it/s]
Loading safetensors checkpoint shards: 57% Completed | 93/163 [00:11<00:12, 5.53it/s]
Loading safetensors checkpoint shards: 58% Completed | 95/163 [00:11<00:10, 6.63it/s]
Loading safetensors checkpoint shards: 59% Completed | 96/163 [00:11<00:11, 5.64it/s]
Loading safetensors checkpoint shards: 61% Completed | 100/163 [00:11<00:06, 10.14it/s]
Loading safetensors checkpoint shards: 63% Completed | 103/163 [00:12<00:05, 10.43it/s]
Loading safetensors checkpoint shards: 64% Completed | 105/163 [00:12<00:08, 6.90it/s]
Loading safetensors checkpoint shards: 66% Completed | 107/163 [00:13<00:08, 6.58it/s]
Loading safetensors checkpoint shards: 69% Completed | 112/163 [00:13<00:04, 11.18it/s]
Loading safetensors checkpoint shards: 70% Completed | 114/163 [00:13<00:04, 10.48it/s]
Loading safetensors checkpoint shards: 71% Completed | 116/163 [00:13<00:05, 8.91it/s]
Loading safetensors checkpoint shards: 74% Completed | 120/163 [00:14<00:04, 9.64it/s]
Loading safetensors checkpoint shards: 75% Completed | 122/163 [00:14<00:06, 6.76it/s]
Loading safetensors checkpoint shards: 77% Completed | 125/163 [00:14<00:04, 8.68it/s]
Loading safetensors checkpoint shards: 78% Completed | 127/163 [00:15<00:03, 9.79it/s]
Loading safetensors checkpoint shards: 79% Completed | 129/163 [00:15<00:04, 6.98it/s]
Loading safetensors checkpoint shards: 82% Completed | 133/163 [00:15<00:03, 8.19it/s]
Loading safetensors checkpoint shards: 83% Completed | 135/163 [00:16<00:03, 7.59it/s]
Loading safetensors checkpoint shards: 83% Completed | 136/163 [00:16<00:04, 6.40it/s]
Loading safetensors checkpoint shards: 84% Completed | 137/163 [00:16<00:04, 5.72it/s]
Loading safetensors checkpoint shards: 85% Completed | 139/163 [00:17<00:04, 5.93it/s]
Loading safetensors checkpoint shards: 87% Completed | 142/163 [00:17<00:03, 6.18it/s]
Loading safetensors checkpoint shards: 89% Completed | 145/163 [00:17<00:02, 7.12it/s]
Loading safetensors checkpoint shards: 91% Completed | 148/163 [00:18<00:01, 7.55it/s]
Loading safetensors checkpoint shards: 93% Completed | 152/163 [00:18<00:01, 10.47it/s]
Loading safetensors checkpoint shards: 94% Completed | 154/163 [00:18<00:01, 8.80it/s]
Loading safetensors checkpoint shards: 96% Completed | 157/163 [00:19<00:00, 8.68it/s]
Loading safetensors checkpoint shards: 98% Completed | 159/163 [00:19<00:00, 7.81it/s]
Loading safetensors checkpoint shards: 99% Completed | 161/163 [00:19<00:00, 7.43it/s]
Loading safetensors checkpoint shards: 100% Completed | 163/163 [00:19<00:00, 8.85it/s]
Loading safetensors checkpoint shards: 100% Completed | 163/163 [00:19<00:00, 8.20it/s]
[2025-03-22 15:10:14 TP7] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=58.21 GB, mem usage=79.59 GB.
[2025-03-22 15:10:14 TP2] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=57.74 GB, mem usage=79.59 GB.
[2025-03-22 15:10:14 TP6] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=57.74 GB, mem usage=79.59 GB.
[2025-03-22 15:10:14 TP5] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=57.74 GB, mem usage=79.59 GB.
[2025-03-22 15:10:14 TP4] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=57.74 GB, mem usage=79.59 GB.
[2025-03-22 15:10:14 TP3] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=57.74 GB, mem usage=79.59 GB.
[2025-03-22 15:10:14 TP1] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=57.74 GB, mem usage=79.59 GB.
[2025-03-22 15:10:14 TP0] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=57.83 GB, mem usage=79.59 GB.
[2025-03-22 15:10:14 TP7] Memory pool end. avail mem=40.63 GB
[2025-03-22 15:10:14 TP3] Memory pool end. avail mem=40.16 GB
[2025-03-22 15:10:14 TP2] Memory pool end. avail mem=40.16 GB
[2025-03-22 15:10:14 TP4] Memory pool end. avail mem=40.16 GB
[2025-03-22 15:10:14 TP0] Memory pool end. avail mem=40.26 GB
[2025-03-22 15:10:14 TP1] Memory pool end. avail mem=40.16 GB
[2025-03-22 15:10:14 TP6] Memory pool end. avail mem=40.16 GB
[2025-03-22 15:10:14 TP5] Memory pool end. avail mem=40.16 GB
[2025-03-22 15:10:14 TP7] Capture cuda graph begin. This can take up to several minutes. avail mem=40.54 GB
[2025-03-22 15:10:14 TP2] Capture cuda graph begin. This can take up to several minutes. avail mem=40.07 GB
[2025-03-22 15:10:14 TP4] Capture cuda graph begin. This can take up to several minutes. avail mem=40.07 GB
[2025-03-22 15:10:14 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=40.16 GB
[2025-03-22 15:10:14 TP3] Capture cuda graph begin. This can take up to several minutes. avail mem=40.07 GB
Capturing batches (avail_mem=40.00 GB): 0%| | 0/23 [00:00<?, ?it/s][2025-03-22 15:10:14 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=40.07 GB
[2025-03-22 15:10:14 TP6] Capture cuda graph begin. This can take up to several minutes. avail mem=40.07 GB
[2025-03-22 15:10:14 TP5] Capture cuda graph begin. This can take up to several minutes. avail mem=40.07 GB
[2025-03-22 15:10:16 TP0] Using configuration from /opt/conda/lib/python3.11/site-packages/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-03-22 15:10:16 TP5] Using configuration from /opt/conda/lib/python3.11/site-packages/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-03-22 15:10:16 TP3] Using configuration from /opt/conda/lib/python3.11/site-packages/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-03-22 15:10:16 TP1] Using configuration from /opt/conda/lib/python3.11/site-packages/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-03-22 15:10:16 TP7] Using configuration from /opt/conda/lib/python3.11/site-packages/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-03-22 15:10:16 TP4] Using configuration from /opt/conda/lib/python3.11/site-packages/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-03-22 15:10:16 TP6] Using configuration from /opt/conda/lib/python3.11/site-packages/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-03-22 15:10:16 TP2] Using configuration from /opt/conda/lib/python3.11/site-packages/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
loc("/opt/conda/lib/python3.11/site-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/opt/conda/lib/python3.11/site-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/opt/conda/lib/python3.11/site-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/opt/conda/lib/python3.11/site-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/opt/conda/lib/python3.11/site-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/opt/conda/lib/python3.11/site-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/opt/conda/lib/python3.11/site-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/opt/conda/lib/python3.11/site-packages/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
Capturing batches (avail_mem=37.73 GB): 100%|███| 23/23 [00:38<00:00, 1.69s/it]
[2025-03-22 15:10:53 TP0] Registering 2829 cuda graph addresses
[2025-03-22 15:10:53 TP1] Registering 2829 cuda graph addresses
[2025-03-22 15:10:53 TP3] Registering 2829 cuda graph addresses
[2025-03-22 15:10:53 TP6] Registering 2829 cuda graph addresses
[2025-03-22 15:10:53 TP2] Registering 2829 cuda graph addresses
[2025-03-22 15:10:53 TP4] Registering 2829 cuda graph addresses
[2025-03-22 15:10:53 TP7] Registering 2829 cuda graph addresses
[2025-03-22 15:10:53 TP5] Registering 2829 cuda graph addresses
[2025-03-22 15:10:54 TP3] Capture cuda graph end. Time elapsed: 39.46 s. avail mem=37.58 GB. mem usage=2.49 GB.
[2025-03-22 15:10:54 TP1] Capture cuda graph end. Time elapsed: 39.46 s. avail mem=37.58 GB. mem usage=2.49 GB.
[2025-03-22 15:10:54 TP7] Capture cuda graph end. Time elapsed: 39.48 s. avail mem=38.05 GB. mem usage=2.49 GB.
[2025-03-22 15:10:54 TP5] Capture cuda graph end. Time elapsed: 39.50 s. avail mem=37.58 GB. mem usage=2.49 GB.
[2025-03-22 15:10:54 TP4] Capture cuda graph end. Time elapsed: 39.51 s. avail mem=37.58 GB. mem usage=2.49 GB.
[2025-03-22 15:10:54 TP0] Capture cuda graph end. Time elapsed: 39.51 s. avail mem=37.68 GB. mem usage=2.49 GB.
[2025-03-22 15:10:54 TP6] Capture cuda graph end. Time elapsed: 39.51 s. avail mem=37.58 GB. mem usage=2.49 GB.
[2025-03-22 15:10:54 TP2] Capture cuda graph end. Time elapsed: 39.51 s. avail mem=37.58 GB. mem usage=2.49 GB.
[2025-03-22 15:10:54 TP6] max_total_num_tokens=248994, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-03-22 15:10:54 TP0] max_total_num_tokens=248994, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-03-22 15:10:54 TP7] max_total_num_tokens=248994, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-03-22 15:10:54 TP1] max_total_num_tokens=248994, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-03-22 15:10:54 TP4] max_total_num_tokens=248994, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-03-22 15:10:54 TP3] max_total_num_tokens=248994, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-03-22 15:10:54 TP2] max_total_num_tokens=248994, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-03-22 15:10:54 TP5] max_total_num_tokens=248994, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-03-22 15:10:54] INFO: Started server process [71649]
[2025-03-22 15:10:54] INFO: Waiting for application startup.
[2025-03-22 15:10:54] INFO: Application startup complete.
[2025-03-22 15:10:54] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-03-22 15:10:55] INFO: 127.0.0.1:50206 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-03-22 15:10:55 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-03-22 15:10:59] INFO: 127.0.0.1:50220 - "POST /generate HTTP/1.1" 200 OK
[2025-03-22 15:10:59] The server is fired up and ready to roll!
参考引用
- GitHub - deepseek-ai/DeepSeek-R1
- deepseek-ai/DeepSeek-R1
- Server Arguments — SGLang
- Setting Data Type from the CLI interface · Issue #325 · sgl-project/sglang
- Parameters for Fastest Token Generation · Issue #4616 · sgl-project/sglang
- A Simple Guide to DeepSeek R1: Architecture, Training, Local Deployment, and Hardware Requirements
- DeepSeek R1 On-Prem Setup: Run Advanced AI Models on Your Hardware with SGLang