1 Introduction
vLLM is a fast and easy-to-use LLM inference and services library suitable for production environments. Single-host deployment will encounter the problem of insufficient video memory, so distributed deployment is required.
Distributed installation method
/en/latest/serving/distributed_serving.html
2 Installation method
⚠️ Note: You must install the docker environment, runtime and GPU in the early stage.
CUDA Version: 12.4
vllm:v0.7.2
2.1 Download the mirror
# Download the mirror, the mirror is relatively largedocker pull vllm/vllm-openai:v0.7.2
Download distributed deployment scripts
/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh
run_cluster.sh file
#!/bin/bash # Check for minimum number of required arguments if [ $# -lt 4 ]; then echo "Usage: $0 docker_image head_node_address --head|--worker path_to_hf_home [additional_args...]" exit 1 fi # Assign the first three arguments and shift them away DOCKER_IMAGE="$1" HEAD_NODE_ADDRESS="$2" NODE_TYPE="$3" # Should be --head or --worker PATH_TO_HF_HOME="$4" shift 4 # Additional arguments are passed directly to the Docker command ADDITIONAL_ARGS=("$@") # Validate node type if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then echo "Error: Node type must be --head or --worker" exit 1 fi # Define a function to cleanup on EXIT signal cleanup() { docker stop node docker rm node } trap cleanup EXIT # Command setup for head or worker node RAY_START_CMD="ray start --block" if [ "${NODE_TYPE}" == "--head" ]; then RAY_START_CMD+=" --head --port=6379" else RAY_START_CMD+=" --address=${HEAD_NODE_ADDRESS}:6379" fi # Run the docker command with the user specified parameters and additional arguments docker run \ --entrypoint /bin/bash \ --network host \ --name node \ --shm-size 10.24g \ --gpus all \ -v "${PATH_TO_HF_HOME}:/root/.cache/huggingface" \ "${ADDITIONAL_ARGS[@]}" \ "${DOCKER_IMAGE}" -c "${RAY_START_CMD}"
2.2 Creating a container
The IPs of the two hosts are as follows: the host IP of the master node: 192.168.108.100, and the host IP of the worker node: 192.168.108.101.
The main node (head node) runs distributed vLLM scripts
Description of the official website
# ip_of_head_node: The IP address of the host where the master node container is located# /path/to/the/huggingface/home/in/this/node: Map to paths in containers# ip_of_this_node: The IP address of the host where the current node is located# --head: represents the master nodebash run_cluster.sh \ vllm/vllm-openai \ ip_of_head_node \ --head \ /path/to/the/huggingface/home/in/this/node \ -e VLLM_HOST_IP=ip_of_this_node
This machine executes
bash run_cluster.sh \ vllm/vllm-openai:v0.7.2 \ 192.168.108.100 \ --head \ /home/vllm \ -e VLLM_HOST_IP=192.168.108.100 \ > 2>&1 &
Worker nodes (worker nodes) run distributed vLLM scripts
Description of the official website
# ip_of_head_node: The IP address of the host where the master node container is located# /path/to/the/huggingface/home/in/this/node: Map to paths in containers# ip_of_this_node: The IP address of the host where the current node is located# --worker: represents a worker nodebash run_cluster.sh \ vllm/vllm-openai \ ip_of_head_node \ --worker \ /path/to/the/huggingface/home/in/this/node \ -e VLLM_HOST_IP=ip_of_this_node
This machine executes
bash run_cluster.sh \ vllm/vllm-openai:v0.7.2 \ 192.168.108.100 \ --worker \ /home/vllm \ -e VLLM_HOST_IP192.168.108.101 \ > 2>&1 &
View cluster information
# Enter the containerdocker exec -it node /bin/bash # View cluster informationray status # The return value includes the number of GPUs, CPU configuration and memory size, etc.======== Autoscaler status: 2025-02-13 20:18:13.886242 ======== Node status --------------------------------------------------------------- Active: 1 node_89c804d654976b3c606850c461e8dc5c6366de5e0ccdb360fcaa1b1c 1 node_4b794efd101bc393da41f0a45bd72eeb3fb78e8e507d72b5fdfb4c1b Pending: (no pending nodes) Recent failures: (no failures) Resources --------------------------------------------------------------- Usage: 0.0/128.0 CPU 0.0/4.0 GPU 0B/20 GiB memory 0B/19.46GiB object_store_memory Demands: (no resource demands)
3 Install the model
⚠️ There are 4 GPU cards locally.
Official website description
# Start the model service and set the model parameters according to the situation# /path/to/the/model/in/the/container: Model path# tensor-parallel-size: the number of tensors in parallel, and the parallel calculation after splitting within the model layer;# pipeline-parallel-size: the number of pipelines in parallel, the model is split and calculated in parallel after different layers are split. This parameter can be set when there is insufficient single video memory.vllm serve /path/to/the/model/in/the/container \ --tensor-parallel-size 8 \ --pipeline-parallel-size 2
This machine executes
Place the downloaded Qwen2.5-7B-Instruct model in the "/home/vllm" directory
# Enter the node, both the master node and the worker node candocker exec -it node /bin/bash # Execute command parametersnohup vllm serve /root/.cache/huggingface/Qwen2.5-7B-Instruct \ --served-model-name qwen2.5-7b \ --tensor-parallel-size 2 \ --pipeline-parallel-size 2 \ > 2>&1 &
Call parameters on the host
curl http://localhost:8000/v1/chat/completions \ -X POST \ -H "Content-Type: application/json" \ -d '{ "model": "qwen2.5-7b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Introduce China, no less than 10,000 words"} ], "stream": true }'
This is the article about the implementation steps of Docker installing distributed vLLM. This is all about this article. For more related content on Docker installing distributed vLLM, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!