Implementation steps for installing distributed vLLM in Docker

1 Introduction

vLLM is a fast and easy-to-use LLM inference and services library suitable for production environments. Single-host deployment will encounter the problem of insufficient video memory, so distributed deployment is required.

Distributed installation method

/en/latest/serving/distributed_serving.html

2 Installation method

⚠️ Note: You must install the docker environment, runtime and GPU in the early stage.

CUDA Version： 12.4

vllm：v0.7.2

2.1 Download the mirror

# Download the mirror, the mirror is relatively largedocker pull vllm/vllm-openai:v0.7.2

Download distributed deployment scripts

/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh

run_cluster.sh file

#!/bin/bash

# Check for minimum number of required arguments
if [ $# -lt 4 ]; then
    echo "Usage: $0 docker_image head_node_address --head|--worker path_to_hf_home [additional_args...]"
    exit 1
fi

# Assign the first three arguments and shift them away
DOCKER_IMAGE="$1"
HEAD_NODE_ADDRESS="$2"
NODE_TYPE="$3"  # Should be --head or --worker
PATH_TO_HF_HOME="$4"
shift 4

# Additional arguments are passed directly to the Docker command
ADDITIONAL_ARGS=("$@")

# Validate node type
if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then
    echo "Error: Node type must be --head or --worker"
    exit 1
fi

# Define a function to cleanup on EXIT signal
cleanup() {
    docker stop node
    docker rm node
}
trap cleanup EXIT

# Command setup for head or worker node
RAY_START_CMD="ray start --block"
if [ "${NODE_TYPE}" == "--head" ]; then
    RAY_START_CMD+=" --head --port=6379"
else
    RAY_START_CMD+=" --address=${HEAD_NODE_ADDRESS}:6379"
fi

# Run the docker command with the user specified parameters and additional arguments
docker run \
    --entrypoint /bin/bash \
    --network host \
    --name node \
    --shm-size 10.24g \
    --gpus all \
    -v "${PATH_TO_HF_HOME}:/root/.cache/huggingface" \
    "${ADDITIONAL_ARGS[@]}" \
    "${DOCKER_IMAGE}" -c "${RAY_START_CMD}"

2.2 Creating a container

The IPs of the two hosts are as follows: the host IP of the master node: 192.168.108.100, and the host IP of the worker node: 192.168.108.101.

The main node (head node) runs distributed vLLM scripts

Description of the official website

# ip_of_head_node: The IP address of the host where the master node container is located# /path/to/the/huggingface/home/in/this/node: Map to paths in containers# ip_of_this_node: The IP address of the host where the current node is located# --head: represents the master nodebash run_cluster.sh \
                vllm/vllm-openai \
                ip_of_head_node \
                --head \
                /path/to/the/huggingface/home/in/this/node \
                -e VLLM_HOST_IP=ip_of_this_node

This machine executes

bash run_cluster.sh \
    vllm/vllm-openai:v0.7.2 \
    192.168.108.100 \
    --head \
    /home/vllm \
    -e VLLM_HOST_IP=192.168.108.100 \
    >  2>&1 &

Worker nodes (worker nodes) run distributed vLLM scripts

Description of the official website

# ip_of_head_node: The IP address of the host where the master node container is located# /path/to/the/huggingface/home/in/this/node: Map to paths in containers# ip_of_this_node: The IP address of the host where the current node is located# --worker: represents a worker nodebash run_cluster.sh \
                vllm/vllm-openai \
                ip_of_head_node \
                --worker \
                /path/to/the/huggingface/home/in/this/node \
                -e VLLM_HOST_IP=ip_of_this_node

This machine executes

bash run_cluster.sh \
    vllm/vllm-openai:v0.7.2 \
    192.168.108.100 \
    --worker \
    /home/vllm \
    -e VLLM_HOST_IP192.168.108.101 \
    >  2>&1 &

View cluster information

# Enter the containerdocker exec -it node /bin/bash

# View cluster informationray status
# The return value includes the number of GPUs, CPU configuration and memory size, etc.======== Autoscaler status: 2025-02-13 20:18:13.886242 ========
Node status
---------------------------------------------------------------
Active:
 1 node_89c804d654976b3c606850c461e8dc5c6366de5e0ccdb360fcaa1b1c
 1 node_4b794efd101bc393da41f0a45bd72eeb3fb78e8e507d72b5fdfb4c1b
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/128.0 CPU
 0.0/4.0 GPU
 0B/20 GiB memory
 0B/19.46GiB object_store_memory

Demands:
 (no resource demands)

3 Install the model

⚠️ There are 4 GPU cards locally.

Official website description

# Start the model service and set the model parameters according to the situation# /path/to/the/model/in/the/container: Model path# tensor-parallel-size: the number of tensors in parallel, and the parallel calculation after splitting within the model layer;# pipeline-parallel-size: the number of pipelines in parallel, the model is split and calculated in parallel after different layers are split. This parameter can be set when there is insufficient single video memory.vllm serve /path/to/the/model/in/the/container \
     --tensor-parallel-size 8 \
     --pipeline-parallel-size 2

This machine executes

Place the downloaded Qwen2.5-7B-Instruct model in the "/home/vllm" directory

# Enter the node, both the master node and the worker node candocker exec -it node /bin/bash

# Execute command parametersnohup vllm serve /root/.cache/huggingface/Qwen2.5-7B-Instruct \
    --served-model-name qwen2.5-7b \
    --tensor-parallel-size 2 \
    --pipeline-parallel-size 2 \
    &gt;  2&gt;&amp;1 &amp;

Call parameters on the host

curl http://localhost:8000/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
    "model": "qwen2.5-7b",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Introduce China, no less than 10,000 words"}
    ],
    "stream": true
}'

This is the article about the implementation steps of Docker installing distributed vLLM. This is all about this article. For more related content on Docker installing distributed vLLM, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!