Detailed tutorial on local privatization deployment of DeepSeek model

1. Introduction

The DeepSeek model is a powerful language model. Local privatized deployment allows users to use the model safely and efficiently in their own environment, avoiding the security risks brought by data transmission to the outside world, and also customize configurations according to their own needs. This tutorial will provide detailed information on how to deploy the DeepSeek model locally.

2. Environmental preparation

(I) Hardware requirements

CPU: Multi-core processors such as the Intel Xeon series or the AMD EPYC series are recommended to provide sufficient computing power. At least 4 cores or more CPUs are required.
GPU: If you want to perform efficient reasoning, it is recommended to use NVIDIA GPUs, such as the NVIDIA GeForce RTX 30 series or NVIDIA A100. The larger the GPU's video memory, the better, and at least 8GB of video memory is required.
Memory: At least 16GB of system memory. For larger-scale model deployments, it is recommended to 32GB or above.
storage: Prepare enough disk space to store model files and related data. Depending on the model version, it may require dozens to hundreds of GB of storage space.

(II) Software requirements

operating system: It is recommended to use Linux systems, such as Ubuntu 20.04 or higher, and Windows 10 and higher can also be used, but Linux systems have more advantages in performance and compatibility.
Python: Install Python 3.8 or later, you can download and install it from the official Python website (/downloads/).
CUDA: If you use an NVIDIA GPU, you need to install the CUDA toolkit. Choose the appropriate version according to the GPU model and system. You can download and install it from the official NVIDIA website (/cuda-downloads).
cuDNN:cuDNN is a deep neural network library provided by NVIDIA, used to accelerate deep learning computing. It requires the installation of the corresponding cuDNN according to the CUDA version. It can be downloaded from the NVIDIA developer website (/cudnn).

(III) Create a virtual environment

To avoid dependency conflicts between different projects, it is recommended to use a virtual environment. Run the following command from the command line to create and activate the virtual environment:

# Create a virtual environmentpython -m venv deepseek_env
# Activate the virtual environment (Linux/Mac)source deepseek_env/bin/activate
# Activate the virtual environment (Windows)deepseek_env\Scripts\activate

3. Install the dependency library

In the activated virtual environment, install the necessary Python dependencies, mainly including PyTorch, Transformers, etc.:

# Install PyTorch, select the appropriate installation command based on the CUDA version# If using CUDA 11.8pip install torch torchvision torchaudio --index-url /whl/cu118
# If you do not use the GPUpip install torch torchvision torchaudio

# Install the Transformers librarypip install transformers

# Install other libraries you may needpip install sentencepiece accelerate

4. Obtain the DeepSeek model

(I) Download the model file

The DeepSeek model can be downloaded from the Hugging Face model library (/deepseek-ai). Choose the appropriate model version according to your needs, such as deepseek-llm-7b or deepseek-llm-67b. You can use the following code to download the model in Python:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "deepseek-ai/deepseek-llm-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Save the model and word participle to localmodel.save_pretrained("./local_deepseek_model")
tokenizer.save_pretrained("./local_deepseek_model")

Or usegit lfsThe command is downloaded directly from the Hugging Face repository:

git lfs install
git clone /deepseek-ai/deepseek-llm-7b

(II) Model file structure

After the download is completed, the model file usually contains the following main parts:

: The model configuration file, including the model's architecture, parameters and other information.
pytorch_model.bin: The weight file of the model, which stores all the parameters of the model.
、tokenizer_config.jsonetc: Word participle-related files, used to convert text into input formats that the model can handle.

V. Model reasoning test

After deploying the model locally, simple inference tests can be performed to verify whether the model is working properly. Here is a sample code for inference using Python:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load local models and word participlemodel_path = "./local_deepseek_model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

# Enter textinput_text = "How is the weather today?"
input_ids = (input_text, return_tensors="pt")

# Generate outputoutput = (input_ids, max_length=100, num_return_sequences=1)
output_text = (output[0], skip_special_tokens=True)

print("enter:", input_text)
print("Output:", output_text)

6. Use API to deploy

(I) Use FastAPI to build a reasoning API

FastAPI is a fast (high performance) Python web framework that is ideal for APIs for building machine learning models. Here is a sample code that uses FastAPI to build an inference API for the DeepSeek model:

from fastapi import FastAPI
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

app = FastAPI()

# Load local models and word participlemodel_path = "./local_deepseek_model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

if .is_available():
    model = ()

@("/generate")
async def generate_text(input_text: str):
    input_ids = (input_text, return_tensors="pt")
    if .is_available():
        input_ids = input_ids.cuda()

    output = (input_ids, max_length=100, num_return_sequences=1)
    output_text = (output[0], skip_special_tokens=True)

    return {"input": input_text, "output": output_text}

(II) Run API services

Save the above code as, and then run the following command on the command line to start the API service:

uvicorn main:app --host 0.0.0.0 --port 8000

here--host 0.0.0.0Indicates that the service can be accessed from any IP address.--port 8000Indicates that the port number of the service monitor is 8000.

(III) Test API

Can be usedcurlTools such as commands or Postman to test the API. The following is usedcurlExample of command:

curl -X POST "http://localhost:8000/generate" -H "Content-Type: application/json" -d '{"input_text": "What's the weather today?  "}'

If everything works fine, you will receive a JSON response with input text and model generated output.

7. Performance optimization

(I) Quantitative Model

Quantization is a technique that converts model parameters from high precision (such as 32-bit floating point numbers) to low precision (such as 8-bit integers) that can significantly reduce the memory footprint and inference time of the model. The DeepSeek model can be quantified using the quantization function in the transformers library:

from transformers import AutoTokenizer, AutoModelForCausalLM
from  import ORTQuantizer, ORTModelForCausalLM
from  import AutoQuantizationConfig

model_path = "./local_deepseek_model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

# Quantitative configurationqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)
quantizer = ORTQuantizer.from_pretrained(model)

# Quantitative Modelquantized_model_path = "./local_deepseek_model_quantized"
(save_dir=quantized_model_path, quantization_config=qconfig)

(II) Use distributed reasoning

If you have multiple GPUs or multiple machines, distributed inference can be used to speed up the inference process of the model.The module provides distributed training and inference functions. Here is a simple distributed inference example:

import torch
import  as dist
import  as mp
from transformers import AutoTokenizer, AutoModelForCausalLM

def setup(rank, world_size):
    ['MASTER_ADDR'] = 'localhost'
    ['MASTER_PORT'] = '12355'

    # initialize the process group
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def inference(rank, world_size):
    setup(rank, world_size)
    model_path = "./local_deepseek_model"
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path)
    model = (rank)
    model = (model, device_ids=[rank])

    # Enter text    input_text = "How is the weather today?"
    input_ids = (input_text, return_tensors="pt").to(rank)

    # Generate output    output = (input_ids, max_length=100, num_return_sequences=1)
    output_text = (output[0], skip_special_tokens=True)

    print(f"Rank {rank}: enter：{input_text}, Output：{output_text}")

    cleanup()

if __name__ == "__main__":
    world_size = .device_count()
    (inference, args=(world_size,), nprocs=world_size, join=True)

8. Safety and Management

(I) Data security

In a local privatized deployment, ensure data security. For input and output data, strict access control and encryption processing must be carried out. The HTTPS protocol can be used to protect the communication of the API and prevent data from being stolen during transmission.

(II) Model update and maintenance

Check the official updates of the DeepSeek model regularly, download and update local models in a timely manner for better performance and functionality. At the same time, the running status of the model should be monitored to promptly discover and deal with possible problems.

(III) Resource Management

Rationally manage server resources to avoid system crashes due to excessive resource use. You can use monitoring tools (such as Prometheus, Grafana, etc.) to monitor the use of CPU, memory, GPU and other resources of the server, and adjust them according to the monitoring results.

9. Summary

Through the above steps, you can complete the private deployment of the DeepSeek model locally and use the API to perform inference services. During the deployment process, attention should be paid to issues such as environmental preparation, model acquisition, performance optimization, and security management. Hopefully this tutorial will help you successfully deploy and use the DeepSeek model.

The above code and steps are only examples, and adjustments may need to be made according to the specific situation during the actual deployment process. At the same time, make sure you comply with relevant laws, regulations and terms of use of the model.

This is the article about the detailed tutorial on local privatization deployment of DeepSeek model. For more relevant local privatization deployment of DeepSeek, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!