The principles and practical combat of deep Q network DQN (Deep Q-Network) reinforcement learning

DQN (Deep Q-Network) is an algorithm based on deep learning and reinforcement learning, proposed by DeepMind to solve the Markov decision-making process (MDP) problem in discrete action space. It is one of the first algorithms to successfully apply deep learning to solve reinforcement learning tasks. DQN, or Deep Q-network, refers to the Q-Learing algorithm based on deep learning.

1. Strengthen the basics of learning

Reinforcement Learning is an important branch of machine learning, and its core idea is to learn optimal strategies through interaction with the environment. Unlike supervised learning, reinforcement learning does not require pre-prepared input-output pairs, but instead obtains reward signals through trial and error mechanisms to guide learning.

1.1 Core Concepts

• Agent: the performer of learning • Environment: the object of the agent's interaction • State: the current situation of the environment • Action: the behavior of the agent • Reward: the feedback of the environment to actions • Policy: the mapping of state to actions

1.2 Markov decision-making process

Reinforcement learning problems are usually modeled as Markov decision-making process (MDP), composed of five tuples (S, A, P, R, γ): • S: state set • A: action set • P: state transition probability • R: reward function • γ: discount factor (0≤γ<1)

2. Q learning and deep Q network

2.1 Q learning algorithm

Q learning is a classic reinforcement learning algorithm that estimates the long-term returns of taking an action in a given state by maintaining a Q value table:

import numpy as np

# Initialize the Q tableq_table = ((state_space_size, action_space_size))

# Q Learning Update Formulaalpha = 0.1  # Learning rategamma = 0.99  # Discount factor
for episode in range(total_episodes):
    state = ()
    done = False
    
    while not done:
        action = select_action(state)  # ε-greedy strategy        next_state, reward, done, _ = (action)
        
        # Q value update        q_table[state, action] = q_table[state, action] + alpha * (
            reward + gamma * (q_table[next_state]) - q_table[state, action]
        )
        state = next_state

2.2 Deep Q Network (DQN)

When the state space is large, the Q table becomes impractical. DQN uses neural networks to approximate Q functions:

import torch
import  as nn
import  as optim

class DQN():
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.fc1 = (input_dim, 128)
        self.fc2 = (128, 128)
        self.fc3 = (128, output_dim)
        
    def forward(self, x):
        x = (self.fc1(x))
        x = (self.fc2(x))
        return self.fc3(x)

3. DQN improvement technology

3.1 Experience Replay

Solve the problem of sample correlation and non-stationary distribution:

from collections import deque
import random

class ReplayBuffer:
    def __init__(self, capacity):
         = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        ((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        return (, batch_size)
    
    def __len__(self):
        return len()

3.2 Target Network

Stable training process:

target_net = DQN(input_dim, output_dim).to(device)
target_net.load_state_dict(policy_net.state_dict())
target_net.eval（)

# Regularly update target networksif steps_done % TARGET_UPDATE == 0:
    target_net.load_state_dict(policy_net.state_dict())

4. Complete DQN implementation (CartPole environment)

import gym
import numpy as np
import torch
import random
from collections import deque
import  as plt

# HyperparametersBATCH_SIZE = 128
GAMMA = 0.99
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 200
TARGET_UPDATE = 10
LR = 0.001

# Initialize the environmentenv = ('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

# Neural Network Definitionclass DQN():
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.fc1 = (input_dim, 64)
        self.fc2 = (64, 64)
        self.fc3 = (64, output_dim)
        
    def forward(self, x):
        x = (self.fc1(x))
        x = (self.fc2(x))
        return self.fc3(x)

# Initialize the networkpolicy_net = DQN(state_dim, action_dim).to(device)
target_net = DQN(state_dim, action_dim).to(device)
target_net.load_state_dict(policy_net.state_dict())
optimizer = (policy_net.parameters(), lr=LR)
memory = ReplayBuffer(10000)

# Training processdef train():
    if len(memory) &lt; BATCH_SIZE:
        return
    transitions = (BATCH_SIZE)
    batch = list(zip(*transitions))
    
    state_batch = ((batch[0]))
    action_batch = ((batch[1]))
    reward_batch = ((batch[2]))
    next_state_batch = ((batch[3]))
    done_batch = ((batch[4]))
    
    current_q = policy_net(state_batch).gather(1, action_batch.unsqueeze(1))
    next_q = target_net(next_state_batch).max(1)[0].detach()
    expected_q = reward_batch + (1 - done_batch) * GAMMA * next_q
    
    loss = ()(current_q.squeeze(), expected_q)
    optimizer.zero_grad()
    ()
    ()

# Main training loopepisode_rewards = []
for episode in range(500):
    state = ()
    total_reward = 0
    done = False
    
    while not done:
        # ε-greedy action selection        eps_threshold = EPS_END + (EPS_START - EPS_END) * \
            (-1. * episode / EPS_DECAY)
        if () &gt; eps_threshold:
            with torch.no_grad():
                action = policy_net((state)).argmax().item()
        else:
            action = (0, action_dim-1)
        
        next_state, reward, done, _ = (action)
        (state, action, reward, next_state, done)
        state = next_state
        total_reward += reward
        
        train()
    
    episode_rewards.append(total_reward)
    if episode % 10 == 0:
        print(f"Episode {episode}, Total Reward: {total_reward}")

# Draw the training curve(episode_rewards)
('Episode')
('Total Reward')
('DQN Training Progress')
()

V. Limitations and development of DQN

Overestimation problem: Double DQN solves the problem by decoupling action selection and Q-value evaluation
Priority experience replay: Give important transfers higher sampling probability
Competitive Network Architecture: Dueling DQN Separate Value Function and Advantage Function
Distributed reinforcement learning: Learning value distribution rather than expected value

6. Summary

Deep Q learning combines deep neural networks with reinforcement learning to solve the limitations of traditional Q learning in high-dimensional state space. Through technologies such as experience replay and target networks, DQN can learn effective strategies in complex environments. This article demonstrates the core ideas and implementation details of DQN through the complete implementation of the CartPole environment. In the future, combined with improved technology and stronger network architecture, deep reinforcement learning will play a greater role in robot control, gaming AI and other fields.

This is the article about the principles and practical reinforcement learning of Deep Q-Network. For more related deep Q network DQN reinforcement learning content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!