Redis caches avalanche species solutions

introduction

In high concurrency systems, Redis, as the core cache component, usually plays an important "goalkeeper" role, effectively protecting the backend database from traffic shocks. However, when a large number of caches fail at the same time, requests will flow directly to the database like a flood, causing the database to rise sharply and even crash. This phenomenon is vividly called "cache avalanche".

There are two main trigger scenarios for cache avalanche: one is that a large number of caches expire at the same time and the other is that the Redis server is down. In either case, the consequence is that the request to penetrate the cache layer and go directly to the database, putting the system at risk of crash. For high-concurrency systems that rely on cache, cache avalanches will not only cause response delays, but may also trigger chain reactions, causing unavailability of the entire system.

1. Cache expiration time randomization strategy

principle

The most common cause of cache avalanches is that large batches of caches are concentrated out of date at the same time. By setting a randomized expiration time for the cache, this centralized failure situation can be effectively avoided, and the pressure of cache failure can be spread to different time points.

Implementation method

The core idea is to add a random value to the base expiration time to ensure that even the same batch of caches will fail at different points in time.

public class RandomExpiryTimeCache {
    private RedisTemplate&lt;String, Object&gt; redisTemplate;
    private Random random = new Random();
    
    public RandomExpiryTimeCache(RedisTemplate&lt;String, Object&gt; redisTemplate) {
         = redisTemplate;
    }
    
    /**
      * Set cache value and random expiration time
      * @param key cache key
      * @param value cache value
      * @param baseTimeSeconds Base Expiry Time (seconds)
      * @param randomRangeSeconds Random time range (seconds)
      */
    public void setWithRandomExpiry(String key, Object value, long baseTimeSeconds, long randomRangeSeconds) {
        // Generate random increment time        long randomSeconds = ((int) randomRangeSeconds);
        // Calculate the final expiration time        long finalExpiry = baseTimeSeconds + randomSeconds;
        
        ().set(key, value, finalExpiry, );
        
        ("Set cache key: {} with expiry time: {}", key, finalExpiry);
    }
    
    /**
      * Set up caches with random expiration times in batches
      */
    public void setBatchWithRandomExpiry(Map&lt;String, Object&gt; keyValueMap, long baseTimeSeconds, long randomRangeSeconds) {
        ((key, value) -&gt; setWithRandomExpiry(key, value, baseTimeSeconds, randomRangeSeconds));
    }
}

Practical application examples

@Service
public class ProductCacheService {
    @Autowired
    private RandomExpiryTimeCache randomCache;
    
    @Autowired
    private ProductRepository productRepository;
    
    /**
      * Get product details and use random expiration time cache
      */
    public Product getProductDetail(String productId) {
        String cacheKey = "product:detail:" + productId;
        Product product = (Product) ().get(cacheKey);
        
        if (product == null) {
            // Cache misses, load from database            product = (productId).orElse(null);
            
            if (product != null) {
                // Set cache, the basic expiration time is 30 minutes, and the random range is 10 minutes.                (cacheKey, product, 30 * 60, 10 * 60);
            }
        }
        
        return product;
    }
    
    /**
      * Cache the home page product list and use random expiration time
      */
    public void cacheHomePageProducts(List&lt;Product&gt; products) {
        String cacheKey = "products:homepage";
        // The basic expiration time is 1 hour, the random range is 20 minutes        (cacheKey, products, 60 * 60, 20 * 60);
    }
}

Pros and cons analysis

advantage

Simple implementation without additional infrastructure
Effectively distribute cache expiration time points to reduce instantaneous database pressure
Small changes to existing code, easy to integrate
No additional operation and maintenance costs required

shortcoming

Unable to cope with the overall downtime of Redis server
It only alleviates, not completely solves, avalanche
Random expiration may cause premature failure of hotspot data
Expiration strategies for different business modules need to be designed separately

Applicable scenarios

Scenarios where a large amount of data of the same type need to be cached, such as product lists, article lists, etc.
When a large amount of cache needs to be preloaded after system initialization or restart
Services with low data update frequency and predictable expiration time
As the first line of defense against avalanches, use it in conjunction with other strategies

2. Cache warm-up and timed updates

principle

Cache warm-up means that when the system starts, the hotspot data is loaded into the cache in advance, rather than waiting for the user to request to trigger the cache. This can prevent a large number of requests from directly breaking into the database after the system is cold-started or restarted. In conjunction with the timed update mechanism, you can actively refresh the cache before it is about to expire to avoid cache missing caused by expiration.

Implementation method

Cache warm-up and timing updates are achieved through system startup hooks and timing tasks:

@Component
public class CacheWarmUpService {
    @Autowired
    private RedisTemplate&lt;String, Object&gt; redisTemplate;
    
    @Autowired
    private ProductRepository productRepository;
    
    @Autowired
    private CategoryRepository categoryRepository;
    
    private ScheduledExecutorService scheduler = (5);
    
    /**
      * Perform cache preheating when the system starts
      */
    @PostConstruct
    public void warmUpCacheOnStartup() {
        ("Starting cache warm-up process...");
        
        (this::warmUpHotProducts);
        (this::warmUpCategories);
        (this::warmUpHomePageData);
        
        ("Cache warm-up tasks submitted");
    }
    
    /**
      * Preheating hot product data
      */
    private void warmUpHotProducts() {
        try {
            ("Warming up hot products cache");
            List&lt;Product&gt; hotProducts = productRepository.findTop100ByOrderByViewCountDesc();
            
            // Set cache in batches, basic TTL for 2 hours, random range for 30 minutes            Map&lt;String, Object&gt; productCacheMap = new HashMap&lt;&gt;();
            (product -&gt; {
                String key = "product:detail:" + ();
                (key, product);
            });
            
            ().multiSet(productCacheMap);
            
            // Set expiration time            ().forEach(key -&gt; {
                int randomSeconds = 7200 + new Random().nextInt(1800);
                (key, randomSeconds, );
            });
            
            // Schedule a timely refresh, refresh 30 minutes before expiration            scheduleRefresh("hotProducts", this::warmUpHotProducts, 90, );
            
            ("Successfully warmed up {} hot products", ());
        } catch (Exception e) {
            ("Failed to warm up hot products cache", e);
        }
    }
    
    /**
      * Preheating classification data
      */
    private void warmUpCategories() {
        // Similar implementation...    }
    
    /**
      * Preheat homepage data
      */
    private void warmUpHomePageData() {
        // Similar implementation...    }
    
    /**
      * Schedule a scheduled refresh task
      */
    private void scheduleRefresh(String taskName, Runnable task, long delay, TimeUnit timeUnit) {
        (() -&gt; {
            ("Executing scheduled refresh for: {}", taskName);
            try {
                ();
            } catch (Exception e) {
                ("Error during scheduled refresh of {}", taskName, e);
                // When an error occurs, schedule a short-term retry                (task, 5, );
            }
        }, delay, timeUnit);
    }
    
    /**
      * Clean up resources when the application is closed
      */
    @PreDestroy
    public void shutdown() {
        ();
    }
}

Pros and cons analysis

advantage

Effectively avoid cache avalanche caused by cold start of the system
Reduce cache loading triggered by user requests and improve response speed
You can warm up according to business importance and allocate resources reasonably
Extend the life cycle of hotspot data cache through timed updates

shortcoming

The preheating process may occupy system resources and affect the startup speed
What are the real hot data you need to identify
Timing tasks may introduce additional system complexity
Too large amount of preheated data may increase Redis memory pressure

Applicable scenarios

Scenarios where the system restart frequency is low and the startup time is insensitive
Businesses with clear hot data and infrequent changes
Core interface with extremely high response speed requirements
Predictable system preparation before high traffic activity

3. Mutex locks and distributed locks to prevent breakdown

principle

When the cache fails, if there are a large number of concurrent requests to find the cache missing and try to rebuild the cache, it will cause a surge in the database to be stressed instantly. Through the mutex mechanism, it is possible to ensure that only one requesting thread can query the database and rebuild the cache, and other threads can wait or return the old value, thereby protecting the database.

Implementation method

Use Redis to implement distributed locks to prevent cache breakdown:

@Service
public class MutexCacheService {
    @Autowired
    private StringRedisTemplate stringRedisTemplate;
    
    @Autowired
    private RedisTemplate&lt;String, Object&gt; redisTemplate;
    
    @Autowired
    private ProductRepository productRepository;
    
    // The default expiration time of the lock    private static final long LOCK_EXPIRY_MS = 3000;
    
    /**
      * Use mutex lock to obtain product data
      */
    public Product getProductWithMutex(String productId) {
        String cacheKey = "product:detail:" + productId;
        String lockKey = "lock:product:detail:" + productId;
        
        // Try to get from cache        Product product = (Product) ().get(cacheKey);
        
        // Cache hit, return directly        if (product != null) {
            return product;
        }
        
        // Define the maximum number of retry and waiting time        int maxRetries = 3;
        long retryIntervalMs = 50;
        
        // Retry to acquire the lock        for (int i = 0; i &lt;= maxRetries; i++) {
            boolean locked = false;
            try {
                // Try to acquire the lock                locked = tryLock(lockKey, LOCK_EXPIRY_MS);
                
                if (locked) {
                    // Double check                    product = (Product) ().get(cacheKey);
                    if (product != null) {
                        return product;
                    }
                    
                    // Load from the database                    product = (productId).orElse(null);
                    
                    if (product != null) {
                        // Set cache                        int expiry = 3600 + new Random().nextInt(300);
                        ().set(cacheKey, product, expiry, );
                    } else {
                        // Set empty value cache                        ().set(cacheKey, new EmptyProduct(), 60, );
                    }
                    
                    return product;
                } else if (i &lt; maxRetries) {
                    // Use a random backoff strategy to avoid all threads retrying at the same time                    long backoffTime = retryIntervalMs * (1L &lt;&lt; i) + new Random().nextInt(50);
                    ((backoffTime, 1000)); // Maximum wait 1 second                }
            } catch (InterruptedException e) {
                ().interrupt();
                ("Interrupted while waiting for mutex lock", e);
                break; // Exit the loop when interrupted            } catch (Exception e) {
                ("Error getting product with mutex", e);
                break; // Exit the loop when an exception occurs            } finally {
                if (locked) {
                    unlock(lockKey);
                }
            }
        }
        
        // The lock has not been obtained after reaching the maximum number of retry times, and the old cache value or default value is returned        product = (Product) ().get(cacheKey);
        return product != null ? product : getDefaultProduct(productId);
    }

    // Provide default values or downgrade policies    private Product getDefaultProduct(String productId) {
        ("Failed to get product after max retries: {}", productId);
        // Return basic information or empty object        return new BasicProduct(productId);
    }
    
    /**
      * Try to acquire a distributed lock
      */
    private boolean tryLock(String key, long expiryTimeMs) {
        Boolean result = ().setIfAbsent(key, "locked", expiryTimeMs, );
        return (result);
    }
    
    /**
      * Release the distributed lock
      */
    private void unlock(String key) {
        (key);
    }
}

Application of actual business scenarios

@RestController
@RequestMapping("/api/products")
public class ProductController {
    @Autowired
    private MutexCacheService mutexCacheService;
    
    @GetMapping("/{id}")
    public ResponseEntity&lt;Product&gt; getProduct(@PathVariable("id") String id) {
        // Use mutex lock to obtain products        Product product = (id);
        
        if (product instanceof EmptyProduct) {
            return ().build();
        }
        
        return (product);
    }
}

Pros and cons analysis

advantage

Effectively prevent cache breakdown and protect database
Suitable for high concurrency scenarios with more reads and fewer writes
Ensure data consistency and avoid repeated calculations
Can be used in conjunction with other anti-avoidance strategies

shortcoming

Increases the complexity of the request link
Additional delays may be introduced, especially when lock competition is intense
Distributed lock implementation needs to consider lock timeout, deadlock and other issues
The granularity selection of locks requires trade-offs. Too coarse will limit concurrency, and too fine will increase complexity.

Applicable scenarios

Scenarios with high concurrency and high cache reconstruction cost
Businesses whose hot data is frequently accessed
Complex queries that need to avoid repeated calculations
As the last line of defense for cache avalanche

4. Multi-level caching architecture

principle

Multi-level caches form cache echelons by setting caches at different levels, reducing the impact of single cache layer failure. Typical multi-level caches include: local cache (such as Caffeine, Guava Cache), distributed cache (such as Redis), and persistent layer cache (such as database query cache). When the Redis cache fails or goes down, the request can be downgraded to the local cache to avoid direct impact on the database.

Implementation method

@Service
public class MultiLevelCacheService {
    @Autowired
    private RedisTemplate&lt;String, Object&gt; redisTemplate;
    
    @Autowired
    private ProductRepository productRepository;
    
    // Local cache configuration    private LoadingCache&lt;String, Optional&lt;Product&gt;&gt; localCache = ()
            .maximumSize(10000)  //Cache up to 10,000 items            .expireAfterWrite(5, )  // Local cache expires after 5 minutes            .recordStats()  // Record cache statistics            .build(new CacheLoader&lt;String, Optional&lt;Product&gt;&gt;() {
                @Override
                public Optional&lt;Product&gt; load(String productId) throws Exception {
                    // When local cache misses, try to load from Redis                    return loadFromRedis(productId);
                }
            });
    
    /**
      * Multi-level cache query product
      */
    public Product getProduct(String productId) {
        String cacheKey = "product:detail:" + productId;
        
        try {
            // First query the local cache            Optional&lt;Product&gt; productOptional = (productId);
            
            if (()) {
                ("Product {} found in local cache", productId);
                return ();
            } else {
                ("Product {} not found in any cache level", productId);
                return null;
            }
        } catch (ExecutionException e) {
            ("Error loading product from cache", e);
            
            // All cache layers fail, directly querying the database as a last resort            try {
                Product product = (productId).orElse(null);
                
                if (product != null) {
                    // Try to update the cache, but do not block the current request                    (() -&gt; {
                        try {
                            updateCache(cacheKey, product);
                        } catch (Exception ex) {
                            ("Failed to update cache asynchronously", ex);
                        }
                    });
                }
                
                return product;
            } catch (Exception dbEx) {
                ("Database query failed as last resort", dbEx);
                throw new ServiceException("Failed to fetch product data", dbEx);
            }
        }
    }
    
    /**
      * Loading data from Redis
      */
    private Optional&lt;Product&gt; loadFromRedis(String productId) {
        String cacheKey = "product:detail:" + productId;
        
        try {
            Product product = (Product) ().get(cacheKey);
            
            if (product != null) {
                ("Product {} found in Redis cache", productId);
                return (product);
            }
            
            // Redis cache misses, query database            product = (productId).orElse(null);
            
            if (product != null) {
                // Update Redis cache                updateCache(cacheKey, product);
                return (product);
            } else {
                // Set empty value cache                ().set(cacheKey, new EmptyProduct(), 60, );
                return ();
            }
        } catch (Exception e) {
            ("Failed to access Redis cache, falling back to database", e);
            
            // Redis access failed, directly query the database            Product product = (productId).orElse(null);
            return (product);
        }
    }
    
    /**
      * Update cache
      */
    private void updateCache(String key, Product product) {
        // Update Redis and set a random expiration time        int expiry = 3600 + new Random().nextInt(300);
        ().set(key, product, expiry, );
    }
    
    /**
      * Actively refresh cache at all levels
      */
    public void refreshCache(String productId) {
        String cacheKey = "product:detail:" + productId;
        
        // Load the latest data from the database        Product product = (productId).orElse(null);
        
        if (product != null) {
            // Update Redis cache            updateCache(cacheKey, product);
            
            // Update local cache            (productId, (product));
            
            ("Refreshed all cache levels for product {}", productId);
        } else {
            // Delete caches at all levels            (cacheKey);
            (productId);
            
            ("Product {} not found, invalidated all cache levels", productId);
        }
    }
    
    /**
      * Get cache statistics
      */
    public Map&lt;String, Object&gt; getCacheStats() {
        CacheStats stats = ();
        
        Map&lt;String, Object&gt; result = new HashMap&lt;&gt;();
        ("localCacheSize", ());
        ("hitRate", ());
        ("missRate", ());
        ("loadSuccessCount", ());
        ("loadExceptionCount", ());
        
        return result;
    }
}

Pros and cons analysis

advantage

Greatly improve the fault tolerance and stability of the system
Reduce the impact on the database when Redis failures
Provides better read performance, especially for hot data
Flexible downgrade paths, multi-layer protection

shortcoming

Increases system complexity
Data consistency issues may be introduced
Requires additional memory consumption for local cache
Need to handle data synchronization between caches at all levels

Applicable scenarios

Core systems with high concurrency and high availability requirements
Key businesses that have strong dependence on Redis
Scenarios where more reads and less writes and data consistency requirements are not extremely high
Large microservice architecture, requiring reduced network calls between services

5. Fuse downgrade and current limit protection

principle

The circuit breaker degradation mechanism monitors the health status of the cache layer, quickly downgrades the service when abnormalities are found, returns the bottom-up data or simplifies functions, avoiding requests to continue to impact the database. Current limiting is to actively control the request rate entering the system to prevent the system from being overwhelmed by a large number of requests during cache failure.

Implementation method

Combined with Spring Cloud Circuit Breaker to achieve fuse degradation and current limit

@Service
public class ResilientCacheService {
    @Autowired
    private RedisTemplate&lt;String, Object&gt; redisTemplate;
    
    @Autowired
    private ProductRepository productRepository;
    
    // Injection fuse factory    @Autowired
    private CircuitBreakerFactory circuitBreakerFactory;
    
    // Inject current limiter    @Autowired
    private RateLimiter productRateLimiter;
    
    /**
      * Product inquiry with fuse and current limit
      */
    public Product getProductWithResilience(String productId) {
        // Application current limit        if (!()) {
            ("Rate limit exceeded for product query: {}", productId);
            return getFallbackProduct(productId);
        }
        
        // Create a fuse        CircuitBreaker circuitBreaker = ("redisProductQuery");
        
        // Package Redis cache query        Function&lt;String, Product&gt; redisQueryWithFallback = id -&gt; {
            try {
                String cacheKey = "product:detail:" + id;
                Product product = (Product) ().get(cacheKey);
                
                if (product != null) {
                    return product;
                }
                
                // Load from the database when the cache misses                product = loadFromDatabase(id);
                
                if (product != null) {
                    //Asynchronously update the cache and do not block the main request                    (() -&gt; {
                        int expiry = 3600 + new Random().nextInt(300);
                        ().set(cacheKey, product, expiry, );
                    });
                }
                
                return product;
            } catch (Exception e) {
                ("Redis query failed", e);
                throw e; // Rethrow the exception to trigger the fuse            }
        };
        
        // Perform a query with fuse protection        try {
            return (() -&gt; (productId), 
                                    throwable -&gt; getFallbackProduct(productId));
        } catch (Exception e) {
            ("Circuit breaker execution failed", e);
            return getFallbackProduct(productId);
        }
    }
    
    /**
      * Loading product data from the database
      */
    private Product loadFromDatabase(String productId) {
        try {
            return (productId).orElse(null);
        } catch (Exception e) {
            ("Database query failed", e);
            return null;
        }
    }
    
    /**
      * The downgraded bottom-up strategy - Return basic product information or cached old data
      */
    private Product getFallbackProduct(String productId) {
        ("Using fallback for product: {}", productId);
        
        // Priority try to get old data from local cache        Product cachedProduct = getFromLocalCache(productId);
        if (cachedProduct != null) {
            return cachedProduct;
        }
        
        // If it is an important product, try to obtain basic information from the database        if (isHighPriorityProduct(productId)) {
            try {
                return (productId);
            } catch (Exception e) {
                ("Even basic info query failed for high priority product", e);
            }
        }
        
        // Final guarantee: build a temporary object with the minimum necessary information        return buildTemporaryProduct(productId);
    }
    
    // Auxiliary method implementation...    
    /**
      * Fuse status monitoring API
      */
    public Map&lt;String, Object&gt; getCircuitBreakerStatus() {
        CircuitBreaker circuitBreaker = ("redisProductQuery");
        
        Map&lt;String, Object&gt; status = new HashMap&lt;&gt;();
        ("state", ().name());
        ("failureRate", ().getFailureRate());
        ("failureCount", ().getNumberOfFailedCalls());
        ("successCount", ().getNumberOfSuccessfulCalls());
        
        return status;
    }
}

Fuse and current limiter configuration

@Configuration
public class ResilienceConfig {
    
    @Bean
    public CircuitBreakerFactory circuitBreakerFactory() {
        // Implementation using Resilience4j        Resilience4JCircuitBreakerFactory factory = new Resilience4JCircuitBreakerFactory();
        
        // Custom fuse configuration        (id -&gt; new Resilience4JConfigBuilder(id)
                .circuitBreakerConfig(()
                        .slidingWindowSize(10)  // Sliding window size                        .failureRateThreshold(50)  // Failure rate threshold                        .waitDurationInOpenState((10))  // Fuse opening duration                        .permittedNumberOfCallsInHalfOpenState(5)  // The number of calls allowed in the half-open state                        .build())
                .build());
        
        return factory;
    }
    
    @Bean
    public RateLimiter productRateLimiter() {
        // Use Guava to implement basic current limiter        return (1000);  // 1000 requests per second    }
}

Pros and cons analysis

advantage:

Provide a complete fault tolerance mechanism to avoid cascading failures
Actively limit traffic to prevent system overload
Provides downgraded access paths when cache is unavailable
Can automatically recover and adapt to dynamic changes in the system

shortcoming

Complex configuration, requiring careful tuning of parameters
The downgrade logic needs to be designed separately for different businesses
Some functions may be temporarily unavailable
Added extra code complexity

Applicable scenarios

Core systems with extremely high availability requirements
Microservice architectures that require a cascade propagation of failures
Online services with large fluctuations in traffic
Complex systems with multi-level service dependencies

6. Comparative analysis

Strategy	Complexity	Effect	Applicable scenarios	Key Advantages
Expiration time randomization	Low	middle	A large number of similar caches are invalidated	Simple implementation, immediate effect
Cache warm-up and timed updates	middle	high	System startup and important data	Actively prevent and reduce sudden stress
Mutex locks prevent breakdown	middle	high	Hot data frequently fails	Accurate protection to avoid repeated calculations
Multi-level cache architecture	high	high	High availability core system	Multi-layer protection, flexible downgrade
Fuse degradation and current limit	high	high	Microservice complex system	Full protection, automatic recovery

7. Summary

In actual applications, these strategies are not mutually exclusive, but should be combined based on business characteristics and system architecture. A perfect cache avalanche protection system requires the coordination of technical means, architectural design and operation and maintenance monitoring to build a truly robust and highly available system.

By rationally implementing these strategies, we can not only effectively deal with the cache avalanche problem, but also comprehensively improve the stability and reliability of the system and provide users with a better service experience.

The above is the detailed content of the species solution for Redis cache avalanche. For more information about Redis cache avalanche, please follow my other related articles!