introduction
In high concurrency systems, Redis, as the core cache component, usually plays an important "goalkeeper" role, effectively protecting the backend database from traffic shocks. However, when a large number of caches fail at the same time, requests will flow directly to the database like a flood, causing the database to rise sharply and even crash. This phenomenon is vividly called "cache avalanche".
There are two main trigger scenarios for cache avalanche: one is that a large number of caches expire at the same time and the other is that the Redis server is down. In either case, the consequence is that the request to penetrate the cache layer and go directly to the database, putting the system at risk of crash. For high-concurrency systems that rely on cache, cache avalanches will not only cause response delays, but may also trigger chain reactions, causing unavailability of the entire system.
1. Cache expiration time randomization strategy
principle
The most common cause of cache avalanches is that large batches of caches are concentrated out of date at the same time. By setting a randomized expiration time for the cache, this centralized failure situation can be effectively avoided, and the pressure of cache failure can be spread to different time points.
Implementation method
The core idea is to add a random value to the base expiration time to ensure that even the same batch of caches will fail at different points in time.
public class RandomExpiryTimeCache { private RedisTemplate<String, Object> redisTemplate; private Random random = new Random(); public RandomExpiryTimeCache(RedisTemplate<String, Object> redisTemplate) { = redisTemplate; } /** * Set cache value and random expiration time * @param key cache key * @param value cache value * @param baseTimeSeconds Base Expiry Time (seconds) * @param randomRangeSeconds Random time range (seconds) */ public void setWithRandomExpiry(String key, Object value, long baseTimeSeconds, long randomRangeSeconds) { // Generate random increment time long randomSeconds = ((int) randomRangeSeconds); // Calculate the final expiration time long finalExpiry = baseTimeSeconds + randomSeconds; ().set(key, value, finalExpiry, ); ("Set cache key: {} with expiry time: {}", key, finalExpiry); } /** * Set up caches with random expiration times in batches */ public void setBatchWithRandomExpiry(Map<String, Object> keyValueMap, long baseTimeSeconds, long randomRangeSeconds) { ((key, value) -> setWithRandomExpiry(key, value, baseTimeSeconds, randomRangeSeconds)); } }
Practical application examples
@Service public class ProductCacheService { @Autowired private RandomExpiryTimeCache randomCache; @Autowired private ProductRepository productRepository; /** * Get product details and use random expiration time cache */ public Product getProductDetail(String productId) { String cacheKey = "product:detail:" + productId; Product product = (Product) ().get(cacheKey); if (product == null) { // Cache misses, load from database product = (productId).orElse(null); if (product != null) { // Set cache, the basic expiration time is 30 minutes, and the random range is 10 minutes. (cacheKey, product, 30 * 60, 10 * 60); } } return product; } /** * Cache the home page product list and use random expiration time */ public void cacheHomePageProducts(List<Product> products) { String cacheKey = "products:homepage"; // The basic expiration time is 1 hour, the random range is 20 minutes (cacheKey, products, 60 * 60, 20 * 60); } }
Pros and cons analysis
advantage
- Simple implementation without additional infrastructure
- Effectively distribute cache expiration time points to reduce instantaneous database pressure
- Small changes to existing code, easy to integrate
- No additional operation and maintenance costs required
shortcoming
- Unable to cope with the overall downtime of Redis server
- It only alleviates, not completely solves, avalanche
- Random expiration may cause premature failure of hotspot data
- Expiration strategies for different business modules need to be designed separately
Applicable scenarios
- Scenarios where a large amount of data of the same type need to be cached, such as product lists, article lists, etc.
- When a large amount of cache needs to be preloaded after system initialization or restart
- Services with low data update frequency and predictable expiration time
- As the first line of defense against avalanches, use it in conjunction with other strategies
2. Cache warm-up and timed updates
principle
Cache warm-up means that when the system starts, the hotspot data is loaded into the cache in advance, rather than waiting for the user to request to trigger the cache. This can prevent a large number of requests from directly breaking into the database after the system is cold-started or restarted. In conjunction with the timed update mechanism, you can actively refresh the cache before it is about to expire to avoid cache missing caused by expiration.
Implementation method
Cache warm-up and timing updates are achieved through system startup hooks and timing tasks:
@Component public class CacheWarmUpService { @Autowired private RedisTemplate<String, Object> redisTemplate; @Autowired private ProductRepository productRepository; @Autowired private CategoryRepository categoryRepository; private ScheduledExecutorService scheduler = (5); /** * Perform cache preheating when the system starts */ @PostConstruct public void warmUpCacheOnStartup() { ("Starting cache warm-up process..."); (this::warmUpHotProducts); (this::warmUpCategories); (this::warmUpHomePageData); ("Cache warm-up tasks submitted"); } /** * Preheating hot product data */ private void warmUpHotProducts() { try { ("Warming up hot products cache"); List<Product> hotProducts = productRepository.findTop100ByOrderByViewCountDesc(); // Set cache in batches, basic TTL for 2 hours, random range for 30 minutes Map<String, Object> productCacheMap = new HashMap<>(); (product -> { String key = "product:detail:" + (); (key, product); }); ().multiSet(productCacheMap); // Set expiration time ().forEach(key -> { int randomSeconds = 7200 + new Random().nextInt(1800); (key, randomSeconds, ); }); // Schedule a timely refresh, refresh 30 minutes before expiration scheduleRefresh("hotProducts", this::warmUpHotProducts, 90, ); ("Successfully warmed up {} hot products", ()); } catch (Exception e) { ("Failed to warm up hot products cache", e); } } /** * Preheating classification data */ private void warmUpCategories() { // Similar implementation... } /** * Preheat homepage data */ private void warmUpHomePageData() { // Similar implementation... } /** * Schedule a scheduled refresh task */ private void scheduleRefresh(String taskName, Runnable task, long delay, TimeUnit timeUnit) { (() -> { ("Executing scheduled refresh for: {}", taskName); try { (); } catch (Exception e) { ("Error during scheduled refresh of {}", taskName, e); // When an error occurs, schedule a short-term retry (task, 5, ); } }, delay, timeUnit); } /** * Clean up resources when the application is closed */ @PreDestroy public void shutdown() { (); } }
Pros and cons analysis
advantage
- Effectively avoid cache avalanche caused by cold start of the system
- Reduce cache loading triggered by user requests and improve response speed
- You can warm up according to business importance and allocate resources reasonably
- Extend the life cycle of hotspot data cache through timed updates
shortcoming
- The preheating process may occupy system resources and affect the startup speed
- What are the real hot data you need to identify
- Timing tasks may introduce additional system complexity
- Too large amount of preheated data may increase Redis memory pressure
Applicable scenarios
- Scenarios where the system restart frequency is low and the startup time is insensitive
- Businesses with clear hot data and infrequent changes
- Core interface with extremely high response speed requirements
- Predictable system preparation before high traffic activity
3. Mutex locks and distributed locks to prevent breakdown
principle
When the cache fails, if there are a large number of concurrent requests to find the cache missing and try to rebuild the cache, it will cause a surge in the database to be stressed instantly. Through the mutex mechanism, it is possible to ensure that only one requesting thread can query the database and rebuild the cache, and other threads can wait or return the old value, thereby protecting the database.
Implementation method
Use Redis to implement distributed locks to prevent cache breakdown:
@Service public class MutexCacheService { @Autowired private StringRedisTemplate stringRedisTemplate; @Autowired private RedisTemplate<String, Object> redisTemplate; @Autowired private ProductRepository productRepository; // The default expiration time of the lock private static final long LOCK_EXPIRY_MS = 3000; /** * Use mutex lock to obtain product data */ public Product getProductWithMutex(String productId) { String cacheKey = "product:detail:" + productId; String lockKey = "lock:product:detail:" + productId; // Try to get from cache Product product = (Product) ().get(cacheKey); // Cache hit, return directly if (product != null) { return product; } // Define the maximum number of retry and waiting time int maxRetries = 3; long retryIntervalMs = 50; // Retry to acquire the lock for (int i = 0; i <= maxRetries; i++) { boolean locked = false; try { // Try to acquire the lock locked = tryLock(lockKey, LOCK_EXPIRY_MS); if (locked) { // Double check product = (Product) ().get(cacheKey); if (product != null) { return product; } // Load from the database product = (productId).orElse(null); if (product != null) { // Set cache int expiry = 3600 + new Random().nextInt(300); ().set(cacheKey, product, expiry, ); } else { // Set empty value cache ().set(cacheKey, new EmptyProduct(), 60, ); } return product; } else if (i < maxRetries) { // Use a random backoff strategy to avoid all threads retrying at the same time long backoffTime = retryIntervalMs * (1L << i) + new Random().nextInt(50); ((backoffTime, 1000)); // Maximum wait 1 second } } catch (InterruptedException e) { ().interrupt(); ("Interrupted while waiting for mutex lock", e); break; // Exit the loop when interrupted } catch (Exception e) { ("Error getting product with mutex", e); break; // Exit the loop when an exception occurs } finally { if (locked) { unlock(lockKey); } } } // The lock has not been obtained after reaching the maximum number of retry times, and the old cache value or default value is returned product = (Product) ().get(cacheKey); return product != null ? product : getDefaultProduct(productId); } // Provide default values or downgrade policies private Product getDefaultProduct(String productId) { ("Failed to get product after max retries: {}", productId); // Return basic information or empty object return new BasicProduct(productId); } /** * Try to acquire a distributed lock */ private boolean tryLock(String key, long expiryTimeMs) { Boolean result = ().setIfAbsent(key, "locked", expiryTimeMs, ); return (result); } /** * Release the distributed lock */ private void unlock(String key) { (key); } }
Application of actual business scenarios
@RestController @RequestMapping("/api/products") public class ProductController { @Autowired private MutexCacheService mutexCacheService; @GetMapping("/{id}") public ResponseEntity<Product> getProduct(@PathVariable("id") String id) { // Use mutex lock to obtain products Product product = (id); if (product instanceof EmptyProduct) { return ().build(); } return (product); } }
Pros and cons analysis
advantage
- Effectively prevent cache breakdown and protect database
- Suitable for high concurrency scenarios with more reads and fewer writes
- Ensure data consistency and avoid repeated calculations
- Can be used in conjunction with other anti-avoidance strategies
shortcoming
- Increases the complexity of the request link
- Additional delays may be introduced, especially when lock competition is intense
- Distributed lock implementation needs to consider lock timeout, deadlock and other issues
- The granularity selection of locks requires trade-offs. Too coarse will limit concurrency, and too fine will increase complexity.
Applicable scenarios
- Scenarios with high concurrency and high cache reconstruction cost
- Businesses whose hot data is frequently accessed
- Complex queries that need to avoid repeated calculations
- As the last line of defense for cache avalanche
4. Multi-level caching architecture
principle
Multi-level caches form cache echelons by setting caches at different levels, reducing the impact of single cache layer failure. Typical multi-level caches include: local cache (such as Caffeine, Guava Cache), distributed cache (such as Redis), and persistent layer cache (such as database query cache). When the Redis cache fails or goes down, the request can be downgraded to the local cache to avoid direct impact on the database.
Implementation method
@Service public class MultiLevelCacheService { @Autowired private RedisTemplate<String, Object> redisTemplate; @Autowired private ProductRepository productRepository; // Local cache configuration private LoadingCache<String, Optional<Product>> localCache = () .maximumSize(10000) //Cache up to 10,000 items .expireAfterWrite(5, ) // Local cache expires after 5 minutes .recordStats() // Record cache statistics .build(new CacheLoader<String, Optional<Product>>() { @Override public Optional<Product> load(String productId) throws Exception { // When local cache misses, try to load from Redis return loadFromRedis(productId); } }); /** * Multi-level cache query product */ public Product getProduct(String productId) { String cacheKey = "product:detail:" + productId; try { // First query the local cache Optional<Product> productOptional = (productId); if (()) { ("Product {} found in local cache", productId); return (); } else { ("Product {} not found in any cache level", productId); return null; } } catch (ExecutionException e) { ("Error loading product from cache", e); // All cache layers fail, directly querying the database as a last resort try { Product product = (productId).orElse(null); if (product != null) { // Try to update the cache, but do not block the current request (() -> { try { updateCache(cacheKey, product); } catch (Exception ex) { ("Failed to update cache asynchronously", ex); } }); } return product; } catch (Exception dbEx) { ("Database query failed as last resort", dbEx); throw new ServiceException("Failed to fetch product data", dbEx); } } } /** * Loading data from Redis */ private Optional<Product> loadFromRedis(String productId) { String cacheKey = "product:detail:" + productId; try { Product product = (Product) ().get(cacheKey); if (product != null) { ("Product {} found in Redis cache", productId); return (product); } // Redis cache misses, query database product = (productId).orElse(null); if (product != null) { // Update Redis cache updateCache(cacheKey, product); return (product); } else { // Set empty value cache ().set(cacheKey, new EmptyProduct(), 60, ); return (); } } catch (Exception e) { ("Failed to access Redis cache, falling back to database", e); // Redis access failed, directly query the database Product product = (productId).orElse(null); return (product); } } /** * Update cache */ private void updateCache(String key, Product product) { // Update Redis and set a random expiration time int expiry = 3600 + new Random().nextInt(300); ().set(key, product, expiry, ); } /** * Actively refresh cache at all levels */ public void refreshCache(String productId) { String cacheKey = "product:detail:" + productId; // Load the latest data from the database Product product = (productId).orElse(null); if (product != null) { // Update Redis cache updateCache(cacheKey, product); // Update local cache (productId, (product)); ("Refreshed all cache levels for product {}", productId); } else { // Delete caches at all levels (cacheKey); (productId); ("Product {} not found, invalidated all cache levels", productId); } } /** * Get cache statistics */ public Map<String, Object> getCacheStats() { CacheStats stats = (); Map<String, Object> result = new HashMap<>(); ("localCacheSize", ()); ("hitRate", ()); ("missRate", ()); ("loadSuccessCount", ()); ("loadExceptionCount", ()); return result; } }
Pros and cons analysis
advantage
- Greatly improve the fault tolerance and stability of the system
- Reduce the impact on the database when Redis failures
- Provides better read performance, especially for hot data
- Flexible downgrade paths, multi-layer protection
shortcoming
- Increases system complexity
- Data consistency issues may be introduced
- Requires additional memory consumption for local cache
- Need to handle data synchronization between caches at all levels
Applicable scenarios
- Core systems with high concurrency and high availability requirements
- Key businesses that have strong dependence on Redis
- Scenarios where more reads and less writes and data consistency requirements are not extremely high
- Large microservice architecture, requiring reduced network calls between services
5. Fuse downgrade and current limit protection
principle
The circuit breaker degradation mechanism monitors the health status of the cache layer, quickly downgrades the service when abnormalities are found, returns the bottom-up data or simplifies functions, avoiding requests to continue to impact the database. Current limiting is to actively control the request rate entering the system to prevent the system from being overwhelmed by a large number of requests during cache failure.
Implementation method
Combined with Spring Cloud Circuit Breaker to achieve fuse degradation and current limit
@Service public class ResilientCacheService { @Autowired private RedisTemplate<String, Object> redisTemplate; @Autowired private ProductRepository productRepository; // Injection fuse factory @Autowired private CircuitBreakerFactory circuitBreakerFactory; // Inject current limiter @Autowired private RateLimiter productRateLimiter; /** * Product inquiry with fuse and current limit */ public Product getProductWithResilience(String productId) { // Application current limit if (!()) { ("Rate limit exceeded for product query: {}", productId); return getFallbackProduct(productId); } // Create a fuse CircuitBreaker circuitBreaker = ("redisProductQuery"); // Package Redis cache query Function<String, Product> redisQueryWithFallback = id -> { try { String cacheKey = "product:detail:" + id; Product product = (Product) ().get(cacheKey); if (product != null) { return product; } // Load from the database when the cache misses product = loadFromDatabase(id); if (product != null) { //Asynchronously update the cache and do not block the main request (() -> { int expiry = 3600 + new Random().nextInt(300); ().set(cacheKey, product, expiry, ); }); } return product; } catch (Exception e) { ("Redis query failed", e); throw e; // Rethrow the exception to trigger the fuse } }; // Perform a query with fuse protection try { return (() -> (productId), throwable -> getFallbackProduct(productId)); } catch (Exception e) { ("Circuit breaker execution failed", e); return getFallbackProduct(productId); } } /** * Loading product data from the database */ private Product loadFromDatabase(String productId) { try { return (productId).orElse(null); } catch (Exception e) { ("Database query failed", e); return null; } } /** * The downgraded bottom-up strategy - Return basic product information or cached old data */ private Product getFallbackProduct(String productId) { ("Using fallback for product: {}", productId); // Priority try to get old data from local cache Product cachedProduct = getFromLocalCache(productId); if (cachedProduct != null) { return cachedProduct; } // If it is an important product, try to obtain basic information from the database if (isHighPriorityProduct(productId)) { try { return (productId); } catch (Exception e) { ("Even basic info query failed for high priority product", e); } } // Final guarantee: build a temporary object with the minimum necessary information return buildTemporaryProduct(productId); } // Auxiliary method implementation... /** * Fuse status monitoring API */ public Map<String, Object> getCircuitBreakerStatus() { CircuitBreaker circuitBreaker = ("redisProductQuery"); Map<String, Object> status = new HashMap<>(); ("state", ().name()); ("failureRate", ().getFailureRate()); ("failureCount", ().getNumberOfFailedCalls()); ("successCount", ().getNumberOfSuccessfulCalls()); return status; } }
Fuse and current limiter configuration
@Configuration public class ResilienceConfig { @Bean public CircuitBreakerFactory circuitBreakerFactory() { // Implementation using Resilience4j Resilience4JCircuitBreakerFactory factory = new Resilience4JCircuitBreakerFactory(); // Custom fuse configuration (id -> new Resilience4JConfigBuilder(id) .circuitBreakerConfig(() .slidingWindowSize(10) // Sliding window size .failureRateThreshold(50) // Failure rate threshold .waitDurationInOpenState((10)) // Fuse opening duration .permittedNumberOfCallsInHalfOpenState(5) // The number of calls allowed in the half-open state .build()) .build()); return factory; } @Bean public RateLimiter productRateLimiter() { // Use Guava to implement basic current limiter return (1000); // 1000 requests per second } }
Pros and cons analysis
advantage:
- Provide a complete fault tolerance mechanism to avoid cascading failures
- Actively limit traffic to prevent system overload
- Provides downgraded access paths when cache is unavailable
- Can automatically recover and adapt to dynamic changes in the system
shortcoming
- Complex configuration, requiring careful tuning of parameters
- The downgrade logic needs to be designed separately for different businesses
- Some functions may be temporarily unavailable
- Added extra code complexity
Applicable scenarios
- Core systems with extremely high availability requirements
- Microservice architectures that require a cascade propagation of failures
- Online services with large fluctuations in traffic
- Complex systems with multi-level service dependencies
6. Comparative analysis
Strategy | Complexity | Effect | Applicable scenarios | Key Advantages |
---|---|---|---|---|
Expiration time randomization | Low | middle | A large number of similar caches are invalidated | Simple implementation, immediate effect |
Cache warm-up and timed updates | middle | high | System startup and important data | Actively prevent and reduce sudden stress |
Mutex locks prevent breakdown | middle | high | Hot data frequently fails | Accurate protection to avoid repeated calculations |
Multi-level cache architecture | high | high | High availability core system | Multi-layer protection, flexible downgrade |
Fuse degradation and current limit | high | high | Microservice complex system | Full protection, automatic recovery |
7. Summary
In actual applications, these strategies are not mutually exclusive, but should be combined based on business characteristics and system architecture. A perfect cache avalanche protection system requires the coordination of technical means, architectural design and operation and maintenance monitoring to build a truly robust and highly available system.
By rationally implementing these strategies, we can not only effectively deal with the cache avalanche problem, but also comprehensively improve the stability and reliability of the system and provide users with a better service experience.
The above is the detailed content of the species solution for Redis cache avalanche. For more information about Redis cache avalanche, please follow my other related articles!