Explanation of practical case study of JVM production environment tuning

Case 3: JVM frequently optimizes Full GC

1. Project background (Situation)

In the anti-money laundering system of Wanwei cross-border payment in Yunzhongzhong, we are responsible for real-time rules verification of massive transaction data to ensure compliance with regulatory requirements. The system has an average daily transaction volume of more than 5 million transactions, a peak QPS of 3,000, and adopts a microservice architecture. The core service is developed based on Java and runs on a container cluster. As the business volume grows, the system frequently triggers Full GC after several hours of operation, causing the service response time (RT) to soar from an average of 50ms to more than 2 seconds, seriously affecting the timeliness of real-time risk control decisions.

2. Problems and Challenges (Task)

Phenomenon:
- The memory usage in the elderly continues to grow, triggering 3-4 Full GCs every hour, with each pause time exceeding 3 seconds.
- System throughput dropped by 30%, and some transactions were misjudged as high risk by the risk control system due to timeout.
Target:
- Locate the source of memory leaks within 1 week and optimize it, reduce the Full GC frequency to less than 1 time a day, and control the pause time below 200ms.
- Ensure the system operates stably during peak business periods and avoid transaction backlogs caused by GC pauses.

3. Solution process (Action)

3.1 Monitoring and Diagnosis

Toolchain selection：
- JVM monitoring:passjstat -gcutilObserve the usage rate of memory partitions (Eden, Survivor, Old Gen) in real time, and find that the occupancy rate of the elderly continues to rise after each Young GC.
GC log analysis：
- Enable detailed GC logs (-Xlog:gc*,gc+heap=debug:file=), and analyze the causes of GC with tools (such as GCViewer, GCEasy).
- Pay attention to the reasons for Full GC triggering (such as Metadata GC Threshold, Ergonomics).
Prometheus + Grafana Monitoring：
- Integrated JVM Exporter to monitor memory partition usage, GC times and time-consuming in real time.
- Set alarm rules (such as the memory usage in the elderly will trigger alarms if it exceeds 80% of the memory).
Root cause positioning：
- MAT analysis results:DiscoverConcurrentHashMapThe historical risk control rule objects are cached (the size of a single rule is about 2KB), with a total of more than 5 million, accounting for 85% of the memory of the elderly.
- Code review: The Rule Engine adds new rules to the static map every time the rule is updated, but the expired rules are not cleaned, resulting in unlimited growth in the cache.

3.2 Optimization solution design

Cache policy refactoring：
- Data structure replacement: Will be staticConcurrentHashMapChange toWeakHashMap, utilizes weak reference features to allow the JVM to automatically recycle unreferenced rules when there is insufficient memory.
- Regular cleaning mechanism: Add timed tasks (via Spring@Scheduled), clean up the historical rules 3 days ago at the beginning of the morning every day.
Code Example：

public class RuleCache {
    private static Map&lt;String, SoftReference&lt;Rule&gt;&gt; ruleCache = new WeakHashMap&lt;&gt;();
    @Scheduled(cron = "0 0 3 * * ?") // Clean up at 3 am every day    public void cleanExpiredRules() {
        ().removeIf(entry -&gt; 
            ().get() == null || ().get().isExpired());
    }
}

Garbage collector tuning：
- Replace the garbage collector: Switch from the default Parallel GC to G1 GC, taking advantage of its partition recovery and prediction of pause time characteristics.
- Parameter adjustment：

-XX:+UseG1GC 
-XX:MaxGCPauseMillis=200  # Target pause time 200ms-XX:InitiatingHeapOccupancyPercent=45  # Start concurrent tags earlier-XX:G1HeapRegionSize=8m  # Adjust to heap sizeRegion

3.3 Verification and guarantee

Pressure test verification：
- The peak flow (QPS 6000) was simulated using JMeter, and the full GC frequency dropped to 1 day, with an average pause time of 180ms.
Monitoring reinforcement：
- Configure GC pause alarm rules in Prometheus (such as Full GC times > 1 within 1 minute) and integrate it into the operation and maintenance alarm platform.
- Visualize GC time distribution and memory usage trends through Grafana.

4. Results and Value (Result)

Performance improvement:
- The Full GC frequency dropped from 3 times per hour to 1 time per day, and the average pause time was reduced from 3 seconds to 180ms.
- The system throughput returns to the pre-optimization level, and the RT is stable within 50ms.
Resource optimization:
- The memory usage in the elderly has been reduced by 70%, and the container memory application has been reduced from 16GB to 10GB, saving about 20% of the cloud resource cost.
Experience precipitation:
- Output the "JVM Memory Leak Troubleshooting Guide" and the "G1 Tuning Manual" to promote the team to establish a periodic GC health check mechanism.

5. In-depth expansion of technology

Limitations of WeakHashMap：
- Weak references are only recycled in the next GC. If the business requires precise control of the cache life cycle, it is necessary to actively clean up in combination with ReferenceQueue.
G1 Tuning Advanced：
- pass-XX:G1ReservePercent=10Leave space to avoid promotion failure (Evacuation Failure).
- Monitor G1Mixed GCEfficiency, adjustment-XX:G1MixedGCLiveThresholdPercentOptimize recycling threshold.

6. Summary

Through this optimization, the system lag caused by Full GC is not only solved, but also deepened the understanding of JVM memory management mechanism. Key gains include:

Proficient application of toolchain: MAT heap dump analysis, G1 parameter adjustment techniques.
The trade-off of cache design: Applicable scenarios for strong citations and weak citations, and implementation of cache expiration strategies.
Systematic thinking: Full-link closed-loop solution capabilities from code optimization to architecture adjustment.

This experience fully reflects the practical ability to ensure the stability of the system through precise positioning and scientific tuning in high concurrency scenarios.

This is the end of this article about JVM production environment tuning. For more relevant JVM tuning content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!