Random crash fault analysis and troubleshooting

Random failure is a common failure that is often encountered during computer use. Due to the uncertainty of crash failure, the nature of the operation is not fixed, and the phenomenon displayed when the crash occurs is not unified, so the scope of the failure occurs is not easy to determine, which brings certain difficulties to the maintenance work.

According to a large number of maintenance examples, the causes of random crashes are mainly the following three aspects:
1. Environmental factors
Environmental factors have a great impact on the normal operation of the machine. Computers' environmental requirements mainly include: temperature, humidity, grid interference, electromagnetic shock, external vibration shock, static electricity, grounding system, power supply system and other aspects. Among them, temperature, humidity, static electricity, grounding systems and power supply systems have the greatest impact on the normal operation of the machine. Due to the working environment of the machine, such as dust and humidity, short circuits between chips or poor contact between plugs and unplugging parts, the system may crash. According to actual maintenance statistics, random failures caused by environmental factors account for about 10% of the total number of failures.
2. Software reasons
There are two types of random crashes caused by software systems. First, virus damage. Although the machine can be started again by cold and hot startup, it will crash again after running. Second, the application software is not fully compatible with the operating system, and there are conflicts between them or conflicts with the inherent characteristics of the hardware. Most of this kind of crash does not respond to the keyboard, and the machine can only be started again through cold start.
The method to check for random failures caused by software is to use a clean boot disk to reboot the machine and then run the antivirus software to clear the virus. For application software and operating system conflicts, it is recommended to use a combination of modifying program configuration and changing machine hardware configuration to solve the problem. According to actual maintenance statistics, random failures caused by software are about 20% of the total number of failures.
3. Hardware reasons
The hardware system causes crashes, mainly due to improper matching of internal components of the machine. Usually includes:
1. Contact failure of pluggable chip. There are some pluggable chips on the motherboard that have poor contact, and this type of failure is very likely to occur on CPU chips, memory chips and various expansion slots. In addition, the AGP expansion slots generally have the problem of not being tightly plugged in.
2. The chip working timing does not match. In a circuit, if several chips work together and the execution speeds between the chips do not match, when a signal is logically transformed inside the chip, the delay time required for transmission is relatively long, and timing failure is likely to occur. Or the control time relationship of the timing circuit is relatively strict, and timing signal drift occurs occasionally. This situation is most common in assembled compatible machines. In addition, due to the use of boards or chips from different manufacturers, the high clock frequency is also the cause of crashes.
3. Poor thermal stability. The so-called poor thermal stability means that the machine runs normally at the beginning. After running for a period of time, as the chip temperature rises, it begins to crash. After shutting down, the cooling system is cooled and rested for a period of time and then it can work normally again, and then the crash occurs. The main reason is that the quality of the components themselves is not up to standard.
4. Poor chip driving capability. Because the fanout value of each chip is fixed, the number of chips driven by the output signal of the chip is required in the circuit design to be less than the allowed fanout value. If the fan-out value of the chip does not meet its rated indicators, the chip will crash when the system or a circuit is connected to more devices. This kind of fault often occurs on the motherboard's I/O interface, memory address or data driver chip.
5. Poor anti-interference ability. The wiring width of the power line and ground line of the chip on the printed circuit board is too small, the distance between lines is too close or the level between chips is not good, causing "oscillation" or "reflection" to cause signal interference, making the chip have anti-interference ability and causing system crash. According to actual maintenance statistics, random failures caused by hardware account for about 70% of the total number of failures, which is the main cause of random failures and is also the part that is highlighted in this article.
4. Random fault analysis and maintenance methods
The principle of checking this type of fault is to first infer the nature of the fault based on the fault phenomenon, and then based on this inference, use multimeter, logic pen, oscilloscope and other tools to check whether the corresponding signals on the hardware line have random interference or timing drift, etc. If so, find the corresponding hardware for repair and replacement.
First check for any contact failure. Remove various expansion cards in the power-off state, use your fingers to jamm the edge of the card and gently bend and knock it. Then, when the power-on state is powered on, press the edge of the card, the CPU socket, memory stick on the motherboard, and various plugs or sockets with your fingers. If the machine can start under a certain situation, it means that a bad contact failure has occurred.
If it is proved to be a contact fault after repeated trials, check whether the timing fault of the control circuit is not. Key inspection:
1. System control circuit chip. It mainly includes address bus and data bus chips, ALE's address latch signal, as well as other gate array chips such as the south and north bridge chips on the motherboard.
2. System memory control circuit and drive circuit. It mainly consists of the row gate signal RAS, column gate signal CAS, row and column address conversion control signal, memory data readout driver, and memory chip speed matching relationship of RAM.
3. Various clock signal circuits in the system are mainly SYSCLK, PROCCLK, PCLK, and DMACLK.
By using a high-frequency oscilloscope above 100MHz to check the above signals, we hope to find that a certain signal has an abnormal state at a certain moment, such as timing drift or glitches, and find the corresponding chip to replace it after discovering it.
Third, poor thermal stability is another major manifestation of timely failures. With the arrival of summer or overclocking of CPU, such failures become more and more frequent. During inspection, you can use a hair dryer to heat the chassis 20 cm to 30 cm away from the open chassis. When the temperature in the chassis rises to around 60°C to 70°C, faults may begin to occur frequently. When the machine is placed in an air-conditioned room between 18°C and 25°C, if the failure rate is greatly reduced, it is determined that it is a failure with poor thermal stability. Then use the data bus, address bus, and control chip on the oscilloscope motherboard to check the output waveform. If there is a significant interference signal, find the corresponding chip to replace it.
Fourth, the mutual interference between signals and poor chip driving capabilities are also one of the common causes of random failures. During maintenance, it was found that such faults mostly occur between the 74FXX chip and the 74LSXX and ALSXX chips.