No one dares to fool the CPU with you after memorizing these 18 items.

1. Main frequency

The main frequency is also called the clock frequency, and the unit is MHz, which is used to represent the computing speed of the CPU. The main frequency of the CPU = external frequency × frequency multiplication coefficient. Many people think that the main frequency determines the CPU's running speed. This is not only one-sided, but also for the server, this understanding also deviates. So far, no definite formula can realize the numerical relationship between the main frequency and the actual computing speed. Even the two major processor manufacturers Intel and AMD have a great controversy in this regard. From the development trend of Intel's products, we can see that Intel pays great attention to strengthening the development of its own main frequency. Like other processor manufacturers, some people have used a fast 1G Quanmeida to compare it, and its operating efficiency is equivalent to a 2G Intel processor.

Therefore, the main frequency of the CPU has no direct relationship with the actual computing power of the CPU. The main frequency represents the speed of the oscillation of the digital pulse signal in the CPU. In Intel's processor products, we can also see an example like this: the 1 GHz Itanium chip can perform almost as fast as the 2.66 GHz Xeon/Opteron, or the 1.5 GHz Itanium 2 is about as fast as the 4 GHz Xeon/Opteron. The computing speed of the CPU also depends on the performance indicators of the CPU's pipeline.
Of course, the main frequency is related to the actual computing speed. It can only be said that the main frequency is only one aspect of the CPU performance, and does not represent the overall performance of the CPU.

2. External frequency

The external frequency is the reference frequency of the CPU, and the unit is also MHz. The external frequency of the CPU determines the running speed of the entire motherboard. To put it bluntly, in desktop computers, what we call overclocking is all external frequency of the CPU (of course, under normal circumstances, the CPU frequency doubling is locked). I believe this is easy to understand. But for server CPUs, overclocking is absolutely not allowed. As mentioned earlier, the CPU determines the running speed of the motherboard. Both run synchronously. If the server CPU is overclocked and the external frequency is changed, asynchronous operation will occur (many motherboards support asynchronous operation) This will cause instability of the entire server system.

In most computer systems, foreign frequency is also the speed of synchronous operation between memory and motherboard. In this way, it can be understood that the external frequency of the CPU is directly connected to memory, realizing the synchronous operation state between the two. External frequency and front-end bus (FSB) frequency are easily confused. The following front-end bus introduction will tell you the difference between the two.

3. Front-end bus (FSB) frequency

The front-end bus (FSB) frequency (i.e. bus frequency) directly affects the direct data exchange speed between the CPU and the memory. There is a formula that can be calculated, that is, data bandwidth = (bus frequency × data bandwidth)/8, and the maximum data transmission bandwidth depends on the width and transmission frequency of all simultaneously transmitted data. For example, the current Xeon Nocona supports 64-bit, with a front-end bus of 800MHz. According to the formula, its maximum data transmission bandwidth is 6.4GB/sec.

The difference between external frequency and front-end bus (FSB) frequency: The speed of the front-end bus refers to the speed of data transmission, and the external frequency is the speed of synchronous operation between the CPU and the motherboard. That is to say, the 100MHz external frequency specifically refers to the digital pulse signal oscillating 10 million times per second; while the 100MHz front-end bus refers to the acceptable data transmission amount of the CPU per second: 100MHz×64bit÷8Byte/bit=800MB/s.

In fact, the emergence of the "HyperTransport" architecture has changed the frequency of this practical front-end bus (FSB). We previously knew that the IA-32 architecture must have three important components: memory controller Hub (MCH), I/O controller Hub and PCI Hub. For example, Intel's very typical chipsets, Intel 7501 and Intel 7505 chipsets, are tailor-made for dual Xeon processors. The MCH they contain provides the CPU with a front-end bus with a frequency of 533MHz. With DDR memory, the front-end bus bandwidth can reach 4.3GB/sec. However, as processor performance continues to improve, it has brought many problems to the system architecture. The "HyperTransport" architecture not only solves the problem, but also increases the bus bandwidth more effectively. For example, the AMD Opteron processor, the flexible HyperTransport I/O bus architecture allows it to integrate the memory controller, so that the processor directly exchanges data with memory without passing it to the chipset through the system bus. In this way, the front-end bus (FSB) frequency will not start with the AMD Opteron processor.

4. The bit and word length of the CPU

Bit: Binary is used in digital circuits and computer technology, and the code is only "0" and "1", among which whether it is "0" or "1" is a "bit" in the CPU.

Word length: The number of bits of binary numbers that the CPU can process at one time in a unit time (at the same time) in computer technology is called word length. Therefore, a CPU that can process 8-bit data is usually called an 8-bit CPU. Similarly, a 32-bit CPU can process binary data with a word length of 32-bit within a unit time. The difference between bytes and word lengths: Since commonly used English characters can be represented in 8-bit binary, 8-bits are usually called a byte. The length of the word length is not fixed, and the length of the word length is different for different CPUs. An 8-bit CPU can only process one byte at a time, while a 32-bit CPU can process 4 bytes at a time. Similarly, a CPU with a word length of 64-bit can process 8 bytes at a time.

5. Frequency multiple coefficient

The frequency doubling coefficient refers to the relative proportional relationship between the CPU main frequency and the external frequency. At the same external frequency, the higher the frequency, the higher the CPU frequency. But in fact, under the premise of the same external frequency, a high frequency multiplication CPU itself does not have much significance. This is because the data transmission speed between the CPU and the system is limited. CPUs that blindly pursue high frequency multiplication and obtain high main frequency will have an obvious "bottleneck" effect - the limit speed of the CPU obtaining data from the system cannot meet the speed of CPU computing. Generally, except for the project model Intel CPU, the CPU is doubled, and AMD has not locked before.

6. Cache

The cache size is also one of the important indicators of the CPU, and the structure and size of the cache have a great impact on CPU speed. The operating frequency of the cache in the CPU is extremely high, and it generally operates at the same frequency as the processor, and its working efficiency is much greater than that of the system memory and hard disk. When actually working, the CPU often needs to repeatedly read the same data block, and the increase in cache capacity can greatly improve the hit rate of the data read internally by the CPU without having to search on the memory or hard disk, thereby improving system performance. However, due to the CPU chip area and cost factors, the cache is very small.

L1 Cache (level one cache) is the first layer of CPU cache, divided into data cache and instruction cache. The capacity and structure of the built-in L1 cache have a great impact on the performance of the CPU, but the cache memory is composed of static RAM and has a complex structure. When the CPU die area cannot be too large, the capacity of the L1 level cache cannot be too large. The L1 cache capacity of the general server CPU is usually between 32-256KB.
L2 Cache (Level 2 Cache) is the second layer of the CPU cache, divided into internal and external chips. The internal chip secondary cache runs at the same speed as the main frequency, while the external secondary cache is only half of the main frequency. The L2 cache capacity will also affect the performance of the CPU. The principle is that the larger the CPU capacity, the better. Now the largest household CPU capacity is 512KB, while the L2 cache for CPUs on servers and workstations is as high as 256-1MB, and some are as high as 2MB or 3MB.
L3 Cache (level three cache) is divided into two types. The early ones were external, and the current ones were built-in. And its actual function is that the application of L3 cache can further reduce memory latency and improve the performance of the processor during large data calculations. Reducing memory latency and improving the computing power of large data volumes are very helpful for the game. There are still significant improvements in adding L3 cache performance in the server field. For example, configurations with larger L3 caches will be more efficient to utilize physical memory, so its slower disk I/O subsystem can handle more data requests. Processors with larger L3 caches provide more efficient file system cache behavior and shorter message and processor queue lengths.

In fact, the earliest L3 cache was used on the K6-III processor released by AMD. At that time, the L3 cache was limited by the manufacturing process and was not integrated into the chip, but on the motherboard. The L3 cache that can only be synchronized with the system bus frequency is actually not much different from the main memory. Later, using L3 cache was the Itanium processor launched by Intel for the server market. Next comes P4EE and Xeon MP. Intel also plans to launch an Itanium2 processor with 9MB L3 cache and a dual-core Itanium2 processor with 24MB L3 cache in the future.

However, basically, L3 cache is not very important to improve the performance of the processor. For example, the Xeon MP processor equipped with 1MB L3 cache is still not a rival to Opteron. This shows that the increase in front-end buses brings more effective performance improvements than the increase in caches.

Extended instruction set

CPUs rely on instructions to calculate and control the system. Each CPU is designed to specify a series of instruction systems that cooperate with its hardware circuits. The strength of instructions is also an important indicator of the CPU, and the instruction set is one of the most effective tools to improve the efficiency of the microprocessor. From the current mainstream architecture, the instruction set can be divided into two parts: complex instruction set and streamlined instruction set. From the perspective of specific application, such as Intel's MMX (Multi Media Extended), SSE, SSE2 (Streaming-Single instruction multiple data-Extensions 2), SEE3 and AMD's 3DNow! are all extended instruction sets of CPUs, which enhance the CPU's multimedia, graphics and Internet processing capabilities respectively. We usually call the CPU's extended instruction set "CPU's instruction set". The SSE3 instruction set is also the smallest instruction set at present. Previously, MMX contained 57 commands, SSE contained 50 commands, SSE2 contained 144 commands, and SSE3 contained 13 commands. Currently, SSE3 is also the most advanced instruction set. Intel Prescott processors already support SSE3 instruction set. AMD will add support for SSE3 instruction set to the future dual-core processors, and AllMeida processors will also support this instruction set.

Core and I/O working voltage

Starting from the 586CPU, the CPU's operating voltage is divided into core voltage and I/O voltage. Usually, the core voltage of the CPU is less than or equal to the I/O voltage. The core voltage is determined by the CPU production process. Generally, the smaller the production process, the lower the core operating voltage; the I/O voltage is generally between 1.6~5V. Low voltage can solve the problems of excessive power consumption and excessive heat generation.

9. Manufacturing technology

The microns of the manufacturing process refer to the distance between the circuits in the IC. The trend of manufacturing processes is to develop in the direction of higher density. The higher the density of IC circuit design means that in the same size, the higher the density and more complex functions can be used. Now the main ones are 180nm, 130nm and 90nm. Recently, the official has stated that there is a manufacturing process of 65nm.

10. Instruction Set

(1) CISC instruction set

CISC instruction set, also known as complex instruction set, is CISC in English (the abbreviation of Complex Instruction Set Computer). In a CISC microprocessor, each instruction of the program is executed serially in sequence, and each operation in each instruction is executed serially in sequence. The advantage of sequential execution is that the control is simple, but the utilization rate of each part of the computer is not high and the execution speed is slow. In fact, it is an x86 series (i.e. IA-32 architecture) CPU and compatible CPUs produced by Intel, such as AMD and VIA. Even the newly emerging X86-64 (also called AMD64) belongs to the category of CISC.

To know what an instruction set is, we need to start with the CPU of today's X86 architecture. The X86 instruction set was specially developed by Intel for its first 16-bit CPU (i8086). The CPU i8088 (i8086 simplified version) in the world's first PC launched by IBM in 1981 also uses X86 instructions. At the same time, the X87 chip was added to the computer to improve floating-point data processing capabilities. Later, the X86 instruction set and X87 instruction set are collectively referred to as the X86 instruction set.

Although with the continuous development of CPU technology, Intel has successively developed the updated i80386 and i80486 to the past PII Xeon, PIII Xeon, and Pentium 3, and finally to today's Pentium 4 series and Xeon (excluding Xeon Nocona), in order to ensure that the computer can continue to run various applications developed in the past to protect and inherit rich software resources, all CPUs produced by Intel still continue to use the X86 instruction set, so its CPUs still belong to the X86 series. Since the Intel X86 series and its compatible CPUs (such as AMD Athlon MP,) all use the X86 instruction set, today's huge X86 series and compatible CPU lineup have been formed. x86CPU currently has two main types: Intel server CPU and AMD server CPU.

(2) RISC instruction set

RISC is the abbreviation of "Reduced Instruction Set Computing" in English, which means "limited instruction set" in Chinese. It was developed based on the CISC instruction system. Some people tested CISC machines and showed that the frequency of various instructions was quite different. The most commonly used are relatively simple instructions, which only account for 20% of the total number of instructions, but the frequency of occurrence in the program accounts for 80%. Complex instruction systems will inevitably increase the complexity of the microprocessor, making the processor developer time and cost high. And complex instructions require complex operations, which will inevitably slow down the computer. Based on the above reasons, RISC CPUs were born in the 1980s. Compared with CISC CPUs, RISC CPUs not only simplify the instruction system, but also adopt a "super scalar and super pipeline structure", which greatly increases parallel processing capabilities. RISC instruction set is the development direction of high-performance CPUs. It is opposite to traditional CISC (complex instruction set). In comparison, RISC has a unified instruction format, fewer types, and fewer addressing methods than complex instruction sets. Of course, the processing speed is much higher. Currently, the CPU of this instruction system is generally used in mid-to-high-end servers, especially the CPU of the high-end servers, all of which use the CPU of the RISC instruction system. The RISC instruction system is more suitable for the operating system UNIX of high-end servers. Now Linux is also an operating system similar to UNIX. RISC CPUs are incompatible with Intel and AMD CPUs in software and hardware.

At present, CPUs that use RISC instructions in medium and high-end servers mainly include the following categories: PowerPC processor, SPARC processor, PA-RISC processor, MIPS processor, and Alpha processor.

（3）IA-64

There are many debates about whether EPIC (Explicitly Parallel Instruction Computers) is the successor of RISC and CISC systems. In terms of EPIC systems alone, it is more like an important step for Intel processors to move towards RISC systems. Theoretically speaking, the CPU designed by the EPIC system is much better for processing Windows application software than for application software based on Unix under the same host configuration.

Intel's server CPU using EPIC technology is Itanium (development code name is Merced). It is a 64-bit processor and the first in the IA-64 series. Microsoft has also developed an operating system code-named Win64 to support it in software. After Intel adopted the X86 instruction set, it turned to more advanced 64-bit microprocessors. The reason why Intel did this was that they wanted to get rid of the huge x86 architecture and introduce an energetic and powerful instruction set. So the IA-64 architecture using the EPIC instruction set was born. IA-64 has made great progress over x86 in many ways. It has broken through many limitations of the traditional IA32 architecture and has achieved breakthrough improvements in data processing capabilities, system stability, security, availability, and considerable rationality.
　

The biggest drawback of IA-64 microprocessors is that they lack compatibility with x86. In order for the IA-64 processor to run software from both dynasties, Intel introduced the x86-to-IA-64 decoder on the IA-64 processor (Itanium, Itanium2…) so that the x86 instructions can be translated into IA-64 instructions. This decoder is not the most efficient decoder, nor is it the best way to run x86 code (the best way is to run x86 code directly on x86 processors), so Itanium and Itanium2 perform very poorly when running x86 applications. This also becomes the root cause of the X86-64.

（4）X86-64 （AMD64 / EM64T）

AMD designed it to handle 64-bit integer operations at the same time and is compatible with the X86-32 architecture. It supports 64-bit logical addressing, and provides the option to convert to 32-bit addressing; but the data operation instructions default to 32-bit and 8-bit, providing the option to convert to 64-bit and 16-bit; it supports conventional registrations. If it is a 32-bit operation, the result must be expanded to a complete 64-bit. In this way, there is a difference between "direct execution" and "convert execution" in instructions. The instruction field is 8-bit or 32-bit, which can avoid the field being too long.

The generation of x86-64 (also called AMD64) is not groundless. The 32bit addressing space of the x86 processor is limited to 4GB of memory, and the IA-64 processor is not compatible with x86. AMD fully considers the needs of customers and strengthens the function of the x86 instruction set, so that this set of instruction sets can support 64-bit operation mode at the same time. Therefore, AMD calls their structure x86-64. Technically, in order to perform 64-bit operations in the x86-64 architecture, AMD has introduced new R8-R15 general purpose registers as an expansion of the original X86 processor registers, but these registers are not fully used in a 32-bit environment. The original registers such as EAX and EBX also expanded from 32 bits to 64 bits. 8 new registers have been added to the SSE unit to provide support for SSE2. The increase in the number of registers will lead to performance improvements. At the same time, in order to support both 32 and 64-bit codes and registers, the x86-64 architecture allows the processor to work in the following two modes: Long Mode (Long Mode) and Legacy Mode (Genetic Mode). Long Mode is divided into two sub-modes (64bit Mode and Compatibility Mode compatibility mode). This standard has been introduced into the Opteron processor in AMD server processors.

This year, EM64T technology that supports 64-bit is also launched. Before it was officially named EM64T, it was IA32E, which is the name of Intel's 64-bit expansion technology, which is used to distinguish the X86 instruction set. Intel's EM64T supports 64-bit sub-mode, similar to AMD's X86-64 technology, adopts 64-bit linear plane addressing, adds 8 new general-purpose registers (GPRs), and adds 8 registers to support SSE instructions. Similar to AMD, Intel's 64-bit technology will be compatible with IA32 and IA32E, and will only use IA32E when running a 64-bit operating system. IA32E will consist of 2 sub-modes: 64-bit sub-mode and 32-bit sub-mode, which are backward compatible like AMD64. Intel's EM64T will be fully compatible with AMD's X86-64 technology. Now some 64-bit technology has been added to the Nocona processor, and Intel's Pentium 4E processor also supports 64-bit technology.
　

It should be said that both are 64-bit microprocessor architectures compatible with the x86 instruction set, but there are still some differences between EM64T and AMD64. The NX bits in the AMD64 processor will not be provided in Intel's processor.

11. Super pipeline and super scalar

Before explaining the super pipeline and the superscalar, first understand the pipeline. Pipeline is the first time Intel has started using it in a 486 chip. The way the assembly line works is like the assembly line in industrial production. In the CPU, a command processing pipeline consists of 5-6 circuit units with different functions, and then divides an X86 instruction into 5-6 steps and then is executed by these circuit units separately. This enables a command to be completed in one CPU clock cycle, thereby improving the CPU's computing speed. Each integer pipeline of Classic Pentium is divided into four levels of flowing water, namely instruction prefetching, decoding, execution, and write back results, and floating point flowing water is divided into eight levels of flowing water.

Superscalars execute multiple processors simultaneously through multiple built-in pipelines, and their essence is to exchange space for time. The super pipeline uses the refinement of the flow and the increase of the main frequency to complete one or even multiple operations within a machine cycle. The essence is to exchange time for space. For example, the Pentium 4's pipeline is as long as 20 levels. The longer the step (level) of the pipeline design, the faster it can complete an instruction, so that it can adapt to CPUs with higher working frequency. However, too long pipeline also brings certain side effects, and it is very likely that the actual computing speed of CPUs with high main frequency will be low. This happens to Intel's Pentium 4. Although its main frequency can be as high as 1.4G, its computing performance is far less than that of AMD's 1.2G Athon or even Pentium III.

12. Packaging form

CPU packaging is a protective measure that uses specific materials to cure the CPU chip or CPU module in it to prevent damage. Generally, the CPU must be delivered to the user for use after packaging. The packaging method of the CPU depends on the CPU installation form and device integration design. From a general classification, CPUs installed with Socket sockets are usually packaged using PGA (raster array), while CPUs installed with Slot x slots are all packaged using SEC (single-sided connector box). Now there are packaging technologies such as PLGA (Plastic Land Grid Array), OLGA (Organic Land Grid Array). Due to the increasingly fierce market competition, the current development direction of CPU packaging technology is mainly cost saving.

13. Multithreading

Simultaneous multithreading, SMT abbreviated. SMT can copy the structural state on the processor, allowing multiple threads on the same processor to execute synchronously and share the processor's execution resources, which can maximize wide transmission and out of order superscalar processing, improve the utilization of processor computing components, and alleviate access memory delays caused by data correlation or cache misses. When no more than one thread is available, the SMT processor is almost the same as the traditional wide-transmitter superscalar processor. The most attractive thing about SMT is that it only requires a small change in the design of the processor core, and can significantly improve performance without adding extra costs. Multithreading technology can prepare more pending data for high-speed computing cores, reducing the idle time of computing cores. This is undoubtedly very attractive for low-end desktop systems. Intel starts with 3.06GHz Pentium 4, and all processors will support SMT technology.
　

14. Multi-core

Multi-core also refers to single-chip multiprocessors (CMP for short). CMP was proposed by Stanford University in the United States. Its idea is to integrate SMP (symmetric multiprocessor) in large-scale parallel processors into the same chip, and each processor executes different processes in parallel. Compared with CMP, the flexibility of the SMT processor structure is more prominent. However, when the semiconductor process enters 0.18 microns, the line delay has exceeded the gate delay, requiring the design of the microprocessor to be performed by dividing many smaller scale and better localized basic unit structures. In contrast, since the CMP structure has been divided into multiple processor cores for design, each core is relatively simple, which is conducive to optimized design, and therefore has more development prospects. Currently, IBM’s Power 4 chip and Sun’s MAJC5200 chip both use CMP structure. Multi-core processors can share caches within the processor, improving cache utilization while simplifying the complexity of multi-processor system design.

In the second half of 2005, Intel and AMD's new processors will also be integrated into the CMP structure. The development code of the new Itanium processor is Montecito, with a dual-core design, with a minimum of 18MB on-chip cache, and a 90nm process manufacturing. Its design can definitely be regarded as a challenge to today's chip industry. Each of its individual cores has independent L1, L2 and L3 caches, containing approximately 1 billion transistors.

15、SMP 　

SMP (Symmetric Multi-Processing), the abbreviation of symmetric multi-processing structure, refers to the collection of a group of processors (multi-CPUs) on a computer, and the memory subsystem and bus structure are shared between each CPU. With the support of this technology, a server system can run multiple processors simultaneously and share memory and other host resources. Like dual Xeon, which is what we call two-way, this is the most common type in symmetric processor systems (Xeon MP can support four-way, AMD Opteron can support 1-8-way). There are also a few that are 16 routes. However, generally speaking, SMP structures have poor machine scalability and are difficult to achieve more than 100 multiprocessors. The conventional ones are generally 8 to 16, but this is enough for most users. Most common in high-performance server and workstation-level motherboard architectures, such as UNIX servers that can support systems with up to 256 CPUs.

The necessary conditions for building an SMP system are: hardware that supports SMP includes motherboard and CPU; system platform that supports SMP, and application software that supports SMP.

In order to enable the SMP system to perform efficient performance, the operating system must support SMP systems, such as WINNT, LINUX, and UNIX and other 32-bit operating systems. That is, it can perform multi-tasking and multi-threading. Multitasking means that the operating system can allow different CPUs to complete different tasks at the same time; multithreading means that the operating system can enable different CPUs to complete the same task in parallel.

To build an SMP system, there are high requirements for the selected CPU. First of all, the CPU must have a built-in APIC (Advanced Programmable Interrupt Controllers) unit. The core of the Intel multi-processing specification is the use of Advanced Programmable Interrupt Controllers (APICs); again, the same product model, the same type of CPU core, the same operating frequency; finally, the same product sequence number can be maintained as much as possible, because when two production batches of CPUs run as dual processors, one CPU may be overloaded while the other is very little, and it cannot perform maximum performance, and worse, it may cause a crash.

16. NUMA technology

NUMA is a non-consistent access distributed shared storage technology. It is a system composed of several independent nodes connected through high-speed dedicated networks. Each node can be a single CPU or an SMP system. In NUMA, there are many solutions to the consistency of Cache, which requires the support of the operating system and special software. Figure 2 shows an example of the NUMA system of Sequent. Here are 3 SMP modules that are connected by high-speed dedicated network to form a node, each node can have 12 CPUs. Systems like Sequent can reach up to 64 CPUs or even 256 CPUs. Obviously, this is based on SMP and then expanded with NUMA technology, which is a combination of these two technologies.

17. Out-of-order execution technology

Out-of-order execution refers to the technology in which the CPU allows multiple instructions to be sent separately to each corresponding circuit unit for processing in the order specified by the program. In this way, after analyzing the status of each circuit unit and the specific situation of whether each instruction can be executed in advance, the instructions that can be executed in advance will be sent to the corresponding circuit unit immediately for execution. During this period, the instructions will not be executed in a prescribed order, and the rearrangement unit will rearrange the results of each execution unit in the order of instructions. The purpose of using out-of-order execution technology is to enable the internal circuit of the CPU to run at full load and to increase the speed of the CPU running program accordingly. Branching technology: (branch) instructions need to wait for the results when performing operations. Generally, unconditional branching only needs to be executed in the order of the instructions, while conditional branching must be determined based on the processed results before deciding whether to proceed in the original order.

18. Memory controller inside the CPU

Many applications have more complex read modes (almost randomly, especially when cache hit is unpredictable) and do not effectively utilize bandwidth. Typical applications are business processing software. Even if they have CPU features such as out of order execution, they will be limited by memory latency. In this way, the CPU must wait until the data required for the operation is loaded by the divisor before executing the instructions (regardless of whether the data comes from the CPU cache or the main memory system). The memory latency of the current low-segment system is about 120-150ns, while the CPU speed reaches more than 3GHz. A single memory request may waste 200-300 CPU cycles. Even when the cache hit rate reaches 99%, the CPU may spend 50% of the time waiting for the end of the memory request - such as due to memory delay.

You can see that Opteron's integrated memory controller has a much lower latency than the chipset's support for dual-channel DDR memory controllers. Intel also integrates memory controllers inside the processor as planned, which makes the Northbridge chip less important. But changing the way processors access main memory helps improve bandwidth, reduce memory latency and improve processor performance.