A complete list of precautions for large-scale website design

1. HTML static

In fact, everyone knows that the most efficient and least consumed is the purely static html page, so we try our best to use static pages on our website to achieve this. This simplest method is actually the most effective method. However, for websites with a large amount of content and frequent updates, we cannot manually implement them one by one. Therefore, our common information publishing system CMS emerged. For example, the news channels of various portal sites we often visit, and even their other channels are managed and implemented through the information publishing system. The information publishing system can realize the simplest information entry and automatically generate static pages, and can also have functions such as channel management, permission management, and automatic crawling. For a large website, it is essential to have an efficient and manageable CMS.

In addition to portal and information publishing websites, for community-type websites with high interactive requirements, static as much as possible is also a necessary means to improve performance. It is a strategy that can be used in real time to static posts and articles in the community. When there are updates, it is also a strategy that is widely used. Mop's hodgepodge uses such strategies, and so is NetEase Community, etc.

At the same time, html staticization is also a means of using certain cache strategies. For applications in the system that frequently use database queries but have very small content updates, you can consider using html staticization, such as the public setting information of the forum in the forum. The current mainstream forums can be managed in the background and stored in the database. This information is actually called by the foreground program in large quantities, but the update frequency is very small. You can consider statically when updating this part of the content, which avoids a large number of database access requests.

2. Separation of picture servers

As we all know, for web servers, whether it is Apache, IIS or other containers, pictures consume the most resources, so it is necessary for us to separate pictures from pages. This is a strategy that basically large websites will adopt. They have independent picture servers, and even many picture servers. Such an architecture can reduce the pressure on the server system that provides page access requests, and ensure that the system will not crash due to picture problems. Different configuration optimizations can be carried out on the application server and picture server. For example, when apache configures ContentType, it can support as little as possible and have as few LoadModule as possible to ensure higher system consumption and execution efficiency.

3. Database cluster and library table hashing

Large websites have complex applications, and these applications must use databases. When faced with a large number of accesses, the bottleneck of the database will soon appear. At this time, a database will soon be unable to meet the application, so we need to use database clusters or library table hashing.

In terms of database clusters, many databases have their own solutions, and Oracle, Sybase, etc. have good solutions. The commonly used Master/Slave provided by MySQL is also similar. What kind of DB you use, just refer to the corresponding solution to implement it.

Since the database cluster mentioned above is limited by the DB type used in terms of architecture, cost and expansion, we need to consider improving the system architecture from the perspective of the application. Library table hashing is a common and most effective solution. We install business and application or functional modules in the application to separate the database. Different modules correspond to different databases or tables, and then have smaller database hashes of a certain page or function according to certain strategies, such as user tables and hash tables according to user IDs, so that the performance of the system can be improved at low cost and has good scalability. Sohu's forum adopts this architecture, separating the forum's users, settings, posts and other information into databases, and then hash the database and tables of posts and users according to the sections and IDs. Finally, simple configurations can be made in the configuration file, which allows the system to add a low-cost database to supplement the system's performance at any time.

4. Cache

Those who have come into contact with the term cache have all come into contact with technology, and cache is used in many places. Caching in website architecture and website development is also very important. Here we will first talk about the two most basic caches. Advanced and distributed caches are described later.
For architecture caching, anyone who is familiar with Apache can know that Apache provides its own cache module, and can also use the additional Squid module for caching. Both methods can effectively improve Apache's access response capabilities.
For website program development, Memory Cache provided on Linux is a commonly used cache interface and can be used in web development. For example, when developing in Java, MemoryCache can be called to cache and communicate and share some data. Some large communities use this architecture. In addition, when developing in web languages, various languages basically have their own cache modules and methods. PHP has Pear's Cache module, so Java is more. .net is not very familiar, so I believe there must be one.

5. Mirror

Mirroring is a method commonly used by large websites to improve performance and data security. Mirroring technology can solve the differences in user access speed caused by different network access providers and regions. For example, the differences between ChinaNet and EduNet have prompted many websites to build mirrored sites within the education network, and data is updated regularly or updated in real time. In terms of mirroring details technology, we will not explain it too deeply here, and there are many professional ready-made solutions and products to choose from. There are also cheap ideas for implementing software, such as rsync on Linux and other tools.

6. Load balancing

Load balancing will be the ultimate solution for large websites to solve high load access and large number of concurrent requests.
Load balancing technology has been developing for many years, and there are many professional service providers and products to choose from. I have personally been exposed to some solutions, including two architectures for your reference.
Hardware four-layer exchange
The fourth layer exchange uses the header information of the third and fourth layer information packets, and identifies the service flow according to the application interval, and allocates the service flow of the entire interval segment to a suitable application server for processing. The fourth layer exchange function is like a virtual IP, pointing to a physical server. It transmits a variety of protocols that comply with services, including HTTP, FTP, NFS, Telnet or other protocols. These services require complex load balancing algorithms based on physical servers. In the IP world, the service type is determined by the terminal TCP or UDP port address, and the application interval in the fourth layer exchange is determined by the source and terminal IP address, TCP and UDP port.
In the field of hardware four-layer switching products, there are some well-known products to choose from, such as Alteon, F5, etc. These products are expensive, but they are worth the money and can provide very excellent performance and flexible management capabilities. Yahoo China was able to handle it by using three or four Alteons at the beginning.

Four-layer software exchange

After everyone knows the principles of hardware layer four switches, software layer four switches based on the OSI model came into being. The principles of such solutions are consistent, but the performance is slightly poor. However, it is still easy to satisfy a certain amount of pressure. Some people say that the software implementation method is actually more flexible, and the processing ability depends entirely on your familiarity in configuration.
We can use the commonly used LVS on Linux to solve the problem of software layer four exchange. LVS is Linux Virtual Server. It provides real-time disaster response solutions based on heartbeat to improve the robustness of the system. It also provides flexible virtual VIP configuration and management functions, which can meet multiple application needs at the same time, which is essential for distributed systems.

A typical strategy to use load balancing is to build a squid cluster based on the four-layer exchange of software or hardware. This idea is adopted on many large websites, including search engines. This architecture has low cost, high performance and strong expansion. It is very easy to add and reduce nodes into the architecture at any time. I'm going to have such a structure empty and I'll discuss it in detail.

For large websites, each method mentioned above may be used at the same time. I have introduced it here more briefly. Many details in the specific implementation process still need to be familiar with and experienced. Sometimes a very small squid parameter or apache parameter setting will have a great impact on system performance. I hope everyone will discuss it together to achieve the effect of throwing bricks and attracting jade.