SoFunction
Updated on 2025-04-09

How to improve the reliability of the metropolitan area network router network layer?

The rapid development of broadband services has brought profound changes to the traditional telecommunications and IT industries. The convergence of multiple services and multiple networks has become an irreversible trend. As the main network entity within the metropolitan area network, broadband metropolitan area network will become the carrier platform for 3G, NGN and other emerging value-added services. Real-time voice and video applications such as 3G and NGN require metropolitan area networks to provide service quality assurance and telecommunications-grade network reliability requirements similar to traditional telecommunications technology. At the same time, fierce competition has also driven operators to provide customers with service quality assurance services similar to SLAs, and network reliability is the primary and most important indicator. By improving the reliability of the network, operators can further establish and consolidate the company's brand image by providing differentiated services.

The reliability of metropolitan area network routers is reflected in the following two aspects: one is the reliability of the device layer and the other is the reliability of the network layer.

Network reliability is an important part of the reliability guarantee of metropolitan network routers. Because traditional router protocols converge relatively slowly (IGP is in the second level and BGP is in the minute level), it cannot meet the needs of carrying real-time services. Network reliability is also an area where new metropolitan area network routers are more active.

Currently, the emerging network layer reliability technologies mainly include fast convergence of IP routing, end-to-end LSP backup, MPLS fast rerouting, smooth restart, RPRIPS, etc.

Fast convergence of IP routing

IP dynamic routing is the most basic network layer reliability guarantee mechanism and is an innate function of IP routing networks. The IP dynamic routing protocol is responsible for computing the IP forwarding path of the network layer. When a link or node fails, the original data forwarding path is interrupted, the routing protocol dynamically recalculates the data forwarding path. Although various routing protocols use different mechanisms, their response times vary, but the average level is at one level in seconds. The recovery time for traditional IP services is acceptable, but for telecommunications-level IP networks that carry multiple services such as real-time services, millisecond recovery response time is required. There is a big gap between traditional IP dynamic routing technology and this requirement.

Based on traditional routing protocols, making improvements can shorten the failure response time of IP routing protocols. These measures are mainly to speed up the convergence of routing protocols. Accelerating the convergence speed of routing protocols can be divided into several aspects such as link failure detection, routing recomputation, routing information update, etc. By accelerating the frequency of Hello messages sent between links, speeding up SPF calculation speed and setting high priority for routing update messages, the routing protocol can quickly discover and handle faults, and accurately and quickly perform routing updates, speeding up the convergence of the routing protocol, and optimizing the IGP routing protocol can achieve convergence of less than 1s.

Another way to speed up the convergence of routing protocols is to use IGP and EGP to reasonably plan the network at a hierarchical level. IGP uses in-domain devices to route, EGP (BGP4) carries external routes, and the two routes are effectively isolated and do not redistribute each other. The rational division of labor between IGP and BGP has formed a hierarchical routing structure. The convergence of intra-domain and inter-domain routing protocols is independent of each other and does not affect each other, and the fastest convergence can be achieved.

LSP protection switch

Protection handover is a term used by ITU-T, and protection handover technology is of critical significance to improving the availability and stability of MPLS networks. Protection handover generally pre-calculates and pre-allocates resources for protected LSP routes, so it can ensure that network resources can be quickly regained after the LSP connection fails or is interrupted.

The current technology development can only support protection switching for point-to-point LSP. Protection can be carried out in two ways: 1+1 protection and 1:1 protection.

1+1 protection uses a dedicated backup LSP as the primary LSP protection. At the IngressLSR, the primary LSP and the backup LSP are bridged together. The traffic on the primary LSP is copied to the backup LSP and transmitted to the EgressLSR at the same time. EgressLSR selects to receive the traffic on the primary and secondary LSPs according to the value of the fault indication parameters.

During 1:1 protection, a dedicated backup LSP is also used as the protection of the main LSP, but the main and backup LSPs do not transmit the same traffic at the same time. The backup LSP can transmit other traffic under the premise that the main LSP works normally. The traffic protection switching decision is carried out in IngressLSR.

MPLS Fast Rerouting (FRR)

To meet real-time applications such as video conferencing and television, these traffic must be provided with LSP protection capabilities similar to traditional SDHAPS milliseconds.

LSP protection switching technology requires the intervention of signaling protocols. The fault indication signaling transmission from the fault point to the recovery point introduces unnecessary network recovery delay. MPLS fast rerouting technology can realize that the fault link traffic is directly redirected by the fault detection point according to the preset protection path without signaling intervention, and the recovery point is the fault point. Most fast rerouting schemes rely on pre-established backup channels. When a network recovery point detects a network failure, all it needs to do is simply update the LSP switching table, so that traffic can be switched from the LSP on the failed port to the LSP pre-established on the normal port.

In addition to improving the speed of protection and recovery, the advantages of fast re-routing are not only possible to increase the speed of protection, but also to selectively configure protection capabilities in weak links in the network, so that repeated protection in reliable networks and unnecessary consumption of core network resources are avoided. MPLS fast rerouting technology provides protection switching within 50ms, which can be used as an alternative to the SDHAPS protection mechanism.

MPLS fast rerouting adopts the following configuration process:

First, at the entrance of the LSP, LSR1, a user command is used to activate the MPLS protection switching function; LSR1 sends signaling to all LSRs on the LSP path, and each LSR calculates a backup LSP for the bypass next hop LSR, and the LSP fast re-routing configuration is completed. When a LSR on the LSP path detects a downstream failure, the LSR locally switches traffic into the backup LSP.

There are many fast rerouting solutions in IETF. The two mainstream protection methods are link protection and node protection. Their solutions and complexities in solving problems are different. At present, this technology has not yet formed a formal RFC.

Gracefulrestart

Possible factors that cause the control plane to restart include: software upgrades, software bugs or hardware failures. Restarting without interruption can enable uninterrupted data plane to forward continuously when the control plane is restarted. However, if the control plane fails, the peer router will recalculate the route, bypass the faulty router, and the uninterrupted forwarding of the data plane will be meaningless, and the faulty route will spread to the entire network range. If this happens on an MPLSVPNPE router, the result is disastrous.

The smooth restart technology of the control plane can effectively solve this problem. When a control plane fails, the router using this technology can notify the neighboring router to continue using the original path for data forwarding. At the same time, restart the router and re-establish the routing state with the neighboring router to ensure business availability during the restart process and minimize the impact of a single device restart on the entire network.

During the smooth restart process, the router does not save the relevant protocol status, so the restart software failure caused will not continue until after the restart.

Smooth restart is a new feature, and many old devices cannot support it, so it can be used on devices that support this feature in a local subnet.

On the network boundary, operator boundary routers face many customers, and generally do not have redundant measures, making it most suitable for using smooth restart technology. The network core generally uses redundant paths for protection, and rebooting with service can easily cause routing rings, so it is not recommended to use smooth restart technology in the network core.

[1]

Article entry: csh     Editor in charge: csh