Graduate Networks, UCSD

CSE222 – Spring 2009

An Experimental Study of Delayed Internet Routing Convergence April 23, 2009

  1. The transition from distance vector to path vector routing algorithms increases the number of routing path oscillations, which can last up to fifteen minutes.
  2. Much of the path restoration delay is a result of non-standardized implementation decisions on the part of BGP router vendors, including timer parameters. The lower bound on convergence could be reduced from linear time to constant time with minor changes to said implementations.

Using a full mesh may greatly increase the number of paths (up to an exponential

Possible future work includes adding some form of synchronization

 

Delayed Internet Routing Convergence April 23, 2009

Salient points:

  1. This paper represents a significant milestone among the networking research community with respect to understanding the Border Gateway Protocol (BGP). The authors use a theoretical approach to predict the lower and upper bounds for the view convergence among BGP routers. That also serves as a model for systematic theoretical approaches to evaluating other network protocols.
  2. The main revelation of the paper is that the delay in convergence is generally much higher than previously thought. The paper formally shows that Internet lags behind public telephony networks in terms of availability, reliability and quality of service.
  3. One of the contributions of this paper is their model for injecting faults into the BGP routers to effect changes in routes. Faults are injected from a Fault Injection Server at the Upstream ISPs and RouteViews are collected the other ISP. They try to mimic the real-world occurrence of faults by spacing them randomly. The period of testing has been very extensive (2 years) which gives credibility to their claims.

Concerns with the paper:

While the analysis of the BGP convergence model is extensive, I am not convinced whether the assumptions they have made to simplify the analysis hold in the real world. They consider “a complete graph of autonomous systems as model for Internet,” which favors shortest path. This is not true today, as ISPs route to other ISPs based on business alliance and not based soleley on shortest routes.

Also, they admit that the routing at ASes pass through many other protocols like IBGP, route reflectors etc. but they exculde the dealys caused by these based on their experimentation

Future Work:

More light needs to be shed on the impact or latencies caused by the the above protocols other than BGP which occur during transmission between ISPs. This apart, most ISPs have ingress and egress filters to route certain packets differently than others. The effect of such customer route filtering on BGP needs to be investigated.

 

Delayed Internet Routing Convergence April 23, 2009

Three Major contributions:

  1. BGP convergence in the Internet is an order of magnitude slower than previously thought. Showing that the Internet does not support effective inter-domain failover.
  2. Most of these convergence problems are from configuration problems with routers, and differences in implementations. They provide some changes to BGP implementations to help improve this problem.
  3. They provide a theoretical upper bound on the latency of BGP convergence.

The most glaring problem with the paper:

For the most part, the study was done using simulation. A simplified model of the internet topology and the BGP algorithm were used. I think a longer  study using a large part of the actual Internet would be extremely useful. This, however, might be infeasible at the moment.

Future research implications:

Application and Protocol architects might have to rethink their designs of current algorithms that depend on effective inter-domain failover. These include QoS dependent things like VoIP, or other content delivery networks.

 

Delayed Internet Routing Convergence April 23, 2009

The paper describes an elaborate study (spanning two years) of the convergence behavior of the inter-domain routing protocol BGP upon changes to routing paths. The paper describes in detail about the methodology that they used to measure the convergence times averaged over a long period to evaluate the parameter and validate if it was indeed as short as it was previously believed.

One of the primary contributions of the paper is the actual quantitative measurement of the convergence time for BGP over the wide area on the Internet rather than through simulations. As claimed by the authors, although several attempts had been made earlier to measure the convergence time of distance vector protocols, this paper is the first attempt to investigate the convergence of BGP which is a path vector based protocol. Their findings showed that the convergence time for BGP could be as high as O(n!) in the worst case and O(n) in the best case where n is the number of ASes in the Internet.

A second important aspect of the paper is that they measure the convergence time both for the BGP routing information in the network as well as for end-to-end convergence which provides a better measure of actual application level impact of the routing convergence process.

Another contribution is their recommendation for changes to routers which could improve the convergence times based on their understanding of the causes of the long delays in convergence.

One flaw or missing aspect from the paper is that they do not mention the actual location or sizes of the 5 ISPs chosen for the measurements in the study. While the BGP routing information is only decided by the edge routers of the ISPs, the actual end-to-end performance would also be affected by the internal policies used by the ISPs for configuring routes. This could vary between different ISPs and can have an impact on the end-to-end measurements.

From a research perspective, it would be interesting to see if any other changes other than the ones suggested by the authors can actually improve the convergence times of BGP. One of the important aspects is that the existing installed router infrastructure cannot be replaced easily, so what improvements can be effected while using the existing hardware to improve the convergence times.

 

Delayed Internet Routing Convergence April 23, 2009

Paper addresses the effects of the convergence delay on packet loss and latency in end-to-end Internet paths. Convergence delay is defined as a time elapsed between the router change and by the time a consistent view of the new network established. Authors discuss the distance vector routing algorithms and issues associated with them such as slow convergence, loop formation due to insufficient information transmitted between nodes.

Border Gateway Protocol is discussed which a type of distance vector algorithms used for inter-domain is routing in present day Internet. Though it solves some of the problems associated with the DV algorithms (e.g. count-to-infinity) it is not as fast as believed.

Paper suggests that it is wrong to assume that path vector approach used in BGP can provide much better convergence properties. Internet does not support effective inter-domain failure conditions. It has been suggested that most of the delay in path restoral are due to unexpected interaction of configurable routing  protocol timers and specific router vendor protocol implementation decisions during the process of delayed BGP convergence.  Path vector approach increases the number of possible routing table oscillations though providing the count-to-infinity problem. Internet path failover adversely affect the end-to-end performance. It has been shown that packet loss grows by a factor of 30 and latency by a factor of four during path restoral.

Author analyzes the BGP convergence model and establishes the theoretical bounds on the convergence. It is assumed that message delays are unbounded. Author shows that states explored for the convergence grows exponentially with the number of nodes. Lower bound is ((n-3)*30). It has been shown that by slight modification in present implementation it can be reduced.

Paper suggests specific changes to BGP implementation to improve Internet convergence latencies. It discusses that impact of these inefficiencies on the emerging standards (e.g. VoIP) will be significant.

 

Delayed Internet Routing Convergence April 23, 2009

Filed under: R06. An Experimental Study of Delayed Internet Routing Convergence — filipposeracini @ 2:38 pm

The Internet infrastructure had grown in very few years from some dozen to billions of end-users and machines constantly connected. This is putting a big stress on the infrastructure that must be able to deal with failures in order to provide a certain level  of availability, scalability and reliability. Indeed, even transient disruptions in backbone networks can now cause massive finantial loss and disrupt hundreds of thousands of users.

The Internet infrastructure has always been considered to be able to support rapid failure recovery of some node or link through restoration and rerouting on other routes. Unfortunately, this paper proves the opposite.

These are the three main results of the study presented in the paper:

  1. The average recovery time from a inter-domain failover is in the order of three minutes with some peaks up to fifteen minutes, hence much higher than originally thought. Moreover, this delay will increase linearly with the addition of  new autonomous systems to the Internet, in the best case. The worst case scenario could take to an exponential growth of the complexity.
  2. The BGP convergence is strictly inflenced by vendors’ implementations of the BGP protocol. The authors suggest possibile changes that BGP implementation vendors could use to reduce the Internet convergency latencies. However, this would come at the cost of a higher complexity of the protocol.
  3. During their study, the authors observed the behavior and the performance of 5 different major ISPs. The interesting observation that came out of that study is that there were significant variations between the convergence latencies of the five ISPs, completely unrelated to geographical or network distance. This is in contrast with the assumption that the Internet is a complete mesh. If it had been so, we would have observed similar results among the ISPs. Instead, the key factor of these variations relates to topological factors, such as the length and number of the possible paths between an AS and a given destination.

I think that the paper presents a really interesting study of  a real test bed, the Internet itself. The authors indeed injected over a 2 year time period, over 250,000 routing faults into geographically and topologically diverse peering sessions with five different ISPs. Hence, the results presented in the paper are not arficial ones, but they do reflect the real behavior of the Internet. I think that this is a very important aspect of the work presented.

Since the authors claims that the recovery delay from inter-domain failover can increase from linearl to exponential with the number of new autonomous systems, I think it would be good to narrow down this range, also to understand whether major structural changes to the BGP implementation are really needed. In fact, it is not so clear if this is the case because they observed that the average failure ratio is in the order of one failure/month.

Another avenue for further research that I can see is to understand how robust the rerouting protocol really is. In the last few months we have observed at least two major failures of the backbone network, one in the Mediterrean see between Sicily and the North Africa and one in the Bay Area. Both of them created huge problems to many countries for several days. In the case of the Mediterrean failure (the optic wires were cut) part of the Middle East, Noth Africa and even India got completely disconnected from the rest of the world. In theory, this should not have happened if the rerouting algorithm had worked as expected.

 

Delayed Internet Routing Convergence April 23, 2009

(i) The three most important things the paper says:

1. A major oversight in the internet today is the recovery time of path selection for main BGP routers used by ISPs after a fault. The faults  caused by oscillations in the BGP convergence algorithm. The average and upper bound for time until convergence is much slower than originally predicted and what is commonly assumed.

2. With a small augmentation to the minimum route advertisement interval timer of the BGP convergence algorithm, it you can significantly reduce the convergence time to a constant 30 seconds and reduce the number of states needed as the number of autonomous nodes increase in the network with unbounded delay.

3. If loop detection is done by both the sender and the receiver mutual dependencies can be detected and eliminated within a single advertisement round. This is the lynchpin that allows the modified version of BGP to converge so much quicker than its predecessor.

(ii) The most glaring problem with the paper:

The author spends the first part of the paper arguing that the old theoretical results of convergence were never realized in practice but then spends much effort promoting an augmentation to improve the current theory. The theory is only able to be supported by simulation results but the problems with the other algorithm were only caught long after deployment. It feels like he undercuts the trust in his theoretical results.

(iii) The future research directions of the work:

Further discussion of the overhead of making the adjustments suggested by the author must be viewed. It also will of course be interesting to verify the theory in practice as more service providers adopt changes advocated by the author. An on going analysis should be done as well as exploring other tweaks to the algorithm.

 

Delayed Internet Routing Convergence April 23, 2009

This paper looks into the latency in internet path failure, failover and repair due to the convergence properties on interdomain routing. The failure recovery times after are substantial as compared to the circuit-switched paths. This burden is created due to the growing size and complexity of the internet. Border Gateway Protocol (BGP) is used as the interdomain routing protocol which uses the path vector approach in including the entire path to the destination. The BGP protocol is widely and incorrectly believed to provide significantly improved convergence performance over other traditional Distance Vector (DV) routing algorithms. The paper provides a theoretical upper bound on the number of computational states explored during BGP convergence which is factorial of the number of autonomous systems in the internet. The paper investigates the rate at which the inter-domain repair and failure information propagates through the Internet. Furthermore, they measure the impact of path changes on end-to-end network performance. The methodology adopted is the analysis of the data collected from the experimental instrumentation of key portions of the internet infrastructure. The authors accomplished this by injecting routing faults into geographically and topologically diverse peering sessions with five major commercial ISPs. They also built a model of the delayed BGP convergence process to provide an upper and lower theoretical computational bound. The theoretical
upper bound is said to be unlikely to occur in practice as it is based on a set of specific assumptions and worst-case scenarios. Finally they related their experimental findings to their theoretical model. They report the delay in internet interdomain path failovers averaged at three minutes. They also argue that the delay of interdomain route convergence is due almost entirely to the unforeseen interaction of the protocol timers with specific router vendor implementation decision. One observation was that the convergence time varied across autonomous systems and therefore the time is directly related to topological factors. One question that arises is if normally occurring Internet failures occur on the average of once a month, do the delays caused by these rare failures have a noticeable effect? Since their solution is relatively straightforward and simple, it is worth making modifications to reduce delays even if the failures are rare. The effect of these failures on end-systems and applications is unclear. Is the situation as dire as they make it out to be? Furthermore, the delay is proportional to the number of autonomous systems.

 

An Experimental Study of Delayed Internet Routing Convergence April 23, 2009

(i) the three most important things the paper says

First off, the paper makes the observation that, due to how the BGP protocol operates, route conversion between BGP routers could take much longer than anticipated when presented with a route withdrawal (via results from other, similar, studies).  The study showed that due to router oscillations (shown through different step-by-step examples) that the worst-case for convergence on any given route withdrawal could be factorial on the number of interconnected BGP hosts.

Secondly, related to the first point, the authors pointed out that inconsistencies in BGP implementations from different vendors proved detrimental to their presented read world results.  I feel that this is an important point, because it may show ambiguities in the BGP protocol that should possibly be more rigidly defined (so as to limit these oscillatory delays present after a BGP withdrawal).

Lastly, the paper creates a great correlation between Tdown timing measurements and theoretical observations made in the paper.  The authors point out that Tdown creates a longer convergence time between BGP routers because of the different cases that are created by the BGP protocol itself (iterating through the numerous routes that a certain router may have stored that are probably not valid routes at all).  The authors bring up the point that using MinRouteAdver significantly decreases the lower bound on this convergence time (another important observation).

(ii) the most glaring problem with the paper

I feel that the paper may make too many simplifying assumptions in their BGP convergence model.  It seems that some of the assumptions might not exist within the context of the Internet today (although this probably an artifact of my lack of knowledge of the architecture of the Internet).  For instance, I found it hard to believe that ASes on the Internet today were somehow peered with all other ASes on the internet [which is the point that I picked up from the authors' statement that all nodes had (n-1) different direct connections].  I feel that some more supporting information was needed for these simplifying assumptions.

(iii) the future research directions of the work

I think that a good future direction for this work would be to perhaps investigate replacement protocols to BGP or perhaps some sort of significant optimization that would reduce the amount of oscillatory behavior in BGP route advertising.  I agree with the authors when they say that improvement needs to be made in convergence time (after a route withdrawal) because of the amount of real-time traffic that continues to be added to the Internet each day.

 

Delayed Internet Routing Convergence April 23, 2009

Three Important Things:

  • There is common acceptance that the internet in its current form is robust and efficient in adjusting routes in the face of link failures. Such occurrences are fairly common in such a huge infrastructure, and the authors recognized that it would be valuable to evaluate this accepted belief. By evaluating BGP convergence they came to the conclusion that inter-domain failover has significant delay, and that the path vector approach is incorrectly believed to provide improved convergence properties over distance vector routing. The claim is that the delay stems from ambiguity in the BGP protocol, and varying implementations across different router vendors.
  • The paper was focused on experimental trails conducted over a long period of time, lending strength to the obtained results. The authors emphasize several points about designing and executing their experiments in order to produce quality, meaningful, relevant results.  Faults were injected into the actual internet backbone as opposed to a simulation environment. Latency effects were measured at the router level by routing ISP routing tables, this allowed the examination of several events: route repair, route failure, repair and failover, failurea and failover. End-to-end measurements were performed to determine the application level impact, the results proved that even moderate routing table flux caused packet loss, increased packet latency, and out-order-delivery.
  • In addition to the hard data, the authors provide theoretical lower and upper bounds. The exploration of these bounds resulted in several insights about delayed convergence. For example, a lack of minimum route advertisement interval timer in some implementations influences the order in which link updates are processed, which impacts the rate of convergence. There are numerous implementation variations across vendors that contribute to the performance bounds.

Glaring Problem:
A major concern that I had with this work is that it only focues on BGP, and that it didn’t really discuss the parallels between inter-domain and intra-domain results. It would have been useful to see how the obtained results pertained to other routing protocols, if at all.

Future Work:
Following this thorough treatment of BGP convergence analysis, it would be useful to follow up on the root causes of delayed convergence and try to design new protocols, or at least new features in future routers that could help alieviate the strain on the internet backbone.  These topics were mentioned in the paper, but not at length. Such research would be crucial to the continued viability of the internet from a QoS perspective.

 

An Experimental Study of Delayed Internet Routing Convergence April 23, 2009

This paper presents the results of a very comprehensive study of routing convergence behavior on the internet. By monitoring the internet in many different places while injecting routing faults at various points in the network, they were able to measure in great detail the effects of such faults on both the routing tables themselves and their effect on end users.

They present a few important conclusions. The first is that the amount of time it takes for new routes to stabilize after a fault is actually much longer than previously thought. In addition to this insight, they have discovered that this delay will increase linearly in the best case and exponentially in the worst case with the number of additional autonomous systems on the internet. This presents a significant challenge to the designers of internet routers, especially if the growth rate increases more exponentially than linearly.

Their second conclusion is that much of this large failover time is due to differences in how vendors implement the BGP specification. If more loop detection was done by the sender instead of just at the receiver, the number of total updates could be reduced dramatically. However, this adds complexity to routers which are likely already quite expensive and complicated to design.

A third conclusion they arrive at is that different classes of update traffic have different behavior with regards to the amount of time and number of messages necessary to reach routing convergence. The loss of a route causes messages which are monotonically increasing while the addition of a route causes messages which are strictly increasing. While this may not seem like a significant difference, it actually has a profound effect on the convergence time, since it can take a long time for a faulted route to disappear from the system as routers bounce messages back and forth slowly increasing their distance to the faulted route until eventually they all realize that it is no longer present. This is in contrast to an added route, which either creates a shorter path which is propagated, or doesn’t, and doesn’t generate any new routes.

A weakness of this paper is that their bounds on the convergence rate of routes are fairly loose. While it is valuable to know that it is at least linear, the upper bound is quite high and does not necessarily describe the behavior accurately. However, since the growth rate is at least linear, future research needs to focus on ways to either improve BGP or supplant it with something more intelligent, which is less susceptible to implementation dependencies and superfluous messages during routing convergence.

 

An Experimental Study of Delayed Internet Routing Convergence April 23, 2009

Filed under: R06. An Experimental Study of Delayed Internet Routing Convergence — gracewangcse222 @ 2:37 pm

(i) The three most important things the paper says:

  1. Routing table oscillations that occur during failure, fail-over and repair of Internet paths can cause routing convergence to take up to tens of minutes (with a theoretical upper bound of O(n!) time), contrary to conventional wisdom that this convergence takes only milliseconds.
  2. The authors injected four types of events into the Internet for the purpose of their study. The routing events Tup (a route repair, representing a newly available route) and Tshort (a route repair and fail-over) formed one equivalence class while the events Tdown (a route failure) and Tlong (a route failure and fail-over) represented another equivalence class in both convergence latencies and triggered number of update messages. This is because the former is strictly increasing (similar to DV algorithms) while the latter is monotonically increasing (similar to BGP).
  3. Improvements can be made to current convergence latencies by adding sender-side loop detection by modifying MinRouteAdver (significantly reducing state complexity and allowing much faster convergence). However, oscillations triggered by BGP path changes appear to be inevitable. Though we can expedite convergence, this would lead to increased protocol complexity and overhead.

(ii) The most glaring problem with the paper:

This paper discusses the discusses the latency of the system to reach a consistent view after failover, which is on the order of many minutes. However, they first mention that at least one study of user experience places the fail-over latency at 30 seconds or less. This indicates to me that while the “true” fail-over latency may be much longer than people would have expected, the user experience study shows that most people don’t think their experience is all that bad. Though the worst case can in theory be quite bad, perhaps this happens only in rare cases.

(iii) The future research directions of the work:

Possible research directions could maybe explore fail-over latency for the average case in order to reconcile the results obtained in the paper with the user experience study. Further work could also be done to look for better ways to minimize routing table oscillations, as well as to explore average-case trends as the Internet is scaled up.

 

An Experimental Study of Delayed Internet Routing Convergence April 23, 2009

The three most important things the paper says:
1) Extensive experimental data indicates that the Internet does not support effective inter-domain failover.
2) Simulation indicates that convergence time can be improved by employing both sender and receiver side loop detection.
3) If reliable, real time services are to be deployed over the Internet (the paper mentions QoS and VoIP), then the quality of the underlying inter-domain failover must be addressed.

The most glaring problem with the paper:
The proofs for theoretical upper and lower bounds on BGP convergence both assume a complete graph topology. A complete graph topology is stated as a necessary condition for worst case convergence, but they don’t relate their theoretical lower bound for a complete graph to an incomplete graph. The Internet is far from being a complete graph, so I don’t know how useful these bounds are.

The future research directions of the work:
The paper suggested some tweaks to BGP to improve convergence performance, and hinted at some ideas for a better solution, which all involved more complexity and overhead. It is not clear to me that the tradeoff mentioned in the paper between scalability and fault-tolerance is fundamental. Future research should try to find a better solution to inter-domain routing issues which are both more fault-tolerant and more scalable.

 

An Experimental Study of Delayed Internet Routing Convergence April 23, 2009

The paper describes the latencies of various kinds of faults to converge in the internet routing domain. They have experimented over 2 years by injecting several faults and measuring their latency to reach steady states and found that the average latency is orders of magnitude greater than previously reported. The paper also gives theoretical upper and lower bound on the delay of convergence assuming a topology of connected nodes and showed that the worst case time to converge is in the order of factorial of the number of nodes in the internet. Specially the adoption of path vector routing to avoid count to infinity problem has pushed the theoretical upper bound of the convergence delay to be exponentially worse. They have characterized the convergence delay based on four kinds of faults, viz. Tup, Tdown, Tlong, Tshort which are the cases when a new link comes up, an existing link goes down, a longer link is available and a shorter link is available respectively. According to their experimental observation both Tlong and Tdown converge more slowly than Tshort and Tup. This was also in accordance to the number of messages sent in the respective cases. Also the delay of end to end packet delivery and percentage loss of packets agreed with the fast and slow convergence of Tshort and Tlong respectively.

The paper addresses the delay of convergence of various faults in the internet and has tried to provide theoretical basis for such observations. According to them the actual observed delays are orders of magnitude worse than previously reported numbers which is also supported by the theoretical upper bound. But the theoretical analysis assumes complete mesh topology of the internet which is rarely the case in practice. Hence the worse case theoretical upper bounds may not hold and depends on the actual topological factors of the networks. Nevertheless, they have proposed a theoretical base on which the even the average case can be analysed for convergence delay in the real case.

The future directions of the research can be of designing better routing algorithms than path vectoring that does not have such exponential worst case complexity for convergence delay. As the paper stated, with the growth and increasing complexity of the internet, routing convergence delay will become significant and possibly a bottleneck in future. Hence the application layer protocols also needs to be re-evaluated which relies on stable underlying interdomain forwarding infrastructure and fast IP path restoral. Also specific changes to future vendor BGP implementation should be employed that can significantly improve the internet convergence latencies.

 

An Experimental Study of Delayed Internet Routing Convergence April 23, 2009

1) This paper’s main objective is to dispel the conventional wisdom about internet failover. Specifically regarding the fact that the Internet actually does not support an effective nor efficient inter domain failover because of delayed BGP convergence.

2) Another main point are the reasons why routing events Tup and Tshort converge faster than Tdown and Tlong. Understanding the reasons why one is strictly increasing vs monotonically increasing is a key point in understanding the delay and what measures could be taken to mitigate it.

3) They also show, by means of actual experimentation, that even a moderate level of routing table oscillation will lead to increase packet loss. This drives their point home, that delay BGP convergence has an effect that is exponentially worse than originally perceived.

(ii) The most glaring problem with the paper:

This paper is actually pretty well written with experimentation and analysis of real network data. So there aren’t any *truly* glaring problems; however if I had to critique it, I would have to say that it does not offer any true solutions. The “tweak” that was suggested for MinRouteAdver is only a band-aid and cannot be relied upon to fully address the issue. If they actually propose a real working solution, then this paper would become more complete.

(iii) The future research directions of the work:

After reading the paper, it seems the most critical area for research would be one of two areas. Either to work on new internet protocols that is robust enough to handle the worse case scenarios of delayed BGP convergence. The other direction could be to solve the problem of delayed BGP convergence or at least further mitigate this issue. I feel that solving either problems is important, because as the internet grows, this problem will get worse considering that the complexity is based on the number of nodes.

 

Delayed Internet Routing Convergence April 23, 2009

The paper “Delayed Internet Routing Convergence” makes three important contributions to the problem of convergence of interdomain routing:

First, the authors present an experimental study on route repair, failure and failover events injected into the routing infrastructure. They observe key exchange points too see how long BGP routers exchange messages after injecting events before all observed ISPs agree on the routing configuration for their experimental stub AS. Metrics included the number of BGP update messages sent and the time until convergence, i.e. when no more messages where generated for this particular event. Also analyzed are end-to-end connectivity by looking at packet loss rates and latency.

Second, the paper contributes a theoretical model for the convergence of the BGP protocol based on the BGP specification, simulation and the previously described experiments. The analysis of the upper bound on the convergence given by this model emerges interesting properties  of BGP.

Third, a proposal to improve the slow convergence of BGP by introducing a loop detection on AS paths exchanged via BGP messages for both partners of the message exchange, sender and receiver.

As the authors acknowledge, implementing the solution does only mitigate some part of the problem, but doesn’t face the problem of slow convergence itself. So further research on these matters and on follow-on protocols to BGP is needed.

 

Delayed Internet Routing Convergence April 23, 2009

i)

1. Contrary to popular belief the BGP protocol takes longer to failover than previously thought. This delay was primarily due to protocol timers and implementation decisions in the routers.

2. Tdown (a route failure) and Tlong (route failure & failover) both take significantly longer to propogate throughout the internet than Tup (new route/route repair) and Tshort (route repair & failover).  This is because information about a longer path takes more time to spread than information about a shorter path. This is because there tends to be multiple routes between some router A out in the internet and the destination B and the information that the route to the destination is now longer must spread to at least 1 router along every previous path of length less than the new real path length from A to B.

3. BGP routers are usually implemented to only send updates to a particular peer at a rate of at most once every 30 seconds. This however creates in effect 30 second rounds. A solution that would improve convergence time would be to have a timer for every (destination prefix, peer) tuple, or some sort of solution that emulated such a system. This would improve convergence time at the cost of possibly increased # of packets sent and requiring more memory on the router.

ii) I think the most glaring problem is the authors only considered faults at the end host and did not evaluate how effective BGP was when there are faults in the middle.

iii) Future research directions include exploring what other tweaks / impelementation details can be changed to impove convergence times since with the increase of streaming video and VoIP, the quicker to failover the better.

 

Delayed Internet Routing Convergence April 23, 2009

The growth in size and complexity of the Internet calls for stronger routing infrastructure. The authors of the paper conducted a series of experiments to the latency in Internet path failure, failover and repair due to the convergence properties of inter-domain routing over a two-year period. They attribute this delay to the slow convergence of routing algorithms. The BGP protocol uses a path vector with the entire path to the destination as opposed to the DV approach and BGP is believed to have lower latencies. The paper shows that the upper bound on complexity for BGP convergence is factorial with respect to the number of autonomous systems. The authors investigate the rate at which the inter-domain repair and failure information propagates through the Internet and finally measure the impact of path changes on end-to-end network performance.

In their experiment, the authors collected data over the course of two years while injecting over 250000 routing faults in 5 major ISPs. They also built a fault model of the delayed BGP convergence process to provide an upper and lower theoretical computational bound. The theoretical upper bound is shown to be unlikely to occur in practice as it is based on a set of specific assumptions and worst-case scenarios. The theoretical lower bound is closer to practice since timers are used to handle synchronization issues not accounted for in the upper bound. The authors finally link their experimental results to the theoretical fault model proposed.

Some important experimental results discussed in the papers are, the average delay in inter-domain path failovers was three minutes, the theoretical upper bound on BGP convergence is O(n!) with n autonomous systems, the lower bound on BGP convergence is Ω((n-3)*30) with n autonomous systems, the delay of inter-domain route convergence is almost entirely attributable to unforeseen interaction of protocol timers, the path failover results in packet loss growth factor of 30 and latency increase factor of 4 on end-to-end performance and lastly changes to vendor BGP implementations would reduce lower bound on inter-domain convergence to Ω(30). An important observation was that the convergence time varied across autonomous systems and therefore the time is directly related to topological factors.

Although the experimental analysis is made thoroughly and is convincing of the fact that the routing delay in the internet stems from the convergence of algorithms, the oversimplified theoretical treatment in the later half of the paper slackens the stronghold established by experimentation. The future that stems from the observations made in the paper are that although the average internet failures rates are rare, if the effect of these failures is noticeable then a cheap feasible solution should be explored and deployed. Further since the delay is proportional to the number of autonomous systems, it is imperative to know the rate at which their numbers are increasing. If these observations are validated and investigated then there will be a better indication on whether BGP needs improvement and the algorithmic and protocol changes that need to be done to make it more fault-tolerant with lower latency.

 

Delayed Internet Routing Convergence April 23, 2009

This paper shows that the common belief that BGP has good convergence properties is inaccurate. To show this Labovits et al. uses passive data collection for a 2 year period and also used a fault-injected machine at major Internet exchange points.

Three most important things:
1.) Measurements
When they looked at their results from the measurements, Labovits et al. notice that the routing convergence was an order magnitude longer than expected. They used a four categrories for describing the route injection: Tup(unavailable route is now available), Tdown(available route is withdrawn), Tshort(long route is replaced with short) and Tlong(short route is replaced with long). Their results showed that Tdown and Tlong converged slower than Tup and Tshort which in idle situation you would want Tdown and Tlong to converge quickly and Tup and Tshort to converge slowly. Tdown also generated a lot more announcements than Tup.
2.) Impact of slow routing convergence
It turns out that the slow routing convergence has an impact on end-to-end internet paths. This is because the routers will drop the packets if they do not have a valid next hop so we will have more dropped packets or they will queue packets while awaiting for their forwarding table cache to update. Their results show the increase packets drops, which occur more on TLong than Tshort because TLong takes longer to converge.
3.) Analysis
The analysis on why Tlong converges longer than Tshort is because, Tshort is accepted quicker than Tlong. When we get a Tlong in BGP, we first try to find all alternatives for the shortest path which can be O(n) long if n is the number of autonomous systems. The rest of the slowness in convergence can be credited to the vendor’s BGP policies.

Glaring Problem:
The most glaring problem is mention by the author himself. In his proposals to fix this problem, he suggests fixes that make the BGP protocol very complex and create more router overhead. Having a complex protocol goes into the debate of scalability versus fault-tolerance.

Future Work:
Labovits et al. showed that something that was commonly believed to be true was infact very inaccurate with just measurements and analysis. It comes to show the difference between theory and implementations as many things can come into play during the implementations. This paper will help evaluate other protocols to reassure that the protocols are acting as they should be.

 

An Experimental Study of Delayed Internet Routing Convergence April 23, 2009

Major Contributions

1. This paper presents the results of a two year study on the convergence of the BGP routing protocol and provides recommendations to reduce convergence time and overhead mainly caused by BGP routers with different implementations due to the ambiguity of BGP specs. RouteView data collection probe was used to log and timestamp all BGP updates from BGP routers of vaious ISPs. Results show that failover delay is independent of network load, congestion, locality and more dependent on topology.

2. A fault injection server was used to inject BGP faults to upstream ISPs and measure the latency to random websites. The failure injection measurement results were taken for BGP events Tup, Tdown, Tshort, and Tlong and measured how long it takes for all the route tables to converge following an event. Tup is a previously unavailable route is now available. Tdown is the opposite of Tup. Tshort is a long ASPath is now replaced by a shorter ASPath. Tlong is the opposite of Tshort. From the distribution of convergence latencies following Tlong, Tdown, Tshort, and Tup faults,  the authors find that Tlong and Tdown form one equivalence class, and Tshort and Tup formed another equivalence class. Tlong and Tdown took almost three times the amount of time to converge compared to Tshort and Tup events. Analysis showed that Tlong and Tdown took longer because it is monotonically incrasing compared to strictly increasing for Tshort and Tup events. The results also show that RTT latency to random websites increased by a factor of four following a failover event and packet loss increased by a factor of 30.

3. The paper provides vendor specific recommendations to reduce convergence lower bound to 30 seconds by analyzing changes to MinRouteAdver using simulation data. The recommendation is to implement both sender and receive side ASPath loop detection, and do not apply MinRouteAdver to Tdown withdrawals.

The most glaring problem is not answering the question of how often these rare long convergence delays occur and whether BGP can still scale without much modification given some constraints on topology.

With routers getting faster, further research considerations with implementing additions or modifications to BGP to improve convergence times using synchronization, diffusing updates, and having additional state information may be feasible. Further research can also consider a complete overhaul or replacement of BGP.

 

An Experimental Study of Delayed Internet Routing Convergence April 23, 2009

The paper takes a critical and look at BGP route convergence in the face of failover/updates. It provides empirical measurements of real Internet latencies and packet losses for a period of over 2 years as well as theoretical analysis for worst and best case complexity of the route convergence process.

The paper brings to light the fact that most of the failovers measured during the 2 year period took an average of 3 minutes to converge. This observation provides hard statistical evidence that there are considerable BGP inefficiencies. Moreover, this duration length is considerably higher as compared to telephone network failovers. The underlying point here being that the network infrastructure supporting the Internet should be as efficient at resolving failures as the telephone network, especially given the increasing reliance on the Internet for business purposes.

The paper also analyzes the BGP protocol from a theoretical perspective. Given a completely connected graph model for the Internet and certain message ordering and processing constraints, the authors show that unbounded delay BGP messages can result in Ο((n-1)!) states. This is of course the upper bound, but it provides insight into how the network topology and BGP router processing details can affect route convergence. Additionally, the authors point out that a lower bound also exists for router states (or rounds as termed in the paper). If the BGP routers adhere to the minimum route advertisement delay and process messages of route length in increasing order (among other constraints), the lower bound is Ω(n).

Lastly, as a result of the analysis on the lower bound for rounds/states, the authors provide implementation suggestions and optimizations to reduce the lower bound to Ω(1). The minor changes to BGP implementations include performing sender and receiver loop detection and using peerwise MinRouteAdvert times. These suggestions allow routers to converge in one round. The paper provides simulation results that explore the unbounded, MinRouteAdvert, and modified (as per suggestions) MinRouyteAdvert implementations of BGP. These results validate the theoretical implications made by the authors.

The paper is a solid presentation of simulation, theoretical analysis, and real world observation/experimentation results. The only criticism is that the authors make the comparison between the public switched telephone network and the Internet. This isn’t necessarily a fair comparison as the telephone networks were designed with a dedicated single purpose, presumably using the same underlying technology throughout. The Internet serves considerably larger topologies, provides numerous varying uses, and runs on a myriad of technologies. Providing faster failure convergence is a great endeavor, but comparing against the telephone network may be too ambitious.

I think the path for further research would be to implement some of the suggestions and perform a long term test of the efficacy of the modified MinRouteAdvert protocol.

 

An Experimental Study of Delayed Internet Routing Convergence April 19, 2009

Convergence time of routing information and packet loss and delay due to inter-domain route changes, longer than expected.

Paper carries out experiments and analyses data collected to determine the average time taken before inter-domain routing information reaches steady state after there has been a change in the topology of an Autonomous System (AS). Data was collected by injecting routing faults into five major ISPs over a period of two years. In one of its experiments the paper observes that  20% of nodes require more than 3 minutes to converge for a Tdown or Tlong update.

Factors convergence time depends on.

The paper makes mention of some interesting factors convergence time depends (and does not depend) on. It provides experimental proofs and reasoning to support its claims. Some of them are:

i) Order in which route-change announcements are processed at a node

ii) Number of interconnections between AS (number of BGP peers). A complete graph would have larger number of computational states to be explored during convergence.

iii) Upstream provider transit property.

iv) If loop detection is done at only the receiver or both the sender and the receiver.

v) Largely independent of network load and congestion.

vi) Independent of geographical location of the node.

T-long and T-down have longer convergence times and why.

In the experimental data gathered by the paper it can be noted that convergence time and latencies are larger when the topology change involved a route going down. This might have resulted in a node becoming inaccessible or a longer route having to be taken. The paper analyses the sequence of events that occur when a route disappears and the how this results in T-down and T-long requiring longer convergence times as compared to T-short and T-up. In the case of T-down, a node keeps failing over to a longer secondary path before it exhausts all of them and realizes that the destination node is no longer available.

Problems/Oversights.

A couple of assumptions the paper makes in its experimental methods might not accurately model internet. Some of them are:

- It monitors the routing tables of only 25 ISPs. This might or might not correctly reflect packet forwarding and updating times over internet. More ISPs would mean more locations from which updates might reach a node.

- It models BGP messages to be processed in a single linear queue. This only models the case where there exist high congestion and long delays over the internet.

Future Work

The new system can be modeled which imbibes in itself factors claimed by the paper to reduce convergence times. These could involve, among others, sender and receiver end loop detection, re-ordering of announcements received to reduce convergence times, etc. This new system could be analyzed to obtain a quantitative measure of the enhancement that can be achieved at adding some amount of complexity to the internet protocol.