Graduate Networks, UCSD

CSE222 – Spring 2009

Building a Robust Software-Based Router Using Network Processors April 30, 2009

The authors describe a 3 tiered processor structure for a software based router using a Pentium computer and a network processor (also from Intel). The aim is to show that using the tiered approach data and control packets can be routed at Gbs line rates while allowing for software based protocol extensibility.

The Intel network processor board comes with DRAM and SRAM caches as well as several MicroEngines and a StrongARM processor.  The MicroEngines are programed using microcode and can perform very fast rudimentary instructions (dequeue, enqueue, copy, etc.). The MicroEngines can operate in parallel servicing input and output ports. The StrongARM processor is used to service exceptional packets, those that incur a route lookup miss or are control packets. The StrongARM process runs as a single process and is capable of forwarding even more “exceptional” packets across a PCI bus to the Pentium processor. The Pentium processor is used to perform compute intensive operations such as calculating shortest paths.

The main contributions are in the design of the 3 tiered architecture and the design of the protocol extensions interface.

The architecture defines a fast path and a mechanism for routing through a longer path (more compute intensive). Packets arrive at an input port where they are queued in a FIFO. The packets are then processed in parallel by the MicroEngines that perform classification on the type of packet. If the packet can be forwarded on by the MicroEngines, they perform whatever transformation is necessary and enqueue the packet in an output FIFO. If the packet requires further processing, it is put into RAM where the StrongARM processor will check for packets to process. The StrongARM processor may copy some or all of the packets onto the PCI bus for the Pentium processor to service. The Pentium can copy back transformed packets via the PCI bus. When the StrongARM has completed the processing (or received the completion from the Pentium processor), it is responsible for forwarding the packet to the appropriate output queue. MicroEngines also work in parallel to service copying from output FIFOs to the output ports. Contention for output ports is solved by using mutual exclusion (implemented using a passing token).

The second contribution involves the interface for which extensions can be installed. The Pentium processor can functions on the StrongARM processor that register/unregister custom fiilters. These filters are used by the StrongARM to identify and forward specialized packets (conforming to some new/extended protocol) up to the Pentium for processing.

The paper provides a very detailed description for building a network processor + general purpose processor based software router. The authors’ results are impressive as they report a processing line rate of 3.47 Mpps.  What seems a bit unaddressed is what would actually happen if a new/extension protocol were implemented using this architecture. If that protocol required Pentium level processing for each packet, it seems as though the line rate would drop considerably. So from a practical perspective, one would not actually want to run their new protocol using this hardware (despite the fact that it could be run).

I’d very much like to see performance results of line rates for traffic using one of the extensions proposed in the paper. This paper does a great job of explaining how traditional traffic can be supported at high line rates but does not provide results from more complex protocols implemented using their design.

 

Can Software Routers Scale? April 30, 2009

Filed under: R09. Can Software Routers Scale? — supritapagad @ 4:22 pm
Tags: ,

1. Routers implemented in software running on servers, in spite of providing the flexibility, low cost and ease of programibility, weren’t popular since they didn’t scale and weren’t able to support line speeds. The paper proposes a method for implementing such a router, realized in software running on commodity servers. It achieves this by using multiple interconnected servers to form a cluster, each server handling one incoming line. It goes on to show that this cluster has the capability to support current line speeds and scale.

2. The paper also make a quantitative analysis of such a server, both in theory and by practical implementation. It does this for a server using a shared bus and one supported by a mesh of multi-processors. It identifies the major time consuming components and estimates the theoretically achievable, best case forwarding time for data packets. It then goes on to implement the proposed server and measures and compares the estimated times with the observed times. It identifies the communication buses to be the bottlenecks.

3. It looks into various topologies for inter-connecting the servers in the cluster. It identifies the issues, which include line-speed communication between the servers and in-order delivery of packets. It proposes a solution for in-order delivery of packets. It takes a look at the scalability of this cluster.

Oversight

The paper talks of providing as many servers as number of line. Though it talks of how this structure would scale as a function of O(sqrt(N)), it still might end up being quite formidable. Also, the paper does not consider packets requiring greater computation such control packets and the impact such packets would have on the processing times for data packets.

Future Work

The paper presents a new approach to implementing software based routers. It also proves theoretically that it is capable of supporting line speeds. The idea can be extended to incorporate servers more suited for the application. Attempt can be made to validate that the system works and supports all types of traffic, including data-control packets and is robust to external network changes. It internal scalability can also be verified.

 

Can Software Routers Scale? April 30, 2009

Filed under: R09. Can Software Routers Scale? — mdjacobsen @ 4:20 pm
Tags:

The authors provide theoretical and observational evidence that PC based software routing can achieve multi-Gbs routing speeds. The focus is on whether such routers can scale in terms of line and switching rates. They address the processing and bandwidth limitations in PC based architectures and where bottlenecks arise when forwarding packets at Gbs speeds.

The primary contributions are in identifying where the bottlenecks arise within a PC while doing packet forwarding and how a switching fabric can be constructed that preserves line rate.

For the first, the authors perform analysis of two internal PC topology models: traditional shared bus and point-to-point. They perform some coarse high level calculations to determine if the bandwidths on the PCIe bus, memory channels, and front side buses (FSBs) can support Gbs line rates. They consider this an upper bound. This bound is then tested using a real PC and multi Gbs network traffic generators. The result is that memory layout and access patterns are a bottleneck. If adjusted to suit the packet sizes, then the address bus on the FSB becomes the bottleneck. In these experiments they achieve roughly a 4.4 Gbs forwarding rate. They identify further packet descriptor handling adjustments to improve the bottleneck but do not implement them.

The second contribution comes in the area of switching. Because the line rate must be maintained throughout the entire process, the switching fabric has to support much higher processing rates. The solution they propose is to use a Valiant load balanced mesh architecture. In this architecture nodes are connected in a mesh and input on a node is forwarded to a random intermediate node before it is forwarded on to the destination. This helps alleviate the overload on any output node but introduces the prospect of packet reordering. They investigate several different switching networks based on this model (and variations of). The results show the maximum link and node capacities of each.

The paper does a good job of providing both theoretical analysis and emperical results. Unfortunately, I found their experiment setup to be too simplified. Their routing process uses static routes which considerably limits the amount of work needed to forward packets. This really makes the forwarding rates incomparable with real dedicated routers and works to undermine the applicability of their results.

I’d expect future research to focus on testing PCs using the customized memory and descriptor changes. Additionally, I’d like to see more reasonable “commodity” PC architectures considered. Multi-core point-to-point PC architectures are still not cheap commodity computers. Replacing low cost dedicated routers with a higher priced PC that can perform a the same level, may not be that surprising of a result.

 

Can Software Routers Scale? April 30, 2009

Filed under: R09. Can Software Routers Scale? — liyunjiu @ 4:20 pm
Tags: ,

The paper examines whether software routers can scale to speeds achieved by network processors and what architecture optimizations may be required.

1. First, a high level model was used to predict an upper bound and bottleneck for current general purpose architectures. It is estimated that a shared FSB architecture can achieve speeds up to 10Gbps and point-to-point architecture can achieve speeds up to 40Gbps using current hardware.

2. Experiements were performed to validate the high level model and provide a more realistic lower bound using unoptimized off the shelf platforms. A server with 16 GigE cards was tested. With packet sizes of 1KB, the server forwarded at 14.9Gbps with a 16Gbps load. With packet sizes of 64 bytes the server can only forward at 3.4Gbps with a 16Gbps load.

3. The initial bottleneck was identified to be the way Linux allocated memory. After the Linux memory allocator was modified, the next bottleneck was discovered to be from the high FSB address bus utilization when sending small packets. This can be partly remedied by modifying the NIC firmware to integrate packets and their descriptors. Other improvements can come from Direct I/O from the NIC to CPU cache and using multi-processor mesh architectures.

The paper then explores switching architectures and topology. This can be a topic of future research to implement routers connected in a Valiant mesh. Additional research can also include optimizations to current hardware and software to achieve network processor speeds. There were no technical flaws in the paper.