This entry was prompted by a question posed on slashdot. It is meant to shed light on the true issues surrounding QoS in most commercial home-grade links.
There’s a huge number of people who don’t know what the issues with QoS actually are – an unfortunately very common misconception is that configuring CPE QoS features == successful QoS. Also common among the ‘enthusiast’ group is the belief that sticking in a linux-based QoS-enabled distro is a panacea for QoS (eg "Put in IPCop == solved’").
In a situation where a link is a simple, as-designed ptp situation with no re-encapsulation (eg E1/SHDSL) QoS can be effectively implemented using in-box configuration. Home-class routers and OS-based control will work provided QoS is configured at each endpoint.
Unfortunately few people have the kind of service which out-of-box QoS can be used effectively.
The root issue of QoS today is the lack of outbound interface queuing. Unless one can ‘see’ the hardware/egress packet queue at its final point, QoS will not work successfully (without creating a ‘dummy’ artificial queue.. see below). When a link is congested, or when smaller frames/packets need to be sent before larger ones, QoS comes into play.
First component of QoS is the "queue". This is the ‘waiting list’ for packet egress (and ingress to a small degree, especially where link-congestion indication is supported). Here is where packets line up to be sent. QoS works by managing the priority of each packet (either by classification or by respecting the IP CoS/DSCP bit in the header). In a simple ptp link, this works well and as intended. Policy is applied to each end, and each router applied a queuing strategy and policy to the packets going out. Bandwidth is known, and the load on the link is directly observable by queue depth. Even applying ‘fancy queuing’ eg weighted round-robin, Random Early Detection is often enough to ensure equitable access.
Unfortunately in this day and age, most link implementations remove the ability to observe queue depth – by either shaping in the middle, by virtualising the terminating interface, or by multi-hopping the termination.. all three at once is common.
I am discussing DSL specifically as it’s the most common access technology, but the same thing applies to cable as rarely can one apply QoS inside the cable CPE.
Originally DSL-based WAN’s were intended to be implemented over an ATM-native network, where the CPE configures one or more PVC’s and this PVC(s) terminates on the decapsulating aggregation router. The PVC (at each end) has known properties (bitrate, class of service). It was also intended to have multiple PVC’s (eg VPI/VCI virtual circuits) with different Class of Service parameters for each to manage QoS.
Sadly this model has been discarded, and almost all DSL services ‘ignore’ the ATM component and just treat it as a ‘throwback’. One CPE has one PVC, and this is terminated on an AVGC far outside the reach of the actual ISP with the ATM component having no real impact other than simply being an access transport. To complicate matters, usually shaping is applied at the AVGC end, and the whole path is a series of virtualised interfaces double and triple encapsulating the link. So we have lost the ability to ‘see’ the actual queues – you can only throw packets at the link and hope they come out the other end.
So now the native ATM opportunity for traffic management is gone, what now? As mentioned, most commercial DSL operates using L2TP multihop and virtual interfaces. A virtual interface has no real queue and in the Cisco world at least, there’s no practical method to do anything with the provider end of the link. The QoS mechanism cannot ‘see’ what the link is doing load-wise and as such can’t effectively manage service.
So, what’s left?
The only avenue open is to create a ‘false queue’. This is effectively an artificial bottleneck.
Here, you implement traffic shaping by taking the maximum amount of bandwidth in one direction (and with the bandwidth tax of multiple encapsulations eg PPP, L2TP, ATM which can often be difficult to determine) and subtracting say 10%. This creates a situation where packets can actually queue up and be seen by the QoS process.
It can be done inside some routers, although many home-grade CPE implementations are unacceptable in that they force you to allocate a fixed percentage of bandwidth to a certain service (and often this is TCP/UDP port based which is very limiting in itself).
This is the only way to make QoS work in the current environment. The limitations are obvious: it must be also be done at the provider end, 10% or more loss of bandwidth is incurred, and if there is any congestion between the two endpoints QoS will not be effective.
There is also the option of the CPE router using coarse techniques to ‘poison’ or control non-sensitive traffic in the presence of time-sensitive traffic. This can be done by using/managing TCP resets, ACK’s and window sizes. This approach has many limitations and I cannot recommend it as a practical alternative.
So, if you’re after QoS, the CPE router must support traffic-shaping and QoS for upstream control. Your provider must also be willing to provide the same service at the head end. You lose bandwidth, and any congestion between ends can’t be accounted for.
Pretty horrible isn’t it? Until the IT industry gets off its’ backside and addresses these issues, the most practical solution for low-latency, low-jitter, low-loss internet is the same as it’s always been: get a bigger pipe.