Accessibility

Breeze Article

Network Latency and Breeze Meeting


Table of Contents

Section 1: Isolating and Fixing Sources of Accelerated RTMPS Breeze Meeting Latency

There are a number of ways to set up a Breeze Meeting pool behind an HLD/SSL accelerator. This diagram roughly summarizes a general lab configuration consisting of two Breeze servers running the full suite of Breeze features behind an HLD/SSL accelerator. It serves the purpose of setting up an example to use as a troubleshooting reference guide.

Breeze 5 Pool HLD-SSL Latency

Figure 1. Breeze 5 Pool HLD-SSL Latency

(+) View Larger

As the diagram depicts, when running Breeze Meeting on a server, that server actually acts as two servers behind the HLD/ SSL accelerator. The Meeting server instance is a Flash Communication Server running RTMP while the Communication server is the application server running HTTP. Here is a further breakdown to add clarity:

  • The meeting1 VIP on the HLD/SSL accelerator 10.10.10.1:443 points to the meeting1 origin server pool at 192.168.0.1:1935
  • The meeting2 VIP on the HLD/SSL accelerator 10.10.10.2:443 points to the meeting2 origin server pool at 192.168.0.2:1935
  • The breeze VIP on the HLD/SSL accelerator 10.10.10.3:443 points to both of the Breeze origin servers in a pool at 192.168.0.3:443 & 192.168.0.4:443

If we add a client to our latency example diagram and abbreviate its route to a secure Breeze Meeting, it would look like this:

  1. From the client IP address 10.15.0.1 we navigate the network and hit the Breeze VIP at https://breeze.macromedia.com at 10.10.10.3:443
  2. The load balancing algorithm on the HLD directs us to the Breeze Communication server at 192.168.0.3:443 or 192.168.0.4:443
  3. The Communication Server directs the meeting room request to a room on one of the Meeting servers at 192.168.0.1:1935 or 192.168.0.2:1935
  4. If the room is hosted on Meeting1 at 192.168.0.1 then it will use the VIP 10.10.10.1; if it is on Meeting2 at 10.10.10.2 then it will use the VIP 10.10.10.2

Breeze 5 Pool HLD-SSL Latency Client

Figure 2. Breeze 5 Pool HLD-SSL Latency Client

(+) View Larger

Most server latency issues with accelerated RTMPS are discovered during a lab deployment or proof of concept pre-deployment of Breeze. On those occasions when excessive latency appears in a production environment it is prudent to begin searching for the source by examining network changes implemented around the time the latency appeared; in the absence of any ostensible network changes it may be more difficult to isolate the source of the problem. Adding an accelerator should not add latency, but latency may afflict Breeze Meeting if a hardware-based SSL accelerator for RTMPS is integrated with a production Breeze Meeting server that was either previously running the openSSL module built into Breeze or was previously running unencrypted without SSL. It is always best practice to test the implementation of any major changes in a lab environment before deploying them into production. It is also a good practice to have a rollback or alternative support plan when upgrading your production environment to accelerated RTMPS. This section of the article will provide tools and techniques to isolate and eliminate sources of excessive Meeting latency whether found in a lab environment or in production.

There are many tools available to help sniff out sources of latency. You will almost certainly find that the network engineers who manage your infrastructure have standard operating procedures in place to discover and eliminate sources of latency. The best place to begin troubleshooting is usually in a conference room with your network engineers considering the case examples described in this article and the following general diagnostic steps:

  1. Scrutinize network diagrams showing the topology between the affected clients and the Breeze Meeting server; consider as suspect any antiquated switches that appear in the topology.
  2. Examine the port settings on the client and on the server and check the link speed and duplex settings on the NICs on both the affected clients and the servers.
  3. Are ports 80, 443 and 1935 open on the routers and firewall?
  4. Examine router access lists to make sure there are no incorrect source and destination addresses being filtered.
  5. Proxy servers can be problematic with Breeze Meeting. Are there proxy server arrays between the Breeze servers and the client? Are ports 80, 443 and 1935 permitted? Is port 443 being cached?
  6. Can the client telnet to ports 80, 443, and 1935 on the server?
  7. Run pathping and tracert from the clients to the Breeze server.
  8. Run a sniffing tool on the client, on the server and on the accelerator while connecting to a Meeting and conducting bandwidth intensive activity in the room.

The general, high-level sample network diagram pictured below serves as a reminder that there are many potential sources of latency in any network infrastructure; granularities at all seven sections of this diagram could be revealing.

Breeze network diagram

Figure 2. Breeze network diagram

(+) View Larger

The most common scenario is that the latency is between the VIP and the client (where there can be many potential chokepoints). In the sample diagnostic output that follows, we were sniffing for the source of excessive latency associated with a Breeze Meeting server pool running behind F5 BIG-IP. The BIG-IP was used for failover of the two Breeze Communication servers and also for SSL acceleration of the Communication servers and the Meeting servers (HTTPS & RTMPS). The VIP and client IP addresses in the out below have been changed to match the illustration above: Client 10.15.0.1, Meeting VIP 10.10.10.2

Here we can see that the BIG-IP Meeting VIP (10.10.10.2) sends data and then sends more data .000008 seconds later.  This is fine if the link is full duplex from end to end:

20:11:10.211300 IP 10.10.10.2.443 > 10.15.0.1.2397: . 22189:23449(1260) ACK 22989 WIN 65535
20:11:10.211308 IP 10.10.10.2.443 > 10.15.0.1.2397: P 23449:23670(221) ACK 22989 WIN 65535

In the following line of output, the selective ACK traffic begins. This indicates that the first of the two packets was dropped; this ACK asks the server to retransmit the packet with sequence number 22189-23449:

20:11:10.434529 IP 10.15.0.1.2397 > 10.10.10.2.443: . [tcp sum ok] ack 22189 win 64895 <nop,nop,sack sack 1 {23449:23670} >

The BIG-IP sends more data:

20:11:10.434571 IP 10.10.10.2.443 > 10.15.0.1.2397: P 42636:43896(1260) ack 22989 win 65535

The client acknowledges the data, but keeps asking for the packet with sequence number 22189:

20:11:10.512362 IP 10.15.0.1.2397 > 10.10.10.2.443: . [tcp sum ok] ack 22189 win 64895 <nop,nop,sack sack 1 {23449:24930} >
20:11:10.512400 IP 10.10.10.2.443 > 10.15.0.1.2397: P 43896:45156(1260) ack 22989 win 65535
20:11:10.588832 IP 10.15.0.1.2397 > 10.10.10.2.443: . [tcp sum ok] ack 22189 win 64895 <nop,nop,sack sack 1 {23449:26190} >

BIG-IP sends the data for the range 22189 to 23449, as requested by the client .37756 seconds after the client request:

20:11:10.588860 IP 10.10.10.2.443 > 10.15.0.1.2397: . 22189:23449(1260) ack 22989 win 65535
 20:11:10.617059 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 1 {23449:26632} > 

And here we see that the client doesn't have the 22189-23449 segment, nor does he have the 26632-27892 segment:

20:11:10.637668 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 2 {27892:28113}{23449:26632} >

The BIG-IP sends the second (26632-27892 segment) again, .000027s after it was requested:

20:11:10.637695 IP 10.10.10.2.443 > 10.15.0.1.2397: . 26632:27892(1260) ack 22989 win 65535

And the client acknowledges receipt, but is still missing some of the previous data:

20:11:10.715406 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 1 {23449:27892} >

Now more data from the BIG-IP is dropped, and is being requested by the client:

20:11:10.732727 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 2 {29373:29594}{23449:27892} >
20:11:10.732756 IP 10.10.10.2.443 > 10.15.0.1.2397: . 28113:29373(1260) ack 22989 win 65535

It continues in this vein as evidenced by more requests from the client for dropped data:

20:11:10.812513 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 2 {27892:29373}{23449:27892} >
20:11:10.830130 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 2 {30854:31075}{27892:29373} >
20:11:10.830163 IP 10.10.10.2.443 > 10.15.0.1.2397: . 29594:30854(1260) ack 22989 win 65535
20:11:10.907721 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 2 {29373:30854}{23449:27892} >
20:11:10.924918 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 2 {32335:32556}{29373:30854} >
20:11:10.924945 IP 10.10.10.2.443 > 10.15.0.1.2397: . 31075:32335(1260) ack 22989 win 65535
20:11:11.002895 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 2 {30854:32335}{27892:29373} >
20:11:11.080631 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 2 {32335:33816}{23449:27892} >
20:11:11.156305 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 2 {32335:35076}{23449:27892} >
20:11:11.233920 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 2 {32335:36336}{23449:27892} >
20:11:11.311583 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 2 {32335:37596}{23449:27892} >
20:11:11.389554 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 2 {32335:38856}{23449:27892} >
20:11:11.465114 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 2 {32335:40116}{23449:27892} >
20:11:11.542776 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 2 {32335:41376}{23449:27892} >
20:11:11.622047 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 2 {32335:42636}{23449:27892} >
20:11:11.698273 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 2 {32335:43896}{23449:27892} >
20:11:11.773822 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 22189 win 64895 <nop,nop,sack sack 2 {32335:45156}{23449:27892} >

Finally the client says that it has received all the retransmitted packets 1.6 seconds after the trouble started:

20:11:11.851648 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 45156 win 65520

The traffic flow starts to clean up again

20:11:11.851698 IP 10.10.10.2.443 > 10.15.0.1.2397: P 45156:46416(1260) ack 22989 win 65535
20:11:11.927472 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 45156 win 65520
20:11:11.927508 IP 10.10.10.2.443 > 10.15.0.1.2397: P 46416:47676(1260) ack 22989 win 65535
20:11:12.005857 IP 10.15.0.1.2397 > 10.10.10.2.443: . ack 45156 win 65520
20:11:12.005886 IP 10.10.10.2.443 > 10.15.0.1.2397: P 47676:48936(1260) ack 22989 win 65535

This example shows a 1.6 second delay captured between the VIP and the client; it is clearly accounted for in the dumps. In this case there was no delay between the Breeze server and the BIG-IP, and from the point-of-view of the BIG-IP, the longest delay between request and retransmission is only around .377 seconds.  The delay between the two first packets, .000008s, wouldn't be a problem on a full-duplex connection, but when packets like that are dropped, it often indicates a duplex mismatch that would prevent the data from traveling across the wire with that short a time delay.

There are many types of legacy switches and routers in the enterprise. In this particular case, after this diagnostic output was presented to a network engineer, he was able to isolate a problematic Cisco 3524XL switch running the following IOS code: c3500XL-c3h2s-mz-120.5.1-XP.bin. A quick search on the 3524XL shows that it is end of life (http://www.cisco.com/en/US/products/hw/switches/ps637/ps642/index.html). After replacing the switch, the latency diminished. This example is paradigmatic of accelerated RTMPS latency issues; it is usually not the Breeze configuration or the SSL accelerator that is the cause of the problem.