Brandon Purcell is a senior product support engineer at Macromedia
- Production Cluster Configuration
- Identifying and Isolating Application Bottle Necks
- Tuning Operating Systems and JVMs
- Testing Load
- Monitoring Load Tests
Applying Real-Time Load Testing Before Going Live
Before a website like macromedia.com goes live, a lot happens behind the scenes to ensure the site will stand up against peak load periods without any performance degradation. This report explores the steps Macromedia took to tune, load test, and cluster the site to prepare it for the go-live date.
Production Cluster Configuration
The new macromedia.com is a straightforward and scalable system architecture. It consists of the following:
- Apache 2.0.44 web servers, which handle all static content. Apache runs on a cluster of 10 Sun E420s, each of which has four Central Processing Units (CPUs), 4GB RAM, and runs Solaris 8.
- ColdFusion MX for J2EE on JRun application servers running on a cluster of three Sun E4500s, each of which has eight CPUs, 8GB RAM, and runs Solaris 8. Each application server has two JRun instances running CFMX for J2EE for a total of six instances across the cluster.
- Oracle database, which supports all of the new applications. The database runs on two mirrored E4500s, each of which has eight CPUs and 8GB RAM.
- F5 Big-IP hardware load balancer, which provides hardware-based load balancing and failover across the cluster.
Identifying and Isolating Application Bottlenecks
Before running full-scale load tests across the cluster, we first wanted to identify application bottlenecks. For this process, we used a single CFMX for J2EE instance from the production servers. We executed load testing on commonly traveled paths through the site. For instance, on macromedia.com, most users use the following path:
Home Page > Downloads > Membership Login > Product Download.
To test this path, we used the free Microsoft Web Application Stress tool (WAST). WAST uses a proxy to record all requests from the browser, including all Macromedia Flash Remoting calls. After recording the script, we removed all calls to static data (to SWF, GIF, JPG, and HTML files) to focus on the performance of requests to ColdFusion. While monitoring CPU on both the application server and database, we ran load tests for 5-10 minutes—saturating the application server CPU. Following each test, we reviewed the WAST reports to identify page requests that took too long to complete. By doing this, we pinpointed bottlenecks within the code and improved performance by doing the following: caching queries, optimizing queries, optimizing business logic, and caching frequently accessed data in shared scopes. In many cases, we reduced page request times by 30% under load and we were also able to greatly reduce the database load. We followed this procedure for all site sections.
Tuning Operating Systems and JVMs
After optimizing each application, it was time to test the production scenario with two instances on each physical server. It was at this time that we tuned all operating system (OS) and Java Virtual Machine (JVM) settings. We set the TCP setting, tcp_conn_hash_size, to 8192; likewise, we set both rlim_fd_max and rlim_fd_cur to 4096 to increase the number of available file descriptors.
The single most important step in improving JVM stability in Solaris is verifying that you have applied the latest patches, described on the java.sun.com site under the JVM patch cluster. For instance, we found that with a complex application, JVM tuning was not straightforward: The primary issue was garbage collection (GC). When we first started load testing, we used the Sun 1.4.0 JVM. After experiencing stability issues, we used the latest Sun 1.4.1 JVM. Using the newer JVM stabilized the OS and also provided several new garbage collection options that improved response times.
With a complex J2EE application, there are a significant number of short-term memory objects. This forces GC to run more frequently, which can affect the user experience. During a load test, the application performed flawlessly for a time period, with response times in the hundred millisecond range; after memory started to increase, however, GC times increased. When the GC occurred, full GCs took as long as 15-20 seconds. During this time, we suspended all user activity. By tuning specific parameters and enabling the new Parallel GC options of the 1.4.1 JVM we were able to greatly reduce the GC times. Full performance tuning details are outside of the scope of this article, but you can find most of the information in the following java.sun.com article: ”Improving Java Application Performance and Scalability by Reducing Garbage Collection Times and Sizing Memory Using JDK 1.4.1.”
Testing Load
As mentioned above, we used the Microsoft WAST tool to optimize the site. After finding several limitations in the tool, we decided to use Segue SilkPerformer to perform final load tests. After optimizing code and tuning the OS and JVM, we focused on load testing a single server with two instances. One of the hurdles we overcame was load testing Macromedia Flash Remoting.
Macromedia Flash Remoting uses an HTTP Post, but encodes all of its data in Action Message Format (AMF). Both WAST and SilkPerformer recorded and ran Macromedia Flash Remoting calls flawlessly, but at a certain point we needed make the AMF data random in the requests. For example, the membership section of the site uses a Macromedia Flash movie to log in a user. To test the login section, we needed to enter random data so the scripts didn’t reuse the same user account over and over. Writing a custom script in SilkPerformer, we randomized the requests made to the Macromedia Flash Remoting gateway.
After we completed testing the single server and were satisfied with the results, we scaled our cluster to multiple servers. Each server is an exact clone. We set up the Solaris environment based on modifications we made in earlier tuning and testing, tarred the application server directories containing JRun and CFMX for J2EE, and copied the information to the other two servers. By using clones, you can very easily scale additional application servers if necessary; it also simplifies server administration. It becomes as easy as moving another server into place with a JRE and cloning the directory structure.
Additionally, we used historical data from macromedia.com to estimate daily load and peak load numbers based on the amount of application server page requested per second. From this data, we built load test models that simulated real-world traffic. This included not only the total number of page requests per second but also the percentage of traffic based on the site sections. We also evaluated figures from historical peak traffic periods over the life of macromedia.com. Based on the load test model, we recorded 10-12 load scripts and modified them within SilkPerformer. Each script simulated a different path through the site and each one was assigned a particular percentage of traffic based on the historical numbers. One difficulty in load testing based on historical data was that app server page requests did not map one-to-one to historical data because of Macromedia Flash Remoting. For example, when the exchange first loads, the index.cfm page appears first; then, each Macromedia Flash element appears within the same index.cfm page, with many Macromedia Flash elements making requests to the application server. In short, for one particular page view, we had 7-8 application server requests (smaller asynchronous calls from the Macromedia Flash client). Using historical data and comparing page views to the app server requests, we found that the page views map one-to-one. With the new Rich Internet Application model, the page views do not map one-to-one.
To account for this difference in designing the load test models, we increased the percentage of requests for the sections that use Rich Internet Applications. Note that while the first page load makes more requests, subsequent requests will only make smaller Macromedia Flash Remoting requests to populate data in the Macromedia Flash client. Read last week’s beta report on optimizing Macromedia Flash Remoting calls to expedite page views in Rich Internet Applications.
After we completed all scripts, it was time to load test the entire cluster. Our first goal was to find the absolute maximum load that the application server could handle. We loaded a few hundred users and continued to increase the load until we peaked the application servers’ CPUs. As we continued to increase the load, the response times increased and all threads available for processing requests became saturated. The following graph illustrates this behavior. Request processing scales without significant reduction in average response time until the CPUs max out +90%; at this point, average request times increase exponentially.
We continued testing at this load for a period and then reduced user load to verify that the application servers could recover from the overload. Next, our target load test approximated peak daily load. We ran this test for over 24 hours to verify long-term stability. We spent several days using different loads to see how the site responded. During this period, we made slight modifications to the application, OS settings, and JVM settings until we tuned and optimized the architecture and applications for maximum performance and stability.
We encountered one issue with multiple instances that was not apparent with a single instance: a problem with business logic threading and access to the database. When we load tested a single instance, a named cflock tag prevented concurrent access to a particular piece of business logic. When we load tested across a cluster, however, this was not the case. Values inserted into the database violated database constraints. By load testing the cluster, we were able to identify this problem and fix it before it became a production-level issue.
Monitoring Load Tests
By monitoring load testing we were able to quickly identify when problems surfaced. While load testing we monitored the following: CPU, JRun metrics, JRun logs, CFMX for J2EE logs, response times for clients, and hardware load balancer throughput and connections.
We monitored CPU levels with top or prstat on the web server, application server, and database server. On the application server, we looked for equal CPU utilization across both instances and at which points the CPU load peaked sharply. On the database server, the goal was to prevent overloading the database server even with the application under high load. Through extensive query caching, we were able to accomplish this goal. JRun metrics provides key information; they log thread usage, JVM heap size, and session information. By modifying the JRun logging configuration, we redirected metrics data to its own log, which made it easier to interpret.
We monitored the following during this time:
- The JRun out and err logs for any JRun errors during testing. We used tail –f for each of these logs and monitored them while the system was under load.
- CFMX for J2EE logs (the exception.log, application.log, and flash.log at /WEB-INF/cfusion/logs/).
- The hardware load balancer (F5 Big-IP), which also provides monitoring data. Our goal was to equally load balance across the cluster. Big-IP provides throughput and connection data.
We experimented with different load balancing algorithms and were able to identify the one that optimized load balancing. We also tested website responsiveness under different load scenarios and timed pages with a stop watch to gauge user experience.
Based on the problems we ran into with garbage collection, we enabled the verbose GC options for the JVM to monitor garbage collection times. All versions of the Sun JVM support the –verbose:gc flag, which you use to log this information. The 1.4.1 JVM also supports several new flags (-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC), which logs more granular detail on generational level garbage collection to the J2EE application server out log.
The Macromedia focus on experience drives us to provide a scalable, highly available website. Testing the apps and architecture with real-world load ensures our customers have a solid and successful experience. We believe that experience matters, and we’ve put that into practice with macromedia.com.
About the Author
Brandon Purcell is a senior product support engineer at Macromedia with over seven years of experience with developing, maintaining, and supporting web-based applications. Brandon has been working with ColdFusion for over six years and has over two years of experience with J2EE and Macromedia JRun.With the project complete, Brandon would like to acknowledge his wife for her patience as he worked every night and weekend for three weeks straight. He would also like acknowledge Chris Elgart, on the Macromedia web development team, with whom he worked side by side during the testing and tuning process to achieve a fully-optimized macromedia.com.