Accessibility

Table of Contents

ColdFusion 8 server monitoring – Part 3: Automated monitoring and request management with Alerts and Snapshots

Automated monitoring and request management with Alerts

Many people think of a system monitor in terms of its graphical interface, with charts, graphs, and reports that reflect the current status of the system and its components. As valuable as those are, you'd need to be watching that interface to know when a problem situation occurs.

What if you could instead receive notification of a problem situation by e-mail or, perhaps, execute a ColdFusion component (CFC) to do any sort of processing (write data to a log file or database, and the like)? And what if you could also specify that the system should generate additional details about the current state of the system (a new feature called “Snapshots” in ColdFusion 8), to help better diagnose the problem? The Alerts feature of the ColdFusion 8 Server Monitor offers this and more.

More than just alerting you about problems, the tool also offers the means to manage request processing, including options to terminate (kill) running requests, reject any new requests during the problem state, or even perform a Java Virtual Machine (JVM) garbage collection operation. While the ColdFusion Administrator has long offered a means to "timeout requests" that ran too long, the alert mechanism takes the functionality that much further, bringing a substantial new dimension of "unattended" monitoring.

Alerts can be created to detect, report on, and respond to four kinds of problem states:

  • Unresponsive Server (too many requests taking too long)
  • Slow Server (average response time too high)
  • JVM Memory (too much memory used)
  • Timeouts (too many requests timing out)

Each of these problem states will be detailed later in this article.

Note: It’s important to note that alerts will only be triggered both if they are enabled (discussed in the next section) and if you have selected the Start Monitoring option in the Server Monitor, as was discussed in Part 1 of this series. Although one of the alerts relates to JVM memory, that alert does not require the “Start Memory Tracking” button to be enabled.

Toward 24x7 operations

Alerts present quite a significant paradigm shift in the management of ColdFusion servers. If set up properly, ColdFusion could conceivably never go down/offline. Previously, when something went wrong, you may have been forced to restart ColdFusion. Now, alerts can notify administrators of a problem, create a snapshot of the environment to help determine the source, and even automatically fix the problem by killing threads, calling garbage collection, rejecting new requests, and/or executing custom code. Each of these features is discussed in more detail, later in this article.

Configuring Alerts

Unlike the previous two articles, whose features were accessible through the Overview (main page) or Statistics tab of the monitor interface, configuration of the Alerts feature is done through its own tab (the Alerts tab), shown at the top of the Monitor. Once selected, this offers two links on the left navigation bar. The first page shows any current Alert notifications (discussed later). Clicking the second link, Alerts Configuration, shows a page that allows you to create or edit new alerts (see Figure 1).

The Alerts Configuration page

Figure 1. The Alerts Configuration page

On the Alerts Configuration page, there is a tab for each type of alert that you can set. For each tab and page, the first option is a check box to enable the Alert option. Until you select the enable the Alert option, you will be unable to select any of the other options for Alerts (see in Figure 1). Once you select the enable the Alert option, you can set the threshold at which the alerts will be detected, and indicate the actions that ColdFusion should take during a problem state, and so forth.

Available actions

The actions are the same for all the alert types with one variation. You can choose to:

  • Send e-mail (to one ore more e-mail addresses specified in the last tab, Email Settings)
  • Dump a snapshot (Snapshots are a depiction of the current state of the system)
  • Kill threads running longer than a specified number of seconds
  • Reject any new requests
  • Run a processing CFC
  • Perform garbage collection (only with the JVM Memory alert)

I’ll discuss each of these actions later, after discussing the types of alerts. Changes made on this screen take effect immediately. You don't need to restart the ColdFusion server.

Types of alerts

The following are the types of alerts and their available threshold settings.

Unresponsive Server

The Unresponsive Server alert detects too many requests taking too long. It offers two threshold values: Hung Thread Count and Busy Thread Time (in seconds). If the number of requests specified by Hung Thread Count are detected to execute for longer than the Busy Thread Time, the server is considered unresponsive and this alert will be triggered. While the Request Timeout setting in the ColdFusion Administrator (Server Settings > Request Tuning) sets the maximum time any single request may be allowed to run, this setting triggers when some number of simultaneous requests exceed a given response time, giving finer control.

Also, there are some operations that can’t be immediately interrupted by the Request Timeout feature (discussed later), so this alert can also serve to back up that setting to notify you of requests exceeding that expected timeout.

You want to avoid having so many threads become unavailable for so long that eventually the server becomes unable to process new requests. This alert can warn you (by e-mail) when you have reached this state, or you can choose to attempt to terminate threads, reject new requests, take a snapshot, or run a CFC. If you take a snapshot, it lists the threads which are detected to be running too long, in addition showing a stack trace (or “thread dump”) of all running threads. Snapshots (and stack traces) and other alert actions are discussed later.

Slow Server

The Slow Server alert detects when the average response time for processing requests reaches a specified threshold. It offers a single Response Time Threshold (in seconds). This is compared to the average response time for all requests, as computed over an interval configured in the Server Monitor’s settings page (at the top right of the Server Monitor, as discussed further in Part 4 of this series.) The current average response time is displayed in the Average Response Time chart on the Server Monitor’s Overview page.

If the average response time of currently running requests is greater than the threshold time, the alert is triggered, with the same available actions as for the Unresponsive Server alert. (Note that if a snapshot is taken, it does not list the threads running, though it does show the stack trace.)

JVM Memory

The JVM Memory alert detects when ColdFusion is using a certain amount of RAM. If the  JVM memory used by ColdFusion is greater than the threshold value (in megabytes), a JVM Memory alert is activated. Consider a suitable value with respect to the Maximum JVM Heap Size, which you can set in stand-alone deployments of ColdFusion through the Administrator (set in Server Settings > Java and VM page), or in the jvm.config file for multiserver and J2EE deployments. You want to avoid a situation where the JVM memory use grows so large that you reach an out-of-memory condition. This alert can warn you when you are approaching this state. When triggered, this alert can take the actions described so far and can also be configured to perform garbage collection, as discussed later.

Timeouts

The Timeouts alert detects when too many requests are timing out. It offers two threshold values: Timeout Counts and Time Interval (in seconds). If the number of requests specified in Timeout Counts time out within the time interval specified by Time Interval, a Timeout alert is triggered. These timeouts are triggered by the Request Timeout feature in the Administrator (Server Settings > Settings). While it's helpful that ColdFusion can time out requests, you can use this alert to let you know when it's happening too often, as well as to take most of the aforementioned actions to maintain server stability. (A snapshot, if taken as an action, does not list the requests that timed out, but you can find more information on the timed out request(s) in the logs in ColdFusion’s runtime/logs directory.)

Viewing Alerts data

If one of the enabled alerts is triggered due to exceeding the threshold value, there are three ways that you can observe the alert notifications.

Alerts page in Server Monitor

First, any alerts triggered will be displayed in the first page of the Alerts tab (see Figure 2).

The Alerts Notification page

Figure 2. The Alerts Notification page

Each alert message should eventually be followed in time by another alert message indicating when the server has recovered from the problem state. The recovery message will indicate if any actions were taken during the alert, including how many requests were killed, whether requests were rejected, and so forth.

If an alert has caused the creation of a snapshot, discussed in the next section, an icon will display to the left of the alert, as shown in two instances in Figure 2. Notice that you can also delete either an individual alert notification or all of them by using the buttons at the top of the page.

This display of Alert information will only remain in the Server Monitor as long as ColdFusion is running. Upon restart, the information is cleared. But it's not entirely lost, as the very same information is tracked in available log files.

Alerts tracked in log files

Another way to view alert messages is in ColdFusion's log files. Note that I say "files," because alerts are actually tracked in two different files (though the same alert information is offered in each).

First, the information is written to a monitor.log file in the traditional ColdFusion logs directory, such as C:\ColdFusion8\logs in the standalone edition, or C:\JRun4\servers\cfusion\cfusion-ear\cfusion-war\WEB-INF\cfusion\logs for the default server in a multiserver deployment.

The same alert logging data is also written to the ColdFusion -out log, along with a considerable amount of other logging information that has been traditionally written to that file. In the standalone deployment of ColdFusion, the location of the – out log file would be C:\ColdFusion8\runtime\logs\ (as coldfusion-out.log). In the multiserver deployment, it would be in C:\JRun4\logs\ as cfusion-out.log for the default server, or replace the "cfusion" portion with the name of any other instance or instances you may have enabled.

E-mail notifications

Still another way to see alert notifications is by way of e-mail, discussed in the next section on available actions.

Available alert actions

For each kind of alert, there are several available actions to take when the alert is triggered. Each of these is described below.

Note: You can enable these actions either before or after an alert has been triggered (in other words, before it has recovered).

Send E-mail

If you select the "Send E-mail" action for any of the alerts, an e-mail will be sent when an alert is triggered and when it recovers, if you have configured an e-mail address in the Email Settings tab of the Alert Configuration page.

Note: The Email Settings page requires that you have configured the mail server settings in the ColdFusion Administrator (in the Server Settings > Mail page) to set an SMTP server for sending e-mails from ColdFusion.

You may specify multiple e-mail addresses by separating them with commas. Semicolons will not work, though you won't receive an error message.

You can confirm that e-mails are being sent by viewing the aforementioned log entry in the monitor.log file. Each alert that fires, which has been set to send e-mail, will report if it did or did not send an e-mail (by adding "Email notification sent" or "Failed to send email notification" to the log entry).

The alert notification e-mails come from an address of cfadmin@[servername], where servername is the name of your server. This is not configurable. If you find you are not receiving the notification e-mails, and you've confirmed that the monitor.log shows it did successfully send an e-mail, your mail server may preclude sending out e-mails with a From address that has a domain name other than that of the mail server.

If an alert is set to kill requests taking longer than a given period of seconds, the e-mail notification for that alert will also list those requests that were killed. (This is not displayed in the notifications page, nor in the snapshot or monitor.log file.) Sadly, the e-mail will not report what requests are running that trigger other events (such as those running too long or when the average response time is too long). But you can capture that information, and a lot more, in the available Snapshots feature.

Dump snapshot

If you select the Dump snapshot action for any of the alerts, ColdFusion will generate a “snapshot” file when the alert is triggered. This is a text file that you can read, which contains considerable information on the status of the ColdFusion server and currently running requests, threads, and queries. The snapshot file created can be viewed on the Alerts page, discussed later, which displays any alerts that are triggered. Note also that if you have chosen the Send E-mail action as well, the snapshot file will be included as an attachment in the email.

Besides requesting a snapshot with an alert, you can also request one manually using the available Snapshots tab within the monitor. Since that’s discussed later in this article, I’ll save further discussion of snapshots for that that section.

Kill Threads running longer than x seconds

If you select the “Kill Threads running longer than x seconds” action, then while an alert is triggered, ColdFusion will attempt to kill any requests whose response time exceeds the number of seconds specified. In most cases, any such requests will be terminated. (Note that this time after which long-running threads will be killed is separate from the time for triggering the alert, if the alert is time-related, and it’s also overrides the request timeout in the CF Admin.

The user will generally see whatever text was being generated prior to the point in the code where the termination occurred. They may also get an error message, which will vary depending on the operation that was interrupted.

Note that there are some kinds of operations within requests that ColdFusion can’t interrupt immediately, such as during requests to databases (called from CFQUERY, CFSTOREDPROC, and so forth), CFHTTP operations, or invocation of a web service, to name a few.

In such cases, the request will be terminated, but only after the blocked operation completes—which means that if the remote service (database, web service, etc.) is what’s causing the delay, the attempt to kill the request will have to wait at least as long as that remote operation takes to complete (or upon the indicated termination time for the specific operation, such as if the TIMEOUT attribute is used on CFHTTP or CFINVOKE of web services.) This applies as well to the manual kill feature discussed in Part 2 of this series, as well as the Request Timeout feature in the ColdFusion Administrator.

Reject any new requests

If you select "Reject new requests," if an alert is triggered (until it’s recovered), new requests will be rejected immediately upon execution. They will receive a 503 status code, and may see a message, “The server is unable to process your request. Please try again later.” This will not affect the execution of requests already running when the alert was triggered.

The ability to reject any new incoming request is a pretty significant change in operations. Otherwise, with a high traffic site especially, when a problem occurs, requests keep coming at their normal pace. These request normally get queued up and will execute when ColdFusion has the necessary resources. This can create a vicious circle; when ColdFusion is again able to recover after a problem state (whether by finishing or timing out requests, allocating more memory, and so forth), it's flooded by all the requests ready to execute. ColdFusion is stuck playing catch up and the server could appear to still be offline or sluggish.

So, rather than try to service all the requests that come in during an alert, ColdFusion can instead reject the new requests. Again, the users experience what appears to be an error message, but at least their requests do not pile up, thus preventing a worsened error state on the server.

Perform garbage collection

Still another powerful new feature for ensuring longer uptimes is the alert action to Perform garbage collection. Available for the JVM Memory alert only, this will cause ColdFusion to make a request to the underlying JVM that garbage collection be performed on memory. A discussion of garbage collection is beyond the scope of this article, but briefly, when CFML requests run, the memory used to perform their processing will be allocated and then generally be marked for reclamation at the end of the request. The underlying JVM should automatically remove (“collect”) that no longer used memory (“garbage”), but sometimes it may not do so until a garbage collection is requested.

You can view the amount of memory used by ColdFusion in the graph shown in the Server Monitor’s page, Statistics>Memory Usage>Memory Usage Summary (where you can also run a garbage collection manually), as discussed in Part 1 of this series. If the garbage collection request is successful, the amount of used memory may drop so as to allow ColdFusion to recover from this alert.

When triggered by the alert, garbage collection will be attempted every minute until the alert recovers. This is not configurable in the Server Monitor interface. The Alerts page will not indicate how many garbage collections are requested, but you can view this in the available monitor.log.

Processing a CFC

The final alert action is "Processing a CFC." With this option, you can arrange to perform any  ColdFusion Markup Language (CFML) operation on the triggering (or resolution) of an alert state, to include storing data in a database, sending an instant message or SMS notification (if you've enabled the ColdFusion event gateways to support that), and so on.

As for the form field for specifying alert actions, the option for a CFC is a field that expects a CFC name (and extension). By default, ColdFusion looks for the CFC in its runtime\bin directory (C:\ColdFusion8\runtime\bin in the standalone edition).

Note: At the time of this writing, I've not found any way to indicate that the CFC is located in any other directory. (I tried both webroot relative and absolute paths, but neither worked. I do note that in the ColdFusion 8 Release Notes, this inability to use relative paths is listed as a known issue, though it suggests that absolute paths should work. Perhaps by the time you read this the problem will have been resolved.)

The CFC you create for this purpose must have two functions, onAlertStart() and onAlertEnd(), both of which accept a structure as an argument and return no values. The onAlertStart() function is executed when an alert becomes active, and onAlertEnd() is executed when the server recovers from this alert or this alert is invalidated.

In both methods, a structure is passed in, containing information about settings as to when the alert was activated (or was recovered or disabled.) The following is a sample alert.cfc that simply dumps the incoming struct (passed by ColdFusion) into a <cfsavecontent> tag, which is then passed to the <cflog> tag to be shown in the application.log file (in the [coldfusion]\logs directory.) Note the use of the format="text" in the <cfdump> tag to make the dump more readable within the log file.

<cfcomponent>
     <cffunction name="onAlertStart">
         <cfargument name="instruct" required="No"> 
         <cfsavecontent variable="get">
              <cfdump format="text" var="#instruct#">
         </cfsavecontent>
         <cflog log="APPLICATION" text="#get#">
         <cfreturn>    
     </cffunction>
     <cffunction name="onAlertEnd">
         <cfargument name="instruct" required="No"> 
         <cfsavecontent variable="get">
              <cfdump format="text" var="#instruct#">
         </cfsavecontent>
         <cflog log="APPLICATION" text="#get#">
         <cfreturn>    
     </cffunction>
</cfcomponent>

The keys in the structure include:

  • ALERTACTIVATE: the date and time that the alert was triggered
  • ALERTMESSAGE: the message corresponding to the type of alert triggered
  • ALERTSNAPSHOTFILE: the filename, if any, of a generated snapshot file
  • ALERTTYPE: a textual reference to the type of alert triggered

Some other keys in the structure that help in determining the alert's state are ISACTIVE, ISINVALIDATED, and ISRECOVERED, all returning booleans. When the alert recovers, other useful keys are ALERTINVALIDATEDAT and ALERTRECOVEREDAT, each of which would hold date/time fields (or the empty string).

If you have any trouble trying to use the CFC, the success or failure of trying to invoke it will be tracked in the monitor.log.