Requirements

Prerequisite knowledge

Requires advanced knowledge of audio in Adobe Flash Player, ActionScript, and Adobe Flash Builder or Adobe Flash Professional.

 

User level

Advanced

Required products

Requirements

Prerequisite knowledge

Requires advanced knowledge of audio in Adobe Flash Player, ActionScript, and Adobe Flash Builder or Adobe Flash Professional.

User level

Advanced

Adobe Flash Player has become very popular for audio and video playback; in fact, most Internet video is viewed using Flash Player. Flash Player provides both a rich viewing experience and efficient, high-quality video playback by incorporating technologies such as advanced audio and video compression schemes (H.264, MP3, and AAC codecs), versatile media functionality (multi-bitrate streaming, playlist, seeking, and other features), and efficient playback mechanisms (hardware decoding and direct rendering).
 
Due to its ubiquitous penetration on desktop computers and its increasing popularity on mobile devices, there has been a lot of interest in using Flash Player for real-time audio and video communications. When compared to video broadcast, however, real-time communications has completely different requirements. The most important requirements include:
 
  • Minimizing latency between communicating endpoints
  • High-quality, error-resilient speech codec
  • Acoustic echo cancellation for a headset-free experience
Although Flash Player has had audio/video capabilities since 2002 and has been used for web meetings in solutions such as Adobe Connect and Big Blue Button, as examples, the real game-changing event occurred with the release of Flash Player 10 in 2008, which introduced a low-latency transport protocol and a new sound codec, making Flash Player well-suited for real-time communications.
 
In this article, I'll briefly describe requirements for real-time communications and how Flash Player addresses these needs. I'll also introduce the new ActionScript API for using enhanced audio, provide you with best practices and limitations, and show you a sample application.
 

 
Requirements for real-time communications

In 2002, Flash Player 6 introduced the Real-Time Messaging Protocol (RTMP) and the Nellymoser sound codec. With the help of Flash Communications Server MX, one could develop real-time communications applications to operate between two or more Flash Player endpoints.
 
RTMP is based on Transmission Control Protocol (TCP), which provides reliable data transmission at the price of unbounded delay—which means it can be arbitrary high. The way an error-free transmission is received is by repeating lost packets. If packets keep getting lost, the delay can be very high because lost packets needs to get re-transmitted.
 
Nellymoser is a proprietary codec, providing low-compression efficiency and limited industry support. Due to RTMP, audio messages are never lost, but may be queued due to networking or server problems. To combat latency accumulation, a so-called catch-up mechanism is employed in Flash Player wherein audio is played out faster than its natural sampling rate. This gradual delay reduction introduces only minimal audio distortion without changing the pitch.
 
While RTMP is well-suited for broadcast and webcast applications (there being no strict delay requirements), it has limited applicability for real-time communications, where a few hundreds of milliseconds of delay may render the conversation unusable. In real-time communications, it is more important to minimize delay than to maintain error-free transmission. Most audio and video coding technologies (such as the H.264 video codec and the Speex speech codec) are designed with network transmission errors in mind and can handle them.
 
Flash Player 10 introduced Real-Time Media Flow Protocol (RTMFP). Unlike RTMP, RTMFP is based on the User Datagram Protocol (UDP). RTMFP enables sending data either reliably (using retransmission) or unreliably. Transmission delay can be minimized by using unreliable transmission. Furthermore, RTMFP enables direct peer-to-peer connections, which not only can reduce server requirements but can further reduce delay between communicating endpoints.
 
Flash Player 10 also introduced the Speex codec. Speex is an open-source, royalty-free codec that enjoys wide industry support. Flash supports Speex encoding at 16 kHz. Furthermore, when Speex is used for live communications, transmission delay using RTMFP is minimized. RTMFP passes all Speex messages to a higher layer as soon as they are received. Flash Player uses an adaptive Speex jitter buffer when playing out messages. Adobe has also implemented Speex noise suppression and voice activity detection, minimizing transmission bandwidth during periods of silence.
 
These capabilities make it practical to develop real-time communications applications using RTMFP and Speex. For an acceptable user experience, participants should wear headsets to prevent acoustic echo. Acoustic echo occurs when the sound from the computer's speaker is fed back to the microphone. Using headsets may be acceptable in corporate environments, but it's clearly undesirable in the consumer space, where users commonly use webcams or built-in laptop microphones. To achieve widespread adaption, acoustic echo cancellation (AEC) is absolutely required for voice-over-IP (VoIP) applications. AEC is available in messenger applications (such as Skype and Google Talk) and softphones (such as Xlite).
 
Adobe Flash Player 10.3 and Adobe AIR 2.7 have introduced enhanced audio, which includes acoustic echo cancellation and noise suppression. Enhanced audio is available on all desktop platforms supported by Flash Player and AIR.
 

 
Enhanced audio API

We have added a new API to the Flash platform for enabling enhanced audio. This feature is available on all supported desktop platforms of Flash Player and AIR. The new API is only available in ActionScript 3. You must target Flash Player 10.3 or AIR 2.7 (or later) and SWF version 12 in your authoring environment, and you must update your playerglobal.swc. The following classes are affected:
 
  • Microphone: A new static method was added to this class to create enhanced microphone and read/write properties for configuring enhanced microphone options.
  • MicrophoneEnhancedOptions: This new class lets you configure enhanced microphone settings.
  • MicrophoneEnhancedMode: This new class enumerates enhanced microphone operation modes.
Sending audio to another Flash endpoint or Flash Media Server can be implemented with only a few lines of code:
 
var netConnection:NetConnection = new NetConnection(); netConnection.connect("rtmfp://example.com/rtc"); var netStream:NetStream = new NetStream(netConnection); var microphone:Microphone = Microphone.getMicrophone(); netStream.attachAudio(microphone);
Using enhanced audio is equally simple: You can obtain an enhanced microphone using the Microphone.getEnhancedMicrophone() static method:
 
var microphone:Microphone = Microphone.getEnhancedMicrophone(); netStream.attachAudio(microphone);
In addition to acoustic echo cancellation, enhanced audio also provides noise suppression. Previously, Flash Player only provided noise suppression for Speex audio. The new noise-suppression scheme is applied to all captured audio samples. Noise suppression is controlled by the already-existing noiseSuppressionLevel property of the Microphone class and is enabled by default. Setting noiseSuppressionLevel to 0 will disable noise suppression.
 
Enhanced audio has a couple of limitations:
 
  • You cannot use enhanced and non-enhanced audio at the same time.
  • You can only use a single enhanced audio capture at any given time.
Flash Player dispatches the Microphone status event Microphone.UNAVAILABLE when a microphone stops providing audio data due to the above-mentioned reasons. This is a minor limitation when compared to non-enhanced audio, but for practical real-time communications applications, having one active microphone should be sufficient.
 
Operation of enhanced audio is controlled by the enhancedOptions property on your Microphone object and the MicrophoneEnhancedOptions class. The class has the following properties:
 
  • MicrophoneEnhancedOptions.mode selects the operation mode for enhanced audio. For possible values, please see the MicrophoneEnhancedMode class. The default value is MicrophoneEnhancedMode.HALF_DUPLEX for USB capture devices and MicrophoneEnhancedMode.FULL_DUPLEX otherwise.
  • MicrophoneEnhancedOptions.echoPath specifies the echo path length (in milliseconds). A longer echo path means better echo cancellation but also introduces longer delays and requires more processing power. The default value is 128; the only other possible value is 256.
  • MicrophoneEnhancedOptions.nonLinearProcessing specifies whether to use non-linear processing to suppresses residual echo. A time-domain technique is used; the default value is enabled.
Available enhanced microphone operation modes are enumerated in the MicrophoneEnhancedMode class. The following values are supported:
 
  • MicrophneEnhancedMode.FULL_DUPLEX: Full duplex processing provides the best-quality echo cancellation. Both parties can speak at the same time. This mode requires high-quality input and output devices and the most computing power. This is ideal for laptop computers with built-in speaker and microphone devices.
  • MicrophoneEnhancedMode.HALF_DUPLEX: Half duplex uses simpler processing than full duplex and thus requires less computing power. It assumes that only one party speaks at a time. Half-duplex is the default mode for USB microphone devices.
  • MicrophoneEnhancedMode.HEADSET: Echo cancellation operates in low-echo mode, assuming that both parties are using a headset. This mode removes the minimal echo that could leak from speaker to microphone. This mode requires the least amount of processing.
  • MicrophoneEnhancedMode.SPEAKER_MUTE: Echo cancellation is turned off but other speech processing functions (such as noise reduction) are enabled.
  • MicrophoneEnhancedMode.OFF: All enhanced audio functionality is turned off.
The constructor of MicrophoneEnhancedOptions will set the following default properties:
 
var options:MicrophoneEnhancedOptions = new MicrophoneEnhancedOptions(); options.mode = MicrophoneEnhancedMode.FULL_DUPLEX; options.echoPath = 128; options.nonLinearProcessing = true;
To query the enhanced microphone options actually in use, please use the following:
 
var microphone:Microphone = Microphone.getEnhancedMicrophone(); var options:MicrophoneEnhancedOptions = microphone.enhancedOptions;
When you only want to modify certain enhanced audio parameters, make sure that you do not create a MicrophoneEnhancedOptions object using the constructor because you may inadvertently override other parameters as well.
 
When using the microphone data generation API (sample data event) with enhanced audio, you will receive echo-canceled and noise-suppressed samples.
 

 
Best practices for real-time communications

To provide the best user experience for a VoIP application, you should use RTMFP with the Speex codec along with enhanced audio. On the receiver, make sure to use unbuffered live streaming mode by properly setting the bufferTime property of your NetStream object. This will ensure that audio messages will be transmitted unreliably with the lowest latency. I strongly suggest that you use peer-to-peer connections whenever possible in order to minimize transmission delay. This may not be feasible in multiparty communications or when RTMFP is not possible (please see my article, Best practices for real-time collaboration using Flash Media Server).
 
Best-quality echo cancellation is obtained by using full duplex mode. If you have a USB webcam attached to your laptop, you should use the built-in microphone. You can also implement a user interface that lets the user select the microphone of his or her choice. As described in the previous section, you can get an enhanced microphone by using the following declaration:
 
var microphone:Microphone = Microphone.getEnhancedMicrophone();
The above code uses the default microphone that the user has selected in his or her local settings manager. Adobe recommends using the default enhanced microphone options. (The default value for enhanced microphone mode is half-duplex for USB devices and full-duplex otherwise; the echo path is 128 ms and non-linear processing is enabled.)
 
Adobe also strongly suggests you use Speex codec. Speex capture and compression are only supported at 16 kHz (playback also supports 8 kHz), so there is no need to specify the sampling rate:
 
microphone.codec = SoundCodec.SPEEX;
Speex operates on frames of samples; each frame is 20 ms long (320 samples at 16 kHz). You can specify how many frames you wish to send in a message. The default value is 2 and the maximum value is 8. In order to minimize latency, you may want to send a single frame per message:
 
microphone.framesPerPacket = 1;
Clearly, sending more frames in a message reduces transmission overhead; for each message, the number of overhead bytes can be as much as 50 bytes (RTMFP/UDP/IP headers). This overhead is significant and cannot be ignored. You may consider the tradeoffs between latency and overhead (especially in a wireless or 3G scenario) but in most cases, minimizing the latency is highly desirable.
 
It is very important to set the silence level to 0. When the silence level is non-zero, discontinuous transmission will occur. This could have a negative effect on the Speex jitter buffer and acoustic echo cancellation algorithms. Moreover, Speex performs voice activity detection and only uses 4 Kbps bandwidth when no speech is detected:
 
microphone.setSilenceLevel(0, 2000);
One of the consequences of setting the silence level to zero is that you will no longer receive activity events. If your application has relied on the microphone activity event to detect whether a user was speaking, you will need to choose an alternate approach.
 
Please be considerate when setting microphone gain. Flash Player does not modify system capture or playback volume levels; it simply performs multiplication of audio samples. For Speex and acoustic echo cancellation, overloading levels can result in poor performance. Setting gain to 50 does not modify sample values, which is probably the best approach:
 
microphone.gain = 50;
Incorrect system microphone and playback levels could result in reduced audio quality. You may want to notify users to adjust their hardware audio levels.
 
The Microphone class supports echo suppression with the setUseEchoSuppression() method. The algorithm is quite simple: When there is outgoing sound, magnitude of microphone input samples is divided by 2. This leads to varying microphone sound levels that could be undesirable in certain applications. When enhanced audio is used, this echo suppression functionality is ignored.
 

 
Limitations of using enhanced audio

AEC is computationally expensive. Currently, only desktop platforms are supported for Flash Player and AIR. Although AEC is optimized for the SIMD architecture, it may take up significant chunk of CPU power on low-powered netbook computers. When AEC is not supported, Microphone.getEnhancedMicrophone() returns null.
 
Multiparty echo cancellation is a very hard problem. Although from the API point of view, nothing prevents you from creating a multiparty communications application, the performance of enhanced audio is optimized for two participants.
 
For echo cancellation to work, you must use unbuffered audio; that is, the bufferTime property of the subscribing NetStream object must be set to 0. Audio produced by NetStream objects with the buffer time larger than 0 will not be included in echo cancellation.
 
Full-duplex mode provides the best-quality echo cancellation and also supports "double-talk" (both parties talking simultaneously). However, it requires the most processing power and good hardware. It cannot be used for USB-capture devices because synchronization between the USB-capture device and the sound card for speakers is not currently supported.
 
When you're using non-enhanced audio, and you have multiple audio capture devices, the following code works as expected:
 
ns1.attachAudio(Microphone.getMicrophone(0)); ns2.attachAudio(Microphone.getMicrophone(1));
where ns1 and ns2 are two different publishing NetStream objects. Each NetStream object publishes audio captured from the respective non-enhanced microphone. This is not currently supported when using enhanced audio; you can only have a single enhanced microphone instance at any given time. You also cannot have a non-enhanced and enhanced microphone capturing audio at the same time. When you are using enhanced audio and want to open a different microphone instance, the existing microphone will stop producing data and dispatch the Microphone.UNAVAILABLE event. This is illustrated in the following use cases:
 
 
Use case 1
  1. Create microphone mic1.
  2. Create enhanced microphone mic2; mic1 will dispatch the Microphone.UNAVAILABLE status event.
 
Use case 2
  1. Create enhanced microphone mic1.
  2. Create microphone mic2; mic1 will dispatch the Microphone.UNAVAILABLE status event.
 
Use case 3
  1. Create enhanced microphone mic1.
  2. Create enhanced microphone mic2 with a different device; mic1 will dispatch the Microphone.UNAVAILABLE status event.
 
Use case 4
  1. Create microphone mic1.
  2. Create microphone mic2 with a different device; both mic1 and mic2 provide audio data.

 
Sample application

For a previous, article, I developed a sample real-time collaboration application using Adobe Cirrus. I have updated this application to use the newly introduced enhanced audio APIs. I created a central function to obtain a microphone:
 
private function getMicrophone():Microphone {   if (enhancedCheckbox.selected)   {     return Microphone.getEnhancedMicrophone(micIndex);   }   else   {     return Microphone.getMicrophone(micIndex);    } }
In this function, micIndex is a variable storing the microphone index selected by the user, and enhancedCheckbox refers to a Flex Spark check box component controlling enhanced microphone selection.
 

 
Where to go from here

In this article, I touched on three key components of real-time communications (minimizing latency, high-quality and open-source sound codec, and acoustic echo cancellation) and showed how Flash Player addresses these requirements. I focused on enhanced audio, introduced in Adobe Flash Player 10.3 and Adobe AIR 2.7. Using these capabilities, you should be able to develop high-quality and user-friendly real-time communications and collaborations applications.