White paper on Avaya Aura Application Enablement Services 5.2 High Availability (HA) Configurations

White paper on Avaya Aura™ Application Enablement Services 5.2 High Availability (HA) Configurations Issue 1.0 AE Services 5.2 November 2009 This whit...
0 downloads 0 Views 85KB Size
White paper on Avaya Aura™ Application Enablement Services 5.2 High Availability (HA) Configurations Issue 1.0 AE Services 5.2 November 2009 This white paper is intended for a software application developer or a systems engineer who is responsible for deploying an application or an AE server in a HA configuration. The HA configurations covered include: - The new Avaya Aura™ Application Enablement Services on System Platform release 5.2 failover features. - All AE Services offers’ interaction with Avaya Aura™ Communication Manager (CM) Enterprise Survivable Server (ESS) and Local Survivable Processor (LSP). It also covers CM Processor Ethernet support and improvements to DMCC service recovery using the CM Time To Service (TTS) feature. Uninterrupted telephony is important for many enterprises especially for mission critical applications. Avaya Aura™ Application Enablement (AE) Services on System Platform (SP) Release 5.2 supports a high availability (HA) cluster of two nodes. The active server node automatically fails over to the standby node in the event of a hardware failure. Client applications are able to re-establish communication with the AE Services cluster when the failover is complete. This failover feature is not supported on the AE Services 5.2 software only and bundled offerings. Avaya Aura™ Communication Manager (CM) provides Enterprise Survivable Server (ESS) and Local Survivable Processor (LSP) for failover from the main media server. This feature provides the ability for media gateways, endpoints, application servers like AE Services and its applications to continue their operations without a major outage. ESS and LSP have been supported since AE Services 3.0 and is included in this paper for completeness. Avaya Aura™ Communication Manager (CM) provides the Processor Ethernet (PE) interface for direct connection to the main media server. This feature reduces cost by not requiring a CLAN for communications. However, a DMCC client application must reestablish any H.323 registrations that are terminated when an interchange occurs between a duplicated pair of CMs that are communicating to an AE server over PE, unless the Time To Service feature is used. Furthermore, an AE Server 5.2 that communicates over PE does not support the ESS and LSP configurations (only a single IP address is allowed to be administered on the AE Server 5.2 for a PE connection, and ESS and LSP servers will have their own (unique) IP addresses, which will always be different than that of the main media server).

1

Avaya recommends the following: • • • •



CM should be configured for H.323 registration using the Time To Service feature. AE Services 5.2 should use the PE interface except in ESS/LSP environments. A local HA cluster of AE Services on System Platform release 5.2 servers is used. An application that uses the Device, Media and Call Control (DMCC) service should keep trying to reestablish the DMCC session when it loses its socket communication link to the DMCC service because the runtime state is preserved. This applies to all AE Services configurations. An application that uses the CVLAN, DLG or TSAPI service should reestablish its socket connections and, its monitors/associations if it loses the socket connection to the service on the AE server because no runtime state is preserved for these services. TSAPI applications also need to reestablish route registrations.

In this CM ESS configuration, the applications and associated AE Server at the remote sites are always active and are supplying functionality for the local resources at the remote site. As described in later sections, this type of configuration ensures the shortest outage.

AE Services on System Platform 5.2 High Availability Figure 1a below illustrates a HA cluster of AE Services on System Platform and Figure 1b shows a single AE Services (on System Platform, Software only or Bundled) communicating to CM through the PE interface. Headquarters

Headquarters Active AE Server

Application

AE Server

Primary S8720

Application

G650 Gateways

Primary S8720

G650 Gateways

Figure 1b

Figure 1a

The AE Services on System Platform 5.2 release provides higher availability relative to the software-only, bundled and earlier releases. This configuration monitors the server nodes for loss of network connectivity and hardware failure events. This information is used to detect faults and decide when to failover from the active node to the standby node in the server cluster. The AE Services on the standby node are restarted when a failover

2

event occurs. This feature enables AE Services to continue to provide service to client applications with reduced downtime when a hardware failure event occurs. In addition to this, the System Platform will restart the AE Services virtual machine if it does not maintain its sanity keep alive because of a software fault condition. Device Media and Call Control Service Avaya recommends that applications reestablish DMCC sessions and verify that all associations (monitors, registrations) are still active after a network interruption. In addition to the SP failover feature, DMCC provides recovery from a software fault or a shutdown that does not allow the DMCC Java Virtual Machine (JVM) process to exit normally. The DMCC Service Recovery feature is available on all AE Services configurations: software only, bundled and on system platform. When the DMCC JVM process is restarted after an abnormal exit, the DMCC service is initialized from persisted state information on the hard disk. This persisted state information is saved during normal operation and represents the last known state of the DMCC service prior to a JVM abnormal exit. The state information includes session, device, device/call monitor and H.323 registration data. From a client application’s point of view, the DMCC recovery appears as a temporary network interruption that requires the client to re-establish any disconnected sessions. When the client application re-establishes the session, the DMCC service will send events for any resources that could not be recovered. These will include monitor stopped and unregistered event messages and enable the client to determine what needs to be restored through new service requests. Otherwise, the client will continue to operate as usual. TSAPI, CVLAN, DLG and Transport Services No runtime state information is persisted for these services. The client application must restore any state that existed before the service was restarted. ESS/LSP Considerations for AE Services deployment Figure 2 (shown below) is an illustration of a sample network configuration with the main S8700 media server and G650 media gateways at the headquarter site. One remote site (Remote site A) has a G700 media gateway with a LSP and the other two remote sites (B and C) have G650 media gateways and ESS servers. Each of the remote sites has an AE Services server and associated application connected to Communication Manager through the CLAN(s) of one or more local (or remote) G650 media gateways. Avaya recommends that all applications have a local AE server. In this configuration, the applications and associated AE Server at the remote sites are always active and are supplying functionality for the local resources at the remote site. As described in later sections, this type of configuration ensures the most seamless survivability in an ESS configuration.

3

Normal operation Remote site A

Headquarters

Active AE Server

G650 Gateways

Active AE Server

G700 with LSP Application

Application

Primary S8700

Remote site C

WAN

G650Gateways

G650 Gateways

Active AE Server

Remote site B Active AE Server

Application

Application ESS S8500

ESS S8500

Figure 2

In case of a WAN outage (as shown in Figure 3 below), each remote site becomes independent and provides service without major interruption to endpoints and applications. Remote site A with a G700 media gateway will have the LSP go online and the G700 media gateway will connect to that local LSP. It is recommended to configure the primary search list of the G700 media gateway such that it contains CLANs of only one site (i.e. headquarters in this case). The secondary search list should contain the LSP at the local site (site A in this case). The AE server will detect connectivity failure with the main site (headquarters) and will notify its applications. The applications will have to direct the AE server to move the connectivity over to the LSP (described in detail further below). The G650 media gateways at the remote sites (sites B and C) will connect to the local ESS server in case of a WAN outage. The AE server will automatically get connected with the ESS server through the G650 media gateways. This will be transparent to the AE server and its applications except for what will appear to be a brief network outage (described in detail further below).

4

The site at the headquarters will continue to function as it did previously in case of a WAN outage. Note: Each of the remote sites and the headquarter site will not be able to access each other’s resources during a WAN outage.

WAN Outage Remote site A

Headquarters

G650 Gateways

AE Server

AE Server G700 with LSP Application

Application

Primary S8700

X

WAN

G650 Gateways

Remote site C

G650 Gateways

Remote site B

AE Server

AE Server

Application

Application ESS S8500

ESS S8500

Figure 3

If the main headquarters site is completely down but the WAN is functional, (as shown in Figure 4 below), the remote sites will behave similar to the WAN outage scenario described above, but with one important exception. With the ESS feature, the system will attempt to stay as “whole” as possible. Since the WAN is still intact, all of the G650 gateways end up being controlled by the same ESS server at Remote Site B. Since the application and AE Server were configured to support only the local resources at the remote sites, the application continues to function the same whether the sites operate independently (WAN failure) or jointly (normal operation or site destruction at headquarters).

5

Site destruction Remote site A

Headquarters

AE Server

G650 Gateways

AE Server

Application

X

G700 with LSP Application

Primary S8700

WAN

Remote site C

G650 Gateways

G650 Gateways

AE Server

Remote site B AE Server

Application

Application ESS S8500

ESS S8500

Figure 4

1. ESS (Enterprise Survivable Server) – Non PE Connectivity The list below describes the behavior of the AE Services: a. DMCC (Device and Media Control) Service: As long as the application is configured to connect to CLANs in the local gateways, recovery with an ESS server should be very straightforward. The application will receive an unregistered event for each DMCC softphone when connectivity is lost from the local gateways (like G600 or G650) to the primary S8700. At this point, the application should begin attempts to reregister the DMCC softphones with the same CLAN IP address (es) it was using before. Note that it takes a little over 3 minutes for the media gateway (like G600 or G650) to connect to an ESS server. For this reason, it is recommended that the application keep trying to register with the same CLAN (through the AE server) for that amount of time before it tries to register with a LSP (if one exists). When the gateways (like G600 or G650) connect with the ESS server, the registration attempts will begin to succeed. After the application has successfully registered all DMCC softphones, it should reestablish its previous state and resume operation.

6

b. CallInformation Services within DMCC, Call Control Services within DMCC, and all other CTI services The CallInformation and Call Control services within DMCC and all other CTI Services (TSAPI, CVLAN, DLG and JTAPI) use the Transport (AEP) link to communicate with Communication Manager. The transport links (Switch Connections) on each AE Server should be administered to communicate only with CLANs in gateways that are local to the AE Server’s site. If the system is configured in this fashion, the application / AE Server will not have to take any unusual action to recover in the event that a gateway loses connectivity to the primary S8700 and transitions to an ESS server. If a gateway loses connectivity to the primary server for an extended period of time (more than 30 seconds), all AEP sockets that are established through CLANs resident in that gateway will drop. If an AE Server loses all of its AEP connections, it will notify any connected applications via a LinkDownEvent (DMCC CallInformationServices) or a CTI link down notification (CTI services). For Call Control Services within DMCC, Avaya recommends that applications add a CallInformationListener and look for a LinkDown event for indication that connectivity to the main site is down. (In future releases, Call Control Services clients will receive a MonitorStop request for all call control monitors if the link is lost to the main site.) Depending on the CTI API, clients will receive an appropriate event when the connectivity to the main site is down. CVLAN clients will receive an “abort” for each association. TSAPI clients will receive a CSTAMonitorEnded event if the client is monitoring a device and/or a CSTASysStatEvent with a link down indication if the client is monitoring system status. Avaya JTAPI 5.2 and later clients will receive a “call event transmission ended” event if the client has call listeners. Otherwise, an “observation ended” event will be received if the client has call observers. DLG clients will receive a link status event with a link down indication and a cause value. The AE Server will then automatically attempt to reestablish the AEP links. Note that it takes a little over 3 minutes for the media gateway (like G600 or G650) to connect to an ESS server. Once the media gateway has registered with the ESS server, the AE Server will succeed in establishing its AEP links very soon thereafter (after around 30 seconds). As soon as an AEP link is established, the application will be notified that the CTI link is back up, and the application can begin to resume normal operations. Since there is no runtime state preserved on a transition to an ESS server (as there is with an interchange on an S8700) all application state must be reestablished. Note that, from the AE server’s and application’s perspectives, the failure scenario and recovery actions appear exactly the same as a long network outage between the AE Server and the gateways.

7

There is one important note with respect to the current versions of AE Services (i.e., AE Services 3.1 and above) ESS behavior and AE Services 3.0 ESS behavior. If an AE 3.0 Server ends up with AEP links to gateways that are controlled by different ESS or primary call servers (i.e. a fragmented system), the system will not behave in a sane fashion. Some messages will be sent to one call server, and others will be sent to other call servers, with no deterministic behavior with respect to where messages are being sent. Recall, however, that the ESS feature attempts to keep as many gateways as possible under the control of a single call server. Given that this is the case, it is possible to configure the system such that it is extremely unlikely that a 3.0 AE Server will have AEP links to different fragments of a survivable system. The safest configuration is to have the 3.0 AE Server talk only to CLANs resident in a single gateway. Avaya recommends that wherever possible, all gateways through which an AE Server connects are all on the same LAN, preferably even on the same ethernet switch to avoid fragmentation. In such a configuration, it is virtually certain that the gateways will all be controlled by the same controller at all times, and the system will therefore always operate in a sane fashion. Unlike AE Services 3.0, however, in all newer versions of AE Services (i.e., AE Services 3.1 and above) ESS behavior is deterministic. AE Services 3.1 and above will only establish and use links to gateways that are controlled by the same ESS or primary call server. Therefore, it is always known by the application to which call server messages are being sent. More specifically, the AE server will only use links to the first ESS or primary call server to which it establishes a connection. Subsequent connections that are made to any other servers, other than the primary server, will be immediately dropped. If a connection to the primary server is (re)established, then any existing connections to any ESS servers or LSPs will be dropped, and the primary server will be used again (note that this will result in the loss of all monitors/associations, and the behavior described above when an AE Server loses all of its AEP connections will apply).

2. LSP (Local Survivable Processor) A media gateway like G700, G350 or G250 can be controlled by a LSP running Communication Manager if the main Communication Manager media server (S8700, S8500 or S8300) is unavailable or down. The AE Server connects to either a CLAN (S8700) or directly to the PE (S8300) to communicate with Communication Manager. Typically, LSPs are configured for remote media gateways so that those media gateways can get service in case the connectivity to the main site is down (e.g. WAN connectivity failure as shown in Figure 3 or site destruction as shown in Figure 4). Once the LSP detects a failure of connectivity to the main media server, the Communication Manager running on that LSP comes online.

8

Starting with Communication Manager 3.1, new administration forms have been created to control the behavior of survivable processors (i.e. LSPs and ESSs). Particularly, the Enabled field on the add/change survivable-processor forms that can be set to one of the following three values: • "n" or no: This means that this processor channel will be disabled on the LSP or ESS. • "i" or inherit: This means that this link is to be inherited by the LSP or ESS exactly as administered on the main. When set to "i" the remaining data on the line is recopied from the translations from the main and may not be edited. Note that this does not mean that the link will work. For example, if the link is administered to a CLAN and an attempt is made to inherit this link on an LSP, the link won’t work because the LSP has no CLAN. It is most appropriate to use "i" for a link administered via procr or for an ESS. • "o" or overwrite: This entry will cause the link field to change to "p" and be uneditable. The data entered on this line will overwrite the processor channel shown on this line when the data is file-synchronized to an LSP or ESS.

Avaya recommends different administration settings for the Enabled field depending on the configuration of a system (as shown below).

Configuration Only LSPs (no ESSs) Both LSPs and ESSs

Administration set Enabled to “o” set Enabled to “n”

Table 1 2.1 Configurations with only LSPs For configurations with LSPs and no ESSs, Avaya recommends setting the Enabled field to “o” (overwrite). This will allow automatic transition to a local LSP after detecting connectivity failure to the main site. 2.2 Configurations with both LSPs and ESSs If both LSPs and ESS servers are configured, Avaya recommends setting the Enabled field to “n” (disabled) for LSPs. Setting the Enabled field to “o” (overwrite) for LSPs will most likely result in undesired behavior. Consider the scenario in Figure 4 where the main headquarters site is completely down but the WAN is still functional. If the Enabled field is set to “o” for the LSPs, then the AE Server will always connect to an

9

LSP first since it would be available for connections before any of the ESS servers. Remember, it takes a little over 3 minutes for a media gateway (like a G600 or G650) to connect to an ESS server. Additionally, while connected to the LSP, the AE Server will deny (i.e. immediately drop) subsequent connections to any ESS servers. However, in this scenario, it would have been preferable to connect to one of the ESS servers first since it’s possible that the ESS server had connectivity and full control of the system. 2.3 Transitioning to a local LSP when Enabled is set to “n” Note that if the Enabled field is set to “n” (disabled), the AE server will detect connectivity failure to the main site, but it will not automatically transition to a local LSP. Depending on the Link type different actions need to be performed, as described below, by the applications using the AE server. a. DMCC (Device and Media Control) Service: DMCC uses the H.323 link for each DMCC softphone extension to talk to Communication Manager. When the connectivity to the main site is down the DMCC service on the AE server detects it and sends an unregistered event to the application for each DMCC extension. Avaya recommends that the application then retry connecting to the main Communication Manager (through the DMCC service on the AE server). If that fails, it should try connecting to the Communication Manager (through the DMCC service on the AE server) on the local LSP. If the LSP is up, the application will get connectivity to Communication Manager (via the DMCC service on the AE server). When the connectivity to the main server is back up, the LSP would need to be put in offline mode either manually or automatically (if configured properly). The DMCC service will detect connectivity failure to the LSP and will send an unregistered event to the application for each DMCC extension. Avaya recommends that the application then retry connecting to the main Communication Manager through the DMCC service on the AE server. AE Services has a feature in 3.0 that allows the use of a symbolic name for a list of ip-addresses (i.e. Gatekeeper list). Once administered through the AE Services OAM web-page, the application can then use the symbolic name to get a DeviceID (i.e. DMCC softphone extension) for a particular Communication Manager. This feature allows the application to easily switch over the DMCC softphones to the LSP using the symbolic name. b. CallInformation Services within DMCC: The CallInformation service within DMCC uses the Transport (AEP) link to communicate with Communication Manager. When the connectivity to the main site is down the CallInformation service on the AE server detects it and

10

sends a link down event to the application. Avaya recommends that the AE server be pre-configured to have the LSP administered under the main site switch name through the AE Services OAM web-page. This connection will not be active as long as the LSP is not up. The application will have to use System Management Services to dynamically configure the Transport (AEP) link (using the change ip-services command) on Communication Manager running on the LSP once it receives the Call Information link down event. The application should use the WSDL defined in: http:///sms/SystemManagementService.php?wsdl with the IPService Model defined in: http:///sms/ModelSchema.php?model=IPServices When the connectivity to the main server is back up, the LSP would need to be put in offline mode either manually (by giving a “reset system 4” command in Communication Manager) or automatically (if configured properly through the “change system-parameters mg-recovery-rule” form in Communication Manager). In either case, the Call Information service will detect transport link connectivity failure to the LSP and will send a link down event to the application. Also the transport link to the main Communication Manager will be back up for which the application will receive a link up event from Call Information services. Note: a) If the Transport (AEP) link has multiple CLAN addresses configured, the application will not receive a Call Information link down event unless connectivity to all CLANs is lost. b) If the Transport AEP link is connected to one CLAN and the DMCC H.323 link is connected to another CLAN, it is possible that one of the connections could be down. In this case, if LSPs are being used, then one of the links could be on the main server and the other could be on a LSP. This will cause undesirable behavior. c) Avaya recommends that for remote sites with G700 gateways and LSPs (and without a G600/G650/MCC/SCC on the same site), the transport (AEP) link from AE Server at that remote site be configured to link to a CLAN(s) (on a G600/G650/MCC/SCC) at the main headquarters site. c. Call Control Services within DMCC: Call Control Services within DMCC uses the Transport (AEP) link to communicate with Communication Manager. Avaya recommends that applications add a CallInformationListener and look for a LinkDown event for indication that connectivity to the main site is down. (In future releases, Call Control Services clients will receive a MonitorStop request for all call control monitors if the link is lost to the main site.) Avaya recommends that the AE server be pre-configured to have the LSP administered under the main site switch name through the AE Services OAM web-page. This connection will not be active as long as the LSP is not up. The application will have to use System Management Services to dynamically configure the Transport (AEP)

11

link (using the change ip-services command) on Communication Manager running on the LSP once it receives the link down event. The application should use the WSDL defined in: http:///sms/SystemManagementService.php?wsdl with the IPService Model defined in: http:///sms/ModelSchema.php?model=IPServices When the connectivity to the main server is back up, the LSP would need to be put in offline mode either manually (by giving a “reset system 4” command in Communication Manager) or automatically (if configured properly through the “change system-parameters mg-recovery-rule” form in Communication Manager). The transport link to the LSP will be down and the application will receive a link down event. Also the transport link to the main Communication Manager will be back up for which the application will receive a link up event. Note: a) If the Transport (AEP) link has multiple CLAN addresses configured, the application will not receive a link down event unless connectivity to all CLANs is lost. b) Avaya recommends that for remote sites with G700 gateways and LSPs (and without a G600/G650/MCC/SCC on the same site), the transport (AEP) link from AE Server at that remote site be configured to link to a CLAN(s) (on a G600/G650/MCC/SCC) at the main headquarters site. d. TSAPI, CVLAN, DLG Service and JTAPI: The TSAPI, CVLAN, DLG Service and JTAPI on the AE server use the Transport (AEP) link to communicate with Communication Manager. Depending on the API, clients will receive an appropriate event when the connectivity to the main site is down. CVLAN clients will receive an “abort” for each association. TSAPI clients will receive a CSTAMonitorEnded event if the client is monitoring a device and/or a CSTASysStatEvent with a link down indication if the client is monitoring system status. Avaya JTAPI 5.2 and later clients will receive a “call event transmission ended” event if the client has call listeners. Otherwise, an “observation ended” event will be received if the client has call observers. DLG clients will receive a link status event with a link down indication and a cause value. Avaya recommends that the AE server be pre-configured to have the LSP administered under the main site switch name through the AE Services OAM web-page. This connection will not be active as long as the LSP is not up. The application will have to use System Management Services to dynamically configure the Transport (AEP) link (using the change ip-services command) on Communication Manager running on the LSP once it receives the link down event. The application should use the WSDL defined in: http:///sms/SystemManagementService.php?wsdl with the IPService Model defined in:

12

http:///sms/ModelSchema.php?model=IPServices When the connectivity to the main server is back up, the LSP would need to be put in offline mode either manually (by giving a “reset system 4” command in Communication Manager) or automatically (if configured properly through the “change system-parameters mg-recovery-rule” form in Communication Manager). The transport link to the LSP will be down and the application will receive a link down event. Also the transport link to the main Communication Manager will be back up for which the application will receive a link up event. Note: a) If the Transport (AEP) link has multiple CLAN addresses configured, the application will not receive a link down event unless connectivity to all CLANs is lost. b) Avaya recommends that for remote sites with G700 gateways and LSPs (and without a G600/G650/MCC/SCC on the same site), the transport (AEP) link from AE Server at that remote site be configured to link to a CLAN(s) (on a G600/G650/MCC/SCC) at the main headquarters site. Terminology and Acronyms Term Meaning AEP Application Enablement Protocol AE Services Application Enablement Services API Application Programming Interface ASAI Adjunct Switch Application Interface CLAN Control Local Area Network interface card CM Communication Manager CVLAN Call Visor LAN DLG Definity LAN Gateway DMCC Device, Media and Call Control ESS Enterprise Survivable Server HA High Availability JTAPI Java Telephony API LSP Local Survivable Server MCC Multi-Carrier Cabinet PE Processor Ethernet also referred or procr SCC Single Carrier Cabinet SP System Platform TSAPI Telephony Server API WSDL Web Service Definition Language

13

Suggest Documents