Mainframe SRDF/A & MSC Best Practices

EMC Proven Professional™ Knowledge Sharing November, 2007

Michael Smialek Technology Solutions EMC Corporation® [email protected]

Copyright 2007 EMC Corporation. All rights reserved.

Table of Contents

Introduction...............................................................................................................................1 Purpose and Scope.............................................................................................................................. 1 Audience and Prerequisites................................................................................................................. 1 Reference Documentation and Resources .......................................................................................... 1 General Considerations....................................................................................................................... 2 Best Practices...................................................................................................................................... 3 1. Disk Drives, RAID Protection and Cache ..................................................................................... 3 2. Cache Configurations .................................................................................................................... 4 3. Initialization Parameters ................................................................................................................ 4 4. GateKeepers................................................................................................................................... 6 5. Target BCVs and Clones ............................................................................................................... 6 6. Performance Considerations.......................................................................................................... 7 7. SQ SRDF Display.......................................................................................................................... 8 8. Recovery Procedures ................................................................................................................... 11 9. Operational Procedures................................................................................................................ 12 10. Network Considerations ............................................................................................................ 13

Disclaimer: The views, processes or methodologies published in this article are those of the authors. They do not necessarily reflect EMC Corporation’s views, processes or methodologies.

Copyright 2007 EMC Corporation. All rights reserved.

Page ii

Introduction Implementation of SRDF/A and MSC into a mainframe production environment is a balancing act between key resources such as cache, disk drives, RAID protection and network bandwidth. The misconfiguration of any one of these resources can cause SRDF/A to drop requiring a detailed recovery process. By properly configuring the resources from the start, satisfaction with SRDF/A will be improved and Business Continuance projects will remain on schedule. This article focuses on the following: • Drive and RAID configurations • Cache configurations • Initialization Parameters • GateKeepers • Target BCVs and Clones • Performance considerations • Recovery considerations • Operational Procedures • Network Considerations

Purpose and Scope This document describes SRDF/A Best Practices from a variety of inputs including EMC documentation, presentations by EMC Corporate personnel, feedback from the Midwest BC Practice, and personal experiences with implementing SRDF/A and MSC. This article assumes an environment is using 5671 or 5771 with DMX2s and DMX3s. As with any product implementation, you must understand the application workload, processing cycles, RPO, RTO and RGO to determine if the information is applicable to your situation. Audience and Prerequisites This article is primarily for customers, Solution Architects and Implementation Specialists chartered to implement SRDF/A and MSC on the mainframe. It assumes previous knowledge of Symmetrix ® configurations, SRDF/A, TimeFinder/Mirror and/or TimeFinder/Clone. Reference Documentation and Resources The contents of this article are based upon the following product software levels. Supportive product information can be found in EMC documentation manuals: • EMC ResourcePak Base for z/OS v 5.5 Product Guide • Symmetrix SRDF Host Component for z/OS v 5.3 Product Guide • Symmetrix SRDF Host Component for z/OS v 5.3 Message and Code Guide • TimeFinder/Mirror for z/OS v 5.4 Product Guide • TimeFinder/Mirror for z/OS v5.4 Message and Code Guide • TimeFinder/Clone Mainframe SNAP Facility V5.5 Product Guide

Copyright 2007 EMC Corporation. All rights reserved.

Page 1

General Considerations An SRDF/A and MSC mainframe implementation will most likely extend over several weeks/months. Many begin with a mainframe migration from existing EMC or other vendor’s storage. Keep the end objective of running SRDF/A and MSC in mind when doing the Symmetrix DMX configurations. There are several design considerations which can make the job easier. For example: 1. Document and raise awareness of cache, bandwidth or configuration issues early in the project 2. Ensure that everyone on the project attends formal SRDF/A and MSC training 3. Always keep the installation at current microcode and software patch levels 4. A Software Assistance Center (SAC) case must be opened for any problems encountered 1-800EMC-4SVC (1-800-362-4782) 5. Review and secure approval for Configuration Design documents 6. Though 60 seconds of Secondary Delay is the goal, peak write periods will probably exceed 60 seconds 7. Document the workload (write %) and RPO/RTO requirements 8. Avoid using DMX800s in the solution if possible because of cache restrictions 9. Average write input bandwidth from the host = Average write output bandwidth on the RDF links 10. Measure everything using CMF/RMF reports, EMC ControlCenter Workload Analyzer or Symmetrix STP data to project cache and bandwidth requirements 11. If possible, keep volumes involved in SRDF/A consecutive by Symmetrix device number because commands like REFRESH, RFR-RSUM and Consistent Split run faster specifying consecutive ranges rather than one command per logical volume 12. Factor in bursts of write I/O and long duration write peak periods 13. Don’t be surprised when you review open cases and they refer to SRDF/A as SNOW (Symmetrix Native Ordered Writes) which was the internal product development name 14. An SRDF/A session consists of all devices in an RAGroup, you cannot switch individual devices in an RAGroup into other modes such as Adaptive Copy 15. To activate SRDFA, all volumes in the SRDFA Group must have a status of R/W-AD. SRDFA will not activate with a volume in TNR-AD status. 16. When executing change commands, always follow the three golden steps: ¾ Do an SQ query command to ensure volumes, links etc. are in the expected state ¾ Execute SC command ¾ Do another SQ query to ensure the volume, links etc. were changed as expected

Copyright 2007 EMC Corporation. All rights reserved.

Page 2

Best Practices 1. Disk Drives, RAID Protection and Cache DMX frame configurations provide flexibility with different drive capacities and performance (73 gb 10K, 73 gb 15K, 146 gb 10K, 146 gb 15k, 300 gb 10K and 500 gb 7.2K). Mainframe RAID protection schemes also provide choices (Raw, RAID1, RAID10, RAID5(3+1), RAID5(7+1). One of the most critical configuration considerations is balancing the R1 source side with the R2 target side speed. An unbalanced configuration could slow down the SRDF/A Apply Cycle on the R2 target side which will increase the Secondary Delay and eventually cause a volume to hit the R2 Write Pending limit. Just one R2 volume hitting the WP limit will cause SRDF/A to drop with a CACA10 error. Generally speaking, you want the R2 target side to be just as fast if not faster than the R1 source side. This can be tricky because source frames typically have 73 and 146 gb drives while the target frames get 146 and 300 gb drives. With disk drives it’s a numbers game. If you have (200) 73 gb drives on the R1 source side you want (200) 146 gb drives on the R2 target side. In this example, putting only (100) 146 gb drives to meet capacity requirements on the R2 target side is a guaranteed SRDF/A target restore problem. RAID protection can also cause issues; for example, having RAID1 or RAID10 on the R1 source side while using RAID5(3+1) on the R2 target side. Again the R1 source side is faster and can de-stage writes faster than the R2 target. Traditional TimeFinder/Mirror BCVs in the R2 target frame can impact performance. If the R2 target standards are on 146 gb drives and the Established BCVs are on 300 gb drives, the write de-stage may be slower waiting to complete the BCV write operations. The BCV/Clones are real devices and will consume cache slots reducing the R2 target side Write Pending limit. The R2 target side should have more cache to compensate for the BCV/Clones than the R1 source side. If the R1 source side has 2000 standard volumes and 48 gb of cache, the R2 target side will probably have 2000 standard and 2000 BCV/Clone volumes therefore the R2 target side should have at least 64 gb of cache. Avoid these SRDF/A configurations: 1. The R1 source is RAID1 or RAID10 and the R2 target is RAID5 2. RAID10 on the R1 source and RAID1 on the R2 target 3. R1 source side has faster and or more drives than the R2 target side 4. R2 target has a smaller volume WP limit than the R1 source because of BCVs or Clones 5. RAID5(7+1) on R1 source to RAID5(3+1) on the R2 target not allowed 6. Compared to the R2s, don’t create slower BCV/Clones by using slower drives, larger drives with more hypers, RAID protection like RAID5(7+1) for Standards and raw BCVs (Standard is spread across 8 drives and has double the I/O performance of the single raw drive). SRDF/A configurations which should be safe depending upon the workload: 1. R1 source is RAID1 and R2 target is RAID1 or RAID10 2. RAID10 on the R1 source and RAID10 on the R2 target 3. RAID5(3+1) on the R1 source and RAID5(3+1) on the R2 target 4. RAID5(3+1) on the R1 source and RAID5(7+1) on the R2 target 5. RAID5(7+1) on the R1 source and RAID5(7+1) on the R2 target

Copyright 2007 EMC Corporation. All rights reserved.

Page 3

Be aware of the reduced maximum volume count with RAID5(7+1). A DMX1000 with RAID5(7+1) can have only 2,000 volumes, DMX2000 4,096 volumes, DMX3000 8,000 volumes. This includes all standards and BCV/Clones.

2. Cache Configurations After network bandwidth, cache in the source and target frames is the most critical component to the SRDF/A solution. Normally, the source frame cache is adequate, but the target frame cache is inadequate. Because of the BCV/Clones in the target frame, you are in trouble if both the source and target frames have the same amount of cache. In order to keep the volume Write Pending limit the same between the source and target frames, you must have more cache in the target frames. A key cache consideration is how much Secondary Delay can you tolerate in the solution? Remember that providing only the minimum cache requirements means you can probably tolerate only about 60-100 seconds of Secondary Delay. Is this enough to absorb the write spikes and long duration write periods? Most likely you will need more than the minimum cache for SRDF/A. Mainframe shops typically have application jobs which rebuild VSAM files after batch processing. This will create write spikes against a small group of logical volumes. EMC has internal tools which can help configure the R1 source cache to tolerate more Secondary Delay. A recommended minimum for mainframe DB2 shops is 600 seconds of Secondary Delay.

3. Initialization Parameters Start Symmetrix Control Facility (SCF): S EMCSCF

Stop SCF: F EMCSCF,INI,SHUTDOWN

Typical SCF initialization parameters: SCF.WORK.HLQ=EMC.PROD SCF.LOG.RETAIN.COUNT=1 SCF.LOG.RETAIN.DAYS=1 SCF.LOG.TRACKS.PRI=10 SCF.LOG.TRACKS.SEC=50 SCF.TRACE.MEGS=20 SCF.TRACE.RETAIN.COUNT=1 SCF.TRACE.RETAIN.DAYS=1 SCF.WORK.UNIT=SYSDA ******* CROSS SYSTEM COMMUNICATIONS ******** SCF.CSC.ACTIVE=NO *SCF.CSC.GATEKEEPER.LIST=CUU5 *SCF.CSC.ACTIVEPOLL=10 *SCF.CSC.IDLEPOLL=5 ******* SRDF/A PARMS ****** SCF.ASY.MONITOR=ENABLE SCF.ASY.POLL.INTERVAL=1 SCF.ASY.SMF.RECORD=206 SCF.ASY.SMF.POLL=15 *SCF.ASY.USEREXIT=NONE SCF.ASY.SECONDARY_DELAY=100 SCF.LFC.LCODES.LIST=9999-9999-9999-9999 /* SRDF/A MSC */ SCF.LFC.LCODES.LIST=9999-9999-9999-9999 /* TF CONSPLIT */ SCF.DAS.ACTIVE=NO ******* MSC PARMS ***** SCF.MSC.DEBUG=N SCF.MSC.ENABLE=Y

Copyright 2007 EMC Corporation. All rights reserved.

Page 4

Start SRDF Host Component (RDF): S EMCRDF

Stop RDF: P EMCRDF

Typical RDF initialization parameters: SUBSYSTEM_NAME=EMC2 NAME IN IEFSSN__ COMMAND_PREFIX=# ONE CHAR COMMAND PREFIX MAX_QUERY=8192 MESSAGE_PROCESSING=YES MESSAGE PROCESSING OPERATOR_VERIFY=CRITICAL SYNCH_DIRECTION_ALLOWED=R1>R2 ALLOW SYNC R1>R2 AND R2>R1 SYNCH_DIRECTION_INIT=R1>R2 DEFAULT SYNC DIRECTION **************************************************************** *** MSC - SRDFA MULTI-BOX ** MSC_GROUP_NAME=MSCGRP01 MSC_WEIGHT_FACTOR=0 MSC_INCLUDE_SESSION=347F,(02) MSC_INCLUDE_SESSION=837F,(08) MSC_INCLUDE_SESSION=887F,(10) MSC_CYCLE_TARGET=30 MSC_GROUP_END *MSC_ACTIVATE *MSC_ALLOW_INCONSISTENT **************************************************************** SECURITY_QUERY=ANY QUERY SECURITY LEVEL SECURITY_CONFIG=MASTER CONFIG SECURITY LEVEL MESSAGE_LABELS=MVS_CUU SHOW_COMMAND_SEQ#=YES **************************************************************** * STAR PARMS *ALLOW_CRPAIR_NOCOPY=STAR *VALIDATE_CRPAIR_NOCOPY_LEVEL=1 **************************************************************** ALIAS=GLOBAL,G ALIAS=CREATEPAIR,CPAIR ALIAS=DELETEPAIR,DPAIR ALIAS=HDELETEPAIR,HDPAIR ALIAS=DIFFERENTIAL,DIFF ALIAS=NOCOPY,NC **************************************************************** GROUP_NAME=SRDFABOX1 INCLUDE_RAG=3400,(02) GROUP_END **************************************************************** GROUP_NAME=SRDFABOX2 INCLUDE_RAG=8300,(08) GROUP_END **************************************************************** GROUP_NAME=SRDFABOX3 INCLUDE_RAG=8800,(10) GROUP_END ****************************************************************

Copyright 2007 EMC Corporation. All rights reserved.

Page 5

4. GateKeepers Gatekeepers play a critical role in the daily operation of SRDF. For SRDF/A and MSC a mainframe DMX should be configured with several CKD gatekeeper devices. You need unique gatekeepers for: ¾ Normal SRDF SQ query and SC commands ¾ MSC operations ¾ Cross System Communications (CSC) There are some simple rules for defining gatekeepers: 1. CKD device of any cylinder size, use a small volume like a MOD1 2. Not configured as an FBA, SRDF, SRDF/A, BCV, VDEV device 3. Should be offline to the host 4. Not used by other systems 5. MSC Gatekeeper must not be an R1 SRDF device Be careful when using SRDF GROUP_NAMEs. By default SRDF will use the first device in the address range to handle the command. For example: GROUP_NAME=BOX3400 FILTER_KNOWN INCLUDE_CUU=3400-34FF GROUP_END If you issue a #SC VOL,BOX3400,ALL command, SRDF will use device CUU=3400 to handle the volume query for the box. If by chance CUU=3400 happens to be an application volume with strict I/O response time requirements like a CICS log file, the query could negatively impact the application. One way to get around this is to make the first UCB in the Symmetrix a device containing no application data. Another way is to use an RA Group definition which allows you to specify the gatekeeper device you want to use: GROUP_NAME=BOX3400 INCLUDE_RAG=34FF,(02) GROUP_END Now when you issue a #SC VOL,BOX3400,ALL command SRDF will use device CUU=34FF, which we assume is a defined CKD gatekeeper, for the query instead of defaulting to CUU=3400 which is an application volume.

5. Target BCVs and Clones With SRDF/A at the target site, you will likely have at least one set of BCV/Clones for both production replication recovery and DR testing. The ideal scenario is to have two sets to keep production replication and DR testing activities separate.

Copyright 2007 EMC Corporation. All rights reserved.

Page 6

Some guidelines for BCV/Clones: 1. Because write de-staging on the target box is so critical, try to keep the BCV/Clone physical drives and RAID protection similar to the R2 Standards. At least, have the same number of physical drives. 2. Configure the R2 Standards and BCV/Clones on their own physical drives. 3. RTO cannot be guaranteed during DR testing if a single set of BCV/Clones are used for both production recovery and DR testing. 4. Do not concentrate high write activity volumes on the same physical drives or RAID set. Remember just one logical R2 volume hitting the write pending limit will cause the volume to go TNR which causes SRDF/A to drop with a CACA10. 5. For shops with intensive write batch cycles, writing to the Standards and Established BCVs causes multiple de-stage operations for a single write. To improve R2 target box performance, some shops split the BCVs prior to the start of batch processing. This eliminates the write destage to the BCV volumes during peak periods. Batch jobs can be automatically submitted by the Scheduler to Split and Re-Establish the BCVs around the peak periods. 6. Don’t forget to build BCV/Clone batch jobs for the target recovery site to be used for testing and in the event of a real disaster. 7. Using ranges of devices in commands like SPLIT 1,RMT(8800,0000-0FFF,02),CONS(GLOBAL) executes much faster than issuing one SPLIT command for each device.

6. Performance Considerations There are some some choices you have to make in an SRDF/A environment which will impact performance. For example: 1. Should you use Adpative Copy Write or Adaptive Copy Disk? Because Adaptive Copy Write is competing for the same cache slots as SRDF/A, Adaptive Copy Disk is preferred. Remember not to mix Adaptive Copy Write and Adaptive Copy Disk in the same box. 2. With Adaptive Copy, you probably want to use a QOS setting to ensure the initial copy does not impact production. But what is the right value? On a DMX2, a QOS=2 does about 220 IOP/s per RA while a QOS=4 does only 60 IOP/s per RA. 3. Be careful when proposing a fan-in solution that has multiple source frames going to one target. Scrutinize the target cache, disk and RAID protection to ensure the target can handle all source frames peaking at the same time. 4. Though supported, try to avoid putting SRDF/A and SRDF/S workload on the same RAs because the SRDF/S workload has a higher priority and could cause SRDF/A to backup and drop. 5. With SRDF/A, how should you handle Page Packs, SYSDA, SORTWORK, TEMP packs etc? You still do not want to replicate Page Packs because the SRDF/A window could cause delays in paging I/O that might impact system performance. Since you need to initialize the VOLSERs for these volumes at the R2 recovery site, most shops put these volumes in their own RAGroup and only put them into Adaptive Copy over the weekend or when they do an IPL. Since Page Packs are replicated weekly, no application datasets should be on the Page Packs!

Copyright 2007 EMC Corporation. All rights reserved.

Page 7

Couple dataset volumes have special requirements. The System Logger CDS (LOGR) used by CICS journaling must be replicated for recovery. All other couple datasets (SYSPLEX, ARM, SFM, WLM, OMVS, and CFRM) are required but have their contents re-created at IPL. Therefore, they can be handled like Page Packs using Adaptive Copy only.

7. SQ SRDF Display The SQ SRDFA query is one display that can help you understand what is happening. Execute this about every 5 minutes especially with new implementations to determine the SRDF/A activity during the peak processing cycles. The output of the display will go to the RDF Log and z/OS SYSLOG. In the following example things are not good. Pay attention to Secondary Consistent (?) which means SRDF/A doesn’t know if the data is consistent. Add the Capture Cycle Size 171,180 + the Transmit Cycle Size 0 x 56K per cache slot for a DMX2000 = 10 GB of cache just to hold the cycles. By the way, this DMX2000 had 45 GB of usable cache. Secondary Delay of 323 seconds and Cleanup Running (Y) which tells you SRDF/A has dropped on this box! 01.48.22 STC22138 EMCMN00I SRDF-HC : (5177) #SQ SRDFA,8800 01.48.22 STC22138 EMCQR00I SRDF-HC DISPLAY FOR (5177) #SQ SRDFA,8800 410 410 MY SERIAL # MY MICROCODE 410 ------------ -----------410 000187700748 5671-54 410 410 MY GRP ONL PC OS GRP OS SERIAL OS MICROCODE SYNCHDIR FEATURE 410 ------ --- -- ------ ------------ ------------ -------- -----------410 LABEL TYPE AUTO-LINKS-RECOVERY LINKS_DOMINO MSC_GROUP 410 ---------- ------- ---------------------- ---------------- ---------410 10 Y F 10 000187751590 5671-54 G(R1>R2) SRDFA I MSC 410 BOX3 STATIC AUTO-LINKS-RECOVERY LINKS-DOMINO:NO (MSCGRP01) 410 410 ---------------------------------------------------------------------410 PRIMARY SIDE: CYCLE NUMBER 7,811 MIN CYCLE TIME 30 410 SECONDARY CONSISTENT ( ? ) TOLERANCE ( N ) 410 CAPTURE CYCLE SIZE 171,180 TRANSMIT CYCLE SIZE 0 410 AVERAGE CYCLE TIME 57 AVERAGE CYCLE SIZE 60,490 410 TIME SINCE LAST CYCLE SWITCH 209 DURATION OF LAST CYCLE 114 410 MAX THROTTLE TIME 0 MAX CACHE PERCENTAGE 94 410 HA WRITES 1,522,704,296 RPTD HA WRITES 720,119,145 410 HA DUP. SLOTS 13,513,070 SECONDARY DELAY 323 410 LAST CYCLE SIZE 148,915 DROP PRIORITY 33 410 CLEANUP RUNNING (Y) MSC WINDOW IS OPEN ( N ) 410 MSC ACTIVE (Y) ACTIVE SINCE 12/03/2006 02:11:41 410 CAPTURE TAG C0000000 00003F26 TRANSMIT TAG C0000000 00003F25 410 GLOBAL CONSISTENCY ( Y ) STAR RECOVERY AVAILABLE ( N ) 410 ---------------------------------------------------------------------410 END OF DISPLAY

Copyright 2007 EMC Corporation. All rights reserved.

Page 8





Min Cycle Time – The target minimum cycle time in seconds. SRDF/A will try to execute the cycles in this interval of time but the cycles can be longer than the specified value. –

A cycle can be active until SRDF/A reaches the MAX CACHE PERCENTAGE that it is allowed to use. If the cache limit is reached, the SRDF/A session is terminated and a bitmap session is activated.



When MSC is active, the Min Cycle Time is not used. The value specified in the parameter MSC_CYCLE_TARGET of SRDF Host Component is used.

Secondary Consistent – Secondary Consistent is a Y/N flag that indicates whether the secondary side is consistent during SRDF/A operations. –

Y

SRDF/A is consistent



N

SRDF/A is not consistent



? SRDF/A is not active and the data on the secondary side may or may not be consistent. When SRDF/A is not active, SRDF/A cannot determine the consistency of the secondary side.

Note: After reaching a point of consistency, SRDF/A will continue to preserve a consistent copy in the secondary side. •

Capture Cycle Size – Number of cache slots currently in the active cycle.



Transmit Cycle Size – Number of cache slots left in the cycle being transmitted to the secondary side.



Average Cycle Size – Average number of cache slots in the past sixteen cycles.



Time since last cycle switch – Number of seconds since the last time SRDF/A has cycle switched.



Duration of last cycle – Number of seconds the last cycle lasted.



Average Cycle Time – Average number of seconds in the past sixteen cycles.



Max Throttle Time – Maximum Throttle Time indicates how long SRDF/Asynchronous will slow the host adapters once cache limits are reached. If the value is 0 then once Cache limits are reached SRDF/Asynchronous is dropped. –

If the value is 65535, the host adapters will work at write pending limits speed indefinitely.



If the value is any other than 65535, that is the number of seconds the host adapters will work at write pending limits speed before SRDF/Asynchronous will be dropped.

Copyright 2007 EMC Corporation. All rights reserved.

Page 9



Max Cache Percentage – Maximum Cache Percentage is the percentage of cache that SRDF/Asynchronous will be allowed to use. The Maximum Cache Percentage is 100% in the initial release of SRDF/A.



HA Writes – Number of tracks written by the host adapters.



RPTD HA Writes – Total number of tracks written multiple times in a cycle by the host adapters.



HA DUP SLOTS – Host Adapter Duplicated Slots reflect the number of times a slot had to be duplicated because it was written to in multiple cycles.



Secondary Delay – Secondary Delay is the approximate time the data on the secondary side is behind the primary side.



Last Cycle Size – The Last Cycle Size is the size of the complete previous cycle.



Cleanup Running – Cleanup Running is a (Y/N) flag: –

Y The secondary side will reject non-SRDF/Asynchronous for a small window of time (approximately 30 seconds). Cleanup only runs immediately after SRDF/Asynchronous goes from the Active to the Inactive state. Cleanup prevents RDF-RSUM, REFRESH RFR-RSUM, or VALIDATE INVALIDATE from being run on the SRDF/Asynchronous devices.



After the cleanup is finished the RDF-RSUM, REFRESH RFR-RSUM, or, VALIDATE INVALIDATE commands may be run.



N

Cleanup is not running.



MSC window is open – The Host Managed Consistency Window is a small time frame that the cycle switch must be run in when running in MSC. When the MSC window is open, all write I/O’s to SRDF/Asynchronous primary devices are disconnected. Read I/O’s continue to run.



MSC Active – Host Active is a (Y/N) flag:





Y The SRDF/Asynchronous session is part of a multiple SRDF/Asynchronous session group. The multiple SRDF/Asynchronous session group is a group of multiple SRDF/Asynchronous sessions that are having the cycle switch coordinated by the Host.



When SRDF/Asynchronous is not active and MSC “Y,” SRDF/Asynchronous is deactivated when MSC was active.



N

SRDF/Asynchronous is not running in MSC mode.

Active since – Active since is the date and time that the SRDF/Asynchronous session joined MSC.

Copyright 2007 EMC Corporation. All rights reserved.

Page 10



Capture tag – The Capture Tag is the tag for the data in the capture cycle. The Capture Tag verifies the multiple SRDF/Asynchronous sessions in the MSC group are coordinated. When MSC is active, the –



Capture Tag functions like the cycle number when SRDF/Asynchronous is active and MSC is not active.

Transmit tag – The Transmit Tag is the tag for the data in the transmit cycle. The Transmit Tag verifies that the multiple SRDF/Asynchronous sessions in the MSC group are coordinated. When MSC is active, the Transmit Tag functions like the cycle number when SRDF/Asynchronous is active and MSC is not active.

8. Recovery Procedures You must execute detailed procedures to recover from a dropped SRDF/A MSC session. 1. Immediately run the MSC CleanUp job (SCFRDFME) at the source, assuming the links are available to the R2s. If the links are down, then the MSC CleanUp job has to be run at the target site before proceeding. Use an MSC Group name comprised of 8 characters, otherwise you have to imbed spaces in CleanUp parameters. Example: PGM=SCFRDFME,PARM=’Y,0300,SHORT ‘ with three trailing blanks. Important: The 0300 UCB address should be an SRDF/A device in any of the MSC Group frames, do not use the MSC GateKeeper UCB! 2. R2 target side must have a “recovery” or “starter” system to run the MSC CleanUp job, ready R2 devices, make R2s read/write and bring R2s online in the event of a real disaster. 3. Three MSC recovery scenarios: 1. All R2 receive cycles have same tag and are completed – MSC action Commit Receive Cycle 2. Most likely - All R2 receive cycles have same tag but one or more are not complete – MSC Action Discard Receive Cycle 3. Apply cycle tags of some R2 Symms match receive cycle tags of one or more other R2 Symms (not all R2 receive cycles were committed) – MSC Action Commit and Discard Receive Cycle 4. SRDF/A Symmetrix error codes: •

CACA1x – Source system maximum write pending limit reached. Could be a cache full or R2 target de-staging issue.



CACA20 – SRDF/A target device has become TNR on the link, most likely a network issue



CACA30 – Target SRDF/A device made TNR on the link at the R1 side



CACA40 – All links to target frame were lost for a period of time greater than set in the BIN link limbo parameter. Default is 10 seconds which can be increased up to 60 seconds if you get 7D3 and 7E3 messages. Don’t increase from the 10 seconds default if using SRDF/S.



CACA50 – SRDF/A MSC mode cycle switch window did not close within 5 seconds. Check to make sure SCF is a started with SUB=MSTR and dispatch priority =SYSSTC

Copyright 2007 EMC Corporation. All rights reserved.

Page 11

5. General SRDF/A start and recovery steps after an outage: 1. To start, assuming SRDF/A devices have a status of TNR 2. Put all devices back into Adaptive Copy Disk mode: #SC VOL,LCL(3400,02),ADCOPY-DISK,ALL (3400=Gatekeeper, 02=SRDF/A Group) 3. Query synchronization status, by doing: #SQ VOL,LCL(3400,02),ALL Check synchronization percentage by doing: #SQ VOL,LCL(3400,02),INV_TRKS

(Device status should be TNR-AD)

(Eventually synchronization > 94%)

4. Resume SRDF Adaptive Copy by doing: #SC VOL,LCL(3400,02),RNG-RSUM,ALL 5. Query devices making sure all have a status of R/W-AD: #SQ VOL,LCL(3400,02),ALL Query devices to make sure none have a status of TNR: #SQ VOL,LCL(3400,02),TGT_NRDY (Should get EMCQV90I NO DEVICES FOUND) 6. If any devices still have a status of TNR or require special handling, then do: #SC VOL,RMT(3400,02),RNG-REFRESH,ALL #SQ VOL,RMT(3400,02),REFRESH #SC VOL,RMT(3400,02),RNG-RSUM,ALL

(R2 device should have –R Refresh Flag) (If this doesn’t work call support)

7. Activate SRDF/A: #SC SRDF,LCL(3400,02),ACT #SQ SRDFA,3400

(Look for Secondary Consistency (Y))

8. Start MSC: F EMCSCF,MSC,REFRESH #SC GLOBAL,PARM_REFRESH #SQ SRDFA,3400

(Load MSC Group definition) (Look for MSC ACTIVE (Y))

9. Operational Procedures 1. Document all procedures used by the customer. For example: SCF and RDF parameters SCF High Availability LPAR Clean start/stop of SRDF/A MSC MSC Cleanup job for both source and target sites Restart of SRDF/A after a crash CREATEPAIR and DELETEPAIR procedures (Must disable MSC) Healthcheck JCL TimeFinder Establish and Re-Establish jobs TimeFinder Consistent Split job

Copyright 2007 EMC Corporation. All rights reserved.

Page 12

Creating new RDF Groups Disable MSC for IPL Refresh after a hardware upgrade R2 and BCV/Clone testing procedures Disaster Failover procedures 2. If queries don’t appear to be correct, you have ??? in the local UCB address field, or you just added more logical devices to the box you must do a: #SC GLOBAL,SSID_REFRESH 3. In order to do CREATEPAIR and DELETEPAIR, MSC must not be active because SRDF/A must have TOLL=ON to permit R1/R2 maintenance. With TOLL=ON, SRDF/A will allow volumes to become inconsistent during the R1/R2 pairing without issuing a drop. 4. You cannot do a CREATEPAIR if the R1 volume has an active SNAP session. 5. Only one MSC Group definition per MSC environment and only one MSC environment per SCF task. 6. Stopping the SCF started task for any reason will stop MSC and drop SRDF/A. For handling system maintenance, you probably want to define a Secondary LPAR with SCF by defining MSC_WEIGHT_FACTOR=2 in SRDF Host Component. This will allow continuance of MSC processing between the Primary and Secondary SCF LPAR. 7. MSC can not be active when attempting an SRDF Mode Change between SRDF/A and SRDF/S. 8. Personality swap between R1/R2 is not permitted while in SRDF/A mode. 9. When SRDF/A drops, the volumes will go back into SRDF Primary Mode which is usually TNRSY. Don’t forget to put the volumes back into Adaptive Copy with SC VOL,G(MYSYMM),ADCOPY-DISK changing mode to TNR-AD. 10. Don’t define more than 6 SRDF/A groups per director. 11. An SRDF/A Group must be empty of devices in order to delete. 12. While SRDF/A is active, the R2s are read-disabled and may not be used as a primary volume for TimeFinder SNAP. 13. Upgrading from 5670 to 5671 or 5771 requires an outage.

10. Network Considerations 1. Link Limbo, defined in the BIN with a default of 10 seconds, determines how long SRDF/A will wait if all links are lost before dropping with a CACA20. If you get Symmetrix Event Reference Codes of 7D3 or 7E3 you may increase Link Limbo up to 60 seconds. Increasing Link Limbo gives problem switches in the network time to reboot. Warning: don’t increase from the 10 seconds default if using SRDF/S. 2. You must have a switch between the Symmetrix and DWDM equipment when using GigE, otherwise you will get numerous link, reset, TNR volumes and phone home errors. You also want to rate limit GigE so you don’t overrun the bandwidth.

Copyright 2007 EMC Corporation. All rights reserved.

Page 13

3. Adequate Buffer-to-Buffer credits on the SAN Director are critical to SRDF/A throughput. But how many BB_Credits do you need? Assume the following: 1) 2) 3) 4)

Link distance = 370 miles = 595,000 meters. Available network bandwidth is approximately 90 MB/s Getting at least 2:1 compression Storage connected to SAN Director with 16 BB_Credits per port input and 4 output ports to the Routers. How many BB_Cedits are needed on each of the 4 output ports?

BB_Credits Formula = (2*length in meters of the link)/4311+1. BB_Credits = (2*595000/4311+1 = 277 BB_Credits = Would fill the 90 MB/s available pipe. 277 BB_Credits / 4 Output Ports = about 70 BB_Credits needed per port. Since we are getting 2:1 compression, we could handle 90 MB/s * 2.1 = 189 MB/s storage write peak. 4. Events which might cause bandwidth issues: ¾ Not enough link bandwidth or other activity when using shared bandwidth ¾ Bursts of application I/Os, remember a lot of spikes can hide in a 15 minute interval ¾ Long peak periods (Hours between 10 PM and 4 AM are typical) ¾ Unexpected growth of workload (new application processing, accounts, subsidiaries) ¾ New CPUs with more processing MIPS ¾ Going from ESCON to FICON ¾ Not allocating enough RAs ¾ Adding more disk capacity in the source frames ¾ Changing the network hardware or configuration ¾ Implementing network features like data encryption

Copyright 2007 EMC Corporation. All rights reserved.

Page 14