99-Node IP/MPLS WAN Topology Design Tariff4 (Medium-Low Tariff Version)

1 Mar 2010 John Koiste

[email protected] 1

In This Example We Have The Information Needed To Minimize or Reduce Interconnection Cost In a 99-Node WAN • The router (node) Lat-Long locations • Estimated offered traffic levels between all WAN node pairs • Modeled long-haul tariffs for the U.S. – A medium-low tariff, Tariff4, as explained on next slide – There is a separate write-up for a high tariff, Tariff1 – The next slide provides more WAN-67 tariff information

• The routing (or path selection) method used by the routers 2

Notes On Tariffs Used in This Example (1 of 2) • WAN-67 comes with six sets of modeled long-haul leased line tariffs for the U.S. – Each set has estimated monthly rates for T-1, T-3, OC-3, OC-12, OC-48, OC-192, OC-768 for any length – No set is an exact representations of actual tariffs – WAN-67 users can modify any or all six tariff sets (each composed of seven curves) to represent more accurately the monthly costs they are seeing

3

Notes On Tariffs Used in This Example (2 of 2) • Each successive tariff (after Tariff1) for any leased line type is 30% lower in cost than the previous one – Tariff1 is undoubtedly much higher than current tariffs – Tariff6 is almost surely lower than current tariffs

• The tariff curves are piece-wise linear, and monthly cost is composed of: – A base cost/month for any leased line length – Plus an additional cost/month derived from up to four linear tariff segments at some cost/mile slope as distance is increased from zero

• WAN-67 users can modify base cost and slopes of the linear segments (i.e., everything for all curves) 4

We Also Have the Following Information Or Capabilities to Reduce Interconnection Cost • A summary of IP/MPLS WAN design requirements (shown on next slide) • Some knowledge on what topologies are likely to meet WAN design requirements • Some skill, augmented by WAN-67 software, to design topologies in one or more steps • A means (also provided by WAN-67) to get a quick and accurate report back on relative success at each step – The execution times for various procedures in WAN-67 for this 99node example were measured for a Sony (Vaio FZ Series) laptop (bought in Q2 2008). It has a 2.1 GHz Intel T8100 dual core CPU with a 3 MB L2 cache. For node counts over 128 an even larger L2 cache would help keep processing times reasonably low. 5

Summary of 99-Node IP/MPLS WAN Design Requirements • The WAN has to carry all the specified (design-level) offered traffic to meet quality of service (QoS) requirements – In other words, we have worked out the estimated offered traffic matrix beforehand (using the traffic generator that comes with WAN-67)

• The WAN must be able to carry all of the offered traffic in the face of all possible single inter-router link failures – However, if needed, average utilization of links is allowed to go to 100% instead of the 50% limit under normal conditions

• No source-destination path shall have more than 12 hops (i.e., 12 links between source and destination nodes) • Finally, the interconnection topology cost/month should be as low as possible while meeting the above requirements 6

Looping From Theory to Practice to Theory… • The approach in the following is to show some designs and to critique them using the broad-brush requirements just outlined – We’ll also discuss how we got to each design as we go along

• Looking across the WAN topology spectrum we find a ring topology at one end and a maximally connected mesh topology at the other – A three node ring is the only topology which is also a maximallyconnected mesh – An n-node ring has the same number of full-duplex (or 2-way) links as nodes, i.e., there are n links in an n-node ring – A maximally-connected mesh of n-nodes has n*(n-1)/2 two-way links, so the 3-nod ring also has 3*2/2 (or 3) two-way links

• We are going to start with node locations as shown next 7

Node Locations For a 99-Node Network

8

Did You Count the Nodes In the Google Jpeg? • If so, you probably got less than 99 • The reason is that some nodes and/or their names cover other nodes and/or their names • If you have Google Earth Pro and had uploaded the node names and lat-long coordinates from WAN-67 you’d be able to zoom in and achieve greater node separation (next slide) – You can zoom in well enough to see a car in most cases – Node locations in our examples are usually not in anyone’s building for several good reasons – The Seattle WA node is in salt water, for example 9

North-East Node Locations (99-Node Net)

10

The WAN-67 Views (Network Maps) • Two WAN-67 views of the 99 nodes (and some links) are shown next – The first is an approximate Mercator Map (MM) view and nodes that are too close to others are sometimes fully or partly covered – The second WAN-67 view is a user-modified map (UMM) view in which some nodes have been moved to expose covered nodes

• WAN-67 views are all simple – Reasons include several programmatic limits – Also, Google Earth (Pro version) works anywhere over the globe, has a software interface for uploading and can output Jpegs (not just screen shots with their clutter) 11

MM View of the 99-Node Network With New York NY (Node 62) Highlighted

12

UMM View of the 99-Node Network With New York NY (Node 62) Highlighted

13

What’s Special About the Topology Shown? • It happens to meet all the design criteria – It can carry all the specified (design-level) offered traffic – It can carry all the offered traffic in the face of all possible single inter-router link failures – Worst-case hop count is 10, or two better than the spec – The above points will be discussed further

• The interconnection topology cost/month is also fairly low, as will be explained – The design approach will be explained

• How long did the topology design take? – One hour of user time using our laptop – Plus 2½ hours of hands-off laptop processing time

14

The Design-Level Offered Traffic (Matrix) • For this design exercise we used the traffic generator that comes with WAN-67 to create an offered traffic profile (or matrix) called Profi2 • It’s the simpler of two gravity-like profiles that are used for our examples (the other being Prof4) • Briefly, node pairs in high-population areas have high traffic between them – Prof2 has a moderate distance effect – If the nodes are closer to each other the traffic between them is higher – Prof4 introduces an unbalancing parameter as well 15

Part of The Design-Level Offered Traffic Prof2 Source-to-Destination List (Matrix)

16

Traffic-Carrying Capacity Results • We will come back to this subject a bit later on • Next we’ll show worst-case hop count

17

Path With Most Hops (And Greatest Length)

• There are actually 56 one-way paths with 10 hops • The reverse direction of this one has the same length (3152 mi.) • The 54 others are shorter in length 18

What If Links Fail, One At a Time? (1 of 5) • For normal conditions we had set the (usersettable) max average utilization of links to 50% • When any link fails, new backup paths must be found for every path that had used the failed link – There can be many such new paths required per failure – Links on those new backup paths will then have to carry a heavier traffic load during the failure time – We had set the (also user-settable) limit for all affected backup-path links to 100% during the existence time of the failure

• Much processing is required since there are 241 two-way links to be “failed” in our topology 19

What If Links Fail, One At a Time? (2 of 5) • We actually do two designs in most networks including our 99-node example • They have the same topology (i.e., interconnection scheme) but several or many link bandwidths may be different – We first try to do a good low-cost design (which took less than a hour for the topology we are talking about) – We then do a fully automated failure-modes and effects analysis (FMEA) for all single link failures and end up with the same topology with some beefed-up links – On our laptop the hands-off process took 2½ hours 20

What If Links Fail, One At a Time? (3 of 5) • We end up with two sets of leased line link costs/month for our design – The lower cost is for the design before FMEA – The higher one is for the one with FMEA and it is always higher because the beefed-up links are more expensive

• How much higher is the cost/month with the same topology, with FMEA and its beefed-up links? – In our experience it seems to add significantly less cost than any other method we have seen – Cost/month for both cases of our example is shown next

21

What If Links Fail, One At a Time? (4 of 5) •This is the cost before FMEA. In this case some or many links in backup paths may get overloaded when a single (other) link fails. Reason? The backup paths have to carry more traffic while the failure lasts. •If you pay this you get beefed-up links and no overloading (on average) during any single-link failure

22

What If Links Fail, One At a Time? (5 of 5) •Want greater protection against overloading? It just costs more. For example, if you enter 80 to allow only 80% utilization on backup path links during any single link failure then link beefing-up level increases and/or occurs more often •Click here after entering the 80 •Here is the new cost in 30 extra seconds on our laptop 23

The Impact of Backup Link Utilizations Approaching 100% Under Link Failures • The immediate impact is on network performance which manifests itself as added delay for applications • The magnitude of the added delay will depend on the degree of overloading • Applications using TCP (and also using overloaded links) will throttle back automatically and will simply run slower • The network will continue to function but its performance may become intolerably sluggish for many users • With high bandwidths the knee of the delay curve is near 100% but it is steep just past the knee. Therefore, voice over IP calls may suddenly become intermittent. • So, in some cases, it may be a good investment to lower the backup link utilization limit (under failure of any other link) to 80% or even lower 24

What’s Good About The WAN-67 FMEA Method For Reducing Performance Loss? • Most importantly, it applies link bandwidth increases only where necessary so it keeps cost increases to a minimum • With any sparse mesh, bandwidth increases applied early in the process often help other link failures at no additional cost • With any existing tariffs, to some point, link bandwidth increases are cheaper than adding more links • Unlike when adding links, the WAN-67 method does not cause rerouting complexities unless a link fails – This always works when “shortest path” means fewest hops with ties resolved by lowest great-circle path length – It may not work for “shortest path” meaning something else

25

How Does The FMEA Process Work? • Summary of the WAN-67 FMEA process – WAN-67 “fails” each link in succession, and, for each such failure, it reroutes the traffic and finds the loading increase on all affected back-up links (that have to carry the failed-link’s traffic) – If it finds that the utilization of any back-up link exceeds a user-set specification, it increases the bandwidth of the affected link – If, during any other single-link modeled failure, it finds that the bandwidth of a previously-affected link is not high enough to accommodate the added load due to new failure, it bumps up the bandwidth again to a new suitable level – It also stores the maximum link loading found during the process for every link (in each direction) in order to minimize subsequent calculation time 26

What If Several Links Fail At a Time? • WAN-67 can also model multiple link failures (for links as seen by routers) – There is just one limit to the number that WAN-67 can model: it won’t allow network fragmentation – For example, there could be multiple failures that divide the 99-node net into a 50-node portion (which still works) and a 49-node portion (which also works) – They can only be modeled as two network files – If one node gets fragmented, then the modeling is simply done for a 98-node network

• In a real network such multiple failures are usually caused in a lower layer that has to be defined – Therefore, the analysis is usually done with much less automation than for single link failures see by IP nets – Analyses could be automated in specific cases 27

Traffic-Carrying Capacity Results (1 of 6) • We are now going to examine some traffic results • First we’ll show a link for the Non-FMEA case when all links are intact – We picked link 25;22, one that is likely needing to be beefed up after FMEA since it will have to take over load from Chicago (25) to Los Angeles (50) if link 25;32 fails

• Then we’ll show the same link 25;22 for the With-FMEA case when all links are intact – For this case will see a lower % utilization if link 25;22 was beefed up – Link 25;22 is shown on the map next 28

Traffic-Carrying Capacity Results (2 of 6) Link 25;22 is likely to get beefed up after FMEA

29

Traffic-Carrying Capacity Results (3 of 6) •The utilization of the 25;22 link for the Non-FMEA case is 44.8% •The link is actually composed of two OC-3s for a total bandwidth of 300 Mbps. Recall that we were given a designlevel max utilization spec of 50% for average load when all is normal. With one OC-3 the average utilization would be 89.6%. 30

Traffic-Carrying Capacity Results (4 of 6) •The utilization of the 25;22 link for the With-FMEA case is 22.4% under normal conditions when all links are intact •This link is now an OC-12 (600 Mbps bandwidth ) • Next question: What is the load on this beefed-up link if it has to carry the greatest load due to failure of some other link? See next slide.

31

Traffic-Carrying Capacity Results (5 of 6) • So, what is the load on link 25;22 if it has to carry its greatest load due to failure of any other link? – The greatest load for 25;22 is 374.796 Mbps. (Note: Greatest loads for all links under failure of any other link were stored by WAN-67 during the 2½-hour FMEA run) – The Non-FMEA bandwidth of the link was 300 Mbps (two parallel 2-way OC-3 leased lines, $104,610/month) – For the With-FMEA case the link was bumped up by WAN-67 to one OC-12 (bandwidth 600 Mbps each way, $113,144/month) since that is less expensive than three OC-3 leased lines (total bandwidth 450 Mbps each way, $156,915/month) – Different tariffs will clearly affect these numbers but not the calculation method 32

Traffic-Carrying Capacity Results (6 of 6) • The Links form-page option buttons show how to get at the most commonly used parameters quickly • Some others, such as the greatest loading number just discussed, are available on the Links sheet • Before we describe how the 99-node topology design was done (and critique it since it can be improved upon) we’ll illustrate a few views of paths through the topology in a question/answer format – In mesh topology design it’s useful to study network traffic flows using both two dimensional graphics and numbers provided by programs such as WAN-67 33

Guess How Many Paths Use Link 25;32 (1 of 3)

34

Guess How Many Paths Use Link 25;32 (2 of 3) •The answer is 606 in the direction shown plus 606 more in the reverse direction. This is higher than many people might guess. The next slide shows sources and destinations of the 606 paths that use link 25;32 in the direction shown. •If link 25;32 fails (in both directions) 1212 paths out of a total of 9702 have to be rerouted 35

Guess How Many Paths Use Link 25;32 (3 of 3)

There are 606 paths that use link 25;32. They cannot all be picked out of this drawing since there is some overlaying of path sources and destination. 36

Show Paths From Node 75 That Use Link 25;32

37

How Much Load Does Path 75;87 Contribute To Link 25;32? •The 25;32 path itself contributes 958 kbps to link 25;32 load. Note that 25;32 is both a 1-way link and a path. •The total load on link 25;32 is 460.852 Mbps •And, finally, the load contribution of path 75;87 to the total above is 496 kbps

38

Show Paths From Node 75 That Use Path 25;46

As expected, there are fewer of these

39

How Much Load Does Path 75;87 Contribute To Path 25;46?

•The 25;46 path itself contributes 3.476 Mbps to 25;46 load. Note that 25;46 is a path. Its 3 links will carry more traffic load than traffic that traverses the entire 25;46 path. •The total load on path 25;46 is 180.209 Mbps. Its link 25;32 carries 460.852 Mbps, as shown before. •The load contribution of path 75;87 to the total above is 496 kbps as before 40

Planning The Topology Design (1 of 2) • We started by recalling the following: – High-bandwidth links are cheaper per Mbps than links composed of bundles of lower-bandwidth links at some point (the point being defined by the tariffs used) – Shorter links of the same bandwidth are cheaper – A 99-node ring has 49 hops for one path, 48 for another, and the spec is 12 so that eliminates a ring – A ring would also be quite expensive due to very high tandem traffic on all its links – Even a mesh topology that is very sparse would have a lot of tandem traffic on many links and that might defeat the per-Mbps advantage of high-bandwidth links – A dense mesh topology might be more resilient (depending on the underlying connection layer) but would lose the link cost battle 41

Planning The Topology Design (2 of 2) • We wanted to do a decent job within an hour: – We were also given up to four hours of additional (hands-off) processing time for FMEA to get the parameters and cost for a more resilient version

• Next, we looked at the node locations: – There were a few areas where the nodes were clustered so one strategy option was to interconnect nodes in clusters fairly heavily since that would (hopefully) not cost too much and would reduce hop counts as well – We still had to figure out how to do the other connections, and that is discussed next

42

First Design Approach (Design No.1) • We decided to: – Instruct WAN-67 to add connections between nodes within clusters – Instruct WAN-67 to add connections between clusters so that every node in the network had at least two ways to get to every other node

• We went through the above cycles a few times and then took over control – We looked at the results, especially with regard to hops and cost, and then added a link based on some skill and experience, looked at the results, added a link, …, and repeated the process five times and got the Design No.1 topology shown earlier and on the next slide – All of the above took less than an hour

• We then let the laptop run the FMEA process, which took 2½ hours 43

Design No.1 of the 99-Node Network

Cost/Mo. NoFMEA Cost/Mo. WithFMEA Max Hops 10

$6,457,947 $7,283,005 44

Second Design Approach (Design No. 2) • The approach was the same but only the first two links were added to the programmatic design – The reasoning was that we knew the Non-FMEA cost would be lower (having looked at the cost in the Design No.1 Log) and that the max hop count would be 12 – This process took just 10 minutes

• We then let the laptop run the FMEA process, which took 2½ hours again – It seemed likely that the With-FMEA would be higher since there would be longer back-up paths and more links would have to be beefed up – The With-FMEA cost did turn out to be higher but not by much (compare next slide with the earlier Design No.1 slide)

45

Design No.2 of the 99-Node Network

Cost/Mo. NoFMEA Cost/Mo. WithFMEA Max Hops 12

$6,255,710 $7,320,195

Third Design Approach (Design No. 3) • Here the approach was to nibble away at the cost by (1) removing several links from Design No. 2 and (2) seeing if adding one link in a good place could further reduce cost – What’s a good place to add a link? Sometimes between two nodes that have a lot of traffic between them, even if they are not close to each other–in this case NY (62) and Chicago (25) seemed good – This process took 25 minutes and the No-FMEA cost was the lowest so far

• We then let the laptop run the FMEA process, which took 2½ hours for a third time – The With-FMEA cost also came out the lowest so far – See the next slide for the topology and costs 47

Design No. 3 of the 99-Node Network

Cost/Mo. NoFMEA Cost/Mo. WithFMEA Max Hops 12

$6,063,511 $7,069,350

Design Log for Design No. 2 & 3 • WAN-67 records a handy design log of each step – The first two steps were for Design No. 2 and the last six were for Design No. 3 – For each step WAN-67 does all the changes, routing, traffic calculations, hop counts, costs, drawings and other information you have seen (except FMEA calculations) – Each step took about 25 seconds on the laptop

49

Which is the Best Design? • If cost is the main criterion Design No.3 is best – It meets all the design requirements – But it has a less robust east-west mid-section

• If performance is more important Design No.1 is best – It has more ways to go from east to west and back – It has only 10 hops maximum which is nice – But it costs more

• With some more work it’s possible to come up with a better topology than any of the three so far – Only the laptop had a busy 7½ hour day. One of its CPU cores was always running at 100% and the other occasionally went to 60%. Its network switch was off. – The user spent just a couple of hours doing topology design work 50

Was the Design Approach Good? • Summary of the approach for designing the 99node network topology: – Interconnect nodes within clusters fairly heavily – Interconnect clusters such that every node in the network has at least two paths to every other node – Fine tune the topology to get within maximum hop limit and to achieve the lowest total link cost quickly – Create a more resilient version using FMEA (failuremodes and effects analysis)

• The approach seems good for networks that have several clusters of nodes – It also scales well for networks with more nodes and clusters – For anyone interested the rest of this slide show discusses different approaches and the impact of various parameters that topology designers run into 51

Related Subjects for the 99-Node Net • Gamut of topology design approaches – More heavily connected meshes – Very sparse meshes – Impact of longer links

• Impact of various parameters affecting the results – – – – –

Tariffs Unavailability of some link bandwidths Different offered traffic profiles Other link utilization levels than the 50% used Three different routing (path finding) methods

• Network reliability notes

52

Heavily Connected Meshes • Wide area network IP/MPLS topology design is– mathematically speaking–a conversion from an offered traffic profile (bandwidth demand matrix) to a set of communication links between routers – In a nutshell, it’s a conversion from a set of many small packet network traffic flows to a much smaller set of big pipes to carry the flows

• As the years have gone by – Inter-router links composed of bundles of many smaller pipes have been replaced by smaller bundles of bigger pipes or, even better, just one big pipe – The cost of long-haul installed (or leased) bandwidth has decreased per-Mbps (for bigger pipes of a given great-circle length) and this works against heavily connected meshes (which use more pipes and ones that tend to be smaller in bandwidth and longer) 53

HeavyMesh-1

Cost/Mo. NoFMEA Cost/Mo. WithFMEA Max Hops 10

$7,444,335 $8,131,570

HeavyMesh-2

Cost/Mo. NoFMEA Cost/Mo. WithFMEA Max Hops 9

$8,022,533 $ N/A

Comments on HeavyMesh-1 & 2 • Both of the more heavily-connected meshes are more expensive in terms of leased-line cost than any of the three earlier sparser meshes – HeavyMesh-2 has more links and costs more – It does have a lower maximum hop count (at 9) than any of the other designs

• Why does HeavyMesh-2 cost more than, for example, the sparser mesh Design No.3? – It has more links – Its links tend to have higher per-Mbps cost since they don’t need as much bandwidth – Its links tend to be longer on average

• The two heavily-connected meshes are both past the sweet spot for low link cost/month, and HeavyMesh-2 is the one farther away 56

Sparser Meshes • If you try to design sparser meshes than the first three we showed you may succeed but it will be increasingly difficult to meet two criteria 1. The maximum allowable hop count limit (set to 12 in our design spec) 2. Keeping tandem traffic low enough so that it does not overcome the advantage of using higher-bandwidth links

• Relaxing criterion 1 above (hop count) will help item 2 (tandem traffic) somewhat but hop count limit is often not negotiable – We’ll show samples of the sparse topologies (Ring-1 and Ring-2) just to show their costs and hop counts

57

Ring-1

Cost/Mo. NoFMEA Cost/Mo. WithFMEA Max Hops 49

$13,310,123 $ N/A

Ring-2

Cost/Mo. NoFMEA Cost/Mo. WithFMEA Max Hops 49

$12,097,318 $ N/A

Comments on Ring-1 & 2 • The two rings are way past the sweet spot (in the sparse direction this time) for low link cost/month – And worst case hop count is off the scale at 49

• Rings make sense for metro networks but not for IP/MPLS WANs with many nodes

60

Impact of Longer Links • The impact for the 99-node network can be seen in 25 seconds on the laptop used for the first three examples (Design No.1 through 3) • Adding a longish link between NY and Chicago reduced cost in Design No.3 because there was a lot of traffic between those two nodes – But, in general, coast-to-coast (east-west) links do not seem to pay off – With WAN-67 the best thing to do is to just try it since it does the calculations quickly

61

Impact of Tariffs • Long-haul leased-line tariffs clearly have a direct impact on the costs • Briefly, if one were to change the leased line monthly cost for each of the following parts of a typical U.S. tariff, then the monthly cost of the networks inter-router links would change by the same ratio – Base rate for zero length – Slopes for each linear segment of cost per mile (usually four slopes) for a full duplex line (2-qay line)

• The total link cost does not necessarily change by exactly the same ratio mentioned above – The reason is build up of round-off differences

62

Impact Of High-Bandwidth Tariffs On Design No.1 Topology Link Cost •This mesh would use many T-1 bundles for links if T-3 were not available •If T-3 is available there is a very large cost-per-month saving •Higher bandwidths continue to reduce cost until OC-48. •This sparse mesh could take advantage of higher-speed links if offered traffic levels were higher 63

Impact of a Missing Line Type • There are two scenarios: 1. An unavailable link type anywhere 2. An unavailable link type for a certain node location

• The solutions are to: 1. Just make a tariff unattractive by boosting its base rate to a high value; this works for any link speed. 2. Do an override for link type and cost after using WAN-67

64

Different Offered Traffic Profiles • The traffic generator functionality needs an application note or two on its own, but here are some features – There are nine profiles available to the user – The first four are outputs of the traffic generator – The user has significant control over various parameters that affect the outputs for the first four – The last five traffic profiles can be created from any of the first four or they can be imported from a network or from a user-created set of values

• Prof 2 and Prof4 are deterministic – We use Prof2 for most examples (because it’s simpler to explain) but Prof4 is more realistic in many cases – Prof1 and Prof3 add randomness to Prof2 and resemble snapshots in time about the average values of Prof2, much like snapshots in a network’s traffic 65

Link Utilization Levels • We used 50% as the maximum design-level utilization for the No-FMEA design and allowed 100% as the maximum utilization for any link affected by the failure of any another link (that being called the With-FMEA limit) – The user can set the design level No-FMEA limit to any value from 5% to 100% in 5% steps – The With FMEA limit can be as low as the No-FMEA limit but (because we are using a model) it can also go to 2500%. Why? – The reason for this huge 2500% value is to see by how much certain links are overloaded when modeling some failure scenario under manual control. This is just a number on paper–not possible in the real network–and it’s a valuable capability in modeling 66

Three Different Routing (Path Finding) Methods • WAN-67 provides three path finding methods 1. Minimum distance 2. Fewest hops (with a tie breaker that depends on the text string of the path) 3. Fewest hops (with ties resolver by shortest distance of the path along its links)

• Here is where they are used for the most part 1. This is used for initial, totally programmatic design but can also be used by the user at any time 2. This is how many routers work and the two directions of a path between two nodes may be different 3. This is used for most designs and seems to be the best one in most cases

• The user can rerun any design with the three different path-finding methods at any time to see what it does to the overall cost 67

Network Reliability Notes (1 of 4) • There are three main reliability subject areas in networks and other systems 1. Reliability, which (in a quantifiable sense) is the probability of some system or subsystem working properly over some specified period of time. Usually the longer the time, the lower the reliability. Reliability, usually expressed as a percentage, can apply to repairable or unrepairable systems. The number is often misused because the words “reliable” and “reliability” are much more commonly used in a general sense as in the title of this slide. 2. Availability, which is the long-term percentage of uptime of a repairable system or subsystem 3. FMEA, which refers to the impact of failures, and which we have focused on in this 99-node example 68

Network Reliability Notes (2 of 4) • Examples of the first two terms 1. Reliability 1. Reliability of my USB memory sticks was 100% over the last three years until recently. A new no-name device brought the reliability number way down because it failed after 3 months. 2. The reliability of my wireless Internet service provider for 1 GB downloads is about 90% because there is a 10% probability of an interruption requiring a restart during the download. 3. In networks reliability is usually used on a per-communicationpath basis. Critical interruptions of service, such as terminating a phone call or download, is a key parameter here.

2. Availability, 1. Te term availability does not apply to USB memory sticks because they are not repairable. 2. The availability of my wireless Internet service provider’s service is over 99% because outages are usually brief. (The wireless portion is even more “available”.) 3. In networks the term availability is often measured as an average uptime of all communication paths. 69

Network Reliability Notes (3 of 4) FMEA compared to Reliability and Availability 1. Reliability and availability tell something useful about the likelihood of failures but nothing about their severity 2. It’s important in most systems to look at the impact of failures using FMEA fairly early in the design 3. We try to get high reliability and availability for failure modes that we can control and that are severe 4. In IP WAN topology design we focus on link failures. We know that core routers are reliable and can reroute around failed links if there are one or more alternate paths possible in the topology. 5. We design topologies that provide good protection for all paths, especially ones that are deemed critical 6. In real designs we have to look at the geography and the underlying link layer (or layers) below the IP link level 7. And we have to minimize the monthly (or monthlyequivalent) cost of the links at the IP link level 70

Network Reliability Notes (4 of 4) Did we apply the reliability, availability and FMEA principles to our Design No. 3 topology? 1. We provided adequate redundancy in the topology for alternate paths that the routers can find 2. We essentially proved (on paper, at least) that – Alternate paths found by the routers, if and when needed, will improve availability and reduce the impact-severity of single link failures – We could achieve the above at a reasonably low cost

3. We’ll look at the impact of multiple link failures and their severity next 4. We realize that continuity of service (or non-stop service for traffic types that need it, i.e., reliability in the mathematical sense) depends mainly on what the routers and applications do 71

NY-LA Paths Around Failures of Links Link 25:32 carries the heaviest load 1. First we’ll show that load under normal conditions 2. Next we’ll show the NY-LA (62;50) path under normal conditions 3. We’ll then “fail” link 25;32 and show the alternate path 4. We’ll then show the link with the highest % utilization in the alternate path (and that % utilization value) under the failure condition for the Non-FMEA case 5. We’ll then repeat the test for the With-FMEA case to prove that the % utilization of the link in the previous step is below 100% 6. We’ll then fail another link for the With-FMEA case and show what happens to some link utilizations and the impact on the network

72

Design No.3 With All Links OK, Showing Link 25;32

Design No.3, Normal Conditions, Non-FMEA, Showing Load & Utilization for Link 25;32

Design No.3 Showing Normal NY-LA Path (Path 62;50)

Design No.3 With Link 25;32 Failed, Showing Alternate NY-LA Path

Design No.3 With Link 25;32 Failed, Non-FMEA Case, Showing Highest (Modeled) % Utilization Link (25;16) in Alternate Path

•In the real network the utilization cannot exceed 100%. Applications that use TCP and this link will throttle back to reduce utilization. And new phone calls will not be set up.

Design No.3 With Link 25;32 Failed, With-FMEA Case, Showing Highest (Modeled) % Utilization Link (25;16) in Alternate Path

•In this With-FMEA case the bandwidth of the link was increased during the design to prevent utilization from going over 100% during the failure of any other single link (such as link 25;32 in this example)

Let’s Fail Another Link • What happens if we fail a second link along the NY-LA path? – We will only look at the With-FMEA case for this one since we already know that we get overloading for the Non-FMEA case

• The With-FMEA case gives predictable and quantifiable protection for single link failures but how will it fare if there is a second link that fails? – We’ll fail the NY-Chicago link (62;25) in addition to Link 25;32 as shown next, along with the new alternate NY to LA path

79

Design No.3 With Links 25;32 and 62;25 Failed, Showing New Alternate NY-LA Path

Design No.3 With Link 25;32 Failed, With-FMEA Case, Showing Highest (Modeled) % Utilization Link (71;16) in New Alternate Path

•Now a new link in this path is overloaded (by 69.4%, if that were possible) •Meanwhile, the 25;16 link has less load than it had because the NY to LA traffic no longer goes through Chicago. How did that happen? Read on.

Design No.3 With Links 25;32 and 62;25 Failed, Showing New NY-Chicago Path

• The NY-Chicago (62;25) path has four hops • To get to Pittsburgh (16) would take five via Chicago (25) • The shortest path (in hops) from NY (62) to (16) is just three • That’s why load on link 25;16 is lower and 71:16 is now overloaded after link 62;25 failed in addition to link 25;32

So What Does All This Reliability Stuff Show? • It shows that topology designers can make more informed decisions if they – Have access to computer aids such as fully automated failure-modes and effects analysis (FMEA) for single link failures – Can model multiple link failures even though such failures don’t lend themselves to total automation without a map of lower link layers – Can do the multi-link failure modeling interactively in important parts of the network relatively easily and quickly

83

Concluding Remarks • The purpose of the document was to show an example of how WAN-67 can provide assistance in IP/MPLS network design – The primary purpose of WAN-67 is to help reduce interconnection cost between routers in IP/MPLS core networks – This example was focused for the U.S. market but the program is designed to work over the globe – The designer of WAN-67 has had extensive experience in design of national networks in the U.S. Canada, U.K., France, Germany Holland, Belgium, Italy, Austria, Switzerland, Japan, China, Australia, India, Mexico, Brazil, and several others–and in global networks such as one that has nodes in 60 countries – Other experience involves studying leased-line tariffs in the countries listed and between many more countries 84

End of 99-Node IP/MPLS WAN Topology Design Tariff4 (Medium-Low Tariff Version)

85