ASR 1000 Series Router Memory Troubleshoot Guide

ASR 1000 Series Router Memory Troubleshoot Guide Document ID: 116777 Contributed by Vishnu Asok and Girish Devgan, Cisco TAC Engineers. Nov 19, 2013 ...
4 downloads 5 Views 17KB Size
ASR 1000 Series Router Memory Troubleshoot Guide Document ID: 116777 Contributed by Vishnu Asok and Girish Devgan, Cisco TAC Engineers. Nov 19, 2013

Contents Introduction Prerequisites Requirements Components Used ASR Memory Layout Overview Memory Allocation under the lsmpi_io pool Memory Usage Verify Memory Usage on IOS−XE Verify Memory Usage on IOSd Verify TCAM Utilization on an ASR1K Verify Memory Utilization on QFP

Introduction This document describes how to check system memory and troubleshoot memory issues on Cisco 1000 Series Aggregation Services Routers (ASR1K).

Prerequisites Requirements Cisco recommends that you have basic knowledge of these topics: • Cisco IOS−XE software • ASR CLI Note: You might need a special license in order to log in to the Linux shell on the ASR 1001 Series router.

Components Used The information in this document is based on these software and hardware versions: • All ASR1K platforms • All Cisco IOS−XE software releases that support the ASR1K platform The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, make sure that you understand the potential impact of any command.

ASR Memory Layout Overview With most of the previous Cisco router platforms, the majority of the internal software processes are run with the Cisco IOS® (IOS) memory. The ASR1K platform introduces a distributed software architecture that moves many Operating System (OS) responsibilities out of the IOS process. In this architecture, IOS, which was previously responsible for almost all of the internal software processes, now runs as one of many Linux processes. This allows other Linux processes to share responsibility for the operation of the router. The ASR1K runs IOS−XE, not the traditional IOS. In IOS−XE, a Linux component runs the kernel, and the IOS runs as a daemon, which hereafter is referred as IOSd (IOS−Daemon). This creates a requirement that the memory be split between the Linux kernel and the IOSd instance. The memory that is split between IOSd and the rest of the system is fixed at startup and cannot be modified. For a 4−GB system, IOSd is allocated approximately 2 GB, and for a 8−GB system, the IOSd is allocated approximately 3.8 GB (with software redundancy disabled). Since the ASR1K has a 64−bit architecture, any pointer that is in every data structure in the system consumes double the amount of memory when compared to the consumption of a traditional single−CPU router (8 bytes instead of 4 bytes). The 64−bit addressing enables IOS to overcome the 2−GB addressable memory limitation of IOS, which allows it to scale to millions of routes. Note: Ensure that you have sufficient memory available before you activate any new features. Cisco recommends that you have at least 8 GB DRAM if you receive the entire Border Gateway Protocol (BGP) routing table when software redundancy is enabled in order to prevent memory exhaustion.

Memory Allocation under the lsmpi_io pool The Linux Shared Memory Punt Interface (LSMPI) memory pool is used in order to transfer packets from the forwarding processor to the route processor. This memory pool is carved at router initialization into preallocated buffers, as opposed to the processor pool, where IOS−XE allocates memory blocks dynamically. On the ASR1K platform, the lsmpi_io pool has little free memory − generally less than 1000 bytes − which is normal. Cisco recommends that you disable monitoring of the LSMPI pool by the network management applications in order to avoid false alarms.

ASR1000# show memory statistics Head

Total(b)

Used(b)

Free(b)

Lowest(b)

Largest(b)

Processor

2C073008

1820510884

173985240

1646525644

1614827804

1646234064

lsmpi_io

996481D0

6295088

6294120

968

968

968

If there are any issues in the LSMPI path, the Device xmit fail counter appears to increment in this command output (some output omitted):

ASR1000−1# show platform software infrastructure lsmpi driver LSMPI Driver stat ver: 3 Packets:

In: 674572 Out: 259861 Rings: RX: 2047 free

0

in−use

2048 total

TX: 2047 free

0

in−use

2048 total

RXDONE: 2047 free

0

in−use

2048 total

TXDONE: 2047 free

0

in−use

2048 total

473

in−use

8194 total

Buffers: RX: 7721 free

Reason for RX drops (sticky): Ring full

: 0

Ring put failed

: 0

No free buffer

: 0

Receive failed

: 0

Packet too large : 0 Other inst buf

: 0

Consecutive SOPs : 0 No SOP or EOP

: 0

EOP but no SOP

: 0

Particle overrun : 0 Bad particle ins : 0 Bad buf cond : 0 DS rd req failed : 0 HT rd req failed : 0 Reason for TX drops (sticky): Bad packet len

: 0

Bad buf len

: 0

Bad ifindex

: 0

No device

: 0

No skbuff

: 0

Device xmit fail : 0 Device xmit rtry : 0 Tx Done ringfull : 0

Bad u−>k xlation : 0 No extra skbuff

: 0



Memory Usage The control CPUs in the ASR1K chassis, such as the Route Processor (RP), the Embedded Switch Processor (ESP), and the Shared Port Adapter (SPA) Interface Processor (SIP), run IOS−XE software. This OS software consists of a Linux−based kernel and a common set of OS−level utility programs, which includes Cisco IOS software that runs as a user process on the RP card. Within IOS−XE, each child process operates in protected memory under each line card Linux kernel and embedded memory.

Verify Memory Usage on IOS−XE Enter the show platform software status control−processor brief command in order to monitor the memory usage on the RP, the ESP, and the SIP. The system state must be identical, in regards to aspects such as the feature configuration and traffic, while you compare the memory usage.

ASR1K# show platform software status control−processor brief Memory (kB) Slot

Status

Total

Used (Pct)

Free (Pct)

Committed (Pct)

RP0

Healthy

3907744

1835628 (47%)

2072116 (53%)

2614788 (67%)

ESP0

Healthy

2042668

789764 (39%)

1252904 (61%)

3108376 (152%)

SIP0

Healthy

482544

341004 (71%)

141540 (29%)

367956 (76%)

SIP1

Healthy

482544

315484 (65%)

167060 (35%)

312216 (65%)

Note: Committed memory is an estimate of how much RAM you need in order to guarantee that the system is never Out of Memory (OOM) for this workload. Normally, the kernel overcommits memory. For example, when you run a 1−GB malloc, nothing really happens; you only receive true memory−on−demand when you begin to use that allocated memory, and only as much as you use.

Each processor listed in the previous output might report the status as Healthy, Warning, or Critical, which is dependent upon the amount of free memory. If any of the processors display the status as Warning or Critical, enter the monitor platform software process command in order to identify the top contributor.

BGL.J.16−ASR1000−4# monitor platform software process ? 0

SPA−Inter−Processor slot 0

1

SPA−Inter−Processor slot 1

F0

Embedded−Service−Processor slot 0

F1

Embedded−Service−Processor slot 1

FP R0

Embedded−Service−Processor Route−Processor slot 0

R1

Route−Processor slot 1

RP

Route−Processor



You might be prompted to set the terminal−type before you can execute the monitor platform software process command:

BGL.J.16−ASR1000−4# monitor platform software process r0 Terminal type 'network' unsupported for command Change the terminal type with the 'terminal terminal−type' command.

The terminal type is set to network by default. In order to set the appropriate terminal type, enter the terminal terminal−type command:

ASR1000# terminal terminal−type vt100

Once the correct terminal type is configured, you can enter the monitor platform software process command (some output omitted):

ASR1000# monitor platform software process r0 top − 00:34:59 up Tasks: 136 total, Cpu(s): Mem:

0.8%us,

5:02,

0 users,

load average: 2.43, 1.52, 0.73

4 running, 132 sleeping, 2.3%sy,

0.0%ni, 96.8%id,

0 stopped, 0.0%wa,

0.0%hi,

2009852k total,

1811024k used,

198828k free,

0k total,

0k used,

0k free,

Swap: PID USER

PR

NI

VIRT

RES

SHR S %CPU %MEM

0 zombie 0.0%si,

0.0%st

135976k buffers 1133544k cached

TIME+

COMMAND

25956 root

20

0

928m 441m 152m R

1.2 22.5

4:21.32 linux_iosd−imag

29074 root

20

0

106m

0.0

0:14.86 smand

95m 6388 S

4.9

24027 root

20

0

114m

61m

55m S

0.0

3.1

0:05.07 fman_rp

25227 root

20

0 27096

13m

12m S

0.0

0.7

0:04.35 imand

23174 root

20

0 33760

11m 9152 S

1.0

0.6

1:58.00 cmand

23489 root

20

0 23988 7372 4952 S

0.2

0.4

0:05.28 emd

24755 root

20

0 19708 6820 4472 S

1.0

0.3

3:39.33 hman

28475 root

20

0 20460 6448 4792 S

0.0

0.3

0:00.26 psd

27957 root

20

0 16688 5668 3300 S

0.0

0.3

0:00.18 plogd

14572 root

20

0

0.0

0.1

0:02.37 reflector.sh

4576 2932 1308 S



Note: In order to sort the output in descending order of memory usage, press Shift + M.

Warning: Open a Cisco Technical Assistance Center (TAC) case if any of the processors report a Critical or Warning status, and you need assistance in order to identify the cause.

Verify Memory Usage on IOSd If you notice that the linux_iosd−imag process holds an unusually large amount of memory in the monitor platform software process rp active command output, focus your troubleshoot efforts on the IOSd instance. It is likely that a specific process in the IOSd thread is not freeing the memory. Troubleshoot memory related issues in the IOSd pool the same way that you troubleshoot any software−based forwarding platform, such as the Cisco 2800, 3800, or 3900 Series platforms. ASR1000# monitor platform software process rp active PID

USER

PR

NI VIRT

25794 root

20

23038 root 9599

root

RES

SHR

S

%CPU

%MEM TIME+

COMMAND

0

2929m 1.9g 155m R

99.9

38.9 1415:11

linux_iosd−imag

20

0

33848 13m

20

0

2648

10m

S

5.9

0.4

30:53.87 cmand

1152 884

R

2.0

0.0

0:00.01

top



Enter the show process memory sorted command in order to identify the problem process:

ASR1000# show process memory sorted Processor Pool Total: 1733568032 lsmpi_io Pool Total: 6295088

Used: 1261854564

Used: 6294116

Free: 471713468

Free: 972

PID

TTY

Allocated

Freed

Holding

Getbufs

Retbufs

Process

522

0

1587708188

803356800

724777608

54432

0

BGP Router

234

0

3834576340

2644349464

232401568

286163388

15876

IP RIB Update

0

0

263244344

36307492

215384208

0

0

*Init

Note: Open a TAC case if you require assistance in order to troubleshoot or identify if the memory usage is legitimate.

Verify TCAM Utilization on an ASR1K Traffic classification is one of the most basic functions found in routers and switches. Many applications and features require that the infrastructure devices provide these differentiated services for different users based on quality requirements or features based on classification requirements. The traffic classification process should be quick, so that the throughput of the device is not greatly degraded. The ASR1K platform uses the 4th generation of Ternary Content Addressable Memory (TCAM4) for this purpose. In order to determine the total number of TCAM cells available on the platform, and the number of free entries that remain, enter this command:

ASR1000# show platform hardware qfp active tcam resource−manager usage Total TCAM Cell Usage Information −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Name

: TCAM #0 on CPP #0

Total number of regions

: 3

Total tcam used cell entries : 65528 Total tcam free cell entries : 30422 Threshold status

: below critical limit

Note: Cisco recommends that you always check the threshold status before you make any changes to Access−lists or Quality of Service (QoS) policies, so that the TCAM has sufficient free cells available in order to program the entries.

If the forwarding processor runs critically low on free TCAM cells, the ESP might generate logs similar to these, and then crash, which causes the traffic forwarding to stop (if there is no redundancy):

%CPPTCAMRM−6−TCAM_RSRC_ERR: SIP0: cpp_sp: Allocation failed because of insufficient TCAM resources in the system.

%CPPOSLIB−3−ERROR_NOTIFY: SIP0: cpp_sp: cpp_sp encountered an error −Traceback= 1#d7f63914d8ef12b8456826243f3b60d7 errmsg:7EFFC525C000+1175 cpp_common_os:7EFFC8D20000+D1E5 cpp_common_os:7EFFC8D20000+D12E

Verify Memory Utilization on QFP In addition to the physical memory, there is also memory attached to the Quantum Flow Processor (QFP) ASIC that is used in order to forward data structures, which includes data such as Forwarding Information Base (FIB) and QoS policies. The amount of DRAM available for the QFP ASIC is fixed, with ranges of 256 MB, 512 MB and 1 GB, dependent upon the ESP module. Enter the show platform hardware qfp active infrastructure exmem statistics command in order to determine the exmem memory usage. The sum of the memory for IRAM and DRAM that is used gives the total QFP memory that is in use.

BGL.I.05−ASR1000−1# show platform hardware qfp active infra exmem statistics user Type: Name: IRAM, CPP: 0 Allocations

Bytes−Alloc

Bytes−Total

User−Name

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 1

115200

115712

CPP_FIA

Bytes−Total

User−Name

Type: Name: DRAM, CPP: 0 Allocations

Bytes−Alloc

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 4

1344

4096

P/I

9

270600

276480

CEF

1

1138256

1138688

QM RM

1

4194304

4194304

TCAM

1

65536

65536

Qm 16

3

15745024

15745024

ING_EGR_UIDB

The IRAM is the instruction memory for QFP software. In the event that DRAM is exhausted, available IRAM can be used. If the IRAM runs critically low on memory, you might see this error message:

%QFPOOR−4−LOWRSRC_PERCENT: F1: cpp_ha: − 97 percent depleted

QFP 0 IRAM resource low

%QFPOOR−4−LOWRSRC_PERCENT: F1: cpp_ha: − 98 percent depleted

QFP 0 IRAM resource low

In order to determine the process that consumes most of the memory, enter the show platform hardware qfp active infra exmem statistics use command:

ASR1000# show platform hardware qfp active infra exmem statistics user Type: Name: IRAM, CPP: 0 Allocations

Bytes−Alloc

Bytes−Total

User−Name

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 1

115200

115712

CPP_FIA

Type: Name: DRAM, CPP: 0 Allocations

Bytes−Alloc

Bytes−Total

User−Name

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 4

1344

4096

P/I

9

270600

276480

CEF

1

1138256

1138688

QM RM

1

4194304

4194304

TCAM

1

65536

65536

Qm 16

3

15745024

15745024

ING_EGR_UIDB

Once you identify the feature that holds most of the memory, collect the output from the show platform hardware qfp active feature command, and contact the Cisco TAC in order to determine the root cause. Updated: Nov 19, 2013

Document ID: 116777