Intel Core i7 Processor Features in Intel Performance Tuning Utility 3.2

Intel® Core™i7 Processor Features in Intel® Performance Tuning Utility 3.2 David Levinthal Principal Engineer Developer Products Division Software and...
Author: Maria Lang
3 downloads 0 Views 5MB Size
Intel® Core™i7 Processor Features in Intel® Performance Tuning Utility 3.2 David Levinthal Principal Engineer Developer Products Division Software and Solutions Group

http://www.intel.com/software/products Copyright © 2007, Intel Corporation. All rights reserved.

Legal Disclaimer •

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications.



Intel may make changes to specifications and product descriptions at any time, without notice.



Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.



The Intel® Performance Tuning Utility may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.



Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.



Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.

• http://www.physics.utah.edu/~detar/milc/milcv7.html#SEC3 • http://www.gnu.org/copyleft/gpl.html •

This document contains information on products in the design phase of development. Do not finalize a design with this information. Revised information will be published when the product is available. Verify with your local sales office that you have the latest datasheet before finalizing a design.



PTU and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user.



All dates specified are target dates, are provided for planning purposes only and are subject to change.



All products, dates, and figures specified are preliminary based on current expectations, provided for planning purposes only, and are subject to change without notice.



Intel, the Intel logo, Vtune and Core are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.



*Other names and brands are the property of their respective owners.



Copyright © 2009, Intel Corporation

2

Outline

• Installation and Introduction • New Intel® Core™i7 Processor PMU Features

• PEBS Improvements • Last Branch Record (LBR) Collection • Matrix Event: Offcore_Response_0 • New Predefined Profiles • Latency Histogram • Performance Ratios

• Column Control • Details Spreadsheet • Call Counts per Source • Register Display • Source view String Search

3

Installation and Introduction

•Eclipse based performance analysis tool that runs on Windows* and Linux* •Release 3.2 builds on existing product •See http://softwarecommunity.intel.com/i sn/Downloads/whatif/iptu/Intel(r)_P TU3.0_installation_and_usage.pdf •See User_Guide.pdf in top directory of PTU installation 4

*Other names and brands are the property of their respective owners.

New Intel® Core™i7 Processor PMU Features • Four General Counters • PEBS events on all 4 • Almost all events are per logical core (HT) • Many more PEBS events • Precise loads and stores • Load events per data source – Large number of data sources for multi core NUMA architecture

• Precise branch retired • Latency event 5

PEBS Events

•Loads and stores retired

• Count all loads and stores • Allows address reconstruction profiling –Except for pointer chase load mystruc=mystruc->new; mov rax, [rax+const] –Subject to distribution distortion due to PEBS shadowing • Total count is correct, distribution may be skewed • See backup foils

6

PEBS Events: loads per data source mem_load_retired.l1d_hit

load hit line in L1D

mem_load_retired.hit_lfb

load missed l1d, hit allocated LFB

mem_load_retired.l2_hit

load_missed l1d, hit l2

mem_load_retired. llc_unshared_hit

load missed l1d, l2. Hit LLC owned by core, or in shared state

mem_load_retired. other_core_l2_hit_hitm

load_missed l1d, l2, hit llc but had to snoop other core

mem_load_retired.llc_miss

load missed l1d, l2, lfb and llc

mem_load_retired.dtlb_miss

missed STLB, includes secondary misses

mem_uncore_retired. other_core_l2_hitm

hit LLC, other core snoop response was HITM

mem_uncore_retired.local_dram

LLC miss satisfied by local dram

mem_uncore_retired.remote_dram

LLC miss satisfied by remote dram

mem_uncore_retired. remote_cache_local_home_hit

LLC miss of locally homed line forwarded from remote cache

7

PEBS Events • Precise branch events • All_branches, conditional, near_call – Call counts, function arguments, basic block execution counts, loop tripcounts, etc

• Latency event • HW samples loads and captures IP, linear address, load to use latency and data source

– Similar to Itanium™ Processor data ear event – Profile addresses, latencies and data sources in correlation – HW sampling fraction ~ 2-3% • Latency_event/mem_load_retired.llc_miss ~0.020.03 for pointer chasing loop retrieving data from dram when HW prefetchers disabled

8

New PMU Features: Last Branch Record (LBR) • Records sources and targets of taken branches in 16 (S/T pairs) deep rotating buffer • Coupled with Precise call retired allows call counts/source • Function arguments on Intel64 for limited arguments

• Call Chains • Basic Block execution counts (see backup) • Loop tripcounts 9

New PMU Features: Matrix Event

•Request type X Response source

• Per core/Per thread • > 65K possible programmings • ~270 are predefined • DATA_IN most useful request type –Loads, RFOs, SW prefetch and HW prefetch

• Writebacks of cacheable lines are always to LLC

10

Memory Access: Offcore Access

• Offcore_Response_0 (b7) – “umasks” set with MSRs 1a6 Bit position

Description

Request

0

Demand Data Rd = DCU reads (includes partials, DCU Prefetch)

Type

1

Demand RFO = DCU RFOs

2

Demand Ifetch = IFU Fetches

3

Writeback = MLC_EVICT/DCUWB

4

PF Data Rd = MPL Reads

5

PF RFO = MPL RFOs

6

PF Ifetch = MPL Fetches

7

OTHER

Response

8

LLC_HIT_UNCORE_HIT

Type

9

LLC_HIT_OTHER_CORE_HIT_SNP

11

10

LLC_HIT_OTHER_CORE_HITM

11

LLC_MISS_REMOTE_HIT_SCRUB

12

LLC_MISS_REMOTE_FWD

13

LLC_MISS_REMOTE_DRAM

14

LLC_MISS_LOCAL_DRAM

15

IO_CSR_MMIO

Offcore_response Reasonable Combinations? Request Type

MSR Encoding

Response Type

MSR Encoding

ANY_DATA

xx11

ANY_CACHE_DRAM

7Fxx

ANY_IFETCH

xx44

ANY_DRAM

60xx

ANY_REQUEST

xxFF

ANY_LLC_MISS

F8xx

ANY_RFO

xx22

ANY_LOCATION

FFxx

COREWB

xx08

IO_CSR_MMIO

80xx

DATA_IFETCH

xx77

LLC_HIT_NO_OTHER_CORE

01xx

DATA_IN

xx33

LLC_OTHER_CORE_HIT

02xx

DEMAND_DATA

xx03

LLC_OTHER_CORE_HITM

04xx

DEMAND_DATA_RD

xx01

LCOAL_CACHE

07xx

DEMAND_IFETCH

xx04

LOCAL_CACHE_DRAM

47xx

DEMAND_RFO

xx02

LOCAL_DRAM

40xx

OTHER

xx80

REMOTE_CACHE

18xx

PF_DATA

xx30

REMOTE_CACHE_DRAM

38xx

PF_DATA_RD

xx10

REMOTE_CACHE_HIT

10xx

PF_IFETCH

xx40

REMOTE_CACHE_HITM

08xx

PF_RFO

xx20

REMOTE_DRAM

20xx

PREFETCH

xx70

12

DATA_IN most useful NT local stores counted by 0200 not 4000

More Profiles

13

Intel® PTU predefined event lists • Cycles and Uops



Cycle usage and uop flow through the pipeline

• General Exploration



Cycles, inst, stalls, branches, basic memory access

• Memory Access



Detailed breakdown of offcore memory access (w/wo address profiling)

• Working Set



Precise loads and stores enabling address space analysis

• FE Investigation



Detailed instruction starvation analysis

• Contested lines



Precise HITM and Store events

• Loop Analysis



32 events for HPC type codes, w/wo call sites , ie including LBR capture

• Client Analysis



54 events for client type codes, w/wo call sites , ie including LBR capture

• And others…

14

Intel® PTU uses predefined event lists to manage the complexity

•General Exploration Cpu_clk_unhalted.thread Inst_retired.any Br_inst_retired.all_branches Mem_inst_retired.latency_above_threshold_32

Mem_Load_retired.llc_miss Uops_executed.core_stall_cycles

Code profiles with respect to cycles, stalls, instructions and longer latency data sources

15

Latency Event Enables Latency Histogram and Filtering in Data Profile Display

16

Latency Event Enables Latency Histogram and Filtering in Data Profile Display

17

Latency Event Enables Latency Histogram and Filtering in Data Profile Display

18

Latency Event Enables Latency Histogram and Filtering in Data Profile Display

19

Latency Event Enables Latency Histogram and Filtering in Data Profile Display

20

Events Grouped into Data Source Hierarchy

21

Events Grouped into Data Source Hierarchy >> and details will display detail spread sheet for the highlighted row • This only has to be done the first time

• Contains events in prioritized rows • Each row displays –Event name –Sample count –Event count –Short description of ratio and its value –Highlighting enables advice text display 32

11/5/2008

32

Selecting a Different Function Changes the Detail Spreadsheet

33

Get Tuning Advice for the Selected Event/Ratio: Highlighting the Event Row Enables Explanation

34

Get Tuning Advice for the Selected Event/Ratio: Highlighting the Event Row Enables Explanation

35

Ratio file components •Ratios cause cells to highlight when (Ratio > Threshold)&& Dependency --Execution Stall Cycles[0.000]= [pmn:UOPS_EXECUTED.CORE_STALL_CYCLES]/[pmn:CPU_CLK_UNH ALTED.THREAD] Dependency="Hot Function" Threshold=0.3 ThresholdEvent=UOPS_EXECUTED.CORE_STALL_CYCLES CyclesRatio=yes ShortDescription=Execution Stall Cycles. Followed by lots of text that is the “advice” displayed by “explain” button ---

PMN.VTR is an Editable File 36

11/5/2008

36

Source/Asm Display displays Column Order by priorities for that Function (like detail)

37

Set Event of Interest to find Hotspot

38

Set Event of Interest to find Hotspot

39

Find the Callers of the Hot Function

40

Find the Callers of the Hot Function Each Call Site is Listed Separately

41

Make a Branch Analysis Profile User Should Change SAV for Call_retired Event to Produce Reasonable Sample Rate for the Application

42

Make a Branch Analysis Profile Collect LBRs, Filtered on User Calls

43

Make a Branch Analysis Profile Collect Registers for the PEBS Branch Events

44

Branch Analysis Captures Call Chains and Register Contents for Calls and all Branches

45

Note: large call count would suggest need to inline, but low cycle count indicates low return for the effort

46

Do_gather is the Caller of the Hottest Hotspots

47

Use Register Contents and Disassembly to Compute Loop Iteration Count (Tripcount) Drill down to source, Display Control Graph Hot Nested loops Blks 12 & 14(on itself) inside blocks 11->15

48

Use Disassembly to Compute Inner Loop Iteration Count (Tripcount) Block 12 (& 14), Tripcount is 3, (R14 is zeroed in blk 11)

49

Cmp in Blk 15 Controls Loop, Comparing R8 and R11. R8 increments by 48 (30H)

50

Register Values Collected with Precise Event Br_inst_retired.all_branches in Blk 11 Yield Values for R11 (14 samples)

51

Select the Asm Line, Right Click and Show Register Statistics

52

Tripcount is constant (min=max=avg, rms=0) and Equals 786432/48 = 16384 Which is the 4-Dim Lattice size for this Problem

53

Source/Asm View Text Search Utility

54

Source/Asm View Text Search Utility

55

Summary

•The Intel® Performance Tuning Utility enables unprecedented performance analysis capabilities 56

Caveats on Using Multiplexing

•Using multiplexing with Hyperthreading™ enabled can cause Intel® PTU to crash Windows* •Multiplexing has been shown to produce incorrect event counts if there are more than 3 event groups (12 general counter events + fixed counter events)

57

*Other names and brands are the property of their respective owners.

58

PMU Based Control Flow Analysis on Intel® Core™i7 Processors David Levinthal Principal Engineer DPD, SSG

Control Flow

•Function Calls by source •Basic Block execution counts

• Leads to accurate instruction retired count • Loop trip counts • Defines a “Hot Spot” •Branch taken/not taken ratio

• Build multi basic block flow/XIF Streams •Precise branch event + registers

• Function arguments for calls (intel64) • Loop tripcount distributions (counted loops) 60

Basic Block Execution

•Average of inst_retired over the basic block

• All instructions are executed equally • Samples are not evenly distributed –Multiple instructions retire/cycle

•Much better method: use precise br_inst_retired + Last Branch Record

61

Basic Branch Analysis

•Vastly improved precise branch monitoring capabilities –Branches retired • All_branches, Conditional_branches, Near_call

–16 deep Last Branch Record (LBR) • Records Taken Branches and their targets • LBR can be filtered by branch type and privilege level

•Precise br retired by branch type –Calls, conditional and all calls –Coupled with LBR capture yields • Call counts • Basic Block execution counts • “HW call graph” 62

Branch Analysis: Call Counts

•Call counts require sampling on calls –Sampling on anything else introduces a “trigger bias” that cannot be corrected for

•Br_Inst_Retired.Near_Call –“EIP+1” results in interupt IP= target

•Requires LBR to identify source and target –Matching PEBS EIP with LBR target

63

Control Flow Analysis

•Br_inst_retired.all_branches + LBR gives Basic Block Execution Counts

• Track back through taken branches

incrementing BB exec count by 1/num_bb

•Br_inst_retired.all_branches + LBR gives taken fraction –Not taken branches identified by branch address missing from LBR

Explained over Next Several Slides 64

Processing LBRs Branch_0

Branch_1

Target_0

Target_1

•All instructions between Target_0 and Branch_1 are retired 1 time •All Basic Blocks between Target_0 and Branch_1 are executed 1 time •All Branch Instructions between Target_0 and Branch_1 are not taken

So it would all Seem Very Straight Forward

65

Shadowing and Precise Data Collection

•The time between the counter overflow and the PEBS arming creates a “shadow”, during which events cannot be collected ~8 cycles? •Ex: conditional branches retired –Sequence of short BBs (< 3 cycles in duration) –If branch into first overflows counter, Pebs event cannot occur until branch at end of 4th BB –Intervening branches will never be sampled

66

Shadowing O

20

Assume 10 cycle shadow for this example

P C O

20

P

2

C

N

O

N

O

2 2

0

O

2 P

O P

20 C 20

0

O

C

P C

P C

O means counter overflow P means PEBS enabled C means interupt occurs 67

0

P C

0

5N

Reducing Shadowing Impact

•Some “events” will never occur!

• Falling into shadowed window •Use LBR to extend range of the single sample •Count the number of objects in LBR and increment count for all of them by 1/NUM

• Since you have only one sample 68

Minimizing Shadowing Impact on BB Execution Count Pebs Samples taken

Cycles/branch taken

Number of LBR entries

O 20 20 2 2 2

16N

P C O P C O O O O

2 P

16N

0

16N

0

17N

0

18N

0

19N

5N

20N

16N

O P

20 C

N N

C

20

Many more with 20 Cycles/branch taken 69

P C

P C

P C

Many more with N samples taken

In this example there are always 16 BB’s covered in the LBR. Incrementing the BB execution count for each BB detected in the LBR, by 1/NUM_BB seen in the LBR path will greatly reduce the effect of shadowing

Many more with 16 N LBR Entries

Branch Filtering LBR Filter Bit Name

Bit Description

CPL_EQ_0

Exclude ring 0

0

CPL_NEQ_0

Exclude ring3

1

JCC

Exclude taken conditional branches

2

NEAR_REL_CALL

Exclude near relative calls

3

NEAR_INDIRECT_CALL

Exclude near indirect calls

4

NEAR_RET

5

NEAR_REL_JMP

Exclude near returns Exclude near unconditional near branches Exclude near unconditional relative branches

FAR_BRANCH

Exclude far branches

8

NEAR_INDIRECT_JMP

70

bit

6 7

Fixing Shadowing in Call Counts

•Filter the LBR to only record calls (and unconditional jumps) •Identify ALL calls (source and target) in the LBR (NUM_CALLS) •Increment all source target links by 1/NUM_CALLS •This will work exactly the same way as it does for BB’s when no filter is applied 71

Precise Conditional Branch Retired

•Counted loops that actually use the induction variable will frequently keep the tripcount in a register for the termination test –Ex heavily optimized triad with the Intel compiler has Addq $0x8, %rcx Cmpq %rax, %rcx Jnge triad+0x27

•Value of RAX is the tripcount •Average value of RCX is tripcount/2 72

Branch Analysis: Function Arguments (Intel64 only)

•Functions with “few” (

Suggest Documents