Intel® Core™i7 Processor Features in Intel® Performance Tuning Utility 3.2 David Levinthal Principal Engineer Developer Products Division Software and Solutions Group
http://www.intel.com/software/products Copyright © 2007, Intel Corporation. All rights reserved.
Legal Disclaimer •
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications.
•
Intel may make changes to specifications and product descriptions at any time, without notice.
•
Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.
•
The Intel® Performance Tuning Utility may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
•
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
•
Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.
• http://www.physics.utah.edu/~detar/milc/milcv7.html#SEC3 • http://www.gnu.org/copyleft/gpl.html •
This document contains information on products in the design phase of development. Do not finalize a design with this information. Revised information will be published when the product is available. Verify with your local sales office that you have the latest datasheet before finalizing a design.
•
PTU and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user.
•
All dates specified are target dates, are provided for planning purposes only and are subject to change.
•
All products, dates, and figures specified are preliminary based on current expectations, provided for planning purposes only, and are subject to change without notice.
•
Intel, the Intel logo, Vtune and Core are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
•
*Other names and brands are the property of their respective owners.
•
Copyright © 2009, Intel Corporation
2
Outline
• Installation and Introduction • New Intel® Core™i7 Processor PMU Features
• PEBS Improvements • Last Branch Record (LBR) Collection • Matrix Event: Offcore_Response_0 • New Predefined Profiles • Latency Histogram • Performance Ratios
• Column Control • Details Spreadsheet • Call Counts per Source • Register Display • Source view String Search
3
Installation and Introduction
•Eclipse based performance analysis tool that runs on Windows* and Linux* •Release 3.2 builds on existing product •See http://softwarecommunity.intel.com/i sn/Downloads/whatif/iptu/Intel(r)_P TU3.0_installation_and_usage.pdf •See User_Guide.pdf in top directory of PTU installation 4
*Other names and brands are the property of their respective owners.
New Intel® Core™i7 Processor PMU Features • Four General Counters • PEBS events on all 4 • Almost all events are per logical core (HT) • Many more PEBS events • Precise loads and stores • Load events per data source – Large number of data sources for multi core NUMA architecture
• Precise branch retired • Latency event 5
PEBS Events
•Loads and stores retired
• Count all loads and stores • Allows address reconstruction profiling –Except for pointer chase load mystruc=mystruc->new; mov rax, [rax+const] –Subject to distribution distortion due to PEBS shadowing • Total count is correct, distribution may be skewed • See backup foils
6
PEBS Events: loads per data source mem_load_retired.l1d_hit
load hit line in L1D
mem_load_retired.hit_lfb
load missed l1d, hit allocated LFB
mem_load_retired.l2_hit
load_missed l1d, hit l2
mem_load_retired. llc_unshared_hit
load missed l1d, l2. Hit LLC owned by core, or in shared state
mem_load_retired. other_core_l2_hit_hitm
load_missed l1d, l2, hit llc but had to snoop other core
mem_load_retired.llc_miss
load missed l1d, l2, lfb and llc
mem_load_retired.dtlb_miss
missed STLB, includes secondary misses
mem_uncore_retired. other_core_l2_hitm
hit LLC, other core snoop response was HITM
mem_uncore_retired.local_dram
LLC miss satisfied by local dram
mem_uncore_retired.remote_dram
LLC miss satisfied by remote dram
mem_uncore_retired. remote_cache_local_home_hit
LLC miss of locally homed line forwarded from remote cache
7
PEBS Events • Precise branch events • All_branches, conditional, near_call – Call counts, function arguments, basic block execution counts, loop tripcounts, etc
• Latency event • HW samples loads and captures IP, linear address, load to use latency and data source
– Similar to Itanium™ Processor data ear event – Profile addresses, latencies and data sources in correlation – HW sampling fraction ~ 2-3% • Latency_event/mem_load_retired.llc_miss ~0.020.03 for pointer chasing loop retrieving data from dram when HW prefetchers disabled
8
New PMU Features: Last Branch Record (LBR) • Records sources and targets of taken branches in 16 (S/T pairs) deep rotating buffer • Coupled with Precise call retired allows call counts/source • Function arguments on Intel64 for limited arguments
• Call Chains • Basic Block execution counts (see backup) • Loop tripcounts 9
New PMU Features: Matrix Event
•Request type X Response source
• Per core/Per thread • > 65K possible programmings • ~270 are predefined • DATA_IN most useful request type –Loads, RFOs, SW prefetch and HW prefetch
• Writebacks of cacheable lines are always to LLC
10
Memory Access: Offcore Access
• Offcore_Response_0 (b7) – “umasks” set with MSRs 1a6 Bit position
Description
Request
0
Demand Data Rd = DCU reads (includes partials, DCU Prefetch)
Type
1
Demand RFO = DCU RFOs
2
Demand Ifetch = IFU Fetches
3
Writeback = MLC_EVICT/DCUWB
4
PF Data Rd = MPL Reads
5
PF RFO = MPL RFOs
6
PF Ifetch = MPL Fetches
7
OTHER
Response
8
LLC_HIT_UNCORE_HIT
Type
9
LLC_HIT_OTHER_CORE_HIT_SNP
11
10
LLC_HIT_OTHER_CORE_HITM
11
LLC_MISS_REMOTE_HIT_SCRUB
12
LLC_MISS_REMOTE_FWD
13
LLC_MISS_REMOTE_DRAM
14
LLC_MISS_LOCAL_DRAM
15
IO_CSR_MMIO
Offcore_response Reasonable Combinations? Request Type
MSR Encoding
Response Type
MSR Encoding
ANY_DATA
xx11
ANY_CACHE_DRAM
7Fxx
ANY_IFETCH
xx44
ANY_DRAM
60xx
ANY_REQUEST
xxFF
ANY_LLC_MISS
F8xx
ANY_RFO
xx22
ANY_LOCATION
FFxx
COREWB
xx08
IO_CSR_MMIO
80xx
DATA_IFETCH
xx77
LLC_HIT_NO_OTHER_CORE
01xx
DATA_IN
xx33
LLC_OTHER_CORE_HIT
02xx
DEMAND_DATA
xx03
LLC_OTHER_CORE_HITM
04xx
DEMAND_DATA_RD
xx01
LCOAL_CACHE
07xx
DEMAND_IFETCH
xx04
LOCAL_CACHE_DRAM
47xx
DEMAND_RFO
xx02
LOCAL_DRAM
40xx
OTHER
xx80
REMOTE_CACHE
18xx
PF_DATA
xx30
REMOTE_CACHE_DRAM
38xx
PF_DATA_RD
xx10
REMOTE_CACHE_HIT
10xx
PF_IFETCH
xx40
REMOTE_CACHE_HITM
08xx
PF_RFO
xx20
REMOTE_DRAM
20xx
PREFETCH
xx70
12
DATA_IN most useful NT local stores counted by 0200 not 4000
More Profiles
13
Intel® PTU predefined event lists • Cycles and Uops
•
Cycle usage and uop flow through the pipeline
• General Exploration
•
Cycles, inst, stalls, branches, basic memory access
• Memory Access
•
Detailed breakdown of offcore memory access (w/wo address profiling)
• Working Set
•
Precise loads and stores enabling address space analysis
• FE Investigation
•
Detailed instruction starvation analysis
• Contested lines
•
Precise HITM and Store events
• Loop Analysis
•
32 events for HPC type codes, w/wo call sites , ie including LBR capture
• Client Analysis
•
54 events for client type codes, w/wo call sites , ie including LBR capture
• And others…
14
Intel® PTU uses predefined event lists to manage the complexity
•General Exploration Cpu_clk_unhalted.thread Inst_retired.any Br_inst_retired.all_branches Mem_inst_retired.latency_above_threshold_32
Mem_Load_retired.llc_miss Uops_executed.core_stall_cycles
Code profiles with respect to cycles, stalls, instructions and longer latency data sources
15
Latency Event Enables Latency Histogram and Filtering in Data Profile Display
16
Latency Event Enables Latency Histogram and Filtering in Data Profile Display
17
Latency Event Enables Latency Histogram and Filtering in Data Profile Display
18
Latency Event Enables Latency Histogram and Filtering in Data Profile Display
19
Latency Event Enables Latency Histogram and Filtering in Data Profile Display
20
Events Grouped into Data Source Hierarchy
21
Events Grouped into Data Source Hierarchy >> and details will display detail spread sheet for the highlighted row • This only has to be done the first time
• Contains events in prioritized rows • Each row displays –Event name –Sample count –Event count –Short description of ratio and its value –Highlighting enables advice text display 32
11/5/2008
32
Selecting a Different Function Changes the Detail Spreadsheet
33
Get Tuning Advice for the Selected Event/Ratio: Highlighting the Event Row Enables Explanation
34
Get Tuning Advice for the Selected Event/Ratio: Highlighting the Event Row Enables Explanation
35
Ratio file components •Ratios cause cells to highlight when (Ratio > Threshold)&& Dependency --Execution Stall Cycles[0.000]= [pmn:UOPS_EXECUTED.CORE_STALL_CYCLES]/[pmn:CPU_CLK_UNH ALTED.THREAD] Dependency="Hot Function" Threshold=0.3 ThresholdEvent=UOPS_EXECUTED.CORE_STALL_CYCLES CyclesRatio=yes ShortDescription=Execution Stall Cycles. Followed by lots of text that is the “advice” displayed by “explain” button ---
PMN.VTR is an Editable File 36
11/5/2008
36
Source/Asm Display displays Column Order by priorities for that Function (like detail)
37
Set Event of Interest to find Hotspot
38
Set Event of Interest to find Hotspot
39
Find the Callers of the Hot Function
40
Find the Callers of the Hot Function Each Call Site is Listed Separately
41
Make a Branch Analysis Profile User Should Change SAV for Call_retired Event to Produce Reasonable Sample Rate for the Application
42
Make a Branch Analysis Profile Collect LBRs, Filtered on User Calls
43
Make a Branch Analysis Profile Collect Registers for the PEBS Branch Events
44
Branch Analysis Captures Call Chains and Register Contents for Calls and all Branches
45
Note: large call count would suggest need to inline, but low cycle count indicates low return for the effort
46
Do_gather is the Caller of the Hottest Hotspots
47
Use Register Contents and Disassembly to Compute Loop Iteration Count (Tripcount) Drill down to source, Display Control Graph Hot Nested loops Blks 12 & 14(on itself) inside blocks 11->15
48
Use Disassembly to Compute Inner Loop Iteration Count (Tripcount) Block 12 (& 14), Tripcount is 3, (R14 is zeroed in blk 11)
49
Cmp in Blk 15 Controls Loop, Comparing R8 and R11. R8 increments by 48 (30H)
50
Register Values Collected with Precise Event Br_inst_retired.all_branches in Blk 11 Yield Values for R11 (14 samples)
51
Select the Asm Line, Right Click and Show Register Statistics
52
Tripcount is constant (min=max=avg, rms=0) and Equals 786432/48 = 16384 Which is the 4-Dim Lattice size for this Problem
53
Source/Asm View Text Search Utility
54
Source/Asm View Text Search Utility
55
Summary
•The Intel® Performance Tuning Utility enables unprecedented performance analysis capabilities 56
Caveats on Using Multiplexing
•Using multiplexing with Hyperthreading™ enabled can cause Intel® PTU to crash Windows* •Multiplexing has been shown to produce incorrect event counts if there are more than 3 event groups (12 general counter events + fixed counter events)
57
*Other names and brands are the property of their respective owners.
58
PMU Based Control Flow Analysis on Intel® Core™i7 Processors David Levinthal Principal Engineer DPD, SSG
Control Flow
•Function Calls by source •Basic Block execution counts
• Leads to accurate instruction retired count • Loop trip counts • Defines a “Hot Spot” •Branch taken/not taken ratio
• Build multi basic block flow/XIF Streams •Precise branch event + registers
• Function arguments for calls (intel64) • Loop tripcount distributions (counted loops) 60
Basic Block Execution
•Average of inst_retired over the basic block
• All instructions are executed equally • Samples are not evenly distributed –Multiple instructions retire/cycle
•Much better method: use precise br_inst_retired + Last Branch Record
61
Basic Branch Analysis
•Vastly improved precise branch monitoring capabilities –Branches retired • All_branches, Conditional_branches, Near_call
–16 deep Last Branch Record (LBR) • Records Taken Branches and their targets • LBR can be filtered by branch type and privilege level
•Precise br retired by branch type –Calls, conditional and all calls –Coupled with LBR capture yields • Call counts • Basic Block execution counts • “HW call graph” 62
Branch Analysis: Call Counts
•Call counts require sampling on calls –Sampling on anything else introduces a “trigger bias” that cannot be corrected for
•Br_Inst_Retired.Near_Call –“EIP+1” results in interupt IP= target
•Requires LBR to identify source and target –Matching PEBS EIP with LBR target
63
Control Flow Analysis
•Br_inst_retired.all_branches + LBR gives Basic Block Execution Counts
• Track back through taken branches
incrementing BB exec count by 1/num_bb
•Br_inst_retired.all_branches + LBR gives taken fraction –Not taken branches identified by branch address missing from LBR
Explained over Next Several Slides 64
Processing LBRs Branch_0
Branch_1
Target_0
Target_1
•All instructions between Target_0 and Branch_1 are retired 1 time •All Basic Blocks between Target_0 and Branch_1 are executed 1 time •All Branch Instructions between Target_0 and Branch_1 are not taken
So it would all Seem Very Straight Forward
65
Shadowing and Precise Data Collection
•The time between the counter overflow and the PEBS arming creates a “shadow”, during which events cannot be collected ~8 cycles? •Ex: conditional branches retired –Sequence of short BBs (< 3 cycles in duration) –If branch into first overflows counter, Pebs event cannot occur until branch at end of 4th BB –Intervening branches will never be sampled
66
Shadowing O
20
Assume 10 cycle shadow for this example
P C O
20
P
2
C
N
O
N
O
2 2
0
O
2 P
O P
20 C 20
0
O
C
P C
P C
O means counter overflow P means PEBS enabled C means interupt occurs 67
0
P C
0
5N
Reducing Shadowing Impact
•Some “events” will never occur!
• Falling into shadowed window •Use LBR to extend range of the single sample •Count the number of objects in LBR and increment count for all of them by 1/NUM
• Since you have only one sample 68
Minimizing Shadowing Impact on BB Execution Count Pebs Samples taken
Cycles/branch taken
Number of LBR entries
O 20 20 2 2 2
16N
P C O P C O O O O
2 P
16N
0
16N
0
17N
0
18N
0
19N
5N
20N
16N
O P
20 C
N N
C
20
Many more with 20 Cycles/branch taken 69
P C
P C
P C
Many more with N samples taken
In this example there are always 16 BB’s covered in the LBR. Incrementing the BB execution count for each BB detected in the LBR, by 1/NUM_BB seen in the LBR path will greatly reduce the effect of shadowing
Many more with 16 N LBR Entries
Branch Filtering LBR Filter Bit Name
Bit Description
CPL_EQ_0
Exclude ring 0
0
CPL_NEQ_0
Exclude ring3
1
JCC
Exclude taken conditional branches
2
NEAR_REL_CALL
Exclude near relative calls
3
NEAR_INDIRECT_CALL
Exclude near indirect calls
4
NEAR_RET
5
NEAR_REL_JMP
Exclude near returns Exclude near unconditional near branches Exclude near unconditional relative branches
FAR_BRANCH
Exclude far branches
8
NEAR_INDIRECT_JMP
70
bit
6 7
Fixing Shadowing in Call Counts
•Filter the LBR to only record calls (and unconditional jumps) •Identify ALL calls (source and target) in the LBR (NUM_CALLS) •Increment all source target links by 1/NUM_CALLS •This will work exactly the same way as it does for BB’s when no filter is applied 71
Precise Conditional Branch Retired
•Counted loops that actually use the induction variable will frequently keep the tripcount in a register for the termination test –Ex heavily optimized triad with the Intel compiler has Addq $0x8, %rcx Cmpq %rax, %rcx Jnge triad+0x27
•Value of RAX is the tripcount •Average value of RCX is tripcount/2 72
Branch Analysis: Function Arguments (Intel64 only)
•Functions with “few” (