Performance Tuning of Computer Systems Subhasis Banerjee

(CARG, University of Ottawa)

1 / 30

Outline

1

Profiling Understand What Computers Execute Program Profiling: Basic Concepts Types of Profiler

2

Using Profile Information Data Collection Case Studies

3

Profile Directed Optimization Software Optimization Hardware Optimization

4

Summary

(CARG, University of Ottawa)

2 / 30

Profiling

Understand What Computers Execute

Outline

1

Profiling Understand What Computers Execute Program Profiling: Basic Concepts Types of Profiler

2

Using Profile Information Data Collection Case Studies

3

Profile Directed Optimization Software Optimization Hardware Optimization

4

Summary

(CARG, University of Ottawa)

3 / 30

Profiling

Understand What Computers Execute

What is Profiling?

Profile = a set of data often in graphic form portraying the significant features of something (Merriam-Webster) Focus on dynamic execution (static program analysis is part of software engineering - Formal Method for software) Profiling reveals interaction between software and underlying machine architecture Indicates areas of improvement ⇒ performance tuning

(CARG, University of Ottawa)

4 / 30

Profiling

Understand What Computers Execute

How is Profiling Done?

Software tools are used to collect profile data Tools are often supported by operating system Some popular profiling tools: gprof, oprofile, valgrind, pin

Profile is collected during program execution ⇒ dynamic profile Profile data analysis ⇒ Performance Engineering

(CARG, University of Ottawa)

5 / 30

Profiling

Understand What Computers Execute

Which Profile Data is Collected?

Call graph profile in function level Call graph in Basic Block Memory performance Architectural events e.g., branch misprediction, exception, cache hit/miss Monitoring performance counters

(CARG, University of Ottawa)

6 / 30

Profiling

Program Profiling: Basic Concepts

Outline

1

Profiling Understand What Computers Execute Program Profiling: Basic Concepts Types of Profiler

2

Using Profile Information Data Collection Case Studies

3

Profile Directed Optimization Software Optimization Hardware Optimization

4

Summary

(CARG, University of Ottawa)

7 / 30

Profiling

Program Profiling: Basic Concepts

Characterizing Programs: Basic Block

Programs best viewed in architecture level in intermediate form e.g., in assembly language Basic Block (BB) is the section of code sequence which has one entry and one exit point BBs are used as building blocks in majority of program analysis tools and optimization policies Once a BB is hit all subsequent instructions are executed till it exits at the end of BB More general definition - a sequence of instructions where every instruction dominates all subsequent following instructions and no other instruction executes between two instructions in the sequence

(CARG, University of Ottawa)

8 / 30

Profiling

Program Profiling: Basic Concepts

Algorithm to Identify BB

Step 1. Identify the leaders (the first instruction of the basic block) in the code. Leaders are instructions which come under any of the following 3 categories: The first instruction The target of a conditional or an unconditional branch instruction The instruction that immediately follows a conditional or an unconditional branch instruction

Step 2. Starting from a leader, the set of all following instructions until and not including the next leader is the basic block corresponding to the starting leader

(CARG, University of Ottawa)

9 / 30

Profiling

Program Profiling: Basic Concepts

Programs in Terms of Directed Graph

1

2

Program is a directed graph with BBs as nodes Edges are indicated by branch instructions

4

5

(CARG, University of Ottawa)

3

6

10 / 30

Profiling

Types of Profiler

Outline

1

Profiling Understand What Computers Execute Program Profiling: Basic Concepts Types of Profiler

2

Using Profile Information Data Collection Case Studies

3

Profile Directed Optimization Software Optimization Hardware Optimization

4

Summary

(CARG, University of Ottawa)

11 / 30

Profiling

Types of Profiler

Profilers

Call graph profiler: Shows the call times and frequency of functions/subroutines. Static or dynamic - depending on degree of accuracy required Dynamic graph can be built fully context sensitive, i.e, for each call of a subroutine graph includes a node with call stack ⇒ large memory requirement Widely used in program analysis - Valgrind, gprof, codeviz, doxygen Drawback - slow execution, intrusive

Event based profiler: Programming languages supports event based profiling (Java, Python, Ruby) Runtime provides various callback to profile agent Customizable in profile collection Drawback - slow execution, intrusive

(CARG, University of Ottawa)

12 / 30

Profiling

Types of Profiler

Profilers Statistical Profiler: Usually sampling is done in hardware (program counter) OS interrupts samples the hardware counters Sampling is always lossy, not as accurate as others in collecting data Nearly as fast as the original execution Advantage - non-intrusive Although theoretically other software profilers provides accurate information statistical profiler has exhibited near accurate performance without any loss of execution speed, without modifying program characteristics Tools - AMD CodeAnalysit, Intel VTune, OProfile (open source), gprof, MIPS with JTAG interface

Instrumenting Program: Instrument code in appropriate place to collect profile Different types of instrumentation - manual, compiler assisted, runtime instrumentation, binary translation gprof (works with instrumentation and sampling) using -pg option with compiler PIN uses runtime instrumentation ATOM (DEC Alpha processors with TRU 64 OS) uses binary translation (obsolete) (CARG, University of Ottawa)

13 / 30

Profiling

Types of Profiler

Simulation Profiler

Simulators can be used for detail profiling - slow execution Simulation can be of different type: Trace driven simulator, execution driven simulator Some simulator can simulate cycle level: Data dependence analysis Simulation is done with smaller data input - realistic representation of actual data

(CARG, University of Ottawa)

14 / 30

Using Profile Information

Data Collection

Outline

1

Profiling Understand What Computers Execute Program Profiling: Basic Concepts Types of Profiler

2

Using Profile Information Data Collection Case Studies

3

Profile Directed Optimization Software Optimization Hardware Optimization

4

Summary

(CARG, University of Ottawa)

15 / 30

Using Profile Information

Data Collection

How to Collect Meaningful Data

What is useful data for a profiler? Trace: Collection of instructions and data at runtime, all dynamic instances Events: Architectural events e.g., cache miss, branch misprediction, page fault Counts: Architecture specific metrics e.g., number of instructions retired, number of hit/miss in cache, interrupts/exception Dataflow analysis: Instructions share data for computation - dataflow dependency ensures program order

What to do with collected data? Analysis of bottleneck in program execution Identify the performance metric: Number of instruction retired in a given time? Power? If multithreaded program - is it adequately parallel? Suggest possible improvement: Automatically tune code? Hint to compiler to generate better code? Architectural support to eliminate/reduce performance bottleneck?

(CARG, University of Ottawa)

16 / 30

Using Profile Information

Case Studies

Outline

1

Profiling Understand What Computers Execute Program Profiling: Basic Concepts Types of Profiler

2

Using Profile Information Data Collection Case Studies

3

Profile Directed Optimization Software Optimization Hardware Optimization

4

Summary

(CARG, University of Ottawa)

17 / 30

Using Profile Information

Case Studies

SPEC2000 Benchmark: Instruction Mix and Cache Profile

25

Miss rate (%)

20

16kB

32kB

64kB

15

10

5

0 bzip2

crafty

gcc

gzip

mcf parser

vortex

1−way 2−way 4−way 8−way

40 32 kB

Miss rate (%)

30

16kB

64kB

20

10 1−way 2−way

0 art

(CARG, University of Ottawa)

equake

4−way mesa

galgel

mgrid

8−way

18 / 30

Using Profile Information

Case Studies

SPEC2000 Benchmark: Branch and Power Profile 100

Percentage of branch prediction accuracy

combined gshare bimodal 95

90

85

80

twolf

vortex

parser

perlbmk

gcc

mcf

crafty

gzip

bzip2

75

100

90

85

80

(CARG, University of Ottawa)

swim

mgrid

mesa

galgel

facerec

art

equake

applu

75 ammp

Percentage of branch prediction accuracy

combined gshare bimodal 95

19 / 30

Profile Directed Optimization

Software Optimization

Outline

1

Profiling Understand What Computers Execute Program Profiling: Basic Concepts Types of Profiler

2

Using Profile Information Data Collection Case Studies

3

Profile Directed Optimization Software Optimization Hardware Optimization

4

Summary

(CARG, University of Ottawa)

20 / 30

Profile Directed Optimization

Software Optimization

Hot Spot Detection

Detection of program hot spots: Program exhibits temporal locality Detect collection of basic blocks which are frequently executed Detection mechanism can be improved by hardware support ([1]) Focus on the code section representing hot spot

Modify hot spot code section aggressively (loop unrolling, instruction fusion, prefetching)

(CARG, University of Ottawa)

21 / 30

Profile Directed Optimization

Software Optimization

Profile Guided Optimization in Java Java Virtual Machines (JVM) employ optimizing compiler Profile information is collected at the time of program execution ([2])

(CARG, University of Ottawa)

22 / 30

Profile Directed Optimization

Hardware Optimization

Outline

1

Profiling Understand What Computers Execute Program Profiling: Basic Concepts Types of Profiler

2

Using Profile Information Data Collection Case Studies

3

Profile Directed Optimization Software Optimization Hardware Optimization

4

Summary

(CARG, University of Ottawa)

23 / 30

Profile Directed Optimization

Hardware Optimization

Microarchitecture Optimization

One architecture never fits all Profiling is the key to characterize program behavior and adapt architecture Architecture reconfiguration is often done with feedback obtained from program profile Runtime configuration is challenging due to several constraint in dynamic profiles: profile overhead, storing time sensitive data

(CARG, University of Ottawa)

24 / 30

Profile Directed Optimization

Hardware Optimization

Trace Cache Design

Traces are sequence of decoded instructions with operand and data Trace cache is an instruction cache which stores sequences of basic blocks with decoded instructions and operands with set of branch predictions A hit is registered if all branch outcomes are true - decoding of instructions is omitted Optimization: Traces can be selectively stored depending of the frequency of execution of a given trace

(CARG, University of Ottawa)

25 / 30

Profile Directed Optimization

Hardware Optimization

Power Optimization: Adaptive Issue Queue Design

Issue queue is one of the power hungry resource Size of issue queue can be selectively enabled / disabled depending on available parallelism in the code section A profiler can indicate degree of parallelism in the code section

(CARG, University of Ottawa)

26 / 30

Summary

Summary

Software and hardware optimizations are primarily guided by profiling information Future architecture trend: many simple on on one chip ⇒ requires scalable solution to extract thread / process level parallelism from program Profiler will play important role in defining model of tuning compiler and architecture Hardware support extends the capability of compiler to generate efficient code Autotuner : Traditional compiler may be replaced by feedback driven autotuner.

(CARG, University of Ottawa)

27 / 30

Appendix

References

M. C. Martin, C. George, J. Gyllenhaal, and W. Hwu, “A hardware driven profiling scheme for identifying program hot spots to support runtime optimization,” Proc. of the Intl. Symp. on Computer Architecture, 1999. 21 M. Arnold, M. Hind, and B. G. Ryder, “Online feedback-directed optimization of java,” in OOPSLA ’02: Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, 2002, pp. 111–129. 22

(CARG, University of Ottawa)

28 / 30

Appendix

References

THANK YOU

QUESTIONS ?

(CARG, University of Ottawa)

29 / 30

Appendix

References

BACKUP SLIDES

(CARG, University of Ottawa)

30 / 30