Parallel Processing in Microprocessors

© 2014 IJIRT | Volume 1 Issue 6 | ISSN: 2349-6002 Parallel Processing in Microprocessors Arjit Rawat, Paras Nachaal, Arshad Ali khan Electrical and E...
Author: Archibald Todd
6 downloads 1 Views 68KB Size
© 2014 IJIRT | Volume 1 Issue 6 | ISSN: 2349-6002

Parallel Processing in Microprocessors Arjit Rawat, Paras Nachaal, Arshad Ali khan Electrical and Electronics Engineering, Dronacharya College of Enginering, Gurgaon, India

Abstract: Moore’s Law states that the number of transistors that can be placed on a microchip at a reasonable price will double ap¬proximately every two years. For the last few decades, com¬putational through put has tracked this growth the recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation. Our view is that this revolutionary approach to parallel hardware and software may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism. We believe that much can be learned by examining the success of parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. In memory bandwidth and shrinking short we can’t make the cores we have much faster in hardware; however we can put more of them on a die. With increasing processes we can stick several conventional cores on a single die and get more bangs for our buck.

every semiconductor process generation starting with a single processor. Multicore will obviously help multiprogrammed workloads, which contain a mix of independent sequential tasks, but how will individual tasks become faster? Switching from sequential to modestly parallel computing will make programming much more difficult without rewarding this greater effort with a dramatic improvement in power-performance. Hence, multicore is unlikely to be the ideal answer. We can no longer simply increase the clock frequency (processor“speed”) at the same rate as we have in the past in order to increase performance. Power and thermal requirements are beginning to outstrip the benefits that faster clock frequencies offer. Parallel execution in multi-core designs will then allow us to take advantage of these greater transistor densities to provide greater performance. Parallel execution in multi-core designs will then allow us to take advantage of these greater transistor densities to provide greater performance. Many simple cores can be built within the same area as a small number of large complex cores. In addition, power consumption can be optimized by using multiple types of cores tuned to match the needs of different usage models. Also, cores that are not busy can be powered down to reduce power consumption during idle times. These advanced power-saving techniques are enabled by multiple cores working in a coordinated fashion.

INTRODUCTION: The new industry buzzword “multicore” captures the plan of doubling the number of standard cores per die with every semiconductor (I)Types of parallelism process generation starting with a single processor. 1) Bit-level parallelism: From the advent of veryMulticore will obviously help multiprogrammed large-scale integration (VLSI) computer-chip workloads, which contain a mix of independent fabrication technology in the 1970s until about 1986, sequential tasks, but how will individual tasks become speed-up in computer architecture was driven by faster? Switching from sequential to modestly parallel doubling computer word size the amount of computing will make programming much more information the processor can manipulate per cycle.[ difficult without rewarding this greater effort with a Increasing the word size reduces the number of dramatic improvement in power-performance. Hence, instructions the processor must execute to perform an multicore is unlikely to be the ideal answer. The new operation on variables whose sizes are greater than the industry buzzword “multicore” captures the plan of length of the word. For example, where an 8doubling the number of standard cores per die with bit processor must add two 16-bit integers, the IJIRT 100964 INTERNATONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 1675

© 2014 IJIRT | Volume 1 Issue 6 | ISSN: 2349-6002 processor must first add the 8 lower-order bits from each integer using the standard addition instruction, then add the 8 higher-order bits using an add-withcarry instruction and the carry bit from the lower order addition; thus, an 8-bit processor requires two instructions to complete a single operation, where a 16-bit processor would be able to complete the operation with a single instruction. Historically, 4bit microprocessors were replaced with 8-bit, then 16bit, then 32-bit microprocessors. This trend generally came to an end with the introduction of 32-bit processors, which has been a standard in generalpurpose computing for two decades. Not until recently (c. 2003–2004), with the advent of x8664 architectures, have 64-bit processors become commonplace. 2) Instruction-level parallelism: A computer program, is in essence, a stream of instructions executed by a processor. These instructions can be re-ordered and combined into groups which are then executed in parallel without changing the result of the program. This is known as instruction-level parallelism. Advances in instruction-level parallelism dominated computer architecture from the mid-1980s until the mid-1990s. Modern processors have multistage instruction pipelines. Each stage in the pipeline corresponds to a different action the processor performs on that instruction in that stage; a processor with an N-stage pipeline can have up to N different instructions at different stages of completion. The canonical example of a pipelined processor is a RISC processor, with five stages: instruction fetch, decode, execute, memory access, and write back. The Pentium 4 processor had a 35-stage pipeline. In addition to instruction-level parallelism from pipelining, some processors can issue more than one instruction at a time. These are known as superscalar processors. Instructions can be grouped together only if there is no data dependency between them.Scoreboarding and the Tomasulo algorithm (which is similar to score boarding but makes use of register renaming) are two of the most common techniques for implementing out-of-order execution and instruction-level parallelism. 3) Task parallelism: Task parallelism is the characteristic of a parallel program that "entirely different calculations can be performed on either the same or different sets of data”. This contrasts with data

IJIRT 100964

parallelism, where the same calculation is performed on the same or different sets of data. Task parallelism does not usually scale with the size of a problem. B.Implications on Software While the hardware continues to advance and the availability of implicit parallelism gets consumed with optimizations, what is the impact on software. Runtime libraries may provide some of the answer, where existing sequential programs that call into libraries can be partially parallelized by improving the runtimes themselves. The remainder will have to be taken up by new programming language paradigms.. The remainder will have to be taken up by new programming language paradigms. 1) Operating System: As multiple CPU architectures become more commonplace, the operating system scheduler must become more intelligent in how it schedules threads and processes to the available CPUs. In addition, each of the CPUs in the system may have an unequal relationship with each other. For example, there could be a CPU topology where the bottom layer consists of a single Simultaneous Multithreading (SMT) physical processor which exposes two logical processors. At the next level, there could be two SMT cores grouped into a Symmetric Multiprocessing (SMP) domain, where each SMT core has equal access time to local memory. At the highest level, there could be two SMPs grouped together which make up a NonUniform Memory Access (NUMA) domain (see Figure 1). Moving threads or process load within a SMT physical processor is cheap, because both logical processors within the physical processor share the same memory, cache, and execution units. However, moving threads or processes from one NUMA node to another is expensive, since the memory access time is longer for remote memory. expensive, since the memory access time is longer for remote memory. Linux handles balancing the load across all of the CPUs in an efficient manner by defining scheduling domains. In the example above, three domains would be defined: SMT, SMP, and NUMA. These domains contain the policy for how scheduling decisions are made. Within the SMT domain, the balancing attempts occur often, even when the imbalance in load is small. For example, if a sleeping thread is awakened, normally the thread would stay on the same processor since its data is likely to be cached on that processor.

INTERNATONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY

1676

© 2014 IJIRT | Volume 1 Issue 6 | ISSN: 2349-6002 However, if another processor shares the same cache, then it is fine to move it to another processor if it is idle. Within the NUMA domain, balancing attempts are made very rarely, since the cost of moving a process between nodes is very high. Most of the time, a process will only be scheduled to another NUMA node when creating a new process. In addition, there is the option to further tune the system through the use of processor affinity. This can be used to specify an ideal processor to run a particular process on. Linux handles balancing the load across all of the CPUs in an efficient manner by defining scheduling domains. In the example above, three domains would be defined: SMT, SMP, and NUMA. These domains contain the policy for how scheduling decisions are made. Within the SMT domain, the balancing attempts occur often, even when the imbalance in load is small. For example, if a sleeping thread is awakened, normally the thread would stay on the same processor since its data is likely to be cached on that processor. However, if another processor shares the same cache, then it is fine to move it to another processor if it is idle. Within the NUMA domain, balancing attempts are made very rarely, since the cost of moving a process between nodes is very high. Most of the time, a process will only be scheduled to another NUMA node when creating a new process. In addition, there is the option to further tune the system through the use of processor affinity. This can be used to specify an ideal processor to run a particular process on. 2) Emerging Programming Models and Runtimes: Highly parallel machines are programmed differently than classical von Neumann computers. The notion of a simple, linear program flow mutating the system from one state to another is no longer sufficient. Rather, parallel applications exhibit markedly different characterictics than conventional software.

Figure 1

IJIRT 100964

They are typically event- or I/O-driven, highly asynchronous, and expressed in terms of small units of work that can be intelligently scheduled according to the resources available on a given system at a given time. Event- and I/O-driven programming models are nothing new. Graphical user interfaces and commodity web servers provide classic examples of each, and neither need be developed using emerging parallel methodologies. But these models can be used in much more powerful ways on emerging, massively parallel hardware. Program components can be engineered as aggregating functions over multiple concurrent input sources, and the software can intelligently arbitrate between these inputs to efficiently process data that arrives at roughly the same time. Work can be cancelled if a concurrent computation determines that it is no longer necessary, and conversely, it can be performed eagerly in intimation of speeding up other concurrent tasks. These principles are demonstrated well in the Microsoft Robotics Concurrency and Coordination Runtime, which is designed to arbitrate concurrent streaming inputs from multiple ports. 3) Computer Language Implication: The growth of parallel processing hardware has led to the necessity for programmers to write code that utilizes the available parallelization. While instruction level parallelization has provided some implicit benefit, and compilers and runtimes can solve problems without explicit participation of the endlevel programmer, this has only limited benefit. Compilers are not able to deduce the intention of a program and rewrite it in a new way. Major programming languages today have mechanisms for using the threads paradigm for explicitly parallelizing applications. Starting and stopping threads, and synchronizing them with mutexes, locks, and critical sections is the job of the coder, with no help from the language itself or runtime libraries. There are dozens of new programming languages today for writing parallel programs that are based on languages with significant code bases. These are “sequential-like” languages, with added keywords and mechanisms built in, to allow compiler and runtime-level support for parallelization. Another approach to creating parallel programs is to use functional programming languages. Functional languages were primarily developed for modeling mathematics. Functional languages define computation in terms of

INTERNATONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY

1677

© 2014 IJIRT | Volume 1 Issue 6 | ISSN: 2349-6002 mathematical functions. A happy side effect of this design is that, since functions are stateless and have no side effects, any evaluations can be done where function inputs are known. Analogous to these languages is Google’s Map Reduce, which is an implicitly parallelizable paradigm for creating programs that is dissimilar to sequential programming. 3) New Technologies to Support Parallel Programming; To simplify software development, Intel is researching new features for tera-scale devices. For example, transactional memory simplifies parallel programming by reducing the need for software developers to manage explicit locks. A key focus of Intel’s research is finding other hardware features that will simplify parallel programming. One of the features of tera-scale architectures will be dedicated partitions that appear as devices to regular software. These partitions will provide functions such as system management and I/O acceleration (such as network protocol processing). The architectures may also include hardware support for lightweight message passing for a distributed-computation model that is familiar to HPC software developers. This type of streaming has proven to be an effective programming model for workloads such as graphics and media processing. Tera-scale architectures will exploit that familiarity by providing hardware support. For example, the architectures will support the cache behavior desired, as well as coprocessor instruction access to fixed-function media acceleration. In addition, as with the MMX extensions for multimedia applications, there will be opportunities to extend the ISA (Instruction Set Architecture) to better support emerging RMS workloads. Conclusion: The triple whammy of the Power, Memory, and Instruction Level Parallelism Walls has forced microprocessor manufacturers to bet their futures on parallel microprocessors. This is no sure thing, as parallel software has an uneven track record. From a research perspective, however, this is an exciting opportunity. Virtually any change can be justified—new programming languages, new instruction set architectures, new interconnection protocols, and so on—if it can deliver on the goal of making it easy to write programs that execute

IJIRT 100964

efficiently on many core computing systems. Three questions still arise Regarding multicore versus manycore: We believe that manycore is the future of computing. Furthermore, it is unwise to presume that multicore architectures and programming models suitable for 2 to 32 processors can incrementally evolve to serve many core systems of 1000s of processors. Regarding the application tower: We believe a promising approach is to use 13 Dwarfs as stand-ins for future parallel applications since applications are rapidly changing and we need to investigate parallel programming models as well as architectures. Regarding the hardware tower: We advise using simple processors, to innovate in memory as well as in processor design, to consider separate latency-oriented and bandwidth-oriented networks. Since the point-topoint communication patterns are very sparse, a hybrid interconnect design that uses circuit switches to tailor the interconnect topology to application requirements could be more area and power efficient than a fullcrossbar and more computationally efficient than a static mesh topology. Traditional cache coherence is unlikely to be sufficient to coordinate the activities of 1000s of cores, so we recommend a richer hardware support for fine-grained synchronization and communication constructs. Finally, do not include features that significantly affect performance or energy if you do not provide counters that let programmers accurately measure their impact. The most important problem for future multicore computing seems to be split between the software and hardware that performs in a way that is optimal for the software. In our opinion this is a primarily software dominated problem, you can throw all the cores you want at a problem but until the compilers/etc. exist to use them they may as well not be there. References 1) [Adiletta et al 2002] M. Adiletta, M. Rosenbluth, D. Bernstein, G. Wolrich, and H. Wilkinson, “The Next Generation of the Intel IXP Network Processors,” Intel Technology Journal, vol. 6, no. 3, pp. 6–18, Aug. 15, 2002. 2) [Arvind et al 2005] Arvind, K. Asanovic, D. Chiou, J.C. Hoe, C. Kozyrakis, S. Lu, M. Oskin, D. Patterson, J. Rabaey, and J. Wawrzynek, “RAMP: Research

INTERNATONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY

1678

© 2014 IJIRT | Volume 1 Issue 6 | ISSN: 2349-6002 Accelerator for Multiple Processors - A Community Vision for a Shared Experimental Parallel HW/SW Platform,” U.C. Berkeley technical report, UCB/CSD05-1412, 2005. 3) Xiao-Feng, Li Zhao Hui Du, Chen Yang, ChuCheow Lim, Tin-Fook Ngai, “Speculative parallel threading architecture and compilation”, pp 285- 294, International Conference Workshops on Parallel Processing (ICPP 2005), 2005. 4) The Landscape of Parallel Computing Research: A View from Berkeley 5) Moore’s Law: Electronics Magazine 19 April 1965 6) Parallel Computing on Wikipedia http://en.wikipedia.org/wiki/Parallel_computing

IJIRT 100964

INTERNATONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY

1679