A Multiprocessor Implementation for the GSM Algorithm

A Multiprocessor Implementation for the GSM Algorithm by Jennifer C. Kleiman Submitted to the Department of Electrical Engineering and Computer Scien...

Author: Jerome Singleton

1 downloads 3 Views 2MB Size

Report

Download PDF

Recommend Documents

Dependable Multiprocessor (DM) CubeSat Implementation

Multiprocessor Scheduling Using Parallel Genetic Algorithm

The Case for a Single-Chip Multiprocessor

Implementation issues and evolution of a multiprocessor operating system port

DSP Implementation of Interference Cancellation Algorithm for a SIMO System

Multithreaded Implementation of the Slope One Algorithm for Collaborative Filtering

On the Implementation of a Fast Prime Generation Algorithm

Implementation of the RSA Algorithm on a DataFlow Architecture

A Functional Implementation of the Garsia Wachs Algorithm

Multiprocessor Operating Systems. Multiprocessor Applications

Multiprocessor Scheduling. Multiprocessor Scheduling

A MULTIPROCESSOR SYSTEM DESIGN

The Stanford FLASH Multiprocessor

The Case for Fair Multiprocessor Scheduling

The Stanford Dash Multiprocessor

The Shared-Thread Multiprocessor

The Stanford Dash Multiprocessor

Algorithm Development of a Sampled Data Frequency Modulation Demodulator for the Implementation of Software Defined Radios

A Coarse-Grain Parallel Implementation of the Block Tridiagonal Divide and Conquer Algorithm for Symmetric Eigenproblems

Implementation of RSA Algorithm for Speech Data Encryption and Decryption

The NUMAchine Multiprocessor

The C.mmp multiprocessor

Strategies for FPGA Implementation of Non-Restoring Square Root Algorithm

FPGA Implementation of Background Subtraction Algorithm for Image Processing

A Multiprocessor Implementation for the GSM Algorithm by

Jennifer C. Kleiman Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degrees of Bachelor of Science in Electrical Engineering and Computer Science and Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology May 21, 1999 © Copyright Jennifer C. Kleiman 1999. All rights reserved. The author hereby grants to M.I.T. permission to reproduce and distribute publicly paper and electronic copies of this thesis and to grant others the right to do so.

Author Department of Electrical Engineering(nd domputer Science May 21, 1999 Certified by Dr. Christ h~r J. Terman T jesSupepjsor Accepted by Prof. Arthur . Smith Chairman, Department Committee on Graduate es

INSTITUTE MASSACHUSETTS

OF TECHNOLOGY

JUL 1 5 1999 LIBRARIES

A Multiprocessor Implementation for the GSM Algorithm by Jennifer C. Kleiman Submitted to the Department of Electrical Engineering and Computer Science May 21, 1999 In Partial Fulfillment of the Requirements for the Degrees of Bachelor of Science in Electrical Engineering and Computer Science and Master of Engineering in Electrical Engineering and Computer Science

ABSTRACT Telecommunications or simply communications play an important role in the computer industry. At the core of this industry lies the digital signal processor. Moreover, many communication technologies rely principally on signal processing. At both the system and component level, redundancy is present in most of these applications. Therefore, an opportunity exists to optimize these technologies using parallel processing. Specifically, the DSPs in these applications may be designed in a parallel configuration to achieve higher performance and a reduction of dedicated hardware. GSM, a mobile communications system, displays redundancy at core processing nodes in its network as well as in its fundamental speech processing algorithm, thereby making it an optimal choice for this implementation. This thesis describes the design methodology for this implementation and evaluates several different configurations. As a result, a new multiprocessor is proposed.

Thesis Supervisor: Christopher J. Terman Title: Senior Lecturer, Dept. of Electrical Engineering and Computer Science

2

Acknowledgments First, I would like to thank God for giving me the patience, endurance, and motivation to complete this thesis. I would like to thank my advisor, Chris Terman, for providing a wealth of knowledge and assistance throughout my thesis endeavor. I am thankful for the opportunity to have worked with him. To my parents, I would like to give my deepest gratitude and honor. They have been a constant source of love, support and encouragement since the moment I came to MIT. I thank them for the sacrifices they made in order to provide me with an excellent education. "Keep Going"

3

Contents

1 Introduction 1.1 Background ........................................................................................................ 1.2 D igital Signal Processors....................................................................................... 1.3 N etw ork A rchitectures ...........................................................................................

7 7 8 9

2 Technologies of Interest 2.1 A D SL....................................................................................................................... 2.2 M PEG ...................................................................................................................... 2.3 GSM ........................................................................................................................

12 12 16 19

3 The Bulk DSP Architecture 3.1 Parallel Com puting.............................................................................................. 3.2 Processing Elem ents........................................................................................... 3.3 M em ory Hierarchy ............................................................................................. 3.4 GSM Implem entation ......................................................................................... 3.5 Encoding and D ecoding ......................................................................................

25 25 27 28 29 30

4 Scheduling 4.1 D esign M ethodology ........................................................................................... 4.2 G SM 's Com putational M odules ........................................................................... 4.3 Evaluation of Architectures................................................................................

33 33 34 37

5 Conclusion 5.1 Summ ary ................................................................................................................. 5.2 Future W ork ........................................................................................................ 5.3 Final Thoughts....................................................................................................

44 44 45 46

A Software Tools Used

48

B GSM Resources on the Web

49

4

List of Figures 2-1 ADSL Network Connection..................................................................................

14

2-2 MPEG Encoding Algorithm ..................................................................................

18

2-3 GSM Network Architecture......................................................................................

20

3-1 GSM Encoding Algorithm.........................................................................................

30

4-1 GSM Decoding Algorithm.........................................................................................35 4-2 LongTermS nthesisFiltering Module .............................................................

36

4-3 Preliminary Building Block of Bulk DSP..............................................................

38

4-4 Segment of Schedule for Proposed Architecture ..................................................

40

4-5 Final Bulk DSP Architecture ..................................................................................

42

5

List of Tables 4.1 Computational Module Parameters ........................................................................

37

4.2 Comparison of Results with Best Architecture ......................................................

41

4 .3 F in al R esults ...............................................................................................................

43

6

Chapter 1

Introduction

This thesis identifies the utility of a parallel multiprocessor for use in communication technologies. In order to demonstrate this, several applications are explored, and the

basic architecture of the new parallel multiprocessor is presented. A specific technology is then chosen as a vehicle to implement this multiprocessor, and the design process is described. In addition, several architectures are analyzed based on experiments done utilizing a set of software modules, and the performance results are presented. Finally, some conclusions are drawn for the final implementation.

1.1

Background

Virtually everyone today uses some type of telephone service. Whether it is a 'plain old telephone', a wireless cellular phone, or a modem in a computer, people heavily depend on telephone systems as their primary source of communication. By the year 2001, it is estimated that there will be 1 billion phone lines worldwide and 580 million cell phone subscribers.

Communication, however, does not just encompass spoken conversation

between people. Technological advances have enabled communications or telecommunications to also include transmitting sound, video, and digital data across

7

subscribers.8 Communication, however, does not just encompass spoken conversation between people. Technological advances have enabled communications or telecommunications to also include transmitting sound, video, and digital data across telephone lines, radio frequencies, and cable lines. Because these types of communication have expanded so rapidly, there are now telephone systems in almost every area of the world. Even so, there still remains a large demand for global connectivity; the most prominent example of this today is the Internet. This demand drives telecommunication technologies to provide an infrastructure, which includes the hardware, software, and network topology to further connect people around the world while delivering as much information as possible. In addition, these communication systems need to service millions of customers simultaneously and efficiently.

1.2

Digital Signal Processors Communication technology relies on specialized hardware to perform digital

signal processing computations. In general, the computations performed are part or all of a signal processing algorithm. Specifically, the numerous computations include receiving, decoding, directing, encoding, and transmitting data. All of these operations are done digitally by the DSPs. The benefits for communication applications reaped from operating in the digital realm include reliability in the transmission process as well as compression in data size, which makes transmission faster and easier. These procedures are carried on small portions of the data known as frames on each of the subscriber lines in order to achieve "seamless" point-to-point communication. The hardware components that perform these functions are known as digital signal processors (DSPs). These

8

processors have been designed to handle complex arithmetic computations precisely for speech and data processing applications. Digital signal processing applications tend to be repetitive in nature. Many signal processing functions rely on the execution of the same computation on each byte of data in order to complete the desired operation. In fact, numerous of these algorithms consist of a relatively small set of instructions that are just executed over and over again in loop configurations. Because of this inherit repetition, the overall signal processing algorithm can be broken down into smaller computing modules, which are then just applied repeatedly throughout the algorithm.

1.3

Network Architectures

The network topologies of these communication systems vary somewhat depending on the particular type of data they transmit and the communication medium used. Although, the current trend is to build networks that can transmit all types of data, there still exist several methods in which to transmit this data (i.e. physical lines vs. wireless). In general, the basic foundations of these networks include a backbone network that connects everything in the network. There are also several central locations in the network, which brings together all the separate "channels" or subscriber lines in order to process the data. Usually, these channels carry multiple simultaneous calls, and thus, these central locations receive and transmit hundreds of channels at any given time. Moreover, the central office or location is equipped with many computers whose tasks are only to process these incoming and outgoing channels of communication. Because all the channels come through a central office, the majority or bulk of the processing in the

9

network occurs at these central locations. The other major components of the network infrastructure handle other operations such as transmitting and receiving the channels. Since the central offices control much of the computation in a communication system, they can be considered the "core processing nodes" of the network. The latency of these core processing nodes, however, is much larger than at anywhere else in the network and as a result much of the total computation time is spent here. Therefore, the efficiency of these nodes play a significant role in the overall network performance. As stated, much of the processing in many communication networks occurs in central locations. Present day infrastructures tend to use a single DSP processor for a small number of channels routed through these points. Since there are usually many channels transmitted through these locations, there are many DSPs located at these spots. Ideally a single DSP chip would process a large number of the channels at these locations which can vary anywhere from 100 to 1000 depending on the technology. This would immensely reduce the number of processors in the network. Logically, the next step towards achieving such an improvement is to design a DSP or multiprocessor that could process much more than just a few channels. By taking advantage of the instruction level parallelism(ILP) among the channels, a parallel computation architecture such as a SIMD (Single Instruction stream Multiple stream) vector processor might be used to design a more efficient implementation. SIMD refers to an architecture in which the same instruction is executed by multiple processors on different data streams. In particular, for communication applications, a multiprocessor or 'Bulk DSP'might be designed to contain many simple processors, such that each processor executes the same instruction or set of instructions in parallel. This design would enable the Bulk DSP to

10

control and process many channels. The reduction in hardware would result in significant cost savings for the given infrastructure. In addition, if successful, the Bulk DSP would exceed the performance of existing communication networks, thereby meeting the growing demands of communication technologies.

11

Chapter 2

Technologies of Interest

Of the many communication and digital signal processing applications, the three that will be considered are ADSL (Asynchronous Digital Subscriber Line), MPEG (Motion Picture Experts Group video compression), and GSM (Global System for Mobile Communications). These technologies were chosen based on their significance to the communications world as well as the similarities in their computation structure. They each embody several key characteristics, which demonstrate the utility of a Bulk DSP as their core processor.

2.1

ADSL

Asynchronous Digital Subscriber Line refers to a communication technology implemented on the copper telephone lines found in all homes and businesses. ADSL is an enhanced version of your basic phone line that provides much more data bandwidth to the subscriber. It transfers voice, data and video at a significant increase in data rates. The main point of attraction that drives this technology is a faster Internet connection. A salient detail of this technology lies in the data rate transmitted. Much more bandwidth is sent downstream (as from the Internet to one's computer) than in the reverse direction;

12

thus the name Asynchronous DSL. In fact, ADSL communication can theoretically reach data rates of 8Mbps (Mega bits per second) downstream and 1Mbps upstream. ADSL communication exists on the same twisted-pair wire as the telephone line and transmits simultaneously with existing phone services without interruption or interference. Because of this, no new phone lines need to be installed to implement ADSL, making it an attractive choice of communication. Once companies start deploying ADSL, customers need only to acquire an ADSL modem for their computers and to subscribe for ADSL service in order to start using ADSL. The ADSL network consists of the central office that contains the ADSL Modem Rack, the phone line connection, a POTS (Plain Old Telephone System) splitter, and the user-end ADSL modems. Figure 1-1 shows the details of the core network and how it interfaces with the actual phone lines and other types of networks [2]. As shown by the diagram, the central office in the network manages all the communication via phone lines to and from personal computers and corporate networks. When a call comes into the central office, it is first passed through a POTS splitter, which "splits" the call into voice and data signals and then directs them to the appropriate device. A voice call will proceed to the Public Switch Telephone Network while a data call will go to the ADSL Modem Rack. The Modem Rack consists of many line cards or ATU-Cs (ADSL Transceiver Unit- Central Office) and is the key device of interest since it uses digital signal processors. The ATU-C receives data from the access module and converts the data into analog signals. It also receives and decodes data from customers sent by the ATU-R (remote or end-user modem). Presently, each ATU-C can

13

accommodate up to 3 ADSL circuits, which means it can serve up to 3 individual phone lines. These ADSL circuits are implemented with several integrated circuits such as a core DSP to perform Discrete Multi-Tone (DMT) technology functions, a line driver/receiver, a general Central Office Public Phone

Switch

PT _

or

_

Fax

PC or Network Computer

WWW or Video Server

-

ADSL Modem

POTS

POTS

Splitter

Sphitter

ADSL Modem Rack

Ethernet, Frame Relay, ATM Internet Backbone

Figure 2-1: ADSL Network Connection

purpose DSP, and an ASIC to perform all the analog and mixed-signal operations as well as the modem configuration software. The ADSL technology rests largely in its transmission methodology. As mentioned ADSL transmits far more information downstream than it does upstream; upstream and downstream refer to different "channels" which transmit information at different frequencies. These channels are created by Discrete Multi-Tone (DMT). This technique falls under the category of Frequency Division Multiplexing (FDM) and is a

14

multi-carrier modulation technology. Basically it takes a band of frequencies (input data) and divides them into separate "channels" so that the channels have the same band but a different center frequency. This allows the channels to be coded individually and independently from each other. The DMT transmitter relies on the efficiency of the IFFT (Inverse Fast Fourier Transform) to create these channels while the receiver uses the FFT (Fast Fourier Transform) to do the compliment operation. This transform pair represents a key digital signal processing technique used by the ADSL technology. DMT uses the band from 26KHz to 134KHz for the upstream channel and 138KHz to 1.1MHz for downstream. DMT reserves a 4KHz band (0 to 4KHz) for POTS to accommodate the ordinary phone line on the same copper wire [14]. ADSL modems also implement error correction algorithms in order to reduce errors that occur on a network line such as impulse noise or continuous noise coupled into a line. These operations, performed by DSPs, combine the channels into blocks and use error correction codes on each block of data. This method allows for effective and accurate transmission of both data and video signals on the wire. The ADSL network implements a high-speed transmission technology on normal copper phone lines. Because it uses the phone lines, it does not require much equipment from the customer and is easy and inexpensive to use. In addition, it meets or exceeds customer requirements with respect to Internet access. Examination of the ADSL network reveals the importance of DSPs to the basic functionality of this technology. In fact, DSPs coordinate and perform the main computations in the transmission technology. These DSPs are found on the ATU-C located at the central office in the ADSL network. Because the ATU-C contains up to 3 ADSL circuits, and in turn each ADSL circuit has at

15

least 2 DSPs, the minimum number of DSPs on each ATU-C is 6. This corresponds to 2 DSPs for each phone line. Presently 560 million copper phone lines exist worldwide. Therefore, if each of these phone lines subscribed to ADSL, over a billion DSPs would be needed in this network alone!

2.2

MPEG

Although this next application is not the same type of communication technology as the other two studied, it exhibits some of the same characteristics, such as the need for many repetitive DSP computations. This is the MPEG (Moving Picture Experts Group) standard, which describes a compression technology for video. MPEG compresses video data into a smaller format so that more information can fit on a storage disk or more data can transfer across a network. The compression ratio achieved with MPEG ranges from 30:1 to 8:1, depending on the complexity of the video. One of the most popular applications today that employs MPEG compression is DVD (Digital Versatile Disk). This technology stores video information on DVDs similar to VHS video tapes and is played on special DVD players (like VCRs). Each disk can hold up to 17 Gigabytes of information. That's a lot of data!

A second, very popular application that employs

MPEG compression is video conferencing. This application relies heavily on the transmission of data across different types of networks. Thus, using the MPEG algorithm to compress data enables video conferencing applications to achieve real-time point-topoint communication. With MPEG compression, other technologies such as High Definition Television (HDTV) can be transmitted at 24 frames per second while movies and live broadcast at 30 frames per second in order to produce high quality resolution

16

pictures. An update to the standard, MPEG-2 adds the functionality to transmit highquality broadcast video. The main difference between the standards is the data rate at which they can transmit video sequences. The MPEG-I standard targets a data rate of 1.5Mbps, which transmits over most transmission links that support MPEG format, namely the Internet, cable networks, and ADSL networks; MPEG-2 transmits at a data rate of 4-8Mbps. MPEG-2 supports a broader range of applications including digital TV and coding of interlaced video, retaining all of the MPEG-I syntax and functionality. The MPEG compression algorithm depends largely on motion compensation and estimation. A block diagram of the algorithm is shown in Figure 1-2 [2]. It first takes low resolution video and converts the images to YUV space. In this domain, the U and V (color) components can be compressed to a greater degree than the Y component without affecting the picture quality. Video pictures characteristically do not contain a lot of movement in them and in a lot of cases the movement can be predicted if done in an intelligent manner. MPEG compression does this prediction by estimation and interpolative algorithms. Specifically, these techniques perform inter-frame coding which means motion is predicted from frame to frame in the temporal direction. The MPEG video stream consists of three types of frames. These frames are defined based on whether their spatial or temporal redundancy is eliminated. They are also grouped together to form GOPs (Groups of Pictures) or the MPEG bit stream. The I (Intra-coded) frames are coded by eliminating spatial redundancy using a technique derived from JPEG compression and serve as a reference point for the sequence. The I frame originates as a sequence of raw images and are then split into 3 8x8 blocks of pixels (one block for luminance and the other two for chrominance). These blocks then pass through a

17

Discrete Time Cosine Transform (DCT), are quantified, and finally proceed through an entropy encoder which transforms the images into a MPEG bit stream (see Figure 1-2). The second type of frame is the P (Predictive-coded) frame which is coded using motion estimation and depends on the preceding I or P frame. In addition, to the motion estimation and compensation operations, P frames require a DCT computation as well. The DCTS performed on I and P frames serve to eliminate the spatial redundancy found in these frames. Finally the B (Bidirectionally predictive-coded) frames are predicted Low Resolution

Q

DCT

Compressed Data

IQ

DJDC

IDCT

--

Filter

M.C.

Motion Estimation

Figure 2-2: MPEG Encoding Algorithm

based on the two closest P or I frames and are the smallest frames in the sequence. Although this type of coding exploits similarities with future images and reduces temporal redundancy, it still introduces a large delay in the overall algorithm [5].

18

Because images are not compressed as a single frame, an MPEG bit stream usually consists of thousands of blocks, which represents a single image. In essence these blocks are just smaller images and are encoded as described above using the MPEG algorithm. Consequently, the process to compress an image is a highly repetitive one since the same operations are executed on each block. In addition, these computations are independent of each other and require digital signal processing. Specifically DSPs are used to compute DCTs as well as the other signal processing operations required by the MPEG algorithm. A Bulk DSP implementation for this algorithm could reduce the overall compression time by performing many of these operations in parallel.

2.3

GSM

Mobile communications technology has undergone a major change in the last several years. The mobile or cellular world has transferred from the analog to the digital domain. Previously, cellular phones used a strictly analog protocol to transmit signals. However with the increased number of cellular users, the push for faster data rates, and the need for better service, cellular technology has moved into the digital realm. GSM (Global System for Mobile Communications) a digital cellular radio network, relies on digital cellular technology. It has been widely used in Europe for several years and is gaining popularity in the US. GSM implements Personal Communication System (PCS) which delivers more than just a wireless phone service. PCS incorporates the transfer of calls, voice mail, and other data transfers anywhere, anytime. In fact, each GSM phone has a personal identifier, which is unique to the phone and identifies itself on the GSM network from any location. PCS also includes the ability to connect your GSM phone to a laptop

19

or computers in order to send and receive faxes, email, or connect to the Internet. GSM has stepped to the forefront in mobile communications and provides its services in over 200 countries worldwide [9]. The GSM network architecture consists of three main functional entities that interface with each other to provide end-to-end communication. The subsystems are the Base Station Subsystem, the Network and Switching Subsystem, and the Operation and Management Subsystem (see Figure 1-3).

BSS BTS

BTS

BTS

BTS

BSC

BTS

BTS

BSC

ISD

NSS OMC

OMC

NMC

Figure 2-3: GSM Network Architecture

GSM subscribers connect to the GSM network via a radio link from their phone (the Mobile Station) to the Base Station Subsystem (BSS). The BSS is actually composed of multiple Base Transceivers Stations (BTS) and a Base Station Controller

20

(BSC). The BTS includes all the transmission and reception equipment such as the antennas and transceivers in order to conduct the radio protocols and signal processing over the radio link. The BSC controls the set of BTSs in its service area and controls radio-channel setup, frequency administration, and handovers for each call. In addition, the multiplexing of speech data is performed by the Transcoder Rate Adaptation Unit (TRAU) which is located at either the BTS, BSC, or the MSC (Mobile Service Switching Center) depending on the configuration. The BSS subsystem interfaces with the Network and Switching Subsystem, specifically, the Base Station Controller connects to the main component of the Network and Switching Subsystem, the MSC. The MSC manages communication with other fixed telecommunication networks such as ISDN (Integrated Services Digital Network) and PSTN (Public Service Telephone Network), and it also performs paging, resource allocation, location registration, authentication, and encryption functionality required to service a mobile subscriber. Finally, all the equipment in the BSC and the Switching System connect to the Operational and Maintenance Center (OMC) which includes the operation and maintenance of GSM equipment and support of the operator network interface. The OMC performs mostly administrative functions such as billing within a region. Depending on the size of the network, there may only be one OMC in a country in which case the OMC is responsible for the network administration in the entire country [13]. GSM's technology allocates a range of frequencies to a GSM system and divides that band of frequencies into individual simultaneous data channels. Each GSM system has a bandwidth of 25MHz that allows for 124 carriers with a bandwidth of 200KHz each. There are 8 users per carrier and as a result approximately 1000 total speech or

21

data channels. The maximum speech rate on the channel (known as full-rate speech) is 13Kbps (Kilobits/sec) and the maximum data rate is 9.6Kbps. GSM's main purpose is to transmit information (either speech or data) reliably in wireless form from one location to another. The following explanation will describe the full-rate speech transmission in order to highlight the main details of GSM. The process begins when the mobile station (the GSM phone) receives an audio signal (speech) through a microphone. This signal must first be converted from an analog to a digital signal before processing begins. This occurs by first filtering the signal so that it only contains frequency components below 4KHz. This frequency characterizes baseband voice signals and is the minimum bandwidth necessary to accurately recognize a voice. Once filtered, the signal is sampled at a rate of 8000 samples per second (8KHz), which corresponds to the minimum sampling frequency needed in order to not to lose any information. As the signal is sampled, it is quantified into 13-bit words. Thus, the output of this analog to digital converter is a bit stream of 104Kbps (13 x 8000) which then becomes the input to the GSM speech codec. The speech codec's job is to reduce this data rate to a size more appropriate for radio transmission. In essence, it removes all the redundant information in the data stream. The codec uses the Linear Predictive Coding (LPC) and Regular Pulse Excitation (RPE) algorithms to perform this function and executes at a bit rate of 6.5Kbps. GSM's codec collects segments of the data stream every 20ms and produces speech frames of 260 bits every 20ms. This corresponds to a speech rate of 13Kbps. From there, the data is transmitted via the radio link to the Base Transceiver Station. The next step in the process occurs at the Base Station where the BTS receives the signal and proceeds to extract the signal and recover the modulation.

22

The signal gets directed and further transmitted by the Base Station Controller to the MSC where the GSM transcoder (speech encoder and decoder) converts the GSM formatted encoding into either a speech format for the PSTN or to 13Kbps data for GSM mobile station functions. The essential part of the GSM technology depends on digital signal processing to encode and decode bits of information into the GSM format. Specifically DSP processors perform the speech, modem, and channel coding, as well as decoding operations. The DSP operation of interest computes the encoding and decoding part of the GSM algorithm and is performed by the GSM transcoder or codec. There are many of these DSPs in the network located in the mobile station as well as in the BTS, BSC, or MSC depending on the network configuration. The BSC and MSC can be considered central processing locations in the GSM infrastructure since most of the phone calls are routed through these units. Effectively the transcoders here encode and decode individual phone calls where the likely configuration is one transcoder per channel, thus operating on one channel at a time. As with the ADSL network, GSM uses DSPs to perform key processing operations on each channel of communication. So again there is a central place in the network where "bulk processing" occurs and at which repetitive computations are executed among its processors.

The ADSL, MPEG, and GSM technologies share similarities at two different levels; first in their governing algorithms and second, in their system architectures. At the algorithm level, they each require a lot of digital signal processing which characterizes most of their computations. As stated, DSP operations tend to be composed

23

of a relatively small set of instructions, which are executed repetitively. Thus, each DSP operation in the algorithm can be treated as a separate computational module. If the algorithm is subdivided into these modules then the steps of the algorithm are easily identified and an instruction level parallelism (ILP) results. GSM, ADSL, and MPEG exemplify this level of parallelism in their algorithms. The second similarity exists at the system level and also exemplifies a type of ILP. Each technology operates on multiple independent data streams in parallel, thus, there is an inherit repetition among the computations performed. The system architectures of each technology dedicate multiple processors to work on the data even though they are all essentially doing the same thing. Therefore the utility of a Bulk DSP is evident. This multiprocessor could take advantage of this inherit repetitive scheme by increasing computing power and thereby exceeding the performance of modern day microprocessors. Indeed, one would assume that if the DSP were designed with "N" simple processors, the improvement in performance would equal that of "N" present day DSP processors. However, if the Bulk DSP were designed to optimize a particular algorithm, one could imagine exceeding a factor of N' improvement in performance with the use of "N" processors. Therefore, a single application has been chosen and a Bulk DSP designed in efforts to achieve this type of improvement.

24

Chapter 3

The Bulk DSP Architecture 3.1

Parallel Computing

Parallel processing describes a method of computational style suited for applications which exhibit some type of parallel algorithmic behavior. These applications usually consist of small computational modules or modules that are used throughout the algorithm. Given that there is a set number of transistors available to design a parallel multiprocessor, the question is how to best utilize these transistors to maximize performance? In order to answer this question, performance-critical aspects of the algorithm must be considered; they include the amount of data processed, the load balance among the computational modules, the parallel structure, the distribution of data, and the spatial and temporal access patterns to memory of the algorithm [10]. These factors determine the design parameters of the multiprocessor such as the architecture of the simple processing elements, the allocation of memory resources, the communication protocol, and ultimately the number of simple processors. With the exception of the communication protocol, all of these parameters were considered in this design process. Bulk DSP is basically a subset of this type of processor architecture. Bulk DSP aims at connecting many simple processors on a single chip instead of designing one large

25

complex processor. The gain in performance stems from using these simple processors in a parallel structure. Bulk DSP differs from modern day microprocessors in that its basic building block consists of simple hardware and a reduced instruction set. Unlike Intel's Pentium processor, which incorporates branch prediction and multiple instructions per clock cycle, Bulk DSP relies on a simple set of instructions using a RISC-like structure [11]. Also, in contrast to the Pentium, the Bulk DSP does not include an extensive memory hierarchy. The memory components of the Bulk DSP consist of a simple instruction cache and data cache. The instruction cache does not have to be large due to the small number of instructions utilized by the algorithm; the main part of memory is dedicated to the data cache that serves as a buffer memory between the modules of computation. This architecture also differs from another type of modern day processor named IRAM (Intelligent RAM) which was designed at the University of CaliforniaBerkeley [12]. This processor relies on the ability to place 1 billion transistors on a single chip, made possible by advances in integrated circuit technology. In having such a large transistor budget, IRAM is able to allocate a large portion of its transistors to memory, specifically on-chip DRAM. Its main purpose is to diminish the gap between microprocessor performance and the latency of main memory accesses. Although the Bulk DSP would also rely on being able to integrate a large number of transistors on a chip, the Bulk DSP allocates these resources to computing or processing power rather than memory. Instead of dedicating 80% of the transistor budget to memory as IRAM does, the Bulk DSP might dedicate this percentage to processors. Bulk DSP applications require a lot of arithmetic computations and thus would benefit from more processing power than memory. Both of these processors, the Pentium and IRAM, are beneficial for

26

certain classes of applications. The Pentium is designed for general-purpose applications that don't necessarily exemplify a specific type of algorithmic structure while the IRAM targets memory-intensive applications such as database and multimedia programs. The Bulk DSP targets neither of these areas, moreover, it aims at improving applications which require a lot of parallel signal processing. Thus, in comparison, an architecture such as the Bulk DSP would be more advantageous in performance than a Pentium or IRAM processor for this class of parallel applications. Additionally, the Bulk DSP does better from a cost standpoint; the cost to have many simple processors on a chip is less than the cost of a lot of DRAM or other specialized hardware characteristic of modern day microprocessors.

3.2

Processing Elements

The architecture and organization of the Bulk DSP's simple processors model the processing elements used in SIMD parallel processing. A SIMD multiprocessor usually contains many simple processors called processing elements and a single control unit with only one instruction and data memory resource. These processing elements are characterized by their simplicity. Their main function is to execute the instructions given to them by a control unit that distributes the instructions to all processors. The ILP present in these programs, imply that short instruction sequences will be carried out in parallel. Because each simple processor only carries out the given instructions, it contains minimal control logic. In fact, these simple processors have a RISC architecture, which basically just fetches an instruction and data, executes the assigned computation, outputs the new data, and fetches the next instruction in a continuous cycle.

27

Essentially, these simple processors contain only basic hardware components and do not require a lot of complexity, therefore, they are inexpensive and easy to replicate on a chip. The number of simple processors used in this Bulk DSP will be discussed in the scheduling section.

3.3

Memory Hierarchy

The memory resources of the Bulk DSP play a large part in the design process. For this multiprocessor, a single 2KB (kilobyte) cache for both instructions and data will be allocated to each simple processor. The caches will be subdivided into 256 byte sections, which can be designated to either instruction or data memory. The instruction size of the computing module(s) in each processor determine the portion of the cache used for instructions, and the number of input and output bytes for each module(s) determine the amount for data cache. Because the algorithm will be subdivided into computational modules among the processors, data will need to propagate from one processor to another. This means that each processor will have to both read data from and write to other processor memories. This data movement can be setup in such a way that the movement occurs in the "background." Consequently, the processors will not have to wait for data and no cycles will be wasted on data movement. This concept will be enabled by the buffer memories between the computing modules. For each processor, there will essentially be four buffer memories. Two on the input "side" and two on the output "side." One buffer memory on each side will be dedicated to the current set of data being processed; this will give the processor a place from which to read current data

28

and another place to write out current data. The other two memories associated with each processor are for other processors to write to or read from while the processor associated with those buffer memories is busy working on the current set of data. The focus of the remainder of the investigation explores how to best implement a Bulk DSP. GSM serves as an excellent application for the Bulk DSP and will be the main application of the designed processor. This is due to several reasons. First, the Base Transceiver Station in the GSM network acts as a core processing node at which many DSP computations take place. Second, a software library entailing the GSM algorithm was found and proved useful for this investigation. Third, GSM is a popular mobile cellular system, which has gained acceptance around the world thereby making it a very relevant and useful technology to explore.

3.4

GSM Implementation Because GSM is a cellular phone network, human speech encompasses the

majority of the information transmitted across the network. As mentioned, the speech compression algorithm used in GSM is a Regular Pulse Excitation- Long-Term Prediction (RPE-LTP) specified in the GSM 06.10 standard [9]. A block diagram of the encoding algorithm is shown in Figure 2-1. This algorithm is executed in the GSM codec and serves as its primary functionality. The input frames to the codec consist of 160 signed 13-bit linear PCM values each of which are sampled at 8 kHz. They come from either the audio part of the mobile station or from the PSTN. These frames last for 20ms, and thus, cover about one glottal period of a very low voice or 10 periods for a very high

29

voice. Because this is a relatively short period of time, the speech wave does not change much and thus the algorithm will not lose any information by dividing up the speech Short-Term LPC

I u S ign

Log Area Ratios

Short-Term Pre-Process

2

RPE Parameters

RPE

Analysis

[0..1591

(4)

Long-Term Analysis

(1) Short Term Residual (2) Long Term Residual (3) Short Term Residual Estimate (4) Reconstructed Short Term Residual (5) Quantized Long Term Residual

TP Analysis

(5)

RPE Grid

LTP Parameters -

Figure 2-1: GSM Encoding Algorithm

signal as such. The encoder divides the input speech samples into a short-term predictable part, a long-term predictable part, and the rest into the residual pulse. Then, it encodes and quantifies the residual pulse and parameters for the two predictors. The decoder applies the long-term residual pulse to the residual pulse in order to reconstruct the speech and then passes the filtered output through the short-term predictor [6].

3.5

Encoding and Decoding

The first step in the encoding algorithm consists of preprocessing the samples to produce an offset-free signal and then passing them through a first-order preemphasis filter. The resulting 160 samples are then analyzed to determine the coefficients of the short-term

30

analysis filter. This short-term linear-predictive filter (LPC analysis) is the first stage of compression and the last stage of decompression. The speech compression in this algorithm is achieved by modeling the human-speech system with two filters and an initial excitation of which LPC is the first filter. In this process, the short-term filter acts as the human vocal and nasal tract such that when excited by a mixture of glottal wave and noise, produces speech that is hopefully similar to the one you are compressing. This is done using the set of coefficients determined from the preprocessed signal and using them as well as the 160 samples to produce a weighted sum of the previous output, which is, termed the short-term residual signal. In addition, the filter parameters, named the reflection coefficients, are transformed into log-area-ratios (LARs) before transmission since they will be used for the short-term synthesis filter in the decoder. The next stage in processing involves the long-term analysis where the main computation is the longterm prediction filter. Before filtering, the speech frame is subdivided into 40 sample blocks of the short-term residual signal. Also, the parameters of the long-term analysis filter, the LTP lag which describes the source of the copy in time and the LTP gain, a scaling factor, are estimated and updated in the LTP analysis block. Both of these prediction parameters are calculated based on the current sub-block and the previous 120 reconstructed short-term residual samples. With these parameters, an estimate of the short-term residual signal is found via the long-term prediction filter. Then, the last stage of this section, subtracts the estimated short-term residual signal from the actual shortterm signal to produce the long-term residual signal. With each 40 sub-block iteration, 56 bits of the GSM encoded frame are produced. The resulting 40 samples of the longterm signal are then passed to the regular pulse excitation analysis for the primary

31

compression operation of the algorithm. Here, each sub-segment of the residual signal is filtered by an FIR (Finite Impulse Response) algorithm and then down-sampled by a factor of 3. Thus results a four candidate sequence of length 13. The sequence with the most energy is chosen and the 13 samples are quantified by block adaptive PCM (APCM) encoding. The result is passed on to the decoder via a 2-bit grid selection. Lastly, the encoder updates the reconstructed short-term residual in order to prepare the next LTP analysis. In summary, the speech codec, or encoder compresses an input of 160 samples into an output frame of 260 bits every 20ms. Therefore, one can see that one second of speech equals 1625 bytes and one megabyte of compressed data holds about 10 minutes of speech [6]. The decoder mirrors many of the encoding computations. Decoding occurs when a call is received from the PSTN or from the Mobile Station (the cellular phone) at the Base Station. The decoding algorithm begins by multiplying the 13 3-bit samples by the scaling factor and expanding them back to 40 sample sub-blocks. This residual signal passes through the long-term predictor, which consists of a similar feedback loop as the one in the encoder. The long-term synthesis filter removes 40 samples of the previous estimated short-term signal, scales it by the LTP gain and adds it to the incoming pulse. This new short-term residual becomes part of the source for the next three predictions. In addition, these samples are applied to the short-term synthesis filter, which uses the reflection coefficients calculated by the LPC module. Finally, the de-emphasis filter processes the samples whose output should resemble the original speech signal.

32

Chapter 4

Scheduling In an effort to design a Bulk DSP that will optimize the performance of the GSM algorithm, different organizations of the algorithm's computational modules were arranged and considered for the building block of the multiprocessor. These architectures differ based on the number of computing modules grouped together in a single processor and the schedule in which the modules are executed by the control unit. Changing the architecture based on these parameters allowed the designer to explore the parallel structure already present in the GSM algorithm.

4.1

Design Methodology

At first, the "best architecture" for this Bulk DSP might seem to be a set of simple processors with a fixed memory resource each assigned to process the entire decoding algorithm. This architecture results in each simple processor working on an entire frame at a time. The benefit from this solution is that each processor will continuously process data. The only exception, or idle time induced, would occur the first time an instruction is called within a frame; this results in some memory access time to fetch the instruction into the cache. Due to the limited memory resource, the entire decoding program can not

33

fit into the on-chip cache, thus the idle cycles while waiting for memory access. This design represents the scheme where a factor of N'improvement is achieved by simply replicating N'number of simple processors on chip. However, this architecture does not take advantage of any instruction-level parallelism present in the algorithm, and therefore, the idea is that a more efficient scheme using the same amount of hardware exists. Thus, given that this Bulk DSP is composed of simple processors with 2KB of memory each, what is the best organization of these resources? In order to best address this question, careful study of each computational module is required. Accordingly, a discussion of the computational modules will follow.

4.2

GSM's Computational Modules

The GSM algorithm consists of two main operations: encoding and decoding. Each of these operations can be easily subdivided into a set of independent computing modules. This modularity allows for the flexibility in organizing the simple processors. Here, we will focus on the decoding part of the algorithm in the design of the Bulk DSP. Because many of the modules are the same for decoding as they are for encoding, a similar approach as the one taken here may also serve to design a multiprocessor for the encoder. In order to subdivide the algorithm, specific functions within the overall computation were identified (see Figure 4-1). Ten independent modules were distinguished differing in instruction and data size. Due to the nature of the GSM algorithm, several of these modules can be executed in parallel thus providing a means to optimize the architectural organization of the algorithm. The number of instructions executed characterizes each module. There are four important parameters that determine the above information for

34

each module. They are the numbers of Loads, Stores, Arithmetics, and Shifts encountered in the instruction set of the module. A Load represents a processor reading from memory (this could be either instructions or data). Specifically, a Load fetches two bytes of

RPERPE

Grid

DeEmphasis

SShort-Term Synthesis

PositionS

t Invers APCM

AN Z

LAR-to-RP

APCM Coefficiet

Quantization LTP

Long-Term SSynthesis Decoding of LARs

LAR

Figure 4-1: GSM Decoding Algorithm

information at a time. Stores, symbolize the times the processor writes to memory. Arithmetics are the actual computations executed by the processor, and Shifts corresponds to the computation of indexing a data array. Another aspect of these modules is the presence of loops in their structure. As noted earlier, DSP computations require many of the same operations repetitively which accounts for the large number of loops found in these modules. An example of this is demonstrated in Figure 3-1, which shows the code for one of the computing modules, Long_TermSynthesisFiltering. An example of the Load (L), Store (S), Arithmetic (A), and Shift accounting is also demonstrated. So, to determine the number of instructions executed by this

35

0 Signal

P

1-0..159]

computational module, the sum of Loads, Stores, Arithmetics, and Shifts was calculated without regard to the loops. This information designates the number of bytes stored in the instruction cache (I-cache). One assumption made here is that each instruction equals 4 bytes of memory. This is a typical number for most modern day instruction sets. The second calculation done, includes counting the number of the above parameters but this time including the loops. The sum of these parameters represent the total number of

void GsmLongTermSynthesisFiltering

struct gsm-state

* S,

word

Ncr,

word register word

bcr, * erp,

register word

* drp

register longword register int

Itmp; k;

(

brp, drpp, Nr; word Nr = Ncr < 40 11 Ncr > 120 ? S->nrp Ncr; S->nrp = Nr; assert(Nr >= 40 && Nr