MPEG-4 natural audio coding

Signal Processing: Image Communication 15 (2000) 423}444 MPEG-4 natural audio coding Karlheinz Brandenburg!,*, Oliver Kunz!, Akihiko Sugiyama" !Fraun...
Author: Lionel Rodgers
5 downloads 2 Views 938KB Size
Signal Processing: Image Communication 15 (2000) 423}444

MPEG-4 natural audio coding Karlheinz Brandenburg!,*, Oliver Kunz!, Akihiko Sugiyama" !Fraunhofer Institut fu( r Integrierte Schaltungen IIS, D-91058 Erlangen, Germany "NEC C&C Media Research Laboratories, 1-1, Miyazaki 4-chome, Miyamae-ku, Kawasaki 216-8555, Japan

Abstract MPEG-4 audio represents a new kind of audio coding standard. Unlike its predecessors, MPEG-1 and MPEG-2 high-quality audio coding, and unlike the speech coding standards which have been completed by the ITU-T, it describes not a single or small set of highly e$cient compression schemes but a complete toolbox to do everything from low bit-rate speech coding to high-quality audio coding or music synthesis. The natural coding part within MPEG-4 audio describes traditional type speech and high-quality audio coding algorithms and their combination to enable new functionalities like scalability (hierarchical coding) across the boundaries of coding algorithms. This paper gives an overview of the basic algorithms and how they can be combined. ( 2000 Elsevier Science B.V. All rights reserved.

1. Introduction Traditional high-quality audio coding schemes like MPEG-1 Layer-3 (aka .mp3) have found their way into many applications including widespread acceptance on the Internet. MPEG-4 audio is scheduled to be the successor of these, building and expanding on the acceptance of earlier audio coding formats. To do this, MPEG-4 natural audio coding has been designed to "t well into the philosophy of MPEG-4. It enables new functionalities and implements a paradigm shift from the linear storage or streaming architecture of MPEG-1 and MPEG-2 into objects and presentation rendering. While most of these new functionalities live within the tools of MPEG-4 structured audio and audio

* Corresponding author. E-mail address: [email protected] (K. Brandenburg)

BIFS, the syntax of the `classicala audio coding algorithms within MPEG-4 natural audio has been de"ned and amended to implement scalability and the notion of audio objects. This way MPEG-4 natural audio goes well beyond classic speech and audio coding algorithms into a new world which we will see unfold in the coming years.

2. Overview The tools de"ned by MPEG-4 natural audio coding can be combined to di!erent audio coding algorithms. Since no single coding paradigm was found to span the complete range from very low bit-rate coding of speech signals up to high-quality multi-channel audio coding, a set of di!erent algorithms has been de"ned to establish optimum coding e$ciency for the broad range of anticipated applications (see Fig. 1 and [9]). The following list introduces the main algorithms and the reason for

0923-5965/00/$ - see front matter ( 2000 Elsevier Science B.V. All rights reserved. PII: S 0 9 2 3 - 5 9 6 5 ( 9 9 ) 0 0 0 5 6 - 9

424

K. Brandenburg et al. / Signal Processing: Image Communication 15 (2000) 423}444

Fig. 1. Assignment of codecs to bit-rate ranges.

their inclusion into MPEG-4. The following sections will give more detailed descriptions for each of the tools used to implement the coding algorithms. The following lists the major algorithms of MPEG-4 natural audio. Each algorithm was de"ned from separate coding tools with the goals of maximizing the overlap of tools between di!erent algorithms and maximizing the #exibility in which tools can be used to generate di!erent #avors of the basic coding algorithms. f f f f

HVXC Low-rate clean speech coder CELP Telephone speech/wideband speech coder GA General Audio coding for medium and high qualities TwinVQ Additional coding tools to increase the coding e$ciency at very low bit-rates.

In addition to the coding tools used for the basic coding functionality, MPEG-4 provides techniques for additional features like bit stream scalability. Tools for these features will be explained in Section 7.

3. General Audio Coding (AAC-based) This key component of MPEG-4 Audio covers the bit-rate range of 16 kbit/s per channel up to bit-rates higher than 64 kbit/s per channel. Using MPEG-4 General Audio quality levels between

`better than AMa up to `transparent audio qualitya can be achieved. MPEG-4 General Audio supports four so-called Audio Object Types (see the paper on MPEG-4 Pro"ling in this issue), where AAC Main, AAC LC, AAC SSR are derived from MPEG-2 AAC (see [2]), adding some functionalities to further improve the bit-rate e$ciency. The fourth Audio Object Type, AAC LTP is unique to MPEG-4 but de"ned in a backwards compatible way. Since MPEG-4 Audio is de"ned in a way that it remains backwards compatible to MPEG-2 AAC, it supports all tools de"ned in MPEG-2 AAC including the tools exclusively used in Main Pro"le and scalable sampling rate (SSR) Pro"le, namely Frequency domain prediction and SSR "lterbank plus gain control. Additionally, MPEG-4 Audio de"nes ways for bit-rate scalability. The supported methods for bitrate scalability are described in Section 7. Fig. 2 shows the arrangement of the building blocks of an MPEG-4 GA encoder in the processing chain. These building blocks will be described in the following subsections. The same building blocks are present in a decoder implementation, performing the inverse processing steps. For the sake of simplicity we omit references to decoding in the following subsections unless explicitly necessary for understanding the underlying processing mechanism. 3.1. Filterbank and block switching One of the main components in each transform coder is the conversion of the incoming audio signal from the time domain into the frequency domain. MPEG-2 AAC supports two di!erent approaches to this. The standard transform is a straightforward modi"ed discrete cosine transform (MDCT). However, in the AAC SSR Audio Object Type a di!erent conversion using a hybrid "lter bank is applied. 3.1.1. Standard xlterbank The "lterbank in MPEG-4 GA is derived from MPEG-2 AAC, i.e. it is an MDCT supporting block

K. Brandenburg et al. / Signal Processing: Image Communication 15 (2000) 423}444

425

Fig. 2. Building blocks of the MPEG-4 General Audio Coder.

lengths of 2048 points and 256 points which can be switched dynamically. Compared to previously known transform coding schemes the length of the long block transform is rather high, o!ering improved coding e$ciency for stationary signals. The shorter of the two block length is rather small, providing optimized coding capabilities for transient signals. MPEG-4 GA supports an additional mode with block lengths of 1920/240 points to facilitate scalability with the speech coding algorithms in MPEG-4 Audio (see VI-C). All blocks are overlapped by 50% with the preceding and the following block.

For improved frequency selectivity the incoming audio samples are windowed before doing the transform. MPEG-4 AAC supports two di!erent window shapes that can be switched dynamically. The two di!erent window shapes are a sine-shaped window and a Kaiser}Bessel derived (KBD) Window o!ering improved far-o! rejection compared to the sine-shaped window. An important feature of the time-to-frequency transform is the signal adaptive selection of the transform length. This is controlled by analyzing the short time variance of the incoming time signal.

426

K. Brandenburg et al. / Signal Processing: Image Communication 15 (2000) 423}444

To assure block synchronicity between two audio channels with di!erent block length sequences eight short transforms are performed in a row using 50% overlap each and specially designed transition windows at the beginning and the end of a short sequence. This keeps the spacing between consecutive blocks at a constant level of 2048 input samples. For further processing of the spectral data in the quantization and coding part the spectrum is arranged in the so-called scalefactor bands roughly re#ecting the bark scale of the human auditory system. 3.1.2. Filterbank and gain control in SSR proxle In the SSR pro"le the MDCT is preceded by a processing block containing a uniformly spaced 4-band polyphase quadrature "lter (PQF) and a Gain control module. The Gain control can attenuate or amplify the output of each PQF band to reduce pre-echo e!ects. After the gain control is performed, an MDCT is calculated on each PQF band, having a quarter of the length of the original MDCT. 3.2. Frequency-domain prediction The frequency-domain prediction improves redundancy reduction of stationary signal segments. It is only supported in the Audio Object Type AAC Main. Since stationary signals can nearly always be found in long transform blocks, it is not supported in short blocks. The actual implementation of the predictor is a second-order backwards adaptive lattice structure, independently calculated for every frequency line. The use of the predicted values instead of the original ones can be controlled on a scalefactor band basis and is decided based on the achieved prediction gain in that band. To improve stability of the predictors, a cyclic reset mechanism is applied which is synchronized between encoder and decoder via a dedicated bitstream element. The required processing power of the frequencydomain prediction and the sensitivity to numerical imperfections make this tool hard to use on "xed point platforms. Additionally, the backwards adaptive structure of the predictor makes such bitstreams quite sensitive to transmission errors.

3.3. Long-term prediction (LTP) Long-term prediction (LTP) is an e$cient tool for reducing the redundancy of a signal between successive coding frames newly introduced in MPEG-4. This tool is especially e!ective for the parts of a signal which have clear pitch property. The implementation complexity of LTP is signi"cantly lower than the complexity of the MPEG-2 AAC frequency-domain prediction. Because the Long-Term Predictor is a forward adaptive predictor (prediction coe$cients are sent as side information), it is inherently less sensitive to roundo! numerical errors in the decoder or bit errors in the transmitted spectral coe$cients. 3.4. Quantization The adaptive quantization of the spectral values is the main source of the bit-rate reduction in all transform coders. It assigns a bit allocation to the spectral values according to the accuracy demands determined by the perceptual model, realizing the irrelevancy reduction. The key components of the quantization process are the actually used quantization function and the noise shaping that is achieved via the scalefactors (see III-E). The quantizer used in MPEG-4 GA has been designed similar to the one used in MPEG 1/2 Layer-3. It is a non-linear quantizer with an x0.75 characteristic. The main advantage of this non-linear quantization over a conventional linear quantizer is the implicit noise shaping that this quantization creates. The absolute quantizer stepsize is determined via a speci"c bitstream element. It can be adjusted in 1.5 dB steps. 3.5. Scalefactors While there is already an inherent noise shaping in the non-linear quantizer it is usually not su$cient to achieve acceptable audio quality. To improve the subjective quality of the coded signal the noise is further shaped via scalefactors. The way the scalefactors are working is the following: Scalefactors are used to amplify the signal in certain spectral regions (the scalefactor bands) to increase the signal-to-noise ratio in these bands. Thus they

K. Brandenburg et al. / Signal Processing: Image Communication 15 (2000) 423}444

implicitly modify the bit-allocation over frequency since higher spectral values usually need more bits to be coded afterwards. Like the global quantizer the stepsize of the scalefactors is 1.5 dB. To properly reconstruct the original spectral values in the decoder the scalefactors have to be transmitted within the bitstream. MPEG-4 GA uses an advanced technique to code the scalefactors as e$ciently as possible. First, it exploits the fact that scalefactors usually do not change too much from one scalefactor band to another. Thus a di!erential encoding already provides some advantage. Second, it uses a Hu!man code to further reduce the redundancy within the scalefactor data. 3.6. Noiseless coding The noiseless coding kernel within an MPEG-4 GA encoder tries to optimize the redundancy reduction within the spectral data coding. The spectral data is encoded using a Hu!man code which is selected from a set of available code books according to the maximum quantized value. The set of available codebooks includes one to signal that all spectral coe$cients in the respective scalefactor band are `0a, implying that there are neither spectral coe$cients nor a scalefactor transmitted for that band. The selected table has to be transmitted inside the so-called section}data, creating a certain amount of side-information overhead. To "nd the optimum tradeo! between selecting the optimum table for each scalefactor band and minimizing the number of section}data elements to be transmitted an e$cient grouping algorithm is applied to the spectral data. 3.7. Joint stereo coding Joint stereo coding methods try to increase the coding e$ciency when encoding stereo signals by exploiting commonalties between the left and right signal. MPEG-4 GA contains 2 di!erent joint stereo coding algorithms, namely mid-side (MS) stereo coding and Intensity stereo coding. MS stereo applies a matrix to the left and right channel signals, computing sum and di!erence of the two original signals. Whenever a signal is concentrated in the middle of the stereo image, MS

427

stereo can achieve a signi"cant saving in bit-rate. Even more important is the fact that by applying the inverse matrix in the decoder the quantization noise becomes correlated and falls in the middle of the stereo image where it is masked by the signal. Intensity stereo coding is a method that achieves a saving in bit-rate by replacing the left and the right signal by a single representing signal plus directional information. This replacement is psychoacoustically justi"ed in the higher frequency range since the human auditory system is insensitive to the signal phase at frequencies above approximately 2 kHz. Intensity stereo is by de"nition a lossy coding method thus it is primarily useful at low bit-rates. For coding at higher bit-rates only MS stereo is used. 3.8. Temporal noise shaping Conventional transform coding schemes often encounter problems with signals that vary heavily over time, especially speech signals. The main reason for this is that the distribution of quantization noise can be controlled over frequency but is constant over a complete transform block. If the signal characteristic changes drastically within such a block without leading to a switch to shorter transform lengths, e.g. in the case of pitchy speech signals this equal distribution of quantization noise can lead to audible artifacts. To overcome this limitation, a new feature called temporal noise shaping (TNS) (see [5]) was introduced into MPEG-2 AAC. The basic idea of TNS relies on the duality of time and frequency domain. TNS uses a prediction approach in the frequency domain to shape the quantization noise over time. It applies a "lter to the original spectrum and quantizes this "ltered signal. Additionally, quantized "lter coe$cients are transmitted in the bitstream. These are used in the decoder to undo the "ltering performed in the encoder, leading to a temporally shaped distribution of quantization noise in the decoded audio signal. TNS can be viewed as a postprocessing step of the transform, creating a continuous signal adaptive "lter bank instead of the conventional twostep switched "lter bank approach. The actual

428

K. Brandenburg et al. / Signal Processing: Image Communication 15 (2000) 423}444

implementation of the TNS approach within MPEG-2 AAC and MPEG-4 GA allows for up to three distinct "lters applied to di!erent spectral regions of the input signal, further improving the #exibility of this novel approach. 3.9. Perceptual noise substitution (PNS ) A feature newly introduced into MPEG-4 GA, i.e. not available within MPEG-2 AAC, is the perceptual noise substitution (PNS) (see [6]). It is a feature aiming at a further optimization of the bit-rate e$ciency of AAC at lower bit-rates. The technique of PNS is based on the observation that `one noise sounds like the othera. This means that the actual "ne structure of a noise signal is of minor importance for the subjective perception of such a signal. Consequently, instead of transmitting the actual spectral components of a noisy signal, the bit-stream would just signal that this frequency region is a noise-like one and give some additional information on the total power in that band. PNS can be switched on a scalefactor band basis so even if there just are some spectral regions with a noisy structure PNS can be used to save bits. In the decoder, a randomly generated noise will be inserted into the appropriate spectral region according to the power level signaled within the bit-stream. From the above description it is obvious that the most challenging task in the context of PNS is not to enter the appropriate information into the bitstream but reliably determining which spectral regions may be treated as noise-like and thus may be coded using PNS without creating severe coding artifacts. A lot of work has been done on this task, most of which is re#ected in [20].

4. TwinVQ To increase coding e$ciency for coding of musical signals at very low bit-rates, TwinVQ-based coding tools are part of the General Audio coding system in MPEG-4 audio. The basic idea is to replace the conventional encoding of scalefactors and spectral data used in MPEG-4 AAC by an interleaved vector quantization applied to a nor-

Fig. 3. Weighted interleave vector quantization.

malized spectrum (see [10,11]). The rest of the processing chain remains identical as can be seen in Fig. 2. Fig. 3 visualizes the basic idea of the weighted interleaved vector quantization (TwinVQ) scheme. The input signal vector (spectral coe$cients) is interleaved into subvectors. These subvectors are quantized using vector quantizers. Twin VQ can achieve a higher coding e$ciency at the cost of always creating a minimum amount of loss in terms of audio quality. Thus, the break even point between Twin VQ and MPEG-4 AAC is at fairly low bit-rates (below 16 kbit/s per channel).

5. Speech coding in MPEG-4 Audio 5.1. Basics of speech coding [4,12] Most of the recent speech coding algorithms can be categorized as a spectrum coding or a hybrid coding. Spectrum coding models the input speech signal based on a vocal tract model which consists of a signal source and a "lter as shown in Fig. 4. A set of parameters obtained by analyzing the input signal are transmitted to the receiver. Hybrid coding synthesizes an approximated speech signal based on a vocal tract model. A set of parameters used for this "rst synthesis are modi"ed

K. Brandenburg et al. / Signal Processing: Image Communication 15 (2000) 423}444

429

Fig. 4. Vocal tract model.

to minimize the error between the original and the synthesized speech signals. A best parameter set can be searched for by repeating this analysis-bysynthesis procedure. The obtained set of parameters are transmitted to the receiver as the compressed data after quantization. In the decoder, a set of parameters for Source and linear prediction (LP) synthesis "ltering are recovered by inverse quantization. These parameter values are used to operate the same vocal tract model as in the encoder. Fig. 5 depicts a block diagram of hybrid speech coding. Source and LP Synth Filter in Fig. 5 correspond to those in Fig. 4. Upon parameter search, the error between the input signal and the synthesized signal is weighted by a PW (perceptually weighted) "lter. This "lter has a frequency response which takes the human auditory system into consideration, thus a perceptually best parameter selection can be achieved. 5.2. Overview of the MPEG-4 Natural Speech Coding Tools MPEG-4 Natural Speech Coding Tool Set [8] provides a generic coding framework for a wide range of applications with speech signals. Its bitrate coverage spans from as low as 2}23.4 kbit/s. Two di!erent bandwidths of the input speech signal

Fig. 5. Hybrid speech coding.

are covered, namely, 4 and 7 kHz. MPEG-4 Natural Speech Coding Tool Set contains two algorithms: harmonic vector excitation coding (HVXC) and code excited linear predictive coding (CELP). HVXC is used at a low bit-rate of 2 or 4 kbit/s. Higher bit-rates than 4 kbit/s in addition to 3.85 kbit/s are covered by CELP. The algorithmic delay by either of these algorithms is comparable to that of other standards for two-way communications, therefore, MPEG-4 Natural Speech Coding Tool Set is also applicable to such applications. Storage of speech data and broadcast are also promising applications of MPEG-4 Natural Speech Coding Tool Set. The speci"cations of MPEG-4 Natural Speech Coding Tool Set are summarized in Table 1. MPEG-4 is based on tools each of which can be combined according to the user needs. HVXC consists of LSP (line spectral pair) VQ (vector quantization) tool and harmonic VQ tool. RPE (regular pulse excitation) tool, MPE (multipulse excitation) tool, and LSP VQ tool form CELP. RPE tool is allowed only for the wideband mode because of its simplicity at the expense of the quality. LSP VQ

430

K. Brandenburg et al. / Signal Processing: Image Communication 15 (2000) 423}444

Table 1 Speci"cations of MPEG-4 Natural Speech Coding Tools HVXC Sampling frequency Bandwidth Bit-rate (bit/s) Frame size Delay Features

8 kHz 300}3400 Hz 2000 and 4000 20 ms 33.5}56 ms Multi-bit-rate coding Bit-rate scalability

CELP Sampling frequency Bandwidth Bit-rate (bit/s) Frame size Delay Features

8 kHz 300}3400 Hz 3850}12 200 (28 bit-rates) 10}40 ms 15}45 ms Multi-bit-rate coding Bit-rate scalability Bandwidth scalability

16 kHz 50}7000 Hz 10 900}23 800 (30 bit-rates) 10}20 ms 15}26.75 ms

Fig. 6. MPEG-4 Natural Speech Coding Tool Set.

tool is common in both HVXC and CELP. MPEG-4 Natural Speech Coding Tools are illustrated in Fig. 6.

three new functionalities: multi-bit-rate coding,1 bit-rate scalable coding, and bandwidth scalable coding. Actually, these new functionalities characterize MPEG-4 Natural Speech Coding Tools. It should be noted that the bandwidth scalability is available only for CELP. 5.3.1. Multi-bit-rate coding Multi-bit-rate coding provides #exible bit-rate selection with the same coding algorithm. It has not been available and di!erent codecs were needed for di!erent bit-rates. In multi-bit-rate coding, a bitrate is selected among multiple available bit-rates upon establishment of a connection between the communicating parties. The bit-rate for CELP may be selected with as small a step as 0.2 kbit/s. The frame length, the number of subframes per frame, and selection of the excitation codebook are modi"ed for di!erent bit-rates [17]. For HVXC, 2 or 4 kbit/s can be selected as the bit-rate. In addition to multi-bit-rate coding, bit-rate control with a smaller step of the bit-rate is available for CELP by "ne-rate control (FRC). In addition to multi-bit-rate coding, some additional bit-rates not available by multi-bit-rate coding are provided by FRC. The bit-rate may be deviated frame by frame from a speci"ed bit-rate according to the inputsignal characteristics. When the spectral envelope, approximated by the LP synthesis "lter, has small variations in time, transmission of liner-prediction coe$cients may be skipped once every two frames for a reduced average bit-rate [22]. Linear prediction coe$cients in the current and the following frames are compared to decide if those in the following frame are to be transmitted or not. In the decoder, the missing LP coe$cients in a frame are interpolated from those in the previous and the following frames. Therefore, FRC requires one-frame delay to make the data in the following frame available in the current frame.

5.3. Functionalities of MPEG-4 Natural Speech Coding Tools

5.3.2. Scalable coding Bit-rate and bandwidth scalabilities are useful for multicast transmission. The bit-rate and the

MPEG-4 Natural Speech Coding Tools are different from other existing speech coding standards such as ITU-T G.723.1 and G.729 in the following

1 An arbitrary bit-rate may be selected with a 200 bit/s step by simply changing the parameter values.

K. Brandenburg et al. / Signal Processing: Image Communication 15 (2000) 423}444

bandwidth can be independently selected for each receiver by simply stripping o! a part of the bit-stream. Scalabilities necessitate only a single encoder to transmit the same data to multiple points connected at di!erent rates. Such a case can be found in connections between a cellular network with mobile terminals and a digital network with "xed multimedia terminals as well as in multipoint teleconferencing. The encoder generates a single common bit-stream by scalable coding for all the recipients instead of independent bit-streams at different bit-rates. The scalable bit-stream has a layered structure with the core bit-stream and enhancement bitstreams. The bit-rate control is performed by adjusting the combination of the enhancement bitstreams depending on the speci"ed bit-rate. The core bit-stream guarantees, at least, reconstruction of the original speech signal with a minimum speech quality. Additional enhancement bitstreams, which may be available depending on the network condition, will increase the quality of the decoded signal. HVXC and CELP may be used to generate the core bitstream when the enhancement bit-streams are generated by TwinVQ or AAC. They can also generate both the core and the enhancement bit-streams. Scalabilities in MPEG4/CELP are depicted in Fig. 7. Scalabilities include bit-rate scalability and bandwidth scalability. These scalabilities reduce signal distortion or achieve better speech quality with high-frequency components by adding enhancement bit-streams to the core bit-stream. These enhancement bit-streams contain detailed characteristics of the input signal or components in higher frequency bands. For example, the output of Decoder A in Fig. 7 is the minimumquality signal decoded from the 6 kbit/s core bit-stream. The Decoder B output is a high-quality signal decoded from an 8 kbit/s bit-stream. Decoder C provides a higher-quality signal decoded from a 12 kbit/s bitstream. On the other hand, the Decoder D output has a wider bandwidth. This wideband signal is decoded from a 22 kbit/s bitstream. The high-frequency components of 10 kbit/s provides increased naturalness than Decoder C. Bandwidth scalability is provided only by the MPE tool.

431

Fig. 7. Scalabilities in MPEG-4/CELP.

Table 2 Bandwidth scalable bit-streams Core bit-stream (bit/s)

Enhancement bit-stream (bit/s)

3850}4650 4900}5500 5700}10 700 11 000}12 200

9200, 9467, 10 000, 11 600,

10 400, 10 667, 11 200, 12 800,

11 600, 11 867, 12 400, 14 000,

12 400 12 667 13 200 14 800

The unit bit-rate for the enhancement bitstreams in bit-rate scalability is 2 kbit/s for the narrowband and 4 kbit/s for the wideband. In case of bandwidth scalable coding, the unit bit-rate for the enhancement bit-streams depends on the total bit-rate and is summarized in Table 2. 5.4. Outline of the algorithms 5.4.1. HVXC A basic blockdiagram of HVXC is depicted in Fig. 8. HVXC "rst performs LP analysis to "nd the LP coe$cients. Quantized LP coe$cients are

432

K. Brandenburg et al. / Signal Processing: Image Communication 15 (2000) 423}444

Fig. 8. HVXC.

Fig. 9. CELP.

supplied to the inverse LP "lter to "nd the prediction error. The prediction error is transformed into a frequency domain and the pitch and the envelope of the spectrum are analyzed. The envelope is quantized by weighted vector quantization in voiced sections. In unvoiced sections, closed-loop search of an excitation vector is carried out. 5.4.2. CELP Fig. 9 shows a blockdiagram of CELP. The LP coe$cients of the input signal are "rst analyzed and then quantized to be used in an LP synthesis "lter driven by the output of the excitation codebooks. Encoding is performed in two steps. Long-term prediction coe$cients are calculated in the "rst step. In the second step, a perceptually weighted error between the input signal and the output of the LP synthesis "lter is minimized. This minimization is achieved by searching for an appropriate codevector for the excitation codebooks. Quan-

tized coe$cients, as well as indexes to the codevectors of the excitation codebooks and the long-term prediction coe$cients, form the bit-stream. The LP coe$cients are quantized by vector quantization and the excitation can be either MPE [19] or regular pulse excitation RPE [13]. MPE and RPE both model the excitation signal by multiple pulses, however, a di!erence exists in the degrees of freedom for pulse positions. MPE allows more freedom on the interpulse distance than RPE which has a "xed interpulse distance. Thanks to such a #exible interpulse distance, MPE achieves better coding quality than RPE [7]. On the other hand, RPE requires less computations than MPE by trading o! its coding quality. Such a low computational requirement is useful in the wideband coding where the total computation should naturally be higher than in the narrowband coding. The excitation signal types of MPEG4/CELP are summarized in Table 3.

K. Brandenburg et al. / Signal Processing: Image Communication 15 (2000) 423}444 Table 3 CELP excitation signal Excitation

Bandwidth

Features

MPE RPE

Narrow, wide Wide

Quality, scalability Complexity

5.5. MPEG-4/CELP with MPE MPEG-4/CELP with MPE is the most complete combination of the tools in MPEG-4 Natural Speech Coding Tools. It provides all the three new functionalities. Therefore, it is useful to explain MPEG-4/CELP with MPE in more detail to show how these functionalities are realized in the algorithm. A blockdiagram of the encoder of MPEG4/CELP with MPE is depicted in Fig. 10. It consists of three modules; a CELP core encoder, a bit-rate scalable (BRS) tool, and a bandwidth extension (BWE) tool. The CELP core encoder provides the basic coding functions which have been explained with Fig. 9 in Section 5.4.2. The BRS tool is used to provide the bit-rate scalability. The residual of the narrowband signal, mode information, LP coe$cients, quantized LSP coe$cients, and multipulse excitation signal are transferred from the core encoder to the BRS tool as the input signals. The BWE tool is used for the bandwidth scalability. Quantized LSP coe$cients and the pitch delay indexes as well as the wideband speech to be en-

433

coded are supplied from the core encoder to the BWE tool. In addition to these input signals, the narrowband multipulse excitation is needed in the BWE tool. This excitation is supplied from either the BRS tool when the bit-rate scalability is implemented, or from the core encoder. When the bandwidth scalability is provided, a downsampled narrowband signal is supplied to the core encoder. Because of this downsampling operation, an additional 5-ms look-ahead of the input signal is necessary for wideband signals. 5.5.1. CELP core encoder Fig. 11 depicts a blockdiagram of the CELP core encoder. It performs LP analysis and pitch analysis on the input speech signal. The obtained LP coe$cients, the pitch lag (phase or delay), the pitch and MPE gains, and the excitation signal are encoded as well as mode information. The LP coe$cients in the LSP domain are encoded frame by frame by predictive VQ. The pitch lag is encoded subframe by subframe by adaptive codebooks. The MPE is modeled by multiple pulses whose positions and polarities ($1) are encoded. The pitch and the MPE gains are normalized by an average subframe power followed by multimode encoding [19]. The average subframe power is scalar-quantized in each frame. 5.5.1.1. LSP quantization. A two-stage partial prediction and multistage vector quantization (PPMVQ) [21] is employed for LSP quantization. This

Fig. 10. MPEG-4/CELP with MPE.

434

K. Brandenburg et al. / Signal Processing: Image Communication 15 (2000) 423}444

Fig. 11. CELP core encoder.

Fig. 12. Partial prediction and multistage vector quantization (PPM-VQ).

quantizer, as shown in Fig. 12, operates either in the standard VQ mode or in the PPM-VQ mode which utilizes interframe prediction, depending on the quantization errors. The standard VQ mode operates as a common two-stage VQ which quantizes the error of the "rst stage in the second stage. On the contrary, in the PPM-VQ mode, the di!erence ¹ between the input LSP C and its predicted n n output is quantized as ¹ "C !(b Q #(1.0!b )