Sparse Representations in Audio & Music: from Coding to Source Separation

1 Sparse Representations in Audio & Music: from Coding to Source Separation M. D. Plumbley, T. Blumensath, L. Daudet, R. Gribonval and M. E. Davies ...

Author: Daniella Patrick

2 downloads 1 Views 1MB Size

Report

Download PDF

Recommend Documents

Blind Audio-Visual Source Separation based on Sparse Redundant Representations

Performance measurement in blind audio source separation

Robust Underdetermined Blind Audio Source Separation of Sparse Signals in the Time-Frequency Domain

Audio Source Separation using Independent Component Analysis

An overview of informed audio source separation

AUDIO SOURCE SEPARATION WITH TIME-FREQUENCY VELOCITIES

AUDIO SOURCE SEPARATION USING MULTIPLE DEFORMED REFERENCES

Audio Source Separation With a Single Sensor

Audio-Visual and Sparsity based Source Separation

On-the-fly audio source separation

Perceptually controlled doping for audio source separation

How to integrate audio source separation and classification?

Score-Informed Source Separation for Music Signals

From blind to guided audio source separation: How models and side information can improve the separation of sound

Advanced Audio Coding File - highly compressed audio file that maintains quality nearly indistinguishable from the original audio source

EFFICIENT MANIFOLD PRESERVING AUDIO SOURCE SEPARATION USING LOCALITY SENSITIVE HASHING

BEYOND NMF: TIME-DOMAIN AUDIO SOURCE SEPARATION WITHOUT PHASE RECONSTRUCTION

Informed Spectral Analysis for Under Determined Audio Source Separation

Audio Source Separation Techniques Including Novel Time-Frequency Representation Tools

Multichannel audio source separation with deep neural networks

Audio Source Separation Techniques Including Novel Time-Frequency Representation Tools

Supervised non-negative matrix factorization for audio source separation

Blind audio source separation via Independent Component Analysis

Extended Semantic Initialization for NMF-based Audio Source Separation

1

Sparse Representations in Audio & Music: from Coding to Source Separation M. D. Plumbley, T. Blumensath, L. Daudet, R. Gribonval and M. E. Davies

Abstract—Sparse representations have proved a powerful tool in the analysis and processing of audio signals and already lie at the heart of popular coding standards such as MP3 and Dolby AAC. In this paper we give an overview of a number of current and emerging applications of sparse representations in areas from audio coding, audio enhancement and music transcription to blind source separation solutions that can solve the “cocktail party problem”. We also explore the similarity between under-determined source separation and the new paradigm of compressed sensing. In each case we will show how the prior assumption that the audio signals are approximately sparse in some time-frequency representation allows us to address the associated signal processing task.

I. I NTRODUCTION Over recent years there has been growing interest in finding ways to transform signals into sparse representations, i.e. representations where most coefficients are zero. These sparse representations are proving to be a particularly interesting and powerful tool for analysis and processing of audio signals. Audio signals are typically generated either by resonant systems or by physical impacts, or both. Resonant systems produce sounds that are dominated by a small number of frequency components, allowing a sparse representation of the signal in the frequency domain. Impacts produce sounds that are concentrated in time, allowing a sparse representation of the signal in either directly the time domain, or in terms of a small number of wavelets. The use of sparse representations therefore appears to be a very appropriate approach for audio. In this article, we will examine a range of applications of sparse representations to audio and music signals. We will see how this concept of sparsity can be used to design new methods for audio coding which have improved performance over non-sparse methods; how it can be used to perform denoising and enhancement on degraded audio signals; and how it can be used to separate source signals from mixed audio signals, particularly when there are more sources than microphones. We will touch on the new technique of compressed sensing (CS), which uses sparse representations to reconstruct signals from a small number of measurements. Finally, we will also see how finding a sparse decomposition can lead to a note-like representation of musical signals, similar to automatic music transcription. A. Sparse Representations of an Audio Signal Suppose we have a sampled audio signal with T samples x(t), 1 ≤ t ≤ T , which we can write in a row vector form as x ¯ = (x(1), . . . , x(T )). For audio signals we are typically

dealing with signals sampled below 20 kHz, but for simplicity we will assume our sampled time t takes integer values. It is often convenient to decompose x ¯ into a weighted sum of Q basis vectors φ¯q = (φq (1), . . . , φq (T )), with the contribution of the q-th basis vector weighted by a coefficient uq : x(t) =

Q X

uq φq (t)

or

x ¯=

q=1

Q X

uq φ¯q

(1)

q=1

or in matrix form x ¯=u ¯Φ

(2)

where Φ is the matrix with elements [Φ]qt = φq (t). The most familiar representation of this type in audio signal processing is the (Discrete) Fourier representation. Here we have the same number of basis vectors as signal samples (Q = T ), and the basis matrix elements are given by 2πj 1 qt (3) φq (t) = exp T T √ where j = −1. Now it remains for us to find the coefficients uq in this representation of x ¯. In the case of our Fourier representation, this is straightforward: the matrix Φ is square and invertible, so u ¯ can be calculated directly as u ¯=x ¯Φ−1 = x ¯(T ΦH ), where the superscript ·H denotes the conjugate transpose. Signal representations corresponding to invertible transforms such as the DFT, the discrete cosine transform (DCT), or the discrete wavelets transform (DWT) are convenient and easy to calculate. However, it is possible to find many alternative representations. In particular, if we allow the number of basis vectors (and hence coefficients) to exceed the number of signal samples, Q > T , then solving (2) for the representation coefficient vector u ¯ is in general not unique: there will be a whole (Q − T )-dimensional subspace of vectors u ¯ which satisfy x ¯=u ¯Φ. In this case we say that (2) is underdetermined. A common choice in this situation is to use the Moore-Penrose pseudoinverse Φ† , yielding u ¯=x ¯Φ† . However, in this article we are interested in finding representations that are sparse, i.e. representations where only a small number of the coefficients of u ¯ are non-zero. B. Advantages of sparse representations Finding a sparse representation for a signal has many advantages for applications such as coding, enhancement, or source separation. In coding, a sparse representation has only a few non-zero values, so only these values (and their locations) need to be encoded to transmit or store the signal. In

10

1

2

3

4

5

6 Time (seconds)

7

2

Chapter 4. Signal representation in a union of MDCT bases

composed of both an MDCT basis ΦC , designed to represent the steady-state sinusoidal parts, and a Wavelet basis ΦW designed to represent the transient, edge-like parts. We could write this representation as x ¯=u ¯Φ = u ¯ C ΦC + u ¯ W ΦW

C. Recovering sparse representations To find a sparse representation when the system is underdetermined is not quite so straightforward as for the square and invertible case. Finding the true sparsest representation u ¯

(6)

where the 0-norm k¯ uk0 is the number of non-zero elements of u ¯, is an NP-hard problem, so would take us a very long time to solve. However, it is possible to find an approximate solution to this. One method is to use the so-called Basis

(b)

4

10

4

Frequency (Hz)

Frequency (Hz)

10

3

10

3

10

1

2

3

4

5

1 6 Time (seconds)

7

2

8

3

9

4 10

5 11 6 Time (seconds)

7

Figure 4.6: Time-frequency plots of the 30dB approximation of t

Fig. 1. Representations of an audio signal in (a)1xMDCT a singleand MDCT basis, and dictionaries: 8xMDCT. (b) a union of eight MDCT bases with different window sizes (“8*MDCT”). 30dB approximation of the glockenspiel signal with MP and the dictionary 8xMDCT

Pursuit relaxation, where instead of looking to solve (6) we 10 look for a solution to the easier problem 4

arg min{k¯ uk1 | x ¯=u ¯Φ} u ¯

(7)

where the 1-norm k¯ uk1 = q |uq | is the sum of the absolute values. Eqn. (7) is equivalent to a linear program (LP), and 10 solved by a range of general or specialist methods [12]. can be Another alternative is to use a greedy algorithm to find 1 2 3 4 5 6 7 9 10 11 Time (seconds) an approximation to (6). For example, the 8matching pursuits (MP) algorithm [21] matching pursuit (OMP) Figure 4.6: Time-frequency plotsand of theorthogonal 30dB approximation of the glockenspiel signal using two dictionaries: 1xMDCT and well-known 8xMDCT. algorithm [25] are examples of this type of greedy algorithm. There are many more in the literature, and considerable recent work in the area of sparse representations has51 concentrated on theoretically optimal and practically efficient methods to find solutions (or approximate solutions) to (6) or (7). Nevertheless, MP is still often used in real-world problems since there are efficient implementations available, such as the Matching Pursuit Toolkit (MPTK)1 [20]. P

3

(5)

where the joint coefficient vector u ¯ = (¯ uC u ¯W ) is a concatenation of the MDCT and Wavelet coefficients. Many other unions are possible, such as unions of MDCT bases with differing time-frequency resolutions. While the number of basis vectors, and hence the number of possible coefficients, is larger in a union of bases, we may find that the resulting representation has fewer non-zero coefficients and is therefore sparser (Fig. 1).

arg min{k¯ uk0 | x ¯=u ¯Φ}

(a)

30dB approximation of the glockenspiel signal with MP and 30dB approximation of the glockenspiel signal with MP and the dictionary 1xMDCT

Frequency (Hz)

enhancement, the noise or other disturbing signal is typically not represented by the same coefficients as the sparse signal. Therefore discarding the “wrong” coefficients can remove a large proportion of the unwanted noise, leaving a much cleaner restored signal. Finally, in source separation, if each signal to be separated has a sparse representation, then there is a good chance that there will be little overlap between the small sets of coefficients used to represent the different source signals. Therefore by selecting the coefficients “used” by each source signal, we can restore each of the original signals with most of the interference from the unwanted signals removed. For typical steady-state audio signals, the Fourier representation already does quite a good job of providing an approximately sparse representation. If an audio signal consists of only stationary oscillations, without onsets or transients, a representation based on a short-time Fourier transform (STFT) or a Modified Discrete Cosine Transform (MDCT) [22] will include some large-amplitude coefficients corresponding to the main frequencies of oscillation of the signal, with little energy in between these. However, audio signals also typically contain short transients at the onsets of musical notes or other sounds. These would not have a sparse representation in an STFT or MDCT basis, but instead in such a representation would require a large number of frequency components to be active simultaneously. One approach to overcome this is therefore to look for a representation in terms of a union of bases, each with different time-frequency characteristics. For example, we could create a “tall, thin” basis matrix ΦC Φ= (4) ΦW

II. C ODING Coding is arguably the most straightforward application of sparse representations. Indeed, reversibly transforming the signal into a new domain where the information is concentrated on a few terms is the main idea underlying data compression. The transform coder is a classical technique used in source coding [16]. When the transform is orthonormal it can be shown (under certain assumptions) that the gain achievable through transform coding is directly linked to the transform’s ability to concentrate the energy of the signal in a small number of coefficients. However, this problem is not as straightforward as it may seem, since there is no single fixed orthonormal basis where all audio signals have a sparse representation. Audio signals are in general quite diverse in nature: they mostly have a strong tonal part, but also some lower-energy components such as transients components (at note attacks) and wide-band noise that are nonetheless important in the perception of audio signals. These tonal, transient and noise components are optimally represented in 1 mptk.irisa.fr

3

bases with different respective requirements in terms of timefrequency localization. We will consider two main approaches to handle this issue. The first approach is to find an adapted orthonormal basis, best fitted to the local features of the signal. This is the technique employed in most state-of-the-art commercial audio codecs, such as MPEG 2/4 Advanced Audio Codec (AAC). The second approach uses dictionary redundancy to accommodate this variety of features, leading to a sparser representation, but where each coefficient carries more information. A. Coding in adapted orthogonal bases For coding, using an orthonormal basis seems an obvious choice. Orthonormal bases yield invertible transforms with no redundancy, so the number of coefficients in the transform domain is equal to the number of samples. Many multimedia signals have compact representations in orthonormal bases: for example, images are oten well suited to wavelet representations (EZW, JPEG200). Furthermore, several orthonormal schemes also have fast implementations due to the special structure of the basis, such as the FFT for implementing the DFT, or Mallat’s multiresolution algorithm for the DWT. For audio signals, a natural choice for an orthonormal transforms might be to use one based on the STFT. However, for real signals the Balian-Low theorem tells us that there cannot be a real orthonormal transform based on local Fourier transforms with nice regularities properties both in time and frequency. To overcome this we can use so-called Lapped Orthogonal Transforms, which exploit special aliasing cancellation properties of the cosine transform, when the window obeys two conditions on symmetry and energy-preservation. The discrete version of these classes of transforms leads to the Modified Discrete Cosine Transform (MDCT) [22], with atoms such as r π 1+L 1 2 cos τ+ k+ (8) φ¯k,p (t) = h(τ ) L L 2 2 with L the frame size, τ = t − pL and window h(τ ) defined for τ = 0, . . . , 2L − 1. Again, there are a number of fast implementations of the MDCT based on the FFT. The MDCT is one key to success of the ubiquitous “MP3” (MPEG-1 layer III) coding standard, and is now used in the majority of stateof-the-art coding standards, such as MPEG 2/4 AAC. Using the simple MDCT as described above has severe limitations. Firstly, it is not shift-invariant: at very-low bitrates, this can lead to so-called “warbling artefacts”, or “birdies” (as these distortions appear most notably at the higher end of the spectrum). Seondly, the time resolution is limited: for a typical frame size of L = 1024 samples at a 44.1 kHz sampling frequency, the resolution is 43 Hz and time resolution is 23 ms. For some very transient signals, such as drums or attacks at note onsets, this value is quite large: this leads to what are known as pre-echo artefacts where the quantization noise “leaks” within the whole window, before the actual burst of energy. However, the MDCT offers an extra degree of freedom in the choice of the window. This leads to the field of adaptive

(orthogonal) transforms: when the encoding algorithm detects that the signal is transient in nature, it switches to a “small window” type, whose size is typically 1/8-th of the long window. The transition from long windows to short windows (and vice-versa) is performed by asymmetric windows. B. Coding in overcomplete bases Using overcomplete bases for coding may at first seem counter-intuitive, as the number of analysis coefficients is increased. However, we can take advantage of the extra degrees of freedom to increase the sparsity of the set of coefficients: the larger the dictionary, the sparser a solution can be expected. Only those coefficients which are deemed toPbe significant will be transmitted and coded, i.e. x(t) ' γ∈Γ uγ φγ (t), where Γ is a small subset of indexes. However, the size of the dictionary cannot be increased at will to increase sparsity, for two reasons. Firstly, solving the inverse problem is computationally intensive and very large dictionaries may lead to overly long computations. Secondly, not only must the values {uγ }γ∈Γ be transmitted, but also the subset {γ | γ ∈ Γ} of significant parameters must itself be specified. In [27], the simultaneous use of M = 8 MDCT bases was proposed and evaluated, where the scales (frame sizes) Lm go as powers of two Lm = L0 2m , m = 1 . . . 8, with window lengths from 128 to 16384 samples (2.9 ms to 370 ms). The 8-times overcomplete dictionary is now Dm = {φ¯mkp | 0 ≤ p < Pm , 0 ≤ k < LM }. To reduce pre-echo, large windows are removed from the dictionary near onsets. Finally, the significant coefficients {uγ }γ∈Γ are quantized and encoded together with their parameters {γ | γ ∈ Γ}. For the sake of efficiency, the sparse decomposition is performed using the Matching Pursuit algorithm [21]. Formal listening tests have shown that this coder (named “8*MDCT”) outperforms MPEG-2 AAC at very low bitrates (around 24 kbps) for some simple sounds while being of similar quality for complex, polyphonic sounds. At the highest bitrates (above 64 kbps), where transform representations are near optimal [16] having to encode the extra scale parameter becomes a heavy penalty, and the overcomplete dictionary performs slightly worse than the (adapted) orthogonal basis, although transparency can still be obtained in both cases. C. New trends A further key advantage of using overcomplete representations such as “8*MDCT” is that a large part of the information is carried by the significant scale-frequency-time parameters {γ = (m, k, p) | γ ∈ Γ}, which provide directly interpretable information about the signal content. This can be useful for instance in audio indexing for data mining: if a large sound database is available in an encoded format, a large quantity of user-intuitive information can be easily inferred from the sparse representation, at a very low computational cost. The “8*MDCT” representation was found to have similar performance to the state-of-the-art in common Music Information Retrieval tasks (e.g. rhythm extraction, chord analysis, and genre classification) while MP3 and AAC codecs only

4

performed well in the rhythm extraction, due to poor frequency resolution of those transforms for the other tasks [28]. Sparse overcomplete representations also offer a step towards the “Holy Grail” of audio coding: object coding [4]. In this paradigm, any recording would be decomposed into a number of elementary constituents such as notes, or instruments’ melodic lines, that could be rearranged at will without perceivable loss in sound quality. Of course, this if far out of reach for current technology if we make no further assumptions on the signal, as this would imply that we were able to fully solve both the “hard” problems of polyphonic transcription and the underdetermined source separation problem. However, some attempts in very restricted cases [31], [13] indicate that this may be the right approach towards “musically-intelligent” coding. D. Application to denoising Finding an efficient encoding of an audio signal based on sparse representations can also help us with audio denoising. Typically, while the desired part of the signal is well represented by the sparse representation, noise is typically poorly represented by the sparse representation. By transforming our signal to its sparse representation, discarding the smaller coefficients, and reconstructing the signal again we have a simple way to suppress a significant part of the signal noise. Many improvements can be made over this simple model. If this is considered in a Bayesian framework, the task is to estimate the most probable original signal given the corrupted observation. Such a Bayesian framework allows the inclusion of structural priors for musical audio objects that take into account the ‘vertical’ frequency structure of transients and the ‘horizontal’ structure of tonals, as well as the variance of the residual noise. Such a structured model can help to reduce the so-called ‘birdies’ or ‘musical noise’ that can occur due to switching of time-frequency coefficients. However, calculating the sparse representation is more complex than a straightforward Basis Pursuit method, but Markov chain Monte-Carlo (MCMC) methods have been used for this [15]. III. S OURCE S EPARATION In many applications, audio recordings are mixtures of underlying audio signals and it is desirable to recover those original signals. For example, in a meeting room we may have several microphones, but each one collects a mixture of several talkers. To automatically transcribe the minutes of a meeting, a first step would be to separate these into one channel per talker. Sparse representations can be of significant help in solving this type of source separation problem. Let us first consider the instantaneous mixing model, where we ignore time delays and reverberation. Here we have J audio sources sj (t), j = 1, . . . , J which are instantaneously mixed to give I observations xi (t), i = 1, . . . , I according to xi (t) =

J X

aij sj (t) + ei (t)

(9)

j=1

where aij is the amount of source j that appears in observation i, and ei (t) is noise on the observation xi (t). This type of

mixture might occur in, for example, pan-potted stereo, where early stereo recordings were produced by changing the amount of each source mixed to the left and right channels without any time delays or other effects. We can also write (9) in vector or matrix notation as X x(t) = aj sj (t) + e(t) or X = AS + E (10) j

where e.g. the matrix X is an I × T matrix with columns x(t) and rows x ¯i , and aj is the jth column of the mixing matrix A = [aij ]. If the noise E is small, the mixing matrix A is known, and A is square (I = J) and full rank, then we can estimate the sources using ˆs(t) = A−1 x(t); if we have more observations than sources (I > J) we can use the pseudo-inverse ˆs(t) = A† x(t). If A is not known (blind source separation) then we could use a technique such as independent component analysis (ICA) to estimate it. However, if we have fewer observations than sources (I < J), then we cannot use matrix inversion (or pseudo-inversion) to unmix the sources. In this case, called underdetermined source separation [32], [9], we can use sparse representations both to help separate the sources, and, for blind source separation, to estimate the mixing matrix A. A. Underdetermined separation by binary masking If we transform the signal mixtures x ¯i = (xi (t))1≤t≤T into a domain where they have a sparse representation, it is likely that most coefficients of the transformed mixture correspond to either none or only one of the original sources. By identifying and matching up the sources present in each coefficient, we can recover the original, unmixed sources. Suppose that our J source signals s¯j all have sparse representations using atoms φ¯q from a Q × T basis matrix Φ (with Q = T ), i.e., X s¯j = zjq φ¯q 1≤i≤I (11) q

where zjq are the sparse representation coeffients. In matrix notation we can write S = ZΦ and Z = SΦ−1 . Now denoting U = XΦ−1 the representation of X in the basis Φ, for noiseless instantaneous mixing we have U = AZ.

(12)

For a simple special case, suppose that Z is so sparse that at most one source coefficient jq is active at each transform index q, i.e. zjq = 0 for j 6= jq . In other words, each column of Z contains at most one nonzero entry, and the source transformed representations are said to have disjoint supports. Then (12) becomes uq = ajq zjq ,q 1≤q≤Q (13) so that each vector uq is a scaled version of one of the mixing matrix columns aj . Therefore, when A is known, for each q we can estimate jq by finding the mixing matrix column aj which is most correlated with uq : jˆq = arg max j

|aT j uq | kaj k2

1≤q≤Q

(14)

5

and we construct a mask εjq = 1 if j = jˆq , 0 otherwise. Therefore using this mask to identify the active sources, and multiplying (13) by aT jq and rearranging we get zˆjq = εjq

aT j uq kaj k2

(15)

ˆ = ZΦ. ˆ from which we can estimate the sources as S Due to the binary nature of εjq this approach is known as binary masking. Even though the assumption that the sources have disjoint supports in the transformed domain is not satisfied for most real audio signals and standard transforms, the binary masking approach remains relevant to obtain accurate (although non exact) estimates of the sources as soon as they have almost disjoint supports, i.e., at each transform index q at most one source j has a non negligible coefficient zjq . The assumption that the sources have essentially disjoint supports in the transformed domain is highly dependent on the chosen transform matrix Φ. This is illustrated in Fig. 2 where on top we displayed the coefficients z¯j of three musical sources (i.e. J = 3) in some domain Φ, below we displayed the coefficients u ¯i ∈ R2 of a stereophonic mixture of the sources (i.e., I = 2) in the same domain, and at the bottom we displayed the scatter plot of uq , that is to say the collection of {uq , 1 ≤ q ≤ Q} . On the left (Fig. 2-(a)), the three musical sources are playing one after another, and the transform is simply the identity matrix Φ = I, which is associated with the so-called Dirac basis. At each time instant t, a single source is active, hence the scatter plot of uq clearly displays “spokes”, with directions given by the columns aj of A. In this simple case, the sources can be separated by simply segmenting their time-domain representation using (14) to determine which source is active at each time instant. In the middle (Fig. 2-(b)), the three musical sources are playing together, and the transform is still the Dirac basis Φ = I. The disjoint support assumption is clearly violated in the time domain, and the scatter plot no longer reveals the directions of the columns aj of A. On the right (Fig. 2(c)), the same three musical sources as in Fig. 2-(b) are displayed but in the time-frequency domain rather than the time domain, using the MDCT transform, i.e., the atoms φ¯q are given by (8). On the top we observe that, for each source, many transform coefficients are small while only a few of them are non negligible and appear as spikes. A detailed study would show that these spikes appear at different transform indices q for different sources, so for each transform index there is at most one source coefficient j which is non negligible. This is confirmed by the scatter plot at the bottom, where we can see that the vectors uq are concentrated along “spokes” in the directions of the columns aj of A. As well as allowing separative for known A, the scatter plot at the bottom of Fig. 2-(c) also illustrates that sparse representations also allow us to estimate A from the data, in the blind source separation case. If at most one source coefficient is active at each transform index q, then the directions of the “spokes” in Fig. 2-(c) correspond to the

columns of A. Therefore estimation of the columns aj of A, up to a scaling ambiguity, becomes a clustering problem which can be addressed using e.g. K-means or weighted variants [32], [2], [5]. Finally, we mention that binary masking can also be used when only one channel is available, provided that at most one source is significantly active at each time-frequency index. However in the single channel case we no longer have a direction aj to allow us to determine which source is active on which transform index q. Additional statistical information must be exploited to identify the active sources and build the separating masks εjq ∈ {0, 1}. For example, non-negative matrix factorization (NMF) or Gaussian Mixture Models (GMMs) of short time Fourier spectra can be used to build non-binary versions of these masks 0 ≤ εjq ≤ 1, associated with timevarying Wiener filtering [29], [6], [24] B. Time-frequency masking of anechoic mixtures Binary masking can also be extended when there is noise, and when the mixture process is convolutive, rather than instantaneous. The convolutive mixing model, which accounts for the sound relections on the walls of a meeting room and the overall reverberation, is as follows: xi (t) ≈

J +∞ X X

aij (n)sj (t − n),

1 ≤ i ≤ I,

(16)

j=1 n=−∞

where ? denotes convolution and aij (n) is the mixing filter applied to source j to get its contribution to observation i: in matrix notation we can write X ≈ A ? S. The STFT of both sides yields an approximate time-frequency domain mixing model Xi (ω, τ ) ≈

J X

Aij (ω)Sj (ω, τ ),

1 ≤ i ≤ I.

(17)

j=1

For anechoic mixtures, we ignore reverberation but allow different propagation times and attenuations between each source and each microphone. Here the mixing filters aij (n) become simple gains aij and delays nij , giving Aij (ω) = aij exp(2jπnij ω). At time-frequency index (ω, τ ), suppose that we know that the only significant source coefficients are indexed by j ∈ J = J (ω, τ ), i.e., Sj (ω, τ ) ≈ 0 for j ∈ / J . Then (17) becomes X Xi (ω, τ ) ≈ Aij (ω)Sj (ω, τ ), 1 ≤ i ≤ I, (18) j∈J

so that the vectors u = u(ω, τ ) := (Xi (ω, τ ))Ii=1 and z = z(ω, τ ) := (Sj (ω, τ ))Jj=1 satisfy u ≈ AJ (ω)zJ

(19)

where e.g. AJ (ω) = (Aij (ω))1≤i≤I,j∈J . Therefore, for each time-frequency index (ω, τ ), if we know the matrix A(ω) and the set J = J (ω, τ ) of most significantly active sources, we can estimate the source coefficients as [17] [Sˆj (ω, τ )]j∈J [Sˆj (ω, τ )]j ∈J /

:= A†J (ω)u(ω, τ )

(20)

:=

(21)

0

6

MDCT of sources

Musical sources 20 10

−1

0

−10

−0.5 0

0.2

0.4

0.6

0.8

1 t

1.2

1.4

1.6

1.8

−1

2

−20 0

0.2

0.4

0.6

0.8

5

x 10

0.4

0

z

0 −0.5

1q

1 0.5

s1(t)

1

s (t)

Sources with disjoint temporal supports 1 0.5

1 t

1.2

1.4

1.6

1.8

2

0

z2q

s2(t)

2

0.4

0.6

0.8

1 q

1.2

1 q

1.2

1 q

1.2

1.4

1.6

1.8

2 5

x 10

2

0.2

s (t)

0.2

4

0

0 −2

−0.2 −0.4

0

5

x 10

0.5

0

0.2

0.4

0.6

0.8

1 t

1.2

1.4

1.6

1.8

−0.5

2

−4 0

0.2

0.4

0.6

0.8

5

x 10

0.5

1 t

1.2

1.4

1.6

1.8

2

0

0.2

0.4

0.6

0.8

5

x 10

1.4

1.6

1.8

2 5

x 10

20

0.5

3q

0

z

s3(t)

3

s (t)

10 0

0 −10

−0.5

0

0.2

0.4

0.6

0.8

1 t

1.2

1.4

1.6

1.8

−0.5

2

0

0.2

0.4

0.6

5

x 10

Sources Mixture

1 t

1.2

1.4

1.6

1.8

−20

2

20

0.4

15

0.2

0.2

10

−0.2

0

1q

0

u

x1(t)

0.6

0.4

−0.2

−5

−0.6

−0.6

−10

0.4

0.6

0.8

1 t

1.2

0.6

1.4

1.6

1.8

−0.8

2

−15 0

0.2

0.4

0.6

0.8

5

x 10

0.5

1 t

0.8

1.2

1.4

1.6

1.8

2 5

x 10

0

−0.4

0.2

0.4

5

−0.4

0

0.2

MDCT of the sources mixture

0.6

−0.8

0

5

x 10

Sources Mixture

1

x (t)

0.8

1.4

1.6

1.8

2

0

0.2

0.4

0.6

0.8

5

x 10

1 q

1.2

1 q

1.2

1.4

1.6

1.8

2 5

x 10

20

0.5

2q

0

u

x2(t)

2

x (t)

10

0

0

−10

−0.5

0

0.2

0.4

0.6

0.8

1 t

1.2

1.4

1.6

1.8

−0.5

2

0

0.2

0.4

0.6

1 t

1.2

1.4

1.6

1.8

−20

2

20

0.6

0.6

15

0.4

0.4

10

0.2

0.2

5

u2q

0.8

0

−0.2

−5

−0.4

−0.4

−10

−0.6

−0.6

−0.6

−0.4

−0.2

0

x1(t)

0.2

0.4

0.6

0.8

−0.8 −0.8

0.4

0.6

0.8

1.4

1.6

1.8

2 5

x 10

0

−0.2

−0.8 −0.8

0.2

Scatter plot

0.8

0

0

5

x 10

Scatter plot

x2(t)

x2(t)

0.8

5

x 10

Scatter plot

−15

−0.6

−0.4

−0.2

0

x1(t)

0.2

0.4

0.6

0.8

−20 −20

−15

−10

−5

0

u

5

10

15

20

1q

Fig. 2. Top: coefficients of three musical sources. Middle: coefficients of two mixtures of the three sources. Bottom: scatter plot of the mixture coefficients (plain lines indicate the directions aj of the columns of the mixing matrix, the colors indicate to which source is associated which column). Left (a): the three musical sources do not play together; time domain coefficients. Middle (b): the three musical sources play together; time domain coefficients. Right (c): the three musical sources play together; time-frequency (MDCT) domain coefficients.

where AJ (ω) is the mixing filter submatrix for the active sources at frequency ω. Each source can finally be reconstructed by inverse STFT, using e.g. the overlap-add method. In practice, if we only know the matrix A(ω), the critical difficulty is to identify the set J of most significantly active sources. For a “reasonably small” number of sources with “sufficiently sparse” time-frequency representations, straightforward statistical considerations show that, at most timefrequency points (ω, τ ), the total number of active sources is small and does not exceed some J 0 ≤ I. Identifying the set J of active source amounts to searching for an approximation u ≈ A(ω)z where z has few nonzero entries. This is a sparse approximation problem, which needs to be addressed independently at each time-frequency point. While binary masking corresponds to searching for z with at most one nonzero entry (J 0 = 1) [32], non-binary masking can be performed choosing, e.g., the minimum 1-norm z such that u = A(ω)z (Basis Pursuit) (7), as proposed in [9], or the minimum p-norm solution with p < 1 [30]. We have seen in this section that sparse representations can be particularly useful when tackling source separation problems. As well as the approaches we have touched on here there are many other interesting methods, such as convolutive blind source separation and sparse filter models, which involve sparse representations in the time and/or time-frequency do-

mains. For surveys of some these methods see e.g. [23], [18]. IV. C OMPRESSED S ENSING There is an intimate link between the problem of underdetermined source separation and the emerging field of compressed sensing [7]. Compressed sensing [11], [14] is a technique that samples signals with a sparse representation using a small number of linear measurements. The problem is as follows. Assume we are interested in sampling a signal that has a finite dimensional vector representation s. Instead of sampling at the Nyquist rate, compressed sensing exploits sparsity in order to sample at a much lower rate, whenever the signal is sparse in some basis. For example, assume s has a sparse representation in a (possibly overcomplete) basis Φ, that is, we have s = Φz, where we assume z to be sparse, that is, z has only a small number of significant non-zero elements. To show the link to source separation, here the representation vector z is a column vector, and the basis matrix Φ is multiplied on the left. In order to sample the signal s, in compressed sensing, we take a small number of linear measurements xi = hai , si, which can be expressed in matrix notation as x ≈ As ≈ AΦz,

(22)

where x is the vector with elements xi and A is the matrix with columns ai . An important aspect of compressed sensing measurements is that they are ‘holistic’, that is, each

7

measurement contains information from many of the nonzero elements in z. In other words, we generally want the measurements to thoroughly mix the all elements in z. This mixing obviously suggests a link to source separation. In fact, we can show that the compressed sensing model is exactly the same as that of the source separation problem discussed above. To see this, we rewrite the source separation problem in (16) xi (t) ≈

J +∞ X X

aij (n)sj (t − n),

1 ≤ i ≤ I,

(23)

j=1 n=−∞

in matrix vector notation. To do this, let xi and sj be the column vector with elements x(t) and sj (t) and let Aj be the convolution matrix associated with aij (n). We then have xi ≈

J X

V. AUTOMATIC M USIC T RANSCRIPTION So far the coefficients in the sparse representation have been fairly arbitrary, so we were only interested in whether such a sparse representation exists, not specifically what the coefficients mean. However, in some cases, we can assign a specific meaning to the sparse coefficients themselves. For example, in a piece of keyboard music, such as a harpsichord or piano solo, only a few of the many possible notes are playing at any one time. Therefore the notes form a sparse representation when compared to, for example, a time-frequency spectrogram. In the simplest case, suppose that x(τ ) = (X(1, τ ), . . . , X(ω, τ ), . . . , X(K, τ ))T is the spectrum at frame τ . Then we approximate this by the model X x(τ ) ≈ As(τ ) = aq Sq (τ ) (29) q

Aj sj

1 ≤ i ≤ I,

(24)

j=1

or, alternatively, we can write 

 s1  s2   xi ≈ [A1 A2 . . . AJ ]   ...  sJ

1 ≤ i ≤ I.

(25)

X ≈ AS.

Again, we suppose that our J source signals sj have sparse representations in some, possibly overcomplete, basis matrix Φj , i.e., sj = Φj zj (26) where we allow each signal to be sparse in a different basis Φj . Letting A = [A1 A2 . . . AJ ] and     s1 Φ1 z 1  s2   Φ2 z2     s= (27)  . . .  =  . . .  = Φz, sJ ΦJ zJ we can write the underdetermined sparse source separation problem in matrix vector notation as x ≈ As ≈ AΦz.

where aq is the contribution of the spectrum due to note q, and s(τ ) = (S1 (τ ), . . . , SQ (τ ))T is the vector of note activities Sq (τ ) at frame τ . In this simple case, we are assuming that each note q produces just a scaled version of the note spectra aq at each frame τ . Joining these all spectral vectors together across frames, in matrix notation we get

(28)

Noting that z is a sparse vector, we see that this is the same model as used in compressed sensing, so that both problems are equivalent and require the solution to a sparse linear inverse problem. This realisation suggests that progress made recently in compressed sensing should help us in understanding the underdetermined source separation problem. In particular, the recent interest in compressed sensing stems from the fact that it has been possible to derive conditions on the matrices AΦ, that allow several algorithms to recover the sparse vector z with near optimal accuracy [10], [8]. Whilst in compressed sensing applications, the measurements can often be designed such that these conditions are more or less satisfied, in a source separation context, the measurement system is often given. Nevertheless, it will be interesting to see, if the extensive theoretical analysis of compressed sensing can help us to shed new light on the source separation problem.

(30)

The basis dictionary A is no longer of a fixed MDCT or FFT form, but instead must be learned from the data X = [x(τ )]. To do this, we can use methods such as gradient descent in a probabilistic framework [19] or the recent K-SVD algorithm [3]. When applied to MIDI-synthesized harpsichord music, this simple model is able to identify most of the notes present in the piece, and produce a sparse ‘piano-roll’ representation of the music, a simple version of automatic music transcription (Fig. 3). For more complex sounds, such as those produced by a real piano, the simple assumption of scaled spectra per note no longer holds, and several sparse coefficients are typically needed to represent each note [1]. It is also possible to apply this sparse representations model directly in the time domain, by searching for shift-invariant sparse coding of the musical audio waveforms. Here a ‘spiking’ representation of the signal is found, which combines with the shift-invariant dictionary to generate the audio signal. For more details and a comparison of these methods, see [26]. VI. C ONCLUSIONS In this article we have given an overview of a number of current and emerging applications of sparse representations to audio signals. In pariticular, we have seen how we can use sparse representations in audio coding, denoising, source separation, and automatic music transcription, and have seen how this is related to the emerging area of compressed sensing. We believe that is an exciting area of research, and we anticipate that there will be many further advances in this area in the future. ACKNOWLEDGEMENTS The authors would like to thank Emmanuel Ravelli for Fig. 1, Simon Arberet for Fig. 2 and Samer Abdallah for Fig. 3.

8

≈

X

×

A

harpsichord input

S

harpsichord dictionary 5

4

4

3 2

50 40 component

5

frequency/kHz

frequency/kHz

harpsichord output

3 2

30 20 10

1

1 0

0

0

1

2

3

4

5

6

7

time/s

Fig. 3.

0

1

2

3

4

5

6

7

time/s

10

20

30

40

50

Transcription of the music spectrogram X = [x(τ )] into the individual note spectra A = [aq ] and note activitites S = [Sq (τ )] [1].

R EFERENCES [1] S. A. Abdallah and M. D. Plumbley. Unsupervised analysis of polyphonic music by sparse coding. IEEE Transactions on Neural Networks, 17(1):179–196, Jan. 2006. [2] F. Abrard and Y. Deville. Blind separation of dependent sources using the “time-frequency ratio of mixtures” approach. In Proc. ISSPA 2003, Paris, France, July 2003. IEEE. [3] M. Aharon, M. Elad, and A. M. Bruckstein. On the uniqueness of overcomplete dictionaries, and a practical way to retrieve them. Linear Algebra and its Applications, 416(1):48–67, 2006. Special Issue devoted to the Haifa 2005 conference on matrix theory. [4] X. Amatriain and P. Herrera. Transmitting audio content as sound objects. In Proceedings of the AES 22nd International Conference on Virtual, Synthetic and Entertainment Audio, Espoo, Finland, 2002. [5] S. Arberet, R. Gribonval, and F. Bimbot. A robust method to count and locate audio sources in a stereophonic linear instantaneous mixture. In J. Rosca et al., editors, Proc. ICA 2006, LNCS 3889, pages 536–543. Springer-Verlag, Berlin Heidelberg, 2006. [6] L. Benaroya, F. Bimbot, and R. Gribonval. Audio source separation with a single sensor. IEEE Trans. Audio, Speech and Language Processing, 14(1):191–199, Jan. 2006. [7] T. Blumensath and M. Davies. Compressed sensing and source separation. In International Conference on Independent Component Analysis and Blind Source Separation, 2007. [8] T. Blumensath and M. Davies. Iterative hard thresholding for compressed sensing. To appear in: Applied and Computational Harmonic Analysis, 2009. [9] P. Bofill and M. Zibulevsky. Underdetermined blind source separation using sparse representations. Signal Processing, 81(11):2353–2362, Nov. 2001. [10] E. Cand`es. The restricted isometry property and its implications for compressed sensing. Compte Rendus de l’Academie des Sciences, 346:589–592, 2008. [11] E. Cand`es, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory, 52(2):489–509, Feb 2006. [12] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33–61, 1998. [13] G. Cornuz, E. Ravelli, P. Leveau, and L. Daudet. Object coding of harmonic sounds using sparse and structured representations. In Proc. of the 10th Int. Conference on Digital Audio Effects (DAFx-07), Bordeaux, 2007. [14] D. Donoho. Compressed sensing. IEEE Trans. on Information Theory, 52(4):1289–1306, 2006. [15] C. F´evotte, B. Torr´esani, L. Daudet, and S. J. Godsill. Sparse linear regression with structured priors and application to denoising of musical audio. IEEE Transactions on Audio, Speech and Language Processing, 16:174–185, 2008. [16] A. Gersho and R. M. Gray. Vector quantization and signal compression. Kluwer, Boston, 1992. [17] R. Gribonval. Piecewise linear source separation. In M. Unser, A. Aldroubi, and A. Laine, editors, Proc. SPIE ’03, volume 5207 Wavelets: Applications in Signal and Image Processing X, pages 297– 310, San Diego, CA, Aug. 2003. [18] R. Gribonval and S. Lesage. A survey of sparse component analysis for source separation : principles, perspectives, and new challenges. In ESANN’06 proceedings – 14th European Symposium on Artificial Neural Networks, 26-28 April 2006, Bruges (Belgium), pages 323–330, 2006.

[19] K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T.-W. Lee, and T. J. Sejnowski. Dictionary learning algorithms for sparse representation. Neural Computation, 15:349–396, 2003. [20] S. Krstulovic and R. Gribonval. MPTK: Matching Pursuit made tractable. In Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP’06), volume 3, pages III–496 – III–499, Toulouse, France, May 2006. [21] S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397– 3415, 1993. [22] H. Malvar. A modulated complex lapped transform and its applications to audioprocessing. In Proc. Int. Conf. Acoust., Speech, and Signal Process. (ICASSP’99), volume 3, 1999. [23] P. D. O’Grady, B. A. Pearlmutter, and S. T. Rickard. Survey of sparse and non-sparse methods in source separation. International Journal of Imaging Systems and Technology, 15(1):18–33, 2005. [24] A. Ozerov, P. Philippe, F. Bimbot, and R. Gribonval. Adaptation of Bayesian models for single channel source separation and its application to voice / music separation in popular songs. IEEE Trans. Audio, Speech and Language Processing, 15(5):1564–1578, July 2007. [25] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Conference Record of The Twenty-Seventh Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, pages 40–44, 1-3 Nov. 1993. [26] M. D. Plumbley, S. A. Abdallah, T. Blumensath, and M. E. Davies. Sparse representations of polyphonic music. Signal Processing, 86(3):417–431, Mar. 2006. [27] E. Ravelli, G. Richard, and L. Daudet. Union of MDCT bases for audio coding. IEEE Transactions on Audio, Speech, and Language Processing, 16(8):1361–1372, Nov. 2008. [28] E. Ravelli, G. Richard, and L. Daudet. Audio signal representations for indexing in the transform domain. IEEE Transactions on Audio, Speech, and Language Processing, to appear, 2009. [29] M. V. S. Shashanka, B. Raj, and P. Smaragdis. Sparse overcomplete decomposition for single channel speaker separation. In Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP’07), volume 2, pages II–641– II–644, 2007. [30] E. Vincent. Complex nonconvex lp norm minimization for underdetermined source separation. In Proc. Int. Conf. Indep. Component Anal. and Blind Signal Separation (ICA2001), pages 430–437. Springer, 2007. [31] E. Vincent and M. D. Plumbley. Low bitrate object coding of musical audio using Bayesian harmonic models. IEEE Transactions on Audio, Speech, and Language Processing, 15(4):1273–1282, May 2007. [32] O. Yilmaz and S. Rickard. Blind separation of speech mixtures via time-frequency masking. IEEE Transactions on Signal Processing, 52(7):1830–1847, July 2004.