Music Source Separation and its Applications to MIR Emmanuel Vincent and Nobutaka Ono
Part I: General principles of music source separation
INRIA Rennes - Bretagne Atlantique, France The University of Tokyo, Japan Tutorial supported by the VERSAMUS project http://versamus.inria.fr/ Contributions from Alexey Ozerov, Ngoc Duong, Simon Arberet, Martin Klein-Hennig and Volker Hohmann.
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
1 / 54
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
2 / 54
Source separation and music
Audio source separation Many sound scenes are mixtures of several concurrent sound sources. 1
2
3
4
5
Source separation and music Computational auditory scene analysis Probabilistic linear modeling Probabilistic variance modeling Summary and future challenges
When facing such scenes, humans are able to perceive and focus on individual sources. Source separation is the problem of recovering the source signals underlying a given mixture. It is a core problem of audio signal processing, with applications such as: hearing aids, post-production, remixing and 3D upmixing, spoken/multimedia document retrieval, MIR.
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
3 / 54
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
4 / 54
Source separation and music
Source separation and music
The data at hand
Music sources
As an inverse problem, source separation requires some knowledge.
Music sources include acoustical or virtual instruments and singing voice.
Music is among the most difficult application areas of source separation because of the wide variety of sources and mixing processes. Concert room
Sound is produced by transmission of one or more excitation movements/signals through a resonant body/filter.
Studio
00 11 00 11 11 00
This results in a wide variety of sounds characterized by their: polyphony (monophonic or polyphonic) 1 0 0 1 0 01 1 00 1 1
temporal shape (transitory, constant or variable)
1 0 0 0 01 01 1 01 1 0 01 1
11 00 00 11 11 00
110011001100 10100110
11 00 11 00 11 00 1 0
spectral envelope
Anechoic recording
Piano source
Far−field near−coincident microphone pair 11 1 001 01 0 11 00 0 0 11 1 00 01 01 111 00 01 0
E. Vincent & N. Ono (INRIA & UTokyo)
Violin source 60
4
Multitrack recording Music Source Separation
Mixing software
Synthesized mixture ISMIR 2010
40 2 20 0 0
5 / 54
0.5 n (s)
E. Vincent & N. Ono (INRIA & UTokyo)
Source separation and music
1
60
4
0
40 dB
Direct sound
f (kHz)
Far−field coincident microphone pair
spectral fine structure (random or pitched) Near−field directional microphones (point sources)
dB
Near−field directional microphones (extended source)
f (kHz)
101011001100 10100110
2 20 0 0
0.5 n (s)
Music Source Separation
1
0
ISMIR 2010
6 / 54
Source separation and music
Effects of microphone recording
Software mixing effects
For point sources, room acoustics result in filtering of the source signal Usual software mixing effects include: compression and equalization panning, i.e. channel-dependent intensity scaling reverb polarity and autopan where the intensity and delay of direct sound are functions of the source position relative to the microphone. Diffuse sources (piano, drums) amount to (infinitely) many point sources. The mixture signal is equal to the sum of the contributions of all sources at each microphone. E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
7 / 54
The latter are widely employed to achieve perceptual envelopment, whereby even point sources are mixed diffusely. Again, the intensity of direct sound is a function of the source position and the mixture signal is equal to the sum of the contributions of all sources in each channel.
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
8 / 54
Source separation and music
Overview Hundreds of source separation systems were designed in the last 20 years. . .
1
. . . but few are yet applicable to real-world music, as illustrated by the 2008 and 2010 Signal Separation Evaluation Campaigns (SiSEC).
2
3
The wide variety of techniques boils down to three modeling paradigms: computational auditory scene analysis (CASA),
4
probabilistic linear modeling, including independent component analysis (ICA) and sparse component analysis (SCA),
5
Source separation and music Computational auditory scene analysis Probabilistic linear modeling Probabilistic variance modeling Summary and future challenges
probabilistic variance modeling, including hidden Markov models (HMM) and nonnegative matrix factorization (NMF).
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
9 / 54
E. Vincent & N. Ono (INRIA & UTokyo)
Computational auditory scene analysis
Music Source Separation
ISMIR 2010
10 / 54
Computational auditory scene analysis
Computational auditory scene analysis (CASA)
Auditory front-end
CASA aims to emulate the human auditory system.
The sound signal is first converted into an auditory nerve representation via a series of processing steps: outer- and middle-ear: filter cochlear traveling wave model: filterbank haircell model: halfwave rectification + bandwise compression + cross-band suppression Piano and violin mixture
Source formation relies on the Gestalt rules of cognition: proximity, similarity, continuity, closure, common fate. E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
On the cochlea 1
0.5
20
0
0.5 n (s)
0.5
15 10 5
0
0
power
power
2
1
25 f (ERB)
f (kHz)
4
1
0 0
After compression
5
0
ISMIR 2010
11 / 54
1
loudness
10
20
f (ERB)
f (ERB)
0.5
15
1
25 loudness
20
0.5 n (s) E. Vincent & N. Ono (INRIA & UTokyo)
1
After suppression 1
25
0
0.5 n (s)
0.5
15 10 5
0 0
Music Source Separation
0.5 n (s)
1 ISMIR 2010
12 / 54
Computational auditory scene analysis
Sinusoidal+noise decomposition
Spatial cues
Many systems further decompose the signal into a collection of sinusoidal tracks plus residual noise.
Spatial proximity is assessed by comparing the observed interchannel time difference (ITD), interchannel intensity difference (IID). ITD (anechoic)
IID (anechoic) 0.5
25
enable the exploitation of advanced cues, e.g. amplitude and frequency modulation.
0
15 10 0
Sinusoidal representation
0.5 n (s)
1
E. Vincent & N. Ono (INRIA & UTokyo)
0.5 n (s)
ms
10
1
10
25 20
5
15
0
10
5
0 0
0.5 n (s)
f (ERB)
f (ERB)
loudness
f (ERB)
0
15
0
5
−5 0
IID (reverberant)
20
0.5
10
0
5
−0.5
0.5
25
15
5
15
ITD (reverberant)
1
20
10
20 10
5
25
ms
f (ERB)
20
25 f (ERB)
reduce the number of sound atoms to be grouped into sources,
0.5 n (s)
dB
This decomposition is useful to
1
−0.5
dB
Computational auditory scene analysis
−5
5 0
0.5 n (s)
1
Note: in practice, most systems consider only binaural data, i.e. recorded by in-ear microphones.
1
Music Source Separation
ISMIR 2010
13 / 54
Computational auditory scene analysis
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
Computational auditory scene analysis
Spectral cues
Learned cues
The Gestalt rules also translate into e.g. common pitch and onset time, similar spectral envelope, spectral and temporal smoothness, lack of silent time intervals, correlated amplitude and frequency modulation.
In addition to the above primitive cues, the auditory system relies on a range of learned cues to focus on a given source: veridical expectation (episodic memory): ”I know the lyrics” schematic expectation (semantic memory): ”The inaudible word after love you must be babe”
Most effort has been devoted to the estimation of pitch by cross-correlation of the auditory nerve representation in each band. Correlogram (n = 0 s)
5
10
15 20 f (ERB)
E. Vincent & N. Ono (INRIA & UTokyo)
25
0
3
40 dB
f0 (Hz)
20 2
10
conscious expectation
60 10
0
40
dB f (Hz)
3
dynamic adaptive expectation (short-term memory): ”This melody already occurred in the song”
Correlogram (n = 0.5 s) 60
10
14 / 54
10
20 2
5
Music Source Separation
10
15 20 f (ERB)
25
0
ISMIR 2010
15 / 54
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
16 / 54
Computational auditory scene analysis
Computational auditory scene analysis
Source formation and signal extraction
Summary of CASA
Each time-frequency bin or each sinusoidal track is associated to a single source according to the above cues: this is known as binary masking.
Advantages:
Individual cues are ambiguous, e.g. the observed IID/ITD may be due to a single source in the associated direction or to several concurrent sources around that direction, a given sinusoidal track may be a harmonic of different sources. Most systems exploit several cues with some precedence order or weighting factors determined by psycho-acousticians. Piano mask
Estimated piano
f (ERB)
15
20 0.5
15
10
0
5
10 5
0
0.5 n (s)
E. Vincent & N. Ono (INRIA & UTokyo)
1
0
0.5 n (s)
1
Limitations: musical noise artifacts due to binary masking suboptimal cues, designed for auditory scene analysis instead of machine source separation practical limitation to a few spectral and/or spatial cues, with no general framework for the integration of additional cues
no results within recent evaluation campaigns
0
Music Source Separation
robustness thanks to joint exploitation of several cues
(historically) bottom-up approach, prone to error propagation, and limitation to pitched sources
loudness
25 1
20
f (ERB)
25
1
wide range of spectral, spatial and learned cues
ISMIR 2010
17 / 54
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
18 / 54
Probabilistic linear modeling
Model-based audio source separation
1
2
3
4
5
Source separation and music Computational auditory scene analysis Probabilistic linear modeling Probabilistic variance modeling Summary and future challenges
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
The alternative top-down approach consists of finding the source signals that best fit the mixture and the expected properties of audio sources. In a probabilistic framework, this translates into building generative models of the source and mixture signals, inferring latent variables in a maximum a posteriori (MAP) sense.
ISMIR 2010
19 / 54
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
20 / 54
Probabilistic linear modeling
Probabilistic linear modeling
Linear modeling
Priors over the mixing vectors
The established linear modeling paradigm relies on two assumptions: 1 point sources 2 low reverberation
The mixing vectors Ajf encode the apparent sound direction in terms of ITD τjf , IID gjf .
Under assumption 1, the sources and the mixing process can be modeled as single-channel source signals and a linear filtering process. Under assumption 2, this filtering process is equivalent to complex-valued multiplication in the time-frequency domain via the short-time Fourier transform (STFT). Xnf : vector of mixture STFT coeff. J: number of sources Sjnf : jth source STFT coeff. Ajf : jth mixing vector
Sjnf Ajf
j=1 E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
21 / 54
Probabilistic linear modeling
0.6 0.4 0.2 0 −2
Empirical distribution of IID probability density
probability density
Xnf =
For echoic mixtures, ITDs and IIDs follow a smeared distribution P(Ajf |θj ) Empirical distribution of ITD
In each time-frequency bin (n, f ) J
For non-echoic mixtures, ITDs and IIDs are constant over frequency and related to the direction of arrival (DOA) θj of each source 1 Ajf ∝ −2iπf τ j gj e
0
2 4 ITD (ms)
6
E. Vincent & N. Ono (INRIA & UTokyo)
8
0.6
anechoic RT=50ms RT=250ms RT=1.25s
0.4 0.2 0 −5
0 IID (dB)
5
Music Source Separation
ISMIR 2010
22 / 54
Probabilistic linear modeling
I.i.d. priors over the source STFT coefficients
Inference algorithms
Most systems assume that the sources have random spectra, i.e. their STFT coefficients Sjnf are independent and identically distributed (i.i.d.). The magnitude STFT coefficients of audio sources are sparse: at each frequency, few coefficients have large values while most are close to zero. This property is well modeled by the˛ generalized exponential distribution ˛ ˛ S ˛p p p: shape parameter −˛ βjnf ˛ f e P(|Sjnf ||p, βf ) = βj : scale parameter βf Γ(1/p) Speech source S1nf
Distribution of magnitude STFT coeff. 60 40
2 20 0 0
0.5 n (s)
1
0
dB probability density
f (kHz)
4
1
10
empirical Gaussian (p=2) Laplacian (p=1) generalized p=0.4
100
10−1 10−2 0
Given the above priors, source separation is typically achieved by joint MAP estimation of the source STFT coefficients Sjnf and other latent variables (Ajf , gj , τj , p, βj ) via alternating nonlinear optimization. This objective is called sparse component analysis (SCA). For typical values of p, the MAP source STFT coefficients are nonzero for at most two sources in a stereo setting. When the number of sources is J = 2, SCA is renamed nongaussianitybased frequency-domain independent component analysis (FDICA).
1 2 3 4 |S1nf| (scaled to unit variance)
Note: coarser binary activity priors have also been employed. E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
23 / 54
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
24 / 54
Probabilistic linear modeling
Probabilistic linear modeling
Practical illustration of separation using i.i.d. linear priors
0 0
0.5 n (s)
1
20
0
0
60
4
0
Mixture X
1
20 0
0.5 n (s)
1
4
2
1+3
1+2 f (kHz)
dB
f (kHz)
f (kHz)
20
2
0
First estimated source S^
1nf
2 20 0
0.5 n (s)
1
1
0
2nf
4
40
0
0.5 n (s)
0
Second estimated source S^ 60
dB
f (kHz)
4
i.i.d. linear priors ideal CASA mask (upper−bound)
5 panned
recorded (RT=250ms)
0 0
2 20 0
0.5 n (s)
1
1
Third estimated source S^ 60
0
3nf
4
40
0
0.5 n (s)
Panned mixture Estimated sources using i.i.d. linear priors
60 40 dB
0
dB
1
2+3
f (kHz)
0.5 n (s)
f (kHz)
0
10
0
1+3
2+3 0
15
0
Estimated nonzero source pairs 1+2
40 2
20
40 2
0
0
Predominant source pairs
nf
4
0.5 n (s)
60
dB
2
Right source S3nf
4
40
SDR (dB)
20
60
dB
dB
2
Center source S2nf
4
40
f (kHz)
f (kHz)
60
f (kHz)
Left source S1nf
4
SiSEC results on toy mixtures of 3 sources
2 20 0 0
0.5 n (s)
1
Recorded reverberant mixture Estimated sources using i.i.d. linear priors
0
Time-frequency bins dominated by the center source are often erroneously associated with the two other sources. E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
25 / 54
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
26 / 54
ISMIR 2010
28 / 54
Probabilistic linear modeling
Summary of probabilistic linear modeling
Advantages:
1
top-down approach 2
separation of more than one source per time-frequency bin
3
Limitations: restricted to mixtures of non-reverberated point sources
4
separation of at most two sources per time-frequency bin 5
musical noise artifacts due to the ambiguities of spatial cues
Source separation and music Computational auditory scene analysis Probabilistic linear modeling Probabilistic variance modeling Summary and future challenges
no straightforward framework for the integration of spectral cues
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
27 / 54
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
Probabilistic variance modeling
Probabilistic variance modeling
Idea 1: from sources to mixture components
Idea 2: translation and phase invariance
Diffuse or semi-diffuse sources cannot be modeled as single-channel signals and not even as finite dimensional signals. Instead of considering the signal produced by each source, one may consider its contribution to each channel of the mixture signal.
In order to overcome the ambiguities of spatial cues, additional spectral cues are needed as shown by CASA.
Source separation becomes the problem of estimating the multichannel mixture components underlying the mixture.
Most audio sources are translation- and phase-invariant: a given sound may be produced at any time with any relative phase across frequency.
In each time-frequency bin (n, f ) Xnf =
J
Xnf : vector of mixture STFT coeff. J: number of sources Cjnf : jth mixture component
Cjnf
j=1
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
29 / 54
Probabilistic variance modeling
ISMIR 2010
30 / 54
Choice of the distribution
Variance modeling combines these two ideas by modeling the STFT coefficients of individual mixture components by a circular multivariate distribution whose parameters vary over time and frequency.
For historical reasons, several distributions have been preferred in a mono context, which can equivalently be expressed as divergence functions over the source magnitude/power STFT coefficients:
The non-sparsity of source STFT coefficients over small time-frequency regions suggests the use of a non-sparse distribution. Speech source S1nf
Poisson ↔ Kullback-Leibler divergence aka I-divergence tied-variance Gaussian ↔ Euclidean distance
Generalized Gaussian shape parameter p
log-Gaussian ↔ weighted log-Euclidean distance
60 4 40 dB
f (kHz)
Music Source Separation
Probabilistic variance modeling
Variance modeling
4
E. Vincent & N. Ono (INRIA & UTokyo)
2
0
0
0.5 n (s)
E. Vincent & N. Ono (INRIA & UTokyo)
1
20
2
0
0
These distributions do not easily generalize to multichannel data. 1
2
10 10 10 neighborhood size (Hz × s)
Music Source Separation
3
ISMIR 2010
31 / 54
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
32 / 54
Probabilistic variance modeling
Probabilistic variance modeling
The multichannel Gaussian model
General inference algorithm
The zero-mean Gaussian distribution is a simple multichannel model. P(Cjnf |Σjnf ) =
−1 H 1 e −Cjnf Σjnf Cjnf det(πΣjnf )
Σjnf : jth component covariance matrix
Independently of the priors over Vjnf and Rjf , source separation is typically achieved in two steps: joint MAP estimation of all model parameters using the expectation maximization (EM) algorithm,
The covariance matrix Σjnf of each mixture component can be factored as the product of a scalar nonnegative variance Vjnf and a mixing covariance matrix Rjf respectively modeling spectral and spatial properties
MAP estimation of the source STFT coefficients conditional to the model parameters by multichannel Wiener filtering
Σjnf = Vjnf Rjf
⎛ jnf = Vjnf Rjf ⎝ C
Under this model, the mixture STFT coefficients also follow a Gaussian distribution whose covariance is the sum of the component covariances P(Xnf |Vjnf , Rjf ) =
E. Vincent & N. Ono (INRIA & UTokyo)
J
⎞−1 Vj nf Rj f ⎠
Xnf .
j =1
PJ −1 H 1 e −Xnf ( j=1 Vjnf Rjf ) Xnf J det π j=1 Vjnf Rjf Music Source Separation
ISMIR 2010
33 / 54
E. Vincent & N. Ono (INRIA & UTokyo)
Probabilistic variance modeling
Music Source Separation
ISMIR 2010
34 / 54
Probabilistic variance modeling
Rank-1 priors over the mixing covariances
Full-rank priors over the mixing covariances
The mixing covariances Rjf encode the apparent spatial direction and spatial spread of sound in terms of
For reverberated or diffuse sources, the interchannel coherence is smaller than one, i.e. Rjf has full rank.
ITD,
The theory of statistical room acoustics suggests the direct+diffuse model
IID,
For non-reverberated point sources, the interchannel coherence is equal to one, i.e. Rjf has rank 1 Rjf = Ajf AH jf The priors P(Ajf |θj ) used with linear modeling can then be simply reused.
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
λj : direct-to-reverberant ratio Ajf : direct mixing vector Bf : diffuse noise covariance
Rjf ∝ λj Ajf AH jf + Bf
normalized interchannel correlation a.k.a. interchannel coherence.
35 / 54
with
Ajf = Bf =
2 1 + gj2
1
gj e −2iπf τj
1 sinc(2πfd/c) sinc(2πfd/c) 1
E. Vincent & N. Ono (INRIA & UTokyo)
τj : ITD of direct sound gj : IID of direct sound
Music Source Separation
d: microphone spacing c: sound speed
ISMIR 2010
36 / 54
Probabilistic variance modeling
Probabilistic variance modeling
I.i.d. priors over the source variances
Benefit of exploiting interchannel coherence
Baseline systems rely again on the assumption that the sources have random spectra and model the source variances Vjnf as i.i.d. and locally constant within small time-frequency regions.
Interchannel coherence helps resolving some ambiguities of ITD and IID and identify the predominant sources more accurately. Linear model A
3
S3
When these follow a mildly sparse prior, it can be shown that the MAP variances are nonzero for up to four sources. Discrete priors constraining the number of nonzero variances to one or two have also been employed.
Covariance model A 2 X
S2
S S1
1
1/2
A
3
V2
1/2
V3
A
2
V1/2 1
A1
A1
1/2
V1
When the number of sources is J = 2, this model is also called nonstationarity-based FDICA.
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
37 / 54
Probabilistic variance modeling
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
38 / 54
Probabilistic variance modeling
Practical illustration of separation using i.i.d. variance priors
Spectral priors based on template spectra Variance modeling enables the design of phase-invariant spectral priors.
20 0 0
0.5 n (s)
1
20 0
0.5 n (s)
1
60
(IID > 0) 60
20 0
0.5 n (s)
1
2
1+3
1+2 f (kHz)
dB
f (kHz)
f (kHz)
20
2
1+3
2+3
1nf
dB
f (kHz)
20 0
0.5 n (s)
1
E. Vincent & N. Ono (INRIA & UTokyo)
1
0
2nf
4
40 2
0
0.5 n (s)
0
Second estimated source S^ 60
Different strategies have been proposed to learn these spectra:
0 0
First estimated source S^ 4
2+3
0
0
60
20 0.5 n (s)
1
Music Source Separation
0
3nf
4
40
0
1
speaker-independent training on separate single-source data,
Third estimated source S^
2
0
0.5 n (s)
60
speaker-dependent training on separate single-source data,
40 dB
1
dB
0.5 n (s)
f (kHz)
0
f (kHz)
0
Vjnf = wjqjn f with P(qjn = k) = πjk
4 1+2
2
0
Estimated nonzero source pairs
4
40
The Gaussian mixture model (GMM) represents the variance Vjnf of each source at a given time by one of K template spectra wjkf indexed by a discrete state qjn
40 2
0
0
Predominant source pairs
nf
3nf
4
40
0
0
Right source S 60
2
Mixture X 4
(IID = 0)
dB
40 2
2nf
4
dB
Center source S 60
dB
f (kHz)
(IID < 0)
f (kHz)
1nf
f (kHz)
Left source S 4
2
0 0
0.5 n (s)
1
20
MAP adaptation to the mixture using model selection or interpolation,
0
MAP inference from a coarse initial separation.
ISMIR 2010
39 / 54
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
40 / 54
Probabilistic variance modeling
Probabilistic variance modeling
Practical illustration of separation using template spectra
0 0
0.5 n (s)
1
20 0
0
0
0.5 n (s) jkf
violin
60
2
piano
dB
40 20 0 2 3 k (piano)
1
Estimated piano variance Σ
0.5 n (s)
20
20
1
0.5 n (s)
1nf
4
k=1
2nf 60
This model is also called nonnegative matrix factorization (NMF).
40 2 20
2 20 0.5 n (s)
1
E. Vincent & N. Ono (INRIA & UTokyo)
0
0.5 n (s)
2nf
4
40
0
1
^ Estimated violin source C
60
dB
f (kHz)
1nf
Again, a range of strategies have been used to learn these spectra:
0 0
^ Estimated piano source C
0
jn
instrument-dependent training on separate single-source data, MAP adaptation to the mixture using uniform priors,
40 2 20 0
0
1
60
dB
0.5 n (s)
The variance Vjnf of each source is then better represented as the linear combination of K basis spectra wjkf multiplied by time-varying scale factors hjkn K hjkn wjkf Vjnf =
0
1
0 0
1
4
40 2
f (kHz)
0
0.5 n (s)
Estimated mixture variance Σ +Σ
60
dB
dB
f (kHz)
2nf
4
40 2
0
Estimated violin variance Σ
60
0
3 2 1 3 2 1
2 3 k (violin)
f (kHz)
1nf
4
20 0
0
The GMM does not efficiently model polyphonic musical instruments.
40 2
Estimated state sequences q
Template spectra w f (kHz)
1
60
dB
2
nf
4
40
4
1
60
dB
dB 20
f (kHz)
2
2nf
4
40
dB
60
0
0.5 n (s)
1
MAP adaptation to the mixture using trained priors.
0
Music Source Separation
ISMIR 2010
41 / 54
Probabilistic variance modeling
0.5 n (s)
1
20 0
0
0
0.5 n (s)
1
−20 2 −40 0 2 3 k (piano)
1
dB
f (kHz)
2 20 0 0
0.5 n (s)
1
0 2nf
1nf
2 20
E. Vincent & N. Ono (INRIA & UTokyo)
0.5 n (s)
1
0
the mismatch between training and test data. 1nf
4
2nf 60
However, it is often inaccurate: additional constraints over the spectra are needed to further reduce overfitting.
40 2 20 0
0
0
0.5 n (s)
^ Estimated violin source C
2nf
4
40
0
1
60
dB
f (kHz)
4
the lack of training data,
1
60
20 0.5 n (s)
MAP adaptation or inference of the template/basis spectra is often needed due to
Estimated mixture variance Σ +Σ
2
^ Estimated piano source C
0
0.5 n (s)
40
0
0
40
0
0
1
jkn
60
4
40
0.5 n (s)
80
Estimated violin variance Σ
60
0
dB
1nf
4
20 0
0
3 2 1 3 2 1
2 3 k (violin)
f (kHz)
1
f (kHz)
f (kHz)
0
dB k (piano) k (violin)
jkf
Estimated piano variance Σ
2
Estimated scale factors h
Basis spectra w 4
40 dB
dB
2
60
dB
0
42 / 54
1
0
60 40 dB
0
ISMIR 2010
Constrained template/basis spectra
dB
20
nf
4 40
f (kHz)
dB
2
2nf
4
40
Music Source Separation
Mixture X
f (kHz)
60 f (kHz)
f (kHz)
Violin source C
1nf
4
E. Vincent & N. Ono (INRIA & UTokyo)
Probabilistic variance modeling
Practical illustration of separation using basis spectra Piano source C
Spectral priors based on basis spectra
Mixture X f (kHz)
1nf
4 f (kHz)
Violin source C
f (kHz)
Piano source C
2 20 0 0
Music Source Separation
0.5 n (s)
1
0
ISMIR 2010
43 / 54
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
44 / 54
Probabilistic variance modeling
Probabilistic variance modeling
Harmonicity and spectral smoothness constraints
Practical illustration of harmonicity constraints bp,1,f (ejp,1=0.756)
For instance, harmonicity and spectral smoothness can be enforced by
bp,2,f (ejp,2=0.128)
bp,3,f (ejp,3=0.041)
1
1
1
0.5
0.5
0.5
0
0
0
wjpf
associating each basis spectrum with some a priori pitch p
1 10
modeling wjpf as the sum of fixed narrowband spectra bplf representing adjacent partials at harmonic frequencies scaled by spectral envelope coefficients ejpl wjpf =
Lp
20 f (ERB)
30
10
bp,4,f (ejp,4=0.037)
20 f (ERB)
30
10
bp,5,f (ejp,5=0.011)
20 f (ERB)
30 0.5
bp,6,f (ejp,6=0)
1
1
1
0.5
0.5
0.5
0 10
0
0 10
ejpl bplf .
20 f (ERB)
30
20 f (ERB)
30
0 10
20 f (ERB)
30
10
20 f (ERB)
30
l=1
Parameter estimation now amounts to estimating the active pitches and their spectral envelopes instead of their full spectra.
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
45 / 54
E. Vincent & N. Ono (INRIA & UTokyo)
Probabilistic variance modeling
Music Source Separation
ISMIR 2010
46 / 54
Probabilistic variance modeling
Further constraints
SiSEC results on toy mixtures of 3 sources
Further constraints that have been implemented in this context include 20
inharmonicity and tuning.
15 SDR (dB)
source-filter model of instrumental timbre,
Probabilistic priors are also popular: state transition priors
adapted basis spectra i.i.d. linear priors
5 0 panned
P(qjn = k|qj,n−1 = l) = πjkl
P(Vjnf |Vjn,f −1 ) = N (Vjnf ; Vjn,f −1 , σperc ) temporal continuity priors (for sustained sounds)
Recorded reverberant mixture Estimated sources using adapted basis spectra Estimated sources using i.i.d. linear priors
P(Vjnf |Vj,n−1,f ) = N (Vjnf ; Vj,n−1,f , σsust ) Music Source Separation
recorded (RT=250ms)
Panned mixture Estimated sources using adapted basis spectra Estimated sources using i.i.d. linear priors
spectral continuity priors (for percussive sounds)
E. Vincent & N. Ono (INRIA & UTokyo)
10
ISMIR 2010
47 / 54
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
48 / 54
Probabilistic variance modeling
Probabilistic variance modeling
SiSEC results on professional mixtures
Summary of probabilistic variance modeling Advantages:
20
SDR (dB)
15
top-down approach
vocals drums bass guitar piano
10 5
virtually applicable to any mixture, including to diffuse sources no hard constraint on the number of sources per time-frequency bin fewer musical noise artifacts by joint exploitation of spatial, spectral and learned cues
0
principled modular framework for the integration of additional cues Tamy (2 sources) Estimated sources using adapted basis spectra
Limitations: remaining musical noise artifacts
Bearlin (10 sources) Estimated sources using adapted basis spectra
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
current implementations limited to a few spectral and/or spatial cues. . . but this is gradually changing!
ISMIR 2010
49 / 54
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
50 / 54
Summary and future challenges
Summary principles of model-based source separation Most model-based source separation systems rely on modeling the STFT coefficients of each source as a function of 1
2
3
4
5
Source separation and music Computational auditory scene analysis Probabilistic linear modeling Probabilistic variance modeling Summary and future challenges
a scalar variable (Sjnf or Vjnf ) encoding spectral cues, a vector or matrix variable (Ajf or Rjf ) encoding spatial cues. Robust source separation requires priors over both types of cues: spectral cues alone cannot discriminate sources with similar pitch range and timbre, spatial cues alone cannot discriminate sources with the same DOA. A range of informative priors have been proposed, relating for example Sjnf or Vjnf to discrete or continuous latent states, Ajf or Rjf to the source DOAs. Variance modeling outperforms linear modeling.
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
51 / 54
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
52 / 54
Summary and future challenges
Summary and future challenges
Conclusion and remaining challenges
References
To sum up, source separation is a core problem of audio signal processing with huge potential applications.
D.L. Wang and G.J. Brown, Eds. Computational Auditory Scene Analysis: Principles, Algorithms and Applications Wiley/IEEE Press, 2006.
Existing systems are gradually finding their way into the industry, especially for applications that can accomodate a certain amount of musical noise artifacts, such as MIR, partial user input/feedback, such as post-production. We believe that these two limitations could be addressed in the next 10 years by exploiting the full power of probabilistic modeling, especially by:
E. Vincent, M.G. Jafari, S.A. Abdallah, M.D. Plumbley, and M.E. Davies Probabilistic modeling paradigms for audio source separation in Machine Audition: Principles, Algorithms and Systems IGI Global, 2010.
integrating more and more spatial and spectral cues, making a better use of learned cues, using training data or repeated sounds
E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
53 / 54
2008 and 2010 Signal Separation Evaluation Campaigns http://sisec.wiki.irisa.fr/ E. Vincent & N. Ono (INRIA & UTokyo)
Music Source Separation
ISMIR 2010
54 / 54
U SOLab Outline ND
The University
of Tokyo
Music Source Separation and its Applications to MIR
Introduction Part I: Brief Introduction of State-of-the-arts
Nobutaka Ono and Emmanuel Vincent The University of Tokyo, Japan INRIA Rennes - Bretagne Atlantique, France
Part II: Harmonic/Percussive Sound Separation
Contributions from Shigeki Sagayama, Kenichi Miyamoto, Hirokazu Kameoka, Jonathan Le Roux, Emiru Tsunoo, Yushi Ueda, Hideyuki Tachibana, Geroge Tzanetakis, Halfdan Rump, Other members of IPC Lab#1
U SOLab
N NDD
ISMIR2010 Tutorial 1
Aug. 9, 2010
1
U SOLab
N NDD
Motivation and Formulation Open Binary Software
Part III: Applications of HPSS to MIR Tasks
Tutorial supported by the VERSAMUS project http://versamus.inria.fr/
Singer/Instrument Identification Audio Tempo Estimation
Audio Chord Estimation Melody Extraction Audio Genre Classification
Conclusions Aug. 9, 2010
ISMIR2010 Tutorial 1
U SOLab ND
Introduction
The University
of Tokyo
Focus of the second half of this tutorial is to clarify
Part I: Brief Introduction of State-of-the-arts
What source separation has been used for MIR? How does it improve performance of MIR tasks?
2
Examples: Multi pitch estimation Task itself is tightly coupled with source separation. Audio genre classification How source separation is useful? Not straightforward.
U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
3
U SOLab
N NDD
ISMIR2010 Tutorial 1
Aug. 9, 2010
4
Singer Identification
Accompaniment Sound Reduction [Fujihara2005]
Task: Identify a singer from music audio with accompaniment Typical approach
features audio Feature Extraction
Classifier
Pre-dominant F0 based voice separation Audio input by PreFEST [Goto2004]
singer
Fig.1 [Fujihara2005] U SOLab
N NDD
Aug. 9, 2010
5
ISMIR2010 Tutorial 1
N NDD
male
Only reliable frame is used for classification
female
Feature extraction
baseline
Fig.1 [Fujihara2005]
Classifier U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
6
ISMIR2010 Tutorial 1
Aug. 9, 2010
Evaluation by Confusing Matrix
Reliable Frame Selection [Fujihara2005]
U SOLab
Feature extraction
selection only 7
U SOLab
N NDD
Aug. 9, 2010
reduction only
Male/female confusion is decreased by accompaniment reduction. Combination of reduction and selection much improves performance.
Fig. 3 [Fujihara2005] reduction and selection ISMIR2010 Tutorial 1
8
Evaluation by Identification Rate
Melody-F0-based Vocal Separation
90
[Mesaros2007] Estimate melody-F0 by melody transcription system [Ryynanen2006]. Generate harmonic overtones at multiple of estimated F0. Estimate amplitudes and phases of overtones based on cross correlation between original signal and complex exponentials.
U SOLab
N NDD
80
w/o sep. w/ sep.
80 70
70 60 50 40
60 50 40
30
30
20
20
10
10
0
0
㪣㪛 㪝 㪨㪛 㪞㪤 㪝 㪞㪤 㪤㪄 㪘 㪤 㪄㪢 㪣㪄 㪞㪤 㪞㪤 㪘 㪤 㪤 㪄 㪞㪤 㪢㪣㪄 㪄㪪 㪪 㪤 㪄 㪢 㪄㪈㪥 㪣㪄 㪥 㪪㪄 㪊㪥 㪞㪄 㪥 㪢㪣 㪄㪘 㪞㪄 㪤 㪸㪿
Correct [%]
w/o sep. w/ sep.
90
Singing to Accompaniment Ratio: -5dB Singing to Accompaniment Ratio: 15dB Generated by Table 1 and 2 [Mesaros2007]
They evaluate the effect of separation in singer identification performance using by different classifiers. Aug. 9, 2010
100
100
Correct [%]
㪣㪛 㪝 㪨㪛 㪞㪤 㪝 㪞㪤 㪤㪄 㪘 㪤 㪄㪢 㪣㪄 㪘 㪞㪤 㪞㪤 㪤 㪤 㪄 㪞㪤 㪢㪣㪄 㪄㪪 㪪 㪤 㪄 㪢 㪄㪈㪥 㪣㪄 㪥 㪪㪄 㪊㪥 㪞㪄 㪥 㪢㪣 㪄㪘 㪞㪄 㪤 㪸㪿
Vocal Separation Based on Melody Transcriber
9
ISMIR2010 Tutorial 1
U SOLab
N NDD
Performance is much improved, especially in low singing-to-accompaniment ratio. Aug. 9, 2010
10
ISMIR2010 Tutorial 1
Instrument Identification
Feature Weighting [Kitahara2007]
Task: Determine instruments present in music piece Typical approach
spectrogram Feature audio Separation of notes Extraction to Notes
Feature vectors of each instrument are collected from polyphonic music for training. Robustness of each feature is evaluated by ratio of intra-class variance to inter-class variance: Applying Linear discriminant analysis (LDA) for feature weighting.
features
U SOLab
N NDD
instrument
Important Issue Source separation is not perfect. How to reduce errors?
Aug. 9, 2010
ISMIR2010 Tutorial 1
Classifier
PCA
LDA
Modified from Fig. 1 [Kitahara2007]
11
U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
12
Audio Tempo Estimation
Effectiveness of Feature Weighting
Task: Extract tempo from musical audio Typical approach:
Instrument recognition rate
subband signals audio STFT or Filterbank
Onset Detection
detection function tempo Fig. 6 [Kitahara2007]
Tracking
tempo candidates Periodicity Analysis
t
Feature weighting by LDA improves recognition rate. U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
13
Applying Harmonic+Noise Model
U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
14
Influence of S+N Model
Harmonic+Noise model is applied before calculating detection function [Alonso2007]
Source separation based on harmonic + noise model
U SOLab
N NDD
Fig. 2 [Alonso2007] Aug. 9, 2010
Detection functions are calculated from both of harmonic component and noise component, and then, they are merged.
ISMIR2010 Tutorial 1
15
Algorithms of periodicity detection
Fig. 14 [Alonso2007]
Separation based on H+N model shows better results. U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
16
U SOLab ND
Applying PLCA
The University
of Tokyo
PLCA (Probabilistic Latent Component Analysis), NMF-like method is applied. It increases much candidates of tempo. They report its effectiveness.
Part II: Harmonic/Percussive Sound Separation
[Chordia2009]
U SOLab
N NDD
Fig. 1 [Chordia2009] Aug. 9, 2010
17
ISMIR2010 Tutorial 1
N NDD
ISMIR2010 Tutorial 1
Aug. 9, 2010
18
Related Works to H/P Separation
Motivation and Goal of HPSS
U SOLab
Motivation: Music consists of two different components
example of a popular music (RWC-MDB-P034)
Source separation into multiple components followed by classification ICA and classification [Uhle2003] NMF and classification [Helen2005]
Steady + Transient model
Adaptive phase vocoder Subspace projection Matching persuit …etc Good review is provided in [Daudet2005] Baysian NMF [Dikmen2009]
harmonic component
percussive component
Goal: Separation of a monaural audio signal into harmonic and percussive components Target: MIR-related tasks
U SOLab
N NDD
multi-pitch analysis, chord recognition… H-related beat tracking, rhythm recognition… P-related Aug. 9, 2010
ISMIR2010 Tutorial 1
19
U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
20
H/P Separation Problem
Point: Anisotropy of Spectrogram horizontally smooth
Problem: Find Ht,Z and Pt,Z from Wt,Z on power spectrogram
Requirements:
vertically smooth
Wt,Z harmonic component
percussive component
Ht,Z
Pt,Z
1) Ht,Z : horizontally smooth 2) Pt,Z : vertically smooth 3) Ht,Z and Pt,Z : non-negative 4) Ht,Z + Pt,Z : should be close to Wt,Z U SOLab
N NDD
Aug. 9, 2010
21
ISMIR2010 Tutorial 1
Formulation as an Optimization Problem: Objective function to minimize Closeness cost
N NDD
Aug. 9, 2010
Closeness cost function: I-divergence
Smoothness cost function: Square of difference
Smoothness cost
Under
constraints: Ht,Z 㻢 0 In MAP estimation context, they are corresponding Pt,Z 㻢 0
for scale invariance Weights to control two smoothness
likelihood term and prior term, respectively. U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
22
ISMIR2010 Tutorial 1
Formulation of H/P Separation (2/2)
Formulation of H/P Separation (1/2)
U SOLab
23
U SOLab
N NDD
A variance modeling-based separation using Poisson observation distribution Gaussian continuity priors [Miyamoto2008, Ono2008, etc] Aug. 9, 2010
ISMIR2010 Tutorial 1
24
Update Rules
Separated Examples
Update alternatively two kinds of variables:
Music piece
H and P:
original
H
P
RWC-MDB-P-7 “PROLOGUE ” RWC-MDB-P-12 “KAGE-ROU ” RWC-MDB-P-18 “True Heart”
Auxiliary variables:
RWC-MDB-P-25 “tell me” RWC-MDB-J-16 “Jive ”
U SOLab
N NDD
Aug. 9, 2010
25
ISMIR2010 Tutorial 1
Real-Time Implementation
U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
26
Open Software: Real-time H/P equalizer Available at http://www.hil.t.u-tokyo.ac.jp/software/HPSS/
Sliding Block Analysis Iterations are applied only within sliding block
Control H/P balance of audio signal in real time Simple instructions: 1) Click “Load WAV” button and choose a WAVformatted audio file. 2) Click “Start” button, and then, audio starts. 3) Slide H/P balance bar as you like and listen how the sound changes. 1)
3)
2) U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
27
U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
28
U SOLab Audio Chord Detection ND
The University
of Tokyo
Part III: Applications of HPSS to MIR Tasks
Task: Estimate chord sequence and its segmentation from music audio 䎛䎓䎓䎓
䎖䎕䎓䎓
䎃
C
G
Am
F
C
G
F
C
䎔䎙䎓䎓
䎛䎓䎓
䎗䎓䎓
III-1: Audio Chord Detection
䎕䎓䎓
䎔䎓䎓
䎘䎓 䎃
U SOLab
N NDD
ISMIR2010 Tutorial 1
29
Aug. 9, 2010
Feature: chroma [Fujishima1999]
䎓
Aug. 9, 2010
䎓
䎓
䎓
䎓
䎓
Feature Extraction
Chroma observation probability p( xt | ct )
N NDD
䎓
30
HMM-based chord recognition
training
recognition
Bigram probability p(ct | ct 1 )
Viterbi algorithm for …
HMM training
p ( xt | ct )
Aug. 9, 2010
ISMIR2010 Tutorial 1
Viterbi decoding
acoustic model language model
p (ct | ct 1 )
Recognized chord sequence
Initial prob. emission transition
U SOLab
䎓
24 dim. features
Maximum a Posteriori Chord Estimation [Sheh2003]
䎓
ISMIR2010 Tutorial 1
Transition: chord progression
N NDD
䎓
Feature-refined System [Ueda2009]
Typical Approach: Chroma Feature + HMM
U SOLab
䎓
31
U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
32
Suppressing Percussive Sounds
Fourier-transformed Chroma
Percussive sounds are harmful in chord detection
Covariance matrix of chroma
Highly correlated components: diagonal-only approximation infeasible
Covariance matrix is near circulant
Emphasize harmonic components by HPSS
Chroma covariance
Assuming …
Caused by harmonic overtones or some pitches performed at the same time Results in large number of parameters
Harmonic overtones of all pitches have the same structure The amount of occurrence of the same intervals is the same
Circulant matrix diagonalized by DFT
Diagonal approximation of FTChroma covariance
Reduces the number of model parameters (statistically robust) FT-Chroma covariance
SOLab
UNNDD
Aug. 9, 2010
33
ISMIR2010 Tutorial 1
Tuning Compensation
Tuning difference among songs
Neglecting this may blur chroma features
N NDD
ISMIR2010 Tutorial 1
D
D B
Improve chord boundary accuracy
E
F E
C B
C
Chord tones largely changes at chord boundary
A
by features representing chord boundaries
Delta chroma䋺 derivative of chroma features Cf. Delta cepstrum (MFCC)䋺Effective features of speech recognition
Calculated by regression analysis of 㱐 sample points [Sagayama&Itakura1979]
Find maximum chroma energy (sum of all bins of chroma) Assume: tuning does not change within a song
Aug. 9, 2010
F
G
A
440.0Hz
filterbank
446.4Hz (+25cent)
Robust to noise
'C (i, n)
G
¦G kw C (i, t k )
k
G# A A# tuning (log freq.)
log power of pitch A
k
G
¦k
t G
G U SOLab
F
C
34
ISMIR2010 Tutorial 1
Aug. 9, 2010
Delta Chroma Features G
Choose best tuning from multiple candidates
SOLab
UNNDD
2
slope of this line
wk
wk
i 1, ,12
B
35
U SOLab
N NDD
time Aug. 9, 2010
ISMIR2010 Tutorial 1
36
Experimental Evaluation
Multiple States per Chord
Chroma changes from “onset” to “release”
Test Data
capture the change by having multiple states per chord tradeoff between data size and the number of states
C1
C2
C3
Labels
Evaluation
G
D
pitch
C
䍃䍃䍃
F
N NDD
12㬍major/minor =24 chords + N (no chord)
Album filtered 3-fold cross validation
Frame Recognition Rate = (#correct frames) / (#total frames) Sampled every 100ms
time U SOLab
180 songs (12 albums) of The Beatles (chord reference annotation provided by C. Harte) 11.025 kHz sampling, 16bit, 1ch, WAV file Frequency range: 55.0Hz-1661.2Hz (5 octaves)
37
ISMIR2010 Tutorial 1
Aug. 9, 2010
U SOLab
N NDD
8 albums for training, 4 albums for testing
Aug. 9, 2010
ISMIR2010 Tutorial 1
38
U SOLab ND
Chord Detection Results
The University
of Tokyo
Err Reduc Rate
28.1%
Part III: Applications of HPSS to MIR Tasks
Err Reduc Rate
Chord detection rate
11.0% MIREX2008 best score [Uchiyama2008]
1 state sstatestate 2 states ᘒ 3 states Chroma
U SOLab
N NDD
HE
HE+TC
HE+TC+FT
HE䋺harmonic sound emphasized TC: tuning compensation FT: FT chroma (diagonal covariance) DC: Delta chroma Aug. 9, 2010
ISMIR2010 Tutorial 1
III-2: Melody Extraction
HE+TC+DC
HPSS improves chord detection performance 39
U SOLab
N NDD
ISMIR2010 Tutorial 1
Aug. 9, 2010
40
Melody Extraction
Singing Voice in Spectrogram RWC-MDB-P-25 “tell me”
Task: Identify a melody pitch contour from polyphonic musical audio Typical approach:
audio Pre-dominant F0s extraction
F0s
Tracking
A C
melody
B
Singing
voice enhancement will be useful pre-processing.
U SOLab
N NDD
Aug. 9, 2010
A. Vertical component: Percussion B. Horizontal component: Harmonic instrument (piano, guiter, etc..) C. Fluctuated component: Singing voice 41
ISMIR2010 Tutorial 1
Is voice harmonic or percussive?
U SOLab
N NDD
Aug. 9, 2010
HPSS results with different frame length
Depends on spectrogram resolution (frame-length) On
short-frame STFT domain, voice appears as “H” (time direction clustered). On long-frame STFT domain, voice appears as “P” (frequency direction clustered).
Example Frame length: 16ms
U SOLab
N NDD
Aug. 9, 2010
H
P
Vocal Frame length: 512ms
“Harmonic”
42
ISMIR2010 Tutorial 1
H
P
“Percussive” ISMIR2010 Tutorial 1
43
U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
44
Spectrogram Example
Two-stage HPSS [Tachibana2010]
Original signal (from LabROSA dataset)
Original HPSS with short frame Sinusoidal Sound
Percussive Sound
HPSS with long frame Stationarysinsoidal Sound U SOLab
N NDD
Aug. 9, 2010
Fluctuatingsinusoidal Sound (㻍singing voice)
ISMIR2010 Tutorial 1
45
Spectrogram Example
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
N NDD
Aug. 9, 2010
46
ISMIR2010 Tutorial 1
Separation Examples
Voice-enhanced signal (by two-stage HPSS)
U SOLab
U SOLab
title
47
original
Extracted Vocal
Vocal Genre Cancelled*
“tell me”
F, R&B
“Weekend”
F, Euro beat
“Dance Together”
M, Jazz
“1999”
M, Metal rock
“Seven little crows”
F, Nursery rhyme
“La donna è mobile” from Verdi’s opera “Rigoletto”
M, Classical
U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
48
Example of Melody Tracking
Melody Tracking by DP [Tachibana2010]
Estimating hidden states by dynamic programming
train06.wav, distributed by LabROSA database
Observation (Voice-enhancedSpectrum) t1
State (Pitch series)
U SOLab
N NDD
t2
t3
460
460
460
460
460
460
450
450
450
450
450
450
440
440
440
440
440
440
49
ISMIR2010 Tutorial 1
Aug. 9, 2010
U SOLab
N NDD
Aug. 9, 2010
U SOLab ND
Results in MIREX 2009
The University
of Tokyo
Data: 379 songs, mixed in +5 dB, 0dB, and -5 dB.
Part III: Applications of HPSS to MIR Tasks
HPSS-based method Noise Rob
ust +5dB
Se
ns itiv e
50
ISMIR2010 Tutorial 1
0dB
-5dB
original processed
III-3: Audio Genre Classification
Accompaniments
Robustness to large singer-to-accompaniment ratio is greatly improved. U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
51
U SOLab
N NDD
ISMIR2010 Tutorial 1
Aug. 9, 2010
52
Audio Genre Classification
New Features I: Percussive Patterns
Task: estimate genre from music audio
Blues, classical, jazz, rock, ...
Typical approach
audio Feature Extraction
features
Classifier
genre
Example of features [Tzanetakis2001]
Timbral information (MFCC, etc.) Melodic information Statistics about periodicities: Beat histogram
Feature Extraction [Tsunoo2009]
U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
53
U SOLab
N NDD
Aug. 9, 2010
Rhythmic Structure Analysis by One-pass DP algorithm
Motivation for Bar-long Percussive Patterns
Assume that correct bar-line unit patterns are given. Problem: tempo fluctuation and unknown segmentation
Bar-long percussive patterns (temporal information) are frequently characteristic of a particular genre Difficulties
1) Mixture of harmonic and percussive components 2) Unknown bar-lines 3) Tempo fluctuation 4) Unknown multiple patterns
Analogous to continuous speech recognition problem One-pass dynamic programming algorithm can be used to segment
A A A A B A A A C C C C
U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
54
ISMIR2010 Tutorial 1
spectrogram of percussive sound 55
U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
56
Example of “Rhythm Map”
Dynamic Pattern Clustering [Tsunoo2009] Actually, unit patterns also should be estimated.
One-pass DP alignment
Chicken-and-egg problem Analogous to unsupervised learning problem
Rhythm 1 (Fundamental )
Iterative algorithm based on k-means clustering
Segment spectrogram using one-pass DP algorithm Update unit patterns by averaging segments
Rhythm 2 (Fill-in)
Convergence is guaranteed mathematically
Rhythm 3 (Interlude) Rhythm 4 (Climax)
Fundamental melody Climax Interlude FullSong
U SOLab
N NDD
57 Aug. 9, 2010
ISMIR2010 Tutorial 1
57
Necessity of HPSS in Rhythm Map
U SOLab
N NDD
Aug. 9, 2010
58
ISMIR2010 Tutorial 1
Extracting Common Patterns to a Particular Genre
With HPSS
Apply to a collection of music pieces Alignment calculation by one-pass DP algorithm
Use same set of templates
Updating templates by k-means clustering
Iteration
Use whole music collection of a particular genre
Without HPSS
Rhythm patterns and structures are not extracted without HPSS! U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
59
U SOLab
N NDD
60 Aug. 9, 2010
ISMIR2010 Tutorial 1
60
Experimental Evaluation
Features and Classifiers
Dataset
Feature Vectors: Genre-pattern Occurrence Histogram (normalized) Classifier: Support Vector Machine (SVM)
{ { { {
4
4/7
1
1/7
2
2/7
Histogram
U SOLab
N NDD
{
{
Normalize
61
ISMIR2010 Tutorial 1
Extracted Percussive Patterns
U SOLab
N NDD
100 songs per genre: total 1000 songs
Aug. 9, 2010
classical country
hiphop
jazz
pop
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
{chacha, foxtrot, quickstep, rumba, samba, tango, viennesewaltz, waltz}
100 songs per style: total 800 songs
62
ISMIR2010 Tutorial 1
Features [number of dim.]
GTZAN dataset
Ballroom dataset
Baseline (Random)
10.0%
12.5%
Rhythmic (from template set #1) [10/8]
43.6%
54.0%
Rhythmic (from template set #2) [10/8]
42.3%
55.125%
Statistic features such as MFCC, etc. (68 dim.) [Tzanetakis 2008] Performed well on audio classification tasks in MIREX 2008 Features [number of dim.]
GTZAN dataset
Ballroom dataset
Existing (Timbre) [68]
72.4%
57.625%
Merged (from template set #1) [78/76]
76.1%
70.125%
Merged (from template set #2) [78/76]
76.2%
69.125%
metal
U SOLab
{
{
Merged with timbral features
disco
reggae rock
{
{blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, rock}
Divided the datasets into 2 parts and obtained 2 sets of 10 templates for each genre
10 templates learned from “blues”
{
Percussive pattern feature only
Example of learned templates
{
(rhythm-intensive) Ballroom dataset 22050Hz sampling, 1ch 30 seconds clips 8 styles
Genre Classification Accuracy
Pattern set
{
Evaluation 10-fold cross-validation Classifier: linear SVM (toolkit “Weka” used)
61 Aug. 9, 2010
(standard) GTZAN dataset 22050Hz sampling, 1ch 30 seconds clips 10 genres
63 63
U SOLab
N NDD
Classification accuracy is improved by combining percussive pattern features. Aug. 9, 2010
ISMIR2010 Tutorial 1
64
New Features II: Bass-line Patterns
Examples of Extracted Bass-line Patterns
[Tsunoo2009]
U SOLab
N NDD
Aug. 9, 2010
65
ISMIR2010 Tutorial 1
Genre Classification Accuracy
U SOLab
N NDD
Features
U SOLab
N NDD
66
Autoregressive MFCC Model applied to Genre Classification HPSS increases the number of channels mono -> three (original, harmonic, percussive) and improves performance
Classification accuracy merged with timbre features
GTZAN dataset Ballroom dataset
Baseline (random classifier)
10.0%
10.0%
Only bass-line (400 dim.)
42.0%
44.8%
Existing (timbre, 68 dim.)
72.4%
72.4%
Merged (468 dim.)
74.4%
76.0%
Aug. 9, 2010
ISMIR2010 Tutorial 1
Another Application of HPSS [Rump2010]
Classification accuracy with only bass-line features
Aug. 9, 2010
ISMIR2010 Tutorial 1
67
U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
68
Conclusions
Future Works
Source separation techniques used to MIR
F0-based harmonic separation Non-negative matrix factorization or PLCA Sinusoid + Noise model Harmonic/percussive sound separation
Improvement of separation performance itself by exploiting musicological knowledge Using spatial (especially stereo) information
To enhance specific components To increase the number of channels and the dimension of feature vectors To generate new features
U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
69
Reference Book Chapter
U SOLab
N NDD
Feature weighting technique for overcoming errors due to imperfect source separation
Aug. 9, 2010
N. Ono, K. Miyamoto, H. Kameoka, J. Le Roux, Y. Uchiyama, E. Tsunoo, T. Nishimoto and S. Sagayama, “Harmonic and Percussive Sound Separation and its Application to MIR-related Tasks,” pp.213-236
Aug. 9, 2010
ISMIR2010 Tutorial 1
70
ICA Central: Early software restricted to mixtures of two sources
SiSEC Reference Software: Linear modeling-based software for panned or recorded mixtures
71
http://www.hil.t.u-tokyo.ac.jp/software/HPSS/
U SOLab
N NDD
http://www.tsi.enst.fr/icacentral/algos.html
http://sisec2008.wiki.irisa.fr/tiki-index.php?page=Underdetermined+speech+and+music+mixtures
QUAERO Source Separation Toolkit: Modular variancemodeling based software implementing a range of structures: GMM, NMF, source-filter model, harmonicity, diffuse mixing, etc
N NDD
ISMIR2010 Tutorial 1
Harmonic Percussive Sound Separation (HPSS)
U SOLab
Current works are limited to monaural separation
Available Separation Softwares
Advances in Music Information Retrieval, ser. Studies in Computational Intelligence, Z. W. Ras and A. Wieczorkowska, Eds. Springer, 274
Cover song identification, audio music similarity,...
Source separation is useful
Application of source separation to other MIR tasks
To be released Fall 2010: watch the music-ir list for an announcement!
Aug. 9, 2010
ISMIR2010 Tutorial 1
72
Advertisement: LVA/ICA 2010
References
LVA/ICA 2010 is held will be held in St. Malo, France on September 27-30, 2010. More than 20 papers on music and audio source separation will be presented.
Singer/Instrument Identification
U SOLab
N NDD
Aug. 9, 2010
ISMIR2010 Tutorial 1
73
References
N NDD
Aug. 9, 2010
M. Alonso, G. Richard and B. David, "Accurate tempo estimation based on harmonic + noise decomposition," EURASIP Journal on Advances in Signal ProcessingVolume 2007 (2007), Article ID 82795 P. Chordia and A. Rae, "Using Source Separation to Improve Tempo Detection," Proc. ISMIR, pp. 183-188, 2009.
C. Uhle, C. Dittmar, and T. Sporer, “Extraction of drum tracks from polyphonic music using independent subspace analysis,'' Proc. ICA, pp. 843-847, 2003. M. Helen and T. Virtanen, "Separation of drums from polyphonic music using non-negative matrix factorization and support vector machine," Proc. EUSIPCO, Sep. 2005. L. Daudet, "A Review on Techniques for the Extraction of Transients in Musical Signals," Proc. CMMR, pp. 219-232, 2005. O. Dikmen, A. T. Cemgil, “Unsupervised Single-channel Source Separation Using Basian NMF,” Proc. WASPAA, pp. 93-96, 2009.
Aug. 9, 2010
ISMIR2010 Tutorial 1
74
Harmonic/Percussive Sound Separation
Related Works to H/P Separation
U SOLab
N NDD
References
Audio Tempo Estimation
U SOLab
H. Fujihara, T. Kitahara, M. Goto, K. Komatani, T. Ogata and H. Okuno, ”Singer Identification Based on Accompaniment Sound Reduction and Reliable Frame Selection, “ Proc. ISMIR, 2005. M. Goto, “A real-time music-scene description system: predominant-F0 estimation,” Speech Communication, vol. 43, no. 4, pp. 311–329, 2004. A. Mesaros, T. Virtanen and A. Klapuri, “Singer identification in polyphonic music using vocal separation and pattern recognition methods,” Proc. ISMIR, pp. 375-378, 2007. M. Ryynanen and A. Klapuri, ”Transcription of the Singing Melody in Polyphonic Music”, Proc. ISMIR, 2006. T. Kitahara, M. Goto, K. Komatani, T. Ogata and H. G. Okuno, “Instrument identification in polyphonic music: feature weighting to minimize influence of sound overlaps,” EURASIP Journal on Applied Signal Processing, vol. 2007, 2007, article ID 51979.
ISMIR2010 Tutorial 1
75
U SOLab
N NDD
K. Miyamoto, H. Kameoka, N. Ono and S. Sagayama, “Separation of Harmonic and Non-Harmonic Sounds Based on Anisotropy in Spectrogram, Proc. ASJ, pp.903-904, 2008. (in Japanese) N. Ono, K. Miyamoto, J. Le Roux, H. Kameoka and S. Sagayama, “Separation of a Monaural Audio Signal into Harmonic/Percussive Components by Complementary Diffusion on Spectrogram,” Proc. EUSIPCO, 2008. N. Ono, K. Miyamoto, J. Le Roux, H. Kameoka and S. Sagayama, “A Real-time Equalizer of Harmonic and Percussive Components in Music Signals,” Proc. of ISMIR, pp.139-144, 2008. N. Ono, K. Miyamoto, H. Kameoka, J. Le Roux, Y. Uchiyama, E. Tsunoo, T. Nishimoto and S. Sagayama, “Harmonic and Percussive Sound Separation and its Application to MIR-related Tasks,” Advances in Music Information Retrieval, ser. Studies in Computational Intelligence, Z. W. Ras and A. Wieczorkowska, Eds. Springer, 274, pp.213-236, Feb., 2010.
Aug. 9, 2010
ISMIR2010 Tutorial 1
76
References
N NDD
Applications of HPSS to MIR Tasks
U SOLab
References
Aug. 9, 2010
ISMIR2010 Tutorial 1
77
Applications of HPSS in MIR Tasks
Y. Ueda, Y. Uchiyama, T. Nishimoto, N. Ono and S. Sagayama, “HMM-Based Approach for Automatic Chord Detection Using Refined Acoustic Features,” Proc. ICASSP, pp.5518-5521, 2010. J. Reed, Y. Ueda, S. M. Siniscalchi, Y. Uchiyama, S. Sagayama, C. -H. Lee, “Minimum Classification Error Training to Improve Isolated Chord Recognition,” Proc. ISMIR, pp.609-614, 2009. H. Tachibana, T. Ono, N. Ono and S. Sagayama, “Melody Line Estimation in Homophonic Music Audio Signals Based on Temporal-Variability of Melodic Source,” Proc. ICASSP, pp.425-428, 2010. H. Rump, S. Miyabe, E. Tsunoo, N. Ono and S. Sagayama, “On the Feature Extraction of Timbral Dynamics,” Proc. ISMIR, 2010.
U SOLab
N NDD
E. Tsunoo, N. Ono and S. Sagayama, “ Rhythm Map: Extraction of Unit Rhythmic Patterns and Analysis of Rhythmic Structure from Music Acoustic Signals,” Proc. ICASSP, pp.185-188, 2009. E. Tsunoo, G. Tzanetakis, N. Ono and S. Sagayama, “Audio Genre Classification Using Percussive Pattern Clustering Combined with Timbral Features,” Proc. ICME, pp.382-385, 2009. E. Tsunoo, N. Ono and S. Sagayama, “Musical Bass-Line Pattern Clustering and Its Application to Audio Genre Classification,” Proc. ISMIR, pp.219-224, 2009. E. Tsunoo, T. Akase, N. Ono and S. Sagayama, “Music Mood Classification by Rhythm and Bass-line Unit Pattern Analysis,” Proc. ICASSP, pp.265-268, 2010.
Aug. 9, 2010
ISMIR2010 Tutorial 1
78