Source separation and music. Computational auditory scene analysis. Probabilistic linear modeling. Probabilistic variance modeling

Music Source Separation and its Applications to MIR Emmanuel Vincent and Nobutaka Ono Part I: General principles of music source separation INRIA Re...

Author: Shona Hardy

1 downloads 1 Views 2MB Size

Report

Download PDF

Recommend Documents

Modeling analogy as probabilistic grammar

On Probabilistic Modeling and Bayesian Networks

Towards Computational Auditory Scene Analysis: Melody Extraction from Polyphonic Music

CURRICULUM VITAE. RESEARCH INTERESTS: Computational systems biology Modeling and analysis of probabilistic, hybrid dynamical systems

COMPUTATIONAL AUDITORY SCENE RECOGNITION

Online Filtering, Smoothing and Probabilistic Modeling of Streaming data

Modeling Attempt and Action Failure in Probabilistic stit Logic

PROBABILISTIC MODELS FOR REGION-BASED SCENE UNDERSTANDING

A Probabilistic Latent Variable Model for Acoustic Modeling

Semantic Scene Modeling and Retrieval

Scene Graphs & Modeling Transformations

Probabilistic spectral envelope modeling of musical instruments within the non-negative matrix factorization framework for mixed music analysis

COMPUTATIONAL FLUID DYNAMICS MODELING

Probabilistic alternatives for competitive analysis

WORKING P A P E R. Using Probabilistic Terrorism Risk Modeling For Regulatory Benefit- Cost Analysis

Unsupervised Single-channel Music Source Separation by Average Harmonic Structure Modeling

Simulation Modeling and Analysis

Predictive Modeling and Analysis

Computational Fluid Dynamics (CFD) Modeling

Probabilistic Programming

Probabilistic Syntax

Including Semantics and Probabilistic Uncertainty in Business Rules Using Fuzzy Modeling and Dempster Shafer Theory

Computational Modeling of Uranium Hydriding and complexes

Music Source Separation and its Applications to MIR Emmanuel Vincent and Nobutaka Ono

Part I: General principles of music source separation

INRIA Rennes - Bretagne Atlantique, France The University of Tokyo, Japan Tutorial supported by the VERSAMUS project http://versamus.inria.fr/ Contributions from Alexey Ozerov, Ngoc Duong, Simon Arberet, Martin Klein-Hennig and Volker Hohmann.

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

1 / 54

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

2 / 54

Source separation and music

Audio source separation Many sound scenes are mixtures of several concurrent sound sources. 1

2

3

4

5

Source separation and music Computational auditory scene analysis Probabilistic linear modeling Probabilistic variance modeling Summary and future challenges

When facing such scenes, humans are able to perceive and focus on individual sources. Source separation is the problem of recovering the source signals underlying a given mixture. It is a core problem of audio signal processing, with applications such as: hearing aids, post-production, remixing and 3D upmixing, spoken/multimedia document retrieval, MIR.

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

3 / 54

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

4 / 54

Source separation and music

Source separation and music

The data at hand

Music sources

As an inverse problem, source separation requires some knowledge.

Music sources include acoustical or virtual instruments and singing voice.

Music is among the most diﬃcult application areas of source separation because of the wide variety of sources and mixing processes. Concert room

Sound is produced by transmission of one or more excitation movements/signals through a resonant body/ﬁlter.

Studio

00 11 00 11 11 00

This results in a wide variety of sounds characterized by their: polyphony (monophonic or polyphonic) 1 0 0 1 0 01 1 00 1 1

temporal shape (transitory, constant or variable)

1 0 0 0 01 01 1 01 1 0 01 1

11 00 00 11 11 00

110011001100 10100110

11 00 11 00 11 00 1 0

spectral envelope

Anechoic recording

Piano source

Far−field near−coincident microphone pair 11 1 001 01 0 11 00 0 0 11 1 00 01 01 111 00 01 0

E. Vincent & N. Ono (INRIA & UTokyo)

Violin source 60

4

Multitrack recording Music Source Separation

Mixing software

Synthesized mixture ISMIR 2010

40 2 20 0 0

5 / 54

0.5 n (s)

E. Vincent & N. Ono (INRIA & UTokyo)

Source separation and music

1

60

4

0

40 dB

Direct sound

f (kHz)

Far−field coincident microphone pair

spectral ﬁne structure (random or pitched) Near−field directional microphones (point sources)

dB

Near−field directional microphones (extended source)

f (kHz)

101011001100 10100110

2 20 0 0

0.5 n (s)

Music Source Separation

1

0

ISMIR 2010

6 / 54

Source separation and music

Eﬀects of microphone recording

Software mixing eﬀects

For point sources, room acoustics result in ﬁltering of the source signal Usual software mixing eﬀects include: compression and equalization panning, i.e. channel-dependent intensity scaling reverb polarity and autopan where the intensity and delay of direct sound are functions of the source position relative to the microphone. Diﬀuse sources (piano, drums) amount to (inﬁnitely) many point sources. The mixture signal is equal to the sum of the contributions of all sources at each microphone. E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

7 / 54

The latter are widely employed to achieve perceptual envelopment, whereby even point sources are mixed diﬀusely. Again, the intensity of direct sound is a function of the source position and the mixture signal is equal to the sum of the contributions of all sources in each channel.

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

8 / 54

Source separation and music

Overview Hundreds of source separation systems were designed in the last 20 years. . .

1

. . . but few are yet applicable to real-world music, as illustrated by the 2008 and 2010 Signal Separation Evaluation Campaigns (SiSEC).

2

3

The wide variety of techniques boils down to three modeling paradigms: computational auditory scene analysis (CASA),

4

probabilistic linear modeling, including independent component analysis (ICA) and sparse component analysis (SCA),

5

Source separation and music Computational auditory scene analysis Probabilistic linear modeling Probabilistic variance modeling Summary and future challenges

probabilistic variance modeling, including hidden Markov models (HMM) and nonnegative matrix factorization (NMF).

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

9 / 54

E. Vincent & N. Ono (INRIA & UTokyo)

Computational auditory scene analysis

Music Source Separation

ISMIR 2010

10 / 54

Computational auditory scene analysis

Computational auditory scene analysis (CASA)

Auditory front-end

CASA aims to emulate the human auditory system.

The sound signal is ﬁrst converted into an auditory nerve representation via a series of processing steps: outer- and middle-ear: ﬁlter cochlear traveling wave model: ﬁlterbank haircell model: halfwave rectiﬁcation + bandwise compression + cross-band suppression Piano and violin mixture

Source formation relies on the Gestalt rules of cognition: proximity, similarity, continuity, closure, common fate. E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

On the cochlea 1

0.5

20

0

0.5 n (s)

0.5

15 10 5

0

0

power

power

2

1

25 f (ERB)

f (kHz)

4

1

0 0

After compression

5

0

ISMIR 2010

11 / 54

1

loudness

10

20

f (ERB)

f (ERB)

0.5

15

1

25 loudness

20

0.5 n (s) E. Vincent & N. Ono (INRIA & UTokyo)

1

After suppression 1

25

0

0.5 n (s)

0.5

15 10 5

0 0

Music Source Separation

0.5 n (s)

1 ISMIR 2010

12 / 54

Computational auditory scene analysis

Sinusoidal+noise decomposition

Spatial cues

Many systems further decompose the signal into a collection of sinusoidal tracks plus residual noise.

Spatial proximity is assessed by comparing the observed interchannel time diﬀerence (ITD), interchannel intensity diﬀerence (IID). ITD (anechoic)

IID (anechoic) 0.5

25

enable the exploitation of advanced cues, e.g. amplitude and frequency modulation.

0

15 10 0

Sinusoidal representation

0.5 n (s)

1

E. Vincent & N. Ono (INRIA & UTokyo)

0.5 n (s)

ms

10

1

10

25 20

5

15

0

10

5

0 0

0.5 n (s)

f (ERB)

f (ERB)

loudness

f (ERB)

0

15

0

5

−5 0

IID (reverberant)

20

0.5

10

0

5

−0.5

0.5

25

15

5

15

ITD (reverberant)

1

20

10

20 10

5

25

ms

f (ERB)

20

25 f (ERB)

reduce the number of sound atoms to be grouped into sources,

0.5 n (s)

dB

This decomposition is useful to

1

−0.5

dB

Computational auditory scene analysis

−5

5 0

0.5 n (s)

1

Note: in practice, most systems consider only binaural data, i.e. recorded by in-ear microphones.

1

Music Source Separation

ISMIR 2010

13 / 54

Computational auditory scene analysis

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

Computational auditory scene analysis

Spectral cues

Learned cues

The Gestalt rules also translate into e.g. common pitch and onset time, similar spectral envelope, spectral and temporal smoothness, lack of silent time intervals, correlated amplitude and frequency modulation.

In addition to the above primitive cues, the auditory system relies on a range of learned cues to focus on a given source: veridical expectation (episodic memory): ”I know the lyrics” schematic expectation (semantic memory): ”The inaudible word after love you must be babe”

Most eﬀort has been devoted to the estimation of pitch by cross-correlation of the auditory nerve representation in each band. Correlogram (n = 0 s)

5

10

15 20 f (ERB)

E. Vincent & N. Ono (INRIA & UTokyo)

25

0

3

40 dB

f0 (Hz)

20 2

10

conscious expectation

60 10

0

40

dB f (Hz)

3

dynamic adaptive expectation (short-term memory): ”This melody already occurred in the song”

Correlogram (n = 0.5 s) 60

10

14 / 54

10

20 2

5

Music Source Separation

10

15 20 f (ERB)

25

0

ISMIR 2010

15 / 54

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

16 / 54

Computational auditory scene analysis

Computational auditory scene analysis

Source formation and signal extraction

Summary of CASA

Each time-frequency bin or each sinusoidal track is associated to a single source according to the above cues: this is known as binary masking.

Advantages:

Individual cues are ambiguous, e.g. the observed IID/ITD may be due to a single source in the associated direction or to several concurrent sources around that direction, a given sinusoidal track may be a harmonic of diﬀerent sources. Most systems exploit several cues with some precedence order or weighting factors determined by psycho-acousticians. Piano mask

Estimated piano

f (ERB)

15

20 0.5

15

10

0

5

10 5

0

0.5 n (s)

E. Vincent & N. Ono (INRIA & UTokyo)

1

0

0.5 n (s)

1

Limitations: musical noise artifacts due to binary masking suboptimal cues, designed for auditory scene analysis instead of machine source separation practical limitation to a few spectral and/or spatial cues, with no general framework for the integration of additional cues

no results within recent evaluation campaigns

0

Music Source Separation

robustness thanks to joint exploitation of several cues

(historically) bottom-up approach, prone to error propagation, and limitation to pitched sources

loudness

25 1

20

f (ERB)

25

1

wide range of spectral, spatial and learned cues

ISMIR 2010

17 / 54

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

18 / 54

Probabilistic linear modeling

Model-based audio source separation

1

2

3

4

5

Source separation and music Computational auditory scene analysis Probabilistic linear modeling Probabilistic variance modeling Summary and future challenges

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

The alternative top-down approach consists of ﬁnding the source signals that best ﬁt the mixture and the expected properties of audio sources. In a probabilistic framework, this translates into building generative models of the source and mixture signals, inferring latent variables in a maximum a posteriori (MAP) sense.

ISMIR 2010

19 / 54

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

20 / 54

Probabilistic linear modeling

Probabilistic linear modeling

Linear modeling

Priors over the mixing vectors

The established linear modeling paradigm relies on two assumptions: 1 point sources 2 low reverberation

The mixing vectors Ajf encode the apparent sound direction in terms of ITD τjf , IID gjf .

Under assumption 1, the sources and the mixing process can be modeled as single-channel source signals and a linear ﬁltering process. Under assumption 2, this ﬁltering process is equivalent to complex-valued multiplication in the time-frequency domain via the short-time Fourier transform (STFT). Xnf : vector of mixture STFT coeﬀ. J: number of sources Sjnf : jth source STFT coeﬀ. Ajf : jth mixing vector

Sjnf Ajf

j=1 E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

21 / 54

Probabilistic linear modeling

0.6 0.4 0.2 0 −2

Empirical distribution of IID probability density

probability density

Xnf =

For echoic mixtures, ITDs and IIDs follow a smeared distribution P(Ajf |θj ) Empirical distribution of ITD

In each time-frequency bin (n, f ) J

For non-echoic mixtures, ITDs and IIDs are constant over frequency and related to the direction of arrival (DOA) θj of each source 1 Ajf ∝ −2iπf τ j gj e

0

2 4 ITD (ms)

6

E. Vincent & N. Ono (INRIA & UTokyo)

8

0.6

anechoic RT=50ms RT=250ms RT=1.25s

0.4 0.2 0 −5

0 IID (dB)

5

Music Source Separation

ISMIR 2010

22 / 54

Probabilistic linear modeling

I.i.d. priors over the source STFT coeﬃcients

Inference algorithms

Most systems assume that the sources have random spectra, i.e. their STFT coeﬃcients Sjnf are independent and identically distributed (i.i.d.). The magnitude STFT coeﬃcients of audio sources are sparse: at each frequency, few coeﬃcients have large values while most are close to zero. This property is well modeled by the˛ generalized exponential distribution ˛ ˛ S ˛p p p: shape parameter −˛ βjnf ˛ f e P(|Sjnf ||p, βf ) = βj : scale parameter βf Γ(1/p) Speech source S1nf

Distribution of magnitude STFT coeff. 60 40

2 20 0 0

0.5 n (s)

1

0

dB probability density

f (kHz)

4

1

10

empirical Gaussian (p=2) Laplacian (p=1) generalized p=0.4

100

10−1 10−2 0

Given the above priors, source separation is typically achieved by joint MAP estimation of the source STFT coeﬃcients Sjnf and other latent variables (Ajf , gj , τj , p, βj ) via alternating nonlinear optimization. This objective is called sparse component analysis (SCA). For typical values of p, the MAP source STFT coeﬃcients are nonzero for at most two sources in a stereo setting. When the number of sources is J = 2, SCA is renamed nongaussianitybased frequency-domain independent component analysis (FDICA).

1 2 3 4 |S1nf| (scaled to unit variance)

Note: coarser binary activity priors have also been employed. E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

23 / 54

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

24 / 54

Probabilistic linear modeling

Probabilistic linear modeling

Practical illustration of separation using i.i.d. linear priors

0 0

0.5 n (s)

1

20

0

0

60

4

0

Mixture X

1

20 0

0.5 n (s)

1

4

2

1+3

1+2 f (kHz)

dB

f (kHz)

f (kHz)

20

2

0

First estimated source S^

1nf

2 20 0

0.5 n (s)

1

1

0

2nf

4

40

0

0.5 n (s)

0

Second estimated source S^ 60

dB

f (kHz)

4

i.i.d. linear priors ideal CASA mask (upper−bound)

5 panned

recorded (RT=250ms)

0 0

2 20 0

0.5 n (s)

1

1

Third estimated source S^ 60

0

3nf

4

40

0

0.5 n (s)

Panned mixture Estimated sources using i.i.d. linear priors

60 40 dB

0

dB

1

2+3

f (kHz)

0.5 n (s)

f (kHz)

0

10

0

1+3

2+3 0

15

0

Estimated nonzero source pairs 1+2

40 2

20

40 2

0

0

Predominant source pairs

nf

4

0.5 n (s)

60

dB

2

Right source S3nf

4

40

SDR (dB)

20

60

dB

dB

2

Center source S2nf

4

40

f (kHz)

f (kHz)

60

f (kHz)

Left source S1nf

4

SiSEC results on toy mixtures of 3 sources

2 20 0 0

0.5 n (s)

1

Recorded reverberant mixture Estimated sources using i.i.d. linear priors

0

Time-frequency bins dominated by the center source are often erroneously associated with the two other sources. E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

25 / 54

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

26 / 54

ISMIR 2010

28 / 54

Probabilistic linear modeling

Summary of probabilistic linear modeling

Advantages:

1

top-down approach 2

separation of more than one source per time-frequency bin

3

Limitations: restricted to mixtures of non-reverberated point sources

4

separation of at most two sources per time-frequency bin 5

musical noise artifacts due to the ambiguities of spatial cues

Source separation and music Computational auditory scene analysis Probabilistic linear modeling Probabilistic variance modeling Summary and future challenges

no straightforward framework for the integration of spectral cues

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

27 / 54

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

Probabilistic variance modeling

Probabilistic variance modeling

Idea 1: from sources to mixture components

Idea 2: translation and phase invariance

Diﬀuse or semi-diﬀuse sources cannot be modeled as single-channel signals and not even as ﬁnite dimensional signals. Instead of considering the signal produced by each source, one may consider its contribution to each channel of the mixture signal.

In order to overcome the ambiguities of spatial cues, additional spectral cues are needed as shown by CASA.

Source separation becomes the problem of estimating the multichannel mixture components underlying the mixture.

Most audio sources are translation- and phase-invariant: a given sound may be produced at any time with any relative phase across frequency.

In each time-frequency bin (n, f ) Xnf =

J

Xnf : vector of mixture STFT coeﬀ. J: number of sources Cjnf : jth mixture component

Cjnf

j=1

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

29 / 54

Probabilistic variance modeling

ISMIR 2010

30 / 54

Choice of the distribution

Variance modeling combines these two ideas by modeling the STFT coeﬃcients of individual mixture components by a circular multivariate distribution whose parameters vary over time and frequency.

For historical reasons, several distributions have been preferred in a mono context, which can equivalently be expressed as divergence functions over the source magnitude/power STFT coeﬃcients:

The non-sparsity of source STFT coeﬃcients over small time-frequency regions suggests the use of a non-sparse distribution. Speech source S1nf

Poisson ↔ Kullback-Leibler divergence aka I-divergence tied-variance Gaussian ↔ Euclidean distance

Generalized Gaussian shape parameter p

log-Gaussian ↔ weighted log-Euclidean distance

60 4 40 dB

f (kHz)

Music Source Separation

Probabilistic variance modeling

Variance modeling

4

E. Vincent & N. Ono (INRIA & UTokyo)

2

0

0

0.5 n (s)

E. Vincent & N. Ono (INRIA & UTokyo)

1

20

2

0

0

These distributions do not easily generalize to multichannel data. 1

2

10 10 10 neighborhood size (Hz × s)

Music Source Separation

3

ISMIR 2010

31 / 54

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

32 / 54

Probabilistic variance modeling

Probabilistic variance modeling

The multichannel Gaussian model

General inference algorithm

The zero-mean Gaussian distribution is a simple multichannel model. P(Cjnf |Σjnf ) =

−1 H 1 e −Cjnf Σjnf Cjnf det(πΣjnf )

Σjnf : jth component covariance matrix

Independently of the priors over Vjnf and Rjf , source separation is typically achieved in two steps: joint MAP estimation of all model parameters using the expectation maximization (EM) algorithm,

The covariance matrix Σjnf of each mixture component can be factored as the product of a scalar nonnegative variance Vjnf and a mixing covariance matrix Rjf respectively modeling spectral and spatial properties

MAP estimation of the source STFT coeﬃcients conditional to the model parameters by multichannel Wiener ﬁltering

Σjnf = Vjnf Rjf

⎛ jnf = Vjnf Rjf ⎝ C

Under this model, the mixture STFT coeﬃcients also follow a Gaussian distribution whose covariance is the sum of the component covariances P(Xnf |Vjnf , Rjf ) =

E. Vincent & N. Ono (INRIA & UTokyo)

J

⎞−1 Vj nf Rj f ⎠

Xnf .

j =1

PJ −1 H 1 e −Xnf ( j=1 Vjnf Rjf ) Xnf J det π j=1 Vjnf Rjf Music Source Separation

ISMIR 2010

33 / 54

E. Vincent & N. Ono (INRIA & UTokyo)

Probabilistic variance modeling

Music Source Separation

ISMIR 2010

34 / 54

Probabilistic variance modeling

Rank-1 priors over the mixing covariances

Full-rank priors over the mixing covariances

The mixing covariances Rjf encode the apparent spatial direction and spatial spread of sound in terms of

For reverberated or diﬀuse sources, the interchannel coherence is smaller than one, i.e. Rjf has full rank.

ITD,

The theory of statistical room acoustics suggests the direct+diﬀuse model

IID,

For non-reverberated point sources, the interchannel coherence is equal to one, i.e. Rjf has rank 1 Rjf = Ajf AH jf The priors P(Ajf |θj ) used with linear modeling can then be simply reused.

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

λj : direct-to-reverberant ratio Ajf : direct mixing vector Bf : diﬀuse noise covariance

Rjf ∝ λj Ajf AH jf + Bf

normalized interchannel correlation a.k.a. interchannel coherence.

35 / 54

with

Ajf = Bf =

2 1 + gj2

1

gj e −2iπf τj

1 sinc(2πfd/c) sinc(2πfd/c) 1

E. Vincent & N. Ono (INRIA & UTokyo)

τj : ITD of direct sound gj : IID of direct sound

Music Source Separation

d: microphone spacing c: sound speed

ISMIR 2010

36 / 54

Probabilistic variance modeling

Probabilistic variance modeling

I.i.d. priors over the source variances

Beneﬁt of exploiting interchannel coherence

Baseline systems rely again on the assumption that the sources have random spectra and model the source variances Vjnf as i.i.d. and locally constant within small time-frequency regions.

Interchannel coherence helps resolving some ambiguities of ITD and IID and identify the predominant sources more accurately. Linear model A

3

S3

When these follow a mildly sparse prior, it can be shown that the MAP variances are nonzero for up to four sources. Discrete priors constraining the number of nonzero variances to one or two have also been employed.

Covariance model A 2 X

S2

S S1

1

1/2

A

3

V2

1/2

V3

A

2

V1/2 1

A1

A1

1/2

V1

When the number of sources is J = 2, this model is also called nonstationarity-based FDICA.

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

37 / 54

Probabilistic variance modeling

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

38 / 54

Probabilistic variance modeling

Practical illustration of separation using i.i.d. variance priors

Spectral priors based on template spectra Variance modeling enables the design of phase-invariant spectral priors.

20 0 0

0.5 n (s)

1

20 0

0.5 n (s)

1

60

(IID > 0) 60

20 0

0.5 n (s)

1

2

1+3

1+2 f (kHz)

dB

f (kHz)

f (kHz)

20

2

1+3

2+3

1nf

dB

f (kHz)

20 0

0.5 n (s)

1

E. Vincent & N. Ono (INRIA & UTokyo)

1

0

2nf

4

40 2

0

0.5 n (s)

0

Second estimated source S^ 60

Diﬀerent strategies have been proposed to learn these spectra:

0 0

First estimated source S^ 4

2+3

0

0

60

20 0.5 n (s)

1

Music Source Separation

0

3nf

4

40

0

1

speaker-independent training on separate single-source data,

Third estimated source S^

2

0

0.5 n (s)

60

speaker-dependent training on separate single-source data,

40 dB

1

dB

0.5 n (s)

f (kHz)

0

f (kHz)

0

Vjnf = wjqjn f with P(qjn = k) = πjk

4 1+2

2

0

Estimated nonzero source pairs

4

40

The Gaussian mixture model (GMM) represents the variance Vjnf of each source at a given time by one of K template spectra wjkf indexed by a discrete state qjn

40 2

0

0

Predominant source pairs

nf

3nf

4

40

0

0

Right source S 60

2

Mixture X 4

(IID = 0)

dB

40 2

2nf

4

dB

Center source S 60

dB

f (kHz)

(IID < 0)

f (kHz)

1nf

f (kHz)

Left source S 4

2

0 0

0.5 n (s)

1

20

MAP adaptation to the mixture using model selection or interpolation,

0

MAP inference from a coarse initial separation.

ISMIR 2010

39 / 54

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

40 / 54

Probabilistic variance modeling

Probabilistic variance modeling

Practical illustration of separation using template spectra

0 0

0.5 n (s)

1

20 0

0

0

0.5 n (s) jkf

violin

60

2

piano

dB

40 20 0 2 3 k (piano)

1

Estimated piano variance Σ

0.5 n (s)

20

20

1

0.5 n (s)

1nf

4

k=1

2nf 60

This model is also called nonnegative matrix factorization (NMF).

40 2 20

2 20 0.5 n (s)

1

E. Vincent & N. Ono (INRIA & UTokyo)

0

0.5 n (s)

2nf

4

40

0

1

^ Estimated violin source C

60

dB

f (kHz)

1nf

Again, a range of strategies have been used to learn these spectra:

0 0

^ Estimated piano source C

0

jn

instrument-dependent training on separate single-source data, MAP adaptation to the mixture using uniform priors,

40 2 20 0

0

1

60

dB

0.5 n (s)

The variance Vjnf of each source is then better represented as the linear combination of K basis spectra wjkf multiplied by time-varying scale factors hjkn K hjkn wjkf Vjnf =

0

1

0 0

1

4

40 2

f (kHz)

0

0.5 n (s)

Estimated mixture variance Σ +Σ

60

dB

dB

f (kHz)

2nf

4

40 2

0

Estimated violin variance Σ

60

0

3 2 1 3 2 1

2 3 k (violin)

f (kHz)

1nf

4

20 0

0

The GMM does not eﬃciently model polyphonic musical instruments.

40 2

Estimated state sequences q

Template spectra w f (kHz)

1

60

dB

2

nf

4

40

4

1

60

dB

dB 20

f (kHz)

2

2nf

4

40

dB

60

0

0.5 n (s)

1

MAP adaptation to the mixture using trained priors.

0

Music Source Separation

ISMIR 2010

41 / 54

Probabilistic variance modeling

0.5 n (s)

1

20 0

0

0

0.5 n (s)

1

−20 2 −40 0 2 3 k (piano)

1

dB

f (kHz)

2 20 0 0

0.5 n (s)

1

0 2nf

1nf

2 20

E. Vincent & N. Ono (INRIA & UTokyo)

0.5 n (s)

1

0

the mismatch between training and test data. 1nf

4

2nf 60

However, it is often inaccurate: additional constraints over the spectra are needed to further reduce overﬁtting.

40 2 20 0

0

0

0.5 n (s)

^ Estimated violin source C

2nf

4

40

0

1

60

dB

f (kHz)

4

the lack of training data,

1

60

20 0.5 n (s)

MAP adaptation or inference of the template/basis spectra is often needed due to

Estimated mixture variance Σ +Σ

2

^ Estimated piano source C

0

0.5 n (s)

40

0

0

40

0

0

1

jkn

60

4

40

0.5 n (s)

80

Estimated violin variance Σ

60

0

dB

1nf

4

20 0

0

3 2 1 3 2 1

2 3 k (violin)

f (kHz)

1

f (kHz)

f (kHz)

0

dB k (piano) k (violin)

jkf

Estimated piano variance Σ

2

Estimated scale factors h

Basis spectra w 4

40 dB

dB

2

60

dB

0

42 / 54

1

0

60 40 dB

0

ISMIR 2010

Constrained template/basis spectra

dB

20

nf

4 40

f (kHz)

dB

2

2nf

4

40

Music Source Separation

Mixture X

f (kHz)

60 f (kHz)

f (kHz)

Violin source C

1nf

4

E. Vincent & N. Ono (INRIA & UTokyo)

Probabilistic variance modeling

Practical illustration of separation using basis spectra Piano source C

Spectral priors based on basis spectra

Mixture X f (kHz)

1nf

4 f (kHz)

Violin source C

f (kHz)

Piano source C

2 20 0 0

Music Source Separation

0.5 n (s)

1

0

ISMIR 2010

43 / 54

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

44 / 54

Probabilistic variance modeling

Probabilistic variance modeling

Harmonicity and spectral smoothness constraints

Practical illustration of harmonicity constraints bp,1,f (ejp,1=0.756)

For instance, harmonicity and spectral smoothness can be enforced by

bp,2,f (ejp,2=0.128)

bp,3,f (ejp,3=0.041)

1

1

1

0.5

0.5

0.5

0

0

0

wjpf

associating each basis spectrum with some a priori pitch p

1 10

modeling wjpf as the sum of ﬁxed narrowband spectra bplf representing adjacent partials at harmonic frequencies scaled by spectral envelope coeﬃcients ejpl wjpf =

Lp

20 f (ERB)

30

10

bp,4,f (ejp,4=0.037)

20 f (ERB)

30

10

bp,5,f (ejp,5=0.011)

20 f (ERB)

30 0.5

bp,6,f (ejp,6=0)

1

1

1

0.5

0.5

0.5

0 10

0

0 10

ejpl bplf .

20 f (ERB)

30

20 f (ERB)

30

0 10

20 f (ERB)

30

10

20 f (ERB)

30

l=1

Parameter estimation now amounts to estimating the active pitches and their spectral envelopes instead of their full spectra.

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

45 / 54

E. Vincent & N. Ono (INRIA & UTokyo)

Probabilistic variance modeling

Music Source Separation

ISMIR 2010

46 / 54

Probabilistic variance modeling

Further constraints

SiSEC results on toy mixtures of 3 sources

Further constraints that have been implemented in this context include 20

inharmonicity and tuning.

15 SDR (dB)

source-ﬁlter model of instrumental timbre,

Probabilistic priors are also popular: state transition priors

adapted basis spectra i.i.d. linear priors

5 0 panned

P(qjn = k|qj,n−1 = l) = πjkl

P(Vjnf |Vjn,f −1 ) = N (Vjnf ; Vjn,f −1 , σperc ) temporal continuity priors (for sustained sounds)

Recorded reverberant mixture Estimated sources using adapted basis spectra Estimated sources using i.i.d. linear priors

P(Vjnf |Vj,n−1,f ) = N (Vjnf ; Vj,n−1,f , σsust ) Music Source Separation

recorded (RT=250ms)

Panned mixture Estimated sources using adapted basis spectra Estimated sources using i.i.d. linear priors

spectral continuity priors (for percussive sounds)

E. Vincent & N. Ono (INRIA & UTokyo)

10

ISMIR 2010

47 / 54

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

48 / 54

Probabilistic variance modeling

Probabilistic variance modeling

SiSEC results on professional mixtures

Summary of probabilistic variance modeling Advantages:

20

SDR (dB)

15

top-down approach

vocals drums bass guitar piano

10 5

virtually applicable to any mixture, including to diﬀuse sources no hard constraint on the number of sources per time-frequency bin fewer musical noise artifacts by joint exploitation of spatial, spectral and learned cues

0

principled modular framework for the integration of additional cues Tamy (2 sources) Estimated sources using adapted basis spectra

Limitations: remaining musical noise artifacts

Bearlin (10 sources) Estimated sources using adapted basis spectra

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

current implementations limited to a few spectral and/or spatial cues. . . but this is gradually changing!

ISMIR 2010

49 / 54

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

50 / 54

Summary and future challenges

Summary principles of model-based source separation Most model-based source separation systems rely on modeling the STFT coeﬃcients of each source as a function of 1

2

3

4

5

Source separation and music Computational auditory scene analysis Probabilistic linear modeling Probabilistic variance modeling Summary and future challenges

a scalar variable (Sjnf or Vjnf ) encoding spectral cues, a vector or matrix variable (Ajf or Rjf ) encoding spatial cues. Robust source separation requires priors over both types of cues: spectral cues alone cannot discriminate sources with similar pitch range and timbre, spatial cues alone cannot discriminate sources with the same DOA. A range of informative priors have been proposed, relating for example Sjnf or Vjnf to discrete or continuous latent states, Ajf or Rjf to the source DOAs. Variance modeling outperforms linear modeling.

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

51 / 54

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

52 / 54

Summary and future challenges

Summary and future challenges

Conclusion and remaining challenges

References

To sum up, source separation is a core problem of audio signal processing with huge potential applications.

D.L. Wang and G.J. Brown, Eds. Computational Auditory Scene Analysis: Principles, Algorithms and Applications Wiley/IEEE Press, 2006.

Existing systems are gradually ﬁnding their way into the industry, especially for applications that can accomodate a certain amount of musical noise artifacts, such as MIR, partial user input/feedback, such as post-production. We believe that these two limitations could be addressed in the next 10 years by exploiting the full power of probabilistic modeling, especially by:

E. Vincent, M.G. Jafari, S.A. Abdallah, M.D. Plumbley, and M.E. Davies Probabilistic modeling paradigms for audio source separation in Machine Audition: Principles, Algorithms and Systems IGI Global, 2010.

integrating more and more spatial and spectral cues, making a better use of learned cues, using training data or repeated sounds

E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

53 / 54

2008 and 2010 Signal Separation Evaluation Campaigns http://sisec.wiki.irisa.fr/ E. Vincent & N. Ono (INRIA & UTokyo)

Music Source Separation

ISMIR 2010

54 / 54

U SOLab Outline ND

The University

of Tokyo

Music Source Separation and its Applications to MIR

Introduction Part I: Brief Introduction of State-of-the-arts

Nobutaka Ono and Emmanuel Vincent The University of Tokyo, Japan INRIA Rennes - Bretagne Atlantique, France

Part II: Harmonic/Percussive Sound Separation

Contributions from Shigeki Sagayama, Kenichi Miyamoto, Hirokazu Kameoka, Jonathan Le Roux, Emiru Tsunoo, Yushi Ueda, Hideyuki Tachibana, Geroge Tzanetakis, Halfdan Rump, Other members of IPC Lab#1

U SOLab

N NDD

ISMIR2010 Tutorial 1

Aug. 9, 2010

1

U SOLab

N NDD

Motivation and Formulation Open Binary Software

Part III: Applications of HPSS to MIR Tasks

Tutorial supported by the VERSAMUS project http://versamus.inria.fr/

Singer/Instrument Identification Audio Tempo Estimation

Audio Chord Estimation Melody Extraction Audio Genre Classification

Conclusions Aug. 9, 2010

ISMIR2010 Tutorial 1

U SOLab ND

Introduction

The University

of Tokyo

Focus of the second half of this tutorial is to clarify

Part I: Brief Introduction of State-of-the-arts

What source separation has been used for MIR? How does it improve performance of MIR tasks?

2

Examples: Multi pitch estimation Task itself is tightly coupled with source separation. Audio genre classification How source separation is useful? Not straightforward.

U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

3

U SOLab

N NDD

ISMIR2010 Tutorial 1

Aug. 9, 2010

4

Singer Identification

Accompaniment Sound Reduction [Fujihara2005]

Task: Identify a singer from music audio with accompaniment Typical approach

features audio Feature Extraction

Classifier

Pre-dominant F0 based voice separation Audio input by PreFEST [Goto2004]

singer

Fig.1 [Fujihara2005] U SOLab

N NDD

Aug. 9, 2010

5

ISMIR2010 Tutorial 1

N NDD

male

Only reliable frame is used for classification

female

Feature extraction

baseline

Fig.1 [Fujihara2005]

Classifier U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

6

ISMIR2010 Tutorial 1

Aug. 9, 2010

Evaluation by Confusing Matrix

Reliable Frame Selection [Fujihara2005]

U SOLab

Feature extraction

selection only 7

U SOLab

N NDD

Aug. 9, 2010

reduction only

Male/female confusion is decreased by accompaniment reduction. Combination of reduction and selection much improves performance.

Fig. 3 [Fujihara2005] reduction and selection ISMIR2010 Tutorial 1

8

Evaluation by Identification Rate

Melody-F0-based Vocal Separation

90

[Mesaros2007] Estimate melody-F0 by melody transcription system [Ryynanen2006]. Generate harmonic overtones at multiple of estimated F0. Estimate amplitudes and phases of overtones based on cross correlation between original signal and complex exponentials.

U SOLab

N NDD

80

w/o sep. w/ sep.

80 70

70 60 50 40

60 50 40

30

30

20

20

10

10

0

0

㪣㪛㪝㪨㪛㪞㪤㪝㪞㪤㪤㪄㪘㪤㪄㪢㪣㪄㪞㪤㪞㪤㪘㪤㪤㪄㪞㪤㪢㪣㪄㪄㪪㪪㪤㪄㪢㪄㪈㪥㪣㪄㪥㪪㪄㪊㪥㪞㪄㪥㪢㪣㪄㪘㪞㪄㪤㪸㪿

Correct [%]

w/o sep. w/ sep.

90

Singing to Accompaniment Ratio: -5dB Singing to Accompaniment Ratio: 15dB Generated by Table 1 and 2 [Mesaros2007]

They evaluate the effect of separation in singer identification performance using by different classifiers. Aug. 9, 2010

100

100

Correct [%]

㪣㪛㪝㪨㪛㪞㪤㪝㪞㪤㪤㪄㪘㪤㪄㪢㪣㪄㪘㪞㪤㪞㪤㪤㪤㪄㪞㪤㪢㪣㪄㪄㪪㪪㪤㪄㪢㪄㪈㪥㪣㪄㪥㪪㪄㪊㪥㪞㪄㪥㪢㪣㪄㪘㪞㪄㪤㪸㪿

Vocal Separation Based on Melody Transcriber

9

ISMIR2010 Tutorial 1

U SOLab

N NDD

Performance is much improved, especially in low singing-to-accompaniment ratio. Aug. 9, 2010

10

ISMIR2010 Tutorial 1

Instrument Identification

Feature Weighting [Kitahara2007]

Task: Determine instruments present in music piece Typical approach

spectrogram Feature audio Separation of notes Extraction to Notes

Feature vectors of each instrument are collected from polyphonic music for training. Robustness of each feature is evaluated by ratio of intra-class variance to inter-class variance: Applying Linear discriminant analysis (LDA) for feature weighting.

features

U SOLab

N NDD

instrument

Important Issue Source separation is not perfect. How to reduce errors?

Aug. 9, 2010

ISMIR2010 Tutorial 1

Classifier

PCA

LDA

Modified from Fig. 1 [Kitahara2007]

11

U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

12

Audio Tempo Estimation

Effectiveness of Feature Weighting

Task: Extract tempo from musical audio Typical approach:

Instrument recognition rate

subband signals audio STFT or Filterbank

Onset Detection

detection function tempo Fig. 6 [Kitahara2007]

Tracking

tempo candidates Periodicity Analysis

t

Feature weighting by LDA improves recognition rate. U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

13

Applying Harmonic+Noise Model

U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

14

Influence of S+N Model

Harmonic+Noise model is applied before calculating detection function [Alonso2007]

Source separation based on harmonic + noise model

U SOLab

N NDD

Fig. 2 [Alonso2007] Aug. 9, 2010

Detection functions are calculated from both of harmonic component and noise component, and then, they are merged.

ISMIR2010 Tutorial 1

15

Algorithms of periodicity detection

Fig. 14 [Alonso2007]

Separation based on H+N model shows better results. U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

16

U SOLab ND

Applying PLCA

The University

of Tokyo

PLCA (Probabilistic Latent Component Analysis), NMF-like method is applied. It increases much candidates of tempo. They report its effectiveness.

Part II: Harmonic/Percussive Sound Separation

[Chordia2009]

U SOLab

N NDD

Fig. 1 [Chordia2009] Aug. 9, 2010

17

ISMIR2010 Tutorial 1

N NDD

ISMIR2010 Tutorial 1

Aug. 9, 2010

18

Related Works to H/P Separation

Motivation and Goal of HPSS

U SOLab

Motivation: Music consists of two different components

example of a popular music (RWC-MDB-P034)

Source separation into multiple components followed by classification ICA and classification [Uhle2003] NMF and classification [Helen2005]

Steady + Transient model

Adaptive phase vocoder Subspace projection Matching persuit …etc Good review is provided in [Daudet2005] Baysian NMF [Dikmen2009]

harmonic component

percussive component

Goal: Separation of a monaural audio signal into harmonic and percussive components Target: MIR-related tasks

U SOLab

N NDD

multi-pitch analysis, chord recognition… H-related beat tracking, rhythm recognition… P-related Aug. 9, 2010

ISMIR2010 Tutorial 1

19

U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

20

H/P Separation Problem

Point: Anisotropy of Spectrogram horizontally smooth

Problem: Find Ht,Z and Pt,Z from Wt,Z on power spectrogram

Requirements:

vertically smooth

Wt,Z harmonic component

percussive component

Ht,Z

Pt,Z

1) Ht,Z : horizontally smooth 2) Pt,Z : vertically smooth 3) Ht,Z and Pt,Z : non-negative 4) Ht,Z + Pt,Z : should be close to Wt,Z U SOLab

N NDD

Aug. 9, 2010

21

ISMIR2010 Tutorial 1

Formulation as an Optimization Problem: Objective function to minimize Closeness cost

N NDD

Aug. 9, 2010

Closeness cost function: I-divergence

Smoothness cost function: Square of difference

Smoothness cost

Under

constraints: Ht,Z 㻢 0 In MAP estimation context, they are corresponding Pt,Z 㻢 0

for scale invariance Weights to control two smoothness

likelihood term and prior term, respectively. U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

22

ISMIR2010 Tutorial 1

Formulation of H/P Separation (2/2)

Formulation of H/P Separation (1/2)

U SOLab

23

U SOLab

N NDD

A variance modeling-based separation using Poisson observation distribution Gaussian continuity priors [Miyamoto2008, Ono2008, etc] Aug. 9, 2010

ISMIR2010 Tutorial 1

24

Update Rules

Separated Examples

Update alternatively two kinds of variables:

Music piece

H and P:

original

H

P

RWC-MDB-P-7 “PROLOGUE ” RWC-MDB-P-12 “KAGE-ROU ” RWC-MDB-P-18 “True Heart”

Auxiliary variables:

RWC-MDB-P-25 “tell me” RWC-MDB-J-16 “Jive ”

U SOLab

N NDD

Aug. 9, 2010

25

ISMIR2010 Tutorial 1

Real-Time Implementation

U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

26

Open Software: Real-time H/P equalizer Available at http://www.hil.t.u-tokyo.ac.jp/software/HPSS/

Sliding Block Analysis Iterations are applied only within sliding block

Control H/P balance of audio signal in real time Simple instructions: 1) Click “Load WAV” button and choose a WAVformatted audio file. 2) Click “Start” button, and then, audio starts. 3) Slide H/P balance bar as you like and listen how the sound changes. 1)

3)

2) U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

27

U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

28

U SOLab Audio Chord Detection ND

The University

of Tokyo

Part III: Applications of HPSS to MIR Tasks

Task: Estimate chord sequence and its segmentation from music audio 䎛䎓䎓䎓

䎖䎕䎓䎓

䎃

C

G

Am

F

C

G

F

C

䎔䎙䎓䎓

䎛䎓䎓

䎗䎓䎓

III-1: Audio Chord Detection

䎕䎓䎓

䎔䎓䎓

䎘䎓䎃

U SOLab

N NDD

ISMIR2010 Tutorial 1

29

Aug. 9, 2010

Feature: chroma [Fujishima1999]

䎓

Aug. 9, 2010

䎓

䎓

䎓

䎓

䎓

Feature Extraction

Chroma observation probability p( xt | ct )

N NDD

䎓

30

HMM-based chord recognition

training

recognition

Bigram probability p(ct | ct 1 )

Viterbi algorithm for …

HMM training

p ( xt | ct )

Aug. 9, 2010

ISMIR2010 Tutorial 1

Viterbi decoding

acoustic model language model

p (ct | ct 1 )

Recognized chord sequence

Initial prob. emission transition

U SOLab

䎓

24 dim. features

Maximum a Posteriori Chord Estimation [Sheh2003]

䎓

ISMIR2010 Tutorial 1

Transition: chord progression

N NDD

䎓

Feature-refined System [Ueda2009]

Typical Approach: Chroma Feature + HMM

U SOLab

䎓

31

U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

32

Suppressing Percussive Sounds

Fourier-transformed Chroma

Percussive sounds are harmful in chord detection

Covariance matrix of chroma

Highly correlated components: diagonal-only approximation infeasible

Covariance matrix is near circulant

Emphasize harmonic components by HPSS

Chroma covariance

Assuming …

Caused by harmonic overtones or some pitches performed at the same time Results in large number of parameters

Harmonic overtones of all pitches have the same structure The amount of occurrence of the same intervals is the same

Circulant matrix diagonalized by DFT

Diagonal approximation of FTChroma covariance

Reduces the number of model parameters (statistically robust) FT-Chroma covariance

SOLab

UNNDD

Aug. 9, 2010

33

ISMIR2010 Tutorial 1

Tuning Compensation

Tuning difference among songs

Neglecting this may blur chroma features

N NDD

ISMIR2010 Tutorial 1

D

D B

Improve chord boundary accuracy

E

F E

C B

C

Chord tones largely changes at chord boundary

A

by features representing chord boundaries

Delta chroma䋺 derivative of chroma features Cf. Delta cepstrum (MFCC)䋺Effective features of speech recognition

Calculated by regression analysis of 㱐 sample points [Sagayama&Itakura1979]

Find maximum chroma energy (sum of all bins of chroma) Assume: tuning does not change within a song

Aug. 9, 2010

F

G

A

440.0Hz

filterbank

446.4Hz (+25cent)

Robust to noise

'C (i, n)

G

¦G kw C (i, t k )

k

G# A A# tuning (log freq.)

log power of pitch A

k

G

¦k

t G

G U SOLab

F

C

34

ISMIR2010 Tutorial 1

Aug. 9, 2010

Delta Chroma Features G

Choose best tuning from multiple candidates

SOLab

UNNDD

2

slope of this line

wk

wk

i 1, ,12

B

35

U SOLab

N NDD

time Aug. 9, 2010

ISMIR2010 Tutorial 1

36

Experimental Evaluation

Multiple States per Chord

Chroma changes from “onset” to “release”

Test Data

capture the change by having multiple states per chord tradeoff between data size and the number of states

C1

C2

C3

Labels

Evaluation

G

D

pitch

C

䍃䍃䍃

F

N NDD

12㬍major/minor =24 chords + N (no chord)

Album filtered 3-fold cross validation

Frame Recognition Rate = (#correct frames) / (#total frames) Sampled every 100ms

time U SOLab

180 songs (12 albums) of The Beatles (chord reference annotation provided by C. Harte) 11.025 kHz sampling, 16bit, 1ch, WAV file Frequency range: 55.0Hz-1661.2Hz (5 octaves)

37

ISMIR2010 Tutorial 1

Aug. 9, 2010

U SOLab

N NDD

8 albums for training, 4 albums for testing

Aug. 9, 2010

ISMIR2010 Tutorial 1

38

U SOLab ND

Chord Detection Results

The University

of Tokyo

Err Reduc Rate

28.1%

Part III: Applications of HPSS to MIR Tasks

Err Reduc Rate

Chord detection rate

11.0% MIREX2008 best score [Uchiyama2008]

1 state sstatestate 2 states ᘒ 3 states Chroma

U SOLab

N NDD

HE

HE+TC

HE+TC+FT

HE䋺harmonic sound emphasized TC: tuning compensation FT: FT chroma (diagonal covariance) DC: Delta chroma Aug. 9, 2010

ISMIR2010 Tutorial 1

III-2: Melody Extraction

HE+TC+DC

HPSS improves chord detection performance 39

U SOLab

N NDD

ISMIR2010 Tutorial 1

Aug. 9, 2010

40

Melody Extraction

Singing Voice in Spectrogram RWC-MDB-P-25 “tell me”

Task: Identify a melody pitch contour from polyphonic musical audio Typical approach:

audio Pre-dominant F0s extraction

F0s

Tracking

A C

melody

B

Singing

voice enhancement will be useful pre-processing.

U SOLab

N NDD

Aug. 9, 2010

A. Vertical component: Percussion B. Horizontal component: Harmonic instrument (piano, guiter, etc..) C. Fluctuated component: Singing voice 41

ISMIR2010 Tutorial 1

Is voice harmonic or percussive?

U SOLab

N NDD

Aug. 9, 2010

HPSS results with different frame length

Depends on spectrogram resolution (frame-length) On

short-frame STFT domain, voice appears as “H” (time direction clustered). On long-frame STFT domain, voice appears as “P” (frequency direction clustered).

Example Frame length: 16ms

U SOLab

N NDD

Aug. 9, 2010

H

P

Vocal Frame length: 512ms

“Harmonic”

42

ISMIR2010 Tutorial 1

H

P

“Percussive” ISMIR2010 Tutorial 1

43

U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

44

Spectrogram Example

Two-stage HPSS [Tachibana2010]

Original signal (from LabROSA dataset)

Original HPSS with short frame Sinusoidal Sound

Percussive Sound

HPSS with long frame Stationarysinsoidal Sound U SOLab

N NDD

Aug. 9, 2010

Fluctuatingsinusoidal Sound (㻍singing voice)

ISMIR2010 Tutorial 1

45

Spectrogram Example

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

N NDD

Aug. 9, 2010

46

ISMIR2010 Tutorial 1

Separation Examples

Voice-enhanced signal (by two-stage HPSS)

U SOLab

U SOLab

title

47

original

Extracted Vocal

Vocal Genre Cancelled*

“tell me”

F, R&B

“Weekend”

F, Euro beat

“Dance Together”

M, Jazz

“1999”

M, Metal rock

“Seven little crows”

F, Nursery rhyme

“La donna è mobile” from Verdi’s opera “Rigoletto”

M, Classical

U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

48

Example of Melody Tracking

Melody Tracking by DP [Tachibana2010]

Estimating hidden states by dynamic programming

train06.wav, distributed by LabROSA database

Observation (Voice-enhancedSpectrum) t1

State (Pitch series)

U SOLab

N NDD

t2

t3

460

460

460

460

460

460

450

450

450

450

450

450

440

440

440

440

440

440

49

ISMIR2010 Tutorial 1

Aug. 9, 2010

U SOLab

N NDD

Aug. 9, 2010

U SOLab ND

Results in MIREX 2009

The University

of Tokyo

Data: 379 songs, mixed in +5 dB, 0dB, and -5 dB.

Part III: Applications of HPSS to MIR Tasks

HPSS-based method Noise Rob

ust +5dB

Se

ns itiv e

50

ISMIR2010 Tutorial 1

0dB

-5dB

original processed

III-3: Audio Genre Classification

Accompaniments

Robustness to large singer-to-accompaniment ratio is greatly improved. U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

51

U SOLab

N NDD

ISMIR2010 Tutorial 1

Aug. 9, 2010

52

Audio Genre Classification

New Features I: Percussive Patterns

Task: estimate genre from music audio

Blues, classical, jazz, rock, ...

Typical approach

audio Feature Extraction

features

Classifier

genre

Example of features [Tzanetakis2001]

Timbral information (MFCC, etc.) Melodic information Statistics about periodicities: Beat histogram

Feature Extraction [Tsunoo2009]

U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

53

U SOLab

N NDD

Aug. 9, 2010

Rhythmic Structure Analysis by One-pass DP algorithm

Motivation for Bar-long Percussive Patterns

Assume that correct bar-line unit patterns are given. Problem: tempo fluctuation and unknown segmentation

Bar-long percussive patterns (temporal information) are frequently characteristic of a particular genre Difficulties

1) Mixture of harmonic and percussive components 2) Unknown bar-lines 3) Tempo fluctuation 4) Unknown multiple patterns

Analogous to continuous speech recognition problem One-pass dynamic programming algorithm can be used to segment

A A A A B A A A C C C C

U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

54

ISMIR2010 Tutorial 1

spectrogram of percussive sound 55

U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

56

Example of “Rhythm Map”

Dynamic Pattern Clustering [Tsunoo2009] Actually, unit patterns also should be estimated.

One-pass DP alignment

Chicken-and-egg problem Analogous to unsupervised learning problem

Rhythm 1 (Fundamental )

Iterative algorithm based on k-means clustering

Segment spectrogram using one-pass DP algorithm Update unit patterns by averaging segments

Rhythm 2 (Fill-in)

Convergence is guaranteed mathematically

Rhythm 3 (Interlude) Rhythm 4 (Climax)

Fundamental melody Climax Interlude FullSong

U SOLab

N NDD

57 Aug. 9, 2010

ISMIR2010 Tutorial 1

57

Necessity of HPSS in Rhythm Map

U SOLab

N NDD

Aug. 9, 2010

58

ISMIR2010 Tutorial 1

Extracting Common Patterns to a Particular Genre

With HPSS

Apply to a collection of music pieces Alignment calculation by one-pass DP algorithm

Use same set of templates

Updating templates by k-means clustering

Iteration

Use whole music collection of a particular genre

Without HPSS

Rhythm patterns and structures are not extracted without HPSS! U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

59

U SOLab

N NDD

60 Aug. 9, 2010

ISMIR2010 Tutorial 1

60

Experimental Evaluation

Features and Classifiers

Dataset

Feature Vectors: Genre-pattern Occurrence Histogram (normalized) Classifier: Support Vector Machine (SVM)

{ { { {

4

4/7

1

1/7

2

2/7

Histogram

U SOLab

N NDD

{

{

Normalize

61

ISMIR2010 Tutorial 1

Extracted Percussive Patterns

U SOLab

N NDD

100 songs per genre: total 1000 songs

Aug. 9, 2010

classical country

hiphop

jazz

pop

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

{chacha, foxtrot, quickstep, rumba, samba, tango, viennesewaltz, waltz}

100 songs per style: total 800 songs

62

ISMIR2010 Tutorial 1

Features [number of dim.]

GTZAN dataset

Ballroom dataset

Baseline (Random)

10.0%

12.5%

Rhythmic (from template set #1) [10/8]

43.6%

54.0%

Rhythmic (from template set #2) [10/8]

42.3%

55.125%

Statistic features such as MFCC, etc. (68 dim.) [Tzanetakis 2008] Performed well on audio classification tasks in MIREX 2008 Features [number of dim.]

GTZAN dataset

Ballroom dataset

Existing (Timbre) [68]

72.4%

57.625%

Merged (from template set #1) [78/76]

76.1%

70.125%

Merged (from template set #2) [78/76]

76.2%

69.125%

metal

U SOLab

{

{

Merged with timbral features

disco

reggae rock

{

{blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, rock}

Divided the datasets into 2 parts and obtained 2 sets of 10 templates for each genre

10 templates learned from “blues”

{

Percussive pattern feature only

Example of learned templates

{

(rhythm-intensive) Ballroom dataset 22050Hz sampling, 1ch 30 seconds clips 8 styles

Genre Classification Accuracy

Pattern set

{

Evaluation 10-fold cross-validation Classifier: linear SVM (toolkit “Weka” used)

61 Aug. 9, 2010

(standard) GTZAN dataset 22050Hz sampling, 1ch 30 seconds clips 10 genres

63 63

U SOLab

N NDD

Classification accuracy is improved by combining percussive pattern features. Aug. 9, 2010

ISMIR2010 Tutorial 1

64

New Features II: Bass-line Patterns

Examples of Extracted Bass-line Patterns

[Tsunoo2009]

U SOLab

N NDD

Aug. 9, 2010

65

ISMIR2010 Tutorial 1

Genre Classification Accuracy

U SOLab

N NDD

Features

U SOLab

N NDD

66

Autoregressive MFCC Model applied to Genre Classification HPSS increases the number of channels mono -> three (original, harmonic, percussive) and improves performance

Classification accuracy merged with timbre features

GTZAN dataset Ballroom dataset

Baseline (random classifier)

10.0%

10.0%

Only bass-line (400 dim.)

42.0%

44.8%

Existing (timbre, 68 dim.)

72.4%

72.4%

Merged (468 dim.)

74.4%

76.0%

Aug. 9, 2010

ISMIR2010 Tutorial 1

Another Application of HPSS [Rump2010]

Classification accuracy with only bass-line features

Aug. 9, 2010

ISMIR2010 Tutorial 1

67

U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

68

Conclusions

Future Works

Source separation techniques used to MIR

F0-based harmonic separation Non-negative matrix factorization or PLCA Sinusoid + Noise model Harmonic/percussive sound separation

Improvement of separation performance itself by exploiting musicological knowledge Using spatial (especially stereo) information

To enhance specific components To increase the number of channels and the dimension of feature vectors To generate new features

U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

69

Reference Book Chapter

U SOLab

N NDD

Feature weighting technique for overcoming errors due to imperfect source separation

Aug. 9, 2010

N. Ono, K. Miyamoto, H. Kameoka, J. Le Roux, Y. Uchiyama, E. Tsunoo, T. Nishimoto and S. Sagayama, “Harmonic and Percussive Sound Separation and its Application to MIR-related Tasks,” pp.213-236

Aug. 9, 2010

ISMIR2010 Tutorial 1

70

ICA Central: Early software restricted to mixtures of two sources

SiSEC Reference Software: Linear modeling-based software for panned or recorded mixtures

71

http://www.hil.t.u-tokyo.ac.jp/software/HPSS/

U SOLab

N NDD

http://www.tsi.enst.fr/icacentral/algos.html

http://sisec2008.wiki.irisa.fr/tiki-index.php?page=Underdetermined+speech+and+music+mixtures

QUAERO Source Separation Toolkit: Modular variancemodeling based software implementing a range of structures: GMM, NMF, source-filter model, harmonicity, diffuse mixing, etc

N NDD

ISMIR2010 Tutorial 1

Harmonic Percussive Sound Separation (HPSS)

U SOLab

Current works are limited to monaural separation

Available Separation Softwares

Advances in Music Information Retrieval, ser. Studies in Computational Intelligence, Z. W. Ras and A. Wieczorkowska, Eds. Springer, 274

Cover song identification, audio music similarity,...

Source separation is useful

Application of source separation to other MIR tasks

To be released Fall 2010: watch the music-ir list for an announcement!

Aug. 9, 2010

ISMIR2010 Tutorial 1

72

Advertisement: LVA/ICA 2010

References

LVA/ICA 2010 is held will be held in St. Malo, France on September 27-30, 2010. More than 20 papers on music and audio source separation will be presented.

Singer/Instrument Identification

U SOLab

N NDD

Aug. 9, 2010

ISMIR2010 Tutorial 1

73

References

N NDD

Aug. 9, 2010

M. Alonso, G. Richard and B. David, "Accurate tempo estimation based on harmonic + noise decomposition," EURASIP Journal on Advances in Signal ProcessingVolume 2007 (2007), Article ID 82795 P. Chordia and A. Rae, "Using Source Separation to Improve Tempo Detection," Proc. ISMIR, pp. 183-188, 2009.

C. Uhle, C. Dittmar, and T. Sporer, “Extraction of drum tracks from polyphonic music using independent subspace analysis,'' Proc. ICA, pp. 843-847, 2003. M. Helen and T. Virtanen, "Separation of drums from polyphonic music using non-negative matrix factorization and support vector machine," Proc. EUSIPCO, Sep. 2005. L. Daudet, "A Review on Techniques for the Extraction of Transients in Musical Signals," Proc. CMMR, pp. 219-232, 2005. O. Dikmen, A. T. Cemgil, “Unsupervised Single-channel Source Separation Using Basian NMF,” Proc. WASPAA, pp. 93-96, 2009.

Aug. 9, 2010

ISMIR2010 Tutorial 1

74

Harmonic/Percussive Sound Separation

Related Works to H/P Separation

U SOLab

N NDD

References

Audio Tempo Estimation

U SOLab

H. Fujihara, T. Kitahara, M. Goto, K. Komatani, T. Ogata and H. Okuno, ”Singer Identification Based on Accompaniment Sound Reduction and Reliable Frame Selection, “ Proc. ISMIR, 2005. M. Goto, “A real-time music-scene description system: predominant-F0 estimation,” Speech Communication, vol. 43, no. 4, pp. 311–329, 2004. A. Mesaros, T. Virtanen and A. Klapuri, “Singer identification in polyphonic music using vocal separation and pattern recognition methods,” Proc. ISMIR, pp. 375-378, 2007. M. Ryynanen and A. Klapuri, ”Transcription of the Singing Melody in Polyphonic Music”, Proc. ISMIR, 2006. T. Kitahara, M. Goto, K. Komatani, T. Ogata and H. G. Okuno, “Instrument identification in polyphonic music: feature weighting to minimize influence of sound overlaps,” EURASIP Journal on Applied Signal Processing, vol. 2007, 2007, article ID 51979.

ISMIR2010 Tutorial 1

75

U SOLab

N NDD

K. Miyamoto, H. Kameoka, N. Ono and S. Sagayama, “Separation of Harmonic and Non-Harmonic Sounds Based on Anisotropy in Spectrogram, Proc. ASJ, pp.903-904, 2008. (in Japanese) N. Ono, K. Miyamoto, J. Le Roux, H. Kameoka and S. Sagayama, “Separation of a Monaural Audio Signal into Harmonic/Percussive Components by Complementary Diffusion on Spectrogram,” Proc. EUSIPCO, 2008. N. Ono, K. Miyamoto, J. Le Roux, H. Kameoka and S. Sagayama, “A Real-time Equalizer of Harmonic and Percussive Components in Music Signals,” Proc. of ISMIR, pp.139-144, 2008. N. Ono, K. Miyamoto, H. Kameoka, J. Le Roux, Y. Uchiyama, E. Tsunoo, T. Nishimoto and S. Sagayama, “Harmonic and Percussive Sound Separation and its Application to MIR-related Tasks,” Advances in Music Information Retrieval, ser. Studies in Computational Intelligence, Z. W. Ras and A. Wieczorkowska, Eds. Springer, 274, pp.213-236, Feb., 2010.

Aug. 9, 2010

ISMIR2010 Tutorial 1

76

References

N NDD

Applications of HPSS to MIR Tasks

U SOLab

References

Aug. 9, 2010

ISMIR2010 Tutorial 1

77

Applications of HPSS in MIR Tasks

Y. Ueda, Y. Uchiyama, T. Nishimoto, N. Ono and S. Sagayama, “HMM-Based Approach for Automatic Chord Detection Using Refined Acoustic Features,” Proc. ICASSP, pp.5518-5521, 2010. J. Reed, Y. Ueda, S. M. Siniscalchi, Y. Uchiyama, S. Sagayama, C. -H. Lee, “Minimum Classification Error Training to Improve Isolated Chord Recognition,” Proc. ISMIR, pp.609-614, 2009. H. Tachibana, T. Ono, N. Ono and S. Sagayama, “Melody Line Estimation in Homophonic Music Audio Signals Based on Temporal-Variability of Melodic Source,” Proc. ICASSP, pp.425-428, 2010. H. Rump, S. Miyabe, E. Tsunoo, N. Ono and S. Sagayama, “On the Feature Extraction of Timbral Dynamics,” Proc. ISMIR, 2010.

U SOLab

N NDD

E. Tsunoo, N. Ono and S. Sagayama, “ Rhythm Map: Extraction of Unit Rhythmic Patterns and Analysis of Rhythmic Structure from Music Acoustic Signals,” Proc. ICASSP, pp.185-188, 2009. E. Tsunoo, G. Tzanetakis, N. Ono and S. Sagayama, “Audio Genre Classification Using Percussive Pattern Clustering Combined with Timbral Features,” Proc. ICME, pp.382-385, 2009. E. Tsunoo, N. Ono and S. Sagayama, “Musical Bass-Line Pattern Clustering and Its Application to Audio Genre Classification,” Proc. ISMIR, pp.219-224, 2009. E. Tsunoo, T. Akase, N. Ono and S. Sagayama, “Music Mood Classification by Rhythm and Bass-line Unit Pattern Analysis,” Proc. ICASSP, pp.265-268, 2010.

Aug. 9, 2010

ISMIR2010 Tutorial 1

78