Content-based Audio Music Retrieval

Content-based Audio Music Retrieval Hsin-Min Wang [email protected] Speech, Language and Music Processing Laboratory Institute of Information Sci...
Author: Iris Carter
0 downloads 0 Views 2MB Size
Content-based Audio Music Retrieval Hsin-Min Wang [email protected]

Speech, Language and Music Processing Laboratory Institute of Information Science, Academia Sinica, Taipei, Taiwan http://slam.iis.sinica.edu.tw

Music Information Retrieval (MIR) 2



Users need to find the “right” songs for   





a specific listening context (driving, studying, exercising) a specific mood (sad, happy, angry) a specific event (wedding, party) accompanying a video (home video)

Current solutions    

Manual browsing or selection Keyword search (artist, title, lyrics) Social recommendation Content-based retrieval (query-by-singing/humming, fingerprinting)

Outline 3

  

Retrieving music by singer (query-by-example) Retrieving music by melody (query-by-singing/humming) Retrieving music by melody (query-by-example) 

 

Cover song retrieval

Retrieving music by social tags (query-by-tag) Retrieving music by emotion (query-by-emotion)

Outline 4

  

Retrieving music by singer (query-by-example) Retrieving music by melody (query-by-singing/humming) Retrieving music by melody (query-by-example) 

 

Cover song retrieval

Retrieving music by social tags (query-by-tag) Retrieving music by emotion (query-by-emotion)

Retrieving Music by Singer 5 

Retrieving music data performed by the singer of an example audio query (query-by-example) Same-singer Songs

Find me all the songs performed by the singer of this audio query 

Users are recommended to listen to the songs performed by their favorite singer or the songs performed by singers with similar vocal characteristics

Singer-based MIR System 6

Music Doc. X1

Music Doc. X2

Music Doc. XM

Vocal/Non-vocal Seg.

Vocal/Non-vocal Seg.

Vocal/Non-vocal Seg.

Solo Voice Modeling

Solo Voice Modeling

Solo Voice Modeling

s,1

s,2

s,M

Likelihood Computation & Ranking L (Y, Xi )  log p(YV | λ s ,i , λb,Y ), 1  i  M

λ b ,Y

YV

Example Music Query Y

YB

Vocal/Non-vocal Seg.

Background Music Model Training

Ranked List

Copyright Protection 7



Singer verification – examining if an unknown music file contains the voices of a particular singer If this unknown file contains Norah Jones's latest song

?

Internet

Spider

Singer Verification 8

A Test Music Recording X

Music Data H 0

Vocal/Non-vocal Segmentation

Solo Voice Modeling

Music Data H 1

Vocal/Non-vocal Segmentation

Solo Voice Modeling

Training Phase

Yes/ No

Testing Phase

H 0 (Yes)

H0: X is performed by the target singer, H1: X is not perform ed by the target singer.

Likelihood Ratio Computation & Decision

Decision

p( X | H 0 ) p ( X | H1 )

 

H1 (No)

,

Outline 9

  

Retrieving music by singer (query-by-example) Retrieving music by melody (query-by-singing/humming) Retrieving music by melody (query-by-example) 

 

Cover song retrieval

Retrieving music by social tags (query-by-tag) Retrieving music by emotion (query-by-emotion)

Query-by-Singing/Humming System 10

Indexing Phase

Searching Phase

Sung Query Background Accompaniment Reduction

Karaoke Song 1

Note Sequence Generation & Smoothing

Main Melody Extraction

Document 1

Phrase Onset Detection Karaoke Song 2

Main Melody Extraction

Document 2

Phrase Onset Detection

End-point Detection Note Sequence Generation & Smoothing

Similarity Computation & Decision

Relevant Song

Phrase Onsets (Document) Alignment & Comparison (Query)

Melody Similarity Comparison 11

u1 u2 u3 ...ul ... uL

Dynamic Time Warping

Document's Note Sequence



DTW constructs a TL distance matrix D = [D(t,)]T  L D(t  2,   1)  2  d (t , )  D(t , )  minD(t  1,   1)  d (t , )   D(t  1,   2)  d (t , )  d (t , ) | qt  u |

r

Boundary conditions:

q1 q2 q3 ... qt ... qT

Query's Note Sequence

Similarity between q and u:

S (q, u)  min D(T , ) T / 2   L

Complexity: O(T2)

 D(1,1)  d (1,1)  D(t ,1)  , 2  t  T   D(t ,2)  , 4  t  T  d (1, ), 1    r  .  D(1, )  , rlL    d (1,   1)  d (2, ), 2    r  1  D(2, )   r 1  l  L  ,   D(3,2)  d (1,1)  2  d (3,2)

Outline 12

  

Retrieving music by singer (query-by-example) Retrieving music by melody (query-by-singing/humming) Retrieving music by melody (query-by-example) 

 

Cover song retrieval

Retrieving music by social tags (query-by-tag) Retrieving music by emotion (query-by-emotion)

Example Cover Song Pairs 13

Type of within-pair difference L L+S L+T L+S+T L+T+N L+S+T+N L+A+T+N L+S+A+T+N

No. of pairs 8 7 3 7 6 4 2 10

Examples

L: Language (Mandarin/English/Japanese) S: Singer A: Principal Accompaniments T: Tempo N: Non-vocal Melodies

Melody-based Cover Song Retrieval System

14

Indexing Phase

Searching Phase Audio Query Main Melody Extraction Note Sequence

Song 1

Non-vocal Removal

Main Melody Note Sequence 1 Extraction

Song 2

Non-vocal Removal

Main Melody Note Sequence 2 Extraction

Song M

Non-vocal Removal

Main Melody Note Sequence M Extraction

Similarity Computation & Ranking

Ranked List

Outline 15

  

Retrieving music by singer (query-by-example) Retrieving music by melody (query-by-singing/humming) Retrieving music by melody (query-by-example) 

 

Cover song retrieval

Retrieving music by social tags (query-by-tag) Retrieving music by emotion (query-by-emotion)

Social Tags for Music 16



Music tags describe different aspects of a music clip, e.g., genre, mood, instrumentation, users’ preference

16

Music Tag Annotation and Retrieval 17

Annotating music clips with tags

A Music Clip

Annotate Music Using One Predictor for Each Tag

Scores of Tags

Female R&B Guitar Metal Bass

Retrieving music clips using a tag query

A Query: Rock

Rank Music Clips based on the Scores of the Rock Predictor

Music Ranked List for the Query

High Relevance

Low Relevance

MTML Query Interface – Coloring Tags in Tag Cloud

18



Online: http://slam.iis.sinica.edu.tw/demo/SoTags/

Music Tagging System 19

p(wm | s)   k  km k

“Playing with Tagging” Music Player

20 

Visual effects of current music players are usually generated by audio signal processing directly and render meaningless or incomprehensible displays



Our playing-with-tagging player can show the dynamic tag distribution during music playback

Outline 21

  

Retrieving music by singer (query-by-example) Retrieving music by melody (query-by-singing/humming) Retrieving music by melody (query-by-example) 

 

Cover song retrieval

Retrieving music by social tags (query-by-tag) Retrieving music by emotion (query-by-emotion)

Emotion-based MIR Systems 22 

Music retrieval and organization by emotion is intuitive 

Music is created to convey and modulate emotions



Music Emotion Recognition (MER)

Mufin Player

Mr. Emo developed by Yang et al.

Emotion as Categories 23 

Music emotion recognition is a classification problem 





Support vector machines, hidden Markov models, etc.

5 mood categories used in the MIREX Audio Mood Classification task: 

Cluster_1: passionate, rousing, confident, boisterous, rowdy



Cluster_2: rollicking, cheerful, fun, sweet, amiable/good natured



Cluster_3: literate, poignant, wistful, bittersweet, autumnal, brooding



Cluster_4: humorous, silly, campy, quirky, whimsical, witty, wry



Cluster_5: aggressive, fiery, tense/anxious, intense, volatile, visceral

Debate on categorical emotions

The Valence-Arousal Model 24 

Emotions are considered as numerical values (instead of discrete labels) over the valence and arousal dimensions



Good visualization and intuitive



Easy to capture temporal change of emotion Activation‒Arousal • Energy or neurophysiological stimulation level

Evaluation‒Valence • Pleasantness • Positive and negative affective states

Valence-arousal circumplex chart

Valence-Arousal Annotations 25 

Emotion is subjective, but aggregation of annotations among users indeed exists



Dimensional emotion of a song can be described by a bivariate Gaussian distribution



Predict the emotion of a song as a single Gaussian

Regression for Gaussian Parameters

26 

The regression method directly learns five regression models to predict the mean, variance, and covariance of valence and arousal



No joint modeling and estimation of the Gaussian parameters

Regressor 2

mVal mAro

Regressor 3

sVal-Val

Regressor 4

sVal-Aro

Regressor 5

sAro-Aro

Regressor 1



Audio signal

Frame-based feature vectors

The Acoustic Emotion Gaussians Model

27 

A probabilistic approach



Represent the acoustic features of a song by a probabilistic histogram vector



Develop a model to comprehend the relationship between acoustic features and annotations in the VA space

Acoustic GMM posterior representations of songs

Music Emotion Recognition (MER) 28 

Given the acoustic GMM posterior of a test song, predict the emotion as a single VA Gaussian

𝜃1 𝜃2 …



𝜃K-1 Feature vectors

𝜃K Acoustic Learned VA GMM GMM Posterior

Predicted Single Gaussian

29

Automatic Generation of Music Video



Audio



Video



Sound energy



Lighting key



Tempo and beat strength



Shot change rate



Rhythm regularity



Motion intensity



Pitch



Color (saturation, color energy)

Demonstration I 30

Video Retrieves Audio

Audio: Radiohead - No Surprises https://www.youtube.com/watch?v=u5CVsCnxyXg Video: Sigur Ros – Von (Heima) http://www.youtube.com/watch?v=hme5jf2Z_ow

Demonstration II 31

Audio Retrieves Video

Audio: Michael Franti & Spearhead - Say Hey (I Love You) https://www.youtube.com/watch?v=ehu3wy4WkHs Video: Of Montreal - Wraith Pinned to the Mist and Other Things https://www.youtube.com/watch?v=7PoJv4N1Too

Automatic Generation of Music Video Considering Temporal Emotion Flow 32

(ACM MM 2015)

References 33 

W. H. Tsai and H. M. Wang, "Automatic Singer Recognition of Popular Music Recordings via Estimation and Modeling of Solo Vocal Signals," IEEE Trans. on Audio, Speech, and Language Processing, 14(1), pp. 330-341, January 2006.



H. M. Yu, W. H. Tsai, and H. M. Wang, "A Query-by-Singing System for Retrieving Karaoke Music," IEEE Trans. on Multimedia, 10(8), pp. 1626-1637, December 2008.



W. H. Tsai, H. M. Yu, and H. M. Wang, "Using the Similarity of Main Melodies to Identify Cover Versions of Popular Songs for Music Document Retrieval," Journal of Information Science and Engineering, 24(6), pp. 1669-1687, November 2008.



J. C. Wang, Y. C. Shih, M. S. Wu, H. M. Wang and S. K. Jeng, "Colorizing tags in tag cloud: A novel query-by-tag music search system," in Proceedings of ACM MM 2011, pp. 293-302, November 2011.



J. C. Wang, Y. H. Yang, H. M. Wang, and S. K. Jeng, "The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval," in Proceedings of ACM MM 2012, pp. 89-98, October 2012.



J. C. Wang, Y. H. Yang, H. M. Wang, and S. K. Jeng, "Modeling the Affective Content of Music with a Gaussian Mixture Model," IEEE Transactions on Affective Computing, 6(1), pp. 56 - 68, March 2015.

More papers are available at http://slam.iis.sinica.edu.tw/paper.htm

Thank You!

Suggest Documents