Content-based Audio Music Retrieval Hsin-Min Wang
[email protected]
Speech, Language and Music Processing Laboratory Institute of Information Science, Academia Sinica, Taipei, Taiwan http://slam.iis.sinica.edu.tw
Music Information Retrieval (MIR) 2
Users need to find the “right” songs for
a specific listening context (driving, studying, exercising) a specific mood (sad, happy, angry) a specific event (wedding, party) accompanying a video (home video)
Current solutions
Manual browsing or selection Keyword search (artist, title, lyrics) Social recommendation Content-based retrieval (query-by-singing/humming, fingerprinting)
Outline 3
Retrieving music by singer (query-by-example) Retrieving music by melody (query-by-singing/humming) Retrieving music by melody (query-by-example)
Cover song retrieval
Retrieving music by social tags (query-by-tag) Retrieving music by emotion (query-by-emotion)
Outline 4
Retrieving music by singer (query-by-example) Retrieving music by melody (query-by-singing/humming) Retrieving music by melody (query-by-example)
Cover song retrieval
Retrieving music by social tags (query-by-tag) Retrieving music by emotion (query-by-emotion)
Retrieving Music by Singer 5
Retrieving music data performed by the singer of an example audio query (query-by-example) Same-singer Songs
Find me all the songs performed by the singer of this audio query
Users are recommended to listen to the songs performed by their favorite singer or the songs performed by singers with similar vocal characteristics
Singer-based MIR System 6
Music Doc. X1
Music Doc. X2
Music Doc. XM
Vocal/Non-vocal Seg.
Vocal/Non-vocal Seg.
Vocal/Non-vocal Seg.
Solo Voice Modeling
Solo Voice Modeling
Solo Voice Modeling
s,1
s,2
s,M
Likelihood Computation & Ranking L (Y, Xi ) log p(YV | λ s ,i , λb,Y ), 1 i M
λ b ,Y
YV
Example Music Query Y
YB
Vocal/Non-vocal Seg.
Background Music Model Training
Ranked List
Copyright Protection 7
Singer verification – examining if an unknown music file contains the voices of a particular singer If this unknown file contains Norah Jones's latest song
?
Internet
Spider
Singer Verification 8
A Test Music Recording X
Music Data H 0
Vocal/Non-vocal Segmentation
Solo Voice Modeling
Music Data H 1
Vocal/Non-vocal Segmentation
Solo Voice Modeling
Training Phase
Yes/ No
Testing Phase
H 0 (Yes)
H0: X is performed by the target singer, H1: X is not perform ed by the target singer.
Likelihood Ratio Computation & Decision
Decision
p( X | H 0 ) p ( X | H1 )
H1 (No)
,
Outline 9
Retrieving music by singer (query-by-example) Retrieving music by melody (query-by-singing/humming) Retrieving music by melody (query-by-example)
Cover song retrieval
Retrieving music by social tags (query-by-tag) Retrieving music by emotion (query-by-emotion)
Query-by-Singing/Humming System 10
Indexing Phase
Searching Phase
Sung Query Background Accompaniment Reduction
Karaoke Song 1
Note Sequence Generation & Smoothing
Main Melody Extraction
Document 1
Phrase Onset Detection Karaoke Song 2
Main Melody Extraction
Document 2
Phrase Onset Detection
End-point Detection Note Sequence Generation & Smoothing
Similarity Computation & Decision
Relevant Song
Phrase Onsets (Document) Alignment & Comparison (Query)
Melody Similarity Comparison 11
u1 u2 u3 ...ul ... uL
Dynamic Time Warping
Document's Note Sequence
DTW constructs a TL distance matrix D = [D(t,)]T L D(t 2, 1) 2 d (t , ) D(t , ) minD(t 1, 1) d (t , ) D(t 1, 2) d (t , ) d (t , ) | qt u |
r
Boundary conditions:
q1 q2 q3 ... qt ... qT
Query's Note Sequence
Similarity between q and u:
S (q, u) min D(T , ) T / 2 L
Complexity: O(T2)
D(1,1) d (1,1) D(t ,1) , 2 t T D(t ,2) , 4 t T d (1, ), 1 r . D(1, ) , rlL d (1, 1) d (2, ), 2 r 1 D(2, ) r 1 l L , D(3,2) d (1,1) 2 d (3,2)
Outline 12
Retrieving music by singer (query-by-example) Retrieving music by melody (query-by-singing/humming) Retrieving music by melody (query-by-example)
Cover song retrieval
Retrieving music by social tags (query-by-tag) Retrieving music by emotion (query-by-emotion)
Example Cover Song Pairs 13
Type of within-pair difference L L+S L+T L+S+T L+T+N L+S+T+N L+A+T+N L+S+A+T+N
No. of pairs 8 7 3 7 6 4 2 10
Examples
L: Language (Mandarin/English/Japanese) S: Singer A: Principal Accompaniments T: Tempo N: Non-vocal Melodies
Melody-based Cover Song Retrieval System
14
Indexing Phase
Searching Phase Audio Query Main Melody Extraction Note Sequence
Song 1
Non-vocal Removal
Main Melody Note Sequence 1 Extraction
Song 2
Non-vocal Removal
Main Melody Note Sequence 2 Extraction
Song M
Non-vocal Removal
Main Melody Note Sequence M Extraction
Similarity Computation & Ranking
Ranked List
Outline 15
Retrieving music by singer (query-by-example) Retrieving music by melody (query-by-singing/humming) Retrieving music by melody (query-by-example)
Cover song retrieval
Retrieving music by social tags (query-by-tag) Retrieving music by emotion (query-by-emotion)
Social Tags for Music 16
Music tags describe different aspects of a music clip, e.g., genre, mood, instrumentation, users’ preference
16
Music Tag Annotation and Retrieval 17
Annotating music clips with tags
A Music Clip
Annotate Music Using One Predictor for Each Tag
Scores of Tags
Female R&B Guitar Metal Bass
Retrieving music clips using a tag query
A Query: Rock
Rank Music Clips based on the Scores of the Rock Predictor
Music Ranked List for the Query
High Relevance
Low Relevance
MTML Query Interface – Coloring Tags in Tag Cloud
18
Online: http://slam.iis.sinica.edu.tw/demo/SoTags/
Music Tagging System 19
p(wm | s) k km k
“Playing with Tagging” Music Player
20
Visual effects of current music players are usually generated by audio signal processing directly and render meaningless or incomprehensible displays
Our playing-with-tagging player can show the dynamic tag distribution during music playback
Outline 21
Retrieving music by singer (query-by-example) Retrieving music by melody (query-by-singing/humming) Retrieving music by melody (query-by-example)
Cover song retrieval
Retrieving music by social tags (query-by-tag) Retrieving music by emotion (query-by-emotion)
Emotion-based MIR Systems 22
Music retrieval and organization by emotion is intuitive
Music is created to convey and modulate emotions
Music Emotion Recognition (MER)
Mufin Player
Mr. Emo developed by Yang et al.
Emotion as Categories 23
Music emotion recognition is a classification problem
Support vector machines, hidden Markov models, etc.
5 mood categories used in the MIREX Audio Mood Classification task:
Cluster_1: passionate, rousing, confident, boisterous, rowdy
Cluster_2: rollicking, cheerful, fun, sweet, amiable/good natured
Cluster_3: literate, poignant, wistful, bittersweet, autumnal, brooding
Cluster_4: humorous, silly, campy, quirky, whimsical, witty, wry
Cluster_5: aggressive, fiery, tense/anxious, intense, volatile, visceral
Debate on categorical emotions
The Valence-Arousal Model 24
Emotions are considered as numerical values (instead of discrete labels) over the valence and arousal dimensions
Good visualization and intuitive
Easy to capture temporal change of emotion Activation‒Arousal • Energy or neurophysiological stimulation level
Evaluation‒Valence • Pleasantness • Positive and negative affective states
Valence-arousal circumplex chart
Valence-Arousal Annotations 25
Emotion is subjective, but aggregation of annotations among users indeed exists
Dimensional emotion of a song can be described by a bivariate Gaussian distribution
Predict the emotion of a song as a single Gaussian
Regression for Gaussian Parameters
26
The regression method directly learns five regression models to predict the mean, variance, and covariance of valence and arousal
No joint modeling and estimation of the Gaussian parameters
Regressor 2
mVal mAro
Regressor 3
sVal-Val
Regressor 4
sVal-Aro
Regressor 5
sAro-Aro
Regressor 1
…
Audio signal
Frame-based feature vectors
The Acoustic Emotion Gaussians Model
27
A probabilistic approach
Represent the acoustic features of a song by a probabilistic histogram vector
Develop a model to comprehend the relationship between acoustic features and annotations in the VA space
Acoustic GMM posterior representations of songs
Music Emotion Recognition (MER) 28
Given the acoustic GMM posterior of a test song, predict the emotion as a single VA Gaussian
𝜃1 𝜃2 …
…
𝜃K-1 Feature vectors
𝜃K Acoustic Learned VA GMM GMM Posterior
Predicted Single Gaussian
29
Automatic Generation of Music Video
Audio
Video
Sound energy
Lighting key
Tempo and beat strength
Shot change rate
Rhythm regularity
Motion intensity
Pitch
Color (saturation, color energy)
Demonstration I 30
Video Retrieves Audio
Audio: Radiohead - No Surprises https://www.youtube.com/watch?v=u5CVsCnxyXg Video: Sigur Ros – Von (Heima) http://www.youtube.com/watch?v=hme5jf2Z_ow
Demonstration II 31
Audio Retrieves Video
Audio: Michael Franti & Spearhead - Say Hey (I Love You) https://www.youtube.com/watch?v=ehu3wy4WkHs Video: Of Montreal - Wraith Pinned to the Mist and Other Things https://www.youtube.com/watch?v=7PoJv4N1Too
Automatic Generation of Music Video Considering Temporal Emotion Flow 32
(ACM MM 2015)
References 33
W. H. Tsai and H. M. Wang, "Automatic Singer Recognition of Popular Music Recordings via Estimation and Modeling of Solo Vocal Signals," IEEE Trans. on Audio, Speech, and Language Processing, 14(1), pp. 330-341, January 2006.
H. M. Yu, W. H. Tsai, and H. M. Wang, "A Query-by-Singing System for Retrieving Karaoke Music," IEEE Trans. on Multimedia, 10(8), pp. 1626-1637, December 2008.
W. H. Tsai, H. M. Yu, and H. M. Wang, "Using the Similarity of Main Melodies to Identify Cover Versions of Popular Songs for Music Document Retrieval," Journal of Information Science and Engineering, 24(6), pp. 1669-1687, November 2008.
J. C. Wang, Y. C. Shih, M. S. Wu, H. M. Wang and S. K. Jeng, "Colorizing tags in tag cloud: A novel query-by-tag music search system," in Proceedings of ACM MM 2011, pp. 293-302, November 2011.
J. C. Wang, Y. H. Yang, H. M. Wang, and S. K. Jeng, "The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval," in Proceedings of ACM MM 2012, pp. 89-98, October 2012.
J. C. Wang, Y. H. Yang, H. M. Wang, and S. K. Jeng, "Modeling the Affective Content of Music with a Gaussian Mixture Model," IEEE Transactions on Affective Computing, 6(1), pp. 56 - 68, March 2015.
More papers are available at http://slam.iis.sinica.edu.tw/paper.htm
Thank You!