Comparison of Neural Network Architectures for Sentiment Analysis of Russian Tweets

Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2016” Moscow, June 1–4, 2016 Compariso...

Author: Felix Manning

1 downloads 0 Views 389KB Size

Report

Download PDF

Recommend Documents

Sentiment Analysis on Movie Reviews using Recursive and Recurrent Neural Network Architectures

Semantic Properties of Customer Sentiment in Tweets

Consumer Tweets about Brands: A Content Analysis of Sentiment Tweets about Goods and Services

Recursive Neural Conditional Random Fields for Aspect-based Sentiment Analysis

Application of Neural Network in Analysis of Stock Market Prediction

Review of Novel Computing Architectures for Neural Applications

Recognition of Sarcasm in Tweets Based on Concept Level Sentiment Analysis and Supervised Learning Approaches

Neural Networks: Architectures and Applications for NLP

Bandwidth Analysis of Slotted Hairpin BandPass Filter Using Neural Network

PERFORMANCE ANALYSIS OF DATA MINING ALGORITHMS WITH NEURAL NETWORK

Multiclass Sentiment Analysis of Movie Reviews

COMPARATIVE ANALYSIS OF. IoT ARCHITECTURES

A Performance Comparison of Contemporary DRAM Architectures

Allergy vigilance Network: comparison of reports for

Neural architectures underlying language acquisition

Derivation of a Queuing Network Model for Structured P2P Architectures

University of Twente. Method for Evaluating Sentiment Analysis Tools

4 NETWORK ARCHITECTURES

Prediction of Concentrate Grade in Industrial Gravity Separation Plant Comparison of rpls and Neural Network

SENTIMENT ANALYSIS IN TURKISH FOR MOVIE REVIEWS

Comparison of mathematical programs for data analysis

Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2016” Moscow, June 1–4, 2016

Comparison of Neural Network Architectures for Sentiment Analysis of Russian Tweets Arkhipenko K. ([email protected])1,2, Kozlov I. ([email protected])1,3, Trofimovich J. ([email protected])1, Skorniakov K. ([email protected])1,3, Gomzin A. ([email protected])1,2, Turdakov D. ([email protected])1,2,4 Institute for System Programming of RAS, Moscow, Russia Lomonosov Moscow State University, CMC faculty, Moscow, Russia 3 MIPT, Dolgoprudny, Russia 4 FCS NRU HSE, Moscow, Russia

1

2

The paper presents evaluation of three neural network based approaches to Twitter sentiment analysis task performed at SentiRuEval-2016. The task focuses on sentiment classification of tweets about banks and telecommunication companies. Our team submitted three solutions which are based on different supervised classifiers: Gated Recurrent Unit neural network (GRU), convolutional neural network (CNN), and SVM classifier with domain adaptation combined with previous two classifiers. We used vector representations of words obtained with word2vec model as features for classifiers. These classifiers were trained on labeled data provided by organizers of the evaluation. Additionally, we collected several million posts and comments from social networks for training word2vec model. According to evaluation results, GRU-based solution shows the best macro-averaged F1-score for both domains (banks and telecommunication companies) and also has the best micro-averaged F1-score for banks domain among all solutions submitted to SentiRuEval. Key words: sentiment analysis, opinion mining, recurrent neural network, convolutional neural network

Архипенко К. et al.

Сравнение архитектур нейронных сетей в задаче анализа тональности русскоязычных твитов Архипенко К. ([email protected])1,2, Козлов И. ([email protected])1,3, Трофимович Ю. ([email protected])1, Скорняков К. ([email protected])1,3, Гомзин А. ([email protected])1,2, Турдаков Д. ([email protected])1,2,4 Институт Системного Программирования РАН, Москва, Россия 2 Московский государственный университет им. М. В. Ломоносова, ВМК, Москва, Россия 3 Московский физико-технический институт, Долгопрудный, Россия 4 НИУ Высшая школа экономики, ФКН, Москва, Россия

1

Ключевые слова: анализ тональности, извлечение мнений, рекуррентная нейронная сеть, свёрточная нейронная сеть

Introduction The paper describes participation in SentiRuEval-2016 competition. The task of the competition focuses on object-oriented sentiment analysis of Russian messages posted by Twitter users. The messages are about banks and telecommunication companies. The goal of the task is detection of sentiment (negative, neutral or positive) with respect to organizations (banks or telecommunication companies) mentioned in Twitter message. Thus it can be viewed as three-class classification task. The organizers of the evaluation provided labeled training datasets along with unlabeled test datasets for both banks and telecommunication companies. Training datasets contain about 9,000 Twitter messages each, while test datasets contain about 19,000 messages each. In this paper, we focus on detection of overall sentiment of messages. Objectoriented sentiment classification with algorithms used in this paper is a part of our further research. All variants of our sentiment analysis system use supervised machine learning algorithms. One of our main goals is evaluation of artificial neural networks (ANNs) for sentiment analysis task. In this paper, we evaluate algorithms based on recurrent neural network (RNN) and convolutional neural network (CNN) along with shallow

Comparison of Neural Network Architectures for Sentiment Analysis of Russian Tweets

machine learning approach—SVM with domain adaptation. In each of these three cases we use word2vec (Mikolov et al., 2013a) vectors as features for the algorithms. We have submitted three solutions to SentiRuEval-2016. The first two are based on recurrent neural network and convolutional neural network, respectively. The last solution is an ensemble solution consisting of three classifiers. It uses SVM with domain adaptation along with RNN and CNN. The paper is organized as follows: Section 1 provides overview of the related work; Section 2 presents full description of our methods and features that we used; Section 3 provides evaluation results for different methods; in the final section we make conclusion for this work.

1. Related work Artificial neural networks have become very popular in recent years. They have been shown to achieve state-of-the-art results in various NLP tasks, outperforming shallow machine learning algorithms like support vector machines (SVMs), hidden Markov models and conditional random fields (CRFs). Recurrent neural networks (RNNs) are considered to be one of the most powerful models for sequence modeling. The success of RNNs in the area of sentence classification was reported by many researchers (Irsoy & Cardie, 2014) (Adamson & Turan, 2015) (Tang et al., 2015). Convolutional neural networks (CNNs) are another class of neural networks initially designed for image processing. However, CNNs have been shown in recent years to perform very well in NLP tasks, including sentiment analysis and sentence modeling tasks (Kalchbrenner et al., 2014) (Kim, 2014) (dos Santos et al., 2014). It has been shown that neural network based models for NLP become especially powerful when they are pre-trained with some vector space model (Collobert et al., 2011). The most common way to do this is to use distributed representations of words. The most popular such model now is word2vec (Mikolov et al., 2013a), which improves many NLP tasks.

2. Method description 2.1. Word2vec Word2vec (Mikolov et al., 2013a) (Mikolov et al., 2013b) is a popular model for computationally efficient learning vector representations of words. Vectors learned using word2vec have been shown to capture semantic information between words (Mikolov et al., 2013c), and pre-training using word2vec leads to major improvements in many NLP tasks.

Архипенко К. et al.

We used original word2vec toolkit1 for obtaining vector representations of Russian words. The model was trained on 3.3 GB of user-submitted posts from VK, LiveJournal, echo.msk.ru and svpressa.ru. All the text was lowercased, and punctuation was removed. The following parameters were used for learning: 1. Continuous Bag-of-Words (CBOW) architecture with negative sampling (10 negative samples for every prediction); 2. vector size of 200; 3. maximum context window size of 5; 4. 5 training iterations over corpus; 5. words occurring in the corpus less than 25 times were discarded from the vocabulary; the resulting vocabulary size was 249,014.

2.2. Recurrent neural network Recurrent neural networks (RNNs) are a class of neural networks that have recurrent connections between units. This makes RNNs well-suited to classify and predict sequence data, inсluding short documents. Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) is a popular RNN architecture designed to cope with long-term dependency problem. LSTM has been shown to achieve state-of-the-art or comparable to state-of-the-art results in many text sequence processing tasks (Sutskever et al., 2014) (Palangi et al., 2015). Gated Recurrent Unit (GRU) (Cho et al., 2014) is a simplified version of LSTM that has been shown to outperform LSTM in some tasks (Chung et al., 2014), although according to (Jozefowicz et al., 2015) the gap between LSTM and GRU can often be closed by changing initialization of LSTM cells. Our RNN-based model is composed of LSTM/GRU network regularized by dropout with probability of 0.5 and succeeded by fully connected layer with 3 neurons that predict probabilities of each class—negative, neutral and positive. The input sample is lowercased and converted to sequence of corresponding word2vec vectors described in section 2.1. Punctuation and words that are not in word2vec vocabulary are discarded. The resulting sequence of vectors is input to the network. Like word2vec vectors, the size of input and output of LSTM/GRU cells is 200. We tried several variations of recurrent networks: shallow LSTM/GRU, bidirectional GRU and two-layer GRU. We also tried to revert the order of input vector sequences. We used Keras library2 to implement the model3. In case of LSTM initialization of cells recommended in (Jozefowicz et al., 2015) was used. Sigmoid and hard sigmoid were used for recurrent network as output activation and hidden activation, respectively; softmax was used as activation of fully connected layer. 1

https://code.google.com/archive/p/word2vec/

2

https://github.com/fchollet/keras

3

Source code is available on https://github.com/arkhipenko-ispras/SentiRuEval-2016-RNN

Comparison of Neural Network Architectures for Sentiment Analysis of Russian Tweets

Adam optimizer (Kingma & Ba, 2014) and batch size of 8 were used for training; the number of training epochs was set to 20.

2.3. Convolutional neural network Due to widely reported success of CNNs (convolutional neural networks) (Kalchbrenner et al., 2014) (Kim, 2014) (dos Santos et al., 2014) in the area of sentiment analysis we have conducted some experiments with CNN as well. We used word2vec word vectors described in section 2.1 as features. For each tweet the matrix S is constructed where si (i-th row) is a word vector for the i-th word in tweet. Then we calculate two vectors tavg and t max as follows:

1 𝑎𝑎𝑎𝑎𝑎𝑎 𝑡𝑡𝑗𝑗𝑎𝑎𝑎𝑎𝑎𝑎 = 1 � 𝑠𝑠𝑖𝑖𝑖𝑖 𝑚𝑚 𝑡𝑡𝑗𝑗 = 1≤𝑖𝑖≤𝑚𝑚 � 𝑠𝑠𝑖𝑖𝑖𝑖 𝑚𝑚

(1)

1≤𝑖𝑖≤𝑚𝑚

𝑡𝑡𝑗𝑗𝑚𝑚𝑚𝑚𝑚𝑚 = 𝑚𝑚𝑚𝑚𝑚𝑚 𝑠𝑠𝑖𝑖𝑖𝑖 𝑡𝑡𝑗𝑗𝑚𝑚𝑚𝑚𝑚𝑚 = 1≤𝑖𝑖≤𝑚𝑚 𝑚𝑚𝑚𝑚𝑚𝑚 𝑠𝑠𝑖𝑖𝑖𝑖

(2)

1≤𝑖𝑖≤𝑚𝑚

Concatenation of these two vectors input 𝑦𝑦) ≡ 𝑃𝑃is𝑡𝑡 (𝑋𝑋, 𝑦𝑦) to our CNN. The network is com𝑃𝑃𝑠𝑠 (𝑋𝑋, posed of convolutional layer with 8 𝑦𝑦) kernels of width ≡ 𝑃𝑃𝑡𝑡 (𝑋𝑋, 𝑦𝑦) 10 which is succeeded by dense 𝑃𝑃𝑠𝑠 (𝑋𝑋, layer with 3 neurons (with softmax activation) that predict probabilities of each class. scikit-neuralnetwork library4 was implementing the network. The number 𝑦𝑦) ≠for 𝑃𝑃𝑡𝑡 (𝑋𝑋, 𝑦𝑦) 𝑃𝑃𝑠𝑠 (𝑋𝑋,used 𝑃𝑃𝑠𝑠 (𝑋𝑋, 𝑦𝑦) ≠ 𝑃𝑃𝑡𝑡 (𝑋𝑋, 𝑦𝑦) of training epochs was set to 10. The roadmap for further survey includes experiments not only with different 𝑙𝑙(𝑥𝑥, 𝑦𝑦, 𝜃𝜃)of the CNN as well. Feature extraction kinds of features but also with architecture 𝑙𝑙(𝑥𝑥, 𝑦𝑦, 𝜃𝜃) with word2vec seems to be the most promising one. Since CNNs are not as powerful in sequence processing as RNNs the technique of Dynamic k-Max Pooling (Kal(𝑥𝑥, 𝑦𝑦, 𝜃𝜃)𝑃𝑃 𝑦𝑦) → 𝑚𝑚𝑚𝑚𝑚𝑚 = � chbrenner et al., 2014) can𝐿𝐿(𝜃𝜃) be used to address the𝑡𝑡 (𝑥𝑥, problem of 𝜃𝜃 variable sentence 𝐿𝐿(𝜃𝜃) = 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 � (𝑥𝑥, 𝑦𝑦, 𝜃𝜃)𝑃𝑃𝑡𝑡 (𝑥𝑥, 𝑦𝑦) → 𝑚𝑚𝑚𝑚𝑚𝑚 𝜃𝜃 length. 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌

1 𝑃𝑃𝑡𝑡 (𝑥𝑥, 𝑦𝑦) 𝑎𝑎𝑎𝑎𝑎𝑎 𝑡𝑡𝑗𝑗 ensemble 𝑠𝑠𝑖𝑖𝑖𝑖𝑦𝑦, 𝜃𝜃) 𝑃𝑃𝑡𝑡 (𝑥𝑥, 𝑦𝑦) 𝑃𝑃𝑠𝑠 (𝑥𝑥, 𝑦𝑦) 𝐿𝐿(𝜃𝜃) == ��(𝑥𝑥, 2.4. Domain adaptation and solution 𝑚𝑚� (𝑥𝑥, 𝑦𝑦, 𝜃𝜃) 𝑃𝑃𝑠𝑠 (𝑥𝑥, 𝑦𝑦) 𝑃𝑃 (𝑥𝑥, 𝑦𝑦) 𝐿𝐿(𝜃𝜃) = 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 1≤𝑖𝑖≤𝑚𝑚 𝑃𝑃𝑠𝑠 (𝑥𝑥, 𝑦𝑦) 𝑠𝑠 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌

2.4.1. Domain adaptation In most cases we assume that source 𝑙𝑙domain 𝑡𝑡𝑗𝑗𝑚𝑚𝑚𝑚𝑚𝑚 = 𝑚𝑚𝑚𝑚𝑚𝑚 𝑠𝑠𝑖𝑖𝑖𝑖 (train data) and target domain (test 𝑃𝑃𝑡𝑡 (𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ) 11≤𝑖𝑖≤𝑚𝑚 𝑙𝑙 distribution: ^(𝜃𝜃) data) are driven from the same𝐿𝐿 probability = �(𝑥𝑥 𝑖𝑖 , 𝑦𝑦𝑖𝑖 , 𝜃𝜃) (𝑥𝑥 )

𝑦𝑦𝑖𝑖 ) ^(𝜃𝜃) = 1𝑙𝑙 �(𝑥𝑥 , 𝑦𝑦 , 𝜃𝜃) 𝑃𝑃 𝑃𝑃𝑠𝑠𝑡𝑡(𝑥𝑥𝑖𝑖𝑖𝑖 ,, 𝑦𝑦 𝑖𝑖 𝐿𝐿 𝑃𝑃𝑠𝑠 (𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ) 𝑙𝑙 𝑖𝑖=1 𝑖𝑖 𝑖𝑖 𝑃𝑃𝑡𝑡 (𝑋𝑋, 𝑦𝑦) 𝑃𝑃𝑠𝑠 (𝑋𝑋, 𝑦𝑦) ≡𝑖𝑖=1

(3)

𝑃𝑃 (𝑦𝑦|𝑥𝑥) ≡ 𝑃𝑃 (𝑦𝑦|𝑥𝑥)

𝑡𝑡 it is impossible 𝑠𝑠 Consequently this means that 𝑃𝑃𝑡𝑡 (𝑦𝑦|𝑥𝑥) ≡ 𝑃𝑃𝑠𝑠 (𝑦𝑦|𝑥𝑥)to build classifier that would be able ≠ 𝑃𝑃𝑡𝑡 (𝑋𝑋, 𝑦𝑦) domain sample. But in many real to distinguish target domain 𝑃𝑃 sample source 𝑠𝑠 (𝑋𝑋, 𝑦𝑦)from world problems assumption (3) does not and ) 𝑃𝑃𝑡𝑡 (𝑥𝑥hold ,𝑦𝑦 𝑖𝑖 𝑖𝑖 4

𝑤𝑤𝑖𝑖 = (𝑥𝑥 ,𝑦𝑦 ) 𝑃𝑃𝑠𝑠𝑡𝑡 𝑖𝑖 𝑖𝑖 𝑤𝑤𝑖𝑖 =𝑙𝑙(𝑥𝑥, 𝜃𝜃) 𝑃𝑃𝑠𝑠 (𝑥𝑥𝑦𝑦, 𝑖𝑖 ,𝑦𝑦𝑖𝑖 ) https://github.com/aigamedev/scikit-neuralnetwork 𝑃𝑃(𝑥𝑥 |𝑡𝑡)

𝑤𝑤𝑖𝑖 = 𝑃𝑃(𝑥𝑥𝑖𝑖|𝑠𝑠) 𝑖𝑖 |𝑡𝑡) 𝑖𝑖 = � |𝑠𝑠)(𝑥𝑥, 𝑦𝑦, 𝜃𝜃)𝑃𝑃 (𝑥𝑥, 𝑦𝑦) → 𝑚𝑚𝑚𝑚𝑚𝑚 𝐿𝐿(𝜃𝜃)𝑤𝑤= 𝑃𝑃(𝑥𝑥𝑖𝑖

𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌

𝑡𝑡

𝑃𝑃(𝑡𝑡|𝑥𝑥𝑖𝑖 )𝑃𝑃(𝑠𝑠) 𝑃𝑃(𝑡𝑡|𝑥𝑥𝑖𝑖 ) 𝑤𝑤𝑖𝑖 = ( | ) ( ) = 𝐶𝐶 × ( | )

𝜃𝜃

𝑗𝑗

Архипенко К. et al.

1≤𝑖𝑖≤𝑚𝑚

𝑖𝑖𝑖𝑖

𝑡𝑡𝑗𝑗𝑚𝑚𝑚𝑚𝑚𝑚 = 𝑚𝑚𝑚𝑚𝑚𝑚 𝑠𝑠𝑖𝑖𝑖𝑖 1≤𝑖𝑖≤𝑚𝑚 𝑃𝑃𝑠𝑠 (𝑋𝑋, 𝑦𝑦) ≡ 𝑃𝑃𝑡𝑡 (𝑋𝑋, 𝑦𝑦)

𝑃𝑃𝑠𝑠 (𝑋𝑋, 𝑦𝑦) ≡ 𝑃𝑃𝑡𝑡 (𝑋𝑋, 𝑦𝑦) 𝑃𝑃𝑠𝑠 (𝑋𝑋, 𝑦𝑦) ≠ 𝑃𝑃𝑡𝑡 (𝑋𝑋, 𝑦𝑦)

(4)

1 1 𝑎𝑎𝑎𝑎𝑎𝑎 that 𝑃𝑃 ? 𝑠𝑠𝑖𝑖𝑖𝑖 11 = How �one 𝑠𝑠𝑖𝑖𝑖𝑖can detect 𝑠𝑠 (𝑋𝑋, 𝑦𝑦)𝑡𝑡𝑗𝑗≠ 𝑃𝑃= 𝑡𝑡 (𝑋𝑋, 𝑦𝑦)� 𝑎𝑎𝑎𝑎𝑎𝑎 𝑎𝑎𝑎𝑎𝑎𝑎= 𝑚𝑚1. Quality of the𝑡𝑡𝑗𝑗𝑡𝑡model, 𝑚𝑚 � 𝑠𝑠𝑠𝑠𝑖𝑖𝑖𝑖 on 𝑦𝑦, 𝜃𝜃) =𝑚𝑚measured �𝑙𝑙(𝑥𝑥, source domain (e.g. with cross-validation) 1≤𝑖𝑖≤𝑚𝑚 1≤𝑖𝑖≤𝑚𝑚 𝑖𝑖𝑖𝑖 𝑗𝑗 𝑚𝑚1≤𝑖𝑖≤𝑚𝑚 1 𝑎𝑎𝑎𝑎𝑎𝑎 1≤𝑖𝑖≤𝑚𝑚 is much higher than on the target domain. Some participants of SentiRuEval-2015 1 𝑡𝑡 1= 𝑚𝑚 � 𝑠𝑠𝑖𝑖𝑖𝑖 𝑡𝑡 𝑎𝑎𝑎𝑎𝑎𝑎 = 𝑦𝑦, 𝜃𝜃) 𝑎𝑎𝑎𝑎𝑎𝑎𝑗𝑗 � 𝑙𝑙(𝑥𝑥, 𝑠𝑠𝑖𝑖𝑖𝑖𝑚𝑚𝑚𝑚𝑚𝑚 faced this problem. 𝑗𝑗 𝑚𝑚𝑚𝑚𝑚𝑚 = 𝑡𝑡 � 𝑠𝑠 𝑚𝑚 = � 𝑡𝑡𝑗𝑗𝑗𝑗 = 𝑚𝑚 𝑚𝑚𝑚𝑚𝑚𝑚 1≤𝑖𝑖≤𝑚𝑚 𝑠𝑠𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖 𝑡𝑡𝑗𝑗 (𝑥𝑥, = 𝑦𝑦,𝑚𝑚𝑚𝑚𝑚𝑚 𝑖𝑖𝑖𝑖 𝑦𝑦) → 𝑚𝑚𝑚𝑚𝑚𝑚 𝐿𝐿(𝜃𝜃) 𝜃𝜃)𝑃𝑃𝑠𝑠(𝑥𝑥, 1≤𝑖𝑖≤𝑚𝑚 1≤𝑖𝑖≤𝑚𝑚 𝑡𝑡 𝑠𝑠(3) 𝑡𝑡𝑗𝑗𝑡𝑡𝑚𝑚𝑚𝑚𝑚𝑚 𝑚𝑚𝑚𝑚𝑚𝑚= 1≤𝑖𝑖≤𝑚𝑚 2. Consequence of assumption to build𝜃𝜃 classifier which can 𝑖𝑖𝑖𝑖 is impossibility 1≤𝑖𝑖≤𝑚𝑚 1𝑚𝑚𝑚𝑚𝑚𝑚 =1≤𝑖𝑖≤𝑚𝑚 𝑚𝑚𝑚𝑚𝑚𝑚 𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎 𝑖𝑖𝑖𝑖 𝑗𝑗 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 1≤𝑖𝑖≤𝑚𝑚 𝑡𝑡 = � 𝑠𝑠 𝑖𝑖𝑖𝑖 distinguish target domain. The ability to build such clas𝑗𝑗 domain from source 𝑚𝑚= � (𝑥𝑥, 𝐿𝐿(𝜃𝜃) 𝑦𝑦, 𝜃𝜃)𝑃𝑃𝑡𝑡 (𝑥𝑥, 𝑦𝑦) → 𝑚𝑚𝑚𝑚𝑚𝑚 𝑡𝑡𝑗𝑗𝑚𝑚𝑚𝑚𝑚𝑚 =sifier 𝑚𝑚𝑚𝑚𝑚𝑚indicates 𝑠𝑠𝑖𝑖𝑖𝑖 𝑚𝑚𝑚𝑚𝑚𝑚 1≤𝑖𝑖≤𝑚𝑚 𝑎𝑎𝑎𝑎𝑎𝑎 (3)1does not that assumption hold. We 𝜃𝜃were able to achieve 𝑡𝑡 = 𝑚𝑚𝑚𝑚𝑚𝑚 𝑠𝑠 1≤𝑖𝑖≤𝑚𝑚 𝑚𝑚𝑚𝑚𝑚𝑚 𝑡𝑡𝑗𝑗𝑖𝑖𝑖𝑖𝑃𝑃𝑠𝑠 (𝑋𝑋,=𝑦𝑦) ≡� 𝑠𝑠𝑖𝑖𝑖𝑖𝑦𝑦) 𝑗𝑗 ≡ 𝑚𝑚𝑚𝑚𝑚𝑚 𝑃𝑃𝑡𝑡 (𝑋𝑋,𝑠𝑠𝑦𝑦) 𝑃𝑃𝑡𝑡 (𝑋𝑋, 𝑃𝑃𝑠𝑠 (𝑋𝑋, 𝑡𝑡𝑗𝑗 𝑦𝑦) = 1≤𝑖𝑖≤𝑚𝑚𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 𝑖𝑖𝑖𝑖 𝑚𝑚classification (𝑋𝑋, 𝑦𝑦) ≡ 𝑃𝑃 𝑦𝑦) 𝑃𝑃𝑃𝑃 F1-score on source vs target domain above 0.85. 1≤𝑖𝑖≤𝑚𝑚 (𝑥𝑥, 𝑃𝑃 𝑦𝑦) 𝑠𝑠 (𝑋𝑋, 𝑡𝑡 𝑡𝑡 𝑦𝑦) ≡=𝑃𝑃𝑡𝑡 (𝑋𝑋, 𝑦𝑦) (𝑥𝑥, 1≤𝑖𝑖≤𝑚𝑚 𝑠𝑠 (𝑋𝑋,𝐿𝐿(𝜃𝜃) � 𝑦𝑦, 𝜃𝜃) 𝑃𝑃𝑠𝑠 (𝑥𝑥, 𝑦𝑦) 𝑡𝑡𝑗𝑗𝑚𝑚𝑚𝑚𝑚𝑚 = 𝑚𝑚𝑚𝑚𝑚𝑚 𝑠𝑠𝑖𝑖𝑖𝑖 (𝑥𝑥, 𝑃𝑃𝑠𝑠 𝑦𝑦) (𝑋𝑋, 𝑦𝑦) quality of 1≤𝑖𝑖≤𝑚𝑚 𝑦𝑦) ≡ 𝑃𝑃𝑡𝑡improve 𝑃𝑃𝑠𝑠 (𝑋𝑋, 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 One (𝑥𝑥, 𝑦𝑦) with different method 𝑃𝑃𝑡𝑡domain 𝑃𝑃𝑠𝑠 (𝑋𝑋, 𝑦𝑦) ≡ 𝑃𝑃algorithm (𝑋𝑋, (𝑋𝑋, (𝑋𝑋, 𝑦𝑦)in≠target (𝑋𝑋, 𝑦𝑦)𝑦𝑦) ≠≡𝑃𝑃𝑡𝑡𝑃𝑃can 𝑦𝑦)𝑦𝑦) 𝑃𝑃 𝑦𝑦) 𝑃𝑃𝑠𝑠𝑃𝑃 𝑡𝑡 (𝑋𝑋,𝑃𝑃𝑠𝑠𝑦𝑦) 𝑚𝑚𝑚𝑚𝑚𝑚 𝑡𝑡 (𝑋𝑋, (𝑋𝑋, 𝑡𝑡can = 𝑚𝑚𝑚𝑚𝑚𝑚 (𝑥𝑥, (𝑥𝑥, 𝑦𝑦) 𝑠𝑠 𝑡𝑡 𝐿𝐿(𝜃𝜃) 𝑦𝑦, 𝜃𝜃)𝑠𝑠in𝑖𝑖𝑖𝑖 (Jiang,𝑃𝑃2008). 𝑦𝑦) ≠≠𝑃𝑃= 𝑦𝑦) 𝑃𝑃Some 𝑗𝑗� of domain adaptation. methods be found 𝑠𝑠 (𝑋𝑋, 𝑡𝑡 (𝑋𝑋, (𝑋𝑋, (𝑋𝑋, 1≤𝑖𝑖≤𝑚𝑚 𝑦𝑦) 𝑃𝑃 𝑦𝑦) 𝑃𝑃 (𝑥𝑥, 𝑃𝑃𝑠𝑠 𝑦𝑦) 𝑠𝑠 𝑠𝑠 𝑡𝑡 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 𝑙𝑙of domain adaptation—sample reweighting. In this work we use a simple method (𝑋𝑋, 𝑦𝑦) ≡ 𝑃𝑃𝑡𝑡 (𝑋𝑋, 𝑃𝑃𝑡𝑡 (𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ) 1 𝑦𝑦) 𝑃𝑃𝑠𝑠 (𝑋𝑋, 𝑦𝑦) ≠ 𝑃𝑃𝑡𝑡 (𝑋𝑋, 𝑦𝑦) 𝑃𝑃𝑠𝑠 ^ (𝑋𝑋, (𝑋𝑋, Let be a loss function. In order to obtain 𝑦𝑦) ≠ 𝑃𝑃 𝑦𝑦) 𝑃𝑃 𝑙𝑙(𝑥𝑥, 𝑦𝑦, 𝜃𝜃) 𝑙𝑙(𝑥𝑥, (𝜃𝜃) , 𝑦𝑦𝑖𝑖 ,𝑦𝑦, 𝜃𝜃)𝜃𝜃)we want to minimize following 𝐿𝐿 = �(𝑥𝑥 𝑠𝑠 𝑡𝑡 𝑖𝑖 𝑃𝑃𝑠𝑠 (𝑋𝑋, 𝑦𝑦) ≠ 𝑃𝑃𝑡𝑡 (𝑋𝑋, 𝑦𝑦) 𝑙𝑙(𝑥𝑥, (𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ) 𝑦𝑦) ≡ 𝑃𝑃𝑡𝑡 (𝑋𝑋,𝑃𝑃𝑠𝑠𝑦𝑦) 𝑃𝑃 𝑙𝑙𝑠𝑠 (𝑋𝑋, 𝑙𝑙(𝑥𝑥,𝑦𝑦,𝑦𝑦,𝜃𝜃) 𝜃𝜃) 𝑙𝑙 function: 𝑃𝑃𝑡𝑡 (𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ) 1 𝑖𝑖=1 ^ ≠ 𝑃𝑃 𝑦𝑦) , 𝑦𝑦 , 𝜃𝜃) 𝑃𝑃𝑠𝑠 (𝑋𝑋, 𝑦𝑦) 𝐿𝐿(𝜃𝜃) =𝑡𝑡 (𝑋𝑋,�(𝑥𝑥 𝑙𝑙(𝑥𝑥, 𝑦𝑦, 𝜃𝜃) 𝑙𝑙 = 𝑖𝑖�𝑖𝑖 (𝑥𝑥,𝑃𝑃𝑠𝑠𝑦𝑦,(𝑥𝑥𝜃𝜃)𝑃𝑃 (𝑥𝑥, 𝑦𝑦) → 𝑚𝑚𝑚𝑚𝑚𝑚 (𝑥𝑥,𝜃𝜃) 𝑖𝑖 , 𝑦𝑦𝑡𝑡𝑖𝑖 ) 𝑚𝑚𝑚𝑚𝑚𝑚𝑦𝑦, 𝜃𝜃) 𝐿𝐿(𝜃𝜃) = � 𝑦𝑦, 𝜃𝜃)𝑃𝑃𝑡𝑡 (𝑥𝑥, 𝑦𝑦) → 𝑙𝑙(𝑥𝑥, 𝐿𝐿(𝜃𝜃) 𝑙𝑙(𝑥𝑥, 𝑦𝑦, 𝑖𝑖=1 𝜃𝜃𝑃𝑃 (𝑦𝑦|𝑥𝑥) 𝜃𝜃 (𝑥𝑥, 𝑚𝑚𝑚𝑚𝑚𝑚 𝐿𝐿(𝜃𝜃) (𝑦𝑦|𝑥𝑥) (𝑋𝑋, (𝑋𝑋,→→ ≡ 𝑃𝑃 𝑦𝑦) 𝑃𝑃𝑦𝑦) 𝑦𝑦) 𝑃𝑃𝑠𝑠𝑦𝑦, 𝑡𝑡≠ 𝑡𝑡 (𝑥𝑥, 𝑠𝑠𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 𝑡𝑡 (𝑥𝑥, (𝑥𝑥, 𝑦𝑦) 𝑚𝑚𝑚𝑚𝑚𝑚 𝐿𝐿(𝜃𝜃)== � � 𝑦𝑦,𝜃𝜃)𝑃𝑃 𝜃𝜃)𝑃𝑃 (5) 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 𝜃𝜃 𝑡𝑡 𝜃𝜃 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 𝑙𝑙(𝑥𝑥, 𝑦𝑦, 𝜃𝜃) 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 𝑚𝑚𝑚𝑚𝑚𝑚 ≡ 𝑃𝑃 (𝑦𝑦|𝑥𝑥) 𝐿𝐿(𝜃𝜃) = � (𝑥𝑥, 𝑦𝑦, 𝜃𝜃)𝑃𝑃𝑡𝑡 (𝑥𝑥, 𝑦𝑦) → 𝑃𝑃𝑡𝑡 (𝑦𝑦|𝑥𝑥) (𝑥𝑥, 𝐿𝐿(𝜃𝜃) = →� 𝜃𝜃 𝑦𝑦, 𝜃𝜃)𝑃𝑃 𝑠𝑠 𝑡𝑡 (𝑥𝑥, 𝑦𝑦) → 𝑚𝑚𝑚𝑚𝑚𝑚 (𝑥𝑥, 𝑦𝑦) (𝑥𝑥, 𝑦𝑦, 𝜃𝜃)𝑃𝑃 𝑚𝑚𝑚𝑚𝑚𝑚 𝐿𝐿(𝜃𝜃) = � 𝜃𝜃 𝑡𝑡 𝑃𝑃 (𝑥𝑥 ,𝑦𝑦 ) 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 𝑃𝑃𝑡𝑡 (𝑥𝑥, 𝑦𝑦) 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 𝑃𝑃𝑡𝑡 (𝑥𝑥, 𝑦𝑦) 𝜃𝜃 = 𝑡𝑡 𝑖𝑖 𝑖𝑖 𝑙𝑙(𝑥𝑥, 𝑦𝑦, 𝜃𝜃) 𝑤𝑤 𝑖𝑖 (𝑥𝑥, 𝑦𝑦) (𝑥𝑥,write (𝑥𝑥, 𝐿𝐿(𝜃𝜃) = 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 � can 𝑦𝑦, 𝜃𝜃) function𝑃𝑃L𝑠𝑠 (𝑥𝑥, 𝐿𝐿(𝜃𝜃) =𝑖𝑖 ) 𝑃𝑃𝑃𝑃𝑡𝑡� 𝑦𝑦, 𝜃𝜃) 𝑃𝑃𝑠𝑠 (𝑥𝑥, 𝑦𝑦) 𝑦𝑦) equivalent We in the form: 𝑃𝑃𝑠𝑠 (𝑥𝑥𝑖𝑖 ,𝑦𝑦 (𝑥𝑥, 𝑦𝑦) 𝑡𝑡 (𝑥𝑥, (𝑥𝑥, (𝑥𝑥, 𝑦𝑦) 𝑃𝑃𝑠𝑠 𝐿𝐿(𝜃𝜃) = � 𝑦𝑦, 𝜃𝜃) 𝑃𝑃 𝑦𝑦) (𝑥𝑥, 𝑦𝑦) → 𝑚𝑚𝑚𝑚𝑚𝑚𝑃𝑃𝑠𝑠 (𝑥𝑥, 𝑦𝑦) 𝐿𝐿(𝜃𝜃) 𝑦𝑦, 𝑠𝑠 (𝑥𝑥, (𝑥𝑥, 𝐿𝐿(𝜃𝜃) = = � � (𝑥𝑥, 𝑦𝑦, 𝜃𝜃)𝑃𝑃 𝜃𝜃)𝑃𝑃𝑠𝑠𝑡𝑡(𝑥𝑥, 𝑃𝑃 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 𝑠𝑠 𝜃𝜃 𝑦𝑦) 𝑦𝑦) 𝑃𝑃𝑡𝑡 (𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 )𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 𝑤𝑤𝑖𝑖 = 𝑃𝑃(𝑥𝑥(𝑥𝑥|𝑡𝑡) ) 𝑃𝑃𝑠𝑠 (𝑥𝑥, 𝑦𝑦) 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 𝑃𝑃𝑡𝑡 (𝑥𝑥, 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 𝑦𝑦) 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 𝑃𝑃𝑠𝑠 = (𝑥𝑥, 𝑦𝑦, 𝜃𝜃)𝑃𝑃𝑡𝑡 (𝑥𝑥, 𝑦𝑦) → 𝑚𝑚𝑚𝑚𝑚𝑚 𝑖𝑖 𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑃𝑃 𝐿𝐿(𝜃𝜃) � 𝐿𝐿(𝜃𝜃) = � (𝑥𝑥, 𝑦𝑦, 𝑃𝑃 𝜃𝜃) 𝑃𝑃 𝑡𝑡 (𝑥𝑥, 𝑦𝑦) 𝑠𝑠𝑖𝑖(𝑥𝑥, 𝑤𝑤 =(𝑥𝑥,𝑦𝑦) (𝑥𝑥, 𝑦𝑦) 𝜃𝜃 𝑡𝑡 𝑃𝑃 (𝑥𝑥, 𝑦𝑦,𝑖𝑖 |𝑠𝑠) 𝜃𝜃) 𝑃𝑃 (𝑥𝑥, 𝑦𝑦) 𝑦𝑦)� 𝑃𝑃(𝑥𝑥 (6) 𝑠𝑠 = 𝑃𝑃 (𝑥𝑥, 𝑦𝑦, 𝜃𝜃) 𝐿𝐿(𝜃𝜃) (𝑥𝑥, 𝐿𝐿(𝜃𝜃) = � 𝑦𝑦) 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 (𝑥𝑥, 𝑙𝑙𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 𝑙𝑙 𝑠𝑠 𝑃𝑃𝑠𝑠 𝑦𝑦) 𝑠𝑠 (𝑥𝑥, 𝑃𝑃 𝑦𝑦) 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 𝑠𝑠 (𝑥𝑥 ) (𝑥𝑥 ) 𝑙𝑙 , 𝑦𝑦 , 𝑦𝑦 𝑃𝑃 𝑃𝑃 1 1 𝑡𝑡 𝑖𝑖 𝑖𝑖 𝑡𝑡 𝑖𝑖 𝑖𝑖 𝑃𝑃(𝑥𝑥𝑖𝑖 |𝑡𝑡) 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 (𝑥𝑥, ^ ^ 𝑙𝑙 𝑃𝑃 𝑦𝑦) 𝑡𝑡 𝑖𝑖 , 𝑦𝑦𝑖𝑖 ), 𝑦𝑦 , 𝜃𝜃) (𝑥𝑥 𝑃𝑃𝑡𝑡 �(𝑥𝑥 (𝜃𝜃)𝑦𝑦,= 𝐿𝐿(𝜃𝜃) = �(𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 , 𝜃𝜃) ^𝐿𝐿(𝜃𝜃) = 11𝑤𝑤� 𝑖𝑖 = 𝐿𝐿 (𝑥𝑥, |𝑠𝑠) 𝜃𝜃)𝑙𝑙𝑃𝑃𝑡𝑡 (𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 𝑖𝑖 )𝑃𝑃𝑖𝑖𝑠𝑠 (𝑥𝑥, 𝑦𝑦) 𝑃𝑃(𝑥𝑥 𝑃𝑃𝑠𝑠 (𝑥𝑥 𝑦𝑦𝑖𝑖 )== �(𝑥𝑥 𝑃𝑃𝑠𝑠 (𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ) 𝑙𝑙 𝐿𝐿𝐿𝐿 ^(𝜃𝜃) 𝑖𝑖 ,(𝜃𝜃) 𝑦𝑦)) , 𝑦𝑦𝑖𝑖𝑖𝑖 ,𝑖𝑖 𝜃𝜃) , 𝜃𝜃)𝑃𝑃𝑠𝑠𝑃𝑃(𝑥𝑥 �(𝑥𝑥𝑖𝑖 ,𝑖𝑖 𝑦𝑦 𝑠𝑠 (𝑥𝑥, 𝑖𝑖=1 𝑖𝑖=1 𝑙𝑙 𝑦𝑦 𝑙𝑙𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 𝑃𝑃𝑡𝑡 (𝑥𝑥, 𝑦𝑦) 𝑖𝑖 ) 𝑃𝑃(𝑡𝑡|𝑥𝑥 (𝑥𝑥𝑖𝑖 ,𝑖𝑖(𝑥𝑥, )𝑃𝑃(𝑠𝑠) , 𝑦𝑦 𝑃𝑃 𝑙𝑙 𝑃𝑃(𝑡𝑡|𝑥𝑥 𝑙𝑙 𝐿𝐿(𝜃𝜃) 𝑖𝑖=1 Now replace true loss function with empirical estimation: 𝑠𝑠 𝑖𝑖 𝑖𝑖 𝑖𝑖 ) (𝑥𝑥 ) , 𝑦𝑦 𝑃𝑃 1 = � 𝑦𝑦, 𝜃𝜃) 𝑃𝑃𝑠𝑠 (𝑥𝑥, 𝑦𝑦) ^(𝜃𝜃) = 𝑙𝑙 �(𝑥𝑥 , 𝑦𝑦 , 𝜃𝜃) 𝑡𝑡 𝑖𝑖 1𝑤𝑤𝑖𝑖𝑖𝑖 𝑖𝑖=1 𝐶𝐶𝑖𝑖 )× = (𝑥𝑥= 𝑃𝑃 , 𝑦𝑦 𝐿𝐿 𝑡𝑡 𝑖𝑖 (𝑥𝑥, 𝑃𝑃 𝑦𝑦) ^ 𝑖𝑖 𝑖𝑖 (𝑥𝑥 ) 𝑃𝑃𝐿𝐿𝑡𝑡(𝜃𝜃) ,= 𝑦𝑦𝑖𝑖 , 𝑦𝑦�(𝑥𝑥 1 𝑙𝑙 𝑠𝑠 𝑖𝑖 ) )𝑃𝑃(𝑡𝑡) 𝑃𝑃(𝑠𝑠|𝑥𝑥 𝑃𝑃(𝑠𝑠|𝑥𝑥 𝑖𝑖(𝑥𝑥 , 𝑦𝑦 , 𝜃𝜃) 𝑖𝑖 ) ^ 𝑃𝑃 𝑥𝑥,𝑦𝑦∈𝑋𝑋×𝑌𝑌 𝑖𝑖 𝑖𝑖 𝑠𝑠 𝑖𝑖 𝑖𝑖 = ≡�(𝑥𝑥 𝑃𝑃𝑠𝑠 (𝑥𝑥 𝑖𝑖=1𝑖𝑖 , 𝑦𝑦𝑖𝑖 , 𝜃𝜃) (𝑦𝑦|𝑥𝑥) 𝑃𝑃𝐿𝐿𝑡𝑡(𝜃𝜃) 𝑃𝑃𝑡𝑡 (𝑦𝑦|𝑥𝑥) ≡𝑖𝑖 ,𝑃𝑃𝑦𝑦𝑠𝑠 (𝑦𝑦|𝑥𝑥) 𝑖𝑖 ) 𝑃𝑃(𝑡𝑡|𝑥𝑥𝑖𝑖 ) 𝑙𝑙 𝑃𝑃(𝑡𝑡|𝑥𝑥 𝑃𝑃𝑠𝑠 (𝑥𝑥 , 𝑦𝑦𝑖𝑖 )𝑙𝑙 𝑖𝑖=1 𝑙𝑙 𝑃𝑃𝑠𝑠 (𝑦𝑦|𝑥𝑥) 𝑖𝑖 )𝑃𝑃(𝑠𝑠) (𝑦𝑦|𝑥𝑥) 𝑃𝑃𝑃𝑃𝑡𝑡𝑖𝑖(𝑦𝑦|𝑥𝑥) ≡ 𝑃𝑃 𝑤𝑤 = 𝑠𝑠 𝑖𝑖=1 (𝑥𝑥𝑖𝑖𝐶𝐶, 𝑦𝑦×𝑖𝑖 ) 𝑃𝑃𝑡𝑡 = 1 (𝑦𝑦|𝑥𝑥) 𝑡𝑡 (𝑦𝑦|𝑥𝑥)𝑖𝑖 ≡ 𝑃𝑃 𝑠𝑠 ^ )𝑃𝑃(𝑡𝑡) 𝑃𝑃(𝑠𝑠|𝑥𝑥 𝑃𝑃(𝑠𝑠|𝑥𝑥𝑖𝑖 ) (7) 𝐿𝐿(𝜃𝜃) = �(𝑥𝑥 𝑖𝑖 , 𝑦𝑦𝑖𝑖𝑖𝑖 , 𝜃𝜃) 𝑙𝑙 𝑃𝑃𝑠𝑠 (𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ) 𝑙𝑙 (𝑥𝑥 ) (𝑦𝑦|𝑥𝑥) 𝑃𝑃 , 𝑦𝑦 1 𝑃𝑃𝑡𝑡 𝑃𝑃(𝑦𝑦|𝑥𝑥) ≡ 𝑃𝑃 𝑖𝑖=1 𝑡𝑡 𝑖𝑖 𝑖𝑖 𝑠𝑠 ^(𝜃𝜃) = 𝑃𝑃𝑡𝑡�(𝑥𝑥 (𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 ) (𝑥𝑥 ,𝑦𝑦 ) 𝑃𝑃𝑡𝑡 (𝑦𝑦|𝑥𝑥) ≡ 𝑃𝑃𝑠𝑠 (𝑦𝑦|𝑥𝑥) 𝐿𝐿 𝑃𝑃𝑖𝑖𝑡𝑡 (𝑦𝑦|𝑥𝑥) 𝑃𝑃 (𝑦𝑦|𝑥𝑥) = 𝑡𝑡 (𝑥𝑥≡ 𝑤𝑤 𝑤𝑤𝑖𝑖 = 𝑙𝑙 (𝑥𝑥𝑖𝑖 𝑖𝑖 )𝑖𝑖 , 𝑦𝑦𝑖𝑖 , 𝜃𝜃) 𝑃𝑃 (𝑥𝑥 , 𝑦𝑦 ) 𝑃𝑃 (𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 ) 𝑃𝑃𝑠𝑠 𝑖𝑖 ,𝑦𝑦𝑖𝑖 )𝑠𝑠 𝑃𝑃𝑠𝑠 𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑠𝑠 𝑖𝑖 𝑖𝑖 𝑡𝑡 (𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 ) 𝑤𝑤𝑤𝑤𝑖𝑖 == 𝑡𝑡𝑃𝑃(𝑥𝑥 𝑃𝑃𝑠𝑠 𝑖𝑖 ,𝑦𝑦𝑖𝑖 ) As one can see that𝑖𝑖 algorithm leads 𝑖𝑖=1 as to the feature reweighting with 𝑃𝑃𝑠𝑠 (𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 ) (𝑦𝑦|𝑥𝑥) (𝑦𝑦|𝑥𝑥) 𝑃𝑃 ≡ 𝑃𝑃 𝑡𝑡 𝑠𝑠 𝑃𝑃𝑡𝑡 (𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 ) 𝑃𝑃𝑡𝑡 (𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 ) 𝑤𝑤𝑃𝑃(𝑥𝑥=|𝑡𝑡) 𝑃𝑃(𝑥𝑥𝑖𝑖 |𝑡𝑡) (𝑥𝑥𝑃𝑃𝑖𝑖𝑠𝑠,𝑦𝑦(𝑥𝑥 = 𝑤𝑤 𝑖𝑖 )𝑖𝑖 ,𝑦𝑦𝑖𝑖 ) 𝑤𝑤𝑖𝑖𝑤𝑤= =𝑖𝑖 𝑃𝑃𝑡𝑡𝑖𝑖|𝑠𝑠) 𝑤𝑤(𝑦𝑦|𝑥𝑥) |𝑡𝑡) 𝑖𝑖 𝑃𝑃(𝑥𝑥 𝑖𝑖 = 𝑃𝑃(𝑥𝑥 𝑖𝑖𝑖𝑖that (𝑥𝑥 ) . Finally we assume ≡𝑖𝑖 |𝑠𝑠) 𝑃𝑃𝑠𝑠 (𝑦𝑦|𝑥𝑥), thus weight wi can be found 𝑃𝑃 𝑃𝑃 ,𝑦𝑦 |𝑡𝑡) 𝑠𝑠 𝑃𝑃(𝑥𝑥 𝑖𝑖 𝑖𝑖 𝑃𝑃(𝑥𝑥 𝑡𝑡 𝑖𝑖 𝑤𝑤𝑤𝑤𝑖𝑖 == |𝑠𝑠) 𝑃𝑃𝑠𝑠𝑖𝑖 (𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 ) 𝑖𝑖 𝑃𝑃(𝑥𝑥 𝑖𝑖 |𝑠𝑠) 𝑃𝑃(𝑥𝑥 𝑖𝑖 𝑖𝑖 ,𝑦𝑦𝑖𝑖 ) 𝑃𝑃𝑡𝑡 (𝑥𝑥 𝑃𝑃(𝑥𝑥 |𝑡𝑡) = (𝑥𝑥 one 𝑤𝑤𝑖𝑖theorem 𝑃𝑃(𝑥𝑥 |𝑡𝑡) = |𝑡𝑡) 𝑖𝑖|𝑠𝑠) . With Bayes’ can estimate weight as: as 𝑤𝑤𝑖𝑖 𝑃𝑃(𝑥𝑥 𝑠𝑠 𝑖𝑖 ,𝑦𝑦𝑖𝑖 ) 𝑤𝑤𝑖𝑖 = 𝑃𝑃𝑖𝑖|𝑠𝑠) 𝑖𝑖𝑃𝑃(𝑥𝑥𝑖𝑖 𝑃𝑃𝑡𝑡 (𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 ) 𝑤𝑤𝑖𝑖𝑃𝑃(𝑡𝑡|𝑥𝑥 = 𝑃𝑃(𝑥𝑥 𝑤𝑤𝑖𝑖 =𝑃𝑃(𝑡𝑡|𝑥𝑥 𝑃𝑃(𝑡𝑡|𝑥𝑥𝑖𝑖 ) 𝑖𝑖 𝑃𝑃(𝑡𝑡|𝑥𝑥𝑖𝑖 ) |𝑠𝑠) 𝑃𝑃(𝑥𝑥𝑖𝑖𝑖𝑖)𝑃𝑃(𝑠𝑠) 𝑖𝑖 )𝑃𝑃(𝑠𝑠) 𝑃𝑃𝑠𝑠 (𝑥𝑥𝑃𝑃(𝑡𝑡|𝑥𝑥 𝑖𝑖 ,𝑦𝑦𝑖𝑖 ) )𝑃𝑃(𝑠𝑠) 𝑃𝑃(𝑡𝑡|𝑥𝑥 𝑤𝑤𝑖𝑖 = = 𝐶𝐶 × 𝑤𝑤 = 𝑖𝑖 𝑖𝑖 ))= 𝐶𝐶 × 𝑖𝑖 |𝑡𝑡) 𝑃𝑃(𝑥𝑥 )𝑃𝑃(𝑠𝑠) 𝑃𝑃(𝑡𝑡|𝑥𝑥 𝑃𝑃(𝑡𝑡|𝑥𝑥 𝑖𝑖 𝑖𝑖 𝑖𝑖 )= 𝑃𝑃(𝑠𝑠|𝑥𝑥𝑖𝑖 )𝑃𝑃(𝑡𝑡) 𝑃𝑃(𝑠𝑠|𝑥𝑥 𝑃𝑃(𝑠𝑠|𝑥𝑥𝑖𝑖 ) 𝑤𝑤𝑤𝑤𝑖𝑖 = ==𝑃𝑃(𝑠𝑠|𝑥𝑥 𝐶𝐶𝐶𝐶×× 𝑖𝑖 )𝑃𝑃(𝑡𝑡) 𝑤𝑤𝑖𝑖𝑖𝑖𝑃𝑃(𝑠𝑠|𝑥𝑥 (8) 𝑖𝑖 = |𝑠𝑠) 𝑃𝑃(𝑠𝑠|𝑥𝑥 𝑃𝑃(𝑥𝑥𝑖𝑖𝑖𝑖)𝑃𝑃(𝑡𝑡) 𝑖𝑖 )) )𝑃𝑃(𝑡𝑡) 𝑃𝑃(𝑥𝑥 |𝑡𝑡) 𝑃𝑃(𝑠𝑠|𝑥𝑥 𝑃𝑃(𝑠𝑠|𝑥𝑥 𝑖𝑖 𝑖𝑖 ) 𝑃𝑃(𝑡𝑡|𝑥𝑥𝑖𝑖 )𝑃𝑃(𝑠𝑠) 𝑃𝑃(𝑡𝑡|𝑥𝑥 𝑖𝑖 )𝑃𝑃(𝑠𝑠) 𝑖𝑖 ) 𝑃𝑃(𝑡𝑡|𝑥𝑥 𝑃𝑃(𝑡𝑡|𝑥𝑥 𝑤𝑤𝑖𝑖 = 𝑤𝑤𝑖𝑖 𝑃𝑃(𝑡𝑡|𝑥𝑥 = 𝑖𝑖 )𝑃𝑃(𝑠𝑠) = 𝐶𝐶𝑃𝑃(𝑡𝑡|𝑥𝑥 × 𝑖𝑖 ) 𝑖𝑖 𝑖𝑖 𝑤𝑤𝑖𝑖 =𝑃𝑃(𝑠𝑠|𝑥𝑥 = 𝐶𝐶𝑃𝑃(𝑥𝑥 × 𝑖𝑖 |𝑠𝑠) 𝑃𝑃(𝑠𝑠|𝑥𝑥𝑖𝑖 )𝑃𝑃(𝑡𝑡) 𝑖𝑖 ))𝑃𝑃(𝑡𝑡) 𝑤𝑤𝑖𝑖 = We = weight 𝐶𝐶 × estimate with the logistic regression, 𝑃𝑃(𝑠𝑠|𝑥𝑥 𝑃𝑃(𝑠𝑠|𝑥𝑥 𝑖𝑖 𝑖𝑖 ) and it slightly increases the 𝑃𝑃(𝑠𝑠|𝑥𝑥𝑖𝑖 )𝑃𝑃(𝑡𝑡) 𝑃𝑃(𝑠𝑠|𝑥𝑥𝑖𝑖 ) 𝑃𝑃(𝑡𝑡|𝑥𝑥𝑖𝑖 )𝑃𝑃(𝑠𝑠) 𝑃𝑃(𝑡𝑡|𝑥𝑥𝑖𝑖 ) quality. 𝑤𝑤𝑖𝑖 = = 𝐶𝐶 × 𝑃𝑃(𝑠𝑠|𝑥𝑥𝑖𝑖 )𝑃𝑃(𝑡𝑡) 𝑃𝑃(𝑡𝑡|𝑥𝑥 )𝑃𝑃(𝑠𝑠) 𝑃𝑃(𝑠𝑠|𝑥𝑥𝑖𝑖 ) 𝑃𝑃(𝑡𝑡|𝑥𝑥𝑖𝑖 ) 𝑖𝑖 𝑤𝑤𝑖𝑖 = = 𝐶𝐶 × 𝑃𝑃(𝑠𝑠|𝑥𝑥𝑖𝑖 )𝑃𝑃(𝑡𝑡) 𝑃𝑃(𝑠𝑠|𝑥𝑥𝑖𝑖 ) 𝑎𝑎𝑎𝑎𝑎𝑎

𝑡𝑡𝑗𝑗

Comparison of Neural Network Architectures for Sentiment Analysis of Russian Tweets

2.4.2. Our ensemble solution Our ensemble classifier consists of three classifiers; each of them votes with equal weight. The first two are GRU neural network and convolutional neural network described in sections 2.2 and 2.3, respectively. The third classifier is SVM with sample reweighting described in 2.4.1. We used polynomial kernel with degree of 3. For every tweet, the average of word2vec vectors (described in section 2.1) of all words in the tweet is used as features for the SVM classifier.

3. Evaluation Tables 1–2 present results of the evalution on sentiment classification. Both tables show macro-averaged F1-score of negative and positive classes, used as quality measure on SentiRuEval-2016 competition. For recurrent neural network based model, we performed 5-fold cross-validation on training data provided by organizers of SentiRuEval. The results are showed in Table 1. We found that GRU network slightly outperforms LSTM network, and that reversing the order of words in tweets improves the quality. Adding an extra recurrent layer also slightly increases the quality. In addition, we found that using word2vec vectors as features for recurrent network is crucial. Using randomly initialized embedding layer and one-hot features instead of word2vec features gives macro-averaged F1-score of only 0.45 for banks and 0.47 for telecommunication companies. Table 2 shows results on SentiRuEval test datasets for solutions described in sections 2.2–2.4. It also shows micro-averaged version of F1-score and includes solutions’ ranks among all 58 solutions submitted to SentiRuEval by 10 teams. For test data classification with GRU network, the model was trained on whole train data 5 times and correspondingly gave 5 predictions for test data. Then the leading class over all predictions was chosen for each sample. Other models were trained and predicted once. The Gated Recurrent Unit based solution got the best macro-averaged score on both domains, significantly outperforming solutions from other teams on banks domain, and also has the best micro-averaged F1-score on banks domain.

Table 1. Macro-averaged F1-score, evaluated with RNN models using 5-fold cross-validation on SentiRuEval training data Domain RNN Architecture LSTM GRU GRU, reversed sequences Bidirectional GRU Two-layer GRU, reversed sequences

Banks 0.6026 0.6129 0.6211 0.6207 0.6243

Telecommunication companies 0.6410 0.6428 0.6570 0.6521 0.6597

Архипенко К. et al.

Table 2. F1-score and ranks among all solutions, evaluated on SentiRuEval test data (according to SentiRuEval results) Domain Telecommunication companies

Banks Classifier CNN Two-layer GRU, reversed sequences Ensemble classifier Best solution not from our team

Macro Micro Macro Micro (score/rank) (score/rank) (score/rank) (score/rank) 0.4832 / 21 0.5517 / 1

0.5253 / 21 0.5881 / 1

0.4704 / 41 0.5594 / 1

0.6060 / 36 0.6569 / 21

0.5352 / 2 0.5252 / 3

0.5749 / 2 0.5653 / 3

0.5403 / 9 0.5493 / 2

0.6525 / 23 0.6822 / 1

Conclusion We have described all variants of our sentiment analysis system. The GRU network based solution performed well and won the SentiRuEval-2016 competition on both domains (banks and telecommunication companies). Using word2vec vectors as features has made a major contribution to the result. However, we believe that parameters of our classifiers were not optimal, even for GRU network. After publication of labeled test data by organizers of the competition, we were able to achieve macro-averaged F1-score above 0.6 on test data for both domains using GRU network. One of the parts of our future work is to find optimal architectures and learning parameters for RNN and CNN. It is also possible to combine RNN and CNN into one compound network. In addition, our future research includes adapting our neural network based approaches to object-oriented sentiment analysis, as well as developing methods of domain adaptation within these approaches.

Acknowledgements This work was supported by the Russian Foundation for Basic Research grant 15-37-20375.

Comparison of Neural Network Architectures for Sentiment Analysis of Russian Tweets

References 1. 2.

3.

4.

5.

6. 7. 8.

9. 10.

11. 12. 13. 14. 15.

16.

Adamson A., Turan V. D., (2015), Opinion Tagging Using Deep Recurrent Nets with GRUs, available at: https://cs224d.stanford.edu/reports/AdamsonAlex.pdf Cho K., van Merrienboer B., Gulcehre C., Bougares F., Schwenk H., Bengio Y., (2014), Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, CoRR, available at: http://arxiv.org/abs/1406.1078 Chung J., Gulcehre C., Cho K., Bengio Y., (2014), Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, CoRR, available at: http:// arxiv.org/abs/1412.3555 Collobert R., Weston J., Bottou L., Karlen M., Kavukcuoglu K., Kuksa P., (2011), Natural language processing (almost) from scratch, CoRR, available at: http:// arxiv.org/abs/1103.0398 dos Santos C., Maira G., (2014), Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts, Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, pp. 69–78 Graves A., (2012), Supervised Sequence Labelling with Recurrent Neural Networks, available at: http://dx.doi.org/10.1007/978-3-642-24797-2 Hochreiter S., Schmidhuber J., (1997), Long Short-Term Memory, Neural computation, volume 9, number 8, pp. 1735–1780 Irsoy O., Cardie C., (2014), Opinion Mining with Deep Recurrent Neural Networks, Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 720–728 Jiang J., (2008), Domain Adaptation in Natural Language Processing, available at: http://hdl.handle.net/2142/11465 Jozefowicz R., Zaremba W., Sutskever I., (2015), An Empirical Exploration of Recurrent Network Architectures, Proceedings of the 32nd International Conference on Machine Learning (ICML-15), Lille, France, pp. 2342–2350 Kalchbrenner N., Grefenstette E., Blunsom P., (2014), A Convolutional Neural Network for Modelling Sentences, CoRR, available at: http://arxiv.org/abs/1404.2188 Kim Y., (2014), Convolutional Neural Networks for Sentence Classification, CoRR, available at: http://arxiv.org/abs/1408.5882 Kingma D. P., Ba J., (2014), Adam: A Method for Stochastic Optimization, CoRR, available at: http://arxiv.org/abs/1412.6980 Mikolov T., Chen K., Corrado G., Dean J., (2013a), Efficient Estimation of Word Representations in Vector Space, CoRR, available at: http://arxiv.org/abs/1301.3781 Mikolov T., Sutskever I., Chen K., Corrado G., Dean J., (2013b), Distributed Representations of Words and Phrases and Their Compositionality, CoRR, available at: http://arxiv.org/abs/1310.4546 Mikolov T., Yih W., Zweig G., (2013c), Linguistic Regularities in Continuous Space Word Representations, Proceedings of NAACL HLT 2013, Atlanta, USA, pp. 746–751

Архипенко К. et al.

17. Palangi H., Deng L., Shen Y., Gao J., He X., Chen J., Song X., Ward R. K., (2015), Deep Sentence Embedding Using the Long Short Term Memory Network: Analysis and Application to Information Retrieval, CoRR, available at: http://arxiv.org/ abs/1502.06922 18. Sutskever I., Vinyals O., Le Q. V., (2014), Sequence to Sequence Learning with Neural Networks, CoRR, available at: http://arxiv.org/abs/1409.3215 19. Tang D., Qin B., Liu T., (2015), Document Modeling with Gated Recurrent Neural Network for Sentiment Classification, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1422–1432