Identifying and Analyzing Popular Phrases Multi- Dimensionally in Social Media Data

98 International Journal of Data Warehousing and Mining, 11(3), 98-112, July-September 2015 Identifying and Analyzing Popular Phrases MultiDimensiona...

Author: Caren Wiggins

3 downloads 0 Views 3MB Size

Report

Download PDF

Recommend Documents

Emoticons and Phrases: Status Symbols in Social Media

Identifying Social Satisfaction from Social Media

Identifying Prepositional Phrases

Social Media and Data Protection

Identifying Organizational Influentials: Methods and Application using Social Network Data

Tabling, Graphing and Analyzing Data

Multi-resolution Spatial Event Forecasting in Social Media

Discovering Multi-Relational Structure in Social Media Streams

Data analytics, social computing, complex social and business networks, social media, data mining, agent-based simulations

Identifying Political Sentiment between Nation States with Social Media

RumorLens: A System for Analyzing the Impact of Rumors and Corrections in Social Media

Processing and Analyzing Advanced Hyperspectral Imagery Data for Identifying Clay Minerals.A Case Study

Analyzing aggregated linguistic data

Social Media in China

Digital and Social Media

Discussions of Health Web Sites in Medical and Popular Media

Analyzing Data, Graphing and Drawing Conclusions

Analyzing multinomial and time-series data

Identifying Data Transfer Objects in EJB Applications

Phrases Prepositional Phrases Appositives and Appositive Verbals and Verbal Phrases. Phrases. Grammar Review. Writing Application

Social media and e-participation in NHSScotland

Collaboration and Social Media in ELMS

Combining multi-modal features for social media analysis

98 International Journal of Data Warehousing and Mining, 11(3), 98-112, July-September 2015

Identifying and Analyzing Popular Phrases MultiDimensionally in Social Media Data Zhongying Zhao, College of Information Science and Engineering, Shandong University of Science and Technology, Qingdao, China & Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin, China Chao Li, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China Yong Zhang, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China Joshua Zhexue Huang, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China Jun Luo, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China Shengzhong Feng, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China Jianping Fan, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China

ABSTRACT With the success of social media, social network analysis has become a very hot research topic and attracted much attention in the last decade. Most studies focus on analyzing the whole network from the perspective of topology or contents. However, there is still no systematic model proposed for multi-dimensional analysis on big social media data. Furthermore, little work has been done on identifying emerging new popular phrases and analyzing them multi-dimensionally. In this paper, the authors first propose an interactive systematic framework. In order to detect the emerging new popular phrases effectively and efficiently, they present an N-Pat Tree model and give some filtering mechanisms. They also propose an algorithm to find and analyze new popular phrases multi-dimensionally. The experiments on one-year Tencent-Microblogs data have demonstrated the effectiveness of their work and shown many meaningful results. Keywords:

Multidimensional Analysis, Popular Phrase, Social Media, Social Network Analysis

DOI: 10.4018/IJDWM.2015070105 Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Data Warehousing and Mining, 11(3), 98-112, July-September 2015 99

1. INTRODUCTION Social media, such as Flickr, Facebook, YouTube, MySpace, Twitter etc., has emerged as one of the most popular platform for people to communicate with each other. People engaging in those online social activities can post their ideas, browse others’ posts, write comments and forward interesting posts, and follow anyone that they are interested in. Thus, analyzing those data has many potential applications. Therefore, it has attracted tremendous attention of researchers. Recent studies are paying more attention to users’ characteristics (Krishnamurthy, Gill, & Arlitt, 2008), statistical analysis (Java, Song, Finin & Tseng, 2007; Du, Wu, Pei, Wang & Xu, 2007), community detection (Zhao, Feng, Wang, Huang, Williams & Fan, 2012), propagation models (Galuba, Aberer, Chakraborty, Despotovic & Kellerer, 2010) and predictive methods (Zhao & Rosson, 2009). Data mining and learning techniques, such as clustering (Kwok, Smith, Lozano & Taniar, 2002), association rule mining (Taniar, Rahayu, Lee & Daly, 2008; Daly & Taniar, 2004) and model learning (Tan & Taniar, 2007), can be applied in the area of social network analysis. Java et al. have conducted preliminary analysis on microblogging behaviors by studying the topological and geographical properties of Twitter network (Java, Song, Finin & Tseng, 2007). Krishnamurthy, Gill, & Arlitt (2008) have analyzed users’ characteristics through followingfollowed relationships. Zhao & Rosson (2009) have qualitatively investigated the motivation of using Twitter. Galuba, Aberer, Chakraborty, Despotovic & Kellerer (2010) have proposed a propagation model, which has been used to predict which users tweet about which URLs based on their historical activities. Jansen et al. have studied the word-of-mouth branding in Twitter (Jansen, Zhang, Sobel & Chowdury, 2009). Aral, Muchnik & Sundararajan (2009) have distinguished the effects of homophily from influence as motivator for propagation. As to the study of influence within Twitter, Cha, Haddadi, Benevenuto & Gummadi (2010)

performed a comparison of three different measures of influence-degrees, retweets, and users’ mentions. In a word, researchers have done a lot of work on social media data mining and analysis. However, the previous studies have ignored the multi-dimensional analysis on the big social media. Furthermore, social media is generating a variety of new popular phrases every hour. It is a challenging problem to detect those new popular phrases efficiently and analyze them multi-dimensionally. In this paper, we propose an interactive and multi-dimensional analyzing framework for big social media data. Then we discover and analyze the new popular phrases from different dimensions based on the proposed framework. We propose a N-Pat Tree model, an efficient data structure, to discover phrases from social texts. We also present some filtering mechanisms to get new popular phrases. Finally, we conduct some experiments on one-year Chinese Microblogs data from more than 0.4 million users’ Microblogs. We detect a large number of popular phrases from those data and then analyze them multi-dimensionally. The results have demonstrated the effectiveness of our work, and shown us many meaningful results from different dimensions. The paper is organized as follows. In section 2, we give a multi-dimensional and interactive analyzing framework. In section 3, we present a new data structure called N-Pat Tree. Section 4 proposes some filtering mechanisms to find new popular phrases. Section 5 presents the evaluation of the approach of detecting popular phrases. The discovery and multi-dimensional analysis on new popular phrases of Tencent Microblogs are shown in section 6. Section 7 concludes the main work of this paper.

2. THE FRAMEWORK TO ANALYZE SOCIAL MEDIA DATA MULTI-DIMENSIONALLY It is an undeniable fact that social media is generating large volume data every day. Those

Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

100 International Journal of Data Warehousing and Mining, 11(3), 98-112, July-September 2015

Figure 1. The framework to analyze social media data multi-dimensionally

data has two important characteristics: (1) they are often generated with a variety of dimensions; (2) they usually contain large number of new popular phrases. In order to discover more meaningful knowledge from those data, we propose a framework to analyze social media data multi-dimensionally, as shown in Figure 1. According to Figure 1, the framework consists of three modules: (1) Data preparation; (2) Multi-dimensional data modeling; (3) Multi-dimensional data analysis. Each module is described in detail as follows:

2.1. (1) Data Preparation It is known to us that most of the social media or social network sites are online. People involved in those media (e.g. microblogs), can share interesting information, comment others, and vote on others’ postings freely. Analyzing those data about users’ interaction often imply many potential applications. Social media web sites have offered an open API for us to collect those data. In this paper, we design and implement a crawling program based on Tencent microblogging API. We collect those data from Tencent microblogs and then store them into data bases.

Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Data Warehousing and Mining, 11(3), 98-112, July-September 2015 101

Figure 2. The multi-dimensional model of social media data

2.2. (2) Multi-Dimensional Data Modeling Considering the variety of social media data, we should model them in terms of different dimensions, as shown in Figure 2. The big social media data is firstly represented as a fact table. In order to study the data from different dimensions, we associate the fact table with different dimension settings in the proposed schema. We can fix the granularity of each dimension manually, like Time Dimension, Geography Dimension, Source Dimension, Topic Dimension etc. Taking time dimension for example, we can fix the granularity as hour, day, week or month to extract the corresponding data. Similarly, the granularity of geography dimension can be set to city, province or country. With the multi-dimensional data model, we integrate different dimensional data by setting different granularity into a data cube. It helps us

adjust the dimensional granularity and ensure the effectiveness of knowledge discovery. In this paper, we mainly focus on the analysis of topic dimension.

2.3. (3) Multi-Dimensional Data Analysis As to the topic analysis on social media data, especially on the Chinese social media data, one of the most challenging problems is to detect and analyze the emerging new popular phrases. To address this problem, we present a procedure to find and filter the new popular phrases. It consists of 5 steps: data selector, text preprocess, text segmentation, filtering and multi-dimensional analysis. Data selector means selecting the required data cube. It can be used to choose different dimensions, select multiple dimensions immediately, control and adjust the granularity based on users’ feedback. Text preprocess aims to clean and format the data.

Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

102 International Journal of Data Warehousing and Mining, 11(3), 98-112, July-September 2015

Figure 3. An example of a Pat tree

Text segmentation refers to finding word segmentations or word groups through calculating the frequencies. In this paper, we propose ‘N-Pat tree’ in section 3 to compute the frequencies of word segmentations. Those word segmentations with higher frequency value can be considered as the candidates of new popular phrases. In order to determine new popular phrases effectively, we should propose some measures to filter those unpopular or meaningless candidates. The filtering process will be described in detail in section 4. At last, we make a multi-dimensional analysis on those new popular phrases to study the patterns and trends.

3. N-PAT TREE STRUCTURE In this section, we propose a new data structure N-Pat Tree, and present the searching algorithm based on N-Pat tree.

3.1. Pat Tree The PAT Tree is a data structure that allows very efficient searching with preprocessing and information retrieval (Sun, 2009). It was developed by Gonnet, Baeza-Yates & Snider (1992) from Morrison’s PATRICIA (Morrison, 1968). In fact, a PAT Tree is a Patricia Tree constructed over all the possible SIStrings (Semi-Infinite String) of a text. A Patricia tree is a binary digital tree where the individual bits of the keys are used to decide on the branching. If the bit is zero, it

will cause a branch to the left subtree, else to the right subtree. The Patricia tree consists of two types of nodes: The Internal nodes and the External nodes. The former is used to store the comparing bit, which determines the branching. The latter is used to store Semi-Infinite Strings and the corresponding frequencies. Figure 3 shows an example of a Pat Tree. In this example, we show the Patricia Tree for the text ‘siat’(the binary code is ‘01110011 01101001 01100001 01110100’). The External nodes are denoted by squares and the Internal nodes are denoted by circles. Notice that to search the substring ‘at’(the binary code is ‘01100001 01110100’), we first check the fourth bit, then, we go left because the bit is zero; Second, we check the fifth bit. It is a zero so we go left; Third, we get the substring and the frequency because those information is stored in the External node. It is important to notice that the search ends when the prefix is exhausted or when we reach an External node. At that point, all the answers are available (regardless of its size) in a single subtree. Given a random Patricia Tree, the height is O(log n ) (Pittel, 1985; Apostolico & Szpankowski, 1992). Thus, the searching of an arbitrary prefix could be finished in O (log n) time (Gonnet, Baeza-Yates & Snider, 1992). In practice, the time complexity of the query is less than O(log n ) . That means the searching time is proportional to the query length. How-

Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Data Warehousing and Mining, 11(3), 98-112, July-September 2015 103

Figure 4. The structure of the N-Pat tree

ever, it still needs a long time to build and search in a large scale social text data. To solve the above problem, we propose N-Pat Tree model to construct trees in parallel, which improves the text processing performance and reduces the querying time.

3.2. N-Pat Tree We insert a special rooting node which stores the length of string into each Pat Tree. Then we organize several Pat Trees in some order to form a N-Pat tree structure, where N denotes the length of the string that the pat tree can search. For instance, given 4-Pat tree, we can query the string whose length is less than 4. Some examples of N-Pat tree are shown in Figure 4.

Considering the independence of N-Pat Tree and a large scale of social text, we aim to construct them in parallel. The main steps are described as follows: Step 1: Initializing the root node for N-Pat Tree; Step 2: Classifying the text based on the length of the strings; Step 3: Constructing N-Pat Tree based on the length of the text to be processed; With the above N-Pat Tree, we aim to search phrases and calculate the appearing frequencies. The searching algorithm is described as in Algorithm 1.

Algorithm 1. Searching a phrase with N-Pat Tree Input: String( Str ), N-Pat Tree Output: a string and its frequency ( Str : Fre ) 1. get the string Str length; 2. extract the N-Pat Tree Set( S ), 3. if the N ≥ Len , 4. put the N-Pat Tree into S ; 5. endif; 6. search the string Str in N-Pat tree from S in parallel; 7. summarize the frequency of the Str ; 8. return ( Str : Fre );

Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

104 International Journal of Data Warehousing and Mining, 11(3), 98-112, July-September 2015

4. FILTERING ALGORITHM

4.1.2. C-Value (C)

Before dropping into details, we first present two kinds of terms named ‘meaningful phrase’ and ‘meaningless phrase’. The former refers to the phrase which has a clear or definite meaning. It can be either a registered phrase or a new phrase. The latter refers to the phrase which has no practical meaning. In this section, we filter those meaningless phrases and those registered meaningful phrases, thus to obtain the unregistered (new) popular phrases. During that process, we present two kinds of filtering techniques: Integratedvalue based Filtering and Anti-word Dictionary based Filtering.

C-value is a domain-independent method for multi-word automatic term recognition, which aims to improve the extraction of nested terms (Frantzi, Ananiadou & Mima, 2000). Suppose that we have a long string L and a relative short multi-word X. In C-value approach, if X often appears in L but seldom appears independently, then the multi-word X is not likely to be a meaningful term even it has a higher frequency. If X appears in many kinds of long strings, then we can consider it to be a meaningful term. If L has the same frequency with X, then the long string L is more likely to be a meaningful term. The calculation of the C-value for string ‘s’ is shown in formula (2).

4.1. Integrated-Value based Filtering In this section, we propose a new measure metric called Integrated-Value (IV), which is the integration of Mutual Information(MI) and C-value(C).

C (s ) = log s f (s ), if s is not nested  2 log2 s ( f (s ) − 1 / P (Ts )∑ b ∈T f (b)), else s  (2)

4.1.1. Mutual Information (MI)

where,

Mutual information is a metric to evaluate the dependency of two strings (Chien, 1997). The stronger the dependency is, the higher the mutual information value gets. The formal definition of mutual information is presented as follows.

• • • • •

MI s = 0, if f (lefts ) + f (rights ) − f (s ) = 0 •   log( f (s ) / f (lefts ) + f (rights ) − f (s )), else  (1)

where, • • •

f (s ) denotes the frequency of the string ... lefts is the left substring of the string s .

rights is the right substring of the string s.

s is a string. s denotes the length of the string s . f ( s ) denotes the frequency of the string s. b is a sub-string of the string s . Ts is the set of b . P (Ts ) denotes the sum of frequencies of b.

As stated above, we have presented two kinds of measures: Mutual Information (MI) and C-value(C). Now we integrate both of the metrics and propose a new metric called Integrated-Value (IV). We define IV as follows. IV=MI+C

(3)

During the filtering process, we first calculate the IV for every string. Then we set a

Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Data Warehousing and Mining, 11(3), 98-112, July-September 2015 105

Figure 5. Some examples of Anti-word terms

threshold based on extensive experiments. We select the meaningful terms by filtering the string s whose IV is less than the threshold.

4.2. Anti-Word Dictionary Based Filtering In this section, we propose a dictionary based filtering mechanism by taking N-Pat Tree into account. Considering that traditional dictionary based methods do the filtering through the existing words. However, there are many meaningless terms in the real world. In order to improve the efficiency of filtering, we construct an Anti-word Dictionary and make full use of it in the filtering process. We first present the definition of the Anti-word and then describe the storing structure of words and Anti-words. Finally, we present the Anti-word Dictionary based filtering algorithm. In Chinese corpus, there are not any labels between terms. The term here refers to the combination of several continuous words in Chinese. With much experimental analysis on large amounts of corpus, we find that some terms are meaningless. We call them as Antiwords. The corresponding definition is given in Definition 1.

C = SubA + SubB

If C is not a meaningful term, it should be considered as an Anti-word term. Some examples about Anti-words are shown in Figure 5. We adopt N-Pat Tree structure to store the terms of the anti-word dictionary, which improves the efficiency of term query and filtering. The Anti-word dictionary based filtering algorithm is described in Algorithm 2.

5. EVALUATION OF OUR APPROACH 5.1. Performance Metrics Three performance metrics including Precision, Recall, and F1-Measure are adopted in experiments to evaluate the goodness of new popular phrases detecting approach. The definitions of those metrics are given as follows. Precision: Precision is the percent of correct detected new popular phrases. It is defined as

4.2.1. Definition 1: Anti-Word Term

P = N right N words

Suppose A and B are Chinese words from the same sentence in order, SubA is the last

where

substring of the A , and SubB is the first substring of the B , we define the Anti-word term as

(4)

• •

(5)

denotes the value of Precision; N right denotes the number of the new popular phrases detected correctly;

Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

106 International Journal of Data Warehousing and Mining, 11(3), 98-112, July-September 2015

Algorithm 2. Filtering algorithm based on Anti-word dictionary Input: set of Strings ( S str ), set of registered terms

Anti-word dictionary Output:

S str

S antidict

1. construct N-Pat trees for the sets of word terms

S antidict ;

S dict , S dict and

2. for String s in S str do 3. if Checkword(s) 4. delete s from S str ; 5. endif; 6. if CheckAntiWord(s) 7. delete s from S str ; 8. endif 9. endfor ; 10. return S str ; •

N words denotes the number of all the new •

popular phrases detected; Recall: Recall is the fraction of the number of correctly detected new popular phrases to the number of all the new popular phrases.

R = N right N all

(6)

where • • •

R denotes the value of Recall; N right denotes the number of the new

popular phrases detected correctly; N all denotes the number of all the new popular phrases of the data; F1-Measure: F1-Measure is the harmonic mean of precision and recall. It is defined as

F1 = 2 ⋅ P ⋅ R ( P + R ) where •

F1 denotes the value of F1-Measure;

(7)

•

P denotes the value of Precision; R denotes the value of Recall;

5.2. Experimental Data Description We have implemented the microblogging data collecting system and crawled large amount of data from Tencent microblogs in the last year. The experimental data set includes 20,637,925,368 microblogging messages from 398,765 users. Those data are generated from January 1st, 2011 to December 31st, 2011, located in 34 provinces and 337 cities of China. The data is briefly described in the following Table 1.

5.3. Experimental Results and Discussions We implement and apply the proposed interactive and multi-dimensional analyzing framework (Section 2) on Tencent microblogging data described in Section 5.2. We adopt N-Pat Tree model (Section 3) to analyze those data to get the text segments. Then we employ the presented filtering mechanisms to filter those segments and get the new popular phrases. In the experiments, we use the three performance metrics (in Section 5.1) as the evaluating

Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Data Warehousing and Mining, 11(3), 98-112, July-September 2015 107

Table 1. The description of the experimental data set Experimental Data Tencent microblogs

Time Span 1 year

#Locations 337

measures and consider the method of ‘N-pat Tree+ Frequency measure+Dictionary filter’ as the ‘Baseline’. Then we implement different methods on the data set above. We also calculate and compare the values of Precision, Recall and F1-Measure for different methods. The comparing results are shown in Table 2. According to Table 2, we can see that our approach improves the Precision, Recall and F1Measure. With the proposed ‘Integrated Value’, the Recall of new popular phrases is increased. With the proposed ‘Anti-words Dictionary’, the Precision is improved markedly. With both ‘Integrated Value’ and ‘Anti-words Dictionary’, our approach gets the highest Precision, Recall and F1-Measure. Therefore, our approach is effective in detecting new popular phrases.

6. MULTI-DIMENSIONAL ANALYSIS ON NEW POPULAR PHRASES

# Users 398,765

#Microblogs 20,637,925,368

number of new popular phrases in each month is shown in Figure 6. Some representative popular phrases are shown in Figure 7. As shown in Figure 6, the microblogging data in January has the largest number of new popular phrases. That may result from the fact that January is the starting point of our data. The number of new popular phrases detected in March is next to that of January. That’s because, many important events have taken place in March, such as big earthquake in Japan, salt cornering in China, panic from Nuclear radiations, the last day of the world ect. Those popular event accelerates the generation of new popular phrases. With the registration of new terms, we find that the number of new popular phrases is gradually stable. By analyzing the representative popular phrases in different months, shown in Figure 7, we may find that, some of them could be characterized by time and hot events.

6.2. Spatial Analysis

6.1. Temporal Analysis In this section, we adopt the proposed framework to extract the microblogging text data from the dimension of time. Then we divided the data into different time slices according to month and adopt N-Pat Tree model to find new popular phrases for each time sliced data. The

In this section, we adopt the proposed framework to extract the microbloging data from the dimension of geography. We divide the data into 34 blocks each of which represents a province. Then we use N-Pat Tree model and different

Table 2. The evaluation of our approach for detecting new popular phrases Methods

Precision

Recall

F1-Measure

Baseline

0.33

0.40

0.36

Baseline+IV

0.40

0.46

0.43

Baseline+Anti-words Dictionary

0.45

0.42

0.43

Baseline+IV+Anti-words Dictionary

0.52

0.48

0.50

Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

108 International Journal of Data Warehousing and Mining, 11(3), 98-112, July-September 2015

Figure 6. The number of new popular phrases in each month

Figure 7. The representative popular phrases in each month

Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Data Warehousing and Mining, 11(3), 98-112, July-September 2015 109

Figure 8. The distribution of new popular phrases in different provinces

Figure 9. The top 5 new popular phrases in each province

Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

110 International Journal of Data Warehousing and Mining, 11(3), 98-112, July-September 2015

Figure 10. The temporal and spatial distribution of those new popular phrases

filtering mechanism to find new popular phrases for each province. Figure 8 describes the distribution of new popular phrases in different provinces. According to Figure 8, we can see that people in Guangdong province generate the most new popular phrases compared with other provinces. It implies that people in Guangdong province are very active. Conversely, people in Xizang and Taiwan generate the least new popular phrases. It implies that Tencent microblogs are not very popular in those areas. Furthermore, the usages of the new popular phrases between different provinces are different from each other. Figure 9 presents the top 5 new popular phrases in every province.

6.3. Temporal and Spatial Analysis In this section, we adopt the proposed framework to extract the data from both timedimension and geography-dimension, and get 338 data blocks. With the N-Pat Tree model, we find new popular phrases for each data blocks. The temporal and spatial distribution of those new popular phrases are shown in Figure 10. According to Figure 10, the distributions of new popular phrases are different between different areas and months. The usage of new popular phrases reaches its peak in the popular area at the critical time. The usage of new popular phrases often plunges to its low point in the obscure areas.

Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

International Journal of Data Warehousing and Mining, 11(3), 98-112, July-September 2015 111

7. CONCLUSION Social media data analysis and mining is a very important research topic in social computing. In this paper, we firstly propose an interactive knowledge discovery framework, which can be used to analyze social media data from different dimensions. Then we propose a N-Pat Tree model to find phrases efficiently. In order to improve the accuracy of new popular phrases detection, we present some filtering measures. Experiments on real social media data have shown that our approach is more effective than other approaches. The multi-dimensional analysis on popular phrases also shows us many meaningful results.

ACKNOWLEDGMENT This research is supported by the National Natural Science Foundation of China (Grant No. 61303167, 61433012 and 11171086), Basic Research Program of Shenzhen (Grant No. JCYJ20130401170306838) and Open Project of Guangxi Key Laboratory of Trusted Software (Grant No. KX201329).

REFERENCES Apostolico, A., & Szpankowski, W. (1992). Selfalignments in words and their applications. Journal of Algorithms, 13(3), 446–467. doi:10.1016/01966774(92)90049-I Aral, S., Muchnik, L., & Sundararajan, A. (2009). Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proceedings of the National Academy of Sciences of the United States of America, 106(51), 21544–21549. doi:10.1073/pnas.0908800106 PMID:20007780 Cha, M., Haddadi, H., Benevenuto, F., & Gummadi, K. (2010). Measuring user influence in twitter: The million follower fallacy, 4th International AAAI Conference on Weblogs and Social Media (ICWSM), 10-17. Chien, L. (1997). Pat-tree-based keyword extraction for chinese information retrieval, ACM SIGIR Forum, 31, 50-58.

Daly, O., & Taniar, D. (2004). Exception Rules Mining Based on Negative Association Rules, Proceedings of the International Conference on Computational Science and Its Applications (ICCSA 2004), 4(3046), 543-552. Du, N., Wu, B., Pei, X., Wang, B., & Xu, L. (2007). Community detection in large scale social networks, Proceedings of the 9th WebKDD and 1st SNAKDD Workshop on Web Mining and Social Network Analysis, 16-25. doi:10.1145/1348549.1348552 Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multiword terms: The c-value/nc-value method. International Journal on Digital Libraries, 3(2), 115–130. doi:10.1007/ s007999900023 Galuba, W., Aberer, K., Chakraborty, D., Despotovic, Z., & Kellerer, W. (2010). Outtweeting the twittererspredicting information cascades in microblogs, Proceedings of the 3rd conference on Online social networks, USENIX Association, 3-3. Gonnet, G., Baeza-Yates, R., & Snider, T. (1992). New indices for text: Pat trees and pat arrays, Information retrieval: data structures and algorithms, 66-82. Jansen, B., Zhang, M., Sobel, K., & Chowdury, A. (2009). Twitter power: Tweets as electronic word of mouth. Journal of the American Society for Information Science and Technology, 60(11), 2169–2188. doi:10.1002/asi.21149 Java, A., Song, X., Finin, T., & Tseng, B. (2007). Why we twitter: understanding microblogging usage and communities, Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, ACM, 56-65. doi:10.1145/1348549.1348556 Krishnamurthy, B., Gill, P., & Arlitt, M. (2008). A few chirps about twitter, Proceedings of the 1st workshop on Online social networks, ACM, 19-24. doi:10.1145/1397735.1397741 Kwok, T., Smith, K. A., Lozano, S., & Taniar, D. (2002). Parallel Fuzzy c-Means Clustering for Large Data Sets, Proceedings of the 8th International EuroPar Conference (Euro-Par 2002), 2400, 365-374. doi:10.1007/3-540-45706-2_48 Morrison, D. (1968). Patriciapractical algorithm to retrieve information coded in alphanumeric [JACM]. Journal of the ACM, 15(4), 514–534. doi:10.1145/321479.321481 Pittel, B. (1985). Asymptotical growth of a class of random trees. Annals of Probability, 13(2), 414–427. doi:10.1214/aop/1176993000

Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

112 International Journal of Data Warehousing and Mining, 11(3), 98-112, July-September 2015

Sun, H. (2009). Efficient algorithms for spatial configuration information retrieval. KnowledgeBased Systems, 22(6), 403–409. doi:10.1016/j. knosys.2009.05.004 Tan, L., & Taniar, D. (2007). Adaptive estimated maximum-entropy distribution model. Information Sciences, 177(15), 3110–3128. doi:10.1016/j. ins.2007.01.029 Taniar, D., Rahayu, W., Lee, V., & Daly, O. (2008). Exception rules in association rule mining. Applied Mathematics and Computation, 205(2), 735–750. doi:10.1016/j.amc.2008.05.020

Zhao, D., & Rosson, M. (2009). How and why people twitter: the role that microblogging plays in informal communication at work, Proceedings of the ACM 2009 international conference on Supporting group work, ACM, 243-252. doi:10.1145/1531674.1531710 Zhao, Z., Feng, S., Wang, Q., Huang, J. Z., Williams, G. J., & Fan, J. (2012). Topic oriented community detection through social objects and link analysis in social networks. Knowledge-Based Systems, 26(2), 164–173. doi:10.1016/j.knosys.2011.07.017

Zhongying Zhao received her PhD degree in computer science, from Institute of Computing Technology, Chinese Academy of Sciences, 2012. She is currently an assistant professor in College of Information Science and Engineering, Shandong University of Science and Technology. She is also a visiting researcher in Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences. Her research interests include social network analysis and data mining. Chao Li received his PhD degree in computer science, from Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, 2014. His current research interests include social network analysis and data mining. Yong Zhang received his PhD degree in computer science, from Fudan University. He is presently an associate professor in Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences. He has published over 50 papers in international journals and conference proceedings. His current research interests include online algorithm and social network mining. Joshua Zhexue Huang received his PhD degree in Spatial Databases from the Royal Institute of Technology, Stockholm, Sweden, 1993. He is presently a Professor in Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences. His research interests are in database, data mining, pattern recognition, machine learning, and grid computing. He has published over 100 papers in international journals and conference proceedings in these areas. He served in various technical program committees of international conferences in related areas. Jun Luo received his PhD degree in computer science from the University of Texas at Dallas, USA, in 2006. Then he spent two years as postdoc in Utrecht University, the Netherlands. Before he joined Noah’s ark lab in Hong Kong, he was an associate professor in Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences. His research interests include data mining and algorithm analysis. He has published over 60 journal and conference papers in these areas. ShengzhongFeng received his PhD degree in computer science from Beijing Institute of Technology, in 1997. He has worked in the Institute of Computing Technology, CAS, and participated in the Dawning supercomputer research and development. He is presently a professor in Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences. His current research focuses on parallel algorithms, big data analysis and social computing. Jianping Fan received his PhD degree in computer science from Institute of Software, Chinese Academy of Sciences, in 1990.He worked at institute of computing technology from 1990 to 2006 as professor and vice Director. Since 2006, he served as Director and Professor in Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences. His research interests include high performance computing, social computing and big data analysis.

Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.