Using Sequence Analysis to Classify Web Usage Patterns across Websites

2012 45th Hawaii International Conference on System Sciences Using Sequence Analysis to Classify Web Usage Patterns across Websites Qiqi Jiang Depart...
Author: Audra Phillips
2 downloads 3 Views 396KB Size
2012 45th Hawaii International Conference on System Sciences

Using Sequence Analysis to Classify Web Usage Patterns across Websites Qiqi Jiang Department of Information Systems City University of Hong Kong [email protected]

Chuan-Hoo Tan Department of Information Systems City University of Hong Kong [email protected]

Chee Wei Phang Department of Information Management and Information Systems Fudan University [email protected]

Kwok Kee Wei Department of Information Systems City University of Hong Kong [email protected]

Likewise, business consultants, on their part, benefit from the use of Web information to evaluate key performance indicators; by doing so, they could offer more tailored recommendations and solutions to marketing managers. In summary, a methodology that allows the analysis of web data and deduction of insights would be invaluable. Our review of the extant literature suggests that the majority of the current understanding on web usage mining has been narrowly restricted to scrutiny of a single website [12, 15], leaving little knowledge on how one website fairs against the others in the digital marketplace. For instance, in a study by Moe [27], the author classified the users into four categories based on their dynamics behaviors at a specific e-commerce website. Correspondingly, customization of marketing strategies could only be performed based on the observed activities of users within a single website [26, 28]. In this study, we shifted the views from each individual site’s web activities (i.e., single website’s web logs) to cross-websites’ web activities (i.e., wholesome user-generated web logs). In other words, we selected users to be observed for a certain period of time, recorded their web activities, and analyzed and clustered their behaviors based on their browsing histories. We propose an analysis methodology, i.e., sequence analysis, which discriminates between different users’ groups based on their web navigation behaviors across websites. Sequence analysis was initially proposed in biology for DNA research [10, 11, 29]. Our review of the literature suggests that scholars from economics and marketing have also attempted to apply the analysis method in their respective research domains. For instance, economists have revised the

Abstract This study applies sequence analysis to identify the distinct web browsing patterns based on 200 China users’ 30-days web usage. Our results reveal four key, unique web navigation behavior categories, namely search-information browsing, social-information browsing, ecommerce-information browsing, and direct browsing. Of these, the ratio of ecommerce activities in the social-information cluster is higher than the others, with the exception of the ecommerceinformation cluster. To test the robustness of the proposed method based on our classification, we also summarize the characteristics of each category after they were segmented according to two demographic indicators, i.e. gender and occupation. Different online shopping behaviors are also discussed through the proposed classified groups. Complementing the extant methods which are based on within-website categorization of consumers, the demonstration of the sequence analysis application to e-commerce affords a deeper, integrated understanding of an individual’s online activity and behavior (i.e., navigation across multiple websites).

1. Introduction With the expeditious growth of e-commerce or the web-based marketing system, the Internet has become the most important media for understanding consumers [47]. Consequentially, marketing managers and business consultants alike seek to gain significant insights on consumers’ web navigation behavior. Gaining such understanding allows marketing managers to identify the most important visitors and customized marketing strategies could be derived. 978-0-7695-4525-7/12 $26.00 © 2012 IEEE DOI 10.1109/HICSS.2012.631

3600

sequence analysis method to study the timing of marital conception topics [24]; likewise in marketing research, sequence analysis has also been proposed to predict acquisition sequences [21, 36]. Motivated by previous research endeavors, we, on our part, adapted the sequence analysis method to study the online navigation behavior of online users. We collected the web navigation data of 200 China users for 30 consecutive days. After subjecting the data to sequence analysis, we observed four different navigation categories, which can be segmented based on the nature of online navigation behavior patterns. They are search-information browsing, socialinformation browsing, ecommerce-information browsing, and direct browsing. To test the robustness of the proposed method according to our classification, we also summarized the characteristics of each category after segmenting the data by means of two demographic indicators, i.e. gender and occupation. The selection of the two demographic indicators was based on a previous finding that information on user profiles is an important indicator of user navigational behaviors in web mining research [12, 18, 19]. In addition, several behavioral researchers have also reported that the demographic indicators of different users do affect Internet usage behaviors [38, 42, 45]. This study complements the extant methods which are based on within-website categorization of consumers, by affording a deeper, integrated understanding of an individual’s online activity and behavior (i.e., navigation across multiple websites). Furthermore, recognizing the importance of social media, we also explored ecommerce activities in the social-information segment as an exploratory work into the field of social commerce. This finding lends empirical support to some business consulting leaders such as McKinsey and Nielson who indicated the importance of social media for online marketing and ecommerce [8, 31]. We structure the subsequent sections of our paper as follows. In Section 2, we review extant literature on web mining, and sequence analysis, in the context of the theoretical basis and implications of our research. Section 3 describes the design of the field studies and presents the analysis results. In Section 4 we discuss our findings and results. We conclude with the implications, limitations and future directions in Section 5.

method [1, 4]. For instance, Stovel [41] applied revised sequence analysis to investigate the residential trajectories in sociology; Stovel, etc. [40] argued that the labor market could better understand through adopting the sequence analysis method after modeling the transformation of career system in a large British bank from 1890 to 1970. In a study by Wilson [46], the author observed consumers’ travelling records to summarize their behavior patterns. The essential property of sequence analysis, differentiating from other statistical analyses, is to take a sequence of rather than individual data points as inputting data [3]. The main aim of applying sequence analysis is to compare sequences in order to discover similarities and dissimilarities in compared sequences. Thus, the conditional requirement for sequence analysis is that the states or events in the sequences must come from a finite states or events pool. In other words, the sequences can be seen as permutations and combinations of pre-defined states or events. With coded data, a researcher must determine how to distinguish similar and dissimilar sequences. Several algorithms can be applied to compare the sequence patterns. Three most commonly used algorithms, are the Longest Common Prefix (LCP), the Longest Common Subsequence (LCS) and the Optimal Matching distance (OM). Referring to prior work [3, 5, 10, 16], we have also adopted the OM algorithm for comparing sequence patterns. More detailed information on OM and the procedures of sequence analysis will be introduced in the Methodology section. It is to be noted that the sequence analysis itself would not tell anything. Hence, complementing methodology is needed to be adopted. For instance, cluster analysis is most commonly applied for grouping the similar sequence patterns [2]; multinomialdiscrimination analysis can also be modeled based on the results of sequence analysis [36]. In this study, we will apply cluster analysis to observe the different website navigation behavior categories.

3. Research Design: A Field Study 3.1. Data Description The data used in this study is obtained through collaboration with an independent Internet-consulting firm in China, which has installed a client-based tracking application in each individual consumer’s computer with which his/her Internet usage behavior is observed. Monetary incentives were offered to the consumers in return for having their Internet usage observed. The consulting firm, which was not cognizant of our research focus and objectives,

2. Sequence Analysis Sequence analysis method was originally conceived in biology research to identify common patterns in DNA research [10, 29]. Several issues in sociology can also be resolved by applying the sequence analysis 3601

randomly selected 100 female and 100 male Internet users in China and their entire internet browsing records for the month of November 2010. In addition, we also had information on their occupations, which included white-collar workers, professionals, sales personnel, teachers, technicians and unrecognized individual persons1. Table 1 describes the attributes in our web log dataset with a row of entries. From Table 1, a user with “User id 1” visited Yahoo’s finance page (http://finance.yahoo.com.cn/news/D...) on 2010-11-5 (YYYY-MM-DD), at 20:38:18 (HH:MM:SS) and he/she stayed at that page for 14 seconds. The visiting time and duration of each visit were recorded in seconds. Moreover, this user reached Yahoo’s finance page from the Yahoo homepage (http://www.yahoo.com.cn). We also categorized the site types of the domains and the service types of the specific pages. With reference to Table 1, site type refers to the industry type of Yahoo’s Integrated Portal; whereas service type refers to the service provided in the specific page visited, i.e., Finance. A total of 453,347 unique records were generated based on the obtained data. Table 1. Data in log files Field User id Visiting Time Duration (Sec) Domain Sub-domain Site Type Service Type URL Reference

Table 2. Codified records Session ID 1

User ID 1

1

1

1

1

2

1

2

1

Visiting Time

Site Type

2010-11-4 18:10:35 2010-11-4 18:23:45 2010-11-4 18:40:02 2010-11-5 00:12:36 2010-11-5 00:12:45

Integrated Portal Blog Video Community Search Engine Video Community

The Internet-consulting firm had previously labeled the site types of each website for their corporate use. For instance, Yahoo was labeled as an integrated portal and chinanews.com was labeled as a news portal. We had a total of 31 different site types summarized by content or the service provided by each website in our dataset. The aim of this study is to better understand the user-orientation to the website types. For instance, news sites, financial information sites or sports information sites can be seen as media for information retrieval. To ensure scientific rigor, we selected eight individuals with similar backgrounds as the users in our dataset to conduct unlabeled and labeled sorting. Of these, four were selected for unlabeled sorting and four for labeled sorting. The final sorting results are recorded in Table 3. Subsequently, based on the sorted labeled results, we aligned the visit sequence of the website categories by sessions as we had previously defined. Table 3. Sorting results

Value 1 2010-11-5 20:38:18 14 yahoo.com.cn finance.yahoo.com.cn Integrated Portal Finance http://finance.yahoo.com.cn/news/D... http://www.yahoo.com.cn

Type Multimedia

We identified a set of user sessions from the raw web logs. Session is defined as a sequence of what pages were requested and in what order by a particular user [25]. In this study, we arranged the websites which were visited according to the time axis of each user. Next, we set 3,600 seconds2 (1 hour) as the interval threshold of the visiting times in order to classify the web logs into the respective sessions of each individual. For instance, if the difference between two successive “visiting times” is more than 3,600 seconds, we indicated these two records as two different session identifiers. Simultaneously, we also indicated the site type of each record. Table 2 depicts an example of the codified records.

Search Social Platform Information Retrieval3

Commerce Brand Website Navigation Others4

1 The “Unrecognized” occupational group includes those who did not indicate their occupation. 2 The mean interval of each user is 3776.5 second, and results of session amounts are same by setting the threshold of 3600 seconds and 3776.5 seconds. Plainly, we set 3600 seconds as the threshold.

3

Instances Online Video; Online Game; Music Website; Digital Magazine Search Engine Community Blog; Social Community; Forum/Bulletin Board System (BBS) Health Information; Integrated Portal; Financial Information; IT information; News Portal; Job Information; Real Estate Information; Fashion Information; Literature Site; Tourism Information; Sports Information; Entertainment Information; Automobile Information Online Shopping Company Website Navigational Site; Advertisement Categories’ Website Others; Email; e-Payment; B2B Website

It is to be noted that the information retrieval categories included topic-specific information, i.e. ESPN focused on sports information; automobilemag.com focused on news or information on automobiles. That explains why there were more site types in the information retrieving column than others.

3602

The formatted session samples can be found in Table 4. For instance, the line, [2.001 4 2 1 2 4 2 6 2 4 6 4], indicates that the session sequences of the first session of User 2 is aligned with IR (Information Retrieving), S (Search), MM (Multimedia), S, IR, S, BW (Brand Website), S, IR, BW and IR according to visiting order. In fact, a total of 8,848 formatted sessions from 200 users was recorded. Outliners were removed5, leaving 6,955 sessions to be used for sequence analysis.

After obtaining the substitution cost matrix, the distance of minimal cost, in terms of insertions, deletions and substitutions can be generated [22, 29]. As mentioned previously, we selected the OM (Optimal Matching) algorithm for comparing sequence patterns with reference to previous studies [3, 5]. The example in Table 6 shows how the OM algorithm compares the sequence patterns. It is by comparing sequences 1 and 2 (of the same sequence), that the discriminations are found at positions 1, 3, 5 and 12, while however, Sequence 2 can be converted into Sequence 1 only by substitution rather than by insertion or deletion8. By referring to the substitution cost matrix obtained in Procedure 1, the minimum total cost of aligning the sequences can be calculated. For instance, according to the values presented in Table 5, the minimum dissimilarity score between sequences 1 and 2 is 5.897 = 1.424 (3 to 4) + 1.424 (3 to 4) + 1.424 (3 to 4) + 1.625 (1 to 2). Similarly, the minimum dissimilarity score between sequences 2 and 3 is 11.859 = 1.859 (7 to 3) + 2 (Deleting the element at position 6 in Sequence 3) + 2*4 (Inserting 4 new elements from positions 9 to 12). Notably, if we calculated the transition cost from positions 6 to 9 and inserted 3 elements in the last positions, the score would be 15.29 = 1.859 (7 to 3) + 1.883 (1 to 5) + 1.672 (5 to 6) + 1.938 (6 to 1) + 1.938 (1 to 6) + 2*3 (Inserting 3 new elements from positions 10 to 12), which is higher than 11.859. By applying the OM algorithm, we obtained a discrimination scores matrix of all data, input from the 6,955 sequences. This 6955x6955 matrix can be seen as a dissimilarity matrix, which can be used for hierarchal cluster analysis [17].

Table4. Session sample Formatted Session Sample 2.001 4 2 1 2 4 2 6 2 4 6 4

3.2. Methodology and Data Analysis Referring to the previous work in biology and sociology research [39, 41], the coded sequences samples obtained in previous section, data description will be used as inputting data to calculate the substitution cost matrix between different categories firstly. The substitution cost between 2 different states can be obtained by calculating the transition rate 6, which can be represented by the following equation [14, 29], 2-P (i|j)-P (j|i) where i is not equal to j and P (i|j) indicates the transition rate from i to j. It is noted that with a substitution cost that is as high as 2, the transition fails to occur. In our case, as we had seven different categories, which are recorded in columns 1 to 7 of Table 3, the substitution cost matrix in our case is a 7x7 matrix that has been calculated. It is shown in Table 5. Table 5. Substitution cost matrix MM S SP IR C BW N

Table 6. Sequences’ Examples

7

MM

S

SP

IR

C

BW

N

0.000 1.625 1.801 1.613 1.883 1.938 1.626

1.625 0.000 1.609 1.111 1.619 1.755 1.724

1.801 1.609 0.000 1.424 1.671 1.911 1.859

1.613 1.111 1.424 0.000 1.216 1.562 1.603

1.883 1.619 1.671 1.216 0.000 1.672 1.844

1.938 1.755 1.911 1.562 1.672 0.000 1.935

1.626 1.724 1.859 1.603 1.844 1.935 0.000

Position Sequence 1 Sequence 2 Sequence 3

1 2 3 4 5 6 7 8 9 10 11 12 4545456161 6 1 3535356161 6 2 357531561

As a standalone distance/dissimilarity matrix is statistically insignificant, clustering is thus used to group homogenous or discriminately heterogeneous sequences into different clusters. Hereby, we applied Ward’s method for procedures of agglomerative hierarchical clustering, because the dissimilarity matrix we input conforms to the assumption of Ward’s criterion, i.e., in terms of the optimal values [44]. The output tree diagram is shown at Figure 1. Each sequence in our dataset is assigned into specific clusters. Based on the structure of the tree

4

This column was not subjected to further analysis because of equivocal definitions and extremely low proportions of the whole dataset, i.e. there were only 5-row records for B2B websites. 5 The extreme length of a sequence will influence the accuracy of sequence analysis (Smith, 1981). 6 The definition of transition rate (from state i to state j) is that probability of j at sequential order s+1 given that i at sequential orders. 7 MM indicates Multimedia, S indicates Search, SP indicates Social Platform, IR indicates Information Retrieving, C indicates Commerce, BW indicates Brand Websites and N indicates Navigation

8

Because we are adding essentially unknown elements to an established sequence, the cost of using an insertion or deletion is always higher than the substation.

3603

behavior positioned between searching and information acquisition; while the second cluster is driven by social power and information. Furthermore, it can be noted that some online commercial website visiting behaviors along with social platforms, are also in accordance with the postulation that ecommerce can be driven by social media [8, 20]. The third cluster is a purely commerce dominated group, and thus it stands to reason that information retrieving and searching behaviors are always accompanied by online shop visiting behaviors for access to the abundance of information [7, 23]. The last cluster contains the greatest number of browsing sequences in which the browsing behaviors are motivated by specific goals which are accomplished rapidly and easily, resembling end-to-end browsing like the online browsing behaviors of simplifiers whose definition can be found at Mckinsey’s report [13]. The more detailed descriptive statistics which are segmented by demographic indicators are presented in Table 7. Table 7. Clusters vs. demographic indicators

diagram, we separated the trees into four clusters. We used two graphical diagrams to depict the nature of each cluster. Figure 2 features the dominant browsing website types and Figure 3 records the detailed browsing sequence patterns according to cluster.

Figure 1. Hierarchical tree According to Figure 2 it is apparent that Cluster 1 is dominated by Search (Purple) and Information Retrieving (Yellow) browsing behaviors; while Social Platform (Orange) and Information Retrieving (Yellow) dominates Cluster 2, in which some commence website visiting behaviors are evident; and Cluster 3 is mainly dominated by Commerce (Blue) and Information Retrieving (Yellow), with some evidence of searching behaviors. The top 10 of the most frequently occurring sequences from each cluster are reported in Figure 3. The sequential pattern in the first cluster can be described as an interactive browsing

Cluster No of Sequences Male (%) Female (%) White-Collar (%) Professional (%) Technical (%) Sales (%) Teacher (%) Unrecognized (%)

Figure 2. Proportion by cluster

3604

1 1475 62.58 37.42 20.75 20.07 35.19 12.34 4.95 6.71

2 477 36.48 63.52 34.38 20.55 23.06 10.48 8.39 3.99

3 580 37.59 62.41 26.72 25.69 26.21 10 8.10 3.28

4 4423 56.41 43.59 29.19 16.21 28.87 11.80 5.47 8.46

Figure 3. Browsing sequence patterns by cluster Table 8. Clusters vs. purchasing ways

The results in Table 7 are consistent with prior literature focusing on Internet browsing behaviors as being segmented by demographic variables. For instance, our finding that more females participate in social activities than males (Cluster 2), finds support in prior research [45]; another finding that males use the Internet more for information searching and retrieval than females [32], is also reflected in the 1st cluster; as is the finding that females engage in more online shopping activities than males [43]. Furthermore, it can also be noted that differences in users’ occupations will also influence web browsing preferences, for instance, technical-minded browsers are more involved in information seeking and retrieving activities than others; while white-collar employees are fascinated with social platforms, which accounts for productivity loss in offices because of frequent access to social media like Facebook and Twitter, an undesirable phenomenon reported by several major industrial consultancy firms like Gartner, Nielsen [30, 35]. Furthermore, we observed that the visiting commerce website behaviors are relatively more manifested in the social/information groups (Cluster 2) compared to the other clusters, with the exception of Cluster 3. Hence, this has motivated us to categorize the different shopping behaviors, such as direct buying9, adding into shopping carts and group buying, which are associated with the different nature of groups, with the aim of investigating purchasing behaviors which have been affected by each individual cluster. The statistical results are shown in Table 8.

Category

No of Sequences

Direct Buying

Group Buying

Cluster 1 Cluster 2 Cluster 3 Cluster 4

1475 477 580 4423

0.138 2.935 7.931 1.809

0.542 0.839 6.897 0.588

Add Shopping Cart 1.491 3.774 12.931 1.673

Apparently, in the comparison of clusters 1 and 4, it is perceived that the online shops were visited more frequently in the second cluster, i.e., the socialinformation group. Significantly, the rate was higher than that in the direct browsing group (Cluster 4), which is powerful evidence demonstrating that social media can sway ecommerce. Furthermore, prior research demonstrates that the group activities can be driven by social networks or social media [6, 9]. However, as a new form of ecommerce distinct from the traditional social activities, group buying appears to be scarcely associated with social media according to our results. In fact, it is also consistent with the finding of a quarterly report made by a leading digital analytical firm in China [48].

4. Discussion and Implications The main contribution of this article is the application of sequence analysis (with complimenting concurrent use of cluster analysis) to understand user web navigation behavior. To the best of our knowledge, this study is the first to apply sequence analysis for web mining, at the individual user level, across websites. The methodology can be explained in the following steps. We first distill the original log files into users’

9

The leading C2C shop platform, taobao.com, provides an option named “Buy it now”. After the selected button is clicked, the URL will be directed into the payment gateway for checking out, instead of the usual option to add the product(s) into the shopping cart.

3605

sequences with group decision support system (GDSS) [34] can also utilize sequence analysis method. The findings of this study are also important to practitioners, whose business values can also be clarified from macro and micro perspective respectively. From a macro viewpoint, the online marketing consulting companies, such as Nielson and comScore, can yield a more comprehensive understanding of an online user based on their across-websites navigation behavior. Building on this understanding, consultants or analysts can apply user-level analytics to measure the performance from distinct perspectives comparing to traditional insights of existing web analytics tools. It is to be cautioned that the strategy in this study, i.e., tracking program installed in users’ machines, might not be directly applicable to merchants interested in single website. However, from micro perspective, the methodology purposed can also be applied for investigating user-browsing behaviors within website by operator(s) for personalization website design or targeting advertisement. For instances, searching function is mandatory in almost each online shop, besides, various commercial websites also consist of other components, such as forums for products reviews and discussion, portals for broadcasting information, etc. Therefore, within a single website, we can also do the classification based on the users’ navigational patterns, and the advertisements can be delivered to the target users on the discriminated groups. Additionally, advertising strategies, such as searching advertising like SEO (Search Engine Optimization), WOM (WordOf-Mouth) advertising based on social network structure or normal advertisements like banners or popups in portals can also be adopted based on the classified groups. Furthermore, navigational patterns of principal visitors10 are also valuable for website’s operators. We did a summarization and exploration how and what the different online shopping methods proportion in each group in order to understand whether the different navigation browsing patterns influencing purchasing methods. For instances, to evaluate online shop, different online shopping methods, such as group buying, auction and direct purchasing can be separated and discussed respectively to find which navigation browsing behaviors matching different purchasing methods specifically.

session format. At the same time, each website should be assigned into pre-defined categories based on the analytical requirements. Afterwards, sequence analysis is applied to classify the navigational sessions and identifying the similar patterns. Notably, OM (Optimal Matching) algorithm is not the solely algorithm for comparing the sequences. The reason why we applied is that its reliability has been proved in various fields in both sociology and biology [2, 11]. After observing 200 users’ visiting behaviors across websites in a one-month period, we have categorized the browsing behaviors into four groups according to the nature of the websites and the sequential order of the visits, employing a scientifically and statistically sound approach. We were also able to discern the deeper nature of each cluster after segmenting the groups according to demographic indicators, and the results are consistent with the prior research [38, 42, 45]. In addition, we were able to discuss the visiting behaviors of consumers in different types of online shops. This study contributes to theory development in three main ways. First, we are probably the first to employ the OM algorithm for the application of sequence analysis with web mining, although it was initially adopted in the fields of biology and sociology [4, 10]. As empirically demonstrated, we uncovered details of how web log files can be converted into data formats that can be processed for web usage mining with sequential patterns. Second, while most previous research has concentrated on users’ browsing processes within specific websites [33], our study takes the lead in observing targeted users’ website browsing behaviors, by employing web mining knowledge. Third, based on the nature of websites and the sequential order of browsing behaviors, we have classified browsing strategies into four clusters. We believe the classification can be further used in various research fields, such as the study of online consumer behaviors and the exploring of its use as a mechanism for targeting advertisements. Last, by means of web mining methods, our statistical results of each cluster do validate and support the finding that different browsing behaviors and preferences are influenced by demographic indicators, a postulation which has been addressed by several psychological researchers [38, 45]. It to be noted that sequence analysis method can also be applied to study the sequential events of new production methods or the implementation of information technology in organizations [37], as well as laboratory experiment for group decision making

10 Principal visitors can be defined from different views based on aspects of measurements. For instance, from CRM perspective, the term of principal visitors refers to the users’ with highest bounce rate; from merchandising perspective, the definition of principal visitors refer to the consumers with highest purchasing volume. In sum, the principal visitors are the most worthwhile tracking visitors based on the KPI.

3606

[3] A. Abbott, “Sequence analysis and optimal matching methods in sociology”, Sociological Methods & Research, Sage Publications, 2000, pp.3-33.

Notwithstanding the implications for research and practice, this study, like others, does have its limitations which serve to inform future research. First, there are some users in the data who did not indicate their occupation, which adversely influenced the accuracy of our findings. For future research, a more systematic and detailed demographic classification should be taken into account to further test the robustness of our findings. Second, with the explosive growth of ecommerce, new business models are mushrooming relentlessly and the nature of websites is expected to become increasingly indistinct. To ensure the accuracy of our classifications, a detailed and more sophisticated definition of the nature of websites should be applied in future research.

[4] A. Abbott, “Sequence analysis: new methods for old ideas”, Annual Review of Sociology, JSTOR, 1995, pp. 93113. [5] M. Blair-Loy, “Career patterns of executive women in finance: An optimal matching analysis”, American Journal of Sociology, JSTOR, 1999, pp.1346-1397. [6] S.A. Bly, S.R. Harrison, S. Irwin, “Media spaces: bringing people together in a video, audio, and computing environment”, Communications of the ACM, ACM, 1993, pp. 28-46. [7] K.P. Chiang, R.R. Dholakia, “Factors driving consumer intention to shop online: an empirical investigation”, Journal of Consumer Psychology, Elsevier, 2003, pp.177-183.

5. Conclusion In this paper, we adopted sequence analysis to discover web navigation patterns based on click stream log files generated by 200 users in China over a onemonth period. After codifying the data of the web log files into session formats based on our labeled categories, and according to the nature of the websites, we applied sequence analysis with the optimal matching distance algorithm to group similar sequential patterns into four clusters. We observed four types of clusters, namely search-information, socialinformation, ecommerce-information and direct browsing. Furthermore, we also summarized the nature of each group in terms of both the users’ demographic indicators and types of online shops to confirm the findings of previous research.

[8] M. Chui, A. Miller, R.P. Roberts, “Six ways to make Web 2.0 work”, The McKinsey Quarterly, 2009 February, pp.1-7. [9] M. De Choudhury, “Modeling and predicting group activity over time in online social media”, Proceedings of the 20th ACM conference on Hypertext and hypermedia, ACM, 2009, pp. 349-350. [10] J. Devereux, P. Haeberli, O. Smithies, “A comprehensive set of sequence analysis programs for the VAX”, Nucleic acids research, Oxford Univ Press, 1984, pp.387-395. [11] R. Durbin, Biological sequence analysis: Probabilistic models of proteins and nucleic acids, Cambridge University Press, 1998. [12] M. Eirinaki, M. Vazirgiannis, “Web mining for web personalization”, ACM Transactions on Internet Technology (TOIT), ACM, 2003, pp.1-27.

6. Acknowledgments The authors thank iResearch Consulting Group (http://www.iresearchchina.com) for their generous contribution of data without which this research would not have been possible. This paper was supported by the General Research Fund sponsored by the Research Grants Council of Hong Kong of project number 9041612.

[13] J. Forsyth, T. McGuire, “All visitors are not created equal”, McKinsey marketing practice, McKinsey&Company, 2000 Aug. [14] A. Gabadinho, G. Ritschard, M. Mueller, “Mining sequence data in R with the TraMineR package: A user’s guide”, Geneva, Switzerland, 2008. [15] N. Pal, R. Rangaswamy, The Power of One: Gaining Business Value from Personalization Technologies, Trafford Publishing, 2003.

7. References [1] A. Abbott, “A primer on sequence methods”, Organization Science, JSTOR, 1990, pp. 375-392.

[16] B. Halpin, T.W. Cban, “Class careers as sequences: An optimal matching analysis of work-life histories”, European Sociological Review, Oxford Univ Press, 1998, pp.111.

[2] A. Abbott, “Measuring resemblance in sequence data: An optimal matching analysis of musicians' careers”, The American Journal of Sociology, JSTOR, 1990, pp. 144-185.

[17] T. Hastie, R. Tibshirani, J.H. Friedman, The elements of statistical learning: data mining, inference, and prediction. Springer Verlag, 2009.

3607

[18] T. Joachims, D. Freitag, T. Mitchell, “Webwatcher: A tour guide for the world wide web”, .International Joint Conference on Artificial Intelligence, Lawrence Erlbaum Associates Ltd, 1997, pp.770-777.

[31] Nielsen, “Social Networking’s New Global Footprint”, Nielsen, http://blog.nielsen.com/nielsenwire/global/socialnetworking-new-global-footprint/#respond, 2009 Mar 9. [32] P.M. Odell, K.O. Korgen, P. Schumacher, M. Delucchi, “Internet use among female and male college students”, CyberPsychology & Behavior, Mary Ann Liebert, Inc, 2000, pp.855-862.

[19] A. Joshi, R. Krishnapuram, “On mining web access logs”, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Citeseer, 2000, pp.63-69. [20] A.M. Kaplan, M. Haenlein, “Users of the world, unite! The challenges and opportunities of Social Media”, Business Horizons, Elsevier, 2010, pp.59-68.

[33] C.W. Phang, A. Kankanhalli, K. Ramakrishnan, K.S. Raman, “Customers’ preference of online store visit strategies: an investigation of demographic variables”, European Journal of Information Systems, Nature Publishing Group, 2010, pp.344-358.

[21] A. Knott, A. Hayes, S.A. Neslin, “Next-product-to-buy models for cross-selling applications”, Journal of Interactive Marketing, Wiley Online Library, 2002, pp.59-75.

[34] M.S. Poole, M.E. Holmes, “Decision Development in Computer-Assisted Group Decision Making”, Human Communication Research, Wiley Online Library, 1995, pp.90-127.

[22] V.I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals”, Soviet Physics Doklady, 1966, pp.707-710.

[35] B. Prentice, “Butt Out IT! Facebook ‘Productivity Loss’ Is No Concern of Yours”, Gartner, http://blogs.gartner.com/brian_prentice/2008/11/23/butt-outit-facebook-productivity-loss-is-no-concern-of-yours/, 2008 Nov 23.

[23] H. Li, C. Kuo, M.G. Rusell, “The impact of perceived channel utilities, shopping orientations, and demographics on the consumer's online buying behavior”, Journal of Computer-Mediated Communication, Wiley Online Library, 1999.

[36] A. Prinzie, D. Van Den Peol, “Predicting homeappliance acquisition sequences: Markov/Markov for Discrimination and survival analysis for modeling sequential information in NPTB model”, Decision support systems, Elsevier, 2007, pp.28-45.

[24] L.A. Lillard, “Simultaneous equations for hazards* 1:: Marriage duration and fertility timing”, Journal of econometrics, Elsevier, 1993, pp.189-217. [25] B. Mobasher, R. Cooley, J. Srivastava, “Automatic personalization based on Web usage mining”, Communications of the ACM, ACM, 2000, pp.142-151.

[37] R. Sabherwal, D. Robey, “An empirical taxonomy of implementation processes based on sequences of events in information system development”, Organization Science, JSTOR, 1993, pp.548-576.

[26] B. Mobasher, H. Dai, T. Luo, M. Nakagawa, “Discovery and evaluation of aggregate usage profiles for web personalization”, Data Mining and Knowledge Discovery, Springer, 2002, pp.61-82.

[38] B. Sheehan, “An investigation of gender differences in on-line privacy concerns and resultant behaviors”, Journal of Interactive Marketing, Wiley Online Library, 1999, pp.24-38.

[27] W.W. Moe, P.S. Fader, “Dynamic conversion behavior at e-commerce sites”, Management Science, JSTOR, 2004, pp.326-335.

[39] T.F. Smith, M.S. Waterman, “Identification of common molecular subsequences”, J. Mol. Bwl, Citeseer, 1981, pp.195-197.

[28] M.D. Mulvenna, S.S. Anand, A.G. Buechner, “Personalization on the Net using Web mining: introduction”, Communications of the ACM, ACM, 2000, pp.122-125.

[40] K. Stovel, M. Savage, P. Bearman, “Ascription into achievement: Models of career systems at Lloyds Bank”, The American Journal of Sociology, 1996, pp.358-399.

[29] S.B. Needleman, C.D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins”, Journal of molecular biology, Elsevier, 1970, pp.443-453.

[41] K. Stovel, M. Bolan, “Residential trajectories: Using optimal alignment to reveal the structure of residential mobility”, Sociological methods & research, Sage Publications, 2004, pp.559-598.

[30] N. Wire, “Led by Facebook, Twitter, Global Time Spent on Social Media Sites up 82 Year over Year”, Nielsen, http://blog.nielsen.com/nielsenwire/global/led-by-facebooktwitter-global-time-spent-on-social-media-sites-up-82-yearover-year/, 2010 Jan 22.

[42] T.S.H. Teo, “Differential effects of occupation on Internet usage”, Internet Research, MCB UP Ltd, 1998, pp.156-165. [43] C. Van Slyke, C.L. Comunale, F. Belanger, “Gender differences in perceptions of web-based shopping”, Communications of the ACM, ACM, 2002, pp.82-86.

3608

[44] J.H. Ward, “Hierarchical grouping to optimize an objective function”, Journal of the American statistical association, JSTOR, 1963, pp.236-244. [45] E.B. Weiser, “Gender differences in Internet use patterns and Internet application preferences: A two-sample omparison”, Cyberpsychology and behavior, Mary Ann Liebert, Inc, 2000, pp.167-178. [46] C. Wilson, “Analysis of travel behavior using sequence aligment methods”, Transportation Research Record: Journal of the Transportation Research Board, Trans Res Board, 1998, pp.52-59. [47] O.R. Zaiane, J. Luo, “Web usage mining for a better web-based learning environment”, Proceedings of conference on advanced technology for education, Citeseer, 2001, pp.6064. [48] X. Zhang, iResearch, http://ppt.iresearch.cn/html/105.shtml

3609

Suggest Documents