IT 11 030
Examensarbete 30 hp Juni 2011
Web Recommendation System with Image Retrieval Bin Yan
Institutionen för informationsteknologi Department of Information Technology
Abstract Web Recommendation System with Image Retrieval Bin Yan
Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student
The amount of information on the Internet has dramatically increased during recent years such that increment causes a problem so called “information overload”, which can only be partially solved by search engines. Although there is a considerable literature on search engine focusing on information overload, it has still not been completely overcome to date due to concerns about commercial interests, individual difference and objective process. Addressing those concerns, recommendation systems, which are information-filtering systems that can recommend information without explicit participation of the user, was designed to aim those problems. The recommendation system collects the interests of users to create an independent profile for each user. Moreover, it compares the user profile to some reference characteristics, and the system recommends information of potential interest to the user. They redeem from shortcomings of search engines, since recommendation systems focus on the specific characteristics of each user. Unlike previous literature that focuses on text, this thesis presents an improved recommendation system, which considers the information stored in images. Based on methods of user modeling and user profile expression are analyzed, A new design for user profiles joint with methods for content based image retrieval are presented. In this design, the new user profile contains information from images on the web pages to increase the accuracy of the recommendation. Furthermore, algorithms for updating the user model according to user feedback are also introduced such that the user model can reflect the interest modification of users. Using a real-word deployment, the thesis shows the new system achieves better accuracy comparing to existed text-only methods given small amount of data. Finally, the thesis argues about the feature selecting in Image analysis is the bottleneck for recommendation system. It appears very hard to significant improve existed system without new features and semantic analysis.
Handledare: Chenxi Zhang Ämnesgranskare: Ivan Christoff Examinator: Anders Jansson IT 11 030 Tryckt av: Reprocentralen ITC
Table of Contents 1 INTRODUCTION ................................................................................................................ 1 1.1 Background ............................................................................................................................... 1 1.2 Related Work ............................................................................................................................. 2 1.3 Objectives ................................................................................................................................. 5 1.4 Thesis Overview......................................................................................................................... 5
2 USER PROFILE ................................................................................................................... 6 2.1 Classification ............................................................................................................................. 6 2.2 The Data Source ......................................................................................................................... 7 2.3 Representation .......................................................................................................................... 9
3 MULTI DIMENSIONAL USER PROFILE .................................................................... 11 3.1 Relativity between Text and Image Content in Web Pages ........................................................ 11 3.1.1 Experimental Design ............................................................................................................... 11 3.1.2 Experiment Result ................................................................................................................... 11 3.1.3 Analysis and Conclusion ......................................................................................................... 12 3.2 Vector Space Model Based on HTML Structure ......................................................................... 12 3.3 Image information extraction and representation .................................................................... 14 3.3.1 Common Methods of Content Based Image Retrieval (CBIR) .............................................. 14 3.3.2 Color Features Extraction and Representation based on Image Block ..................................... 15 3.4 User Profile Design .................................................................................................................. 18 3.4.1 Interest Vector ......................................................................................................................... 18 3.4.2 Similar Interest Judgment ...................................................................................................... 18 3.4.3 User Modeling Algorithm ....................................................................................................... 19 3.4.4 Distance between Users ......................................................................................................... 19 3.4.5 Similar User Clustering ........................................................................................................... 20 3.4.6 User Feedback ......................................................................................................................... 21 3.4.7 Updating User Profile .............................................................................................................. 21
4 A HYBRID SYSTEM ‐ COMBINED CONTENT BASED AND COLLABORATIVE RECOMMENDATIONS ...................................................................................................... 26 4.1 Advantage of Hybrid System .................................................................................................... 26
4.2 System Architecture ................................................................................................................. 27 4.3 System Flow ............................................................................................................................ 28 4.4 Key Technology Used in the System .......................................................................................... 29 4.4.1 Chinese word Segmentation ................................................................................................... 29 4.4.2 Calculation of Chinese compound word ................................................................................... 31 4.4.3 Capture User Browsing Behavior .............................................................................................. 32 4.5 The Brief Introduction of the Prototype .................................................................................... 33 4.5.1 Server Modules ......................................................................................................................... 33 4.5.2 Client Interface .......................................................................................................................... 38 4.6 The Experiment ....................................................................................................................... 40
5 CONCLUSIONS ................................................................................................................ 41 5.1 Result ...................................................................................................................................... 41 5.2 Future Work ............................................................................................................................ 41
REFERENCES ................................................................................................................ 43
1 Introduction 1.1 Background According to China Internet Network Information Center (CNNIC), the number of Internet users in China reaches more than 384 million up to Dec. 31 2009. The total number of domain name in China is more than 16.28 million, and the number of web pages reaches 33.6 billion [1]. Figure 1.1 shows the increasing trend of the web pages in China. 40 35 30 25 20 15 10 5 0 2003
2004
2005
2006
2007
2008
2009
Figure 1.1 the trend of web pages number in China from 2003 to 2009
Regarding to the rapid growing of the Internet, information on the Internet has increased extremely, and the increasing may result in the problem called “information overload” which refers to the difficulty of users to make decision by information [2]. The search engine can help the Internet user to handle the information overload situation partially. However, search engine or information retrieval system has following defects: 1. The search result is disturbed by commercial interest. Currently, search engine rely on advertisement to generate revenue. The service providers always place the advertisement on the search result that reduces the optimum of result. 2. Search engine ignores the difference between users. The users would get the same result if they input the same keywords; even through emphasis is distinct between individuals. 3. The user dominates the search process. The search engine depends on the users’ input, but the requirements of users are always unclear. To improve the accuracy of the result, user should modify the search keywords and research again. 1
Addressing those defects, recommendation systems, which are information‐filtering systems that can recommend information without explicit participation of the user, was designed to aim those problems. The system collects user information to create independent profile for each user. Moreover, It compares user profile to some reference features, and the system recommends item to users who have potential interest on specified topic. However, the previous web page recommendation system is based on the plain text content of the web pages rather than the information contained by the images [3], [4]. The goal of this thesis is to design a new user profile that handles the information contained in the images in the web pages. Also, the thesis implements a prototype of the web page System, and it measures the performance of new user profile by experiments.
1.2 Related Work Recommendation System Recommendation system derives from a specific type of information filtering system technique that attempts to recommend information items that are likely to be of interest to the user. Typically, recommendation system includes three key elements: candidate items, users and recommendation algorithm. Figure 1.2 illustrates the architecture of recommendation system. When building the user's profile, the users can explicitly provide interest information, or the system can collect data implicitly. Recommendation algorithm calculates the candidate items based on user interest profile to make recommendation.
Figure 1.2 architecture of recommendation system
Research Situation Robert Armstrong etc. advanced the first recommendation system Webwatcher in 1995. After that variety of recommendation systems have been developed such as Amazon, eBay, Taobao etc. Table 1.1 listed classified mainstream recommendation system in both research and commercial area. 2
Table 1.1 mainstream recommendation system
Area Recommendation System E-commerce Amazon.com, eBay, Levis, Ski-europe.com Web Page Fab, Foxtrot, ifWeb, MEMOIR, METIOREW, ProfBuilder, QuIC, Quickstep, R2P, Siteseer, SurfLen Music CDNOW, CoCoA, Ringo, Music.Yahoo.com Movie Netfilx.com, Moviefinder.com, MovieLens, Reel.com News GroupLens, PHOAKS, P-Tango In theoretical literature, recommendation system has become an independent discipline, contains areas like E‐commerce, Network Economics, sociology etc. Recent years, research on recommendation system increase rapidly: 1) ACM set up conference: ACM recommender system; 2) Papers about recommendation system increased year by year in Top conference on human computer interaction, data retrieval and machine learning (SIGCHI, KDD, SIGIR, WWW etc.). 3) Top Journals (such as IEEE Trans. on knowledge and Data Engineering, ACM Trans. on Information system) have collected several papers on recommendation system. The research institute (researcher) advanced in recommendation system include: New York University (Alexander Tuzhilin), GroupLens group in University of Minnesota (Joseph A. Konstan, John Riedl etc.), University of Michigan (Paul Resnick), Carnegie‐Mellon University (Jaime Callan), Microsoft Research (Ryen W. White) etc. Besides, University of Michigan offered recommendation system course begin from 2006. Previous Work on Web Page Recommendation System 1. Fab Marko Balabanović and Yoav Shoham implemented a hybrid content‐based, collaborative system in 1997 [7]. As a part of digital library project, Fab aims to help users filter useful information from huge quantity of Internet information. The system combined Content Based Recommendation and Collaborative Filtering Recommendation to create a hybrid recommendation system. The process of recommending can be divided into two phases: 1. Collect information and set up manageable database; 2. Select certain information for certain user. Three parts composed fab: Collection Agent, Selection Agent and Center Router. Every agent maintains a profile contains words of web pages that have been rated. The profile of Collection Agent represents its current topic, whereas a selection agent’s profile represents a single user’s interests. Pages found by the collection agents are sent to the central router, which forwards them on to those users whose profiles they match above some threshold. The users are required to assign appropriate ratings from a 7‐point scale. The user’s ratings are used to update their personal selection agent’s profile, and are also forwarded back to the originating collection agents, which will use them to adapt their profiles. Additionally, any highly 3
rated pages are passed directly to the users’ nearest neighbors—other people with similar profiles. In conclusion, by combined two recommendation algorithms, Fab can filter large scale, rapidly changed information and process dynamic feedback. 2. Siteseer James Ruker and Marcos J. Polanco developed Siteseer system in 1997 [8]. Siteseer is a web page recommendation system that utilizes user bookmarks (including Hotlist and Favorites) and the organization of bookmarks. Bookmarks represent users’ interests, especially classified bookmarks that reflect a conscious behavior of user. Siteseer contrast bookmarks from different user, calculate users similarity based on URL similarity in their bookmarks. Then the system introduces different URLs between neighbors. 3. ifWeb Fabio A. Asnicar and Carlo Tasso introduce the ifWeb system in 1997 [9]. ifWeb is a content based web page recommendation system. Other than Fab, ifWeb use a multi‐dimensional description of the web page. ifWeb not only utilizes the key words in web pages, but also exploit domain name, size of HTML file, number of images etc. to represents the content of the web pages. Analysis of Pervious Web Page Recommendation System Because of variety of limitations, there is no instance of web page recommendation system achieved commercial success. The main defects are listed below: 1. The description of user profile and knowledge are unitary. For instance, Syskill & Webert [10] and Fab [4] are all use several key words to represent the content of the web pages. Accordingly, the matching between candidate items and users are all based on text matching methods. 2. Over‐fit problem caused by only using content‐based recommendation algorithm. Exclusive content‐based recommendation algorithm can only recommend item similar with the one that the user has browse before. Over‐fit problem is that the recommendations are either too similar with previously items or unrelated at all. 3. The cold start problems include new user problem and new item problem. When new users or new items are added into the system, the system cannot make recommendation until new users have given enough feedback rating or the new items have received enough feedback rating. Sparse problem means that when the number of user is too small compare to the number of items, there is always some items did not receive any feedback. The condition that the difference between any two users is too large can also cause sparse problem.
4
1.3 Objectives The objectives of this thesis is to design and specify a new user profile based on both the text and image content of the web pages, and the thesis also implements a prototype of the web page recommendation system using this user profile. Finally, the thesis shows the performance of new recommendation system by experiments.
1.4 Thesis Overview
General idea and methods on user profile will be introduced in chapter 2. Chapter 3 will give a new user profile that represents interest of both text and image content of web pages. Chapter 3 also includes the modeling and updating method of user profile. Chapter 4 introduces the architecture of the whole system and the implements of the system. Future work is discussed in chapter 5.
5
2 U User Pro ofile Theere is no standard defin nition of useer profile. Yaangjun Pei d defines userr profile as “a mod del used forr capturing, recording aand managiing user’s in nterest” [111]. Zhiwei Guan defiines it as “a description n of users’ uunderstandiing of the outer world and the inteeraction with the outerr world” [122]. Alfred Ko obsa definess it as a set of user’s go oal, plan n (with whicch to reach the goal), bbeliefs and kknowledge about a parrticular dom main [13]]. In conclussion, in a recommendaation system m, user profile refers too a model dep pict user’s in nterests and d requiremeents in a perriod of time e. User proffile should b be cognized by maachine and calculable. It is an algo orithm orien nted formal description n h particularr data structture. with Useer profile plaays a core ro ole in recom mmendation n system. It is the key ffactor for incrreasing reco ommendatio on accuracyy and impro oving the effficiency of tthe system. A goo od user proffile system ccan analyze the browsing history o of the user, and infer user’s preferen nce. from interaction w with the systtem. Based on these feeatures, the with the moost proper rrecommend dation. This chapter will systtem providees the user w intro oduce somee critical tecchnology abbout modeling user pro ofile.
2.1 1 Classifiication 1. M Manual Mod deling by Usser Man nual Modeling means u user provid es preferen nce informattion manuaally or selectts from m predefineed options. For exampl e, my.yahoo o.com, during the regisstering proccess, new w user is req quired proviiding locatioon, interests etc. (see ffigure 2.1).
Figu ure 2.1 my.ya hoo.com new w user registe ering
2. U User Samplee Modeling Useer sample m modeling req quired userss provide saamples whicch they are interested in. Thee general meethod is using the ratedd items whiich user hass browsed aas samples. For exam mple, Amazzon.com req quires new user to sele ect a preset category oof goods. Then the user is askeed to input some key w words for se earching. Aftter searchinng, the userr 6
neeed to rate iteems in the ssearch resu lt. The proccess iteratess until the uuser satisfiess with h the search h result (see e Figure 2.22).
Figgure 2.2 sampple modeling of Amazon.com
Automatic M Modeling byy System 3. A Automatic Mod deling mean ns the user does not exxplicitly parrticipate in tthe modelin ng proccess but thee system establishes thhe user proffile based on the user’ss implicit feed dback. For eexample, the Google Chhrome brow wser lists the most freqquently visitted web b site in its h home page.. The data iss collected implicitly (see figure 2..3).
Figure 2.3 most frequeently visiting w web site lists in Chrome
2.2 2 The Da ata Sourcce 1.
2.
3.
Data from Web Sever The log filee on web se erver recordds the web p page URL, time user brrowsed, upload and downlload behaviors etc. Data from Proxy Sever As an interrmediate no ode betweeen user and web serverr, proxy servver records the user’s behavior of bro owsing multtiple web sittes. Data from Client Data repreesents user’s interest o n client side e includes: 7
1) User’s browsing history and Internet temporary files. The local history or temporary files record the web site the user has browsed recently and the time of browsing; 2) User’s bookmark. Bookmark represents user’s interest, especially the classified bookmarks which reflect a conscious behaviors of users; 3) Searching keywords; 4) Browsing behavior of the user, which include the time the user stay on each page, keyboard and mouse operating, printing or saving the page, adding bookmarks etc; 5) Cookies and forms saved by the browser; 6) Documents download by the user. Among the data sources mentioned above, searching key words can only represents the current interest but not the long‐term interest. Cookies are difficult to understand without particular knowledge of the web server. Generally, user only add pages they interested in into bookmarks, thus bookmark can well represent user’s interest. However, bookmark is too few compare to the history files or temporary files. In fact, users do not always add every interesting page into bookmarks. So modeling based on bookmarks can not reflect the overall interest of the user. Contrast to the bookmarks, the history file can better represent user’s interest. The browsing history file is saved implicitly by the browser. The system can establishes user profile without explicit participation of user. Of course this data source also has some drawbacks. Web pages in the history folder may not all interested the user. For example, the hyperlink cannot depict the page well, the user may find the page is not interesting after open the link. That means the history files contain some interference. The system should rule out that interference when utilizing history files as data source. The browsing behavior can also reflect user’s interest. When user stays a relatively long time on a certain page, it is inferred that the user may interest in the page. To present user’s interest, browsing behavior should be utilized together with page which is browsing. The log on web server or proxy server not only records the pages the user has browsed but also records the varied behavior when browsing. Log on proxy server always records all web sites the user has browsing. Thus it can represent user’s interest completely. On the other hand, web server only records the visitation of the particular web site. But it has no idea about other sites. So web server only proper for user modeling on the particular site. The document download or saved by user can also represent his/her interest. Normally, user only downloads and save documents he/she interested in. Besides, in order to help managing and accessing, the downloaded files are always classified. 8
The information collect from those classified files can reflect topics user concerned. To sum up, the sever log, browsing history and the browsing behavior can represent the user’s interest most completely. Bookmarks may not reflect overall interest but still represent the user’s concern.
2.3 Representation According to [14], [15], [16], traditional reprehensive method include: 1. Topic words representation method This method use topic words to represent user’s interest. Topic words always represent a particular domain. For example, topic words like “Sports” or “News”. This method is always used together with manual modeling. For instance, my.yahoo.com records the selection of preset options like “Sports”, “Technology”, and “Finance” etc. Then the information 2. List of key words method This method uses a list of key words to present user’s interest. For example, assume a user interest in football, and then the user profile may like {football, Word Cup, Messi, UEFA Champions League}. The key words can determine by user or learn by system. The typical recommendation system use key word list is Webwatcher. Webwatcher required that the user should input interested key words first. Then it recommends web pages to the user when browsing. 3. Methods based on Vector Space Model This method use vectors in the vector space of key words to represent user’s interest. Vector space model is a common method to represent document. Every , , , in which is ,…, , document can be present as the items (word or phrase), is the weight of in . To use the method represent user profile, , , … , represnt items user interested in, and , , … , represent the degree of interest of each items. 4. Bookmark Method Users always add web pages they are interested in into bookmark (including Hotlist and Favorites) in order to visit again. The system use this method include Siteseer, Open Bookmark and online bookmark service. 5. Methods based on User‐item Matrix User‐item matrix method uses a R ∗ matrix to represent user profile [17]. In the matrix, m is the number of users,n is the number of items。Every element r in the matrix represents rating of the user (the row) to the item (the column). Generally r is an integer, for example from 1 to 5. Empty value means user has not rating the item yet. Systems based on collaborative filtering are suitable for using this method. 9
All the methods mentioned above did not consider the multi‐media content contained in the web page. This thesis combined user profile modeling with image retrieval in order to utilize the information contained in images to refine the representation of users’ interest.
10
3 Multi Dimensional User Profile 3.1 Relativity between Text and Image Content in Web Pages Images transfer information by painting language which is more direct and has larger information capacity. Besides, images are more accuracy than text. It is harder to tamper or twist images. Thus most of web pages on Internet use images to better represent the content. However, images in web page are not all related to the main topic of the page. There are lots of images such as advertisement, UI elements or logos in the web page. The first part of this chapter will introduce an experiment to find out the relativity between text and image content in web pages.
3.1.1 Experimental Design The experiment analyzes the relativity of images and text through automatic scan 244 web pages. The key points of the experiment include: 1) The selection of web pages This experiment has two groups of web pages. Group randomly selects 100 web pages from the Internet. Group selects 144 web pages from the data source THE SYSKILL AND WEBERT WEB PAGE RATINGS [33]. 2) Image analysis This experiment analyzes the HTML file of selected web pages. Calculate the number of tag to get the number of images. The attributes include src, alt, title in tag represent the image content from the web page producer’s view. The src attribute is used for set the source file of the image, the alt attribute is used for specifying alternative text (alt text) that is to be rendered when the element to which it is applied cannot be rendered. The title attribute set the title of the image. 3) Relativity judgment Sort the words in web page by the word frequency. Compare the top 5 frequently occurred words with the words in src, alt, title attributes in tag. Once matched word is found, the image is considered relative to the topic of the web page.
3.1.2 Experiment Result In group 99% web pages have images, more than 78% pages has at least one image relative with the page topic. Among all the pages, 14.7% were relative with the page topic. See figure 3.1 11
Pages only have irrelevvant images Pages have relevennt es image
Figure 3.1 images releevant ratio in random caug ght pages
Thee data source [33] has o only 34.2% ppages have images. Am mong all thee images 44 4.8% images are relaative to the page topic.. See figure 3.2. Among g all the pagges, 17% we ere releevant to the page topic.
Pages have no images Pages have relevant images Pages only irrelevant images
Figure e 3.2 images rrelevant ratio o in data source [33]
3.1 1.3 Analysis and C Conclusio on Thee result show ws that although the m most of imagges are not relevant to page’s topiic, but most pagess at least haave one imaage relative to its topic. In fact, maany tags have no alt, titlle attribute,, which decclines the ratio. It is surre that the aactual ratio will be h higher than it in the experiment. In p previous work of web p page recomm mendation system, the e informatioon contained in images was not considered. The last ppart of this chapter will introducee a more accu uracy user p profile whicch combinedd with inforrmation of image conteent.
3.2 2 Vectorr Space M Model B Based on n HTML Structu ure As m mentioned in 2.3, vecto or space is aa common method for represent uuser profile e. Because this m method is calculable andd operable, this thesis will use vecctor space mod del to repreesent the paart of the usser profile e extract from m text conteent. 12
Vector space model use the vector like , , , ,…, , to present user’s interest. In the vector, is the item, which is a word depicting the user’s interest. Because the web pages interested by the user may contain a lot of word have little help to represent user’s interest. Besides, along with the increasing of user description file, memory space and calculating cost are also increasing. Thus how to select items and their weights is the central problem of vector space model. In this thesis, vector space model is based on HTML structure and the word frequency. The more frequently occurred words have higher weight. Besides, words in different tags of HTML file have distinguishing importance on representing the topic the page. Generally, words in the title may have a closer relation to the topic then the words occur in the other part of the web page. Considering the HTML structure, words in different tags should be assigned with different weight. See table 3.1. Table 3.1 tags affect item weights Tag
Specification
Weight
Title of the page
,,…
Headings of the
10
page
6
, in which
is the level
of the heading
Text body
,
Emphasis
1 1
The weight computing equation of key word t is ∑
(3.1)
In which, k presents the type of the tag, tag k.
is the times key word t occurs in the
is the corresponding weight in table 3.1. The function 1, T in tag or 0, T not in tag and
(t )
is (3.2)
Now key word t and its weight can be calculated. After calculating all the words in the page, vector V is gained by assemble the highest n weight words. V
,
,
,
13
,…,
,
(3.3)
3.3 Image information extraction and representation 3.3.1 Common Methods of Content Based Image Retrieval (CBIR) Image retrieval system is a system for browsing, searching and retrieving images from large database of digital images. The traditional methods utilize text description of images, but this method has great disadvantages: first, images are generally rich in detail and extended meaning. It is difficult to describe by a few keywords or a simple comment; Second, different people will have a different understanding of the same image, which makes it difficult to use the text label for responding to user queries accurately; Third, the image text annotation can only be done by hand, which is feasible only when the number of images is small. However, if the total number of images grow too fast, the annotation by hand will become very difficult. Content‐based Image Retrieval provides a good solution to these problems by extracting the characteristics from the image itself. This features extract from images are objective and comprehensive. Besides, the entire process can be done automatically. The speed and accuracy of the retrieval are increased. The common methods of CBIR include: 1. Retrieval based on color Compare to the shape, color has a rotation invariance and scale invariance [18]. The basic idea of color based retrieval is to utilize feature of the color distribution. Color histogram is a common method of color based retrieval. This method represents color distribution by a histogram and attributes the distance between the images to the distance between its color histogram. Therefore image retrieval becomes color histogram matching. This method has disadvantages: because the color histogram does not maintain any spatial information, the search results were not accurate. Color histograms of two completely different images may very similar. 2. Retrieval based on texture Texture is an important feature of objects, which reflecting the changes of surface color and grayscale. Retrieval based on texture can be divided into three categories: statistical methods, structural methods and spectral methods. Statistical method is to identify the numerical characteristics of the images, such as Fourier spectrum [23], co‐occurrence matrix (co‐occurrence matrix) [24], Markov random field models [25‐26]. Structural approach assumes that the texture pattern has certain texture primitive arranged with certain rules. This is only proper for some regular images. 3. Retrieval Based on Shape Feature Shape is one of the essential characteristics for characterizing objects. Shape is also the initial stage that people learn about things. The problem of image retrieval based on object shape is segmenting objects from the image by 14
appropriate image segmentation method. The key is to find characteristics of shape consistent with the human eye perception. The traditional shape‐based retrieval is based on shape features composed by shape feature vector. The classic description of the shape in Image analysis include: Fourier descriptors, moment invariants and various simple form factor (size, roundness, eccentricity, etc.), spindle orientation (major axis orientation) and so on.
3.3.2 Color Features Extraction and Representation based on Image Block Considering the computing complexity, this thesis selects color feature to represent the images. In order to reduce the error, this thesis also utilizes image block division to represent shape features in some extent. 1. Color Feature Extraction and Representation Color histogram is widely used in many image retrieval systems to represent color features. It concerns the proportion of different colors in the whole image but does not care the spatial location of each color. It cannot describe the object in images. Color histogram is particularly suitable to describe the image which is difficult to segment object automatically. The color histogram can base on different color space and coordinate system. The most common color space is RGB color space, because most of the digital image is expressed in this color space. However, RGB space structure does not consist with people's subjective color similarity judgments. Therefore, the RGB color space is often converted into HSV color space. HSV (hue, saturation, value) color space model corresponding to conical subset in the cylindrical coordinate system. The top of the cone corresponding to V = 1. The top surface contains the sides of R = 1, G = 1, B = 1 in RGB model and represented by the bright colors. Hue H is determined by the angle rotated around the V axis. Red corresponds to the angle of 0 °, green corresponds to the angle of 120 °, and blue corresponds to the angle of 240 °. In the HSV color model, each color and its complementary color differed 180°. Saturation S values from 0 to 1, therefore the top surface of the cone has the radius of 1. In HSV color model, the color domain is one hundred percent color saturation, and its saturation is generally less than one hundred percent. At the cone vertex (i.e. the origin), V = 0, H and S is not defined, represents black. At the center of the top surface of the cone S = 0, V = 1, H is not defined, represents white. Axis from this point to the origin represents gray dimming light, i.e. the grayscale. For these points, S = 0, H is not defined. It can be said, HSV model’s V‐axis corresponds to main diagonal in RGB color space. The circumference on the top surface of the cone is the solid color which has V = 1, S = 1. HSV color model corresponds to the way the painter 15
mixes the p pigment. Painters chan ge the hue and value tto get differrent color frrom one solid co olor. To change the val ue, painterss mixed solid color withh white; to change thee hue painte ers add blacck. Adding d different pro oportions off white and black can p produce variiety of colorrs. HSV colo or space can n be illustratted by a conical space model, aas shown in Figure 3.3.
Figurre 3.3 HSV co olor space
The formula a convert RGB R space iinto HSV sp pace is
m r , g , b) v max( s [v min(r , g , b)] / v 5 b ' if r max(r , g , b) and g min(r , g , b) 1 g ' if r max(r , g , b) and g min(r , g , b) 1 r ' if g max(r , g , b) and b min(r , g , b) h 3 b ' if g max(r , g , b) and b min(r , g , b) (3.4) 3 g ' if b max(r , g , b) and r min(r , g , b) 5 r ' otheerwise min(r , g , b)] r ' [v r ] [v m min(r , g , b)] g ' [v g ] [v m b ' [v b] [v m min(r , g , b)]
In the equaation, r , g , b [0,1], h [0, 6], and he d s, v [0,1] . Before callculating, th RGB valuess should be normalized . In order to o calculate tthe color hiistogram, co olor space is divvided into se everal smal l color rangges. Each ran nge is a bin in the histogram. This processs is called ccolor quantization. The en, by countting the number of pixels falls w within eachh bin, histoggram can be e obtained. TThis thesis calculated H H, S, V three e‐channel ccolor histogram. Hue histogram is divided into o 180 bins, w which meanss adjacent aangles of hu ues are merged into onne bin. 16
Saturation and brightness has 256 bins, which can be gained by multiply original S, V by 255. Thus, histogram is obtained by counting each pixel's H, S, V, then counting number of pixels fall within each bin. Expressed by a vector:
histH {h1 , h2 ,..., h180 } histS {s0 , s1 ,..., s255 } (3.5) histV {v , v ,..., v } 0 1 255
In the equation, hi , s j , vk is the number of pixels fall into each bin. 2. Image Block The disadvantage of representing image by color features is that two completely different images may have a similar distribution of the image. For example, two different images in Figure 3.4 have 9 pixels each, but they have the same color distribution.
Figure 3.4 different images have the same color distribution
The image block may avoid this problem in certain extent. This method is to split the whole image into small pieces, then extract color features from each piece separately. Then the color features of each block matches correspondingly. By doing so, the problem will be limited into each block to avoid differences in the overall image is too large. The number of blocks depends on balance among the complexity of computing and the desired effect. The problem can also be eliminated by the text description in the web page. Taking the computing complexity into account, this thesis segment an image into 4 × 4 sub‐blocks. So each image is represented by three 16‐dimensional image feature vectors. See equation (3.6), (3.7), (3.8) I
H ,H ,…,H
(3.6)
I
S ,S ,…,S
(3.7)
I
V ,V ,…,V
(3.8)
17
3.4 User Profile Design 3.4.1 Interest Vector Vector T [equation (3.9)] can describe a user interested topic. It composed by a keyword ‐ value pairs and the image feature vector. The key word represent the topic, the image feature vector represent relative image. If relative image to the topic does not exist, the image feature vector part will be set a null value. T={key words,weight, I , I , I } (3.9) The final user profile is composed by several interest vectors.
3.4.2 Similar Interest Judgment Because the interest vector is composed by two parts: key words and weight extracted from the text content; Image feature vector extracted from image content relative to the topic, the distance between two interest vectors are composed by the distance of key words and distance between image feature vectors. 1. Distance between key words with weight Considering the computing complexity, this thesis does not contain semantic analysis of the key words. The same key words are considered to have distance 0 and the different words are thought to have distance ∞. Thus when merge similar interest vectors, only vectors with same topic will be merged. 2. Distance between color histogram If considering the height of histogram bins as distribution of a discrete random variable, then the distance of color histogram can be represented by correlation coefficient of the two random variables. The benefit of using correlation coefficient is the ability to handle negative correlation, which means complementary colors can also be recognized when the images are similarity. It is calculated as follows:
Cov ( X , Y )
XY
E ( XY ) E ( X ) E (Y ) E ( X ) E 2 ( X ) E (Y 2 ) E 2 (Y ) (3.10) 2
If the two histograms are positive correlated, is positive. When the two histograms are perfect positive correlation, 1. If the two histograms are negative correlated, ρ is negative, and 1 when they are perfect negative correlated. The absolute value of correlation coefficient is closer to 1, the two 18
histograms are more closely related, and the images have higher similarity. If it is closer to 0, the two histograms are less closely related and the images have lower similarity. 3. Distance between image feature vector Now distance of color histograms in each block can be calculated correspondingly. To calculate the distance between image feature vectors, all blocks are considered equally important. The distance between image feature vectors is the mean value of distances between each corresponding block: 1 16 D ij , j {H , S , V } j 16 i 1 (3.11) 1 D ( DH , DS , DV ) 3
4. The distance between interest vector Interest vector distance F is defined as: T ,T =
∞, different key words (3.12) D, same key words
If the distance is less than the threshold S, then the two vectors are similar.
3.4.3 User Modeling Algorithm Algorithm 3.1 gives the user modeling algorithm.
3.4.4 Distance between Users In this thesis, distance between users is calculated only based on user profile. The user rating is used for updating the user profile. The distance can be calculated as follows:
U1 and U 2 are two different user profiles. They are both composed by some interest vectors. Calculate the distance between them is divided into the following steps: 1. Digitalize Vector Assume user profile U1 contains the key words t1 ,..., tm , x1 ,..., xk , the user profile
U 2 contains key words x1 ,..., xk , v1 ,..., vn , in which x1 ,..., xk is common (same or similar, see 3.4.2) interest vectors contained by both user profile. The two user profile can be converted into two equal‐length digital vector by the way illustrated in Table 4.1. Different key words are replaced by its corresponding weight. In order to reflect the difference between the two user profiles on the same or similar key words, they are replaced by corresponding 19
weight combined the distance between the two interest vectors. If the key word does not exist in one of the user profile, the corresponding position is set to 0. Table 3.2 digitalize user profile User Profile
…
U1
…
U2
0
…
…
0
′
0.5
…
0.5
…
…
′
0.5
0
…
0
0.5
′
…
′
2. Calculate the Cosine Distance
The cosine distance between user profiles U1 and U 2 can be calculated after digitalization. L cos(U 1 , U 2 ) U1,iU 2,i i 1 K
K i 1
U1,2i
K i 1
U 2,2 i (3.13)
In the formula, K is the number of interest vectors in the user profile. In this thesis it is 20.
3.4.5 Similar User Clustering This thesis uses an improved K‐means algorithm to cluster the similar user. Algorithm 3.2 Input: user profile set U Output: user profile clusters 1. Randomly select logK user profile as the initial center of the cluster; 2. For k from logK to K 3. Calculating distance between other profiles and the cluster centers, attribute the user profile to the nearest cluster; 4. Select the profile farthest from each cluster center as the center of new cluster; 5. k = 2 * k; 6. End
20
3.4.6 User Feedback This thesis includes several aspects of user feedback rating: 1. The explicitly rating by user After user clicks the link in the recommendation result list to access the page, the system required the user to assign appropriate ratings from a 5‐point scale. 2. Users browsing behavior User’s browsing behavior can imply user's preferences. For example, when browsing the user may add bookmark, download, copy or do other operations. The user browsing behavior can be seen as a kind of feedback on the recommendation result. This thesis considers that variety of behaviors represents different degree of user’s interest, as: Add Bookmark> Download> Save> Copy. presents the weight of user’s browsing behavior. Then the page behavior vector can be represent as: { (represent adding bookmark), (represent download), (represent save page), (represent copy)}. Their values are set as 4,3,2,1 separately. The weight of behavior is only two possibilities, either 0 or the specified value. For example, if a user added a page into bookmark and also copied some content, but not download and save, then the user behavior vector is {4,0,0,1}. The score of user’s browsing behavior would be sum of the operation weight: 1 (3.14) Because once the user is reading a page, even none of the four browsing operation was done, it does not mean that users is completely not interested in the page. Therefore, a constant 1 is added at the end of the equation. At least the feedback is mainly based on the user to explicitly rating. If the user has no give any rating on the website, the score is gained from the user browsing behavior. Both situations will give a 1‐5 score feedback. Feedback score occurred only after user clicks one of the links in the recommendation result list to enter the page. Pages in results list without user access will get a score of 0.
3.4.7 Updating User Profile User profile should be able to reflect changes in user interest. Thus user profile needs to be updated and maintain. On the other hand, user profile can be refined based on user’s feedback. In this thesis, updating of user profile consists of two aspects: 1. Regular updating by system. According to the user’s new browsing behavior in the most recent period, the system captures the changes of user’s interest and automatically updates the user profile. User model update algorithm is described as Algorithm 3.3. 21
2. Updating based on user feedback ratings Combining with collaborative filtering, the system required user to rate the results. At the same time, the system captures the user’s browsing behavior on pages in the results list. Utilizing these rating and browsing behavior, the system refines the user profile. See algorithm 3.4. If a web page gets a feedback scoreless then 3, add interest of vectors extracted from the page into user profile. If the interest vector is already exist in the user profile, then decrease the corresponding weight in the user profile. If feedback score is greater than or equal to 3, add interest of vectors extracted from the page into user profile. If the interest vector is already exist in the user profile, then increase the corresponding weight in the user profile.
22
Algorithm 3.1 Input: set of web pages W Output: user profile U ⑴Preprocessing on W; WHILE S != ∅ { ⑵∀P ∈ W; ⑶Analyze P,counting frequency of each words in different tags (table 3.1); ⑷Calculate weight of each words using equation 3.1; ⑸Calculate vector V using equation 3.3; ⑹Sort Vs by their weight, keep top 20 vector weight pairs; ⑺Judge the relativity between image and key words (use the alt, title, src attributes like chapter 3.11); ⑻Calculating vectors in equation 3.7, 3.8, 3.9 of each image relative to at least one key word; ⑼Composing the image feature vectors with the key word‐weight pair to get interest vector T 1
20 . If one image is related to multiple keywords,
then add the image feature vectors into all the interest vectors. ⑽Add interest vector T 1
20 into interest vector set T;
⑾Delete P from W; } WHILE T != ∅ { ⑿∀ T , T ∈ T, if F T , T
, then T and T are similar. (the distance
F is calculated by equation 3.13,S is the threshold); ⒀Merge similar interest vector T , T :
(i)
Sum the weight of the two vectors,
(ii)
Compute mean value of vectors in equation 3.7, 3.8, 3.9 ;
⒁Add the interest vector (after merging or not need to merge) into U,and
delete it from T; }
⒂Sort U by weight,keep the top μ interest.
23
Algorithm 3.2 Input: User profile U,temporary web page set W’ Output: New user profile U’ ⑴Preprocessing on W’; WHILE W’ != ∅ { ⑵∀P′ ∈ W′; ⑶ Analyze P’, counting frequency of each words in different tags (table 3.1); ⑷ Calculate weight of each words using equation 3.1; ⑸ Calculate vector V’ using equation 3.3; ⑹ Sort V’s by their weight, keep top 20 vector weight pairs; ⑺ Judge the relativity between image and key words (use the alt, title, src attributes like chapter 3.11); ⑻ Calculating vectors in equation 3.7, 3.8, 3.9 of each image relative with at least one key word; ⑼ Composing the image feature vectors with the key word‐weight pair to get interest vector T 1
20 . If one image is related to multiple keywords,
then add the image feature vectors into all the interest vectors; ⑽ Add interest vector T 1
20 into interest vector set T’;
⑾ Delete P’ from W’; } WHILE T’ != ∅ { ⑿∀T ∈ T’,∀T ∈ U,if F T , T
,then T and T are similar. (the
distance F is calculated by equation 3.13,S is the threshold); ⒀ Merge similar interest vector T , T : (i)
Sum the weight of the two vectors,
(ii)
Compute mean value of vectors in equation 3.7, 3.8, 3.9 ;
⒁Else if ∃T ∈ T’,∀T ∈ U, F T , T
,then adding T into U;
} ⒂ Sort U by weight, keep the top μ interest, U’ = U。
24
Algorithm 3.3 Input:User Profile U,Feedback Score set S , Interest Vector Set T of the result pages Output:New User profile U’ 1. For every interest vector T in T 2. ∃s ∈ S, s is the feedback score of pages containT 3. IF s