Formal Models for Learning of User Preferences, a Preliminary Report

Formal Models for Learning of User Preferences, a Preliminary Report Joaquin Delgado and Naohiro Ishii Nagoya Institute of Technology Dept. of Intelli...
Author: Chester Woods
1 downloads 1 Views 137KB Size
Formal Models for Learning of User Preferences, a Preliminary Report Joaquin Delgado and Naohiro Ishii Nagoya Institute of Technology Dept. of Intelligence & C.S. Gokiso-cho, Showa-ku, Nagoya 466-8555 JAPAN {jdelgado,ishii}@ics.nitech.ac.jp Abstract We present a preliminary report about formal models for the task of learning user preferences. Learning user preferences by examples is described under content-based learning models, such as the Mistake-bound Model and the Probably Approximately Correct (PAC) model of Computational Learning Theory. Then we describe the learning of user preferences under a collaborative-based learning schema inspired by Collaborative Information Filtering and Recommender Systems. Finally we formalize the problem as a joint learning problem under the Co-training model [Blum and Mitchell 1998], in order to combine content and social information for a more accurate model of human preferences. As result, different concepts and methods of learning user preferences can be expressed under a unified formal framework.

1 Introduction Recently, in the Artificial Intelligence (AI) community, there has been a great deal of work on how AI can be applied to scenarios such as the Internet and the WorldWide Web as emerging platforms for distributed intelligent systems. Notions of personalized Search Engines, Intelligent Software Agents, and Recommender Systems have gained large acceptance among users for the task of assisting them in searching, sorting, classifying, filtering and sharing the vast amount of information now available on the Web. The combination of the modeling of preferences of particular users, building content/resource models, and the modeling of social patterns, has led to frameworks such as Content-based Collaborative Information Retrieval, previously introduced in [Delgado, Ishii and Ura 1998a], [Delgado and Ishii 1998b] and [Basu, Hirish and Cohen 1998], that describes combinations of both content-based and collaborative-based methods for information filtering, classifying and sharing. The process of filtering Web documents, separating relevant documents from non-relevant ones, can be viewed

as a personalized text classification task based on user profiles, which are somewhat hypothesis of unknown target concepts of user preferences. In principle, these profiles could be built by intelligent agents by examining the content of positive and negative examples given by the user, but, in general, the user is not willing to spend so much time on this task. Additional examples are needed! In collaborative filtering systems [Resnick et. al. 1994] these come in form of recommendations. Recommendations are extracted from the analysis of social patterns and analogies among the different users of the system. As we will point out in the next section, the learning of user profiles is a non-trivial task and represents an interesting challenge for AI researchers and practitioners. This paper aims to fill the gap between empirical user modeling/profiling and the existing machine learning theories, which we believe to be useful for further advances in this area.

2 Learning User Profiles In actual Internet systems that require a model of user preferences, these are represented as user profiles, normally in the form of attribute-value lists and/or rules as in [Newell 1997] . From the system point of view, it is desirable to know as much as possible from the user so that it can provide useful results from the first actions of the agents. However, the user is willing to provide limited time to creating the specification of his user profiles. Moreover, user’s interests may change over time making the profiles difficult to maintain. For these reasons, the method for the initialization and maintenance of user profiles is a difficult aspect of the design and development of Intelligent Internet Systems. The level of automation of the acquisition of the user profiles can range from manual input, through semi-automatic procedures, to the automatic recognition by the agents themselves. Most of the information filtering software agents now available, such as e-mail filters, push technology and content-based recommender systems, makes use of an interface, trough which the system prompts the user to explicitly enter his preferences, desires and intentions in the form of attribute-values or rules. Although this process can provide very precise results when done properly,

it can be both time consuming and sensitive to input errors. It also requires the user to have a good understanding of domain knowledge and actions performed by the agents. The manual updating turns out to be difficult when the requirements change. of this model. On the other hand, ideas of learning by examples, clustering and time-based learning, have been in the AI and Machine Learning (ML) literature for some time. These can be directly related to the methods we will shortly describe below.

2.1 Learning by Given Examples

Figure 1. Learning by Examples The user is explicitly asked to give some examples of relevant information or to answer specific questions (see Figure 1). Once the user has given the appropriate answers, the agent processes the information using its internal weighting scale, and so, by combining them, produces compressed information for building user profiles that are consistent with the examples/questions provided. This mode has the advantage of simplified handling. It has the disadvantage and the danger that the selected examples are not representative and the results are less precise. Normally the learning process is of high computational complexity.

statistics of these characteristics within the stereotype, and stored into a table. When a new user is assigned to a stereotype, he is given, by default, the profile of this stereotype. This is a simple model that requires little interaction from the user, but its accuracy strongly depends on the grain and number of the user stereotype, as well as the selection of the important personal characteristics

2.3 Learning by Observation The automatic creation of the user profiles by the agent is based on two phases and the observation of user’s behavior (Phase 1) and the matching with user’s needs (Phase 2) done as previous cases. The observation phase ends when the agent considers that it has obtained sufficient information to make conclusions with regard to a general behavior pattern. Finally, in Phase 2, these profiles are used to match user’s information needs and agent’s actions. Because the user is relieved of having to do any input, this mode has the advantage of being a very simple operation. It provides a further advantage of determining changes in the user’s preferences over the complete period of use and then adapting itself appropriately. However, it should be notice that agents cannot be deployed immediately, since it needs a long training period before it can start to work.

2.4 Content vs. Collaborative Filtering Most of the systems that support acquisition and maintenance of user profiles, in the ways described before, are content-based filtering systems. In these system, it is common to see resources (i.e. Web pages) and user profiles being represented under the same model (i.e. the keyword vector space model), in order to be compared by the use of some similarity measure that evaluates relevance as shown in Figure3.

2.2 Prof ile/User Stereotyping

Figure 3. Content-based filtering Figure 2. Profiling by Clusters In this model (see Figure 2), there is an assumption that selected characteristics of each user permit an assignment of the user, through a decision tree-like process, into user stereotypes (clusters). An individual profile for each user stereotype is created, based on the accumulated

On the other hand, some systems do not even have user profiles in the way as we have described them until now. Instead, these systems use a rating schema, in order to produce statistical learning or prediction of ratings (scores) of new items or of a class of items on behalf of the user. The prediction is based on the user’s previously captured ratings and the ratings of the other users in the system over the same item/class (see Figure 4.). In this

case, a user’s profile could naively be thought as the set of ratings that has been captured from the user by the agent, but in fact, this doesn’t make much sense as we will argue later.

expressive to accommodate various models and be consistent with both content-based, collaborative learning and the co-training model. The first thing we noticed was that although, in principle, the objective of the learning is the same (the prediction of preference or the relevance of a given instance), the nature of the learning is totally different. Therefore, we chose to analyze the problem case by case.

3.2 Content-based Learning

Figure 4. Recommender Systems Users share their interests, and obtain recommendations based on what other users have to say about related items. This process, is called collaborative filtering [Resnick et. al 1994] or social filtering [Shardanand and Maes 1995], and may also have different levels of automation (from explicit entry of the rates to the observation of users behavior).

In content-based learning, as its name suggests, the inputs to a learning algorithm are the contents of the instances and their label ∈ {0,1}, represented in some domain. Using {0,1} (relevant, non-relevant) as the range is an over-simplification of the idea of preference, but in general, it will suffice for the task information filtering. We present non-infinite set-valued features, a variation of the model presented in [Cohen 1996], and use it for the representation of the content of an instance. This gives us expressiveness and the possibility to build a framework within computational learning theory. This is not restrictive, of course, and depending on the application it may be convenient to use other representation such as frame-like structures.

3 Formal Models 3.1 Existing Learning Theories From a theoretical point of view, research in machine learning and computational learning theory (COLT) has given new insights on why, within the mistake-bound model, probably approximately correct (PAC) model, and other statistical learning frameworks, the combinations of a “group of experts” leads to better results than those obtained by individual learners. More recently, work on the Co-training model, proposed in [Blum and Mitchell 1998], show that the sharing of (unlabeled) examples between two learners can boost the accuracy of learning a joint target concept function. This is especially useful when the size of the initial training sample of both or either learner is small. Some of these techniques have also been successfully applied to the task of classification of Net-news and Web pages [Nigam et. al. 1998] [Blum and Mitchell 1998]. All this suggests that the learning of user preferences might be feasible within the Co-training model and with the combination of multiple learners and unlabeled data. Perhaps we should also taking advantage of social patterns like the similarity between users In order to describe formal models for the learning of user preferences, we need to define first what do we mean by user preferences and user profiles under each context. These definitions have to be general and very

For this type of representation, lets define some basic  boolean tests. If a i a component of the name vector a , r is a real number and s is a string, then the following are all basic boolean tests for the domain D: ai = s and ai ≠ s for a nominal feature a i; a i ≤ r and a i ≥ r for a continuous feature a i; and s ∈ ai and s ∉ ai; for a set-valued feature ai.

An intuitive example of a legal instance for the food domain, represented through name-component pairs, would be Name = “Pasta”, Origin = “Italy”, Kcal/gr = 47.5, Ingredients = {“wheat”, “egg”, “salt”}. Here we can implicitly define the value sets as V1={set of all food names}, V2={set of country names} and V4={set of all ingredient names}. Note that V3 has no meaning, since it’s uncountable attribute. Now we are ready to describe the learning process. We will describe two types of learning models: On-line or

interactive learning and Off-line or batch learning and see how do they relate to the concept and the learning of user preferences. 3.2.1 On-line Learning of User Preferences From a theoretical point of view, on-line algorithm falls into what machine learning and computational learning theory (COLT) calls the mistake-bound model. Sometimes called “Learning from experts advice”, this on-line learning model is a continuous and interactive process, in which each attribute is considered to be an “expert predictor” and is given a weight used to measure its confidence on the prediction task. In each trail, a valid instance is presented to the algorithm, and each predictor gives its verdict using a certain Boolean testing function. The weighted majority output is arg max

where



∑ a ( x )w

R∈{ Ro , R1 ) R

i

i

  Ro = {ai ( x ) = 0; ∀i ≤ n} R1 = {ai ( x ) = 1; ∀i ≤ n} 

and ai (x ) is the result of applying a basic Boolean test to the i-th component of , made of n attributes. Next, the correct label is shown to the algorithm that proceeds to update the weights of the experts denoted by wi, following a strategy that punishes those that made mistakes and rewards or leaves unchanged the weights of those that were correct. The algorithm goes back to the prediction phase and loops.

Although these types of algorithms are not assured to converge (it might be because the target concept function changes in time!), there are guarantees on theoretical bounds on the number of mistakes they can make. In computational learning theory this corresponds to the mistake-bound model. Note that this concept of user profile is compatible with the representation of a Document in the keyword Vector Space Model of Information Retrieval that uses Term Frequency - Inverse Document Frequency (TF-IDF) as weights. It resembles even more to the concept of a class profile, in Rocchio's algorithm [Salton and Buckley 1990], widely used for text classification. In this case the name vector would be the list of all the indexing terms, and the weights are calculated in an relevance feedback loop (updating the TF-IDF values for each example). Some authors call these type of algorithms Memorybased algorithms [Breese, et al. 1998], but we should

point out some subtle differences between Memorybased algorithms and On-line algorithms. The former neither punishes nor rewards weights used for the prediction after knowing the correct result. Instead, it calculates the weights solely based on the data available. The later does use a correcting scheme that reflects the actual performance of the algorithm. 3.2.2 Batch or Off-line learning There are many algorithms in the machine learning literature for batch or off-line learning. Normally there is a training (labeled) set and a test set of instances with known but hidden labels. The performance of these algorithms, are normally measured, in terms of the accuracy/number of examples and the minimum square error over the test data. In computational learning theory, the Probably Approximately Correct (PAC), has been widely studied and used for off-line learning and it offers a set of nice theoretical properties. We would like to define formally the concept of user profile under this context, since it is desirable to represent the problem using such a powerful model. Normally the domain of interest in formal discussions is , the set of all strings of finite length on the Boolean alphabet . In order to use the definitions we have stated we need a mapping between any valid in-

stance in the user preference model and the formal domain of interest. As stated in Remark 5, if one assumes there are no continuous features, every set-valued feature domain D and every set-valued feature language has an isomorphic formulation in given by the Boolean combinations of the basic Boolean tests over valid instances of D. This allows us to formulate the learning of user preferences using the PAC model. We are even more interested in formulating this problem using PAC learning. Since we are learning user's preferences the target class remains the same as before, defined over domain D. Now we call instance the isomorphic mapping of every legal instance over D to instances . In the PAC learning model, the learning algorithm gets as input a sample, i.e., a multi-set of desired size m < ∞. Each instance xt in the sample given to the learner must be independently drawn from the same distribution θ on X and labeled according to the same target function c∈C. Both C and θ are fixed in advance and unknown to the learner. To meet the PAC learning criterion, the learner on the basis of a polynomial size sample, must output an hypothesis h. that with high probability is a close approximation of the target c. Formally, an algorithm A is said to PAC learn the target class C of

user preferences, using hypothesis class H if, for all distributions θ on X, for all target c ∈ C, and for all 1 ≥ ε , δ > 0 , given as an input a sample of size m, A outputs an hypothesis h ∈ H such that its error probability, θ(h ≠ c), is strictly smaller than ε with probability at least 1-δ with respect to the sample random draw, where m = m(ε,δ) is some polynomial in 1/ε and ln(1/δ). Given this definition of PAC learnability and a way to map it to batch learning of user’s preferences, can we describe what is a user profile?

Similar definitions can be given for PAC learning with the presence of noise.

3.3 Collaborative Filtering & Learning Collaborative filtering can be thought, in principle, as a learning problem for a binary relation, in which a user is related to the content just in case he or she prefers it. A 0,1-valued matrix, in which the rows represent users and columns represent contents (instances), can represent such a binary relation.

Off-line or batch learning, although clearly not suitable for this type of ever-changing problem, can be thought as a one time learning process that receives as input a partially filled (usually small) matrix O as training data, then produces a prediction function P: {i,j} →{0,1} later used to predict the remaining unknown entries (test data). Note that in this case time is irrelevant, as well as the actual values M[i,j] corresponding to the test data. In the literature, several ideas for implementing P(t) have been proposed. It’s worth mentioning the widely used Pearson-r/Correlation algorithm [Resnick et.al. 1994]. Several of well-known Machine Learning (ML) algorithms have also been implemented and evaluated for learning collaborative filters [Billsus and Pazzani 1998]. Inductive learning has also been explored in [Basu, Hirish and Cohen, 1998]. An interesting empirical analysis of several algorithms was made by [Breese, et.al. 1998], where the following classification of collaborative filtering learning algorithms was presented: • Memory-based Algorithms - Correlation (Person-r). Extended with default voting and case amplification - Vector Similarity (tf-idf ) • Model-based Algorithms - Cluster Models (Naïve Bayes Classifier with EM for the estimation of the model structure) - Bayesian Network Model One can think of each row in O t as a basic user profile being learned through out the prediction process.

3.4 The Co-training Model Figure 5. M: The Joint Information Matrix. We define this joint learning model as follows. We have an instance space X with an underlying unknown distribution, and a target concept class C:X→{0,1}, representing the class of user preferences over a given domain D. Let M (see Figure 2) be the joint information matrix that represents the target binary function whose i,j-entry (a i,j) represents the value of a target function ci(xj), where ci∈C and xj∈X, and Ot an observation matrix O, at time t, that in general satisfies O[i,j] = M[i,j], whenever the i,j entry has been observed, and O[i,j]= ∗ otherwise. Let P(t): O t×{i,j} →{0,1} be the prediction function for a given observation matrix O at time t, a column (instance) i and a row (user) j. Under this context, on-line or incremental learning is easily described in terms of the observation matrix O as follows. Starting initially with O 0 whose elements are all ∗, at any given time t an arbitrary pair i,j is given to P(t) that predicts its value as M ^[i,j], based on the observation matrix O t. The learner is given the actual value of M[i,j], and O t is updated (to Ot+1 ) accordingly. The above process is repeated until the matrix is fully observed, namely Ot=M.

It would be worth to explore a model that combines both content and collaborative learning for situations in which there are several users/agent learners learning in a common domain. In this, case each one them are learning different target concepts belonging to the same concept class C (the class of user preferences) and examples are drawn from a shared instance space X or a domain D with an unknown underlying distribution θ. This framework should capture the relations between a) the target concept of each learner, b) the recommendation of examples and prediction of its label based on social patterns and c) the learning process of each individual agent. We would be interested in exploring how can we take advantage from the prediction of rates in a collaborative learning environment in order to improve the content-based learning of each individual's user preferences and vice-versa. One can argue that the target concept class being learned is the same: the class of user preferences, and that in the sake of simplicity the instances are defined over a common domain. Thus, it is reasonable to say that whenever there is more than one independent learner, there are at least two different views of the instance space:



A direct or selfish view, that follows the direct analysis of the content of an instance by the learner and • An indirect or social view that follows the analysis of an instance based on what others have to say (their impression) about the instance. For example, suppose that person A, that has learned by experience that he doesn’t like hot (spicy) food (Domain = food), is given to eat a certain dish. He can either taste the food or check out the list of ingredients (direct view) to see if he likes it or not (if its hot or not), or he can ask somebody else (Person B) who doesn’t like hot food either and already tasted the dish, if he or she liked it or disliked it (social view). Note that there is no query about what B directly observed in the dish (e.g. a certain ingredient), which might be interesting to analyze, but rather the question is about the value of evaluating the instance by the target function. If we represent person’s A target function as a Boolean function of the form where x1=’’the food is hot’’, it is sufficient that B contains the same variable in a conjunction form, in order to accurate predict A’s preferences related to hot food even if A and B differ in all the rest of the variables since, in fact, they are different concept functions. The learning of different views of the instance space has been studied by [Blum and Mitchell 1998] in the Cotraining model. This model consists of an instance space X=X1×X2 and a composed target concept c=(c1,c2) belonging respectively to the target class C=C1×C2 with an underlying Distribution θ over the instance space. In this model X1 and X2 corresponds to different views of an example, such as the words inside a Web page and the words in the hyperlinks pointing to that page. The degree of compatibility of a function c with respect of a distribution θ is defined as a number 0 ≤ p ≤ 1 where p=1-Pr θ[(x1,x2) : c1(x1)≠c2(x2)] for all (x1,x2) ∈ X with non-zero probability. Assuming full compatibility (p=1), they have shown both theoretically and experimentally, that combining labeled and unlabeled examples by cotraining of each function, c1 and c2, with each other’s examples, learning can be achieved with better accuracy than relying on each function separately or just combining them by voting. This is especially true when the amount of labeled samples is small. Now lets define two different views of a compound instance space as for capturing the fundamental ideas of a jointly learning both content and collaborative-based user preferences. The intuitive idea is that there is a selfish learner, that is only interested in the content of the examples that matches with its (internal) hypothesis without a need of second opinion, and there is a social learner that is only interested in building its concepts based solely on predicted values that other learners have assigned to the same sample, therefore also learning the relations among the different learners.

Let X=X1×X2 be the new sample space in the Co-training model. Let X1 be the set of all legal instances over a domain D as in the Content-based learning model. Let X2= M×{i,j} so that i, j are valid entries of the collaborative learning matrix M. Let c1 be the target concept that represents the selfish point of view of the user's preferences, and c2 the target concept that represents the collaborative point of view of the user's preferences. Let's assume that each pair x=(x1,x2); x ∈ X can be drawn with a distribution θ that it assigns probability 0 to any pair (x1,x2) such that c1(x1)≠c2(x2). Note that this is somewhat consistent with the idea of not allowing contradictions in one's preferences, (which is not always true). If we assume conditional independence on θ and have confidence in the learnability of at least one of the concept targets (learnable in the PAC model with classification noise), then we can directly apply Theorem 1 in [Blum and Mitchell 1998]. This states that the joint function c is learnable in the Co-training model from unlabeled data only, given an initial weakly-useful predictor for one of the concept targets. An encouraging fact that supports this theory is that most of the ML algorithms available perform better than a weakly-useful predictor even with few data.

Conclusion The contributions of this paper are threefold: • We have presented a preliminary formal framework for the problem of learning user preferences. • Using this framework we have given an interpretation of this problem under the context of several ML theoretical models. • We have introduced and investigated the concepts of Content-based and Collaborative-based learning and the combination of both aspects under the Cotraining Model. As future work we would like to verify these thoughts and build new algorithms based on the definitions we have proposed.

References Delgado, J., Ishii N. and Ura, T., 1998a. Content-based Collaborative Information Filtering: Actively Learning to Classify and Recommend Documents, in: Matthias Klush, Gerhard Weiss, eds., Cooperative Agents II, Proceedings/CIA’98 206215. LNAI Series Vol. 1435, Springer-Verlag. Delgado, J. and Ishii, N., 1998b. Content + Collaboration = Recommendation, in Papers from the AAAI 1998 Workshop on Recommender Systems, 37-41. Technical Report WS-9808, Menlo Park, Calif. AAAI Press. Basu, C.,. Hirsh, H, and Cohen, W., 1998. Recommendation as Classification: Using Social and Content-Based Information in Recommendation, in: Proceedings AAAI-98. 714-726. Menlo Park, Calif. AAAI Press.

Blum, A. and Mitchell, T. 1998. Combining Labeled and Unlabeled Data with Co-Training, in: Proceedings of the Eleventh Annual Conference on Computational Learning Theory. 92-100. ACM Press. Nigam, K., McCallum, A., Thrun, S. and Mitchell, T., 1998. Learning to Classify Text from Labeled and Unlabeled Documents, in: Proceedings AAAI-98. 792-799. Menlo Park, Calif. AAAI Press. Breese, J., Heckerman, D., Kadie, C., 1998. Empirical Analysis of Predictive Algorithms for Collaborative Filtering. Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Madison, WI. Resnick, P., Iacovou, N., Sushak, M., Bergstrom P. and Riedl J., 1994. GroupLens: An Open Architecture for Collaborative Filtering of Netnews,in: Proceedings of the CSCW 1994 conference. Cohen, W., 1996. Learning Trees and Rules with Set-valued Features in Proceedings of AAAI-96. Menlo Park, Calif. AAAI Press. Salton, G., Buckley, C., 1990. Improving retrieval performance by relevance feedback, Journal of the American Society for Information Science Vol. 41, 288-297 Shardanand, U. and Maes, P., 1995. Social Information Filtering: Algorithms for Automation of “Word of Mouth”, in: Proceedings of ACM/CHI’95. ACM Press Billsus, D. and Pazzani, M., 1998. Learning Collaborative Filters, in: Proceedings of ICML’98, 46-53. Morgan Kaufman Eds. Nakamura, A. and Abe, N., 1998. Collaborative Filtering using Weighted Majority Prediction Algorithms in: Proceedings of ICML’98, 395-403. Morgan Kaufman Eds. Newell, S.C. 1997. User Models and Filtering Agents for Improved Internet Information Retrieval. User Modeling and User-Adapted Interaction 7(4) 223-237. Kluwer Academic. Pub.

Suggest Documents