MICROBLOGGING, a Web 2.0 technology, is a type of

292 IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, VOL. 4, NO. 4, OCTOBER-DECEMBER 2011 Microblogging in a Classroom: Classifying Students’ Relevant...
Author: Winfred Freeman
0 downloads 4 Views 585KB Size
292

IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES,

VOL. 4,

NO. 4,

OCTOBER-DECEMBER 2011

Microblogging in a Classroom: Classifying Students’ Relevant and Irrelevant Questions in a Microblogging-Supported Classroom Suleyman Cetintas, Luo Si, Hans Peter Aagard, Kyle Bowen, and Mariheida Cordova-Sanchez Abstract—Microblogging is a popular technology in social networking applications that lets users publish online short text messages (e.g., less than 200 characters) in real time via the web, SMS, instant messaging clients, etc. Microblogging can be an effective tool in the classroom and has lately gained notable interest from the education community. This paper proposes a novel application of text categorization for two types of microblogging questions asked in a classroom, namely relevant (i.e., questions that the teacher wants to address in the class) and irrelevant questions. Empirical results and analysis show that using personalization together with question text leads to better categorization accuracy than using question text alone. It is also beneficial to utilize the correlation between questions and available lecture materials as well as the correlation between questions asked in a lecture. Furthermore, empirical results also show that the elimination of stopwords leads to better correlation estimation between questions and leads to better categorization accuracy. On the other hand, incorporating students’ votes on the questions does not improve categorization accuracy, although a similar feature has been shown to be effective in community question answering environments for assessing question quality. Index Terms—Education, computer uses in education.

Ç 1

INTRODUCTION

M

a Web 2.0 technology, is a type of blogging that lets the users post short text messages (usually less than 200 characters) to their community in real time via several communication channels such as the web, mobile devices, e-mail, and instant messengers. Depending on whom a user follows (i.e., communicates with) and is followed by, a microblogging tool such as Twitter [11], [29] can be effectively used for professional networking. Recently, microblogging tools have been used in classroom environments as a communication tool between a student and the instructor [9], [27] as well as between students themselves [3], [7], [30]. Utilizing microblogging in classrooms has several advantages and disadvantages (that can be found in [9]). An important issue with large microblogging supported classrooms is that the number of questions/ comments an instructor receives from the students can be many more than what she/he can answer in a limited time. Therefore, there exists a need to differentiate the relevant questions/messages that should be addressed by the instructor from the irrelevant messages that need not or should not be addressed. To the best of our knowledge, there is very limited prior work on the categorization of relevant and irrelevant ICROBLOGGING,

. S. Cetintas, L. Si, and M. Cordova-Sanchez are with the Department of Computer Sciences, Purdue University, 250 N. University Street, West Lafayette, IN 47907-2066. E-mail: {scetinta, lsi, cordovas}@cs.purdue.edu. . H.P. Aagard and K. Bowen are with the Rosen Center for Advanced Computing, Purdue University, 302 West Wood Street, West Lafayette, IN 47907-2066. E-mail: {hans, kbowen}@purdue.edu. Manuscript received 26 Mar. 2010; revised 17 Sept. 2010; accepted 30 Nov. 2010; published online 25 Mar. 2011. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TLT-2010-03-0032. Digital Object Identifier no. 10.1109/TLT.2011.14. 1939-1382/11/$26.00 ß 2011 IEEE

microblogging messages or questions in classroom environments [4]. Prior work on teacher agents includes a question ranking capability that scores each question and tries to select the best questions to respond to [24]. They utilize the question text as well as a personalized approach (i.e., questions from a student who has been asking good questions are favored) to differentiate between the relevant and irrelevant questions. Recently, Cetintas et al. also utilized the correlation between questions and available lecture materials in lectures along with personalization and question text, and showed that it further improves the categorization accuracy [4]. Other similar prior work in community question answering environments (where users post a question and have their questions answered by others) such as Yahoo! Answers [32], WikiAnswers [31], and Baidu Zhidao [2] utilize users’ votes to determine which questions should rank high for question search [25], [26]. Although these approaches are quite important to improve the effectiveness of the question categorization, all of those approaches do not consider 1) the correlation between questions asked in the same lecture, 2) the effect of removing stopwords or not over the classifiers’ performance. This paper proposes a text categorization approach that can automatically identify relevant and irrelevant questions asked in a lecture by utilizing multiple types of evidence including question text, personalization, correlation between questions and lecture materials, correlation between questions themselves, and finally, students’ votes on questions. We show that 1) utilizing personalization along with question text is more effective than using question text alone, 2) utilizing the correlation between questions and available lecture materials improves the categorization accuracy, 3) utilizing the correlation between questions Published by the IEEE CS & ES

CETINTAS ET AL.: MICROBLOGGING IN A CLASSROOM: CLASSIFYING STUDENTS’ RELEVANT AND IRRELEVANT QUESTIONS IN A...

asked in a single lecture improves the categorization accuracy. It is shown that when lecture materials are not available, the approach of utilizing the correlations between questions asked in a lecture becomes a good alternative to the approach of utilizing the correlation between questions and available lecture materials. Finally, it is shown that elimination of stopwords leads to better correlation estimation between questions, and leads to better categorization accuracy. Yet, elimination of stopwords from the feature space of classifiers or while calculating the correlation between questions and available lecture materials does not make a significant difference. The rest of the paper is arranged as follows: Section 2 discusses the related work. Section 3 introduces the data set used in this study. Section 4 describes the Support Vector Machines (SVM), Stopwords Removal, and Cosine Similarity with Tf-Idf and Okapi techniques. Section 5 proposes several approaches for the categorization of relevant and irrelevant questions. Section 6 discusses the experimental methodology. Section 7 presents the experiment results, and finally Section 8 concludes this work.

2

RELATED WORK

This section surveys related work on identification of relevant and irrelevant questions in a microblogging supported classroom. Microblogging has recently been used as a communication tool between a student and the instructor [9], [27] as well as with other students [3], [7], [30]. Although all of these works are similar in the sense that they use microblogging for educational purposes, their main focus is on how microblogging can be used academically [9], [16], [27], and on how to analyze microblogging in the context of learning [3], [7], [30]. Yet, an important problem in microblogging supported classrooms and distance classrooms is that the number of questions an instructor receives from the students can be larger than what can be answered in a limited time. Therefore, there is the need to select the best questions to respond to [4], [24]. For instance, an instructor can receive as many as 69 questions for a 50-minute lecture (see the Data section for more details), and this number can easily be even larger depending on the classroom size, clarity level of the lecture, participation level of students, etc. Soh et al. attempt to solve this issue in their work on teacher agents with a question ranking capability that scores each question and tries to select the best questions for the teacher to respond to [24]. Question text and personalization (i.e., favoring question coming from a student who has been asking good questions) is used to differentiate between the relevant and irrelevant questions. Although question text and personalization are quite important for identifying relevant and irrelevant questions, their work ignores 1) using the correlation between questions and available lecture materials and 2) using the correlation among questions asked in a lecture. In a recent work, Cetintas et al. show that using the correlation between questions asked in a lecture and the available lecture materials, along with question text and personalization, help to better identify relevant and

293

irrelevant questions [4]. Yet, they do not consider using the correlation between questions themselves asked in a lecture. Furthermore, they do not consider the effect of stopwords removal on the classifier performance 1) when they are not removed for the bag-of-words representation of the input space of the classifiers; and 2) when the correlations among questions themselves and the correlation among questions and available lecture materials are calculated. Other related work considers the community question answering environments that enable users to post their questions and have their questions answered by others, such as Yahoo! Answers [32], WikiAnswers [31], Baidu Zhidao [2]. A big portion of research in those environments focus on 1) how to find related questions and 2) how to mine questions and answers given a new question by a user [8], [10], [12], [14]. However, the task of finding similar questions to a question of particular interest is different than the task of identifying the best questions to respond to in a lecture, and as Song et al. note [25], none of those prior works address the usefulness of questions. Recently, studies by Song et al. and Sun et al. utilize users’ interest in the questions to find the most useful questions [25] and to recommend questions when users browse questions by category [26]. Specifically, Sun et al. utilize users’ votes on the questions for recommending questions. However, the latter research work does not consider personalization or the correlation among questions. Moreover, since the environment is not a classroom environment, it is not possible to see the effect of using available lecture materials on identifying the best questions to respond to, as no lecture materials are available.

3

DATA

Data collected from a personal finance class (a 300 level undergraduate course) during Fall 2009 has been used in this work. The microblogging tool, called HotSeat, is owned by Purdue University and has been developed by the third and the fourth authors. It is specialized for use in classroom, so it can be used for any course and fosters communication between a student with the instructor as well as with the other students. The instructor and students of a class are members of the microblogging class. For each lecture, a separate page is created so that students can post their short messages (of at most 200 characters) related with that particular lecture. As a property for educational environments, students are given the opportunity to be able to ask their questions anonymously depending on their will. The system also has a binary voting mechanism such that every question can get voted for and the number of votes a question gets indicates its popularity in the lecture. Overall, it can be seen as a counterpart of the public microblogging tools (e.g., Twitter) with the mentioned specializations (e.g., voting) for classroom use. Its user interface is similar to the publicly known tools in order to make it easier for the students to get used to it. The study was conducted in a large classroom with 243 students during 24 lectures (each of them 50 minutes long). Data from the first four lectures are used for training and the remaining 20 lectures are used for testing. Each

294

IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES,

VOL. 4,

NO. 4,

OCTOBER-DECEMBER 2011

TABLE 1 Examples of Relevant and Irrelevant Questions

Note the word capitalization mistakes at several questions, the distribution of votes and the distribution of stopwords among relevant and irrelevant questions.

lecture has an average of 26.9 relevant questions with a standard deviation of 9.4 and 10.8 irrelevant questions with a standard deviation of 8.5. The largest total number of questions observed in a lecture is 69 and there is very limited number of repeated (or very similar) questions due to the voting system. Examples of relevant and irrelevant questions can be seen in Table 1. We employ two human annotators (the first author and an expert in finance) and ask them to annotate each question as either being relevant or irrelevant. The annotators reach a Kappa of 0.868 on 162 questions (i.e., questions of the first four lectures) and therefore the rest of the data was annotated by the first annotator only. Every lecture has a publicly available presentation file relevant to the lecture and they are used as the available relevant lecture materials without any modification. The course has a syllabus file that discusses about course policies, exams, projects, quizzes, details about the utilization of HotSeat, participation grades, etc. (that are treated as irrelevant lecture materials), and this file is used as the available irrelevant lecture material without any modification. Note that there is only one irrelevant lecture

material used in this work as the instructor has chosen to combine all the materials about course policies, exams, projects, etc., into a single file. It can be seen in Table 2 that that file is much longer than other available lecture materials as it is a combination of several documents. Details about the questions and available lecture materials can be found in Table 2. A total of 118 students have participated by posting at least one question. Each student has 5.5 relevant questions on average with a standard deviation of 6.2 and 2.2 irrelevant questions with a standard deviation of 2.7. Out of 260 irrelevant questions, 132 of them got voted for by at least one student (1.9 times on average with a standard deviation of 0.9) and out of 645 relevant questions 276 of them got voted for at least once (0.9 times on average with a standard deviation of 1.6). Examples of voted relevant and irrelevant questions can be found in Table 1. Note that HotSeat is a university-owned, internal system that is currently being used at Purdue University. According to the university policy, the system or the private data from students cannot be shared with the public community at the moment.

TABLE 2 Statistics about Questions or Available Lecture Materials

The number of average terms for each sentence type after stopwords’ removal under the without stopwords column; after avoiding stopwords’ removal under the with stopwords column.

CETINTAS ET AL.: MICROBLOGGING IN A CLASSROOM: CLASSIFYING STUDENTS’ RELEVANT AND IRRELEVANT QUESTIONS IN A...

4

! ! ! ! Q i Q j SimðQi ; Qj Þ ¼ cosðQi ; Qj Þ ¼ ! ! ; jjQ i jjjjQ j jj

METHODS

This section describes the techniques of Support Vector Machines, Stopwords Removal and Cosine Measure with Tf-Idf and Okapi.

4.1 Support Vector Machines Microblogging questions/messages are in textual format; therefore, detecting their types can be treated as a text categorization (TC) problem [33]. Support Vector Machines have been shown to be one of the most accurate as well as widely used text categorization techniques [15], [33]. In this work, the simplest linear version of SVM (i.e., with a linear kernel) was used as the TC classifier that can be formulated as a solution to an optimization problem as follows: N X 1 f~ w; bg ¼ min jjwjj2 þ C i w 2 i¼1

ð1Þ

subject to yi ð~ w  d þ bÞ  1 þ i  0 8i  0; ! where d i is the ith document represented as a bag-of-words vector in the TC task; yi ¼ f1; þ1g is the binary classifica! tion of d i . Note that, the positive class (i.e., y ¼ þ1) represents the relevant questions, and the negative class (i.e., y ¼ 1) represents the irrelevant questions. ! w , b are the parameters of the SVM model. C has the control over the trade-off between classification accuracy and margin, which is tuned empirically. The categorization threshold of each SVM classifier is learned by twofold cross validation in the training phase.

4.2 Stopwords Removal As a common text preprocessing technique, stopword removal suggests that many of the most frequent terms in English such as why, where, he, she, there, is, etc., are not content words, because they appear almost in every document, do not carry important information about the context of the documents and should be removed [23]. However, prior research has also shown that stopwords can be useful for some text categorization tasks such as classifying math word problems with respect to their types [5]. In this paper, we investigate the effect of avoiding stopwords’ removal over classifier performance in two ways: 1) when they are not removed from the bag-ofwords representation of the input space for the SVM classifiers; 2) when the correlations among questions themselves and the correlation among questions and available lecture materials are calculated with the cosine similarity measure. We used the Lemur information retrieval toolkit [17] for stopwords’ removal and stemming (which is applied to the data for all the models). In particular, the INQUERY stopwords list and Porter stemmer were utilized, respectively [20]. 4.3 Cosine Similarity with Tf-Idf and Okapi Cosine Similarity is a measure of similarity between two vectors by calculating the cosine of the angle between them, which is commonly used in text mining to compare text documents. In this work, the similarity scores between questions are calculated as a measure of the correlation by the common Cosine measure [1] as follows:

295

ð2Þ

where Qi is the ith question and Qj is the jth question in ! ! a lecture, Q i and Q j are sentences represented as the bagof-words vectors, and “” denotes the dot product of these vectors. The correlation between questions and lecture materials are also calculated in the same way. We use two common weighting schemes, namely Tf-Idf [1] and Okapi [21], along with the cosine measure to calculate both correlations. Tf-Idf uses term frequency and inverse document frequency (i.e., favoring discriminative terms that only reside in a small number of documents) and Okapi additionally considers document sizes by favoring shorter but relevant documents since longer documents have an advantage of being able to match more terms in Tf-Idf and score higher.

5

MODELS

5.1 Modeling with Terms and Personalization Using individual features of questions (i.e., their bag-ofwords representations—which will be referred as terms) along with personalization to select the best questions to respond to has been shown to be a useful approach in a recent prior work [24]. It is intuitive to use personalization along with the terms of questions since different students have different question asking habits and some students tend to ask more relevant or irrelevant questions during the lectures than other students. Therefore, utilization of personalization features enables the classification model to adapt to students’ different question asking habits by using the information from their past questions. In this work, we use two features for personalization: 1) percentage of relevant questions asked by a student, and 2) percentage of irrelevant questions asked by a student. An SVM classifier that only uses the terms of questions is used as a baseline along with another SVM classifier that uses personalization along with the terms. The two baseline classifiers will be referred as SVM_TermsOnly and SVM_TermsPers, respectively. 5.2

Modeling with Terms, Personalization, and Students’ Votes The effect of utilizing students’ votes on the questions to select the best questions to respond to in a classroom environment has not been studied before. Yet, there is recent prior research [25], [26] in the community question answering services. In such communities, a user asks a new question to the system, and the system finds past questions with answers relevant to the new question, in order to find relevant answers. In such a setting, utilizing the votes of users on the questions has been shown to be effective for searching relevant questions for a test question [25], and for browsing questions by category to see the popular questions and answers in that category [26]. In this modeling approach, a new binary-valued feature, whether a question gets more than 5 percent of all votes in a lecture or not, is incorporated in addition to the terms and personalization features that has been mentioned before. This approach will be referred as SVM_TermsPersVotes.

296

IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES,

5.3

Modeling with Terms, Personalization, and Similarity between Questions and Available Lecture Materials In a lecture, it is intuitive that most relevant questions asked in a class will be related with the lecture being covered in class. This modeling approach makes use of this fact and adds three new features about the correlation between a question and available lecture materials (that may be relevant or irrelevant) to the set of baseline (i.e., terms and personalization) features, which have been shown to be effective in the recent work by Cetintas et al. [4]. Particularly, the added features in this work are the Cosine Similarity score 1) between a question and available relevant lecture material(s) of the current lecture, 2) between a question and all available relevant lecture materials of that course (i.e., in this work, sum of the similarity scores of top three most similar relevant materials are used), 3) between a question and all available nonrelevant lecture materials (i.e., in this work, the similarity score with the only irrelevant material is used: if there are more irrelevant materials, the approach in 2 can be used with the irrelevant materials). As an example, if an instructor prefers using the in class time to answer only the questions directly related with the topic and chooses to talk about quizzes, projects, homeworks, exams, etc., out of the class, then these topics can be used as irrelevant lecture materials. This modeling approach will be referred as SVM_TermsPersLMSim. Modeling with Terms, Personalization and Similarity between Questions By the nature of microblogging, the questions that are asked in class are often short, which makes the job of the classifiers harder due to data sparseness. Besides the terms and personalization features (i.e., the baseline features), another feature that utilizes the categorization decision of similar questions for a test question can be added. Particularly, the new feature, which can be called as qSim, can be calculated as follows:

6

¼

k X

SimðQi ; Rj ðQj ÞÞ  SV M T ermsP ersðRj ðQj ÞÞ;

ð3Þ

j¼1

where Rj ðQi Þ is the jth most similar question to question Qi , SV M T ermsP ersðRj ðQj ÞÞ is the decision of the SVM_TermsPers (default) classifier for question Rj ðQi Þ, SimðQi ; Rj ðQj ÞÞ is the Cosine Similarity between questions Qi and Rj ðQi Þ, and k is chosen to be 3. So, the new feature is the weighted sum of the classification decisions on the three most similar questions for a test question in a lecture. Note that k is chosen to be 3 in this work due to the fact that the total number of irrelevant questions that are correlated with irrelevant questions becomes comparable to the total number of relevant questions that are correlated with irrelevant questions at ranks k > 3 (shown in detail in Fig. 1d. On the other hand, at ranks k  3, higher number of irrelevant questions is correlated with other irrelevant questions, which is desired by this modeling approach. This approach will be referred as SVM_TermsPersQSim.

NO. 4,

OCTOBER-DECEMBER 2011

EXPERIMENTAL METHODOLOGY: EVALUATION METRIC

To evaluate the effectiveness of the relevant question selection task, we use the common F1 measure, which is the harmonic mean of precision (p) and recall (r) [1]. Precision is the ratio of the correct categorizations by a model divided by all the categorizations of that model. Recall is the ratio of correct categorizations by a model divided by the total number of correct categorizations. A higher F1 value indicates a high recall as well as a high precision.

7

EXPERIMENTAL RESULTS

This section presents the experimental results of the methods that are proposed in the Models section. All the models were evaluated on the data set described in the Data set section. An extensive set of experiments was conducted to address the following questions: 1.

2.

5.4

QSimðQi Þ

VOL. 4,

3. 4.

7.1

How effective are the following three models compared to each other: a) SVM_TermsOnly that utilizes only question terms, b) SVM_TermsPers that utilizes personalization along with question terms, and c) SVM_TermsPersVotes that utilizes students’ votes along with personalization and question terms? How effective is the approach of utilizing the similarity between questions and available lecture materials along with personalization and question terms? How effective is the approach of utilizing the similarity between questions asked in a lecture? What is the effect of stopwords’ removal over classifier performances?

The Performance of Modeling with Terms, Personalization and Students’ Votes (i.e., SVM_TermsPersVotes) The first set of experiments was conducted to compare SVM_TermsOnly, SVM_TermsPers, and SVM_TermsPersVotes classifiers (details are given in Sections 3.1 and 3.2) with each other on the categorization of relevant and irrelevant questions asked in a lecture. Their performance can be seen in Table 3. It can be seen that the SVM_TermsPers classifier outperforms both of SVM_TermsOnly and SVM_TermsPersVotes classifiers. Paired t-tests have been applied for this set of experiments (in different configurations), and statistical significance with p-value less than 0.01 has been achieved in favor of using personalization along with terms over using 1) only terms and 2) students’ votes along with personalization and terms (in the case when stopwords are removed). Utilizing personalization along with terms is a better approach than using only terms of questions. This observation is consistent with prior research [24]. Using students’ votes along with personalization and terms leads to a significantly lower (i.e., with p-value less than 0.01) performance than using only personalization and terms when stopwords are removed, and does not make a significant difference in performance (i.e., with p-value more than 0.01) when stopwords are included. To the best of our knowledge, this is the first set of experiments using students’ votes for selecting the best questions to respond to

CETINTAS ET AL.: MICROBLOGGING IN A CLASSROOM: CLASSIFYING STUDENTS’ RELEVANT AND IRRELEVANT QUESTIONS IN A...

297

Fig. 1. Statistics about the correlations between relevant questions (seen in (a) and (b)) and all questions; and irrelevant questions (seen in (c) and (d)) and all questions. Each relevant question has a list of similar (relevant or irrelevant) questions (from the same lecture with that question) ranked with respect to the similarity scores while determining the most similar questions. The total number of relevant and irrelevant questions at a particular rank k for all relevant questions is reported in (a) when stopwords are included, and in (b) when they are removed, while calculating the correlations with Tf-Idf similarity measure. The corresponding statistics are reported for irrelevant questions in (c) and (d) similarly. Detailed view of each graph for the top 10 most similar questions (note that in this work we only choose the top three most similar past questions for each test question) can be seen on the smaller graphs on the right top of each graph.

in a classroom environment. Results show that, students’ votes, unlike their positive effect in community questions answering systems [25], [26], should be carefully considered. This result may be explained by a previously reported fact that educational materials (i.e., questions) liked (i.e., voted) by users (i.e., students) may not be pedagogically good for them [28]. Having the most accurate categorization results so far, SVM_TermsPers classifier is used as the main baseline for the rest of the experiments.

7.2

The Performance of Modeling with Terms, Personalization, and the Similarity between Questions and Available Lecture Materials (i.e., SVM_TermsPersLMSim) The second set of experiments was conducted to compare SVM_TermsPersLMSim and SVM_TermsPers classifiers, and their performance can be seen in Table 4. It can be seen that the SVM_TermsPersLMSim classifier significantly outperforms (with p-value much less than 0.01) SVM_TermsPers classifier. Utilizing the correlations among

TABLE 3 Results of the SVM_TermsOnly, SVM_TermsPers, and SVM_TermsPersVotes Classifiers in Comparison to Each Other

Results of the SVM_TermsOnly, SVM_TermsPers, and SVM_TermsPersVotes classifiers in comparison to each other for two configurations: when stopwords are 1) removed and 2) not removed from the bag-of-words representation of the input space for the SVM classifiers. The performance is evaluated with the F1 measure and “F1 (precision, recall)” triplets are reported for each configuration of each classifier.

298

IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES,

VOL. 4,

NO. 4,

OCTOBER-DECEMBER 2011

TABLE 4 Results of the SVM_TermsPersLMSim and SVM_TermsPersQSim Classifiers in Comparison to SVM_TermsPers Classifier for Stopword and Weighting Scheme Configurations

Results of the SVM_TermsPersLMSim and SVM_TermsPersQSim classifiers are shown in comparison to SVM_TermsPers classifier for two main configurations (while correlations between questions themselves and correlations between questions and available lecture materials are calculated): 1) with Tf-Idf and Okapi weighting schemes and 2) when stopwords are removed or not. The performance is evaluated with the F1 measure and “F1 (precision, recall)” triplets are reported for each configuration of each classifier.

questions and available lecture materials along with terms and personalization is a better approach than using only personalization and terms of questions. To assess the similarity between questions and available lecture materials, two common weighting schemes are used with the Cosine Similarity measure. It can be seen that Tf-Idf weighting scheme significantly outperforms (with p-value less than 0.01) Okapi weighting scheme for almost all of the experiments.

7.3

The Performance of Modeling with Terms, Personalization, and the Similarity between Questions (i.e., SVM_TermsPersQSim) The third set of experiments was conducted to compare SVM_TermsPersQSim and SVM_TermsPers classifiers, and their performance can be seen in Table 4. It can be seen that the SVM_TermsPersQSim classifier significantly outperforms (with p-value less than 0.01) SVM_TermsPers classifier and has a comparable performance (p-value more than 0.05) with SVM_TermsPersLMSim classifier. Utilizing the correlations among questions along with terms and personalization is 1) a better approach than only using personalization and terms of questions, and 2) comparable with using available lecture materials along with personalization and terms. This also shows that when there are not any lecture materials available, utilizing the correlation between questions can also be used as an effective alternative method. In further experimentation, a hybrid classifier that uses a basic combination of the features of SVM_TermsPersLMSim and SVM_TermsPersQSim approaches has been found to be not significantly different than the individual approaches. For weighting schemes, Table 4 also shows that Tf-Idf significantly outperforms (with p-value 0.01) Okapi for this set of experiments as well. It can also be seen in Table 4 that SVM_TermsPersLMSim has the best precision over all classifiers while SVM_TermsPersQSim achieves the highest recall. It is important to note that, as this work deals with unbalanced positive and negative classes, a trivial classifier that accepts all questions as relevant will have an F1 value of 0.867 with a precision value of 0.765 (as 76.5 percent of all test questions are relevant questions) and a recall value of 1. The F1 value of this trivial classifier is comparable with the best performing

classifiers presented in this work (while beating all others), and the recall achieves the perfect value (beating all the presented classifiers). On the other hand, it is important to note that precision is relatively more important than recall in this work, and the precision of the trivial classifier is significantly lower than the precision achieved by any of the presented classifiers (including the baseline classifier). The fact that the trivial classifier achieves a comparable F1 value with the best performing classifiers shows that discriminating relevant and irrelevant questions in a microblogging supported classroom is a challenging task. The challenge is mainly due to two reasons. First, questions are short (as mentioned before) and this makes the classification task significantly harder due to data sparsity. Second, questions are asked about the topics in a limited domain (e.g., personal finance) that makes relevant and irrelevant questions highly correlated with each other (which can be observed in Figs. 1c and 1d), and the high correlation makes the classification task significantly harder as well.

7.4 The Effect of Eliminating Stopwords The fourth set of experiments was conducted to evaluate the effectiveness of including or excluding stopwords along with other words 1) as terms that constitute the feature space of the SVM models, and 2) as terms that constitute the feature space to be used by the Cosine Similarity to measure the similarity between questions in a lecture and the similarity between questions and available lecture materials. It can be seen in Table 3 that including or excluding stopwords does not lead to consistent performance improvements (that are significant with p-value less than 0.01) over each other. Yet, including stopwords seems to be more robust and is consistent with prior work [24]; therefore, it is chosen to be the default for the rest of the experiments (i.e., Table 4). In the same way, removing stopwords while calculating the correlations between questions and available lecture materials has been found to be not significantly different than including them, as can be seen in Table 4. In further experimentation, removing stopwords from the feature space of all SVM classifiers has been found to be not significantly different from the reported results in Table 4. The detailed results for this set of experiments can be seen in Table 5.

CETINTAS ET AL.: MICROBLOGGING IN A CLASSROOM: CLASSIFYING STUDENTS’ RELEVANT AND IRRELEVANT QUESTIONS IN A...

299

TABLE 5 Results of the SVM_TermsPersLMSim and SVM_TermsPersQSim Classifiers in Comparison to SVM_TermsPers Classifier for Stopword Configuration

Results of the SVM_TermsPersLMSim and SVM_TermsPersQSim classifiers are shown in comparison to SVM_TermsPers classifier for one main configuration: when stopwords are 1) removed and 2) not removed for the bag-of-words representation of the input space for the SVM classifiers. Note that the Tf-Idf weighting scheme is used with Cosine Similarity for SVM_TermsPersLMSim and SVM_TermsPersQSim classifiers and stopwords are removed during the estimation of the correlation between questions (i.e., for SVM_TermsPersQSim) and between questions and available lecture materials (i.e., for SVM_TermsPersLMSim). The performance is evaluated with the F1 measure and “F1 (precision, recall)” triplets are reported for each configuration of each classifier.

Removing stopwords from the feature space to be used by the Cosine Similarity while calculating the correlation between questions in a lecture has been found to be significantly (with p-value less than 0.01) better than including them (when Tf-Idf weighting scheme is used). Questions are short texts and correlations between short texts are highly effected by including or excluding stopwords. Terms such as when, how, why, it, they, are can lead to increased similarity between two unrelated questions that have those words in common. It can be seen in Fig. 1 that correlation estimation technique that includes stopwords finds many more similar questions for irrelevant questions (i.e., overestimate the similarity between questions). However, most of them are relevant questions, which should not be the case. Yet, the technique that excludes stopwords leads to more conservative similarity estimation and more irrelevant similar questions are found for the irrelevant test questions (especially at top ranks). In the same way, the correlation estimation technique that excludes stopwords finds much less similarity between relevant and irrelevant questions.

8

CONCLUSIONS, DISCUSSIONS, AND FUTURE WORK

This paper proposes a novel application of text categorization to identify relevant and irrelevant microblogging questions asked in a classroom. Several modeling approaches and several weighting or preprocessing configurations are studied for this application through extensive experiments. Empirical results show that utilizing personalization together with question text is more effective than only using question text. It has been shown to be beneficial to utilize the correlation among questions and available lecture materials as well as the correlations between questions asked in a lecture. Furthermore, it is found to be significantly more effective to remove stopwords when calculating the correlations among questions themselves. Finally, utilizing students’ votes on questions is found not to be effective, although it has been shown to be useful in community question answering environments for question quality assessment. There are several possibilities to extend the research. First, the SVM classifiers in this work are using single

words as their features (i.e., unigram model); however, it is possible to utilize NLP techniques such as n-gram models and parts of speech (POS) tagging to further enrich the feature space. Though, it should be noted that user generated questions are prone to be ill formatted (i.e., misspellings and abbreviations are frequently encountered); therefore, it can be difficult to analyze them with NLP techniques [18]. Yet, in this work, majority of illformatted questions have only word capitalization mistakes that do not affect the techniques presented in this work. Therefore, this work does not explicitly deal with misspellings, abbreviations, etc., and use the questions as they are. In a similar work on question subjectivity analysis over user generated questions, it is noted that the gain acquired by using only the question text is comparable to the gain acquired by utilizing NLP, which generates more complex features. Therefore, using NLP is not worth the increased time and space complexity [18]. Yet, it is worthwhile to explore the effect of utilizing NLP on this application 1) to enrich the feature space and 2) to deal with the ill-formatted questions in a separate work thoroughly. Second, only one course is used for experimentation in this work. More courses can be used to assess the robustness of the proposed algorithms. Future work will be conducted mainly in those directions.

ACKNOWLEDGMENTS This research was partially supported by US National Science Foundation grants IIS-0749462, IIS-0746830, and DUE-1021975. Any opinions, findings, conclusions, or recommendations expressed in this paper are the authors’ and do not necessarily reflect those of the sponsor.

REFERENCES [1] [2] [3]

[4]

R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, pp. 75-82. Addison Wesley, 1999. Z. Baidu, http://zhidao.baidu.com, Dec. 2010. K. Borau, C. Ullrich, J. Feng, and R. Shen, “Microblogging for Language Learning: Using Twitter to Train Communicative and Cultural Competence,” Proc. Eighth Int’l Conf. Web Based Learning, pp. 78-87, 2009. S. Cetintas, L. Si, Xin, S. Chakravarty, H. Aagard, and K. Bowen, “Learning to Identify Students’ Relevant and Irrelevant Questions in a Micro-Blogging Supported Classroom,” Proc. 10th Intelligent Tutoring Systems (ITS ’10) Conf., pp. 281-284, 2010.

300

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17] [18]

[19]

[20]

[21] [22] [23] [24]

[25]

[26]

[27] [28]

[29]

IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES,

S. Cetintas, L. Si, Y.P. Xin, D. Zhang, and J.Y. Park, “Automatic Text Categorization of Mathematical Word Problems,” Proc. 22nd Int’l FLAIRS Conf., pp. 27-32, 2009. J. Chen, R. Nairn, L. Nelson, M. Bernstein, and E. Chi, “Short and Tweet: Experiments on Recommending Content from Information Streams,” Proc. 28th ACM Int’l Conf. Human Factors in Computing Systems, pp. 1185-1194, 2010. C. Costa, G. Beham, W. Reinhardt, and M. Sillaots, “MicroBlogging in Technology Enhanced Learning: A Use-Case Inspection of PPE Summer School 2008,” Proc. Workshop Social Information Retrieval for Technology Enhanced Learning, http://www.know center.tu-graz.ac.at/content/download/1650/8519/file/2008_ ccosta_microblogging.pdf, Dec. 2008. H. Duan, Y. Cao, C.-Y. Lin, and Y. Yu, “Searching Questions by Identifying Questions Topic and Question Focus,” Proc. 46th ACL Conf., pp. 156-164, 2008. G. Grosseck and C. Holotescu, “Can We Use Twitter for Educational Activities?” Proc. Fourth Int’l Scientific Conf., eLearning and Software for Education, 2008. P. Han, R. Shen, F. Yang, and Q. Yang, “The Application of Case Based Reasoning on Q&A System,” Proc. Australian Joint Conf. Artificial Intelligence, pp. 704-713, 2002. A. Java, X. Song, T. Finin, and B. Tseng, “Why We Twitter: Understanding Micro-Blogging Usage and Communities,” Proc. Ninth WEBKDD Conf., pp. 56-65, 2007. J. Jeon, B. Croft, and J. Lee, “Finding Semantically Similar Questions Based on Their Answers,” Proc. 28th ACM SIGIR Conf., pp. 617-618, 2005. J. Jeon, B. Croft, J. Lee, and S. Park, “A Framework to Predict the Quality of Answers with Non-Textual Features,” Proc. 29th ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 228-235, 2006. V. Jikoun and M. de Rijke, “Retrieving Answers from Frequently Asked Questions Pages on the Web,” Proc. 14th ACM Int’l Conf. Information and Knowledge Management (CIKM ’05), pp. 76-83, 2005. T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. 10th European Conf. Machine Learning (ECML ’98), pp. 137-142, 1998. J. Keefer, “How to Use Twitter in Higher Education,” http:// silenceandvoice.com/archives/2008/03/31/how-to-use-twitterin-highereducation, Mar. 2008. Lemur IR Toolkit, http://www.lemurproject.org, Dec. 2010. B. Li, Y. Liu, A. Ram, E. Garcia, and E. Agichtein, “Exploring Question Subjectivity Prediction in Community QA,” Proc. 31st ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 735-736, 2008. M. Michelson and S. Macskassy, “Discovering Users’ Topics of Interest on Twitter: A First Look,” Proc. Fourth Workshop Analytics for Noisy Unstructured Data in Conjunction with the 19th ACM CIKM Conf., 2010. M.F. Porter, “An Algorithm for Suffix Stripping,” Program: Electronic Library and Information Systems, vol. 14, no. 3, pp. 130137, 1980. S. Robertson, S.S. Walker Beaulieu, A. Gull, and M. Lau, “Okapi at TREC,” Proc. First Text REtrieval Conf. (TREC ’92), pp. 21-30, 1992. C.J.V. Rijsbergen, Information Retrieval, second ed. Univ. of Glasgow, 1979. F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002. L. Soh, N. Khandaker, and H. Jiang, “I-MINDS: A Multiagent System for Intelligent Computer-Supported Collaborative Learning and Classroom Management,” Int’l J. Artificial Intelligence in Education, vol. 18, no. 2, pp. 119-151, 2008. Y. Song, C. Cao, and H. Rim, “Question Utility: A Novel Static Ranking of Question Search,” Proc. 23rd Nat’l Conf. Artificial Intelligence (AAAI ’08), pp. 1231-1236, 2008. K. Sun, Y. Cao, X. Song, Y. Song, X. Wang, and C. Lion, “Learning to Recommend Questions Based on User Ratings,” Proc. 18th ACM Conf. Information and Knowledge Management (CIKM ’09), pp. 751758, 2009. K.D. Sweetser, ”Teaching Tweets,” http://www.kayesweetser. com, Dec. 2008. T.Y. Tang and G. Mccalla, “Smart Recommendation for an Evolving E-Learning System,” Int’l J. E-Learning, vol. 4, no. 1, pp. 105-130, 2005. Twitter, http://twitter.com, Dec. 2010.

VOL. 4,

NO. 4,

OCTOBER-DECEMBER 2011

[30] C. Ullrich, K. Borau, H. Luo, X. Tan, L. Shen, and R. Shen, “Why Web 2.0 Is Good for Learning and for Research: Principles and Prototypes,” Proc. 17th ACM Int’l Conf. World Wide Web (WWW ’08), pp. 705-714, 2008. [31] WikiAnswers, http://wiki.answers.com, Dec. 2010. [32] Yahoo! Answers, http://answers.yahoo.com, Dec. 2010. [33] Y. Yang and X. Liu, “A Re-Examination of Text Categorization Methods,” Proc. 22nd ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 42-49, 1999. Suleyman Cetintas received the BS degree in computer engineering from Bilkent University, Turkey. He is currently working toward the PhD degree in computer science at Purdue University. His primary interests lie in the areas of information retrieval, machine learning, intelligent tutoring systems, and text mining. He has also worked in the areas of privacy preserving data mining and social network analysis. He is a member of the ACM, ACM SIGIR, and the International Artificial Intelligence in Education (IAIED) Society. Luo Si received the PhD degree from Carnegie Mellon University in 2006. He is an assistant professor in the Computer Science Department and Statistics Department (by courtesy), Purdue University. His main research interests include information retrieval, knowledge management, machine learning, intelligent tutoring systems, and text mining. His research has been supported by the US National Science Foundation (NSF), State of Indiana, Purdue University, and industry companies. He received the NSF CAREER award in 2008. Hans Peter Aagard received the masters of science in educational technology in 2004 from Purdue University, where he is currently working toward the PhD degree in educational technology. He is a senior educational technologist in the Central Technology Organization at Purdue University. His focus has been on technologies that disrupt or fundamentally change the classroom and systemic change in higher education.

Kyle Bowen is the director of informatics at Purdue University, where he leads a development group focused on creating new applications for teaching, learning, and scientific discovery. He recently led the development of HotSeat, a new social-networking-powered tool that enables students to collaborate via Twitter or Facebook both inside and outside of the classroom. Author of the Exploring Dreamweaver series of books, he has also contributed to or served as technical director for more than 20 books within the areas of web design, development, and usability. His work has been featured by the New York Times, USA Today, CNET, and The Chronicle of Higher Education. His broad range of experience includes developing web strategies, applications, and instructional media within industry, government, and higher education. Mariheida Cordova-Sanchez received the BS degree in computer engineering from the University of Puerto Rico, Mayagu¨ez. She is currently working toward the PhD degree in computer science at Purdue University. Her areas of interest lie in the fields of information retrieval, machine learning, and education. In the past, she also worked in the areas of human computer interaction and computer graphics.