Topic 1: Evaluation methods (Dr. Wiebe)

Topic 1: Evaluation methods (Dr. Wiebe) Themes: Evaluation methods in Classification Evaluation methods in Clustering Evaluation methods in NLP Corpus...

Author: Imogene Hensley

10 downloads 2 Views 129KB Size

Report

Download PDF

Recommend Documents

Evaluation Methods for Topic Models

Methods of Holistic Evaluation

Methods of Drug Evaluation;

Topic 3 static Methods and Structured Programming

Topic 13. Nonparametric Methods (Ch. 13)

Evaluation of Faculty Teaching: Methods of Evaluation

Types of text summaries. The three steps. Topic identification methods. Multi-document summarization. Summary evaluation

Topic 1. Introduction. Topic Contents. Aims. Objectives

Course Name: Etica. Evaluation Methods:

Evaluation Methods for Groupware Systems

9 Evaluation models and methods

Overview of Census Evaluation Methods

Topic 1 - Workplace Hazards

Topic 1 INTRODUCTION BEGINNINGS

TOPIC 1 Recharge systems

Topic 1: Core beliefs

Topic: Topic 1- Understand Multiplication and Division of Whole Numbers

Topic 1 - Planar Linkage Kinematics

Key Stage 1 Topic Coverage

Sociology 6Z03 Topic 1: Introduction

FIT1004 Database Topic 1: Introduction

UNIT 1 TOPIC 3 TERRORISM

Topic 1: Basic Consumer Theory

Topic 1(h) Exchanging currencies

Topic 1: Evaluation methods (Dr. Wiebe) Themes: Evaluation methods in Classification Evaluation methods in Clustering Evaluation methods in NLP Corpus creation Evaluating MT Evaluating GEC Evaluating Parsers Evaluating Paraphrasing Evaluating Summarization

Evaluation methods in Classification 1. M Sokolova, N Japkowicz, S Szpakowicz. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. Advances in Artificial Intelligence, 2006. 2. Janez Demšar. Statistical Comparisons of Classifiers over Multiple Data Sets. The Journal of Machine Learning Research, 2006. a. This paper is related to chapter 4 of the book “Empirical methods for AI”. Since it is newer and highly cited, I picked it. 3. Stefano Baccianella and Andrea Esuli and Fabrizio Sebastiani. Evaluation measures for ordinal regression. Intelligent Systems Design and Applications, 2009. . This paper discusses evaluation methods in ordinal regression (also known as ordinal classification). This task is a type of multi-class classification where there is ordering relationship between the classes but there is not meaningful numeric differences between them.

Evaluation methods in Clustering This sections mentions some of the clustering evaluation methods. I used some of these methods in my fragmentation project to evaluate the accuracy of fragmentation methods with the gold standard fragments. 4. CD Manning, P Raghavan, H Schütze. Introduction to information retrieval, chapter 16.3: Evaluation of clustering. 2008. 5. NX Vinh, J Epps, J Bailey. Information theoretic measures for clusterings comparison: is a correction for chance necessary? ICML, 2009. 6. Halkidi, Maria and Batistakis, Yannis and Vazirgiannis, Michalis. On clustering validation techniques. Journal of Intelligent Information Systems, 2001.

Evaluation methods in NLP In this section, I will specifically consider evaluation methods in various NLP tasks. The goal is to learn the different ideas of evaluations and how to come up with an evaluation method when dealing with a slightly different NLP task.

Corpus creation Creating corpora by manually annotating data. 7. Ron Artstein and Massimo Poesio. Inter-coder agreement for computational linguistics. Computational Linguistics, 2008.

8. Rebecca Passonneau. Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. LREC, 2006.

Evaluating MT 9. Papineni et al.. BLEU: a Method for Automatic Evaluation of Machine Translation. ACL, 2002. 10. Andrew Mutton , Mark Dras , Stephen Wan , Robert Dale. GLEU: Automatic evaluation of sentence-level fluency. ACL, 2007. 11. Bach, Nguyen and Huang, Fei and Al-Onaizan, Yaser. Goodness: A method for measuring machine translation confidence. ACL, 2011. 12. Joshua Albrecht and Rebecca Hwa. A re-examination of machine learning approaches for sentence-level MT evaluation. ACL, 2007.

Evaluating GEC 13. Chodorow, Martin. Problems in Evaluating Grammatical Error Detection Systems. COLING 2012. 14. Daniel Dahlmeier and Hwee Tou Ng. Better evaluation for grammatical error correction. NAACL, 2012. 15. Madnani, Nitin; Tetreault, Joel R.; Chodorow, Martin; Rozovskaya, Alla. They can help: using crowdsourcing to improve the evaluation of grammatical error detection systems. ACL, 2011.

Evaluating Parsers 16. Jennifer Foster. Treebanks gone bad: parser evaluation and retraining using a treebank of ungrammatical sentences. IJDAR, 2007. 17. John Carroll, Ted Briscoe, Antonio Sanfilippo. Parser evaluation: a survey and a new proposal. LREC, 1998. 18. Stephen Clark and James R. Curran. Formalism-Independent Parser Evaluation with CCG and DepBank. ACL 2007. 19. Reut Tsarfaty, Joakim Nivre, Evelina Andersson. Cross-framework evaluation for statistical parsing. EACL, 2012.

Evaluating Paraphrasing 20. David L. Chen and William B. Dolan. Collecting highly parallel data for paraphrase evaluation. ACL 2011. 21. Chang Liu, Daniel Dahlmeier, Hwee Tou Ng. PEM: A paraphrase evaluation metric exploiting parallel texts. EMNLP, 2010.

Evaluating Summarization 22. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, ACL workshop. 2004. 23. Ani Nenkova, Rebecca Passonneau. Evaluating content selection in summarization: The pyramid method. NAACL, 2004.

Topic 2: Education in NLP (Dr. Litman)

In this topic, I first investigate general areas of AIED and then consider sub-areas with the focus of NLP methods to help different Educational tasks.

Themes: Artificial Intelligence in Education Educational Data Mining Collaborative Learning Educational Games Intelligent Tutoring Systems NLP for Educational applications Essay assessment Tutorial Dialogue Systems Spoken Dialogue Tutoring Typed Dialogue Tutoring NLP techniques for Education Discourse Analysis

Artificial Intelligence in Education The goal of AIED is to design computer based learning systems to help learning. Following is the list of sub-areas in this field along with a few papers that discuss some techniques.

Educational Data Mining Data mining techniques such as classification, clustering, outlier detection and pattern mining can be applied on educational systems to discover knowledge.

1. Baker, Ryan SJD and Yacef, Kalina. The state of educational data mining in 2009: A review and future visions. JEDM-Journal of Educational Data Mining, 2009. This is a survey paper that discusses the trend and shift of educational data mining. 2. Käser, Busetto, Solenthaler, Kohn, von Aster, Gross. Cluster-based Prediction of Mathematical Learning Patterns. AIED, 2013. 3. Samad Kardan, Ido Roll & Cristina Conati. The usefulness of log based clustering in a complex simulation environment. ITS, 2014.

Collaborative Learning This set of papers briefly shows some techniques for collaborative learning. 4. Jennifer Olsen, Daniel Belenky, Vincent Aleven & Nikol Rummel. Using an Intelligent Tutoring System to Support Collaborative as well as Individual Learning. ITS 2014. 5. Roberto Martinez-Maldonado, Judy Kay, Kalina Yacef. An automatic approach for mining patterns of collaboration around an interactive tabletop. AIED, 2013. 6. David Adamson, Carolyn Rosé. Coordinating Multi-dimensional Support in Collaborative Conversational Agents. ITS, 2012.

Educational Games This section contains some samples of educational games.

7. Erica L. Snow, G. Tanner Jackson, Laura K. Varner, Danielle S. McNamara. Expectations of Technology: A Factor to Consider in Game-Based Learning Environments. AIED, 2013. 8. Andrew Head & Jingtao Wang. ToneWars: Connecting Second Language Learners and Native Speakers in Collaborative Mobile Games. ITS, 2014. 9. Jennifer L. Sabourin, Lucy R. Shores, Bradford W. Mott, James C. Lester. Understanding and Predicting Student Self-Regulated Learning Strategies in GameBased Learning Environments. IJAIED, 2013.

Intelligent Tutoring Systems 10. Kurt VanLehn. The Behavior of Tutoring Systems. IJAIED, 2006. 11. Vincent Aleven, Bruce M Mclaren, Jonathan Sewall, Kenneth R Koedinger. A new paradigm for intelligent tutoring systems: Example-tracing tutors. IJAIED, 2009. 12. Mingyu Feng, Jeremy Roschelle, Neil Heffernan, Janet Fairman, Robert Murphy. Implementation of an Intelligent Tutoring System for Online Homework Support in an Efficacy Trial. ITS, 2014.

NLP for Educational applications This section focuses on use of NLP techniques in educational applications.

Essay assessment It is among the first educational applications that uses NLP methods to automatically evaluate essays. Since the evaluating readability and grammar error detection related papers are covered in my special topic I will not consider those papers in this sub-area of education.

13. Jill Burstein. Opportunities for natural language processing research in education. Computational Linguistics and Intelligent Text Processing, 2009. a. This paper gives a good introduction over the field of NLP in education. It uses two essay assessment systems as examples and focuses on their NLP aspects. I think it would be helpful to read this paper because the author is one of the major researchers in this area and she made good discussion over NLP part of the two sample systems. 14. Chen, Hongbo, and Ben He. Automated Essay Scoring by Maximizing Humanmachine Agreement. EMNLP, 2013. 15. Ou Lydia Liu, Chris Brew, John Blackmore, Libby Gerard, Jacquie Madhok, and Marcia C. Linn. Automated Scoring of Constructed-Response Science Items: Prospects and Obstacles. Educational Measurement: Issues and Practice, 2014. 16. Isaac Persing and Vincent Ng. Modeling Prompt Adherence in Student Essays. ACL, 2014.

Tutorial Dialogue Systems This section shows a number of techniques to utilize dialogue in educational applications. The interaction with the systems could be either via speech or typing.

17. K. Vanlehn, A. Graesser, G. T. Jackson, P. Jordan, A. Olney and C. Rose. When are Tutorial Dialogues More Effective than Reading?. Cognitive Science, 2007. This paper presents a systematic study to test the hypothesis of effectiveness of learning by dialogues or reading. I liked the way they designed their study. They considered tutoring systems with typed

dialogues and also spoken human tutoring and typed human tutoring conditions.

Spoken Dialogue Tutoring 19. Maxine Eskenazi. An overview of spoken language technology for education. Speech Communication, 2009. This paper address a brief history of speech technology for education and some of its main issues. I think that reading this paper will give me a good overview of the field. 20. Kate Forbes-Riley, Diane Litman. Benefits and challenges of real-time uncertainty detection and adaptation in a spoken dialogue computer tutor. Speech Communication, 2011.

Typed Dialogue Tutoring 21. Myroslava Dzikovska, Natalie Steinhauser, Elaine Farrow, Johanna Moore, Gwendolyn Campbell. BEETLE II: Deep Natural Language Understanding and Automatic Feedback Generation for Intelligent Tutoring in Basic Electricity and Electronics. IJAIED, 2014. This paper discuss an intelligent tutoring systems and the dialogues are typed by the student. 22. Kristy Elizabeth Boyer, Robert Phillips, Amy Ingram, Eun Young Ha, Michael D. Wallis, Mladen A. Vouk, and James C. Lester. Investigating the Relationship Between Dialogue Structure and Tutoring Effectiveness: A Hidden Markov Modeling Approach. IJAIED, 2011. This paper studies human-human tutoring using textual dialogue messages.

NLP techniques for Education In this section, I will focus on some common NLP methods that are applied on educational applications.

Discourse Analysis Discourse analysis contains a number of methods to analyze text or speech. In this section, I just put samples of discourse analysis approaches. 23. Nitin Madnani, Michael Heilman, Joel Tetreault, and Martin Chodorow. Identifying high-level organizational elements in argumentative discourse. NAACL, 2012. 24. Swapna Somasundaran, Jill Burstein and Martin Chodorow. Lexical Chaining for Measuring Discourse Coherence Quality in Test-taker Essays. COLING 2014. 25. Christian Stab, Iryna Gurevych. Identifying Argumentative Discourse Structures in Persuasive Essays. EMNLP, 2014. The corpora that they used contains essays written in easyforum. They mentioned that it is an active community that provides feedback for different kinds of essays. For instance students post their essays there. So, I guess the approach that they propose in this paper can also be considered for educational purposes.

Topic 3: Grammar Error Correction & Robust Parsing (Dr. Hwa)

Themes: Grammatical Errors ESL errors and corpora MT errors Social media language Artificial errors Error detection and correction GEC in general GEC using ML GEC using MT GEC using rules GEC evaluation Robust Parsing Judging grammatically Parsing ungrammatical sentences

Grammatical Errors In order to detect and then correct errors, it would be nice to have a survey on different types of errors in disfluent sentences (like MT and ESL) and also mentioning some of common annotated standards and corpora.

ESL errors and corpora 1. Diane Nicholls. The Cambridge Learner Corpus: Error coding and analysis for lexicography and ELT. Corpus Linguistics 2003 conference. This is a technical report which demonstrates the ESL categories of FCE corpus. This paper can be a considered as a supplementary reading list for the FCE(above) paper. 2. Alla Rozovskaya and Dan Roth. Annotating ESL errors: Challenges and rewards. NAACL HLT Fifth Workshop on Innovative Use of NLP for Building Educational Applications, 2010. It introduces UIUC error categories. I think it can be considered as one of the references in ESL annotating survey. 3. Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. A new dataset and method for automatically grading ESOL texts. ACL, 2011. This paper both introduces FCE corpus and some learning methods for assessing ESL essays.

MT errors 4. David Vilar, Jia Xu, Luis Fernando D’Haro, and Hermann Ney. Error analysis of statistical machine translation output. LREC, 2006. This is an early work on error analysis of MT outputs. They introduced an error category that is used frequently in the community. I think this paper should be mentioned in a survey of MT error types.

Social media language 5.

Jacob Eisenstein. What to do about bad language on the internet. NAACL, 2013. It shows challenges of social media for the traditional NLP methods. Since social media is another domain of disfluent sentences, I guess it would be interesting to compare it with ESL and MT sentences.

Artificial errors 6. Jennifer Foster and Øistein Andersen. GenERRate: generating errors for use in grammatical error detection. NAACL BEA workshop, 2009. This paper is an example of generating artificial errors. The authors have published several papers on this topic and I just picked this paper as a representative of their work. 7. Sungjin Lee, Jonghoon Lee, Hyungjong Noh, Kyusong Lee and Lee, Gary Geunbae Lee. Realistic grammar error simulation using Markov Logic. Knowledge-Based Systems, 2011. 8. Daisuke Okanohara and Jun’ichi Tsujii. A discriminative language model with pseudo-negative samples. ACL 2007.

Error detection and correction After discussing different error types and some annotated corpora, this set of papers are about approaches of detecting and correcting grammatical errors.

GEC in general 9. Claudia Leacock, Martin Chodorow, Michael Gamon, Joel Tetreault. Automated Grammatical Error Detection for Language Learner. Synthesis lectures on human language technologies. 2010. This book is a comprehensive introduction over topics of ESL errors. Reading this book could be helpful on getting more idea about history and related works of this area.

GEC using ML 10. Joel Tetreault, Jennifer Foster and Martin Chodorow. Using Parse Features for Preposition Selection and Error Detection. ACL short, 2010. This paper shows the helpfulness of parse tree features for preposition detection and correction. The features that they used might be useful to detect other error types. 11. Alla Rozovskaya and Dan Roth. Algorithm selection and model adaptation for ESL correction tasks. ACL, 2011. 12. Alla Rozovskaya, Dan Roth. Joint learning and inference for grammatical error correction. EMNLP, 2013. 13. Yuanbin Wu, Hwee Tou Ng. Grammatical Error Correction Using Integer Linear Programming. ACL, 2013. 14. Alla Rozovskaya, Dan Roth. Building a State-of-the-Art Grammatical Error Correction System. TACL, 2014.

GEC using MT 15. Chris Brockett, William B. Dolan, and Michael Gamon. Correcting ESL Errors Using Phrasal SMT Techniques. ACL 2006. a. This paper introduces MT based approach for GEC. The idea of using MT methods to solve other NLP problems has been recently used, so it would be nice to mention this method as one of the correction approaches. This paper is among the first works on using SMT. The main property of using MT is to correct the whole sentence together. Following papers show the recent attempts: 16. Y. Albert Park; Roger Levy. Automated Whole Sentence Grammar Correction Using a Noisy Channel Model. ACL 2011. a. This paper borrows noisy channel model from MT to solve GEC. Their approach can be considered as a working method in the GEC survey. 17. Daniel Dahlmeier and Ng, H. T. A beam-search decoder for grammatical error correction. EMNLP 2012. a. This paper also uses the idea of decoding in MT for GEC. Again this idea is interesting and worthwhile to be mentioned as one of the GEC methods. 18. Nitin Madnani; Joel Tetreault; Martin Chodorow. Exploring Grammatical Error Correction with Not-So-Crummy Machine Translation. Workshop on the Innovative Use of NLP for Building Educational Applications 2012. a. This paper uses round trip translations with different pivot languages for GEC. The idea is interesting and I guess could be mentioned in the survey.

GEC using rules 19. Ryo Nagata; Mikko Vilenius; Edward Whittaker. Correcting Preposition Errors in Learner English Using Error Case Frames and Feedback Messages. ACL, 2014. a. This paper introduces a novel approach for preposition error correction by defining error case frames which are extracted by comparing native and learner corpora. Their system can also give learners feedback. I liked the idea of making error case frames as a new approach for GEC.

GEC evaluation 20. Joel Tetreault, Martin Chodorow and Nitin Madnani. Bucking the trend: improved evaluation and annotation practices for ESL error detection systems. Lang Resources & Evaluation, 2014. a. This paper first gives an introduction to different GEC methods then discusses the effect of annotation and evaluation of ESL error detection systems. I can put this paper in the evaluation topic, but I also like the background section on GEC systems.

Robust Parsing The goal of GEC methods are to correct all errors of a sentence. But they are not perfect and there are still some errors that they are not able to handle. So, parsing these erroneous sentences might be problematic for parsers. This set of papers are basically about parsing ungrammatical sentences. The first step of parsing or even correcting grammatical errors is to check whether the sentences is ungrammatical. If it is ungrammatical, then it needs to be corrected or parsed in a special way.

Judging grammatically 21. Joachim Wagner, Jennifer Foster, Josef van Genabith. Judging grammaticality: Experiments in sentence classification. CALICO Journal, 2009. 22. Joachim Wagner, Jennifer Foster and Josef van Genabith. A comparative evaluation of deep and shallow approaches to the automatic detection of common grammatical errors. EMNLP, 2007. a. This paper both describes an automatic error creation method and a classification to distinguish grammatical and ungrammatical sentences. The features are XLE parser output, POS n-grams and n-grams. 23. Adam Pauls and Dan Klein. Large-scale syntactic language modeling with treelets. ACL, 2012. a. This paper introduces a syntactic language model using parse trees. Treelets are overlapping windows of trees with depth at most 3 containing CFG rules and nonterminals. Since these treelets are similar to fragments, I mentioned this paper in the reading list. 24. Matt Post. Judging grammaticality with tree substitution grammar derivations, ACL short, 2011. a. This paper uses extracted TSG rules as features for binary classification of grammaticality. Since TSG rules are fragments of parse trees, I mentioned this paper. Also, this paper is one of the papers that explains fluency check task using parse features.

Parsing ungrammatical sentences 25. Joachim Wagner, Jennifer Foster. The effect of correcting grammatical errors on parse probabilities. IWPT, 2009. a. This paper investigates the output probability of parsers when parsing ungrammatical sentences. 26. Jennifer Foster.Treebanks gone bad: Parser evaluation and retraining using a treebank of ungrammatical sentences. IJDAR, 2007. a. This paper presents creating an ungrammatical treebank by artificially adding errors to sentences and changing parse trees. This idea is similar to our fragmentation method to generate gold standards fragments. So, it is one of our references for changing parse trees. I have put this paper in the Evaluation topic. 27. Jennifer Foster, Joachim Wagner, Josef Van Genabith. Adapting a WSJ-trained parser to grammatically noisy text. ACL short, 2008. a. This paper presents a robust parser which is trained on an ungrammatical treebank. They also used a classifier to choose whether a sentence has grammatical errors before parsing it. 28. Jennifer Foster, Ozlem Cetinoglu, Joachim Wagner, Joseph Le Roux, Joakim Nivre, Deirdre Hogan, Josef van Genabith. From News to Comment: Resources and Benchmarks for Parsing the Language of Web 2.0. IJCNLP, 2011. a. Parsing the web

29. Jennifer Foster, Özlem Çetinoglu, Joachim Wagner, Joseph Le Roux, Stephen Hogan, Joakim Nivre, Deirdre Hogan, Josef Van Genabith. # hardtoparse: POS Tagging and Parsing the Twitterverse. AAAI, 2011. a. Parsing tweets 30. Lingpeng Kong, Nathan Schneider, Swabha Swayamdipta, Archna Bhatia, Chris Dyer, and Noah A. Smith. A Dependency Parser for Tweets. EMNLP, 2014. 31. David McClosky, Eugene Charniak, and Mark Johnson. Automatic Domain Adaptation for Parsing. NAACL, 2010. 32. Mohammad Khan, Markus Dickinson, and Sandra Kübler. Towards Domain Adaptation for Parsing Web Data. RANLP, 2013.