Data Analysis, Machine Learning and Knowledge Discovery

The 36th Annual Conference of the German Classification Society (GfKl) on Data Analysis, Machine Learning and Knowledge Discovery University of Hilde...

Author: Branden Lawson

8 downloads 0 Views 3MB Size

Report

Download PDF

Recommend Documents

Topological Data Analysis and Machine Learning

Knowledge Discovery and Data Mining

Machine Learning and Data Analysis Infinite Hypothesis Spaces

Methodologies from Machine Learning in Data Analysis and Software

Data Mining & Machine Learning

KDML Knowledge Discovery, Data Mining and Machine Learning. Editors Dominik Benz, University of Kassel Frederik Janssen, TU Darmstadt

Knowledge Discovery from Sensor Data (Sensor-KDD)

PART 1: KNOWLEDGE DISCOVERY FROM EPIDEMIOLOGICAL DATA

Statistical and Machine-Learning Data Mining

Visual Exploration of Machine Learning Results using Data Cube Analysis

Scalable Machine Learning Methods for Massive Biomedical Data Analysis

Machine learning data set analysis with visual simulation

The Analysis of Adaptive Data Collection Methods for Machine Learning

Comparison of four machine learning algorithms for spatial data analysis

Machine Learning and Data Analysis Lecture 12: Learning in Predicate Logic

DATA MINING AND KNOWLEDGE DISCOVERY VIA LOGIC-BASED METHODS

Information-Theoretic Measures for Knowledge Discovery and Data Mining

DATA MINING AND NEURAL NETWORKS FOR KNOWLEDGE DISCOVERY

ANALYSIS OF AIRPORT TELEMATIC DATA USING DATA MINING AND MACHINE LEARNING

Learning and Discovery

Sensitivity Analysis For The Winning Algorithm In Knowledge Discovery And Data Mining (kdd) Cup Competition 2014

Applying Machine Learning Methods to Aphasic Data

Machine Learning Algorithms for Real Data Sources

Selective Data Acquisition for Machine Learning

The 36th Annual Conference of the German Classification Society (GfKl) on

Data Analysis, Machine Learning and Knowledge Discovery University of Hildesheim, Germany August 1-3, 2012

Program & Abstracts

Photo title page: c Market place of Hildesheim ( Hildesheim Marketing, Photographer: Obornik)

The 36th Annual Conference of the German Classification Society (GfKl)

The 36th Annual Conference of the German Classification Society (GfKl) on

Data Analysis, Machine Learning and Applications

Program & Abstracts

v

Preface

Message from the GfKl 2012 Chairs We would like to cordially welcome you to the 36th Annual Conference of the German Classification Society, taking place in Hildesheim, Germany. The GfKl has become 36 years young. In these years, we saw the core topics of the conference crystallize themselves into thematic areas. This year, for the first time, these areas were made explicit and their coordination was undertaken by dedicated Area Chairs. We are proudly hosting five Areas: • Statistics and Data Analysis (SDA), organized by Hans-Hermann Bock, Christian Henning and Claus Weihs • Machine Learning and Knowledge Discovery (MLKD), organized by Lars Schmidt-Thieme and Myra Spiliopoulou • Data Analysis and Classification in Marketing (DACMar), organized by Daniel Baier and Reinhold Decker • Data Analysis in Finance (DAFin), organized by Michael Hanke and Krzysztof Jajuga • Biostatistics and Bio-informatics, organized by Anne-Laure Boulesteix and Hans Kestler • Interdisciplinary Domains (InterDom), organized by Andreas Hadjar, Sabine Krolak-Schwerdt and Claus Weihs • and the Workshop Library and Information Science (LIS’2012) organized by Frank Scholze Obviously, the subjects accommodated in the five areas are not sharply separated. We have fostered links and interactions among them, culminating in the three plenary and six semi-plenary talks, and in the three invited sessions that cover research advances spanning more than one area. Our invited sessions are: • Ensemble Methods in Clustering and Classification, organized by Berthold Lausen

vii

viii

Preface

• Applications in Empirical Educational Research Based on Secondary Data, organized by Sabine Krolak-Schwerdt and Alexandra Schwarz • Dynamic Cluster Analysis — Theory and Practise, organized by Jozef Pociecha under the auspices of the Polish Classification Society Next to the plenary and semi-plenary talks, our scientific program accommodates 130 contributions, 16 of them in the LIS workshop. As expected, the lion’s share among the contributions comes from Germany, followed by Poland, but we have contributions from all over the world, stretching from Portugal to Ukraine, from Canada and USA to Japan and Thailand. Organizing such a conference with its parallel, interleaved events is not an easy task. It requires coordination of many individuals and on many issues, and lives from the tremendous effort of engaged scientists and of the dedicated teams in Hildesheim and Magdeburg. We would like to thank the Area Chairs for their hard work in conference advertisement, author recruitment and submissions evaluation, and the three Chairs of the invited sessions for winning renowned presenters to complement the main areas with inter-area subjects of major interest. We are particularly indebted to the Polish Classification Society for its involvement and presence in the GfKl 2012. We are proud to announce the best paper awards of GfKl 2011. This year, we have honoured two papers: 1. Sarah Frost and Daniel Baier (University of Cottbus) elaborate on the performance of the Earth Mover’s Distance on image clustering; and 2. Florent Domenach and Ali Tayari (University of Nicosia) discuss implications of Axiomatic consensus properties. We would like to congratulate the prize winners and would like to thank the Best Paper Awards Jury members H.H. Bock, R. Decker, B. Lausen, A. Ultsch, and C. Weihs for their excellent work. The awarded papers appear as part of the GfKl 2011 postconference proceedings. We would like to thank the EasyChair GfKl 2012 administrator, Miriam T¨odten (master student of the ’Data & Knowledge Engineering’ degree at the Otto-vonGuericke University Magdeburg) for her tireless work in troubleshooting and assistance during submissions, evaluation and camera-ready preparation and contribution to the abstracts volume, and Silke Reifgerste, the financial administrator of the KMD research lab at the Otto-von-Guericke University Magdeburg for her fast and competent treatment of all financial matters concerning the Magdeburg team. Further we would like to thank Kerstin Hinze-Melching (University of Hildesheim) for her help with the local organization, J¨org Striewski and Uwe Oppermann, our technicians in Hildesheim, for technical assistance, Selma Batur (master student at University of Hildesheim) for help with the abstract proceedings and preparation of the conference and our assistants in Hildesheim for the conference Fabian Brandes, Christian Brauch, Lenja Busch, Sarina Flemnitz, Sophia Graefe, Stephan Reller, Nicole Reuss and Kai Wedig. GfKl 2012 is not ending with the conference in Hildesheim. According to our long tradition of post-conference proceedings, we will open the EasyChair GfKl

Preface

ix

2012 site again from August 2012, and invite the conference authors to submit the full version of their work for the peer-reviewing phase scheduled for fall 2012. Accepted papers will be published by Springer. We wish you a productive, inspiring conference and a pleasant stay in Hildesheim!

Hildesheim, August 2012

Lars Schmidt-Thieme and Ruth Janning, GfKl 2012 Local Organizers Myra Spiliopoulou and Lars Schmidt-Thieme, GfKl 2012 Program Chairs Claus Weihs, President of the GfKl

x

Preface

Sponsors We thank our sponsors:

Information Systems and Machine Learning Lab (ISMLL) Stiftung Universit¨at Hildesheim

Microsoft

Preface

Conference Location The GfKl 2012 is hosted by University Hildesheim. The conference location is the Dom¨ane Marienburg.

The address is: Dom¨ane Marienburg Dom¨anenstraße 31141 Hildesheim

xi

xii

Map of the buildings of Dom¨ane Marienburg

Preface

Preface

xiii

Program Committee Chairs Myra Spiliopoulou, Otto-von-Guericke-Univ. Magdeburg, Germany Lars Schmidt-Thieme, Univ. Hildesheim, Germany

Local Organizers Lars Schmidt-Thieme, Univ. Hildesheim, Germany Ruth Janning, Univ. Hildesheim, Germany

Scientific Program Committee AREA Machine Learning and Knowledge Discovery (MLKD) Myra Spiliopoulou, Otto-von-Guericke-Univ. Magdeburg, Germany (Area Chair) Lars Schmidt-Thieme, Univ. Hildesheim, Germany (Area Chair) Martin Atzmueller, Univ. Kassel, Germany Eirini Ntoutsi, Ludwig-Maximilians-Univ. Munich, Germany Georg Krempl, Otto-von-Guericke-Univ. Magdeburg, Germany Joao Gama, Univ. Porto, Portugal Eyke H¨ullermeier, Univ. Marburg, Germany Thomas Seidl, RWTH Aachen, Germany Andreas Hotho, Univ. Wuerzburg, Germany

AREA Statistics and Data Analysis Claus Weihs, TU Dortmund, Germany (Area Chair) Hans-Hermann Bock, RWTH Aachen, Germany (Area Chair) Christian Hennig, Univ. College London, UK (Area Chair) Bettina Gruen, Johannes Kepler Univ. Linz, Austria Patrick Groenen, Erasmus Univ. Rotterdam, Netherlands

AREA Data Analysis and Classification in Marketing Daniel Baier, BTU Cottbus, Germany (Area Chair) Reinhold Decker, Univ. Bielefeld, Germany (Area Chair)

xiv

Preface

AREA Data Analysis in Finance Krzysztof Jajuga, Wroclaw Univ.of Economics, Poland (Area Chair) Michael Hanke, Univ. of Liechtenstein, Liechtenstein (Area Chair)

AREA Data Analysis in Biostatistics and Bioinformatics Anne-Laure Boulesteix, Ludwig Maximilian Univ. Munich, Germany (Area Chair) Hans Kestler, Univ. Ulm, Germany (Area Chair) Harald Binder, Univ. Mainz, Germany Matthias Schmid, Univ. Erlangen, Germany Friedhelm Schwenker, Univ. Ulm, Germany

AREA Data Analysis in Interdisciplinary Domains Sabine Krolak-Schwerdt, Univ. Luxembourg, Luxembourg (Area Chair) Claus Weihs, TU Dortmund, Germany (Area Chair) Andreas Hadjar, Univ. Luxembourg, Luxembourg Irmela Herzog, LVR, Bonn, Germany Florian Klapproth, Univ. Luxembourg, Luxembourg Hans-Joachim Mucha, WIAS, Berlin, Germany Frank Scholze, KIT Karlsruhe, Germany(Chair)

LIS’2012 Frank Scholze, KIT Karlsruhe, Germany(Chair) Stefan Gradmann, HU Berlin, Germany Heidrun Wiesenm¨uller, HDM Stuttgart, Germany Ewald Brahms, Univ. Hildesheim, Germany Michael M¨onnich, KIT Karlsruhe, Germany ¨ Munich, Germany Bernd Lorenz, FHOV Hans-Joachim Hermes, TUniv. Chemnitz, Germany Andreas Geyer-Schulz, KIT Karlsruhe, Germany

Preface

xv

Social Program Tuesday, July 31th 20:00 h Informal come-together in Knochenhauer Amtshaus (http://www.knochenhaueramtshaus.com/) in the city center at the market place

Wednesday, August 1th 16:30 h 17:30 h 18:00 h 18:00 h 20:00 h

Guided tour of UB Hildesheim Guided tour of Dombibliothek Guided City tour of Hildesheim (German) Guided City tour of Hildesheim (English) Reception in the city hall at the market place Greeting: Ruth Seefels, Mayor of Hildesheim, Claus Weihs, President of GfKl

Thursday, August 2th 20:15 h Conference dinner in the Novotel (Bahnhofsallee 38, 31134 Hildesheim, admittance: 19:30)

c City hall of Hildesheim (Photo: Hildesheim Marketing)

xvi

Preface

Invited Speakers Wolfgang Gaul Where Data Analysis meets Graph Theory (01.08, 10:00-10:45, Building: HS 52, Room: 001 (Theater)) Katsutoshi Yada Knowledge Discovery in Shopping Path Data (1.08, 13:30-14:15, Building: HS 52, Room: 001 (Theater)) Thomas Seidl Stream Data Mining and Anytime Algorithms (01.08, 13:30-14:15, Building: HS 27, Room 003) Joao Gama Data Stream Mining for Ubiquitous Environments (02.08, 09:00-09:45, Building: HS 52, Room: 001 (Theater)) Michele Sebag Autonomous Robotics: Defining Instincts and Learning Systems of Values (02.08, 14:00-14:45, Building: HS 52, Room: 001 (Theater)) Alex Weissensteiner Arbitrage-Free Scenario Trees for Financial Optimization (02.08, 14:00-14:45, Building: HS 27, Room 003) Hillol Kargupta Connected Cars, Machine-to-Machine Environments, and Distributed Data Mining (03.08, 09:00-09:45, Building: HS 52, Room: 001 (Theater)) Dirk Van den Poel On the value of incorporating sequential information into predictive analytics classification models for analytical CRM (03.08, 09:00-09:45, Building: HS 27, Room 003) Shai Ben-David Universal Learning vs. No Free Lunch results - can there be learners that do not require task-specific knowledge? (03.08, 13:15-14:00, Building: HS 52, Room: 001 (Theater))

Preface

xvii

GfKl 2013 The next annual conference of the German Classification Society GfKl 2013 will take place in Luxembourg from July 10, 2013 till July 13, 2013 under the title

European Conference on Data Analysis

Scientific Program Committee: Prof. Dr. Dirk Van den Poel (Ghent University), Chair Local Organizer: Prof. Dr. Sabine Krolak-Schwerdt, Dr. Matthias B¨ohmer

Luxembourg

Program

Preface

xix

Scientific Program of GfKl 2012 (Overview) Tuesday, July 31, 2012 Social events 20:00 Informal come-together in Knochenhauer Amtshaus (http://www.knochenhaueramtshaus.com/) in the city center at the market place

Wednesday, August 01, 2012

08:15 09:00

Registration (Building: HS 52, Foyer) Opening (Building: HS 52, Room: 001 (Theater))

– Welcome by Prof. Dr. Wolfgang-Uwe Friedrich (President of the University of Hildesheim) – Welcome by Prof. Dr. Martin Sauerwein (Dean of faculty for mathematics, natural sciences, economy and computer science, University of Hildesheim) – Welcome and best paper awards by Prof. Dr. Claus Weihs (President of the GfKl): · Best Paper Award 2011 - methods: ”Implications of Axiomatic Consensus Properties”, Florent Domenach and Ali Tayari (Department of Computer Science, University of Nicosia) · Best Paper Award 2011 - application: ”Comparing Earth Movers Distance and its Approximations for Clustering Images”, Sarah Frost and Daniel Baier (Institute of Business Administration and Economics, Brandenburg University of Technology Cottbus) – Welcome by Prof. Dr. Myra Spiliopoulou (Program Chair) – Welcome by Prof. Dr. Dr. Lars Schmidt-Thieme (Local Organizer) 10:00 Opening Plenary (Building: HS 52, Room: 001 (Theater)), Wolfgang Gaul: Where Data Analysis meets Graph Theory (4), Chair: L. Schmidt-Thieme

xx

Preface

Coffee break 10:45 – 11:15 Building Area

HS 52, Room 001 Statistics & Data Analysis

HS 1, Room 007 Data Analysis & Classification in Marketing

Session

Clustering 1

Chair

H. Bock

S. Voekler

11:15

Alexandrovich (30) Tanioka (52) Ayale (36)

Baier (56)

11:40 12:05

Rese (64) -

HS 52, Room 101 Data Analysis in Finance

HS 27, HS 2a, Room 003 Room 004 Machine Interdisciplinary Learning & Domains Knowledge Discovery Recommenders Education & Multi-Criteria Optimization K. Jajuga L. SchmidtS. KrolakThieme Schwerdt Vogt (83) Symeonidis Trendtel (129) (100) M¨uller (79) Ntoutsi (95) Kasper (119) Feldman (75) Cheng (90) -

Lunch break 12:30 – 13:30 13:30 Semi Plenary (Building: HS 52, Room: 001 (Theater)), Katsutoshi Yada: Knowledge Discovery in Shopping Path Data (10), Chair: D. Baier 13:30 Semi Plenary (Building: HS 27, Room 003), Thomas Seidl: Stream Data Mining and Anytime Algorithms (9), Chair: C. Weihs Break 14:15 – 14:30 Building Area

Session Chair 14:30 14:55 15:20

HS 52, Room 001 Statistics & Data Analysis

HS 1, Room 007 Data Analysis & Classification in Marketing

Classification 1 J. Schiffner A. S¨ann Bischl (33) Ba¸k (59) Takai (51) Rumstadt (65) Lange (42)

Voekler (70)

HS 52, Room 101 Data Analysis in Finance

K. Jajuga Piontek (81) Nagy (80) Kaszuba (78)

HS 27, HS 2a, Room 003 Room 004 Machine Interdisciplinary Learning & Domains Knowledge Discovery Streams Psychology C. Weihs F. Klapproth Bolanos (88) Hahn (114) T¨odten (101) H¨orstermann (118) Matuszyk (93) Geyer-Schulz (113)

Preface

xxi

Coffee break 15:45 – 16:15 Building Area

HS 52, Room 001 Statistics & Data Analysis

HS 1, Room 007 Statistics & Data Analysis

Statistics 1

Statistics in Economics

Session

Chair 16:15 16:40

F. Schwaiger A. Rybicka Beige (31) Brzezinska (34) Voigt (53) Biron (32)

17:05

Joenssen (40)

Jefmanski (39)

HS 52, Room 101 Data Analysis in Finance

M. Hanke Bessler (72) RutkowskaZiarko (82) Garsztka (76)

HS 27, Room 003 Machine Learning & Knowledge Discovery Clustering

HS 2a, Room 004 Invited Session:

V¨olkel (103)

Schwarz (127)

Applications in Empirical Educational Research Based on Secondary Data A. Schwarz Mouysset (94) Schwarz (19) Pelka (96) Makles (18)

Social events 16:30 Guided tour of UB Hildesheim 17:30 Guided tour of Dombibliothek 18:00 Guided City tour of Hildesheim (German) 18:00 Guided City tour of Hildesheim (English) 20:00 Reception in the city hall at the market place (Greeting: Ruth Seefels, Mayor of Hildesheim, Claus Weihs, President of GfKl)

Thursday, August 02, 2012

08:15 Registration (Building: HS 52, Foyer) 09:00 Plenary (Building: HS 52, Room: 001 (Theater)), Jo˜ao Gama: Data Stream Mining for Ubiquitous Environments (3), Chair: M. Spiliopoulou

xxii

Preface

Break 09:45 – 10:00 Building

Session

HS 52, HS 1, Room 001 Room 007 Statistics & Biostatistics & Data Analysis Bioinformatics Factor analysis

Chair 10:00 10:25

C. Hennig H. Kestler Schoonees (49) Potapov (138) Schmid (139)

Area

10:50

Mucha (47)

Matuszyk (136)

HS 52, Room 101

-

HS 27, HS 2a, Room 003 Room 004 Invited Session: Interdisciplinary Domains Dynamic Music 1 Cluster Analysis Theory and Practise J. Pociecha C. Weihs Bock (22) Dittmar (111) Najman (24) Hillewaere (117) Lula (23) Bauer (108)

Coffee break 11:15 – 11:30 Building Area Session

HS 52, HS 1, Room 001 Room 007 Statistics & Biostatistics & Data Analysis Bioinformatics Classification 2

Chair 11:30

B. Bischl Meyer (45)

11:55

Lange (43)

12:20 12:45 12:45 13:30

H. Kestler Heider (135)

Burkovski (134) Schwenker (50) Maucher (137)

HS 52, Room 101

-

Nguyen (124)

HS 27, HS 2a, Room 003 Room 004 Invited Session: Interdisciplinary Domains Dynamic Music 2 Cluster Analysis Theory and Practise J. Pociecha C. Weihs Sokolowski Vatolkin (130) (25) Voekler (27) Eichhoff (112) Stanimir (26)

Lukashevich (123) Krey (120)

Meeting AG Biostatistik (HS 1, Room 007) Meeting AG DA-NK (HS 52, Room 101)

Lunch break 12:45 – 14:00 14:00 Semi Plenary (Building: HS 52, Room: 001 (Theater)), Michele Sebag: Autonomous Robotics: Defining Instincts and Learning Systems of Values (8), Chair: W. Gaul

Preface

xxiii

14:00 Semi Plenary (Building: HS 27, Room 003), Alex Weissensteiner: Arbitrage-Free Scenario Trees for Financial Optimization (5), Chair: K. Jajuga Break 14:45 – 14:55 Building Area

HS 52, Room 001 Statistics & Data Analysis

Session

Clustering 2

Chair 14:55 15:20

Ritter Wilk (54) Schwaiger (38)

15:45

Hennig (37)

HS 1, Room 007 Data Analysis & Classification in Marketing

HS 52, Room 101 Data Analysis in Finance

HS 27, Room 003 Machine Learning & Knowledge Discovery Classification & Ensembles

R. Decker Steiner (67) Lichtenth¨aler (61) Tuma (69)

M. Hanke Geyer (77) Schwenker (97) Bohlmann (74) Senge (98)

HS 2a, Room 004

-

-

Vatolkin (102)

-

HS 52, Room 101

HS 27, Room 003

HS 2a, Room 004 Interdisciplinary Domains

Coffee break 16:10 – 16:40 Building Area

HS 52, Room 001 Statistics & Data Analysis

HS 1, Room 007 Data Analysis & Classification in Marketing

Session Model selection Chair

C. Weihs

R. Decker

16:40 17:05 17:30

Liebscher (44) Mucha (46)

Minke (62) Bak (57)

-

Break 17:30 – 18:00 18:00 General meeting of the German Classification Society (Building: HS 52, Room: 001 (Theater), End: 19:30)

-

Language & Education S. KrolakSchwerdt Beica (109) Nisioi (125) ¨ u (131) Unl¨

xxiv

Preface

Social events 20:15 Conference dinner in the Novotel (Bahnhofsallee 38, 31134 Hildesheim, admittance: 19:30)

Friday, August 03, 2012

08:15 Registration (Building: HS 52, Foyer) 09:00 Semi Plenary (Building: HS 52, Room: 001 (Theater)), Hillol Kargupta: Connected Cars, Machine-to-Machine Environments, and Distributed Data Mining (6), Chair: L. Schmidt-Thieme 09:00 Semi Plenary (Building: HS 27, Room 003), Dirk Van den Poel: On the value of incorporating sequential information into predictive analytics classification models for analytical CRM (7), Chair: W. Steiner

Break 09:45 – 09:55 Building Area

HS 52, Room 001 Statistics & Data Analysis

HS 1, Room 007 Data Analysis & Classification in Marketing

Session

Applications

Chair

H. Mucha

09:55

Santos (48)

D. Van den Poel Kottemann (60)

10:20

Klapproth (41)

Ballings (58)

10:45

Carvalho (35)

Paetz (63)

HS 52, HS 27, HS 2a, Room 101 Room 003 Room 004 Machine Machine Interdisciplinary Learning & Learning & Domains Knowledge Knowledge Discovery Discovery Opinions and Distributed and Quality marketing Temporal Data Analysis J. Gama C. Hennig Ahn (86)

Khan (92)

Sinelnikova (99) Wagner (104)

D´avid (91) Bakhtyar (87)

Thorleuchter (128) Hildebrand (116) Rozkrut (126)

Preface

xxv

Coffee break 11:10 – 11:25 Building

HS 52, HS 1, Room 001 Room 007 Area Invited Session: Data Analysis & Classification in Marketing Session Ensemble methods in clustering and classification Chair B. Lausen A. Rese 11:25 Ziegler (15) S¨ann (68) 11:50 Binder (13) Selka (66) 12:15 Janitza (14) 12:40 Adler (12) -

HS 52, Room 101

-

HS 27, HS 2a, Room 003 Room 004 Machine Interdisciplinary Learning & Domains Knowledge Discovery Social networks Maps & Images

P. Symeonidis Yakoubi (106) Wartena (105) Buza (89)

I. Herzog Busche (110) Herzog (115) Loidl (121) -

Break 13:05 – 13:15 13:15 Closing Plenary (Building: HS 52, Room: 001 (Theater)), Shai Ben-David: Universal Learning vs. No Free Lunch results - can there be learners that do not require task-specific knowledge? (2), Chair: C. Hennig 14:00 Farewell (Beverages/snacks, Building: HS 52, Foyer) End 14:30

xxvi

Preface

Full Scientific Program of GfKl 2012 Tuesday, July 31, 2012 Social events 20:00 Informal come-together in Knochenhauer Amtshaus (http://www.knochenhaueramtshaus.com/) in the city center at the market place

Wednesday, August 01, 2012

08:15 09:00

Registration (Building: HS 52, Foyer) Opening (Building: HS 52, Room: 001 (Theater))

– Welcome by Prof. Dr. Wolfgang-Uwe Friedrich (President of the University of Hildesheim) – Welcome by Prof. Dr. Martin Sauerwein (Dean of faculty for mathematics, natural sciences, economy and computer science, University of Hildesheim) – Welcome and best paper awards by Prof. Dr. Claus Weihs (President of the GfKl): · Best Paper Award 2011 - methods: ”Implications of Axiomatic Consensus Properties”, Florent Domenach and Ali Tayari (Department of Computer Science, University of Nicosia) · Best Paper Award 2011 - application: ”Comparing Earth Movers Distance and its Approximations for Clustering Images”, Sarah Frost and Daniel Baier (Institute of Business Administration and Economics, Brandenburg University of Technology Cottbus) – Welcome by Prof. Dr. Myra Spiliopoulou (Program Chair) – Welcome by Prof. Dr. Dr. Lars Schmidt-Thieme (Local Organizer) 10:00 Opening Plenary (Building: HS 52, Room: 001 (Theater)), Wolfgang Gaul: Where Data Analysis meets Graph Theory (4), Chair: L. Schmidt-Thieme

Coffee break 10:45 – 11:15

Preface

xxvii

Statistics and Data Analysis: Clustering 1 (HS 52, Room 001) Chair: H. Bock 11:15 Grigory Alexandrovich: An exact Newton’s method for ML estimation in a penalized Gaussian mixture model (30) 11:40 Kensuke Tanioka and Hiroshi Yadohisa: Three-way Subspace Hierarchical Clustering based on Entropy Regularization Method (52) 12:05 Daher Ayale and Dhorne Thierry: Geographic clustering through aggregation control (36)

Data Analysis and Classification in Marketing (HS 1, Room 007) Chair: S. Voekler 11:15 Daniel Baier, Wolfgang Polasek and Alexandra Rese: Spatial Modeling of Dependencies Between Population, Education, and Economic Growth (56) 11:40 Alexandra Rese, Hans-Georg Gemnden and Daniel Baier: Rasch Models for Analyzing Role Models in Inter-Organisational Innovation Processes (64)

Data Analysis in Finance (HS 52, Room 101) Chair: K. Jajuga 11:15 Jonas Vogt: Sovereign Credit Spreads During the European Fiscal Crisis (83) 11:40 Marlene M¨uller: Using generalized additive models to fit credit rating scores (79) 12:05 Lukasz Feldman, Radoslaw Pietrzyk and Pawel Rokita: A practical method of determining longevity and premature-death risk aversion in households and some proposals of its application (75)

Machine Learning and Knowledge Discovery: Recommenders and Multi-Criteria Optimization (HS 27, Room 003) Chair: L. Schmidt-Thieme 11:15 Panagiotis Symeonidis: Recommendations in Time Evolving Multi-modal Social Networks (100) 11:40 Eirini Ntoutsi, Kostas Stefanidis, Kjetil Norvag and Hans-Peter Kriegel: gRecs: A collaborative filtering framework for group recommendations (95) 12:05 Weiwei Cheng and Eyke H¨ullermeier: Label Ranking with Abstention: Learning to Predict Partial Orders (90)

xxviii

Preface

Interdisciplinary Domains: Education (HS 2a, Room 004) Chair: S. Krolak-Schwerdt ¨ u: Using Latent Class Models with Random 11:15 Matthias Trendtel and Ali Unl¨ Effects for Investigating Local Dependence (129) ¨ u: Sensitivity Analyses for the Rasch Model (119) 11:40 Daniel Kasper and Ali Unl¨

Lunch break 12:30 – 13:30

13:30 Semi Plenary (Building: HS 52, Room: 001 (Theater)), Katsutoshi Yada: Knowledge Discovery in Shopping Path Data (10), Chair: D. Baier 13:30 Semi Plenary (Building: HS 27, Room 003), Thomas Seidl: Stream Data Mining and Anytime Algorithms (9), Chair: C. Weihs Break 14:15 – 14:30

Statistics and Data Analysis: Classification 1 (HS 52, Room 001) Chair: J. Schiffner 14:30 Bernd Bischl, Julia Schiffner and Claus Weihs: Benchmarking classification algorithms on high-performance computing clusters (33) 14:55 Keiji Takai and Kenichi Hayashi: Effects of Labeling Mechanisms on Classification Error in Linear Discriminant Analysis (51) 15:20 Tatjana Lange, Karl Mosler and Pavlo Mozharovskyi: DDα-classification of asymmetric and fat-tailed data (42)

Data Analysis and Classification in Marketing (HS 1, Room 007) Chair: A. S¨ann 14:30 Andrzej Ba¸k and Tomasz Bartłomowicz: Microeconometrics Multinomial Models and their Applications in Preferences Analysis using R (59) 14:55 Susanne Rumstadt and Daniel Baier: Variable Weighting and Selection Approaches for Market Segmentation: A Comparison (65) 15:20 Sascha Voekler and Daniel Baier: Solving Product Line Design Optimization Problems using Stochastic Programming (70)

Preface

xxix

Data Analysis in Finance (HS 52, Room 101) Chair: K. Jajuga 14:30 Krzysztof Piontek: Value-at-Risk Backtesting Procedures Based on the Loss Functions - Simulation Analysis of the Power of Tests (81) 14:55 Gabor I. Nagy and Krisztian Buza: Clustering Algorithms for Storage of Tick Data (80) 15:20 Bartosz Kaszuba: Correlation of outliers in multivariate data (78)

Machine Learning and Knowledge Discovery: Streams (HS 27, Room 003) Chair: C. Weihs 14:30 Matthew Bolanos, John Forrest and Michael Hahsler: A Study of the Efficiency and Accuracy of Data Stream Clustering for Large Data Sets (88) 14:55 Miriam T¨odten, Zaigham Faraz Siddiqui and Myra Spiliopoulou: A Lightweight CVFDT Classifier for Streams with Concept Drift (101) 15:20 Pawel Matuszyk: Framework for Storing and Processing Relational Entities in a Data Stream (93)

Interdisciplinary Domains: Psychology (HS 2a, Room 004) Chair: F. Klapproth 14:30 Sonja Hahn: ANOVA and Alternatives for Causal Inferences (114) 14:55 Thomas H¨orstermann and Sabine Krolak-Schwerdt: Comparing regression approaches in modelling (non-)compensatory judgment formation (118) 15:20 Andreas Geyer-Schulz, Jonas Kunze and Andreas Sonnenbichler: Learning in groups and exam performance (113)

Coffee break 15:45 – 16:15

Statistics and Data Analysis: Statistics 1 (HS 52, Room 001) Chair: F. Schwaiger 16:15 Tim Beige, Thomas Terhorst, Claus Weihs and Holger Wormer: Which District of Dortmund is the Most Dangerous? (31)

xxx

Preface

16:40 Tobias Voigt, Roland Fried, Michael Backes and Wolfgang Rhode: GammaHadron-Separation in the MAGIC-Experiment (53) 17:05 Dieter Joenssen and Udo Bankhofer: Zur Begrenzung der Verwendungsh¨aufigkeit von Spenderobjekten bei der Imputationen fehlender Daten mittels Hot-DeckVerfahren (40)

Statistics and Data Analysis: Statistics in Economics (HS 1, Room 007) Chair: A. Rybicka 16:15 Justyna Brzezinska: Visual models for categorical data in economic research (34) 16:40 Miguel Biron and Cristian Bravo: Empirically Measuring the Effect of Violating the Independence Assumption in Behavioral Scoring (32) 17:05 Bartlomiej Jefmanski and Marcin Pelka: Fuzzy Composite Index for Customer Satisfaction Evaluation: an Application for Public Sector Services (39)

Data Analysis in Finance (HS 52, Room 101) Chair: M. Hanke 16:15 Wolfgang Bessler and Daniil Wagner: Sovereign Wealth Funds and Portfolio Choice (72) 16:40 Anna Rutkowska-Ziarko: Fundamental portfolio construction based on semi-variance (82) 17:05 Przemysław Garsztka: Optimal portfolios of assets taking into account the asymmetry of specific risk (76)

Machine Learning and Knowledge Discovery: Clustering (HS 27, Room 003) 16:15 Sandrine Mouysset, Joseph Noailles, Daniel Ruiz and Clovis Tauber: Spectral Clustering: interpretation and Gaussian parameter (94) 16:40 Marcin Pelka: Symbolic cluster ensemble based on co-association matrix vs. noisy variables and outliers (96) 17:05 Gunnar V¨olkel, Uwe Sch¨oning and Hans A. Kestler: Group-Based Ant Colony Optimization (103)

Preface

xxxi

Invited Session: Applications in Empirical Educational Research Based on Secondary Data (HS 2a, Room 004) Chair: A. Schwarz 16:15 Alexandra Schwarz: Applications in Empirical Educational Research Based on Secondary Data (19) 16:40 Anna Makles and Kerstin Schneider: Does school choice increase ethnic segregation in primary schools or only segregation indices? (18) 17:05 Alexandra Schwarz: The Impact of Student Loans on Personal Financing of Higher Education in Germany (127)

Social events 16:30 Guided tour of UB Hildesheim 17:30 Guided tour of Dombibliothek 18:00 Guided City tour of Hildesheim (German) 18:00 Guided City tour of Hildesheim (English) 20:00 Reception in the city hall at the market place (Greeting: Ruth Seefels, Mayor of Hildesheim, Claus Weihs, President of GfKl)

xxxii

Preface

Thursday, August 02, 2012

08:15 Registration (Building: HS 52, Foyer) 09:00 Plenary (Building: HS 52, Room: 001 (Theater)), Jo˜ao Gama: Data Stream Mining for Ubiquitous Environments (3), Chair: M. Spiliopoulou

Break 09:45 – 10:00

Statistics and Data Analysis: Factor analysis (HS 52, Room 001) Chair: C. Hennig 10:00 Pieter Schoonees, Michel Van de Velden and Patrick Groenen: Constrained dual scaling of successive categories for detecting response styles (49) 10:50 Hans-Joachim Mucha, Hans-Georg Bartel and Jens Dolata: Dual Scaling Classification and Its Application in Archaeometry (47)

Biostatistics and Bioinformatics (HS 1, Room 007) Chair: H. Kestler 10:00 Sergej Potapov, Asma Gul, Werner Adler and Berthold Lausen: Decision tree ensembles with different split criteria (138) 10:25 Florian Schmid, Ludwig Lausser and Hans A. Kestler: A Transductive Set Covering Machine (139) 10:50 Pawel Matuszyk, Dominik Brammen, Ren´e Schult and Myra Spiliopoulou: Prediction of Surgery Duration Using Data Mining Methods on Anaesthesia Protocols (136)

Preface

xxxiii

Invited Session: Dynamic Cluster Analysis - Theory and Practise (HS 27, Room 003) Chair: J. Pociecha 10:00 Hans-Hermann Bock: Old and new dynamic clustering methods (22) 10:25 Kamila Migdał Najman and Krzysztof Najman: Dynamical Clustering with Self Learning Neural Networks (24) 10:50 Paweł Lula: Machine learning approach in information retrieval for real estate offers analysis (23)

Interdisciplinary Domains: Music 1 (HS 2a, Room 004) Chair: C. Weihs 10:00 Christian Dittmar, Daniel G¨artner, Kay F. Hildebrand and Florian M¨uller: Evaluating Similarity Measures for Plagiarism Detection in Melody Transcriptions (111) 10:25 Ruben Hillewaere, Bernard Manderick and Darrell Conklin: Alignment methods for folk tune classification (117) 10:50 Nadja Bauer, Klaus Friedrichs, Julia Schiffner and Claus Weihs: Onset detection using an auditory model (108)

Coffee break 11:15 – 11:30

Statistics and Data Analysis: Classification 2 (HS 52, Room 001) Chair: B. Bischl 11:30 Oliver Meyer, Bernd Bischl and Claus Weihs: Support Vector Machines on Large Data Sets: Simple Parallel Approaches (45) 11:55 Tatjana Lange and Pavlo Mozharovskyi: The Alpha-Procedure - a nonparametric invariant method for automatic classification of d-dimensional objects (43) 12:20 Friedhelm Schwenker and Sascha Meudt: On Instance Selection in Multi Classifier Systems (50) 12:40 Hoang Huy Nguyen, Stefan Frenzel and Christoph Bandt: Multi-Step Linear Discriminant Analysis for Classification of Event-Related Potentials (124)

xxxiv

Preface

Biostatistics and Bioinformatics (HS 1, Room 007) Chair: H. Kestler 11:30 Dominik Heider, Christoph Bartenhagen, J. Nikolaj Dybowski, Sascha Hauke, Martin Pyka and Daniel Hoffmann: Unsupervised dimension reduction methods for protein sequence classification (135) 11:55 Andre Burkovski, Ludwig Lausser and Hans A. Kestler: Rank aggregation for candidate gene selection (134) 12:20 Markus Maucher, Christian Wawra and Hans A. Kestler: The critical noise level for learning Boolean functions (137) 12:45 Meeting AG Biostatistik

Invited Session: Dynamic Cluster Analysis - Theory and Practise (HS 27, Room 003) Chair: J. Pociecha 11:30 Andrzej Sokolowski: Classification of Three-Way Clustering Problems (25) 11:55 Sascha Voekler and Baier Daniel: Solving Product Line Design Optimization Problems using Stochastic Programming (27) 12:20 Agnieszka Stanimir: Studies in Lower Secondary Educational Level Outcomes Changes in Poland Using Correspondence Analysis (26)

Interdisciplinary Domains: Music 2 (HS 2a, Room 004) Chair: C. Weihs 11:30 Igor Vatolkin, G¨unther R¨otter and Claus Weihs: Music Genre Prediction by High-Level Instrument and Harmony Characteristics (130) 11:55 Markus Eichhoff and Claus Weihs: From Single Tones to MIDI Remixes Detecting Families of Musical Instruments by High-Level Features (112) 12:20 Hanna Lukashevich: Confidence measures in automatic music classification (123) 12:45 Sebastian Krey, Uwe Ligges and Friedrich Leisch: Music and Timbre Segmentation by efficient Order Constrained K-Means Clustering (120)

13:30

Meeting AG DA-NK (HS 52, Room 101)

Lunch break 12:45 – 14:00

Preface

xxxv

14:00 Semi Plenary (Building: HS 52, Room: 001 (Theater)), Michele Sebag: Autonomous Robotics: Defining Instincts and Learning Systems of Values (8), Chair: W. Gaul 14:00 Semi Plenary (Building: HS 27, Room 003), Alex Weissensteiner: Arbitrage-Free Scenario Trees for Financial Optimization (5), Chair: K. Jajuga

Break 14:45 – 14:55

Statistics and Data Analysis: Clustering 2 (HS 52, Room 001) Chair: Ritter 14:55 Justyna Wilk and Marcin Pelka: Cluster Analysis of Symbolic Data with Application of R Software (54) 15:20 Hajo Holzmann and Florian Schwaiger: Merging States in Hidden Markov Models (38) 15:45 Christian Hennig: Some thoughts about the “number of clusters”-problem (37)

Data Analysis and Classification in Marketing (HS 1, Room 007) Chair: R. Decker 14:55 Winfried Steiner, Florian Siems, Anett Weber and Daniel Guhl: Exploring Nonlinear Effects in the Relationship between Customer Satisfaction and Customer Retention (67) 15:20 Christina Lichtenth¨aler and Lars Schmidt-Thieme: Multinomial-SVM-ItemRecommender for Repeat Buying Scenarios (61) 15:45 Michael Tuma: Identifying Consumer Typologies from Online Product Reviews Using Finite Mixture Models (69)

Data Analysis in Finance (HS 52, Room 101) Chair: M. Hanke 14:55 Alois Geyer, Michael Hanke and Alex Weissensteiner: A Simplex Rotation Algorithm for the Factor Approach to Generate Financial Scenarios (77) 15:20 Daniel Bohlmann and Jarek Krajewski: Feature reduction and pattern classification for financial forecasting, - A comparative study on different optimization strategies - (74)

xxxvi

Preface

Machine Learning and Knowledge Discovery: Classification & Ensembles (HS 27, Room 003) 14:55 Friedhelm Schwenker, Michael Glodek and Martin Schels: Ensemble learning for density estimation (97) 15:20 Robin Senge and Eyke H¨ullermeier: An Analysis of Classifier Chains for Multi-Label Classification (98) 15:45 Igor Vatolkin, Bernd Bischl, G¨unter Rudolph and Claus Weihs: Statistical Comparison of Classifiers for Multi-Objective Feature Selection in Instrument Recognition (102)

Coffee break 16:10 – 16:40

Statistics and Data Analysis: Model selection (HS 52, Room 001) Chair: C. Weihs 16:40 Eckhard Liebscher: A universal method for model selection in parametric regression models based on statistical tests (44) 17:05 Hans-Joachim Mucha and Hans-Georg Bartel: Soft Bootstrapping and Its Comparison with Other Resampling Methods (46)

Data Analysis and Classification in Marketing (HS 1, Room 007) Chair: R. Decker 16:40 Anneke Minke and Klaus Ambrosi: Approach to Predicting Changes in Market Segments Based on Customer Behavior (62) 17:05 Andrzej Bak, Marcin Pelka and Aneta Rybicka: Discrete Choice Methods and Their Applications in Preference Analysis of Vodka Consumers (57)

Interdisciplinary Domains: Language & Education (HS 2a, Room 004) Chair: S. Krolak-Schwerdt 16:40 Andreea Beica and Liviu P. Dinu: Computational Aspects of Natural Languages’ Similarities (109) 17:05 Sergiu Nisioi and Liviu P. Dinu: The Author in Translation: A Computational Method (125)

Preface

xxxvii

¨ u, Daniel Kasper and Matthias Trendtel: The OECD’s Programme 17:30 Ali Unl¨ for International Student Assessment (PISA) Study: A Review of Its Basic Psychometric Concepts (131)

Break 17:30 – 18:00 18:00 General meeting of the German Classification Society (Building: HS 52, Room: 001 (Theater), End: 19:30)

Social events 20:15 Conference dinner in the Novotel (Bahnhofsallee 38, 31134 Hildesheim, admittance: 19:30)

xxxviii

Preface

Friday, August 03, 2012

08:15 Registration (Building: HS 52, Foyer) 09:00 Semi Plenary (Building: HS 52, Room: 001 (Theater)), Hillol Kargupta: Connected Cars, Machine-to-Machine Environments, and Distributed Data Mining (6), Chair: L. Schmidt-Thieme 09:00 Semi Plenary (Building: HS 27, Room 003), Dirk Van den Poel: On the value of incorporating sequential information into predictive analytics classification models for analytical CRM (7), Chair: W. Steiner

Break 09:45 – 09:55

Statistics and Data Analysis: Applications (HS 52, Room 001) Chair: H. Mucha 09:55 Jaime Santos and Orlando Belo: Introducing Analytical Methods and Predictive Models in Project Management Activities (48) 10:20 Florian Klapproth, Sabine Krolak-Schwerdt and Thomas H¨orstermann: Predictive validity of tracking decisions: Application of a new validation criterion (41) 10:45 Mariana Carvalho, Paulo Sampaio and Orlando Belo: Discovering Process Certification Tendencies (35)

Data Analysis and Classification in Marketing (HS 1, Room 007) Chair: D. Van den Poel 09:55 Pascal Kottemann, Martin Meiner and Reinhold Decker: Measuring Consumers’ Brand Associations in Online Market Research (60) 10:20 Michel Ballings and Dirk Van den Poel: The Dangers of using Intention as a Surrogate for Retention in Brand Positioning Decision Support Systems (58)

Preface

xxxix

10:45 Friederike Paetz and Winfried J. Steiner: Finite Mixture MNP vs. Finite Mixture IP Models: An Empirical Study (63)

Machine Learning and Knowledge Discovery: Opinions and marketing (HS 52, Room 101) 09:55 Hyunsup Ahn, Markus Weinmann and Christoph Lofi: Classification and definition of contextual vicinity from emotional words for sentiment analysis (86) 10:20 Alina Sinelnikova, Eirini Ntoutsi and Hans-Peter Kriegel: Sentiment analysis in the Twitter stream (99) 10:45 Ralf Wagner: The Dark Side of Marketing Communication: Grouping Consumers with Respect to Their Reactance Behavior (104)

Machine Learning and Knowledge Discovery: Distributed and Temporal Data Analysis (HS 27, Room 003) Chair: J. Gama 09:55 Umer Khan, Alexandros Nanopoulos and Lars Schmidt-Thieme: Experimental Evaluation of Communication Efficient Distributed Classification in Peerto-Peer Networks (92) 10:20 Istv´an D´avid and Krisztian Buza: On the relation of cluster stability and early classifiability of time series (91) 10:45 Maheen Bakhtyar, Lena Wiese, Katsumi Inoue and Nam Dang: Using Conceptual Inductive Learning for Cooperative Query Answering (87)

Interdisciplinary Domains: Quality (HS 2a, Room 004) Chair: C. Hennig 09:55 Dirk Thorleuchter and Dirk Van den Poel: Espionage Risk Assessment for Security of Defense based Research and Technology (128) 10:20 Kay F. Hildebrand: Supporting Selection of Statistical Techniques in Research (116) 10:45 Dominik Rozkrut: Differentiation of innovation strategies across regions (126)

Coffee break 11:10 – 11:25

xl

Preface

Invited Session: Ensemble methods in clustering and classification (HS 52, Room 001) Chair: B. Lausen 11:25 Andreas Ziegler and Jochen Kruppa: Probability Machines: Estimating individual probabilities using machine learning methods (15) 11:50 Harald Binder: Tailoring componentwise boosting for prediction with a huge number of molecular measurements (13) 12:15 Silke Janitza and Anne-Laure Boulesteix: An AUC-based Permutation Variable Importance Measure for Random Forests (14) 12:40 Werner Adler, Zardad Khan, Sergej Potapov and Berthold Lausen: Diversity Based Weighting to Improve the Performance of Classifier Ensembles (12)

Data Analysis and Classification in Marketing (HS 1, Room 007) Chair: A. Rese 11:25 Alexander S¨ann and Daniel Baier: Complex Product Development: Using a Combined VoC Lead User Approach (68) 11:50 Sebastian Selka, Daniel Baier and Peter Kurz: An Validity Analysis of Recent Commercial Conjoint Analysis Studies (66)

Machine Learning and Knowledge Discovery: Social networks (HS 27, Room 003) Chair: P. Symeonidis 11:25 Zied Yakoubi and Rushed Kanawati: Applying Leaders Driven Community Detection Algorithms to Data Clustering (106) 11:50 Christian Wartena and Rogier Brussee: Evaluating Tag Similarity Measures by Clustering Bibsonomy Tags (105) 12:40 Krisztian Buza: Feedback Predicition for Blogs (89)

Interdisciplinary Domains: Maps & Images (HS 2a, Room 004) Chair: I. Herzog 11:25 Andre Busche, Ruth Janning, Tomas Horvath and Lars Schmidt-Thieme: A Unifying Framework for GPR Image Reconstruction (110) 11:50 Irmela Herzog: Testing Models for Medieval Settlement Location (115) 12:15 Martin Loidl and Christoph Traun: The balance of value and space - Merging classification and regionalization to make more sense out of spatial data (121)

Preface

xli

Break 13:05 – 13:15 13:15 Closing Plenary (Building: HS 52, Room: 001 (Theater)), Shai Ben-David: Universal Learning vs. No Free Lunch results - can there be learners that do not require task-specific knowledge? (2), Chair: C. Hennig 14:00 Farewell (Beverages/snacks, Building: HS 52, Foyer)

End 14:30

Workshop on Classification and Subject Indexing in Library and Information Science (LIS'2012) im Rahmen der Jahrestagung der Deutschen Gesellschaft für Klassifikation Mittwoch, August 01, 2012 09:00

Eröffnung der Tagung (HS 52)

10:45

Kaffeepause Moderation: Frank Scholze

11:15

Heidrun Wiesenmüller: Resource Discovery Systeme – Chance oder Verhängnis für die bibliothekarische Erschließung? Karl Rädler: Instrumentalisierung der klassifikatorischen Sacherschließung im neuen Suchportal mit AquaBrowser in der Vorarlberger Landesbibliothek

12:30

Mittagspause

13:30

Uwe Geith: Die sachliche Suche in Schweizer Online-Katalogen und Discovery-Systemen

14:15

Pause Moderation: Michael Mönnich

14:30

Dominique Ritze, Kai Eckert: Data Enrichment in Discovery Systems using Linked Data Elmar Haake: Verarbeitung von Sacherschliessungselementen in Discoverysystemen: auf dem Weg zu einer nutzergerechten Verwendung von inhaltlicher Erschließung in der E-LIB Bremen.

15:45

Kaffeepause

16:15

Jan Frederik Maas: Entwicklung eines Werkzeugs zur Visualisierung der SWD/GND Alice Spinnler: Sacherschliessung mit GND/RSWK im Verbund Basel : eine erste Bilanz

17:30

Bibliotheksführungen

20:30

Abendessen (auf Selbstzahlerbasis)

Donnerstag, August 02, 2012 09:00

Plenarsitzung (HS 52)

09:45

Pause Moderation: Ewald Brahms

10:00

Debora Daberkow, Petra Mensing, Irina Sens, Claudia Todt: LinSearch – Effiziente Indizierung an der Technischen Informationsbibliothek, Hannover Monika Lösse, Mathias Lösch: Sachliche Einordnung von Dokumenten in Bibliotheken: praktische Erfahrungen mit maschinellen Lernverfahren

11:15

Kaffeepause

11:30

Magnus Pfeffer: Abgleich von Titeldaten zur Übernahme von Sacherschließungsinformationen über Verbundgrenzen Uwe Geith, Wolfgang Giella: Herausforderung "Neue Klassifikation für Freihandbestände" - 3 Praxis-Beispiele aus der Schweiz

12:45

Mittagspause

14:00

Semiplenarsitzung (HS 52/27)

14:45

Pause Moderation: Heidrun Wiesenmüller

14:55

Andreas Ledl: Blogs als Thesaurus-Datenbanken Michael Schwantner, Silke Rehme, Helmut Müller, Elke Bubel, Mario Quilitz, Peter König, Nadejda Nikitina, Achim Rettinger, Nils Elsner: Semiautomatische Ontologiegenerierung – ein Erfahrungsbericht

16:10

Kaffeepause

16:40

Bernd Lorenz: AG Dezimalklassifikationen - Literaturbericht 2011

17:30

Pause bzw. Treffen des LIS PC

18:00

Versammlung der Gesellschaft für Klassifikation (HS 52)

19:30

Ende The LIS’2012 workshop will take place at Building HS 31, Room 012.

Abstracts

Contents

Part I Keynote Speakers Universal Learning vs. No Free Lunch results - can there be learners that do not require task-specific knowledge? . . . . . . . . . . . . . . . . . . . . . . . . . Shai Ben-David, Nathan Srebro, and Ruth Urner

2

Data Stream Mining for Ubiquitous Environments . . . . . . . . . . . . . . . . . . . Jo˜ao Gama

3

Where Data Analysis Meets Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . Wolfgang Gaul

4

Arbitrage-free Scenario Trees for Financial Optimization . . . . . . . . . . . . . . Alois Geyer, Michael Hanke, and Alex Weissensteiner

5

Connected Cars, Machine-to-Machine Environments, and Distributed Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hillol Kargupta

6

On the value of incorporating sequential information into predictive analytics classification models for analytical CRM . . . . . . . . . . . . . . . . . . . . Dirk Van den Poel

7

Autonomous Robotics: Defining Instincts and Learning Systems of Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mich`ele Sebag

8

Stream Data Mining and Anytime Algorithms . . . . . . . . . . . . . . . . . . . . . . . Thomas Seidl

9

Knowledge Discovery in Shopping Path Data . . . . . . . . . . . . . . . . . . . . . . . . 10 Katsutoshi Yada

xlv

xlvi

Contents

Part II Invited Session: Ensemble methods in clustering and classification Diversity Based Weighting to Improve the Performance of Classifier Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Werner Adler, Zardad Khan, Sergej Potapov, and Berthold Lausen Tailoring Componentwise Boosting for Prediction with a Huge Number of Molecular Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Harald Binder An AUC-based Permutation Variable Importance Measure for Random Forests for Unbalanced Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Silke Janitza and Anne-Laure Boulesteix Probability Machines: Estimating individual probabilities using machine learning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Andreas Ziegler and Jochen Kruppa Part III Invited Session: Applications in Empirical Educational Research Based on Secondary Data Does school choice increase ethnic segregation in primary schools or only segregation indices? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Anna Makles and Kerstin Schneider Applications in Empirical Educational Research Based on Secondary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Alexandra Schwarz Part IV Invited Session: Dynamic Cluster Analysis - Theory and Practise Old and new dynamic clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Hans-Hermann Bock Machine learning approach in information retrieval for real estate offers analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Paweł Lula Dynamical Clustering with Self Learning Neural Networks . . . . . . . . . . . . 24 Kamila Migdał Najman and Krzysztof Najman Classification of Three-Way Clustering Problems . . . . . . . . . . . . . . . . . . . . . 25 Andrzej Sokolowski Studies in Lower Secondary Educational Level Outcomes Changes in Poland Using Correspondence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Agnieszka Stanimir

Contents

xlvii

Solving Product Line Design Optimization Problems using Stochastic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Sascha Voekler and Daniel Baier Part V Statistics and Data Analysis An exact Newton’s method for ML estimation in a penalized Gaussian mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Grigory Alexandrovich Which District of Dortmund is the Most Dangerous? . . . . . . . . . . . . . . . . . . 31 Tim Beige, Thomas Terhorst, Claus Weihs and Holger Wormer Empirically Measuring the Effect of Violating the Independence Assumption in Behavioural Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Miguel Biron and Cristi´an Bravo Benchmarking classification algorithms on high-performance computing clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Bernd Bischl, Julia Schiffner, Claus Weihs Visual models for categorical data in economic research . . . . . . . . . . . . . . . 34 Justyna Brzezi`nska Discovering Process Certification Tendencies . . . . . . . . . . . . . . . . . . . . . . . . 35 Mariana Carvalho, Paulo Sampaio, and Orlando Belo Geographic clustering through aggregation control . . . . . . . . . . . . . . . . . . . 36 Daher Ayale and Dhorne Thierry Some thoughts about the “number of clusters”-problem . . . . . . . . . . . . . . . 37 Christian Hennig Merging States in Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Hajo Holzmann and Florian Schwaiger Fuzzy Composite Index for Customer Satisfaction Evaluation: an Application for Public Sector Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Bartłomiej Jefma´nski and Marcin Pełka Zur Begrenzung der Verwendungsh¨aufigkeit von Spenderobjekten bei der Imputationen fehlender Daten mittels Hot-Deck-Verfahren . . . . . . . . . 40 Dieter William Joenssen and Udo Bankhofer Predictive validity of tracking decisions: Application of a new validation criterion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Florian Klapproth, Sabine Krolak-Schwerdt, and Thomas H¨orstermann

xlviii

Contents

DDα-classification of asymmetric and fat-tailed data . . . . . . . . . . . . . . . . . . 42 Tatjana Lange, Karl Mosler, and Pavlo Mozharovskyi The Alpha-Procedure - a nonparametric invariant method for automatic classification of d-dimensional objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Tatjana Lange and Pavlo Mozharovskyi A universal method for model selection in parametric regression models based on statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Eckhard Liebscher Support Vector Machines on Large Data Sets: Simple Parallel Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Oliver Meyer, Bernd Bischl, Claus Weihs Soft Bootstrapping and Its Comparison with Other Resampling Methods Hans-Joachim Mucha and Hans-Georg Bartel

46

Dual Scaling Classification and Its Application in Archaeometry . . . . . . . . 47 Hans-Joachim Mucha, Hans-Georg Bartel, and Jens Dolata Introducing Analytical Methods and Predictive Models in Project Management Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Jaime Santos and Orlando Belo Constrained Dual Scaling of Successive Categories for Detecting Response Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Pieter C. Schoonees, Michel van de Velden, and Patrick J. F. Groenen On Instance Selection in Multi Classifier Systems . . . . . . . . . . . . . . . . . . . . . 50 Friedhelm Schwenker, Sascha Meudt Effects of Labeling Mechanisms on Classification Error in Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Keiji Takai and Kenichi Hayashi Three-way Subspace Hierarchical Clustering based on Entropy Regularization Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Kensuke Tanioka and Hiroshi Yadohisa Gamma-Hadron-Separation in the MAGIC-Experiment . . . . . . . . . . . . . . 53 Tobias Voigt, Roland Fried, Michael Backes, and Wolfgang Rhode Cluster Analysis of Symbolic Data with Application of R Software . . . . . . 54 Justyna Wilk and Marcin Pełka Part VI Data Analysis and Classification in Marketing

Contents

xlix

Spatial Modeling of Dependencies Between Population, Education, and Economic Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Daniel Baier, Wolfgang Polasek, and Alexandra Rese Discrete Choice Methods and Their Applications in Preference Analysis of Vodka Consumers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Andrzej Ba¸k, Marcin Pełka, and Aneta Rybicka The Dangers of using Intention as a Surrogate for Retention in Brand Positioning Decision Support Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Michel Ballings and Dirk Van den Poel Microeconometrics Multinomial Models and their Applications in Preferences Analysis using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Andrzej Ba¸k and Tomasz Bartłomowicz Measuring Consumers’ Brand Associations in Online Market Research . . 60 Pascal Kottemann, Martin Meißner and Reinhold Decker Multinomial-SVM-Item-Recommender for Repeat-Buying Scenarios . . . . 61 Christina Lichtenthaeler and Lars Schmid-Thieme Approach to Predicting Changes in Market Segments Based on Customer Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Anneke Minke and Klaus Ambrosi Finite Mixture MNP vs. Finite Mixture IP Models: An Empirical Study . . 63 Friederike Paetz and Winfried J. Steiner Rasch Models for Analyzing Role Models in Inter-Organisational Innovation Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Alexandra Rese, Hans-Georg Gem¨unden, and Daniel Baier Variable Weighting and Selection Approaches for Market Segmentation: A Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Susanne Rumstadt and Daniel Baier An Validity Analysis of Recent Commercial Conjoint Analysis Studies . . . 66 Sebastian Selka, Daniel Baier, and Peter Kurz Exploring Nonlinear Effects in the Relationship between Customer Satisfaction and Customer Retention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Winfried J. Steiner, Florian U. Siems, Anett Weber and Daniel Guhl Complex Product Development: Using a Combined VoC Lead User Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Alexander S¨ann and Daniel Baier

l

Contents

Identifying Consumer Typologies from Online Product Reviews Using Finite Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Michael N. Tuma Solving Product Line Design Optimization Problems using Stochastic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Sascha Voekler and Daniel Baier Part VII Data Analysis in Finance Sovereign Wealth Funds and Portfolio Choice . . . . . . . . . . . . . . . . . . . . . . . . 72 Wolfgang Bessler and Daniil Wagner, CFA Feature reduction and pattern classification for financial forecasting, - A comparative study on different optimization strategies - . . . . . . . . . . . . . . . 74 Daniel Bohlmann and Jarek Krajewski A practical method of determining longevity and premature-death risk aversion in households and some proposals of its application . . . . . . . . . . . 75 Lukasz Feldman, Radoslaw Pietrzyk, and Pawel Rokita Optimal portfolios of securities taking into account the asymmetry of specific risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Garsztka Przemyslaw A Simplex Rotation Algorithm for the Factor Approach to Generate Financial Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Alois Geyer, Michael Hanke, and Alex Weissensteiner Correlation of outliers in multivariate data . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Bartosz Kaszuba Using generalized additive models to fit credit rating scores . . . . . . . . . . . . 79 Marlene M¨uller Clustering Algorithms for Storage of Tick Data . . . . . . . . . . . . . . . . . . . . . . 80 Gabor I. Nagy and Krisztian Buza Value-at-Risk Backtesting Procedures Based on the Loss Functions Simulation Analysis of the Power of Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Krzysztof Piontek Fundamental portfolio construction based on semi-variance . . . . . . . . . . . . 82 Anna Rutkowska-Ziarko Sovereign Credit Spreads During the European Fiscal Crisis . . . . . . . . . . . 83 Jonas Vogt, PhD Student Part VIII Machine Learning and Knowledge Discovery

Contents

li

Classification and definition of contextual vicinity from emotional words for sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Hyunsup Ahn, Markus Weinmann, and Christoph Lofi Using Conceptual Inductive Learning for Cooperative Query Answering . 87 Maheen Bakhtyar, Lena Wiese, Katsumi Inoue, and Nam Dang A Study of the Efficiency and Accuracy of Data Stream Clustering for Large Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . 88 Matthew Bola˜nos, John Forrest, and Michael Hahsler Feedback Predicition for Blogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Krisztian Buza Label Ranking with Abstention: Learning to Predict Partial Orders . . . . . 90 Weiwei Cheng, Willem Waegeman, Volkmar Welker, and Eyke H¨ullermeier On the relation of cluster stability and early classifiability of time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Istv´an D´avid and Krisztian Buza Experimental Evaluation of Communication Efficient Distributed Classification in Peer-to-Peer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Umer Khan , Alexandros Nanopoulos and Lars Schmidt Thieme Framework for Storing and Processing Relational Entities in a Data Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Pawel Matuszyk Spectral clustering: interpretation and Gaussian parameter . . . . . . . . . . . . 94 Sandrine Mouysset, Joseph Noailles, Daniel Ruiz, and Clovis Tauber gRecs: A collaborative filtering framework for group recommendations . . 95 Eirini Ntoutsi, Kostas Stefanidis, Kjetil Nørv˚ag, and Hans-Peter Kriegel Symbolic cluster ensemble based on co-association matrix vs. noisy variables and outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Pełka Marcin Ensemble learning for density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Friedhelm Schwenker, Michael Glodek, Martin Schels An Analysis of Classifier Chains for Multi-Label Classification . . . . . . . . . 98 Robin Senge, Jose Barranquero, Juan Jos´e del Coz, and Eyke H¨ullermeier Sentiment analysis in the Twitter stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Alina Sinelnikova, Eirini Ntoutsi, and Hans-Peter Kriegel

lii

Contents

Recommendations in Time Evolving Multi-modal Social Networks . . . . . . 100 Panagiotis Symeonidis A Lightweight CVFDT Classifier for Streams with Concept Drift . . . . . . . 101 Miriam T¨odten, Zaigham Faraz Siddiqui, and Myra Spiliopoulou Statistical Comparison of Classifiers for Multi-Objective Feature Selection in Instrument Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Igor Vatolkin, Bernd Bischl, G¨unter Rudolph, and Claus Weihs Group-Based Ant Colony Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Gunnar V¨olkel, Uwe Sch¨oning, and Hans A. Kestler The Dark Side of Marketing Communication: Grouping Consumers with Respect to Their Reactance Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Ralf Wagner Evaluating Tag Similarity Measures by Clustering Bibsonomy Tags . . . . . 105 Christian Wartena and Rogier Brussee Applying Leaders Driven Community Detection Algorithms to Data Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Zied Yakoubi and Rushed Kanawati Part IX Interdisciplinary Domains Onset detection using an auditory model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Bauer, Nadja, Friedrichs, Klaus, Schiffner, Julia, and Weihs, Claus Computational Aspects of Natural Languages’ Similarities . . . . . . . . . . . . 109 Andreea Beica and Liviu P. Dinu A Unifying Framework for GPR Image Reconstruction . . . . . . . . . . . . . . . 110 Andre Busche, Ruth Janning, Tom´asˇ Horv´ath, and Lars Schmidt-Thieme Evaluating Similarity Measures for Plagiarism Detection in Melody Transcriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Christian Dittmar, Daniel G¨artner, Kay F. Hildebrand, and Florian M¨uller From Single Tones to MIDI Remixes - Detecting Families of Musical Instruments by High-Level Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Eichhoff, Markus and Weihs, Claus Learning in groups and exam performance . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Andreas Geyer-Schulz, Jonas Kunze, and Andreas Sonnenbichler ANOVA and Alternatives for Causal Inferences . . . . . . . . . . . . . . . . . . . . . . 114 Sonja Hahn

Contents

liii

Testing Models for Medieval Settlement Location . . . . . . . . . . . . . . . . . . . . . 115 Irmela Herzog Supporting Selection of Statistical Techniques in Research . . . . . . . . . . . . . 116 Kay F. Hildebrand Alignment methods for folk tune classification . . . . . . . . . . . . . . . . . . . . . . . 117 Ruben Hillewaere, Bernard Manderick, and Darrell Conklin Comparing regression approaches in modelling (non-)compensatory judgment formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Thomas H¨orstermann and Sabine Krolak-Schwerdt Sensitivity Analyses for the Rasch Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 ¨ u Daniel Kasper and Ali Unl¨ Music and Timbre Segmentation by efficient Order Constrained K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Sebastian Krey, Uwe Ligges, and Friedrich Leisch The balance of value and space Merging classification and regionalization to make more sense out of spatial data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Martin Loidl and Christoph Traun Confidence measures in automatic music classification . . . . . . . . . . . . . . . . 123 Hanna Lukashevich Multi-Step Linear Discriminant Analysis for Classification of Event-Related Potentials . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Nguyen Hoang Huy, Stefan Frenzel, and Christoph Bandt The Author in Translation: A Computational Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Sergiu Nisioi and Liviu P. Dinu Differentiation of innovation strategies across regions . . . . . . . . . . . . . . . . 126 Dominik Antoni Rozkrut The Impact of Student Loans on Personal Financing of Higher Education in Germany . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Alexandra Schwarz Espionage Risk Assessment for Security of Defense based Research and Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Dirk Thorleuchter and Dirk Van den Poel

liv

Contents

Using Latent Class Models with Random Effects for Investigating Local Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 ¨ u Matthias Trendtel and Ali Unl¨ Music Genre Prediction by High-Level Instrument and Harmony Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Igor Vatolkin, G¨unther R¨otter, and Claus Weihs The OECD’s Programme for International Student Assessment (PISA) Study: A Review of Its Basic Psychometric Concepts . . . . . . . . . . . . . . . . . . 131 ¨ u, Daniel Kasper and Matthias Trendtel Ali Unl¨ Part X Biostatistics and Bioinformatics Rank aggregation for candidate gene selection . . . . . . . . . . . . . . . . . . . . . . . 134 Andre Burkovski, Ludwig Lausser and Hans A. Kestler Unsupervised dimension reduction methods for protein sequence classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Dominik Heider, Christoph Bartenhagen, J. Nikolaj Dybowski, Sascha Hauke, Martin Pyka, and Daniel Hoffmann Prediction of Surgery Duration Using Data Mining Methods on Anaesthesia Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Pawel Matuszyk, Dominik Brammen, Ren Schult, and Myra Spiliopoulou The critical noise level for learning Boolean functions . . . . . . . . . . . . . . . . . 137 Markus Maucher, Christian Wawra, and Hans A. Kestler Decision tree ensembles with different split criteria. . . . . . . . . . . . . . . . . . . . 138 Sergej Potapov, Asma Gul, Werner Adler, and Berthold Lausen A Transductive Set Covering Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Florian Schmid, Ludwig Lausser and Hans A. Kestler Part XI LIS’12 Workshop LinSearch – Effiziente Indizierung an der Technischen Informationsbibliothek, Hannover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Dr. Debora Daberkow, Dr. Petra Mensing, Dr. Irina Sens, Claudia Todt ¨ Freihandbest¨ande” - 3 Herausforderung ”Neue Klassifikation fur Praxis-Beispiele aus der Schweiz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Uwe Geith and Dr. Wolfgang Giella Die sachliche Suche in Schweizer Online-Katalogen und DiscoverySystemen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Uwe Geith

Contents

lv

Verarbeitung von Sacherschliessungselementen in Discoverysystemen: Auf dem Weg zu einer nutzergerechten Verwendung von inhaltlicher Erschlieung in der E-LIB Bremen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Dr. Elmar Haake Der Blog als Thesaurus-Datenbank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Andreas Ledl AUSZUG AUS DEM LITERATURBERICHT 2011 DEWEY DECIMAL CLASSIFICATION (DDC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Bernd Lorenz Entwicklung eines Werkzeugs zur Visualisierung der SWD/GND . . . . . . . 149 Dr.-Ing. Jan Frederik Maas Practical Experiences with Machine Learning-based Text Categorization for Library Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Elisabeth M¨odden, Mathias L¨osch, Monika L¨osse, and Ulrike Junger ¨ Abgleich von Titeldaten zur Ubernahme von ¨ Sacherschließungsinformationen uber Verbundgrenzen . . . . . . . . . . . . . . . 151 Magnus Pfeffer Data Enrichment in Discovery Systems using Linked Data . . . . . . . . . . . . . 152 Dominique Ritze and Kai Eckert Instrumentalisierung der klassifikatorischen Sacherschließung im neuen Suchportal mit AquaBrowser in der Vorarlberger Landesbibliothek . . . . . 153 Karl R¨adler ¨ den Ontologieaufbau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Text Mining fur Elke Bubel, Nils Elsner, Peter K¨onig, Helmut M¨uller, Nadejda Nikitina, Mario Quilitz, Silke Rehme, Achim Rettinger, and Michael Schwantner Sacherschliessung mit GND/RSWK im Verbund Basel: eine erste Bilanz . 155 Alice Spinnler ¨ die Resource Discovery Systeme – Chance oder Verh¨angnis fur bibliothekarische Erschließung? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Heidrun Wiesenm¨uller Inhaltliche Anpassung der RVK als Aufstellungsklassifikation – Projekt Bibliotheksneubau Kleine F¨acher der FU Berlin, Schwerpunkt Orient . . . 157 Helen Younansardaroud

List of Contributors

Prof. Dr. Myra Spiliopoulou (PC Chair) Workgroup KMD: ”Knowledge Management and Discovery” Faculty of Computer Science Otto-von-Guericke-Universitt Magdeburg Universittsplatz 2 39106 Magdeburg, Germany Prof. Dr. Dr. Lars Schmidt-Thieme (PC Chair, Local Organizer) Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim Marienburger Platz 22 D-31141 Hildesheim, Germany Ruth Janning, M.Sc. (Local Organizer) Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim Marienburger Platz 22 D-31141 Hildesheim, Germany

lvii

Part I

Keynote Speakers

Universal Learning vs. No Free Lunch results can there be learners that do not require task-specific knowledge? Shai Ben-David1 , Nathan Srebro2 , and Ruth Urner3 1 2 3

University of Waterloo, Canada [email protected] Toyota Technological Institute at Chicago, United States [email protected] University of Waterloo, Canada [email protected]

Abstract. The so called No-Free-Lunch principle is a basic insight of machine learning. It may be viewed as stating that in the lack of prior knowledge (or inductive bias), every learning algorithm may fail on some *learnable* task. In recent years, several paradigms for ”universal learning” have been proposed and advocated. These range from paradigms of almost science-fictional nature, like ”Automation of science”, through practically oriented Deep Belief Networks, to theoretical constructs like Universal Kernels, Universal Priors and Universal Coding for MDL-based learning. In this talk I address this apparent contradiction by examining and analyzing several possible definitions of universal learning. I will show a basic no-free-lunch theorem for such generic learning and discuss how it applies to the above mentioned universal learning paradigms.

Keywords Theory, Universal Learning, No-Free-Lunch

2

Data Stream Mining for Ubiquitous Environments Jo˜ao Gama 1 2

LIAAD, INESC TEC FEP, University of Porto, Portugal [email protected]

Abstract. Data stream mining is, nowadays, a mature topic in data mining. Nevertheless, most of the works focus on centralized approaches to learn from sequences of instances generated from environments with unknown dynamics, that can be read only once or a small number of times, using limited computing and storage capabilities. The phenomenal growth of mobile and embedded devices coupled with their ever-increasing computational and communications capacity presents an exciting new opportunity for real-time, distributed intelligent data analysis in ubiquitous environments. In these contexts centralized approaches have limitations due to communication constraints, power consumption (e.g. in sensor networks), and privacy concerns. Distributed online algorithms are highly needed to address the above concerns. The focus of this talk is on distributed stream mining algorithms that are highly scalable, computationally efficient and resource-aware. These features enable the continued operation of data stream mining algorithms in highly dynamic mobile and ubiquitous environments.

Keywords Data Mining, Data Streams, Distributed Algorithms

3

Where Data Analysis Meets Graph Theory Wolfgang Gaul University of Karlsruhe, Germany [email protected]

Abstract. Based on information by which objects are described standard tasks of data analysis try to reveal pecularities (features, structures, etc.) in the data that help to characterize the objects. When relations between objects belong to the information available the objects can be interpreted as vertices of a graph and knowledge about the relational structure between the vertices can be added to the underlying data analysis situation with the help of (possibly weighted) links between pairs of vertices. Graph clustering and Web data mining, among others, are examples that will be used to demonstrate findings in which data analysis and graph theory overlap.

Keywords Graph Clustering, Web Data Mining, Data Analysis in Marketing

4

Arbitrage-free Scenario Trees for Financial Optimization Alois Geyer1 , Michael Hanke2 , and Alex Weissensteiner3 1 2 3

Vienna University of Economics and Business, Austria [email protected] University of Liechtenstein, Liechtenstein [email protected] Free University of Bolzano/Bozen, Italy [email protected]

Abstract. This paper presents a method which is designed to generate arbitragefree scenario trees representing multivariate return distributions. Our approach is embedded in the setting of Arbitrage Pricing Theory (APT), and asset returns are assumed to be driven by orthogonal factors. In a complete market setting we derive no-arbitrage bounds for expected excess returns using the least possible number of scenarios (i.e. the smallest dimension of the discrete state space) necessary to match the first two moments and to exclude arbitrage at the outset. This not only safeguards against the curse of dimensionality: Numerical results from solving twostage asset allocation problems show that highly accurate results can be obtained with the smallest possible scenario tree.

Keywords No-Arbitrage Bounds, Scenario Generation, Financial Optimization

5

Connected Cars, Machine-to-Machine Environments, and Distributed Data Mining Hillol Kargupta 1 2

Agnik University of Maryland, Baltimore County, Computer Science & Electrical Engineering Department, USA

Abstract. Modern vehicles are embedded with varieties of sensors monitoring different functional components of the car and the driver behavior. With vehicles getting connected over wide-area wireless networks, many of these vehicle diagnosticdata along with location and accelerometer information are now accessible to a wider audience through wireless aftermarket devices. This data offer rich source of information about the vehicle and driver performance. Once this is combined with other contextual data about the car, environment, location, and the driver, it can offer exciting possibilities. Distributed data mining technology powered by onboard analysis of data is changing the face of such vehicle telematics applications for the consumer market, insurance industry, car repair chains and car OEMs. This talk will offer an overview of the market, emerging product-types, and identify some of the core technical challenges. It will describe how advanced data analysis has helped creating new innovative products and made them commercially successful. The talk will offer a perspective on the algorithmic issues and describe their practical significances. It will end with remarks on future directions of the field of Machine-toMachine (M2M) sensor networks and how the next generation of researchers can play an important role in shaping that.

6

On the value of incorporating sequential information into predictive analytics classification models for analytical CRM Dirk Van den Poel1 Ghent University, Department of Marketing, Tweekerkenstraat 2, 9000 Ghent, Belgium [email protected] Abstract. This keynote talk gives an overview of different methods to incorporate sequential information into classification models for predictive analytics in marketing. More specifically, we zoom in on SAM (sequence alignment methods), Markov, MTD, MTDg, and Markov for Discrimination, and Survival analysis. It has been shown time and again that sequential data adds value to predictive models in marketing. We discuss applications of these techniques in financial services, fast-moving consumer goods (FMCG), and home appliance. Sequential data captures two aspects: 1. Order, 2. Timing. We show that sequential information is useful for cross-sell modeling (PRINZIE et al. 2006b, 2007) as well as customer churn modeling (MIGUEIS et al. 2012a, 2012b; PRINZIE et al. 2006a) in analytical customer relationship management.

References MIGUEIS V.L., VAN DEN POEL D., CAMANHO A.S., CUNHA J. F. (2012a): Predicting partial customer churn: On the value of the purchasing sequence. under review. MIGUEIS V.L., VAN DEN POEL D., CAMANHO A.S., CUNHA J. F. (2012b), Modeling partial customer churn: On the value of first product-category purchase sequences. under review. PRINZIE A. and VAN DEN POEL D. (2006a): Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM. Decision Support Systems, 42 (2), 508-526. PRINZIE A. and VAN DEN POEL D. (2006b): Investigating Purchasing Patterns for Financial Services using Markov, MTD and MTDg Models. European Journal of Operational Research, 170 (3), 710-734. PRINZIE A. and VAN DEN POEL D. (2007): Predicting home-appliance acquisition sequences: Markov/MTD/MTDg and survival analysis for modeling sequential information in NPTB models. Decision Support Systems, 44 (1), 28-45.

Keywords PREDICTIVE ANALYTICS, SEQUENCE ANALYSIS, ANALYTICAL CUSTOMER RELATIONSHIP MANAGEMENT 7

Autonomous Robotics: Defining Instincts and Learning Systems of Values Mich`ele Sebag University Paris-Sud, CNRS, France [email protected]

Abstract. Reinforcement learning aims at finding a good action policy, interacting with the environment in such a way that the agent (the robot) optimizes its cumulative reward along time. Where does the reward come from? In the robotics simulation context, the ground truth is available and the designer can use it to steer the robot learning toward the desired goals, through an appropriate reward function. When reinforcement learning takes place on the robot, the ground truth is no longer available. A first issue then becomes to design an intrinsic reward function, or ”instinct”, providing the robot with internal incentives to act and explore its environment, visiting all states reachable on a given time budget. A second issue is to provide the robot with ”values”, indicating that not all reachable states are equal, and gradually steering the exploration toward the most promising behaviors. Regarding the former issue, an intrinsic reward function will be discussed: viewing the robot as an information machine, a natural motivation thus is to maximize the quantity of information in its sensory datastream. Regarding the latter issue, preference-based approaches can be used. A first possibility is to ask the designer to rank the behaviors demonstrated by the robot, thus enabling the robot to learn a policy return estimate. Iteratively, the robot builds a new and expectedly better controller during an active reinforcement learning phase. It thereafter demonstrates this controller to the designer, and uses the designer’s feedback (it’s better / it’s worse) to update the policy return estimate. Another possibility is to design ad hoc experiments to learn a preference-based value function directly on the state space. For instance, the designer can set the robot in a target position, and exploit the fact that the robot situation almost surely deteriorates along time when using naive controllers.

Keywords reinforcement learning, preference learning, active learning, robotics

8

Stream Data Mining and Anytime Algorithms Thomas Seidl Department of Computer Science 9 (Data Management and Data Exploration) RWTH Aachen University, 52056 Aachen, [email protected]

Abstract. Sensors pervade all areas of personal, environmental and industrial domains, and nearly all applications in engineering, telecommunication, business, and life sciences produce tremendously increasing amounts of data. Though the availability of storage space grows at decreasing prices, many of the data require immediate analysis as they cannot be stored for reasons of their huge size or the fast reaction they require. In contrast to static data mining algorithms, stream data mining techniques follow the data and, as an additional challenge, the evolution of their concepts. In contrast to real-time algorithms which strictly obey fixed time budgets, anytime algorithms are designed to exploit the available time between the arrival of objects in a stream even for varying stream rates. Recently, various anytime classification techniques as well as anytime clustering algorithms have been proposed which were integrated into the MOA framework.

References KRANEN P., ASSENT I., BALDAUF C., SEIDL T. (2011): The ClusTree: Indexing MicroClusters for Anytime Stream Mining. Knowl Inf Syst, 29(2), 249–272. KRANEN P., KREMER H., JANSEN T., SEIDL T., BIFET A., HOLMES G., PFAHRINGER B., READ J. (2012): Stream Data Mining using the MOA Framework. (Demo) DASFAA, Springer, 309–313. KRANEN P., SEIDL T. (2009): Harnessing the Strengths of Anytime Algorithms for Constant Data Streams. Data Min Knowl Disc, 19(2), 245–260. SEIDL T., ASSENT I., KRANEN P., KRIEGER R., HERRMANN J. (2009): Indexing Density Models for Incremental Learning and Anytime Classification on Data Streams. EDBT/ICDT, 311–322.

Keywords Data Mining, Stream Data Analysis, Clustering, Anytime Algorithms 9

Knowledge Discovery in Shopping Path Data Katsutoshi Yada Kansai University, Japan [email protected]

Abstract. The development of Radio Frequency Identification (RFID) has enabled detailed tracking and electronic recording of customer positions and movements in stores. We give the term Shopping Path Data to time series data on customer movement paths in stores which was obtained this way. This talk describes a model using Shopping Path Data, and explains findings which are useful for in-store marketing as a result of analysis using store experiment data in Japan. Shopping Path Data provides us with new knowledge about customer movement in stores.

Keywords Shopping Path Data, RFID, Customer Movement, Time-Series Data, Knowledge Discovery

10

Part II

Invited Session: Ensemble methods in clustering and classification

Diversity Based Weighting to Improve the Performance of Classifier Ensembles Werner Adler1 , Zardad Khan2 , Sergej Potapov1 , and Berthold Lausen2 1 2

University Erlangen-Nuremberg, Germany {werner.adler,sergej.potapov}@imbe.med.uni-erlangen.de University of Essex, United Kingdom {zkhan,blausen}@essex.ac.uk

Abstract. The performance of bootstrap aggregated classifier ensembles, e.g. bagged classification trees (Breiman, 1996) or random forests (Breiman, 2001) depends on the diversity of the base classifiers constituting the ensembles. Adler et al. (2011) proposed a modified approach to draw the bootstrap samples for the base classifiers in a repeated measurements setup. The bootstrap samples are drawn on the subject rather than the observation level, i.e. from a data set consisting of several observations from several subjects - as is the case in the repeated measurements setup - subjects are randomly selected into the bootstrap sample while not selected subjects then make up the out-of-bag sample. Compared to the traditional approach, where observations are drawn for the bootstrap sample irrespective of the subjects, this leads to more diverse base classifiers and hence to an improved classification performance of the ensemble. In addition to indirectly increasing the diversity of the ensemble by modified bootstrap strategies, we examine the effect of actively weighting the single baseclassifiers based on their similarity or dissimilarity as calculated by proposed similarity measeres (examined e.g. by Tang et al., 2006) to each others. We report and discuss the results obtained using simulated data as well as a clinical example data set.

References A DLER , W., P OTAPOV, S., and L AUSEN , B. (2011): Classification of repeated measurements data using tree-based ensemble methods. Computational Statistics, 26(2), 355-369. B REIMAN , L. (1996): Bagging Predictors. Machine Learning, 24(2), 123-140. B REIMAN , L. (2001): Random forests. Machine Learning, 45, 5-32. TANG , E.K., S UGANTHAN , P.N., YAO , X. (2006): An analysis of diversity measures. Machine Learning, 65, 247-271.

Keywords Bootstrap, Classifier Ensembles, Diversity

12

Tailoring Componentwise Boosting for Prediction with a Huge Number of Molecular Measurements Harald Binder Institut f¨ur Medizinische Biometrie, Epidemiologie und Informatik, Universit¨atsmedizin der Johannes-Gutenberg-Universit¨at Mainz, 55101 Mainz, Germany [email protected] Abstract. When seeking prognostic information for patients or when attempting classification, modern technologies provide a huge amount of molecular measurements as a starting point. For example, there may be more than one million single nucleotide polymorphisms (SNPs) that need to be simultaneously considered with respect to a clinical endpoint or class membership. Sparse multivariable regression techniques have recently become available for automatically identifying molecular signatures that comprise relatively few covariates and provide reasonable prediction performance. For illustrating how such approaches can be adapted to the specific features of molecular measurements, we propose different variants of a componentwise likelihood-based boosting approach for SNP data. The latter links SNP measurements to a class membership or a time-to-event endpoint by a regression model that is built up in a large number of steps. The variants allow for strategic choices in dealing with SNPs that differ in variance due to their variation in minor allele frequencies. In addition, we propose a heuristic that allows computationally efficient handling of millions of covariates. The resulting models are judged according to prediction performance and signature stability in resampling data sets. By considering these different aspects, a more general strategy is outlined for linking a huge number of molecular measurements to class membership or a time-to-event endpoint by means of componentwise likelihood-based boosting.

Keywords PREDICTION, SNP, VARIABLE SELECTION, BOOSTING

13

An AUC-based Permutation Variable Importance Measure for Random Forests for Unbalanced Data Silke Janitza1∗ and Anne-Laure Boulesteix2 1

Department for Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, 81377 Munich, Germany [email protected] Department for Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, 81377 Munich, Germany [email protected]

2

Abstract. The random forest method is a commonly used tool for classification with high dimensional data as well as for predictor ranking. It can handle complex data structures including correlated predictors, interactions and heterogeneity and offers inbuilt variable importance measures for the ranking of important predictors. However classification performance of random forest is suboptimal in case of extremely unbalanced data, i.e. data where response class sizes differ considerably. In this case it tends to almost always predict the majority class, yielding a minimal error rate. The standard random forest permutation variable importance measure which is based on the error rate is directly affected by this problem and loses its ability to discriminate between important and unimportant predictors in the case of extreme class unbalance. This effect is more pronounced for small effects and small sample sizes. The area under the curve (AUC) is a promising alternative to the error rate for unbalanced classes as it puts the same weight on both classes. A novel permutation variable importance measure in which the error rate is replaced by the AUC is therefore a promising alternative for unbalanced data settings. It can be shown in simulations that this measure outperforms the error-rate-based permutation variable importance measure for strongly unbalanced classes.

References BLAGUS, R. and LUSA, L. (2010): Class Prediction for high-dimensional Class-Imbalanced Data. BMC Bioinformatics, 11, 523. LIN, W.J. and CHEN, J. (2012): Class-imbalanced Classifers for high-dimensional Data. Briefings in Bioinformatics.

Keywords RANDOM FOREST, VARIABLE IMPORTANCE MEASURE, AREA UNDER THE CURVE, FEATURE SELECTION, UNBALANCED DATA, CLASS IMBALANCE ∗

First author SJ is a student

14

Probability Machines: Estimating individual probabilities using machine learning methods Andreas Ziegler and Jochen Kruppa Universit¨at zu L¨ubeck, Germany {ziegler,kruppa}@imbs.uni-luebeck.de Abstract. Machine learning (ML) is increasingly used for data mining in biomedicine, credit scoring, weather forecasting and other areas of application. Recent work has shown that machine learning can also be used for probability estimation by embedding the probability estimation problem in nonparametric regression estimation. As a result, nonparametric regression machines directly inherit their properties, such as consistency and convergence rate, to the corresponding probability machine. Their advantage over parametric standard statistical approaches, such as logistic regression is that probability machines do not require a correct specification of the functional relationship between the dependent variables and the independent variables. These methods provide robust nonparametric modeling of the regression function with minimal assumptions about the form of the relationships instead. Probability machines directly apply to assessing the probability of outcomes of interest based on different characteristics of individuals. In therapeutic observational studies, they can also be used for computing propensity scores for adjustment. They easily extend to dependent variables with multiple categories. In this contribution we first embed the probability estimation problem in nonparametric regression estimation. Next, we explore some consistent probability machines, such as random forest, k-nearest neighbors, and bagged nearest neighbors for the purpose of probability estimation. We show how probabilities using probability machines can be estimated using standard software. Finally, we illustrate the approach using data from the literature as well as from our own applications.

References M ALLEY, J. D., K RUPPA , J., DASGUPTA, A., M ALLEY, K. G. and Z IEGLER , A. (2012). Probability machines. Consistent probability estimation using nonparametric learning machines. Methods of Information in Medicine 51, 74-81. doi: 10.3414/ME00-01-0052. ¨ K RUPPA , J., Z IEGLER , A. and K ONIG , I. R. (2012). Risk estimation and risk prediction using machine learning methods. Human Genetics, in press.

Keywords Bagged nearest neighbor, Consistency, k-nearest neighbor, Nonparametric regression, Probability estimation, R software package, Random forest, Random jungle

15

Part III

Invited Session: Applications in Empirical Educational Research Based on Secondary Data

Does school choice increase ethnic segregation in primary schools or only segregation indices? Anna Makles1 and Kerstin Schneider2 1 2

University of Wuppertal, Schumpeter School of Business and Economics, Germany, [email protected] University of Wuppertal, Schumpeter School of Business and Economics, Germany, [email protected]

Abstract. In 2006 the government of the federal state North Rhine-Westphalia (NRW) in Germany passed a new school law abolishing binding primary school catchment areas by the 2008/09 school year. Hence, parents in NRW - unlike their counterparts in other German federal states - are now allowed to choose a primary school independent of their place of residence. The political intention was to increase parental school choice and to foster competition between schools. The most frequently-cited argument against free school choice, however, is the fear of increased ethnic segregation and educational disparity. In educational research school segregation is mainly measured by the dissimilarity index given by Duncan and Duncan (1955). But, despite of its popularity and related measures in empirical work, the indices nonetheless suer from severe shortcomings. As segregation indices are in particular sensitive when group sizes and minority proportions are small, there is need to account for changes in group size (e. g. Carrington and Troske (1997)). In our study, we show how biased the index of ethnic segregation in primary schools is and how simple therefore a negative effect of the policy reform can be detected. Finally, we calculate unbiased systematic segregation measures for dierent ethnic groups and show (a) that accounting for the drawbacks of the index within empirical studies is not that difficult and (b) that systematic segregation has not increased signicantly since the abolishment of primary school catchment areas.

References CARRINGTON, W. and TROSKE, K. (1997): On measuring segregation in samples with small units. Journal of Business & Economic Statistics, 15, 402–409. DUNCAN, O. and DUNCAN, B. (1955): A methodological analysis of segregation indexes. American Sociological Review, 20, 210–217.

Keywords SEGREGATION, DISSIMILARITY, SCHOOL CHOICE, SCHOOL CATCHMENT AREAS 18

Applications in Empirical Educational Research Based on Secondary Data Alexandra Schwarz German Institute for International Educational Research, Schlossstr. 29, D-60486 Frankfurt am Main, Germany, [email protected] Abstract. In general, empirical educational research is an area concerned with the analysis and the evaluation of conditions and requirements, processes, and outcomes of education. Typical problems in this area involve the assessment of individual student achievement and progress and the evaluation of professional practice, programs and policies. Hence, empirical educational research is an interdisciplinary research domain, bringing together theories and methods from education, psychology, sociology and economics. Whereas evaluating effects of educational programs often requires primary surveys conducted in experimental designs, the analysis of national and regional data bases are of great importance for investigating governance aspects and policy issues of education and education systems. Especially data from official and semi-official statistics offer the opportunity of evaluating conditions, processes and outcomes of education on a much broader basis. Such data bases have become increasingly available in recent years, but their usage for the work on scientific issues still needs to be improved. Among other things, this can be attributed to the fact that dealing with administrative and other secondary data requires special methods and a sound methodology, e. g. techniques of statistical inference, weighting, and data fusion. In this session, we consider methodological questions of analyzing secondary data in education contexts as well as empirical papers dealing with the evaluation of specific educational or organizational policies, programs or systems.

Keywords EDUCATION, SECONDARY DATA, POLICY ANALYSIS

19

Part IV

Invited Session: Dynamic Cluster Analysis - Theory and Practise

Old and new dynamic clustering methods Hans-Hermann Bock Institute of Statistics, RWTH Aachen University, [email protected] Abstract. ’Clustering methods’ deal with the grouping of objects into (typically disjoint) homogeneous classes on the basis of data that are recorded for all objects and that characterize their mutual similarities or dissimilarities. Many clustering approaches try to attain an ’optimal’ classification C ∗ by minimizing a suitable clustering criterion g(C ) among all feasible groupings C . In many situations there is a two- or multi-variable criterion G(C , θ ) with a suitable parameter vector θ such that g(C ) = minθ G(C , θ ). Then the classical relaxation method from mathematics can be used for finding or approximating an optimum configuration just by iteratively minimizing G(C , θ ) w.r.t. C and θ in turn, thereby producing a dynamically varying sequence of steadily improving classifications. This algorithm is termed k-means method, dynamic clustering, iterated minimum distance method, etc. The classical case of least-squares clustering G(C , θ ) = ∑ki=1 ∑ j∈Ci ||x j − θi ||2 → min has been generalized in many ways, e.g., in the framework of probabilistic clustering models, fuzzy clustering, subspace clustering, distance-based criteria (medoid method), multimode clustering, entropy clustering, etc. The paper provides a survey on the resulting dynamic clustering approaches and also discusses briefly some typical problems that are encountered with these methods: convergence, local optima, bias when estimating class-specific model parameters, choice of the number of classes, etc.

References Bezdek, J.C. (1981): Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York. Bock, H.-H. (1974): Automatische Klassifikation. Vandenhoeck & Ruprecht, G¨ottingen. Bock, H.-H. (2007): Clustering methods: a history of k-means algorithms. In: P. Brito, P. Bertrand, G. Cucumel, F. de Carvalho (eds.): Selected contributions in data analysis and classification. Springer, Heidelberg, 2007, 161-172. Bock, H.-H. (2008): Origins and extensions of the k-means algorithm in cluster analysis. Journ@l Electronique d’Histoire des Probabilit´es et de la Statistique. Num´ero sp´ecial ’Contributions a` l’histoire de l’analyse des donn´ees’ de la revue Electronic Journ@l for History of Probability and Statistics (JEHPS), vol. 4 (2008), no. 2, 18pp. www.emis.de/journals/JEHPS/decembre2008.html Dalenius T. (1950): The problem of optimum stratification I. Skandinavisk Aktuarietidskrift, 203-213. Forgey, E.W. (1965): Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometric Society Meeting, Riverside, California, 1965. Abstract in Biometrics 21 (1965), 768. Diday, E. (1971): Une nouvelle m´ethode de classification automatique et reconnaissance des formes: la m´ethode des nu´ees dynamiques. Revue de Statistique Appliqu´ee XIX (2), 1970, 19-33. Diday, E., Schroeder, A. (1976): A new approach in mixed distribution detection. R.A.I.R.O. Recherche Op´erationnelle 10 (6) 75-106. Steinhaus, H. (1956): Sur la division des corps mat´eriels en parties. Bulletin de l’Acad´emie Polonaise des Sciences, Classe III, vol. IV, no. 12, 801-804. Steinley, D. (2006): K-means clustering: a half-century synthesis. British J. on Mathematical and Statistical Psychology 59, 1-34.

22

Machine learning approach in information retrieval for real estate offers analysis Paweł Lula Cracow University of Economics, Poland [email protected]

Abstract. Information retrieval is a process which allows to extract automatically fundamental facts from text documents. This technique is widely used for web pages, forums and blogs processing. The main challenge for automatic text mining is identification and extraction crucial pieces of information. The rules for information extraction can be defined manually or can be built by machine learning process. The evaluation of different approaches used for information retrieval is the main goal of the presentation. The following solutions are discussed: • methods based on the vector space model, • automatically defined regular expressions, • methods based on the domain model. All methods are learnt and evaluated on the set of real estate offers prepared in Polish.

Keywords text mining, information extraction, machine learning methods

23

Dynamical Clustering with Self Learning Neural Networks Kamila Migdał Najman and Krzysztof Najman University of Gdansk, Poland {kmn,krzysztof.najman}@wzr.pl

Abstract. Together with constantly expanding IT knowledge, the amount of data collected in the different systems of database is increasing. One of the characteristics of modern databases is their increasing dynamism. The number of registered units and the structure of the group change dynamically. In order to effectively detect fast changes in the number and structure of clusters, it should use the appropriate methods for cluster analysis. The article presents the results of simulation research into the possibility of using self-learning neural networks in clustering data with dynamically changing structure of the group.

References F RITZKE B., Growing cell structures - a self-organizing network for unsupervised and supervised learning, Neural Networks, 1994, vol. 7, no. 9, page 1441-1460. NAJMAN K., Dynamical clustering with Growing Neural Gas Networks, Statistical Review, 2011, vol. 3-4, page 231-242 (in polish). Q IN A. K., S UGANTHAN P. N., Robust Growing Neural Gas algorithm with application in cluster analysis, Neural Networks, 2004, vol. 17, no. 8-9, page 1135-1148. P RUDENT Y., E NNAJI A., An incremental Growing Neural Gas learns topologies, Proceedings of International Joint Conference on Neural Networks, 2005, page 1211-1216.

Keywords dynamical clustering, self-learning neural networks, classification

24

Classification of Three-Way Clustering Problems Andrzej Sokolowski Cracow University of Economics sokolows@ uek.krakow.pl

Abstract. A special notation for clustering problems is proposed, where classification subject and classification space is defined. We consider a set of objects Y, set of variables Z and set of time units T. Three types of clustering problems can be define: simple, double and complex. The first one has one set as the classification objects, the second one one set as a classification space, and in the complex one we are trying to find which objects, characterized by which variables and when can be considered as homogeneous. In the paper special attention is paid to the double clustering problems where two strategies are being proposed.

25

Studies in Lower Secondary Educational Level Outcomes Changes in Poland Using Correspondence Analysis Agnieszka Stanimir Wroclaw University of Economics, Poland [email protected] Abstract. The purpose of this paper is to indicate the possibility of using correspondence analysis to study changes in the level of learning outcomes in lower secondary educational level. Assessment of student knowledge and skills is conducted on the basis of exam results from the years 2003-2010. Results were collected for all students of the two Polish regions belonging to the same Regional Examination Board. The analysis of knowledge and skills of a young person is an extremely important task in the educational process. The use of different variables in the study provides extensive analysis of the problem and allows for formulation of recommendations for the development of education system. In the analysis additional factors, such as gender, commune type, competences and skills areas were taken into account. This factors are nominal, therefore natural is to apply correspondence analysis to describe associations between categories of this variables.

References B LASIUS , J. (2001). Korrespondenzanalyse. M¨unchen: Oldenburg Verlag. The Education and Assessment System in Poland, Central Examination Board, 1999 G REENACRE , M., J. (1984). Theory and applications of correspondence analysis, London : Academic Press.

Keywords multiway correspondence analysis, knowledge and skills of young people, analysis of changes over time, regional comparisons

26

Solving Product Line Design Optimization Problems using Stochastic Programming Sascha Voekler1 and Daniel Baier2 1

2

Institute of Business Administration and Economics, Brandenburg University of Technology Cottbus, Postbox 101344, D-03013 Cottbus, Germany [email protected] Institute of Business Administration and Economics, Brandenburg University of Technology Cottbus, Postbox 101344, D-03013 Cottbus, Germany [email protected]

Abstract. In this paper, we try to apply stochastic programming methods to product line design optimization problems. Because of the estimated part-worths of the product attributes in conjoint analysis, there is a need to deal with the uncertainty caused by the underlying statistical data (Kall/Mayer 2011). Inspired by the work of Georg B. Dantzig (Dantzig 1955), we developed an approach to use the methods of stochastic programming for product line design issues. Therefore, four different approaches will be compared by using notional data of a yogurt market from Gaul and Baier (2009). Stochastic programming methods like single- or two-stage programs are applied on Gaul, Aust and Baier (Gaul et al. 1995) and will be compared to its original approach, to Green and Krieger (Green/Krieger 1985) and to Kohli and Sukumar (Kohli/Sukumar 1990). Besides the theoretical work, these methods will be realized by a self-written code with the help of the statistical software package R.

References Dantzig, G.B. (1955): Linear Programming Under Uncertainty. Management Science, 1(3/4), 197206. Gaul, W., Aust, E., Baier, D. (1995): Gewinnorientierte Produktliniengestaltung unter Ber¨ucksichtigung des Kundennutzens. Zeitschrift f¨ur Betriebswirtschaftslehre, 65, 835-855. Gaul, W., Baier, D. (2009): Simulations- und Optimierungsrechnungen auf Basis der Conjointanalyse. Conjointanalyse - Methoden-Anwendungen-Praxisbeispiele, D. Baier, M. Brusch (Hrsg.), Berlin, Heidelberg, Springer 2009, 163–182. Green, P.E., Krieger, A.M. (1985): Models and Heuristics for Product Line Selection. Marketing Science, 4(1), 1-19. Kall, P., Mayer, J. (2011): Linear Stochastic Programming - Models, Theory, and Computation. International Series in Operations Research and Management Science, Springer New York, Dordrecht, Heidelberg, London, 2011, 156. Kohli, R., Sukumar, R. (1990): Heuristics for Product-line Design Using Conjoint Analysis. Marketing Science, 36(12), 1464-1478.

Keywords Conjoint Analysis, Product Line Design Optimization, Stochastic Programming.

27

Part V

Statistics and Data Analysis

An exact Newton’s method for ML estimation in a penalized Gaussian mixture model Grigory Alexandrovich Philipps-Universit¨at Marburg

Abstract. We discuss the problem of computing the MLE of the parameters of a multivariate Gaussian mixture. The most widely used method for solving this problem is the EM-Algorithm. Although this method converges globally under some general assumptions, does not require much storage and is simple to implement, it yields only a linear convergence rate. We introduce an alternative - an exact Newton’s method, which converges loacally quadratic and yields an estimate of the Fisher-Information matrix. To this end we discuss a parametrization of the mixture density, which assures the adherence of several restrictions on the parameters during the Newton iterations. For the parametrization of the covariance matrices we use the Cholesky decomposition of the inverse: Σ −1 = LLT . We also discuss some aspects of computing the analytical derivatives of the log-likelihood function. Further we consider a penalization of the log-likelihood to avoid ”bad” solutions, as suggested by Chen and Tan (Inference for multivariate normal mixtures, Journal of Multivariate Analysis 100 (2009) 1367-1383). Finally we consider some numerical experiments where we compare the EMAlgorithm with our implementation of the Newton method and discuss the possibility of computation the MLE with the Newton’s method in other elliptical mixture models.

30

Which District of Dortmund is the Most Dangerous? Tim Beige,1 Thomas Terhorst,2 Claus Weihs1 and Holger Wormer2 1 2

Chair of Computational Statistics, TU Dortmund University [email protected], [email protected], Institute of Journalism, TU Dortmund University [email protected], [email protected]

Abstract. In this paper the districts of Dortmund, a big German city, are ranked concerning their level of risk to be involved in an offence. In order to measure this risk the offences reported by police press reports in the year 2011 (Presseportal (2011)) were analyzed and weighted by their maximum penalty provided by the German criminal code. The resulting danger index was used to rank the districts. Moreover, the socio-demographic influences on the different offences are studied. The most probable influences appear to be traffic density (Sierau (2006)) and the share of older people. Also, the inner city parts appear to be much more dangerous than the outskirts of the city of Dortmund. However, can these results be trusted? The head of the press office of Dortmund’s police argues that offences might not be uniformly reported by the districts to his office, and that small offences like pocket picking are never reported in police press reports.

References PRESSEPORTAL: http://www.presseportal.de/polizeipresse/pm/4971/polizei-dortmund?start=0 SIERAU, U. (2006): Dortmunderinnen und Dortmunder unterwegs - Ergebnisse einer Befragung von Dortmunder Haushalten zu Mobilit¨at und Mobilit¨atsverhalten, Ergebnisbericht, Dortmund-Agentur/Graphischer Betrieb Dortmund, Stadt Dortmund, 09/2006.

Keywords risk level, danger index, regression, variable selection

31

Empirically Measuring the Effect of Violating the Independence Assumption in Behavioural Scoring Miguel Biron∗1 and Cristi´an Bravo2 1

Department of Industrial Engineering, Universidad de Chile. Rep´ublica 701, 8370439 Santiago, Chile. [email protected] Finance Center, Department of Industrial Engineering, Universidad de Chile. Domeyko 2369, 8370397 Santiago, Chile. [email protected]

2

Abstract. Behavioural scorings are a well-known statistical technique used by financial institutions to predict if new clients are in danger of not returning a loan in the future. The aim of this work is to assess the importance of independence assumption in logistic regression based behavioural scorings. The issue has been documented on the literature [1], but no assessment has been made on its real impact. We develop four sampling methods that control which observations associated to each client are to be included in the training set, avoiding a functional dependence between observations of the same client. We then calibrate regressions with variable selection on the samples created by each method, plus one using all the data in the training set (biased base method), and validate the models on an independent data set. We find that the regression built using all the observations shows the highest area under the ROC curve and Kolmogorv-Smirnov statistics, while the regression that uses the least amount of observations shows the lowest performance and highest variance of these indicators. Nevertheless, method four shows almost the same performance as the base method using less variables. We conclude that violating the independence assumption does not impact strongly on results and, furthermore, trying to control it by using less data can harm the performance of calibrated models.

References 1.MEDEMA, L., KONING, H. R. and LENSINK, R. (2009): A practical approach to validating a PD model. Journal of Banking & Finance, 33(4), 701–708.

Keywords Behavioral Scoring, Sampling, Panel Logistic Regression ∗

Student author

32

Benchmarking classification algorithms on high-performance computing clusters Bernd Bischl, Julia Schiffner, Claus Weihs Lehrstuhl fuer Computergestuetzte Statistik, Technische Universitaet Dortmund Vogelpothsweg 87, 44227 Dortmund {bischl, schiffner, weihs}@statistik.tu-dortmund.de Abstract. Comparing and benchmarking classification algorithms is an important topic in applied data analysis. Extensive and thorough studies of such a kind will produce a considerable computational burden and are therefore best delegated to high-performance computing clusters. This on the other hand is technically nontrivial, requires knowledge of the underlying architectures and naive approaches often make experiments much harder to reproduce. We will demonstrate how to effectively and reproducibly perform these calculations on high-performance computing clusters with minimal effort for the researcher. We build upon our recently developed R packages BatchJobs (Map, Reduce and Filter operations from functional programming for clusters) and BatchExperiments (Parallelization and management of statistical experiments). We will present benchmarking results for standard classification algorithms and study the influence of hyperparameters and pre-processing steps on their performance.

References BISCHL, B., LANG, M., MERSMANN, O., RAHNENFUEHRER, J. and WEIHS, C. (2012): BatchJobs and BatchExperiments: Abstraction Mechanisms for Using R in Batch Environments. Technical report for Collaborative Research Center SFB 876, TU Dortmund.

Keywords Classification, Benchmarking, Parallelization

33

Visual models for categorical data in economic research Justyna Brzezi`nska Department of Statistics, University of Economics in Katowice, 1 Maja 50, 40– 287 Katowice [email protected] Abstract. This paper is concerned with the use of visualizing categorical data in qualitative data analysis [1],[2],[3]. Graphical methods for qualitative data and extension using a variety of R packages will be presented. This paper outlines a general framework for data visualization methods. These ideas are illustrated with a variety of graphical methods for categorical data for large, multi-way contingency tables. Graphical methods are available in R software in vcd vcd and vcdExtra library including mosaic plot, association plot, sieve plot, double–decker plot or agreement plot. These R packages include methods for the exploration of categorical data, such as fitting and graphing, plots and tests for independence or visualization techniques for log–linear models. Some graphs e.g. sieve and mosaic display are well–suited for detecting and patterns of association in the process of model building, others are useful in model diagnosis and graphical presentation and summaries. The use of log–linear analysis, as well as visualizing categorical data in economic, will be presented in this paper. Key words: Graphics, mosaic display, log-linear models, categorical data analysis

References 1.Friendly M., Visualizing Categorical Data, Cary, NC: SAS Institute, 2000. 2.Meyer D., Zeileis A., Hornik K., The strucplot framework: visualizing multi-way contingency tables with vcd, Journal of Statistical Software, 17 (3), 1-48, 2006. 3.Meyer D., Zeileis A., Hornik K., VCD: Visualizing Categorical Data, R package http://CRAN.Rproject.org, 2008.

34

Discovering Process Certification Tendencies Mariana Carvalho, Paulo Sampaio, and Orlando Belo Algoritmi R&D Centre, University of Minho, PORTUGAL [email protected], [email protected], [email protected] Abstract. This was a case study especially designed to conceive a specific-oriented system for discovering certification tendencies, based on information about what certificates were acquired by Portuguese companies during the period of 2008-2010. Our main goal was to provide useful (and effective) information about what kind certificate a company must apply to guarantee the quality of its services and products, getting consequently advantages besides its most direct competitors. As all we know, certification is a voluntary process, which despite the time taken in the process, the costs and the bureaucracy involved with, is today quite crucial for the survival of any company. Certificates are pledges of companys commitment with quality of its services and products directly, and necessarily with the environment, health and security of its workers, just to name a few. Which are the most adequate certificates that a company must apply (and acquire) is one of the main questions that is often placed to any new company in the market. The answer to this question it is not easy and depends on several factors, as the region where the company is located or its activity sector. A previous analysis was necessary to get the knowledge about the state of the market that enables to take the best decisions. The application of data mining techniques on this case allowed us to get a clear description? of the state of the certification market in Portugal for the referred period. What is very interesting, once it revealed very particularly characteristics of the way of being of Portuguese business companies. Specifically, with clustering analysis we got very refined information that provides to new companies the necessary awareness of the state of the market, providing them an initial orientation in the competitive market. The information extracted from the several application data sets is enough to support decision making relatively to the set of certificates that better suit to the need of a company. Obviously, this set varies according with the companys competitiveness, and depends also on the number of other companies that are located in the region, and in the same activity sector. In this work we will present a general overview of the certification process in Portugal, describe the main aspects of the case study used - general features, data selection, and data preparation tasks -, the clustering techniques and models used, and, finally, a review of the entire set of results, its interpretation and impact in the certification field.

Keywords Data Mining, Process Certification of Companies, Data Clustering, Discovery of Patterns of Certification Processes

35

Geographic clustering through aggregation control Daher Ayale and Dhorne Thierry Lab-STICC CNRS UMR 6285 / Universit´e de Bretagne Sud [email protected],[email protected] Abstract. The actors in spatial decision are often led to define strategic zoning checking both constraints of structural homogeneity and spatial cohesion. In most cases, they use traditional clustering algorithms that do not take into account geographic information, and possibly correct the excessive fragmentation with Ad hoc heuristics. Usually, in clustering algorithms, the number of clusters is initially fixed. In this paper, it is considered that, equivalently, the number of geographically related components (called regions) R is also initially fixed, with, of course, the constraint : R ≥ C. Rather than seeking an absolute solution to this optimization problem, which is computationally difficult, the proposed method uses an algorithm to control the intra-class geographic aggregation in order to approach (possibly achieve) the constraint set on the number of regions. We present in detail the proposed method, and then show how the level of aggregation can be controlled by a parameter of the algorithm. Finally we seek the parameter value that leads to the most appropriate solution, and we study the dynamic evolution of clusters and regions according to the parameter control. The results are illustrated on a real example.

References OLIVER, M.A. and WEBSTER, R. (1989): A Geostatistical Basis for Spatial Weighting in Multivariate Classification. Mathematical Geology, 21, 275–289.

Keywords Geographic clustering, Connected components, Geographic Information.

36

Some thoughts about the “number of clusters”-problem Christian Hennig1 Department of Statistical Science, UCL, London WC1E 6BT, United Kingdom [email protected] Abstract. The problem of finding the number of clusters in a dataset is notoriously difficult. This is at least partly due to a widely shared misconception that for any given dataset (or at least for many of them) there is a unique “true” clustering which “good” methodology should “estimate”. Such true clustering, however, is rarely well defined. The view taken here is different. What a cluster is, depends crucially on the researcher’s concept of “observations belonging together”. This depends on the given application, and it can easily be demonstrated that there are various legitimate versions of it, which may lead to different numbers of clusters in different datasets. Existing criteria for this task can be understood as translations of various different cluster concepts, and can be used if they are found to agree with the concept required in the given application. In this presentation I will discuss existing ideas about the “true number of clusters” and the data analytic meaning of existing methods to find the number of clusters such as the BIC (Fraley and Raftery 2002), the average silhouette width (Kaufman and Rouseeuw 1990), the Calinski and Harabasz (1974) index and the prediction strength (Tibshirani and Walther 2005).

References CALINSKI, T. and HARABASZ, J. (1974): A dendrite method for cluster analysis. Communications in Statistics 3, 1–27. FRALEY, C. and RAFTERY, A. E. (2002): How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41, 578-588. KAUFMAN, L. and ROUSSEEUW, P. J. (1990): Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York. TIBSHIRANI, R. and WALTHER, G. (2005): Cluster Validation by Prediction Strength, Journal of Computational and Graphical Statistics, 14, 511-528.

Keywords NUMBER OF CLUSTERS, BIC, AVERAGE SILHOUETTE WIDTH, PREDICTION STRENGTH, CALINSKI AND HARABASZ INDEX

37

Merging States in Hidden Markov Models Hajo Holzmann1 and Florian Schwaiger2 1 2

Philipps-Universit¨at Marburg [email protected] Philipps-Universit¨at Marburg [email protected]

Abstract. We analyse clustering problems in case of dependent data. Specifically we consider the observable part of a finite state hidden Markov model (HMM), where the stationary distribution is a finite mixture of parametric distributions (e.g. of multivariate normal distributions) and the hidden state process has a Markov chain structure. Generally, for clustering data the estimates of the states can be used as cluster assignments. In contrast to independent finite mixtures, the dependence structure of the model plays an important role for estimating the non-observable states of the HMM (see e.g. the Viterbi algorithm). Baudry et al. (2010) model i.i.d. samples with finite mixtures and merge those components to clusters, whose merged component distribution appears more like a cluster than the component distributions considered singularly. Similarly, it is not necessarily always the case that each state of the Markov chain corresponds to an own cluster. Thus, we analyse the merging of states to clusters for HMMs. In contrast to independent finite mixtures, where merging states does not affect the probabilistic structure of the model, merging states of a HMM changes this structure: After merging, the dependence structure is influenced as the transition probability matrix changes and the state dependent distribution is now a mixture itself. If the dependence structure is not taken into account, it can occur that states are being merged whose state dependent distributions imply a merging, although their transition probabilities are too distinct and hence a lot of dependence information would be lost. Therefore, we employ an entropy based criterion which strongly involves the dependence structure of the estimated model.

References BAUDRY, J.-P., RAFTERY, A., CELEUX, G., LO, K. and GOTTARDO R. (2010): Combining Mixture Components for Clustering. Journal of Computational and Graphical Statistics, 19, 332–353.

Keywords HIDDEN MARKOV MODEL, CLUSTERS, MERGING

38

Fuzzy Composite Index for Customer Satisfaction Evaluation: an Application for Public Sector Services Bartłomiej Jefma´nski1 and Marcin Pełka1 1 Wroclaw

University of Economics, Department of Econometrics and Computer Science, Nowowiejska 3, 58-500 Jelenia G´ora, Poland, [email protected], [email protected] Abstract. Customer satisfaction is a complex and latent concept, of which the direct measurement and analysis are impossible, therefore requires the estimation with applying directly observable variables. In the social sciences we are dealing with a number of such occurrences and a popular approach in their analysis is the construction of composite indices. These are synthetic measures useful in the situations when the concept, considering its complexity, can not be expressed using a single indicator. The purpose of this study is to apply the methodology of the composite indices construction to develop an index of customer satisfaction in one of the Polish Town Offices. To construct the index the data from a survey periodically conducted by the office were used. Since each of the attributes of service quality is assessed on a five-point ordinal scale and the items of the scale constitute linguistic values – fuzzy sets were applied in the study. Such a structure of the index enabled to take into consideration the ambiguity and subjectivity in the opinions of the respondents. The index values in 2008-2010 years were estimated in the R program.

References KENETT, R.S., SALINI, S. (2012), Modern Analysis of Customer Surveys: with Applications using R. Chichester, John Wiley & Sons, Ltd. SMITHSON, M., VERKUILEN, J. (2006), Fuzzy set theory. Applications in the Social Sciences. Thousand Oaks, Sage Publications, Inc. ZIMMERMANN, H.J. (2001), Fuzzy Sets Theory and its Applications. Norwell, Kluwer Academic Publishers.

Keywords CUSTOMER SATISFACTION INDEX, FUZZY SETS, PUBLIC SECTOR SERVICES

39

Zur Begrenzung der Verwendungsh¨aufigkeit von Spenderobjekten bei der Imputationen fehlender Daten mittels Hot-Deck-Verfahren Dieter William Joenssen1 and Udo Bankhofer2 1 2

Technische Universit¨at Ilmenau [email protected] Technische Universit¨at Ilmenau [email protected]

Abstract. Hot-Deck-Verfahren sind spezielle, auf Imputationsklassen basierende Imputationsverfahren. Das Objekt, das dabei die vorhandenen Daten zur Imputation liefert, wird als Spenderobjekt bezeichnet. Damit sichergestellt wird, dass die fehlenden Daten durch die Auspr¨agungen eines a¨ hnlichen Spenderobjekts ersetzt wird, erfolgt der Verdoppelungsprozess innerhalb zuvor gebildeter Imputationsklassen. Durch diese Verdoppelungseigenschaft der Hot-Deck-Imputation ergibt sich das Problem, dass im Extremfall ein Spenderobjekt alle Werte zur Imputation liefert. Deshalb erfolgt bei einigen Hot-Deck-Verfahren eine Begrenzung der Anzahl, wie h¨aufig ein Objekt als Spenderobjekt verwendet werden darf. Damit stellt sich zwangsl¨aufig die Frage, unter welchen Bedingungen eine Begrenzung u¨ berhaupt sinnvoll ist. Im Rahmen dieser Arbeit wird daher eine Simulationsstudie zur Beantwortung dieser Frage durchgef¨uhrt. Dabei zeigt sich, dass es deutliche Unterschiede zwischen Hot-Deck-Imputationen gibt, bei denen die Spenderverwendungsh¨aufigkeit variiert wird. Dar¨uber hinaus k¨onnen auch Einflussfaktoren identifiziert werden, die f¨ur oder gegen eine Begrenzung der Spenderobjekte sprechen.

References Andridge, R.R., Little, R.J.A. (2010): A Review of Hot Deck Imputation for Survey Nonresponse. International Statistical Review, 78, 1, 40–64. Bankhofer, U. (1995): Unvollst¨andige Daten- und Distanzmatrizen in der Multivariaten Datenanalyse. Eul, Bergisch Gladbach. Kalton, G., Kish, L. (1981): Two Efficient Random Imputation Procedures.Proceedings of the Survey Research Methods Section 1981, 146–151. Sande, I. (1983): Hot-Deck Imputation Procedures. In: W. Madow, H. Nisselson, I. Olkin (Eds.): Incomplete Data in Sample Surveys, 3, Theory and Bibliographies. Academic Press, New York, 339–349.

Keywords Hot-Deck-Verfahren, fehlende Daten, Imputation, Simulationsstudie

40

Predictive validity of tracking decisions: Application of a new validation criterion. Florian Klapproth1 , Sabine Krolak-Schwerdt2 , and Thomas H¨orstermann3∗ 1

University of Luxembourg, Route de Diekirch, L-7220 Walferdange [email protected] [email protected] [email protected]

2 3

Abstract. Although tracking decisions are primarily based on students’ achievements, the distributions of academic competencies in secondary school strongly overlap between school tracks. However, the correctness of tracking decisions usually is based on whether or not a student has kept the track he was initially assigned to. To overcome the neglect of misclassified students, we proposed an alternative validation criterion for tracking decisions. In the present study, we applied this criterion to a sample of n = 2, 300 Luxembourgish 9th graders to examine the degree of misclassification due to tracking decisions. Of all students, scores of academic achievement tests were obtained at the beginning of 9th grade. The distributions of test scores, when separated for the academic track and the vocational track, overlapped to a large degree. Based on the intersection of both distributions, we determined two competence levels. With respect to their individual test scores, we assigned students to one of these levels. Students being assigned to the lower level showed scores that were more likely to occur within the vocational than within the academic track. The reverse was true for students assigned to the higher competence level. However, it turned out that about 20% of the students attended a track that did not match their competence level. Whereas the agreement between tracking decisions and actual tracks in 9th grade was fairly high (κ = .93), the agreement between tracking decisions and competence levels was only moderate (κ = .56).

Keywords TRACKING DECISIONS, VALIDATION CRITERION, MISCLASSIFICATIONS

∗

PhD student

41

DDα-classification of asymmetric and fat-tailed data Tatjana Lange1 , Karl Mosler2 , and Pavlo Mozharovskyi2,3 1 2 3

Hochschule Merseburg, Geusaer Straße, 06217, Merseburg, Germany [email protected] Universit¨at zu K¨oln, Albertus-Magnus-Platz, 50923, K¨oln, Germany {mosler,mozharovskyi}@statistik.uni-koeln.de PhD student

Abstract. The DDα-procedure is a fast nonparametric method for supervised classification of d-dimensional objects into q ≥ 2 classes. It is based on q-dimensional depth plots (Liu et al., 1999) and the α-procedure (Vasil’ev and Lange, 1998), which is an efficient algorithm for discrimination in the depth space [0, 1]q . Specifically, we use two depth functions that are well computable in high dimensions, the zonoid depth (Koshevoy and Mosler, 1997) and the random Tukey depth (Cuesta-Albertos and Nieto-Reyes, 2008), and compare their performance for different simulated data sets, in particular asymmetric elliptically and t-distributed data.

References CUESTA-ALBERTOS J.A. and NIETO-REYES A. (2008): The random Tukey depth. Computational Statistics and Data Analysis, 52, 4979–4988. KOSHEVOY G. and MOSLER K. (1997): Zonoid trimming for multivariate distributions, Annals of Statistics, 25, 1998-2017. LIU, R., PARELIUS, J. and SINGH, K. (1999): Multivariate analysis of the data-depth : Descriptive statistics and inference. Annals of Statistics 27, 783-858. VASIL’EV V.I. and LANGE T. (1998): The duality principle in learning for pattern recognition (in Russian). Kibernetika i Vytschislit’elnaya Technika, 121, 7-16.

Keywords ALPHA-PROCEDURE, ZONOID DEPTH, DD-PLOT, LOCATION DEPTH, PATTERN RECOGNITION, SUPERVISED LEARNING

42

The Alpha-Procedure - a nonparametric invariant method for automatic classification of d-dimensional objects Tatjana Lange1 and Pavlo Mozharovskyi2 1 2

Hochschule Merseburg, Geusaer Straße, 06217, Merseburg, Germany [email protected] Universit¨at zu K¨oln, Albertus-Magnus-Platz, 50923, K¨oln, Germany [email protected]

Abstract. The presentation describes the α-procedure, which is based on a geometric representation of the separation of two classes by a hyperplane within a ddimensional rectifying feature space. The needed dimension of the space, i.e. the number of features that is necessary for classification, is gained step by step using a 2-dimensional rep`ere (frame of vector space). The supplement of the feature set is performed depending on the values of the functions describing the discriminating power of both the feature and the rep`ere. The transformation of the vectors (i.e. objects) within the 2-dimensional rep`ere is done towards the growth of value of the discriminating power while the invariant is preserved. Here, the invariant is the object’s affiliation with a class. The result of the rep`ere’s transformation builds a fictitious feature. Now, a new rep`ere is built using this fictitious feature and as second dimension the next real feature that owns the best value of the discriminating power. The enrichment of the feature set and the transformation of the rep`eres are stopped after the classes are separated. The advantage of the α-procedure is the robustness and clarity of the process separating step by step the classes using 2-dimensional rep`eres. Finally, the results of investigations comparing advanced classification methods, such as SVM and others, will be discussed.

References VASIL’EV, V.I. (1991): The reduction principle in pattern recognition learning (PRL) problem. Pattern Recognition and Image Analysis, 1, 1. VASIL’EV V.I. and LANGE T. (1998): The duality principle in learning for pattern recognition (in Russian). Kibernetika i Vytschislit’elnaya Technika, 121, 7-16.

Keywords ALPHA-PROCEDURE, PATTERN RECOGNITION, SUPERVISED LEARNING, ` REPERE, INVARIANT

43

A universal method for model selection in parametric regression models based on statistical tests Eckhard Liebscher University of Applied Sciences Merseburg [email protected] Abstract. Let Yn1 , ...,Ynn be a sample of observations of a response variable. We consider the following master regression model with fixed design: k

Yni =

(n)

∑ β j xi j

+ εi for i = 1, ..., n,

j=1

(n)

where β1 , ..., βk are the parameters. (xi j )i=1...n, j=1...k is the deterministic design matrix containing the data of the regressor variables. Suppose that ε1 , ε2 , ... is a sequence of i.i.d. real random variables with E(εi ) = 0 and Var(εi ) = σ 2 . Further a family F of submodels is determined. To each submodel, we assign a number d assessing the complexity of the submodel M ∈ F . The aim is to search for a submodel which fits the data reasonably well and which is as simple as possible. For the decision concerning submodel with index ν, we employ a modified F -statistic M(ν). The selection method goes as follows: For given numbers ψi , we search for a minimum of d(ν) subject to M(ν) ≤ ψν . If there is more than one admissible model with the same minimum complexity, then we take the model with maximum p-value. In the talk we present asymptotic bounds for the misselection error which imply consistency of the rule under weak assumptions. One particular feature of the rule is that subjective grading of the model complexity can be incorporated. This aspect is of special interest from the point of view of model building. Typically model builder have some preference rules for special types of functions in mind when selecting the model. These ideas can be used in the definition of d. Results of a simulation study show that by using the proposed selection rule, the mis-selection error can be controlled uniformly in contrast to well-known approaches such as Akaike criterion, Bayesian one and Hannan-Quinn one. In the last part of the talk, we discuss some computational issues. The use of the branch-and-bound method improves the performance of the search.

References LIEBSCHER, E. (2012): A universal selection method in linear regression models. Open Journal of Statistics, to Appear. HOFMANN, M.; GATU, C. and KONTOGHIORGHES, E. J. (2007): Efficient algorithms for computing the best-subset regression models for large scale problems. Computational Statistics & Data Analysis, 52, 16-29.

44

Support Vector Machines on Large Data Sets: Simple Parallel Approaches Oliver Meyer, Bernd Bischl, Claus Weihs Lehrstuhl fuer Computergestuetzte Statistik, Technische Universitaet Dortmund Vogelpothsweg 87, 44227 Dortmund {meyer, bischl, weihs}@statistik.uni-dortmund.de Abstract. Support Vector Machines (SMVs) are well-known for their excellent performance in the field of statistical classification. Still, the high computational cost due to the cubic runtime complexity is problematic for larger data sets. To mitigate this, Graf et al. (2005) proposed the Cascade SVM. It is a simple, stepwise procedure, in which the SVM is iteratively trained on subsets of the original data set and support vectors of resulting models are combined to create new training sets. The general idea is to bound the size of all considered training sets and therefore obtain a significant speedup. Another relevant advantage is that this approach can easily be parallelized because a number of independent models have to be fitted during each stage of the cascade. Initial experiments show that even moderate parallelization can reduce the computation time considerably, with only minor loss in accuracy. We compare the Cascade SVM to the standard SVM and a simple parallel bagging method w.r.t. both classification accuracy and training time. Furthermore, some approaches to improve the performance of the Cascade SVM, e.g. specifically adapted hyperparameter tuning will be discussed.

References GRAF, H. P., COSATTO, E., BOTTOU, L., DURDANOVIC, I. and VAPNIK, V. (2005): Parallel Support Vector Machines: The Cascade SVM. Advances in Neural Information Processing Systems, 17, 521 S-P-E-C-I-A-L-C-H-A-R 528. CHAWLA, N. V., MOORE, T. E., HALL, L. O., BOWYER, K. W., KEGELMEYER, P. and SPRINGER, C. (2003): Distributed Learning with Bagging-Like Performance. Pattern Recognition Letters, 24, 455 S-P-E-C-I-A-L-C-H-A-R 471.

Keywords Classification, Support Vector Machines, Cascade SVM, Parallelization

45

Soft Bootstrapping and Its Comparison with Other Resampling Methods Hans-Joachim Mucha1 and Hans-Georg Bartel2 1 2

WIAS, Germany [email protected] Humboldt University Berlin, Germany

Abstract. The bootstrap approach is resampling taken with replacement from the original data. Concretely, the original bootstrap technique can be formulated by choosing the following weights of observations: m(i) = k, if the corresponding object i is drawn k times, and m(i) = 0, otherwise. Here it is supposed that originally m(i) = 1 for all observations. In clustering, the so-called sub-sampling (i.e., resampling taken without replacement from the original data) is another approach (see Hartigan (1969)): m(i) = 1, if observation i is drawn randomly, and m(i) = 0, otherwise. Here we recommend another bootstrap method, called soft bootstrapping, that consists of random change of the original masses m(i) = 1 to some degree. This resampling scheme of assigning randomized masses m(i) ¿ 0 (under the constraint that the total sum of masses equals the original number of obserbvations) is especially appropriate for a small sample size because no object is excluded from the soft bootstrap sample. We compare the applicability of different resampling techniques with respect to cluster analysis.

Keywords bootstrap, sub-sampling, cluster analysis

46

Dual Scaling Classification and Its Application in Archaeometry Hans-Joachim Mucha1 , Hans-Georg Bartel2 , and Jens Dolata3 1 2 3

Weierstrass Institute for Applied Analysis and Stochastics (WIAS), 10117 Berlin, Germany, [email protected] Department of Chemistry at Humboldt University, Berlin, Brook-Taylor-Straße 2, 12489 Berlin, Germany, [email protected] Head Office for Cultural Heritage Rhineland-Palatinate (GDKE), Große Langgasse 29, 55116 Mainz, Germany, [email protected]

Abstract. We consider binary classification based on the dual scaling technique. In the case of more than two classes many binary classifiers can be considered. We call this pairwise classification because we train a classifier for each pair of classes. The proposed approach goes back to Mucha (2002) and it is based on the pioneering book of Nishisato (1980). It is applicable to mixed data. First, numerical variables have to be discretized into bins to become ordinal variables (data preprocessing). Second, the ordinal variables are converted into categorical ones. Then the data is ready for dual scaling of each individual variable based on the given two classes: each category is transformed into a score. Then a classifier can be derived from the scores simply in an additive manner over all variables. Examples and applications to archaeometry (provenance studies of Roman ceramics) are presented.

References NISHISATO, S. (1980): Analysis of Categorical Data: Dual Scaling and Its Applications. The University of Toronto Press, Toronto 1980. MUCHA, H.-J. (2002): An Intelligent Clustering Technique Based on Dual Scaling. In: S. Nishisato, Y. Baba, H. Bozdogan, and K. Kanefuji (Eds.),: Measurement and Multivariate Analysis. Springer, Tokyo, 37–46.

47

Introducing Analytical Methods and Predictive Models in Project Management Activities Jaime Santos1 and Orlando Belo2 1 2

ISCTE/IUL, Portugal [email protected] Algoritmi R&D Centre, University of Minho, Portugal [email protected]

Abstract. Independent of the nature of a project, in project management the control of variables like cost, quality, schedule, and scope are main decision factors for a good and successful execution of a project. In the context of software engineering, the project planning and execution are highly influenced by the creative nature of the individual intended to create and the individuals involved with the project. Additionally, projects are surrounded by environmental complexities influencing directly (and indirectly) team productivity. So, managing the risks, related to the different project steps, is a key task with extreme importance for project managers (and sponsors) that should be focused on control and monitoring such variables, as well as others concerned with the context around them. This work will present a set of analytical techniques that we used to estimate the effort and to perform classification to predict the success of a project. In this study, we prepared a small cocktail of data mining techniques and methods to explore potential correlations and influences contained in some of the most relevant parameters related to experience, complexity, organization maturity and project innovation, as well, some other execution constraints and sizing units. We developed a model that could be deployed in any project management process, assisting project managers in planning and monitoring the state of one project or program under its supervision.

Keywords Project Management; Data Mining; Business Intelligence; Effort Estimation; Project Success Classification.

48

Constrained Dual Scaling of Successive Categories for Detecting Response Styles Pieter C. Schoonees1,2 , Michel van de Velden1 , and Patrick J. F. Groenen1 1 2

Erasmus University Rotterdam, The Netherlands [email protected]

Abstract. A constrained dual scaling method for detecting response styles in socalled successive categories categorical data is proposed. Response styles arise in questionnaire research when respondents tend to use rating scales in a manner unrelated to the actual content of the survey question. Dual scaling for successive categories is a technique related to correspondence analysis (CA) for analyzing categorical data. However, there are important differences, with one important aspect of dual scaling for successive categories data being that it also provides optimal scores for the rating scale. This property is used together with the interpretation of a response style as a nonlinear mapping of a group of respondents latent preferences to a rating scale. It is shown through simulation that the curvature properties of four well-known response styles make it possible to use dual scaling to detect them. Also, the relationship between dual scaling and CA in conjunction with nonnegative least squares is used to restrict the detected mappings to conform to quadratic monotone splines. This gives rise to simple diagnostic maps which can help researchers to determine both the type of response style and the extent to which it is manifested in the data.

References NISHISATO, S. (1980): Analysis of Categorical Data: Dual Scaling and its Applications. University of Toronto Press, Toronto. VAN ROSMALEN, J., VAN HERK, H. and GROENEN, P.J.F. (2010): Identifying Response Styles: A Latent-Class Bilinear Multinomial Logit Model. Journal of Marketing Research, 47, 157–172.

Keywords RESPONSE STYLE, DUAL SCALING, CORRESPONDENCE ANALYSIS, SPLINES

49

On Instance Selection in Multi Classifier Systems Friedhelm Schwenker, Sascha Meudt University of Ulm, Institute of Neural Information Processing, 89069 Ulm [email protected] Abstract. In any data mining application the training set design is the most important part of the overall data mining process. Designing a training set means preprocessing the raw data, selecting the relevant features, selecting the representative instances (samples), and labeling the instances for the classification or regression application at hand. Labeling data is usually time consuming, expensive (e.g. in cases where more than one expert must be asked), and error-prone. Instance selection deals with searching for a subset S of the original training set T , such that a classifier trained on S shows similar, or even better classification performance than a classifier trained on the full data set T (Olvera-L´opez et al 2010). We will present confidence-based instance selection criteria for k-nearest-neighbor classifiers and probabilistic support vector machines. In particular we propose criteria for multi classifier systems and discuss them in the context of classifier diversity. The statistical evaluation of the proposed selection methods has been performed on affect recognition from speech and facial expressions. Classes are not defined very well in this type of application leading to data sets with high label noise. Numerical evaluations on these data sets show that classifiers can benefit form instance selection not only in terms of computational costs, but even in terms of classification accuracy.

References ´ OLVERA-LOPEZ, J. A. and CARRASCO-OCHOA, J. A. and MARTINEZ-TRINIDAD, J. F. and KITTLER, J. (2010): A review of instance selection methods, Artif. Intell. Rev. 34(2), 133-143

Keywords INSTANCE SELECTION, ACTIVE LEARNING, MULTI -CLASSIFIER-SYSTEMS, SUPERVISED LEARNING

50

Effects of Labeling Mechanisms on Classification Error in Linear Discriminant Analysis Keiji Takai1 and Kenichi Hayashi2 1 2

Kansai University, 3-3-35 Yamatecho, Suita, Osaka 564-8680, JAPAN [email protected] Osaka University, 2-2 Yamadaoka, Suita, Osaka 565-0871, JAPAN [email protected]

Abstract. In machine learning literature as well as statistical method literature, it is widely believed that unlabeled data in addition to labeled data are effective to reduce the classification error or to make more precise estimation of the parameters for the classification boundary. In our talk, we examine if this belief is true or not by focusing our attention to the classification error in linear discriminant analysis. For the examination, we introduce the missing-data framework. This is because unlabeled data can be regarded as missing data. Using this framework, we classify the labeling mechanisms into two, the feature-independent labeling mechanism and the feature-dependent labeling mechanism. The former corresponds to MCAR and the latter to MAR in the missing-data analysis context. The former mechanism has been implicitly assumed in a lot of the machine learning studies that deal with the partially labeled data, while the latter has been rarely assumed although a lot of the practical examples in which the latter mechanism is more suitable can be found, for instance, in medical sciences. Under each of the labeling mechanisms, there are two ways to use the data for estimation of the parameters, that is, the estimation based on the labeled data alone and the estimation based on the mixed data of labeled and unlabeled data. For each of the labeling mechanisms, we theoretically derive the asymptotic classification error efficiency based on the asymptotic theory for missing data and numerically show which is a better way to use the data.

References EFRON, B. (1975): The efficiency of logistic regression compared to normal discriminant analysis, Journal of the American Statistical Association, 70, 892–898.

Keywords UNLABELED DATA, SEMI-SUPERVISED LEARNING, MISSING DATA, LDA 51

Three-way Subspace Hierarchical Clustering based on Entropy Regularization Method Kensuke Tanioka1 and Hiroshi Yadohisa2 1

2

Graduate School of Culture and Information Science, Doshisha University, 1-3 Tatara Miyakodani, Kyotanabe City, Kyoto 610-0394, Japan [email protected] Department of Culture and Information Science, Doshisha University, 1-3 Tatara Miyakodani, Kyotanabe City, Kyoto 610-0394, Japan [email protected]

Abstract. Three-way three-mode data are defined as X ∈ R|I|×|J|×|K| , where I,J, and K are a set of objects, variables, and occasions, respectively. Vichi, et al., (2007) proposed a three-way clustering that considers the effects of variables and occasions. The subspace is described as a linear combination of all original variables and occasions. However, Lance, et al., (2004) argue that such a subspace is affected by noise variables. In addition, the subspace is subjected to complicated assumptions, and therefore it is hard to interpret the results. Next, Tanioka and Yadohisa (2012) proposed a subspace hierarchical clustering whose subspaces for each cluster are described as differential subsets of variables and occasions. These method eliminate noise variables and makes it easier to interpret the results. However, the only mean to evaluate each variable and occasion for each cluster is by calculating the subspace. In this study, we propose that distribution concepts such as the variation of each variable and each occasion used to reflect structures for each variable and occasion of each cluster and not as means to evaluate them.

References LANCE, P., EHTEASHAM, H. and HUAN, L. (2004): Subspace Clustering for High Dimensional Data:A Review. TANIOKA, K. and YADOHISA, H. (2012): Three-mode Subspace clustering for considering effects under noise variables and occasions, VICHI, M., ROCCI, R. and KIERS, H.A.L. (2007): Simultaneous component and clustering models for three way-data: Within and between approaches.

Keywords VARIABLE SELECTION, OCCASION SELECTION

52

Gamma-Hadron-Separation in the MAGIC-Experiment Tobias Voigt1,3 , Roland Fried1 , Michael Backes2 , and Wolfgang Rhode2 1

2

3

TU Dortmund, Faculty of Statistics, Vogelpothsweg 87, 44227 Dortmund [email protected], [email protected] TU Dortmund, Physics Faculty, Otto-Hahn-Straße 4, 44227 Dortmund [email protected], [email protected] PhD Student

Abstract. The MAGIC-telescopes on the canary island of La Palma are the largest Cherenkov telescopes in the world, operating in stereoscopic mode since 2009 (Aleksi´c, 2012). Their purpose is to detect very high energy gamma rays emitted by various astrophysical sources. Due to characteristics of the detection process one cannot avoid that besides the gamma ray signal also other particles are observed. These background particles are summarized as hadrons. Before the gamma rays can be further analyzed, they have to be separated from the hadronic background. In the MAGIC experiment this classification is usually done using a random forest. In this talk we introduce the data which is provided by the MAGIC telescopes, which has some distinctive features. These features include high class imbalance and unknown and unequal misclassification costs as well as the absence of reliably labeled training data. We introduce a method to deal with some of these features. The method is based on a thresholding approach (Sheng and Liang, 2009) and aims at minimization of the mean square error of an estimator, which is derived from the classification. The method is designed to fit into the special requirements of the MAGIC data.

References ´ J. et al. (2012): Performance of the MAGIC stereo system obtained with Crab Nebula ALEKSIC, data. Astroparticle Physics, 35, 435–448 SHENG, V. and LIANG, C. (2006): Thresholding for making classifiers cost-sensitive. Proceedings of the 21st National Conference on Artifcial Intelligence, AAAI Press, 1, 476–481

Keywords MAGIC, THRESHOLDING, CLASS IMBALANCE, RANDOM FOREST, ASTROPHYSICS 53

Cluster Analysis of Symbolic Data with Application of R Software Justyna Wilk1 and Marcin Pełka1 1 Wroclaw

University of Economics, Department of Econometrics and Computer Science, Nowowiejska 3, 58-500 Jelenia G´ora, Poland, [email protected], [email protected] Abstract. Cluster analysis is ranked among the most important groups of exploratory data analysis. In the typical cluster analysis study four major steps are distinguished: objects and variables selection, distance measurement and objects clustering, determining the number of clusters and clustering validation, clusters description and profiling. Symbolic data analysis contributes significantly in the development of taxonomy methodology but there are two main problems in symbolic data clustering. Majority of methods used in the procedure is implemented exclusively for classical data situation and symbolic data complexity prevents direct application of the methods. The aim of this article is to present approach in symbolic data clustering with using of R software. In the first part of the paper symbolic data concept and cluster analysis procedure is presented. In the second part alternative strategies of symbolic data clustering and the methods basing on dissimilarity matrix and symbolic data table are discussed. A review of these two approach application in empirical research is performed. Afterwards packages and functions of R software that may be useful in symbolic data clustering depending on selected approach (with a special focus on the symbolicDA package) are presented.

References BOCK, H.-H., DIDAY, E. (Eds.) (2000): Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data. Springer Verlag, Berlin-Heidelberg. DIDAY, E., NOIRHOMME-FRAITURE, M. (2008): Symbolic data analysis and the Sodas software. John Wiley & Sons, Chichester. GATNAR, E., WALESIAK, M. (Eds.) (2011): Analiza danych jako´sciowych i symbolicznych z wykorzystaniem programu R [Qualitative and symbolic data analysis with application of R software]. C.H. Beck, Warszawa.

Keywords SYMBOLIC DATA ANALYSIS, CLUSTERING, R SOFTWARE

54

Part VI

Data Analysis and Classification in Marketing

Spatial Modeling of Dependencies Between Population, Education, and Economic Growth Daniel Baier1 , Wolfgang Polasek2 , and Alexandra Rese1 1 2

Chair of Marketing and Innovation Management, Brandenburg University of Technology Cottbus, Germany, daniel.baier|[email protected] Institute for Advanced Studies, Vienna, Austria, [email protected]

Abstract. Since the seminal work by Anselin (1988), spatial effects have become an important tool for predictions in econometrics. Spatial effects appear in a huge variety of analysis tasks in economics and business adminstration, so, e.g. when a comparative regional analysis (Zelias 1987) or the modeling of regional sales data (see, e.g., Baier, Polasek 2010) is under study or when spatial effects have to be compared with other effects in success factor analysis (see, e.g., Rese, Baier 2011). The paper discusses different approaches to model spatial effects and shows the viability of this approach in a practical application where the dependencies of population, education, and economic growth are under study. The paper uses actual regional data from Germany to analyze these effects. The advantages and disadvantages of this spatial modeling approach is discussed.

References Anselin, L. (1988): Spatial Econometrics. Baltagi, B.H. (Ed.), A Companion to Theoretical Econometrics. Blackwell Publishing Ltd AD, 310–330. Baier, D., Polasek, W. (2010): Marketing and Regional Sales: Evaluation of Expenditure Strategies by Spatial Sales Response Functions. Studies in Classification, Data Analysis, and Knowledge Organization, 40, 673–682. Rese, A., Baier, D. (2011): Success Factors for Innovation Management in Networks of Small and Medium Enterprises. R&D Management, 41(2), 138–155. Zelias, A.J. (1987): A Regression Approach to Regional Forecasting. Papers of the Regional Science Association, 61, 39–49.

Keywords SPATIALMODELS, ECONOMETRICS, MARKETING, INNOVATIONMANAGEMENT

56

Discrete Choice Methods and Their Applications in Preference Analysis of Vodka Consumers Andrzej Ba¸k1 , Marcin Pełka1 , and Aneta Rybicka1 1 Wroclaw

University of Economics, Department of Econometrics and Computer Science, Nowowiejska 3, 58-500 Jelenia G´ora, Poland, [email protected], [email protected], [email protected] Abstract. Preference analysis is one of key elements in marketing researches and in economy in general. The preferences help to explain how and why consumers make their choices. There are two types of preferences – stated and revealed. Discrete choice methods allow to analyze stated preferences. They model choices made by people among a finite set of alternatives. Discrete choice models take many forms, including: binary logit, binary probit, multinomial logit, conditional logit, multinomial probit, nested logit, generalized extreme value models, mixed logit, and exploded logit. The main aim of the paper is to apply multinominal logit model to analyze vodka consumers preferences – a discrete choice experiment – with application of R software. The article presents basic terms of multinominal logit model, discrete choice experiment, model estimation. The paper presents also results of model estimation that will allow to determine worst and best vodka brands and which attributes are most important for consumers.

References AGRESTI, A. (2002): Categorical Data Analysis. Second Edition. Wiley, New York. CAMERON, A.C., TRIVEDI, P.K. (2005): Microeconometrics. Methods and Applications. Cambridge University Press, New York. TRAIN, K. (2003): Discrete Choice Methods with Simulation. Cambridge University Press, New York. ZWERINA, K. (1997): Discrete Choice Experiments in Marketing. Heidelberg-New York, Physica-Verlag.

Keywords DISCRETE CHOICE METHODS, PREFERENCE ANALYSIS, MULTINOMINAL LOGIT MODEL

57

The Dangers of using Intention as a Surrogate for Retention in Brand Positioning Decision Support Systems Michel Ballings1 and Dirk Van den Poel2 1 2

Ghent University, Department of Marketing, Tweekerkenstraat 2, 9000 Ghent, Belgium [email protected],PhD student Ghent University, Department of Marketing, Tweekerkenstraat 2, 9000 Ghent, Belgium [email protected]

Abstract. The purpose of this paper is to explore the dangers of using intention as a surrogate for retention in a decision support system (DSS) for brand positioning. An empirical study is conducted, using structural equation modeling and both data from the internal transactional database and a survey. The results show that different product benefits are recommended for brand positioning when intention is used as opposed to retention as a criterion variable. The findings also indicate that the strength of the structural relationships is inflated when intention is used. This has implications in that managers will not only underinvest in marketing campaigns but will also invest in advertisements that promote the wrong product benefits. Although this study is limited to only one industry; the newspaper industry, it provides guidance for brand managers in selecting the most appropriate product benefit for brand positioning and advices against the use of intention as opposed to retention in DSSs. Our contribution to literature is demonstrated in the fact that it is the first study that challenges and refutes the commonly held belief that intention is a valid surrogate for retention in a DSS for brand positioning. Moreover, a framework is provided that addresses opportunities for integrating predictive and descriptive computational systems through data.

Keywords PRODUCT BENEFITS, BRAND POSITIONING, DECISION SUPPORT SYSTEM, INTENTION, OBSERVED CUSTOMER RETENTION

58

Microeconometrics Multinomial Models and their Applications in Preferences Analysis using R Andrzej Ba¸k and Tomasz Bartłomowicz Wrocław University of Economics, Department of Econometrics and Computer Science, ul. Nowowiejska 3, 58-500 Jelenia G´ora, Poland, [email protected] [email protected] Abstract. Measurement of consumer preferences is one of the most important elements of marketing research. It helps to explain the reasons of consumer choices among products and services. Microeconometrics models are useful in analysis of categorical data (microdata describing individuals) often collected in marketing research based on discrete choices. Among microeconometrics models for unordered categories most frequently are used multinomial logit model, conditional logit model and mixed logit model. The main aim of this paper is to present some types of discrete choice multinomial logit models and their applications in consumer preferences analysis. The basis for distinguishing between types of multinomial models is mainly character of the independent variables including in the model and this distinction is not clearly interpreted in microeconometrics publications. The paper shows the fundamental differences between these types of multinomial logit models used in the area of consumer preferences analysis. For estimation these models are used R program, R packages and user functions written in R programming language.

References AGRESTI A. (2002), Categorical Data Analysis. Second Edition, Wiley, New York, CAMERON A.C., TRIVEDI P.K. (2005), Microeconometrics. Methods and Applications. Cambridge University Press, New York. JACKMAN S. (2007), Models for Unordered Outcomes. Political Science 150C/350C. http://jackman.stanford.edu/classes/350C/07/unordered. pdf (12.03.2012). SO Y., KUHFELD W.F. (1995), Multinomial Logit Models. http://support.sas. com/techsup/technote/mr2010g.pdf (12.03.2012) . WINKELMANN R., BOES S. (2006), Analysis of Microdata. Springer, Berlin.

Keywords MICROECONOMETRICS, PREFERENCES, R PROGRAM 59

Measuring Consumers’ Brand Associations in Online Market Research Pascal Kottemann∗ , Martin Meißner and Reinhold Decker Bielefeld University, Department of Business Administration and Economics, P.O. Box 10 01 31, 33501 Bielefeld, Germany {pkottemann,mmeissner,rdecker}@wiwi.uni-bielefeld.de Abstract. Understanding brand equity based on consumers’ brand associations is important for both marketing research and practice. Assuming that human knowledge is stored in a network structure, association network analysis is often applied for determining a brand’s image (see, e.g., Teichert and Sch¨ontag 2010). In 2006, John et al. introduced Brand Concept Maps (BCM) as a tool for identifying and visualizing brand associations, the direct or indirect link of these associations to the brand and the relationship between these associations. Up to know, applying BCM requires data collection in a laboratory setting where consumers express their associations on poster boards. However, generating representative and sufficiently large samples in such setting is difficult and often comes along with high costs. The aim of this paper is to investigate whether the BCM approach can be applied in online market research (implying computerized interviews). We empirically compare the outcomes of the original BCM approach using face-to-face interviews with results from an online adaptation of BCM. Furthermore, we discuss the extent to which BCM data are suitable for market segmentation based on brand perception.

References JOHN, D. R.; LOKEN, B.; KIM, K.; MONGA, A. B. (2006): Brand Concept Maps: A Methodology for Identifying Brand Association Networks. Journal of Marketing Research, 43(4), 549–563. ¨ TEICHERT, T. A.; SCHONTAG, K. (2010): Exploring Consumer Knowledge Structures Using Associative Network Analysis. Psychology and Marketing, 27(4), 369–398.

Keywords BRAND CONCEPT MAPS, BRAND ASSOCIATION NETWORKS, ONLINE MARKET RESEARCH

∗

Ph.D. student

60

Multinomial-SVM-Item-Recommender for Repeat-Buying Scenarios Christina Lichtenthaeler1 and Lars Schmid-Thieme2 1

2

Institute for Advanced Study, Technische Universit¨at M¨unchen, Lichtenbergstrasse 2a, 85748 Garching, Germany [email protected] Information Systems and Machine Learning Lab, University of Hildesheim Marienburger Platz 22, 31141 Hildesheim, Germany [email protected]

Abstract. Most of the common recommender systems are dealing with the task to generate recommendations for assortments in which a product is usually bought only once like books or DVDs. However, there are plenty of online shops selling consumer goods like drugstore products where the customer purchases the same product repeatedly. We call such scenarios repeat-buying scenarios (B¨ohm et al. (2001)). For our approach we assigned the results of information geometry (Amari and Nagaoka (2000)) and transformed customer data taken from a repeat-buying scenario into a multinomial space. Using the multinomial diffusion kernel from Lafferty and Lebanon (2005) we developed a multinomial SVM item recommender system M-SVM-IR to calculate personalized item recommendations for a repeatbuying scenario. We evaluated our SVM-item-recommender in a 10-fold-crossvalidation against the state of the art recommender BPR-MF developed by Rendle et. al (2009). Evaluation was performed on a real world dataset taken from the online drugstore of Rossmann. It shows that the M-SVM-IR outperforms the BPR-MF with statistical significance regarding the AUC.

References AMARI, S., NAGAOKA, H. : Methods of information geometry. In: Translation of mathematical Monographs, Vol. 191, American Mathematical Society. ¨ BOHM W. ,GEYER-SCHULZ, A. , HAHSLER, M., JAHN, M. (2001): Repeat-Buying Theory and Its Application for Recommender Services. In: Opitz, O. (Ed.): Studies in Classification, Data Analysis, and Knowledge Organization. LAFFERTY, J. and LEBANON, G.(2005): Diffusion Kernels on Statististical Manifolds. In: Journal of Machine Learning Research 6, 129-163. RENDLE, S., FREUDENTHALER, C., GANTNER, Z., SCHMIDT-THIEME, L.S. (2009): BPR: Bayesian Personalized Ranking from Implicit Feedback. In: Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence.

Keywords SVM, Item-Recommender, Diffusion Kernel, Information Geometry

61

Approach to Predicting Changes in Market Segments Based on Customer Behavior Anneke Minke1 and Klaus Ambrosi1 Institut f¨ur Betriebswirtschaft und Wirtschaftsinformatik, Universit¨at Hildesheim, Germany {minke,ambrosi}@bwl.uni-hildesheim.de Abstract. In modern marketing, knowing the development of different market segments is crucial. However, simply measuring the occured changes is not sufficient when planning future marketing campaigns. Predictive models are needed to show trends and to forecast abrupt changes such as the elimination of segments, the splitting of a segment, or the like. For predicting changes, continuously collected data are needed. In internet market places, data concerning customer behavior can easily be recorded; furthermore, these data are more adequate for showing changes in the relationship between customers and a cooperation than demographic data. Therefore, behavioral data are suitable for spotting trends in customer segments. For detecting changes in a market structure, fuzzy-clustering is used since gradual changes in cluster memberships can implicate future abrupt changes. In this talk, we introduce different measurements for the analysis of gradual changes that comprise the currentness of data and can be used in order to predict abrupt changes.

References ¨ ¨ BOTTCHER, M. and HOPPNER, F. and SPILIOPOULOU, M. (2008): On Exploiting the Power of Time in Data Mining. ACM SIGKDD Explorations Newsletter, 10/2, 3–11. MINKE, A. and AMBROSI, K. and HAHNE, F. (2009): Approach for Dynamic Problems in Clustering. In: I.N. Athanasiadis, P.A. Mitkas, A.E. Rizzoli, J.M. G´omez (Eds.): Proceedings of the 4th International Symposium on Information Technologies in Environmental Engineering (ITEE’09). Springer, Berlin, 373–386. SONG, H.S. and KIM, J.K. and KIM, S.H. (2001): Mining the Change of Customer Behavior in an Internet Shopping Mall. Expert Systems with Applications, 21, 157–168.

Keywords CHANGE PREDICTION, CLUSTERING, MARKET SEGMENTATION

62

Finite Mixture MNP vs. Finite Mixture IP Models: An Empirical Study Friederike Paetz1 and Winfried J. Steiner2 1 2

Department of Marketing, Clausthal University of Technology, 38678 Clausthal-Zellerfeld [email protected] Department of Marketing, Clausthal University of Technology, 38678 Clausthal-Zellerfeld [email protected]

Abstract. In the context of conjoint choice models, we propose a new finite mixture model framework for estimating two types of probit models: Finite Mixture Multinomial Probit (FM-MNP) models and Finite Mixture Independent Probit (FMIP) models. While the FM-MNP both accommodates heterogeneity in consumers’ preferences and allows to consider dependencies between alternatives, the FM-IP assumes independence and suffers from a similar property like IIA (cf. Hausman and Wise (1978)). The models are estimated using an Expectation-Maximization algorithm. In an empirical study, we investigate how restrictive the independence assumption of the FM-IP model is, and whether FM-IP and FM-MNP models therefore lead to different implications concerning the number of segments and market share forecasts (following Haaijer et al. 1998). Not unexpected, the MNP performs much better on the aggregate market level. However, when heterogeneity is accounted for, the FM-IP outperforms the FM-MNP, with the FM-IP 2-segment solution turning out as overall best. Obviously, the additional benefit from considering dependencies between alternatives dilutes when taking into account heterogeneity of consumers. We also find that market share predictions under the optimal 2-segment solution are rather close between the FM-MNP and FM-IP models so that the higher complexity of the FM-MNP seems not justified.

References HAAIJER, R., WEDEL, M., VRIENS, M. and WANSBEEK, T.J. (1998): Utility Covariances and Context Effects in Conjoint MNP Models. Marketing Science, 17 (3), 236–252 HAUSMAN, J. and WISE, D. (1978): A conditional probit model for qualitative choice: Discrete decisions recognizing interdependence and heterogeneous preferences. Econometrica, 46 (2), 403–429

Keywords FINITE MIXTURE MODELS, MULTINOMIAL PROBIT MODELS, INDEPENDENT PROBIT MODELS

63

Rasch Models for Analyzing Role Models in Inter-Organisational Innovation Processes Alexandra Rese1 , Hans-Georg Gem¨unden2 , and Daniel Baier1 1

2

Chair of Marketing and Innovation Management, Brandenburg University of Technology Cottbus, Germany [email protected], [email protected] Chair for Innovation and Technology Management, Technical University of Berlin, Germany [email protected]

Abstract. Rasch models have been used especially in psychometrics and human sciences for the analysis of social behavior, abilities or personality traits (Rasch 1980). Rasch models are adapted and used here to examine a set of dichotomous promoting behavior items in inter-organizational radical innovations with respect to the dimensional structure and item fit. Empirical research in innovation management has shown that key people play an important role in initiating and implementing innovations (Walter et al. 2011). However, these key people have been hardly investigated in an inter-organisational context so far (Rese, Baier 2011). Rasch analysis allows for taking several measurement issues into account which are required for validity and support scale improvement. Regarding the occupation of a role for each person a total raw score of the role trait can be calculated. The focus and aim of this study is first of all to develop a scale to assess the behavior of several actors in inter-organizational radical innovations. Besides the identification of roles the role structure is analyzed: With respect to an assumed size of two to five cooperating organizations the question answered is how many people work together and take on roles.

References Rasch, G. (1980): Probabilistic Models for Some Intelligence and Attainment Tests. Mesa Press, Chicago, IL. Rese, A. and Baier, D. (2011):Success factors for innovation management in networks of small and medium enterprises. R&D Management, 41(2), 138–155. Walter, A., Parboteeah, K. P., Riesenhuber, F. and Hoegl, M. (2011): Championship behaviors and innovations success: an empirical investigation of university spinoffs. Journal of Product Innovation Management, 28(4), 586–598.

Keywords RASH MODELS, ROLE MODELS, INNOVATION MANAGEMENT

64

Variable Weighting and Selection Approaches for Market Segmentation: A Comparison Susanne Rumstadt and Daniel Baier Chair of Marketing and Innovation Management, BTU Cottbus, Postbox 101344, 03013 Cottbus, Germany, {susanne.rumstadt, daniel.baier}@tu-cottbus.de Abstract. The selection and weighting of variables play decisive roles in market segmentation. The inclusion or exclusion of variables as well as the distribution of their possible values affect the quality of the grouping. Some selection and weighting approaches suggest better, some worse groupings. So, e.g., in the grouping of respondents on the basis of uploaded holiday, spare time, or appartment images in social networks or during online interviews (see, e.g., Baier, Daniel 2012 for a recent overview), a manifold of subsets of extracted features from these uploaded images (e.g. color histograms, edge histograms, high level features like number of persons or categories like beach or mountain) can be used, resulting in different grouping results. For solving this problem, several feature saliency approaches have been proposed recently, basing, e.g. on latent class analysis (see, e.g. Law et al. 2004) or k-means heuristics (see, e.g. Carmone et al. 1999, Brusco, Cradit 2001, Steinly, Brusco 2008). In this paper we analyze which feature saliency approach is advantageous in which setting. We use real data from image clustering and simulated data, both in a Monte Carlo setting, for this purpose.

References Baier, D., Daniel, I. (2012): Image Clustering for Marketing Purposes, to appear in: Studies in Classification, Data Analysis, and Knowledge Organization, 43, 1. Brusco, M.J., Cradit, J.D. (2001): A Variable-Selection Heuristic for K-means Clustering, Psychometrika, 66, 2, 249-270. Steinly, D., Brusco, M.J. (2008): Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures, Psychometrika, 73, 1, 125-144. Carmone, F.J., Kara, A., Maxwell, S. (1999): HINoV: A new Model to Improve Market Segment Definition by Identifying Noisy Variables, Journal of Marketing Research, 36, 4, 501-509. Law, M.H.C., Figueiredo, M.A.T., Jain, A.K. (2004): Simultaneous Feature Selection and Clustering Using Mixture Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 9, 1154-1166. 2 Susanne Rumstadt and Daniel Baier

Keywords Market Segmentation, Latent Class Analysis, K-Means, Feature Saliency, Variable Selection

65

An Validity Analysis of Recent Commercial Conjoint Analysis Studies Sebastian Selka, Daniel Baier, and Peter Kurz Institute of Business Administration and Economics, Brandenburg University of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany {sebastian.selka,daniel.baier}@tu-cottbus.de TNS Infratest GmbH, Arnulfstrasse 205, Munich 6839, Germany [email protected] Abstract. Due to more and more online questionnaires and possible distraction – e.g. by mails, social network messages, or news reading during the processing in an uncontrolled environment – one can assume that the (internal and external) validity of conjoint studies lowers. We test this assumption by comparing the (internal and external) validity of commercial conjoint analysis studies over the last years. Research base are (disguised) recent commercial conjoint analysis studies of a leading international marketing research company in this field with about 1.000 conjoint studies per year. The validity information is analyzed w.r.t. research objective, product type, period, incentives, and other categories, also w.r.t. other outcomes like interview length and response rates. The results show some interesting changes in the validity of these studies. Additionally, new procedures to deal with these setting will be shown.

References WITTINK, D. R. and VRIENS, M. and BURHENNE, W. (1994): Commercial use of conjoint analysis in Europe: Results and critical reflections. International Journal of Research in Marketing, 11, 1, 41 - 52. DEUTSKENS, E. and de RUYTER, K. and WETZELS, M. and OOSTERVELD, P. (2004): Response Rate and Response Quality of Internet-Based Surveys: An Experimental Study. Marketing Letters, 2004, 15, 1, 21-36. GREEN, P. E., KRIEGER, A. M., and WIND, Y. J. (2001): Thirty Years of Conjoint Analysis: Reflections and Prospects. Interfaces, 31, 56–73.

Keywords MARKETING RESEARCH, CONJOINT ANALYSIS, VALIDITY DEVELOPMENT

66

Exploring Nonlinear Effects in the Relationship between Customer Satisfaction and Customer Retention Winfried J. Steiner1 , Florian U. Siems2 , Anett Weber1 and Daniel Guhl1 1

2

Department of Marketing, Clausthal University of Technology, 38678 Clausthal-Zellerfeld [email protected], [email protected], [email protected] Faculty of Business and Economics, RWTH Aachen University, 52072 Aachen [email protected]

Abstract. There is consensus in the marketing literature that satisfaction of customers concerning (1) perceived quality and (2) pricing of products/services is critical for customer retention. In contrast, there is lack of empirical evidence about the exact functional relationship. Using nonparametric regression, this contribution empirically investigates whether and to what extent nonlinear effects of each of those two satisfaction dimensions affect customer retention. Results from an empirical study not only reveal complex nonlinear effects for these satisfaction-retention relationships, respectively, but also indicate strong interaction effects of both satisfaction dimensions on customer retention. To estimate nonlinear effects, we follow Lang and Brezger (2004) who proposed a Bayesian version of P-splines originally introduced by Eilers and Marx (1996). Accordingly, nonlinear interaction effects are modeled via tensor products of unidimensional splines within this Bayesian framework. The P-spline models are estimated with and without interaction effects and clearly outperform parametric benchmark models in both fit and predictive validity. While the P-spline model with interaction effects shows the best performance across models, the P-spline model without interaction effects still outperforms the parametric model with interaction effects which indicates the important role of nonlinearities in the satisfaction-retention context.

References Eilers P, Marx B (1996) Flexible smoothing using B-splines and penalized likelihood (with comment and rejoinder). Statistical Science 11(2):89-121 Lang S, Brezger A (2004) Bayesian P-Splines. Journal of Computational and Graphical Statistics 13(1):183-212

Keywords CUSTOMER SATISFACTION, CUSTOMER RETENTION, NONPARAMETRIC REGRESSION, P-SPLINES

67

Complex Product Development: Using a Combined VoC Lead User Approach Alexander S¨ann1 and Daniel Baier2 1

IHP GmbH - Leibniz-Institut f¨ur innovative Mikroelektronik† , Im Technologiepark 25, 15236 Frankfurt (Oder), Germany [email protected]

2

Institute of Business Administration and Economics, Brandenburg University of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany [email protected]

Abstract. Nowadays, the lead user method is a state of the art method to generate breakthrough innovations for the new product development. Since the lead user method was successfully applied in generating simple products for businessto-consumer markets (e.g. S¨ann and Baier 2012), the contribution of lead users in a complex product environment is highly controversial (e.g. Mahr and Lievens 2012; Magnusson 2009). This research adopts the view of SME to generate complex radical innovations for business-to-business contexts employing the lead user method in a combined surrounding with voice of the customer techniques in the field of complex IT security products. The new approach is expected to lower the risk of a niche product development. The empirical finding led to an adaptive lead user approach addressing common problems of lead userness in complex product environments. Overall, SME will be enabled to specify and parameterize future products according to reliable and user verified data in an early stage of the new product development.

References MAGNUSSON, P. (2009): Exploring the Contributions of Involving Ordinary Users in Ideation of Technology-Based Services.Journal of Product Innovation Management, 26, 5, 578-593. MAHR, D. and LIEVENS, A. (2012): Virtual lead user communities: Drivers of knowledge creation for innovation. Research Policy, 41, 1, 167-177. ¨ SANN, A. and BAIER, D. (2012): Lead User Identification Based in Conjoint Analysis Based Product Design. Studies in Classification, Data Analysis and Knowledge Organization, 43, 521-528.

Keywords MARKETING RESEARCH, LEAD USER, NEW PRODUCT DEVELOPMENT

† Alexander S¨ ann is a PhD student at Institute of Business Administration and Economics, Brandenburg University of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany.

68

Identifying Consumer Typologies from Online Product Reviews Using Finite Mixture Models Michael N. Tuma1∗ Department of Business Administration and Economics, Bielefeld University, D-33615 Bielefeld, Germany. [email protected] Abstract. Online product reviewing is an emerging phenomenon that is playing an increasingly important role in consumers’ purchase decisions (Chen and Xie, 2008). Recent empirical surveys show that people rely more and more on opinions posted on blogs, online forums and opinion portals when making a variety of decisions, ranging from which movies to watch to which products to purchase. Despite this importance of opinion analysis, to the best of our knowledge, there has been no attempt by marketing researchers to empirically identify different types of consumers who post their opinions online. This study seeks to fill this gap. Using natural language processing techniques, we develop a novel approach – related to that of Decker and Trusov (2010) – to identify those variables which, at least partly, explain the articulated opinions. These variables are then used in a modelbased clustering approach (Wedel and Kamakura, 2001) to identify homogeneous segments of consumers that can then be targeted with the same marketing measures. The results show that polarity in consumer opinions plays a significant role in segment formation.

References CHEN, Y. and XIE, J. (2008): Online Consumer Review: Word–of–Mouth as a New Element of Marketing Communication Mix. Management Science, 54, 477-491. DECKER, R. and TRUSOV, M. (2010): Estimating Aggregate Consumer Preferences from Online Product Reviews. International Journal of Research in Marketing, 27, 293-307. WEDEL, M. and KAMAKURA, W. (2000): Market Segmentation: Conceptual and Methodological Foundations. 2nd ed., Kluwer Academic Publishers, Dordrecht.

Keywords MARKET SEGMENTATION, ONLINE CONSUMER REVIEWS, FINITE MIXTURE MODELS, SEGMENTATION VARIABLES

∗

PH.D. student

69

Solving Product Line Design Optimization Problems using Stochastic Programming Sascha Voekler1 and Daniel Baier2 1

2

Institute of Business Administration and Economics, Brandenburg University of Technology Cottbus, Postbox 101344, D-03013 Cottbus, Germany [email protected] Institute of Business Administration and Economics, Brandenburg University of Technology Cottbus, Postbox 101344, D-03013 Cottbus, Germany [email protected]

Abstract. In this paper, we try to apply stochastic programming methods to product line design optimization problems. Because of the estimated part-worths of the product attributes in conjoint analysis, there is a need to deal with the uncertainty caused by the underlying statistical data (Kall/Mayer 2011). Inspired by the work of Georg B. Dantzig (Dantzig 1955), we developed an approach to use the methods of stochastic programming for product line design issues. Therefore, four different approaches will be compared by using notional data of a yogurt market from Gaul and Baier (2009). Stochastic programming methods like singleor two-stage programs are applied on Gaul, Aust and Baier (Gaul et al. 1995) and will be compared to its original approach, to Green and Krieger (Green/Krieger 1985) and to Kohli and Sukumar (Kohli/Sukumar 1990). Besides the theoretical work, these methods will be realized by a self-written code with the help of the statistical software package R.

References Dantzig, G.B. (1955): Linear Programming Under Uncertainty. Management Science, 1(3/4), 197206. Gaul, W., Aust, E., Baier, D. (1995): Gewinnorientierte Produktliniengestaltung unter Beru S-P-EC-I-A-L-C-H-A-R cksichtigung des Kundennutzens. Zeitschrift fu S-P-E-C-I-A-L-C-H-A-R r Betriebswirtschaftslehre, 65, 835-855. Gaul, W., Baier, D. (2009): Simulationsund Optimierungsrechnungen auf Basis der Conjointanalyse. Conjointanalyse Methoden-Anwendungen-Praxisbeispiele, D. Baier, M. Brusch (Hrsg.), Berlin, Heidelberg, Springer 2009, 163 S-P-E-C-I-A-L-C-H-A-R 182. Green, P.E., Krieger, A.M. (1985): Models and Heuristics for Product Line Selection. Marketing Science, 4(1), 1-19. Kall, P., Mayer, J. (2011): Linear Stochastic Programming Models, Theory, and Computation. International Series in Operations Research and Management Science, Springer New York, Dordrecht, Heidelberg, London, 2011, 156. 2 Sascha Voekler and Daniel Baier Kohli, R., Sukumar, R. (1990): Heuristics for Product-line Design Using Conjoint Analysis. Marketing Science, 36(12), 1464-1478.

Keywords Conjoint Analysis, Product Line Design Optimization, Stochastic Programming. 70

Part VII

Data Analysis in Finance

Sovereign Wealth Funds and Portfolio Choice Wolfgang Bessler1 and Daniil Wagner, CFA2 1

2

Center for Finance and Banking, University of Giessen, Licher Strasse 74, 35394 Giessen, Germany [email protected] PhD Student, Center for Finance and Banking, University of Giessen, Licher Strasse 74, 35394 Giessen, Germany [email protected]

Abstract. In this paper we take the portfolio manager’s perspective and analyze Sovereign Wealth Fund (SWF) portfolios from an investment management view. Since more than 50 percent of SWF assets are funded by oil or gas revenues and two of the main SWF goals are economic stabilization (e.g. in the case of resource risks borne by a high dependence on resource revenues) and intergenerational wealth transfer (e.g. from exhaustible resource revenues to financial assets) our approach is to include the funding source of the SWF into the portfolio choice problem as a background asset. By doing so the portfolio choice for countries endowed with natural resources should significantly differ depending on the type of resource involved and also from a pure financial asset setting. Based on this assumption we develop a portfolio choice framework using Markowitz mean-variance optimization (MVO) and including the natural resource pool as a fixed optimization component. We account for the shortcomings of sample-based MVO like the high sensitivity of optimal weights to small changes in input parameters and the poor out-of-sample performance relative to naive diversification strategies by using the Black-Litterman model to calculate the input parameters. Our empirical results show that in presence of background assets (natural resources like oil, gas or copper) the investment opportunities for the whole country shrink in terms of risk-return opportunities. Furthermore, in-sample- as well as out-of-sample analysis indicates that for resource-rich economies a high allocation of low correlated assets such U.S. government bonds, real estate or hedge funds may be optimal.

Keywords SOVEREIGN WEALTH FUNDS, RESOURCE RISK, PORTFOLIO CHOICE, MEAN-VARIANCE OPTIMIZATION, BLACK-LITTERMAN MODEL

72

Sovereign Wealth Funds and Portfolio Choice

73

References BALDING, C. (2008): A portfolio analysis of Sovereign Wealth Funds. Working Paper University of California. BESSLER, W. and WOLFF, D. (2011): A Theoretical and Empirical Analysis of the BlackLitterman Model, in: GfKl 2011 Proceedings. BEST, M.J. and GRAUER, R.R. (1991): On the Sensitivity of Mean-Variance-Efficient Portfolios to Changes in Asset Means: Some Analytical and Computational Results, in: The Review of Financial Studies, vol. 4, no. 2, pp. 315-342. BLACK, F. and LITTERMAN, R. (1992): Global Portfolio Optimization, in: Financial Analysts Journal, pp. 28-43. BROWN, A., PAPAIOANNOU, M. and PETROVA, I. (2010): Macrofinancial linkages of the strategic asset allocation of commodity-based Sovereign Wealth Funds, IMF Working Paper. CORDEN, W.M. and NEARY, J.P. (1982): Booming sector and de-industrialisation in a small open economy, in: Economic Journal, vol. 92, pp. 825-848. DOSKELAND, T.M. (2007): Strategic asset allocation for a country: the Norwegian case, in: Financial Markets and Portfolio Management, vol. 21, pp. 167-201. DEMIGUELl, V., GARLAPPI, L. and UPPAL, R. (2009): Optimal Versus Naive Diversification: How Inefficient is the 1/N Portfolio Strategy? in: The Review of Financial Studies, vol. 22, no. 5, pp. 1915-1953. GINTSCHEL, A. and SCHERER, B. (2008): Optimal asset allocation for sovereign wealth funds, in: Journal of Asset Management, vol. 9, pp. 215-238. HARTWICK, J.M. (1977): Intergenerational Equity and the Investing of Rents from Exhaustible Resources, in: The American Economic Review, vol. 67, no. 5, pp. 972-974. HE, G. and LITTERMAN, R. (1999): The Intuition Behind Black-Litterman Model Portfolios, Goldman Sachs Investment Management. HOTELLING, H. (1931): The Economics of Exhaustible Resources, in: Journal of Political Economy, vol. 39, no. 2, pp. 137-175. IDZOREK, T.M. (2006): Developing Robust Asset Allocations, Working Paper. LEE, B. and WANG, H. (2010): Reevaluating the roles of large public surpluses and Sovereign Wealth Wealth Funds in Asia, Asian Development Bank Institute and Institute for South East Asian Studies Working Paper. MICHAUD, R.O. (1989): The Markowitz Optimization Enigma: Is Optimized Optimal? in: Financial Analysts Journal, vol. 45, pp. 31-42. SCHERER, B. (2009): A note on portfolio choice for sovereign wealth funds, in: Financial Markets and Portfolio Management, vol. 23, no. 2009, pp. 315-327. SOLOW, R.M. (1974): Intergenerational equity and exhaustible resources, in: The Review of Economic Studies, vol. 41, pp. 29-45. SOLOW, R.M. and WAN, F.Y. (1976): Extraction costs in the theory of exhaustible resources, in: The Bell Journal of Economics, vol. 7, no. 2, pp. 359-370. XIE, P. and C. CHEN (2008): Sovereign Wealth Funds, macroeconomic policy alignment and financial stability, National Natural Science Fund Emergency Project Working Paper.

Feature reduction and pattern classification for financial forecasting, - A comparative study on different optimization strategies Daniel Bohlmann1 and Jarek Krajewski2 1 2

Bergische Universitt Wuppertal, Germany [email protected] Bergische Universitt Wuppertal, Germany [email protected]

Abstract. The aim of our contribution is to expose different feature reduction strategies for classifying patterns in the field of financial time series forecasting. Feature (space) reduction plays an important role in pattern classification and has gained higher interest during the last years where the number of features has enormously increased (large scale data). The inclusion of irrelevant, redundant, and noisy attributes in the dataset can result in poor predictive performance and requires efficient search strategies and evaluation criteria in order to find the optimal feature subset. Wrapper methods tend to achieve superior classification accuracy than filter approaches, but they also face higher computational costs. Feature transformation strategies such as Principal Component Analysis (PCA) or Nonnegative Matrix Factorization (NMF) aim on reducing the space dimension without losing any including information. In financial time series prediction, the number of features in previous work was relatively small and mostly focused on a few trend indicators and oscillators. Since the velocity and the acceleration function might include more information about the future development of the time series, we extend the number of features to the first and second derivative of the technical indicators. The dataset in this study focuses on the development of the Euro-Dollar exchange rate and consists of 3,382 trading days from January 1999 to December 2011. This study presents a benchmark comparison of several attribute reduction methods for supervised classification. We use different Forward Selection strategies in combination with Wrapper, Filter, Transformation and Hybrid (filter-wrapper) models. We apply Support vector machine (SVM), Back-propagation neural network (BP), k-nearest neighbor and Naive Bayes as learning algorithms. Empirical results indicate that SVM deals best with the nonlinear and noisy environment and outperforms the other forecasting models and random walk. Furthermore it can be shown that the selected features clearly depend on the specific underlying classification algorithm.

Keywords Financial forecasting, Pattern recognition, Pattern classification, Feature reduction, Feature selection, Dimension reduction

74

A practical method of determining longevity and premature-death risk aversion in households and some proposals of its application Lukasz Feldman1 , Radoslaw Pietrzyk2 , and Pawel Rokita3 1 2 3

Wroclaw University of Economics [email protected] Wroclaw University of Economics [email protected] Wroclaw University of Economics [email protected]

Abstract. This paper presents a technique facilitating practical calibration of utility function of a household to support the choice of the optimal (or, at least, satisfying) cash flow term structure in retirement. A simplified model of a two-person household is adopted. It is suggested that household members choose from amongst a number of easy-to-understand graphical schemes of cumulated cash flow term structures to be realized at the distribution phase of the life cycle. On this basis an analyst (or personal financial advisor) assigns individualized utility function parameters to the household. The utility function, as well as any associated mortality-rate models, are joint models for the whole household. Utility function calibrated with the suggested algorithm may be further on used in optimization of retirement spending, but also to support investment decisions in accumulation phase. The resulting cash flow term structure that maximizes expected discounted utility depends (among others) on: applied life-cycle and mortality-force model, age and gender of household members, cumulated retirement capital, etc. The calibration technique might be also helpful in classification of households with respect to risk aversion.

References GONG, G. and WEBB, A. (2008): Mortality Heterogeneity and the Distributional Consequences of Mandatory Annuitization. The Journal of Risk and Insurance, 75(4), 1055–1079. MILEVSKY, M. A., HUANG, H. (2011): Spending Retirement on Planet Vulcan: The Impact of Longevity Risk Aversion on Optimal Withdrawal Rates, Financial Analysts Journal, 67(2), 45–58

Keywords LONGEVITY RISK, UTILITY, HOUSEHOLD, RETIREMENT

75

Optimal portfolios of securities taking into account the asymmetry of specific risk Garsztka Przemyslaw1 Poznan Univ. of Economics [email protected]

Abstract. The standard approach to portfolio construction by Sharpe has two sources of risk: market risk and the risk of a random factor. At the same time, it is assumed that the random component is normally distributed with zero expectation and constant variance. But in empirical research, it may be noted the high kurtosis and skewness of the characteristic line residuals. Suppose that the return on asset is dependent on the situation on the market: In the case of positive information asset is attractive to buyers and they are willing to a premium in order to accelerate the asset purchase. In the case of negative information asset is ”attractive” for the sellers and they are willing to some concession. Additionally, suppose that: The less liquid value - the more difficult to conclude a transaction, the premium/concession must be greater [Amihud, Mendelson (1986)]. The bonus offered by buyers is the reason for the emergence of right-sided skewness of the random component of the observed characteristic lines in the calculation of line in periods when the stock index increases. Similarly, the concession offered by sellers is the reason for the emergence of left-sided skewness. The paper proposes an empirical account of that fact by the splitting of the random component into two factors, one of which explains the asymmetric effect of liquidity risk attached to the assets. In article is used Battesse and Coelli specification of function which was originally used to stochastic frontier analysis. In addition, author proposed construction of the portfolio of shares listed on The Frankfurt Stock Exchange, taking into account three sources of risk.

References AMIHUD, Y. and MENDELSON, H. (1986): Asset pricing and the bid-ask spread, Journal of Financial Economics 17, 223-249. BATTESSE, G.E. and COELLI, T.J. (1995): A Model for Technical Inefficiency Effects in a Stochastic Frontier Production Function for Panel Data, Empirical Economics 20, 325-332.

Keywords PORTFOLIO SELECTION, SPECYFIC RISK, ASSETS LIQUIDITY

76

A Simplex Rotation Algorithm for the Factor Approach to Generate Financial Scenarios Alois Geyer1 , Michael Hanke2 , and Alex Weissensteiner3 1 2 3

WU (Vienna University of Economics and Business), Austria Vienna Graduate School of Finance (VGSF) Institute for Financial Services University of Liechtenstein School of Economics and Management Free University of Bolzano, Italy

Abstract. Scenario trees to be used for financial optimization must be free of arbitrage opportunities. We start from a factor approach which is explicitly designed to generate arbitrage-free scenario trees while exactly matching the assets’ expected excess returns and covariances. Here we present a new algorithm to implement the factor approach which is based on rotations of simplexes. This algorithm offers two major computational advantages: First, it does not require Cholesky decomposition, but uses a deterministically constructed simplex as its starting point. Second, instead of (potentially frequent) re-sampling, it ensures no-arbitrage for every single run by purposefully rotating this simplex. Hence, the new algorithm completely avoids any need for checking scenarios for arbitrage. As a by-product, the derivation of our algorithm provides interesting geometrical insights.

77

Correlation of outliers in multivariate data Bartosz Kaszuba Department of Financial Investments and Risk Management Wroclaw University of Economics, ul. Komandorska 118/120, Wroclaw, Poland [email protected] Abstract. Conditional correlations of stock returns (also known as exceedance correlations) are commonly compared for downside moves and upside moves separately. The results have shown so far the increase of correlation when the market goes down and hence investors’ portfolios are less diversified. Unfortunately, while analysing empirical exceedance correlations in multi-asset portfolio each correlation may base on different moments of time thus high exceedance correlations for downside moves does not mean lack of diversification in bear market. This paper proposes calculating correlations assuming that Mahalanobis distance is greater than given quantile of chi-square distribution. The main advantage of proposed approach is that each correlation is calculated from the same moments of time. Furthermore, when the data come from elliptical distribution, proposed conditional correlation does not change, what is in opposition to exceedance correlation. Empirical results for selected stocks from DAX30 will show the increase of correlation in bear market and decrease of correlation in bull market.

References CHUA, D. B., KRITZMAN, M. and PAGE, S. (2009): The Myth of Diversification The Journal of Portfolio Management, 36, 26–35. HONG, Y.,TU, J. and ZHOU, G. (2007): Asymmetries in Stock Returns: Statistical Tests and Economic Evaluation. Review of Financial Studies, 20, 1547–1581. LONGIN, F., and SOLNIK, B. (2001): Extreme Correlation of International Equity Markets. Journal of Finance, 56, 649–676.

Keywords MULTIVARIATE OUTLIERS, ASSET CORRELATION, MAHALANOBIS DISTANCE, EXCEEDANCE CORRELATION

0

PhD student

78

Using generalized additive models to fit credit rating scores Marlene M¨uller Beuth University of Applied Sciences Berlin [email protected] Abstract. We consider the estimation of credit scores by means of semiparametric logit models. In credit scoring, the fitted rating score shall not only provide an optimal classification result but serves also as a modular component of a (typically quite complex) rating system. This means in particular that a rating score should be given by a linearly weighted sum of rating factors. That way the rating procedure can be easily interpreted and understood also by non-statisticians. For that reason the logit model or the logistic regression approach is one of the most popular models for estimating credit rating scores. The first step in fitting the rating model is usually a nonlinear transformation of the raw variables in order to obtain a linear predictor (rating score) in the final estimation. As an alternative to this two-step approach, generalized additive models (GAM) would allow for a simultaneous estimation of both the initial transformation and final logit fit. In this study we compare GAM estimating approaches with a focus on the specific structure of credit data: small default rates, mixed discrete and continuous explanatory variables, possibly nonlinear dependencies between the regressors.

References ¨ ¨ HARDLE, W., MULLER, M., SPERLICH, S. and WERWATZ, A. (2004): Nonparametric and Semiparametric Modeling, Springer, New York. HASTIE, T. J. and TIBSHIRANI, R. J. (1990): Generalized Additive Models, Chapman and Hall, London. R DEVELOPMENT CORE TEAM (2010): R: A Language and Environment for Statistical Computing, http://www.R-project.org WOOD, S. N. (2006): Generalized Additive Models: An Introduction with R, Chapman and Hall, London.

Keywords SEMIPARAMETRIC LOGIT MODEL, GENERALIZED ADDITIVE MODEL, CREDIT RATING

79

Clustering Algorithms for Storage of Tick Data Gabor I. Nagy∗ and Krisztian Buza Budapest University of Technology and Economics Magyar tud´osok k¨or´utja 2, H-1117 Budapest, Hungary [email protected], [email protected] Abstract. Tick data is one of the most prominent types of temporal data, as it can be used to represent data in various domains such as geophysics or finance. Storage of tick data is a challenging problem because two criteria have to be fulfilled simultaneously: the storage structure should allow fast execution of queries and the data should not occupy too much space on the hard disk or in the main memory. We present two clustering-based solutions, in particular, our recently-developed clustering algorithms, SOHAC and SOPAC. These algorithms are designed to support the storage of tick data and are under publication (see References). We evaluate our algorithms both on publicly available real-world datasets, as well as real-world tick data from the financial domain provided by one of the world-wide most renowned investment bank. In our experiments, we compare our approaches, SOHAC and SOPAC, against a large collection of conventional clustering algorithms from the literature. The experiments show that our algorithm substantially outperforms – both in terms of statistical significance and practical relevance – the examined clustering algorithms for the tick data storage problem. Additionally, we present our most recent research directions related to clustering algoritms for tick data storage.

References NAGY, G.I. and BUZA, K. (2012): Partitional Clustering of Tick Data to Reduce Storage Space. IEEE 16th International Conference on Intelligent Engineering Systems, to appear. NAGY, G.I. and BUZA, K. (2012): Efficient Storage of Tick Data That Supports Search and Analysis 12th Industrial Conference on Data Mining, LNCS, Springer, to appear.

Keywords TICK DATA, CLUSTERING, STORAGE, APPLICATION, FINANCE

∗

The first author is PhD-student

80

Value-at-Risk Backtesting Procedures Based on the Loss Functions - Simulation Analysis of the Power of Tests Krzysztof Piontek Department of Financial Investments and Risk Management Wroclaw University of Economics, ul. Komandorska 118/120, Wroclaw, Poland [email protected] Abstract. The definition of Value at Risk is quite general. There are many approaches which can give different VaR values. The challenge is not to suggest a new method but to distinguish between good and bad models. Backtesting is the necessary statistical procedure to evaluate VaR models and select the best one. There are three groups of methods for validating VaR models: based on the frequency of failures, on the adherence of model to asset return distributions and on various loss functions. Usually risk managers are not concerned about the power of used tests. If the power of the test is low, then it is likely to mis-classify an inaccurate VaR model as well-specified. It can be a threat to financial institutions. The aim of the paper is to analyze some chosen backtesting methods (based on the idea of loss functions) focusing on the problem of power of the tests and limited data sets (usually observed in practice). The main attention is paid to the different kinds of loss functions and to statistical evaluation of the most applied tests. Simulated data representing asset returns are used here. The last part summarizes obtained results and gives hints for the optimal backtesting. This paper is a continuation of earlier pieces of research done by the author.

References CAMPBELL, S. (2005): A Review of Backtesting and Backtesting Procedures. Federal Reserve Board. Washington PIONTEK, K. (2010): Analysis of Power for Some Chosen VaR Backtesting Procedures - Simulation Approach, Advances in Data Analysis, Data Handling and Business Intelligence, Part 7, Springer Verlag, 481-490

Keywords RISK MEASUREMENT, VALUE-AT-RISK, VaR, BACKTESTING, POWER OF TESTS

81

Fundamental portfolio construction based on semi-variance Anna Rutkowska-Ziarko

Abstract. In models for creating a fundamental portfolio, based on the classical Markowitz model, the variance is usually used as a risk measure. However, equal treatment of negative and positive deviations from the expected return rate is a slight shortcoming of variance as the risk measure. Markowitz defined semi-variance to measure the negative deviations only. However, finding the fundamental portfolio with minimum semi-variance is much more difficult than finding a fundamental portfolio with minimum variance. The fundamental portfolio introduces an additional condition aimed at ensuring that the portfolio is only composed of companies in good economic condition. A synthetic indicator is constructed for each company, describing its economic and financial situation. The method of constructing fundamental portfolios using semi-variance as the risk measure is presented. The differences between the semi-variance fundamental portfolios and variance fundamental portfolios are analysed on example of companies listed on the Warsaw Stock Exchange.

Keywords Markowitz model, fundamental portfolio, semi-variance, Mahalanobis distance

82

Sovereign Credit Spreads During the European Fiscal Crisis Jonas Vogt, PhD Student1 Fakultaet Statistik, Technische Universitaet Dortmund, 44221 Dortmund Germany [email protected] Abstract. During the European financial crisis strong correlations of certain sovereigns’ credit spreads were observed even though the respective economies are hardly connected. A possible explanation for this phenomenon might be that the increase of these spreads is not induced by an increase in default probabilities themselves but by an increase of their implied variances. To analyze this hypothetical relation, we model the risk-neutral default probabilities implied in CDS spreads both under the risk-neutral and historical measure considering a default as first jump of a Poisson process and the intensities as diffusion processes. By comparing diffusion parameter estimates obtained under both of these measures, we want to see whether an increase in spreads can be explained by an increase in risk-premiums for the default intensity variance (see Pan und Singleton, 2008). We make use of the characteristics of affine processes to transform the Feynman-Kac differential equations resulting from expectation terms in the common CDS-pricing formula for solving the implied intensities (see Duffie et al, 2000). We suggest moreover an iterative procedure to numerically solve for implied intensities based on the parameters of the underlying diffusions and to estimate the diffusion parameters based on the obtained default intensities in turn.

References DUFFIE, D., PAN, J. und SINGLETON, K. (2000): Transform Analysis and Asset Pricing for Affine Jump Diffusions. Econometrica, 68, 1343–1376. PAN, J. und SINGLETON, K. (2008): Default and Recovery Implicit in the Term Structure of Sovereign CDS Spreads. Journal of Finance, LXIII, 2345–2384

Keywords CREDIT-RISK, REDUCED-FORM MODEL, AFFINE PROCESSES, MACROFINANCE, EUROPEAN FINANCIAL CRISIS

83

Part VIII

Machine Learning and Knowledge Discovery

Classification and definition of contextual vicinity from emotional words for sentiment analysis Hyunsup Ahn, Markus Weinmann, and Christoph Lofi TU-Braunschweig, Germany {hs.ahn@,markus.weinmann@,[email protected].}tu-bs.de Abstract. Much research has been carried out on collecting opinion from product reviews. Due to a number of channel and message on the web where customers leave messages to show their opinion and product-related emotion, a manual approach selecting a relevant and meaningful opinion is being harder than ever. For this reason, the method which adjusts a polarity from dichotomized categories - negative and positive words - automatically has been generally accepted. In this paper we propose to build an emotion-relevant lexicon that indicates an intensity of emotional words and contextual vicinity among themselves. Instead of manual classification of word set, our suggestion adopted a self-verifiable method based on a user rating that is highly congruent with an overall opinion posted by the person. Therefore our result shows further possibility that the common technique which extracts and summarizes peoples emotional stances from web as corpus could be more accurately measured by applying weighted score from emotional words.

Keywords web mining, sentiment analysis, emotion in e-Business, product review

86

Using Conceptual Inductive Learning for Cooperative Query Answering Maheen Bakhtyar1 , Lena Wiese2 , Katsumi Inoue3 , and Nam Dang4 1 2 3 4

Asian Inst. of Technology Bangkok, Thailand [email protected] University of Hildesheim, Hildesheim, Germany [email protected] National Inst. of Informatics, Tokyo, Japan [email protected] Tokyo Inst. of Technology, Tokyo, Japan [email protected]

Abstract. A database system may not always be able to find correct answers for a query and such a query is called a failing query. Cooperative query answering systems produce informative answers for such queries. To obtain such informative answers we apply generalization operators that have long been studied in the area of Conceptual Inductive Learning (Michalski, 1983). In particular, Inoue and Wiese (2011) analyze three generalization operators “Dropping Conditions”, “AntiInstantiation” and “Goal Replacement”. We observed that sometimes a number of answers produced after query generalization are not related to what user asked. Therefore, we extend the generalization operators by a mechanism to classify answers into those related and those unrelated to the original query intention. We determine the similarity between the user query and the answers produced based on a similarity function by acquiring the semantics of constants in the answer using WordNet (http://wordnet.princeton.edu). Only those answers being classified as most related to the query will be returned to the user.

References Inoue, K. and Wiese, L. (2011): Generalizing conjunctive queries for informative answers. In Proceedings of the 9th International Conference on Flexible Query Answering Systems. Lecture Notes in Artificial Intelligence, vol. 7022, pp. 1–12, Springer-Verlag. Michalski, Ryzard S. (1983): A Theory and Methodolgy of Inductive Learning. In Machine Learning: An Artificial Intelligence Approach, pp. 111–161, TIOGA Publishing.

Keywords INDUCTIVE CONCEPTUAL LEARNING, QUERY RELAXATION, SEMANTIC FILTERING

87

A Study of the Efficiency and Accuracy of Data Stream Clustering for Large Data Sets Matthew Bola˜nos1∗ , John Forrest2 , and Michael Hahsler1 1

Southern Methodist University, Dallas, Texas, USA. Microsoft, Redmond, Washington, USA.

2

Abstract. Identifying groups in large data sets is important for many machine learning and knowledge discovery applications. In recent years, data stream clustering algorithms have been proposed which can deal efficiently with potentially unbounded streams of data. Obviously, these algorithms can also be used for large nonstreaming data sets and, as such, present light-weight alternatives to conventional algorithms. The question is how accurate are results obtained via a data stream clustering algorithm compared to conventional clustering methods. To investigate this among other questions we have developed an R-extension package called stream which provides the experimental infrastructure for data stream mining and currently focuses on data stream clustering and cluster evaluation. Using this infrastructure we will systematically compare the results obtained via conventional clustering algorithms (ranging from k-means and hierarchical clustering to BIRCH) with data stream clustering algorithms (CluStream, DenStream, kNN) on a set of synthetic and real-world data from several domains. We will evaluate efficiency (runtime and memory requirements), accuracy (e.g., by purity and precision given known ground truth), as well as sensitivity to parameters and other choices like the algorithm used to recluster micro-clusters.

References AGGARWAL, C. (2007): Data Streams - Models and Algorithms. Springer. ˜ BOLANOS, M., FORREST, J. and HAHSLER, M. (2012): stream: Infrastructure for Data Streams, R package version 0.1-0, http://r-forge.r-project.org/projects/clusterds.

Keywords CLUSTERING, DATA STREAM CLUSTERING ALGORITHMS, EVALUATION

∗

Student author

88

Feedback Predicition for Blogs Krisztian Buza Department of Computer Science and Information Theory Budapest University of Technology and Economics, Hungary [email protected] Abstract. The last decade lead to an unbelievable growth of the importance of social media. While in the early days of social media, blogs, tweets, facebook, youtube, social tagging systems, etc. served more-less just as an entertainment of a few enthusiastic users, nowadays news spreading over social media may govern the most important changes of our society, such as the revolutions in the Islamic world, or US president elections. Due to the huge amounts of documents appearing in social media, there is an enormous need for the automatic analysis of such documents. One of the most important properties which distinguishes social media from the classic one, is the uncontrolled, dynamic and rapidly-changing content: e.g. when a blog-entry appears, users may immediately comment this document. In this work, we focus on the analysis of documents appearing in blogs. We present an industrial application which has the following major components: (i) the crawler, (ii) information extractors, (iii) data store and (iv) analytic components. Analytic components allow to explore trends and to predict the number of feedbacks that a document is expected to receive in the next 24 hours. This task is related to opinion mining, however, despite its relevance, there are just a few works on predicting the number of feedbacks that a blog-entry is expected to receive, see e.g. Yano and Smith (2010). In contrast to them, we target various topics (do not focus on political blogs) and perform experiments with many different models. We hope that our observations will motivate research in order to improve classification and regression algorithms.

References YANO, T. and SMITH, N. A. (2010): Whats Worthy of Comment? Content and Comment Volume in Political Blogs. 4th International AAAI Conference on Weblogs and Social Media, 359–362 MISHNE, G. (2007): Using Blog Properties to Improve Retrieval. International Conference on Weblogs and Social Media

Keywords SOCIAL MEDIA, BLOGS, FEEDBACK PREDICTION

89

Label Ranking with Abstention: Learning to Predict Partial Orders Weiwei Cheng1 , Willem Waegeman2 , Volkmar Welker1 , and Eyke H¨ullermeier1 1 2

Mathematics and Computer Science, Marburg University, 35032 Marburg, Germany {cheng,eyke}@mathematik.uni-marburg.de Department of Applied Mathematics, Biometrics and Process Control, Ghent University [email protected]

Abstract. The prediction of structured outputs in general and rankings in particular has attracted considerable attention in machine learning in recent years, and different types of ranking problems have already been studied (F¨urnkranz and H¨ullermeier, 2011). Here, we propose a generalization or, say, relaxation of the standard setting of label ranking, allowing a model to make predictions in the form of partial instead of total orders. We interpret such kind of prediction as a ranking with partial abstention: If the model is not sufficiently certain regarding the relative order of two alternatives and, therefore, cannot reliably decide whether the former should precede the latter or the other way around, it may abstain from this decision and instead declare these alternatives as being incomparable. We propose a general approach to ranking with partial abstention as well as evaluation metrics for measuring the correctness and completeness of predictions. Moreover, we introduce a new method for learning to predict partial orders that improves on an existing approach (Cheng et al., 2010), both theoretically and empirically. Our method is based on the idea of thresholding the probabilities of pairwise preferences between labels as induced by a predicted (parameterized) probability distribution on the set of all rankings (Marden 1995).

References F¨urnkranz, J., and H¨ullermeier, E. Preference Learning. Springer-Verlag, 2011. Cheng, W., Rademaker, M., De Baets, B., and H¨ullermeier, E. Predicting partial orders: Ranking with abstention. In Proc. ECML/PKDD–2010, pp. 215–230, Barcelona, Spain, 2010. Marden, J. Analyzing and Modeling Rank Data. Chapman and Hall, 1995.

Keywords label ranking, partial orders, Plackett-Luce model, Mallows model

90

On the relation of cluster stability and early classifiability of time series Istv´an D´avid∗ and Krisztian Buza Department of Computer Science and Information Theory Budapest University of Technology and Economics, Hungary {david,buza}@cs.bme.hu Abstract. Although classification of time series or sequences of observations is one of the well-studied topics, there are just a few works on their early classification. By early classifiability we mean the property that the class label (which refers to the entire time series) can be often predicted based on the first few observations (see e.g. Xing et al., 2009 and Xing et al., 2008). Some practical examples of early classification include fraud detection, applications in health care and network engineering (e.g. classification of TCP/IP packets based on the first few segments of data). In our study, we examine the relation of early classifiability to early identification of clusters and cluster (in)stability, which might be an indicator of concept drift. As the k-nearest neighbor algorithm (k-NN) with dynamic time warping (DTW) became popular for time-series classification, we target the above question in the context of k-NN and k-medoids. Furthermore, we extend the concept of cluster stability introduced by Ackerman and Ben-David (2009) for time-series clustering.

References XING, Z., PEI, J., YU, P.S. (2009): Early Prediction on Time Series: A Nearest Neighbor Approach. Proc. of the Twenty-First Int. Joint Conf. on Artificial Intelligence (IJCAI-09). AAAI Press, Palo Alto, California, 1297-1302. ACKERMAN, M., BEN-DAVID, S. (2009): Clusterability: A Theoretical Study. Journal of Machine Learning Research: Workshop and Conf. Proc., 5, 1-8. XING, Z. et al. (2008): Mining sequence classifiers for early prediction. SDM’08 Proc. of the 2008 SIAM Int. Conf. on Data Mining, 644-655.

Keywords TIME-SERIES CLASSIFICATION, CLUSTER STABILITY, EARLY PREDICTION, CLUSTERING, ALGORITHMS

∗

Student

91

Experimental Evaluation of Communication Efficient Distributed Classification in Peer-to-Peer Networks Umer Khan1 , Alexandros Nanopoulos2 and Lars Schmidt Thieme1 1 2

University of Hildesheim, Information Systems and Machine Learning Lab {khan, schmidt-thieme}@ismll.uni-hildesheim.de University of Eichst¨att, Ingolstadt Germany. [email protected]

Abstract. Mining patterns from large-scale distributed networks, such as Peer-toPeer (P2P), is a challenging task, because centralization of data is not feasible. The goal is to develop mining algorithms that are communication efficient, scalable, asynchronous, and robust to peer dynamism, which achieve accuracy as close as possible to centralized ones. In this paper, we present a detailed experimental evaluation of classification algorithms in P2P framework. We focus on two variants of Support Vector Machines (SVM), namely Reduced-SVM (RSVM) (Lee et. al 2001) and Relevance Vector Machines (RVM) (Tipping 2001). RSVM are known for their ability to represent whole data by using a very small subset of training instances (Ang et al. 2008). Based on Bayesian probabilistic framework, RVM utilizes dramatically fewer kernel functions. Nevertheless, both RSVM and RVM provide very good generalization performance, which is comparable to standard SVM. Additionally, their ability of providing compact and accurate models makes them both efficient for classification in P2P networks, due to the reduced communication cost resulting from the need to propagate local (i.e. within each peer) classification models to neighboring peers, until all peers converge to a global model. We perform an extensive empirical comparison between RSVM and RVM, using several real data sets from UCI repository. Our results provide useful conclusions about the suitability of RSVM and RVM for the task of classification in P2P networks, in terms of classification accuracy and communication overhead.

References Ang, Hock H. et al.(2008): Cascade RSVM in Peer-to-Peer Networks.ECML PKDD Lee, Y. and Mangasarian, Olvi L.(2001): RSVM: Reduced Support Vector Machines. First SIAM International Conference on Data Mining,5-7 Tipping, Michael E.(2001): Sparse Bayesian Learning and the Relevance Vector Machine. Journal of Machine Learning Research,211-244.

Keywords DISTRIBUTED DATA MINING, RSVM, RVM, P2P NETWORKS 92

Framework for Storing and Processing Relational Entities in a Data Stream Pawel Matuszyk∗ Otto-von-Guericke-University, Faculty of Computer Science Magdeburg, Germany [email protected] Abstract. Conventional stream mining algorithms use the assumption, that every data instance can be seen only once in a stream [1]. Therefore, all data instances are considered statistically independent from each other. Consequently, this assumption causes a loss of information. This problem can be solved by modelling data as reoccurring relational entities (e.g. customer in an online shop)[2]. In this article an efficient, multithreaded framework, which can handle such entities, is proposed. For this framework a new architecture consisting of four layers was developed. One of these layers is the cache layer, for which a new, tailored cache structure, which shows a nearly linear complexity, is proposed. The framework and its components were evaluated using a self-implemented data generator that creates relational streams with changing speed and concept drift. The evaluation showed, that the new framework leads to a reduction of the computation time of up to 97,13%. An effect of smoothing of the speedups of the stream has been observed. This is especially relevant for many application scenarios in the internet (e.g. recommender systems).

References 1. Guha, S.; Meyerson, A.; Mishra, N.; Motwani, R. & O’Callaghan, L. Clustering data streams: Theory and practice, IEEE Transactions on Knowledge and Data Engineering, IEEE Computer Society, 2003, 515-528 2. Siddiqui, Z.; Spiliopoulou, M.; Winslett, M. (Ed.) Combining Multiple Interrelated Streams for Incremental Clustering , Scientific and Statistical Database Management, Springer Berlin / Heidelberg, 2009, 5566, 535-552

Keywords Relational Stream Mining, Relational Entities Preparation, Mining Multiple Streams

∗

PhD student

93

Spectral clustering: interpretation and Gaussian parameter Sandrine Mouysset1 , Joseph Noailles, Daniel Ruiz2 , and Clovis Tauber3 1 2

3

University of Toulouse, IRIT-UPS, 118 route de Narbonne, 31062 Toulouse, University of Toulouse, IRIT-ENSEEIHT, 2 rue Camichel, 31071 Toulouse, {sandrine.mouysset,joseph.noailles, daniel.ruiz}@irit.fr University of Tours, Hopital Bretonneau, 2 boulevard Tonnelle, 37044 Tours, [email protected]

Abstract. Spectral Clustering consists in creating, from the spectral elements of a Gaussian affinity matrix, a low-dimension space in which data are grouped into clusters. This unsupervised method is mainly based on the Gaussian affinity measure, its parameter and its spectral elements. However, questions about the separability of clusters in the projection space and the spectral parameter choices remain open. By drawing back to some continuous formulation wherein clusters will appear as disjoint subsets, we propose an interpretation of spectral clustering for a finite discrete data set via Partial Differential Equations and Finite Elements theory which gives good properties on how spectral clustering works. This approach develops some particular geometrical properties inherent to eigenfunctions of some specific eigenvalue problem. This leads to a study showing the rule of the Gaussian affinity parameter: this geometrical property is proved to be preserved asymptotically on the Gaussian parameter when looking at eigenvectors of spectral clustering algorithm. With numerical experiments, we show the efficiency of the spectral clustering method on retrieving groups from several geometrical examples and with various refinements. More precisely, we focus on the behaviour of the method with respect to this new theoretical material.

References NG, A.Y. and JORDAN, M.I. and WEISS, Y. (2002): On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 849–856. MOUYSSET, S. and NOAILLES, J. and RUIZ, D. (2010):On an interpretation of Spectral Clustering via Heat equation and Finite Elements theory. IAENG, 267–272.

Keywords Spectral Clustering, Gaussian Kernel, Heat equation, Eigenvalue problem. 94

gRecs: A collaborative filtering framework for group recommendations Eirini Ntoutsi1 , Kostas Stefanidis2 , Kjetil Nørv˚ag2 , and Hans-Peter Kriegel1 1 2

Institute for Informatics, Ludwig-Maximilians University (LMU), Munich {ntoutsi,kriegel}@dbs.ifi.lmu.de Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim {kstef,Kjetil.Norvag}@idi.ntnu.no

Abstract. Recommendation systems provide suggestions to users about a variety of items, such as movies and restaurants. The large majority of these systems are designed to make recommendations for individual users. However, there are cases in which the items to be suggested are intended for a group of users, e.g., a group of friends planning to watch a movie or visit a restaurant. Recent approaches try to satisfy the preferences of all group members either by creating a joint profile for the group and suggesting items w.r.t. this profile or by aggregating the single user recommendations into group recommendations [3]. We opt for the second approach, since it is more flexible and offers opportunities for efficiency improvements. We propose a framework for group recommendations following the collaborative filtering approach. The most prominent items for each user of the group are identified based on items that similar users liked in the past. We efficiently aggregate the single user recommendations into group recommendations by leveraging the power of a top-k algorithm. We employ three different aggregation designs: least misery, where strong user preferences act as a veto, most optimistic, where the most satisfied member is the most influential one and fair, for more democratic cases. The main bottleneck in collaborative filtering is to locate the most similar users for a given user. We model the user-item interactions in terms of clustering and use the extracted clusters for predictions [1,2]. To deal with the high dimensionality and sparsity of ratings, we envision subspace clustering to find clusters of similar users and subsets of items where these users have similar ratings for the items.

References 1 NTOUTSI, E., STEFANIDIS, K., NORVAG, K. and KRIEGEL, H-P. (2012): Fast Group Recommendations by Applying User Clustering. In: ER. 2 NTOUTSI, E., STEFANIDIS, K., NORVAG, K. and KRIEGEL, H-P. (2012): gRecs: A Group Recommendation System based on User Clustering (demo paper). In: DASFAA. 3 ROY, S.B., AMER-YAHIA, S., CHAWLA, A., DAS, G. and YU, C. (2010): Space Efficiency in Group Recommendation. VLDBJ 19(6), 877–900.

Keywords GROUP RECOMMENDATIONS, COLLABORATIVE FILTERING, USER CLUSTERING

95

Symbolic cluster ensemble based on co-association matrix vs. noisy variables and outliers Pełka Marcin1 Wroclaw University of Economics, Department of Econometrics and Computer Science, Nowowiejska 3, 58-500 Jelenia G´ora, Poland, [email protected] Abstract. Ensemble approach based on aggregating information provided by different models has been proved to be a very useful tool in the context of the supervised learning. The main goal is to increase the accuracy and stability of the final classification. Recently the same techniques have been applied for cluster analysis where by combining a set of different clusterings, a better solution can be received. Ensemble clustering techniques might be not a new problem, but their application to the symbolic data case is quite new area. The article presents a proposal of application of the co-association based functions in cluster analysis when dealing symbolic data which tends to form not well separated clusters of many different shapes. In the empirical part simulation experiment results are compared based on artificial data (containing noisy variables and/or outliers). Besides that ensemble clustering results of real data sets are shown. In both cases ensemble clustering results are compared with application of single clustering method.

References BOCK, H.-H., DIDAY, E. (Eds.) (2000): Analysis of symbolic data. Explanatory methods for extracting statistical information from complex data. Springer Verlag, Berlin-Heidelberg. FRED, A.L.N. (2001): Finding consistent clusters in data partitions. In: J. Kittler, F. Roli (Eds.): Multiple Classifier Systems, Vol. 1857 of Lecture Notes in Computer Science. Springer-Verlag, Berlin-Heidelberg, 78–86. FRED, A.L.N., JAIN, A.K. (2005): Combining multiple clustering using evidence accumulation. IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 27, 835–850.

Keywords SYMBOLIC DATA ANALYSIS, ENSEMBLE CLUSTERING, CO-ASSOCIATION MATRIX

96

Ensemble learning for density estimation Friedhelm Schwenker, Michael Glodek, Martin Schels University of Ulm, Institute of Neural Information Processing, 89069 Ulm [email protected]

Abstract. Estimation of probability density functions (PDF) is a fundamental concept in statistics and machine learning and has various applications in pattern recognition. In this contribution ensemble learning approaches will be discussed in the context of density estimation, particularly these methods will be applied to two PDF estimation methods: Kernel density estimation (or Parzen window approach) and Gaussian mixture models (GMM) (Fukunaga, 1990). The idea of ensemble learning is to combine a set of L pre-trained models g1 , . . . , gL into an overall ensemble estimate g. Combining multiple models is a natural step to overcome shortcomings and problems appearing in the design of single models. Along with the design of single models gi an aggregation mapping must be realized in order to achieve a final combined estimate, usually this mapping has to be fixed a priori but trainable fusion mappings can be applied as well. Examples of fixed fusion schemes are median or (weighted) average of the predicted models (Kuncheva, 2004), e.g. weighted average is defined through gw (x) = ∑Ll=1 wl gl (x) with wl ≥ 0 and ∑Ll=1 wl = 1. For example, weighted averaging of kernel density estimates leads to a representation with a new kernel function. The proposed ensemble pdf approach will be analyzed by statistical evaluations on benchmark data sets. The behavior of these algorithms in classification and cluster analysis applications will be presented as well.

References KUNCHEVA, L. (2004): Combining pattern classifiers: Methods and algorithms, Wiley. FUKUNAGA, K. (1990): Introduction to statistical pattern recognition. Academic press.

Keywords ENSEMBLES, KERNEL DENSITY ESTIMATION, GAUSSIAN MIXTURE MODELS

97

An Analysis of Classifier Chains for Multi-Label Classification Robin Senge1 , Jose Barranquero2 , Juan Jos´e del Coz2 , and Eyke H¨ullermeier1 1 2

Mathematics and Computer Science, Marburg University, 35032 Marburg, Germany {senge,eyke}@mathematik.uni-marburg.de Artificial Intelligence Center, University of Oviedo at Gij´on, Campus de Viesques, 33204 Gij´on, Spain [email protected]

Abstract. Multi-label classification (MLC) has attracted increasing attention in the machine learning community during the past few years. Apart from being interesting theoretically, this is largely due to its practical relevance in many domains, such as text classification and bioinformatics. The goal in MLC is to induce a model that assigns a subset of labels to each example, rather than a single one as in multiclass classification. In order to exploit dependencies between the labels, so-called classifier chains have been proposed as an appealing method for tackling the MLC task (Read et al., 2011). In addition to several empirical studies showing it to be competitive to state-of-the-art methods, especially when being used in its ensemble variant, there are also some first results on theoretical properties of classifier chains (Dembczy´nski et al., 2010). Continuing along this line, we analyze the influence of a potential pitfall of the learning process, namely the discrepancy between the feature spaces used in training and testing: While true class labels are used as supplementary attributes for training the binary models along the chain, the same models need to rely on estimations of these labels when making a prediction. We demonstrate under which circumstances the attribute noise thus created can affect the overall prediction performance. As a result of our findings, we propose two variants of classifier chains that are designed to overcome this problem. Experimentally, we show that these methods are indeed able to produce better results in cases where the original chaining process is likely to fail.

References J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classifier chains for multi-label classification. Machine Learning, 85(3):333–359, 2011. K. Dembczy´nski, W. Cheng, and E. H¨ullermeier. Bayes optimal multilabel classification via probabilistic classifier chains. In ICML, pages 279–286, 2010.

Keywords multi-label classification, classifier chains, attribute noise

98

Sentiment analysis in the Twitter stream Alina Sinelnikova, Eirini Ntoutsi, and Hans-Peter Kriegel1 Institute for Informatics, Ludwig-Maximilians University (LMU), Munich [email protected], {ntoutsi, kriegel}@dbs.ifi.lmu.de Abstract. Nowadays, more and more people publish their opinions online so everyone has the possibility to catch up the thoughts of millions of people without even knowing them. This way, consumers have the enormous power to influence each other by sharing their brand experiences, either positive or negative. Twitter is the most famous micro-blogging service and an opinion-rich resource that allows people to broadcast their opinions about politics, products, movies etc in real time. With 200 million tweets generated on a daily basis, there is a need for opinion mining and sentiment analysis in order to help business analysts in the decision making process. In this work, we deal with the challenges posed by the Twitter stream, namely size, unbalanced classes, changing class distributions, as well as with the specific limitations of the Twitter language, namely, colloquial language, tweet length and the difficult nature of the sentiment analysis problem due to the subjectivity of the tweets. For the study, we use a dataset of predefined topics from the Twitter API monitored over a period of three months. We experimented with a variety of classifiers such as Multinomial Naive Bayes, Adaptive Hoeffding Tree, Stochastic Gradient Descent, a hybrid Hoeffding Tree and Naive Bayes classifier and ensembles of classifiers. For the evaluation, we used both holdout and prequential methods. As a forgetting mechanism we used a sliding window. We evaluated the different methods and also the impact of the different preprocessing steps. We implemented a sentiment analysis tool that connects our methods to the Twitter API and identifies and monitors the changes in the sentiment distribution of the current opinions regarding some user defined topic.

References PANG, B. and LEE, L. (2008): Opinion Mining and Sentiment Analysis. Found. Trends Inf. Retr., 2 1554-0669. BIFET, A. and Holmes, G. and Pfahringer, B.(2011): MOA-TweetReader: Real-Time Analysis in Twitter Streaming Data. In: T. Elomaa, J. Hollm´en and H. Mannila (Eds.): Discovery Science. Springer, 46–60.

Keywords DATA STREAMS, SENTIMENT ANALYSIS, TWITTER.

99

Recommendations in Time Evolving Multi-modal Social Networks Panagiotis Symeonidis Department of Informatics, Aristotle University of Thessaloniki [email protected] Abstract. Social networking sites (OSNs), such as Facebook and LinkedIn, have attracted a huge attention after the widespread adoption of Web 2.0 technology. These systems contain gigabytes of data which can be mined and used for making personalized predictions and recommendations of products, users and digital content. In particular, OSNs collect information from users’ social contacts and other interactions, build an interconnected multi-modal social network, and make suggestions of products or even people to users based on their common friends, common commenting on written posts etc. People often belong to multiple explicit or implicit social networks because of different interpersonal interactions. For example, in Facebook, people add each other as friends constructing a large unipartite friendship network. However, besides the explicit friendship relations between the users, there are also other implicit relations. For example, users can co-comment on the posts written by their friends, they can co-rate products, and co-like a user’s photo. In this paper, we study (i) methods that combine information derived from heterogeneous explicit or implicit social networks and (ii) the evolution of user preferences over time. These two aspects of OSNs result to better personalized recommendations of users, products and services.

References SYMEONIDIS, P. and TIAKAS, E. and MANOLOPOULOS, Y. (2011): Product recommendation and rating prediction based on multi-modal social networks. In: Proceedings of the fifth ACM conference on Recommender systems (RecSys 2011), ACM, Chicago, 61–68. SIDDIQUI, Z.F. and SPILIOPOULOY, M. and SYMEONIDIS, P. and TIAKAS, E. (2011): A Data Generator for Multi-Stream Data. In: Proceedings of the second International Workshop on Mining Ubiquitous and Social Environments (MUSE 2011), Athens, 63–68.

Keywords Social Networks, Data Streams, Recommender Systems

100

A Lightweight CVFDT Classifier for Streams with Concept Drift Miriam T¨odten∗ , Zaigham Faraz Siddiqui† , and Myra Spiliopoulou Otto-von-Guericke University Magdeburg, 39106 Magdeburg, Germany {toedten@mail,siddiqui@iti,myra@iti}.cs.uni-magdeburg.de Abstract. We have investigated the induction of decision trees over concept-drifting data streams. Whereas other approaches - based on the concept-adapting CVFDT [HULTEN et al] - maintain alternate subtrees if there is sufficient statistical evidence for another test attribute in some decision node, our learner replaces the subtree in question by a leaf with Na¨ıve Bayes classifier if there is no longer sufficient evidence for the test attribute currently selected. This approach is based on the ability of Na¨ıve Bayes leaves to improve the any-time property of Hoeffding trees [Gama et al]. Since leaves of Hoeffding trees evolve by learning subsequent training examples, the new Na¨ıve Bayes leaf will be grown to a new subtree that reflects the new target concept. The results of the evaluation show that our lightweight approach can react fast to concept drift. For evaluation purposes, we suggest an approach that differs from the established test scenario for supervised learning tasks on data streams. Since the classification performance of a classifier depends on the speed in which query examples arrive, we distinguish between training and test examples and simulate test streams of different speeds. Experiments show that a learner that reacts fast is beneficial if the test/query instances arrive in a fast data stream.

References H ULTEN , G., S PENCER , L., and D OMINGOS , P. (2001): Mining Time-Changing Data Streams. Proc. of KDD 2001, ACM Press. G AMA , J., ROCHA , R., and M EDAS , P. (2003): Accurate Decision Trees for Mining High-Speed Data Streams. Proc. of KDD 2003, ACM Press.

Keywords Decision Tree Stream Classifier, Stream Classification, Concept Drift, Streams ∗

The first author is a student of Master ’Data & Knowledge Engineering’.

† Work of the second author was partially funded by the German Research Foundation project SP 572/11-1 ”IMPRINT: Incremental Mining for Perennial Objects”.

101

Statistical Comparison of Classifiers for Multi-Objective Feature Selection in Instrument Recognition Igor Vatolkin1 , Bernd Bischl2 , G¨unter Rudolph1 , and Claus Weihs2 1 2

TU Dortmund, Chair of Algorithm Engineering {igor.vatolkin;guenter.rudolph}@tu-dortmund.de TU Dortmund, Chair of Computational Statistics {bernd.bischl;claus.weihs}@tu-dortmund.de

Abstract. Instrument identification is one of the most challenging tasks in Music Information Retrieval. With an increasing number of simultaneously playing sources it becomes harder to distinguish between their spectral fractions, which are built from fundamental frequencies, overtones, non-harmonic and resonant components. Also the intensity of these characteristics varies over time and is often classified into attack, decay, sustain and release stages. A vast number of different features are potentially available for instrument classification and it is still unsolved which perform best. Also it is acceptable to a certain degree to trade off prediction accuracy against a computationally simpler model with less features. Because this trade-off can in general not be specified a priori, we employ a multi-objective feature selection approach regarding instrument recognition in polyphonic mixtures and the number of features in the model. The performance of several classifiers and their impact on the Pareto front are compared by means of statistical tests.

References BEUME, N., NAUJOKS, B. and EMMERICH, M. (2007): Sms-emoa: Multiobjective selection based on dominated hypervolume. European Journal of Operational Research, 181(3), 1653– 1669. GUYON, I., GUNN, S., NIKRAVESH, M. and ZADEH, L. (Eds.) (2006): Feature Extraction, Foundations and Applications. Springer, Berlin - Heidelberg. VATOLKIN, I., PREUß, M. and RUDOLPH, G. (2011): Multi-Objective Feature Selection in Music Genre and Style Recognition Tasks. In: N. Krasnogor and P.L. Lanzi (Eds.): Proceedings of the 2011 Genetic and Evolutionary Computation Conference (GECCO). ACM Press, New York, 411–418.

Keywords INSTRUMENT RECOGNITION, FEATURE SELECTION

102

Group-Based Ant Colony Optimization Gunnar V¨olkel1 , Uwe Sch¨oning1 , and Hans A. Kestler2 1

2

Institute of Theoretical Computer Science, University of Ulm, [email protected] (PhD student), [email protected] Institute of Neural Information Processing, University of Ulm, [email protected]

Abstract. Ant Colony Optimization (ACO) is a metaheuristic for combinatorial optimization problems. The main idea of ACO is that in each iteration a fixed number of solutions is constructed probabilistically based on a pheromone matrix which evolves between the iterations. In general a solution consists of a sequence of solution components. For problems like the Traveling Salesman Problem (TSP) the linear solution encoding of ACO as a sequence of components works well since a sequence of customers is a natural representation of the visiting order of those customers. The solutions of the Capacitated Vehicle Routing Problem (CVRP), a descendant of the TSP, usually consist of more than one route. A linear solution encoding for the CVRP has to consist of the component sequences of the individual routes interleaved with some end of route component. This is no natural encoding because it favors one route over another whereas the problem does not state such a preference. Generally, this applies to problems with a solution that is sub-structured into independent groups of components. We propose Group-Based Ant Colony Optimization (GBACO) whose solution encoding is sub-structured into groups each consisting of a sequence of components. The modified construction procedure selects one pair of group and component probabilistically and adds the selected component to the selected group. First experiments comparing ACO and GBACO on the commonly used Solomon benchmark instances (VRP with Time Windows) are presented.

References ¨ DORIGO, M. and STUTZLE, T. (2009): Ant Colony Optimization: Overview and Recent Advances. Techreport, IRIDIA, Universit´e Libre de Bruxelles. GAMBARDELLA, L. M. and TAILLARD, E. and AGAZZI G. (1999): MACS-VRPTW - A Multiple Colony System For Vehicle Routing Problems With Time Windows. In: Corne, D. and Dorigo, M. et al. (Eds.): New Ideas in Optimization. McGraw-Hill, Maidenhead, England, 63–76.

Keywords GROUP-BASED ACO, VEHICLE ROUTING PROBLEM 103

The Dark Side of Marketing Communication: Grouping Consumers with Respect to Their Reactance Behavior Ralf Wagner SVI-Endowed Chair for International Direct Marketing DMCC- Dialog Marketing Competence Center University of Kassel, Germany [email protected] Abstract. Marketing practitioners as well as researchers are fascinated by the new opportunities of communicating to and with customers using using social media and sophisticated mobile devices (e.g., Wagner, 2011). However, consumers’ dialogue competence and as well their disposition are rarely challenged. In this study the recipients’ reactance (Brehm & Brehm, 1981) towards marketing communication is quantified by means of a Rasch model. This probabilistic test theory approach allows to compute individual scores for the recipients. These scores provide the data basis for grouping the recipients to cluster with similar reactance behavior. Taking advantage of the Rasch framework’s invariance of comparisons (Salzberger, 1999) we reveal patterns of of unfavorable marketing communication consequences in different cultures.

References ANDRICH, D. (2002): Understanding Resistance to the Data-Model Relationship in Rasch’s Paradigm: A Refelction for the Next Generation. Journal of Applied Measurement, 3, 325– 359. BREHM, S. S. and BREHM, J. W. (1981): Psychological Reactance: A Theory of Freedom and Control. Academic Press, New York. SALZBERGER, T. (1999): Interkulturelle Marktforschung - Methoden zur berprfung der Daten¨aquivalenz. Service, Wien. WAGNER, R. (2011): Neue Medien im Kundendialog - Ein berblick zu den Kommunikationsdiensten des Web 2.0. In: W. Lietzau, J. Bender and T. Richter (Eds.): Praxishandbuch Social Media in Verbnden Grundlagen - Praxiswissen - Fallbeispiele. DGVM, Bonn, 90–104.

Keywords INVARIANCE, MARKETING COMMUNICATION, REACTANCE

104

Evaluating Tag Similarity Measures by Clustering Bibsonomy Tags Christian Wartena1 and Rogier Brussee2 1 2

Hochschule Hannover, Expo Plaza 12, 30539 Hannover, Germany. [email protected] Univ. of Applied Sciences Utrecht, Crossmedialab, PO Box 8611, 3503 RP Utrecht, The Netherlands. [email protected]

Abstract. Most approaches to determine semantic relatedness of collaborative tags have been based on direct co-occurrence of tags (see Markines et al. (2009) for an overview). In Wartena and Brussee (2008) we have studied a similarity measure based on comparison of contexts in which tags occur, and we could show that this so-called second order co-occurrence outperforms similarity measures based on direct co-occurrence in an ontology alignment task. Here we add evidence to the superiority of second order co-occurrence by showing that the second order cooccurrence similarity measure is also superior in a tag clustering task. The topical coherence of clustering results largely dependents on the quality of the distance measure used by the clustering algorithm. Thus, the effectiveness of clustering can be used as a method to evaluate distance measures for semantic relevance. For an empirical evaluation we used a dump with 2.6 million tag assignments from Bibsonomy, a tagging service for scientific papers. We chose 12 scientific disciplines and their main journals, and selected 215 most typical tags. We then cluster the tags using distance measures based on Jaccard coefficients, on cosine similarity and on the above mentioned second-order co-occurrence. We evaluated the results against the predefined clustering by scientific discipline using F-scores and cluster purity. Clustering based on the co-occurrence similarity measure consistently performed better for both F-score and purity, clustering using cosine similarity and Jaccard coefficient.

References MARKINES, B., CATTUTO, C., MENCZER, F. and BENZ, D., HOTHO, A., STUMME, G (2009): Evaluating similarity measures for emergent semantics of social tagging, in: J. Quada et al. (Eds.): Proceedings of the 18th international conference on World Wide Web. ACM, 641–650. WARTENA, C. and BRUSSEE, R. (2008): Instance-based mapping between thesauri and folksonomies, in: A. Seth et al. (Eds.): Proceedings of the 7th International Conference on The Semantic Web. Springer, 356–370.

Keywords TAGGING, CO-OCCURRENCE, SIMILARITY, CLUSTERING

105

Applying Leaders Driven Community Detection Algorithms to Data Clustering Zied Yakoubi and Rushed Kanawati LIPN CNRS UMR 7030 University Paris Nord, Villetaneuse, France [email protected] Abstract. Leaders driven community detection algorithms (LDA hereafter) constitute a new trend in devising algorithms for community detection in complex networks. Unlike most existing community detection approaches, LDA algorithms are not guided by the optimization of an objective function such as the modularity. In this work, we show that LDA approaches can also be efficiently applied to data clustering. The clustering approach is organized into two steps: first the input dataset is processed to generate a complex network. This is achieved by constructing the relative neighborhood graph using a similarity matrix induced by a given distance function defined over the dataset points (Toussaint (1980)). Then, we apply a community detection algorithm on the produced graph in order to get the different clusters. We show through experimentation on different classical clustering dataset benchmarks that applying our LDA algorithm, called LICOD (Kanawati (2011)), provides better clustering results, evaluated in terms of purity and rand Index, than: Modularity optimization algorithms (Blondel at. al. (2008)), Label propagation community detection approaches (Raghavan et. al, 2007) and K-means algorithm.

References BLONDEL V. D., GUILLAUME J-L. LAMBIOTTE R. LFEBVRE E. (2008): Fast unfolding of communites in large networks, Journal of Statistical Mechanics: Theory and Experiment, 1742-5468 KANAWATI, R. (2011): LICOD: Leaders Identification for Community Detection in Complex Networks. In IEEE SocialCom’11, Boston, MA, 577-582. RAGHAVAN U. N., ALBERT R. and KUMARA S. (2007): Near linear time algorithm to detect community structures in large-scale networks Phys. Rev. E, vol. 76, p. 036106. TOUSSAINT G. T. (1980): The Relative Neighbourhood Graph of a Finite Planar Set. Pattern Recognition (PR) 12(4):261-268

Keywords DATA CLUSTERING, COMMUNITY DETECTION.

106

Part IX

Interdisciplinary Domains

Onset detection using an auditory model Bauer, Nadja, Friedrichs, Klaus, Schiffner, Julia, and Weihs, Claus Chair of Computational Statistics, Faculty of Statistics, TU Dortmund {bauer,friedrichs,schiffner,weihs}@statistik.tu-dortmund.de

Abstract. Onset detection is an important step for music transcription and other applications like timbre or meter analysis. Although several approaches have been developed for this task, neither of them works well under all circumstances. In our work, we will use a simple algorithm proposed by Bauer et al. (2010), which is based on calculating the correlation index between spectra of neighboring signal windows. In Bauer et al. (2012) this algorithm was tested on a special data set of tone sequences, which are composed of recorded tones from real musical instruments. This data set was generated by using an experimental design, in which music tempo and the set of instruments are considered as control variables. This particulary allows to measure the influence of each instrument on the onset detection rate. In this work, the onset detection algorithm is extended by a computational model of the human auditory periphery. Instead of the original signal the spectral analysis is evaluated on the outputs of the simulated auditory nerve fibres. The extension of this simple algorithm with an auditory model leads to an essential improvement of the onset detection rate compared to previous results. The main challenge here is combining the outputs of all auditory nerve fibres to one feature for onset detection. Different approaches are presented and compared.

References BAUER, N., SCHIFFNER, J., WEIHS, C. (2010): Einsatzzeiterkennung bei polyphonen Musikzeitreihen. SFB 823 Discussion Paper 22/2010, TU Dortmund. BAUER, N., SCHIFFNER, J., WEIHS, C. (2012): Einfluss der Musikinstrumente auf die G¨ute der Einsatzeiterkennung. SFB 823 Discussion Paper 10/2012, TU Dortmund.

Keywords AUDITORY MODEL, DESIGN OF EXPERIMENTS, ONSET DETECTION 108

Computational Aspects of Natural Languages’ Similarities Andreea Beica1∗ and Liviu P. Dinu2 1

University of Bucharest, Faculty of Mathematics and Computer Science, 14 Academiei, Bucharest,Romania [email protected] University of Bucharest, Faculty of Mathematics and Computer Science, 14 Academiei, Bucharest,Romania [email protected]

2

Abstract. Natural languages can be classified into families - groups of languages related through descent from a common ancestor, called the proto-language of that family. Establishing language families is equivalent to the construction of phylogenetic language trees. What’s more, words are the core of any language, and cognates are words that have a common etymological origin. Therefore, cognate identification, alongside phylogenetic inference (which aims to determine the existing genetic relationships between languages), represent the basis of discovering the evolutionary history of languages. In this thesis we have designed a system that uses different distances (like the rank distance[1] or the Hamming and alphabet-weight edit distances) to measure string similarity, and we have applied the system to the task of phylogenetic inference. We initially worked on a relatively small corpus, consisting of 200-word Swadesh lists, one for each of the 11 languages we analyse. We then extended our work to significantly larger parallel corpora: George Orwell’s ‘1984’ novel, translated in 8 of the 11 languages. We conducted our study using both the phonetical transcription and the latin-alphabet form of our corpora. We used both a dataset consisting of syllables, as well as one consisting of whole words. When applied to the Indo-European language family, our method estimated phylogenies that were compatible with the benchmark tree, and correctly reproduced established major language groups present in the dataset; thus, our results confirmed the linguistic theories, stating the correctness of our approach.

References [1] DINU, L.P.: On the classification and aggregation of hierarchies with different constitutive elements. Fundamenta Informaticae, 55, 1, 39-50, 2003. [2] DINU, L.P.: Rank Distance with Applications in Similarity of Natural Languages, Fundamenta Informaticae 65 (2005) 1-15

Keywords PHYLOGENETIC INFERENCE, LANGUAGE SIMILARITY, LANGUAGE FAMILIES, PHYLOGENIES

∗

Final year Undergraduate student

109

A Unifying Framework for GPR Image Reconstruction Andre Busche, Ruth Janning, Tom´asˇ Horv´ath, and Lars Schmidt-Thieme University of Hildesheim, Information Systems and Machine Learning Lab {busche,janning,horvath,schmidt-thieme}@ismll.uni-hildesheim.de Abstract. Ground Penetrating Radar (GPR) is a widely used technique for detecting buried objects in subsoil. Exact localiztion of buried objects is required, e.g. during environmental reconstruction works to both accelerate the overall process and to reduce overall costs. Radar measurements are usually visualized as images, so-called radargrams, that contain certain geometric shapes to be identified. This paper introduces a component-based image reconstruction framework for the recognition process based on pixelwise image decomposition at position (x/y): I(x, y) =

K k

fk (θk , x, y) +

K

gl (x, y)

(1)

l

We assume some image to be generated out of k base component models fk being individually parameterized through θ , e.g., being an image representation from a FDTD-based simulation, or some extracted pattern from a training dataset. Those component models are aggregated through an operator , e.g., a summation ∑ in the case of a first-order Born Approximation on simulated radargrams, of more complex convolutional operations. Integration of l different noise components allows for capturing different noise types for better estimations and artifacts suppression. We present initial experimental results on a simple instantiation of this conceptual model using primitive object shapes, being a first step towards a pluggable, robust image reconstruction mechanism for GPR data.

References H. Chen and A.G. Cohn, Probabilistic robust hyperbola mixture model for interpreting ground penetrating radar data, IJCNN IEEE, 2010, 1-8. F. Yaman, Location and shape reconstructions of sound-soft obstacles in penetrable cylinders, Inverse Problems, 2009, 1-17.

Keywords GPR, Image Reconstruction, Inverse Problem, Unifying Framework, template models 110

Evaluating Similarity Measures for Plagiarism Detection in Melody Transcriptions Christian Dittmar1 , Daniel G¨artner1 , Kay F. Hildebrand2 , and Florian M¨uller3 1 2 3

Fraunhofer IDMT Department Metadata Ilmenau, Germany dmr|[email protected] European Research Center for Information Systems (ERCIS), M¨unster [email protected] zeb/information.technology [email protected]

Abstract. Plagiarism in the area of music is a problem that consumes a lot of resources. Lawsuits prosecuting acoustic plagiarism can last for decades. Consequently, plagiarism cases need a transparent and efficient approach to reduce insecurity of judges and accelerate the decision process. Instead of performing similarity analyses manually, software can be used. By automatically extracting relevant features from audio files, a working basis is created. After correcting this input, musicology experts can apply pattern matching algorithms. Eventually, software can display identified similarities to enable evaluation of individual importance and explanation to untrained audiences. In this paper, we present a detailed empirical evaluation of algorithms that can be used to compare transcribed melodies in pitch vector format. Pitch Vector Similarity (PVS), Recursive Alignment (RA), Geometric Alignment (GA) and Sequence Alignment (SA) have been submitted to tests evaluating their ability to detect similarities with increasing difference in compared sequences. Results show that PVS and SA deliver good detection rates and are most stable and flexible among tested candidates. Under the given conditions, GA performed better than RA.

References M. Ryyn¨annen and A. Klapuri (2008): Query by humming of MIDI and audio using locality sensitive hashing. ICASSP 2008, 2249-2252. X. Wu et al. (2006): A top-down approach to melody match in pitch contour for query by humming. Proceedings of ISCA 2006. J. Urbano et. al. (2011): Melodic Similarity through Shape Similarity. In Proceedings of the 7th International Conference on Exploring Music Contents, 338-355.

Keywords AUDIO PLAGIARISM DETECTION, SIMILARITY MEASURES, LOCAL ALIGNMENT, GLOBAL ALIGNMENT.

111

From Single Tones to MIDI Remixes - Detecting Families of Musical Instruments by High-Level Features Eichhoff, Markus1 and Weihs, Claus1 1 Chair

of Computational Statistics, Faculty of Statistics, TU Dortmund {eichhoff,weihs}@statistik.tu-dortmund.de Abstract. Detecting musical instruments in pieces of polyphonic music given as mp3- or wav-files is a difficult task. Using source-filter models for sound separation as being done in Heittola et al. (2009) is one Ansatz to do it. In this study four families of musical instruments (strings, wind, piano, plugged strings) are classified by using the four high-level audio feature groups Pitchless Periodogramm (PiP) (Weihs and Ligges (2003)), Absolute Amplitude Envelope, Mel-Frequency Cepstral Coefficients and Linear Predictor Coding to take also physical properties of the instruments into account (Fletcher (2008)). These feature groups are calculated for consecutive time blocks. Statistical supervised classification methods such as LDA, MDA, Support Vector Machines, Random Forest, Boosting and variable selection are used for classification. This instrument recognition task is carried out for single tones, intervals, chords and MIDIs. MIDI-samples have been exchanged by real audio samples that are used for training statistical models in case of single tones, intervals and chords. Statistical tests confirm hypotheses on e.g. which blocks are at least necessary or which statistical methods are best for each classification task.

References FLETCHER, N.H. (2008): The physics of musical instruments. Springer, New York, 2008. HEITTOLA, T. KLAPURI, A. and Virtanen, T. (2009): Musical Instrument Recognition in Polyphonic Aurio Using Source-Filter Model For Sound Separation. 10th International Society for Music Information Retrieval Conference, ISMIR 2009, Proceedings. WEIHS, C. and LIGGES, U. (2003): Voice Prints as a Tool for Automatic Classification of Vocal Performance. In: R. Kopiez, A. C. Lehmann, I. Wolther and C. Wolf (Eds.): Proceedings of the 5th Triennial ESCOM Conference, Hanover University of Music and Drama. Germany, September 8-13, 332-335.

Keywords HIGH-LEVEL AUDIO FEATURES, MUSICAL INSTRUMENT RECOGNITION, SUPERVISED CLASSIFICATION, PIP, MIDI

112

Learning in groups and exam performance Andreas Geyer-Schulz1 , Jonas Kunze1 , and Andreas Sonnenbichler1 Informationsdienste und Elektronische M¨arkte, Karlsruhe Institute of Technology (KIT) {andreas.geyer-schulz|jonas.kunze|andreas.sonnenbichler}@kit.edu

Abstract. As learning in groups may increase learners productivity (cf. Lam and Ching, 2001), exercise courses are common standard amongst classes in universities. Besides that various aspects may be considered while forming a learning group (Hsiung, 2010), the size of the group has shown a special role. Hunkeler and Sharp (1997) showed that four member groups outperfomed three member groups in a statistically significant way. In this article, we present an evaluation of 2010-2012 exercise course data and final exam performance with a special focus on the learner group size of different courses.

References HSIUNG, C.-M. (2010): An experimental investigation into the efficiency of cooperative learning with consideration of multiple grouping criteria. European Journal of Engineering Education, 35(6), 679–692. HUNKELER, D. and SHARP, J. E. (1997): Assigning functional groups: The influence of group size, academic record, practical experience, and learning style. Journal of Engineering Education, 86(4), 321–332. LAM, M. and CHING, R. (2001): Effect of group learning on academic performance: A pilot study for com-based classes. In: AMCIS 2001 Proceedings. Paper 21.

Keywords EDUCATION, GROUP LEARNING, GROUP SIZE

113

ANOVA and Alternatives for Causal Inferences Sonja Hahn1 Friedrich-Schiller-Universit¨at Jena, Institut f¨ur Psychologie, Am Steiger 3, Haus 1, 07743 Jena [email protected]

Abstract. Analysis of variance (ANOVA) is one of the procedures most often used for analyzing experimental and quasiexperimental data in psychology. Nonetheless there is sometimes confusion which subtype to prefer when there is unbalanced data. Much of this confusion can be prevented when first an adequate hypothesis is formulated. In the present paper this is done by using a theory of causal effects. This is the starting point for the following simulation study done on unbalanced two-way designs. Simulating data sets differed in the presence of an (average) effect, the degree of interaction, sample size (N = 30; 60; 90; 150; 300; 600; 900), stochasticity of the factors and if there was confounding between the two factors (i.e. experimental vs. quasiexperimental design). Different subtypes of ANOVA as well as other competing procedures from the tradition of causality research were compared with regard to adherence to the nominal α-level and power. Results suggest that different types of ANOVA should be used with care, especially in quasiexperimental designs and when there is interaction. Procedures developed within the tradition of causality research are feasible alternatives that may serve better to answer meaningful hypotheses.1

References STEYER, R., GABLER, S., von DAVIER, A. A. and NACHTIGALL, C. (2000): Causal regression models II: Unconfoundedness and causal unbiasedness. Methods of Psychological Research Online, 5, 55–87. ¨ STEYER, R., NACHTIGALL, C., WUTHRICH-MARTONE, O. and KRAUS, K. (2002): Causal Regression Models III: Covariates, Conditional, and Unconditional Average Causal Effects. Methods of Psychological Research Online, 7, 41–68.

Keywords ANOVA, CAUSALITY, SIMULATION STUDY, UNBALANCED DESIGNS

1

Author is PhD student

114

Testing Models for Medieval Settlement Location Irmela Herzog The Rhineland Commission for Archaeological Monuments and Sites The Rhineland Regional Council [email protected] Abstract. Two models have been proposed for the spread of Medieval settlements in the landscape known as Bergisches Land in Germany. Some archaeologists think that the spread was closely connected with the ancient trade routes which were already in use before the population increase in Medieval times. An alternative hypothesis assumes that the settlements primarily developed in the valleys with good soil. Focusing on an area covering 675 km2 of the Bergisches Land, this contribution investigates the two hypotheses. For this study area, a publication is available listing the years when the small hamlets and villages were first mentioned in historical sources, with a total of 513 locations mentioned between 950 and 1350 AD. In a first step the patterns of movement in Medieval times are derived from the trade routes of that time. The result is an adjusted distance measure, which takes slope and wet soil into account. In the next step, simple accessibility maps are generated on the basis of this adjusted distance measure for both alternative targets, i.e. the trade routes and the valleys with favourable soils. For each location, the accessibility values in these maps correspond to the distance to the nearest trade route or valley with good soils respectively. In a final step, for each alternative target a Kolmogorov-Smirnov test is applied to compare the adjusted distances of the Medieval settlements with the reference distribution derived from the appropriate accessibility map.

References BORTZ, J. and LIENERT, G.A. (1998): Kurzgefasste Statistik f¨ur die klinische Forschung. Springer, Berlin. HERZOG, I. (2009): Berechnung von optimalen Wegen am Beispiel der Zeitstrasse. Arch¨aologische Informationen 31 (1&2), 87–96. NICKE, H. (2001): Vergessene Wege. Martina Galunder Verlag, N¨umbrecht.

Keywords MEDIEVAL SETTLEMENTS, LEAST-COST PATHS, KOLMOGOROV-SMIRNOV TEST 115

Supporting Selection of Statistical Techniques in Research Kay F. Hildebrand European Research Center for Information Systems (ERCIS), M¨unster [email protected] Abstract. In this paper we describe the necessity for a more structured approach towards quantitative research. The number of available techniques has surpassed the limit of possible comprehension by researchers. Deciding for one suitable technique to work with a given dataset is a non-trivial and time-consuming task. Thus, structured support for choosing adequate data analysis techniques is required. We present a structural framework for organizing techniques, a description template to uniformly characterize techniques. We show that the former will provide an overview on all available techniques on different levels of abstraction, while the latter offers a way to assess a single method as well as compare it to others. Furthermore, we developed a set of guidelines for the process of data analysis that-if applied-will increase the overall quality of data analysis in research.

References J. Becker et al. (2000). Guidelines of Business Process Modeling. Business Process Management, 1806, 30-49. Springer. J. Jackson (2002). Data Mining: A Conceptual Overview. Communications of the Association for Information Systems, 8(1), 267-296.

Keywords RESEARCH METHODOLOGY, DATA ANALYSIS, STATISTICS, FRAMEWORK, GUIDELINES.

116

Alignment methods for folk tune classification Ruben Hillewaere1 , Bernard Manderick1 , and Darrell Conklin2,3 1 2 3

Computational Modeling Lab, Department of Computing, Vrije Universiteit Brussel, Brussels, Belgium {rhillewa,bmanderi}@vub.ac.be Department of Computer Science and AI, Universidad del Pa´ıs Vasco UPV/EHU, San Sebasti´an, Spain darrell [email protected] IKERBASQUE, Basque Foundation for Science, Bilbao, Spain

Abstract. In folk song research, alignment methods have been widely used to retrieve highly similar tunes from a database. In a recent study (van Kranenburg, 2010), they have also been applied to the specific task of tune family classification, a tune family being an ensemble of folk songs which are all variations of the same tune. It is shown that they achieve remarkable classification accuracies in comparison with other types of models. In this study, we investigate how alignment methods perform on two fundamentally different classification tasks. The first task is geographic region classification, which we have thoroughly studied in our previous work (Hillewaere et al., 2009). A second task is a folk tune genre classification, where the genres are the dance types of the tunes. Given the excellent results with alignment methods on tune family classification, one could expect that they would also perform well on other classification tasks. To verify that hypothesis, a string edit distance method is applied to three folk music datasets. Folk tunes are encoded in melodic and rhythmic representations: as strings of pitch intervals, and as strings of inter onset intervals. All pairwise edit distances are computed over the string representations and the classification is done with a one nearest neighbour algorithm. Classification accuracies with the alignment methods are compared with an ngram model. Results confirm that alignment methods perform well on the tune family classification task, and suggest that n-gram models are better choices for the other two classification tasks.4

References HILLEWAERE, R., MANDERICK, B. and CONKLIN, D. (2009): Global feature versus event models for folk song classification. In: Proceedings of the 10th International Society for Music Information Retrieval Conference. Kobe, Japan, 729–733. VAN KRANENBURG, P. (2010): A computational approach to content-based retrieval of folk song melodies. SIKS dissertatiereeks, 43.

Keywords MUSIC CLASSIFICATION, ALIGNMENT, MUSIC REPRESENTATION 4

Author Ruben Hillewaere is PhD-student.

117

Comparing regression approaches in modelling (non-)compensatory judgment formation Thomas H¨orstermann1∗ and Sabine Krolak-Schwerdt2 1

University of Luxembourg, Route de Diekirch, L-7220 Walferdange [email protected] [email protected]

2

Abstract. Research on judgment formation deals with the integration of multiple information into a mostly unidimensional judgment. Psychological theories and empirical results support the assumption of compensatory strategies, e.g. (weighted) additive models, as well as non-compensatory (heuristic) strategies as underlying decision rules. If a compensatory decision rule is assumed, multiple regression is frequently used to model the judgment formation process. An adequate fit of the regression model in turn leads to the conclusion that the cognitive process of judgment formation is compensatory, whereas an unsatisfactory fit leads to the rejection of a cognitive compensatory model. The conclusion’s validity is impaired if either regression models do not reliably identify an underlying compensatory decision rule or if non-compensatory decision rules also lead to an adequate fit of the linear model. The study adresses this question by applying regression techniques to simulated sets of judgment data with underlying compensatory and non-compensatory decision rules. The simulated data sets are designed to reflect typical data sets from empirical educational research. Results indicate that noncompensatory decision rules,at least partially, may lead to an adequate fit, thus impairing conclusion’s validity.

References ANDERSON, N.H. and BUTZIN, C.A. (1974): Performance = Motivation × Ability. An Integration-theoretical Analysis. Journal of Personality and Social Psychology, 30, 598–604. GIGERENZER, G. (2008): Why Heuristics work. Perspectives on Psychological Science, 3, 20– 29.

Keywords JUDGMENT FORMATION, JUDGMENT MODELLING, REGRESSION ANALYSIS, (NON-)COMPENSATORY JUDGMENTS ∗

PhD student

118

Sensitivity Analyses for the Rasch Model ¨ u Daniel Kasper∗ and Ali Unl¨ Chair for Methods in Empirical Educational Research, TUM School of Education, Technische Universit¨at M¨unchen, Lothstrasse 17, 80335 Munich, Germany {daniel.kasper,ali.uenlue}@tum.de Abstract. For scaling items and persons in large scale assessment studies such as Programme for International Student Assessment (PISA; OECD (2012)) or Progress in International Reading Literacy Study (PIRLS; Martin et al. (2007)) variants of the Rasch model (Fischer and Molenaar (1995)) are used. However, goodness-offit statistics for the overall fit of the models under varying conditions as well as specific statistics for the various testable consequences of the models (Steyer and Eid (2001)) are rarely, if at all, presented in the published reports. In this paper, we apply the Rasch model to PISA data under varying conditions (e.g., under different methods for dealing with missing data, different dichotomization procedures, or different software for performing the item response analyses). On the basis of various overall and specific fit statistics, we compare how sensitive the Rasch model is, across changing conditions. The results of our study will help in quantifying how meaningful the findings from large scale assessment studies can be, and we will be able to recommend under which conditions the Rasch model or its variants can be used for scaling large scale assessment data. Finally, practical guides are given to help the applied researcher interested in Rasch modeling in choosing the appropriate psychometric software package for her/his intended research.

References FISCHER, G.H. and MOLENAAR, I.W. (Eds.) (1995): Rasch Models: Foundations, Recent Developments, and Applications. Springer-Verlag, New York. MARTIN, M.O., MULLIS, I.V.S. and KENNEDY, A.M. (2007): PIRLS 2006 Technical Report. TIMSS & PIRLS International Study Center, Chestnut Hill. OECD (2012): PISA 2009 Technical Report. OECD Publishing, Paris. STEYER, R. and EID, M. (2001): Messen und Testen [Measuring and Testing]. Springer-Verlag, Berlin.

Keywords RASCH MODEL, PISA, LARGE SCALE ASSESSMENT, SENSITIVITY ANALYSES, PSYCHOMETRIC SOFTWARE ∗

PhD student

119

Music and Timbre Segmentation by efficient Order Constrained K-Means Clustering Sebastian Krey1∗ , Uwe Ligges1 , and Friedrich Leisch2 1

Technische Universit¨at Dortmund, Fakult¨at Statistik, Vogelpothsweg 87, 44221 Dortmund, Germany, Tel.: +49-231-755 3057, Fax: +49-231-755 4387, [email protected], [email protected] Universit¨at f¨ur Bodenkultur Wien, Institut f¨ur angewandte Statistik und EDV, Peter-Jordan-Straße 82, 1190 Wien, Austria, Tel.: +43-1-47 654 5061, Fax: +43-1-47 654 5069, [email protected]

2

Abstract. Clustering of features derived from musical sound recordings proved to be beneficial for further classification tasks such as instrument recognition [1]. Using order constrained solutions in K-means clustering [2] the clustering results can be stabilized and the interpretability of the clustering is improved. With this method a further reduction of the misclassification error in the aforementioned instrument recognition task is possible. For an efficient calculation of the order constrained solutions in K-means clustering we use a dynamic programming approach implemented in the statistical programming language R. Using this efficient implementation the musical structure of a whole piece of popular music can be extracted automatically. Visualizing the distances of the feature vectors through a self distance matrix allows for an easy visual verification of the result. For the estimation of the right number of clusters, we propose to calculate the adjusted Rand indices of bootstrap samples of the data and base the decision on the minimum of a robust version of the coefficient of variation. In addition to the average stability, which is measured through the adjusted Rand index, this approach takes the variation between the different bootstrap samples into account. This results in favoring settings with little variation between the bootstrap samples, if average stability is nearly identical.

References 1.LIGGES, U. and KREY, S. (2011): Feature Clustering for Instrument Classification. Computational Statistics 26(2), 279–291. 2.STEINLEY, D. and HUBERT, L. (2008): Order-constrained solutions in k-means clustering: Even better than being globally optimal. Psychometrika 73(5), 647–664.

Keywords CLUSTERING, CONSTRAINTS, MUSIC, CLASSIFICATION ∗

PhD student

120

The balance of value and space Merging classification and regionalization to make more sense out of spatial data. Martin Loidl1 and Christoph Traun1 Center for Geoinformatics, University of Salzburg, Hellbrunnerstr. 34, 5020 Salzburg [martin.loidl; christoph.traun]@sbg.ac.at Abstract. Within the domain of geography there are basically two different approaches to reduce complexity and reveal underlying patterns of polygonal aggregated, univariate quantitative data like unemployment rates per administrative unit: 1. Regionalization (in its homogeneous variant) combines polygons into one region if they share a common border and are attributive similar. This approach primarily aims to define boundaries between contiguous spatial aggregates based on an attributive homogeneity criterion. Example: Dividing the EU into a set of individual regions based on economic performance. 2. Classification in a geographic sense groups objects into mutually exclusive categories based on value ranges (class intervals) of a predetermined attribute. The main purpose of classification in this context (e.g. applied in cartography) is to reduce visual “noise” in the map and help the interpreter to extract meaningful information (Cromley and Cromley 1996), like spatial patterns formed by an underlying geographic phenomenon . In cartography, which can be seen as the visual output of geographic analysis, characteristic spatial patterns occur if polygons of the same class (resp. areal shading) are predominantly adjacent and therefore visually connected to larger figures. However, commonly used cartographic classification techniques are solely based on the attributive domain and completely ignore the spatial context. This ‘blindness’ for spatial configuration during a classification process leads to comparably complex and fragmented spatial patterns, hampering visual perception and subsequent cognitive processes related to map interpretation. While cartographic classification therefore sometimes misses its target of removing “visual noise” from choropleth maps, “homogeneous” regions resulting from regionalization tend to hide too much of the local variation in values to allow meaningful interpretation. Several approaches conceptually between regionalization and classification have been developed (e.g. Murray and Shyy 2000). The main shortcoming of all the proposed methods is the undefinable weight between the attributive and spatial properties of data. This leads to vague results which cannot be reproduced or compared. In contrast Autocorrelation-based Regioclassification (Traun and Loidl 2012) uses the degree of spatial autocorrelation determined by Moran’s I statistics as a weight in a bi-criterion classification process considering the attributive dimension as well as the local neighborhood. Therefore it closes the gap between classification and re121

122

Martin Loidl and Christoph Traun

gionalization on a sound statistical basis. While the applicability of the method has been proven in a cartographic context, potential fields of application comprise a variety of space-sensitive questions, for example in image analysis and object extraction.

References CROMLEY, E. K. and R. G. CROMLEY (1996): An Analysis of Alternative Classification Schemes for Medical Atlas Mapping. European Journal of Cancer, 32(9), 1551–1559. MURRAY, A. T. and T.-K. SHYY (2000): Integrating Attribute and Space Characteristics in Choropleth Display and Spatial Data Mining. International Journal of Geographical Information Science, 14(7), 649-667. TRAUN, C. and M. LOIDL (2012): Autocorrelation-Based Regioclassification - a self-calibrating classification approach for choropleth maps explicitly considering spatial autocorrelation. International Journal of Geographical Information Science, iFirst, 1-17.

Keywords REGIONALIZATION, CLASSIFICATION, AUTOCORRELATION-BASED REGIOCLASSIFICATION, CARTOGRAPHY, SPATIAL DATA

Confidence measures in automatic music classification Hanna Lukashevich 1 2

Fraunhofer IDMT, Ehrenbergstr. 31, 98693 Ilmenau, Germany [email protected]

Abstract. Automatic music classification receives a steady attention in the research community. Music can be classified, for instance, according to music genre, style, mood, or played instruments. Automatically retrieved class labels can be used for searching and browsing within large digital music collections. State-of-the-art methods for music classification involve various machine learning techniques such as Gaussian mixture models and support vector machines. Once trained, the classifiers can predict class labels for the unseen data. However, due to the variability and complexity of music data and to the imprecise class definitions, the classification of the real-world music remains error-prone. The goal of this work is to enhance the automatic class labels with the confidence measures that provide an estimation of the probability of correct classification.

Keywords AUTOMATIC MUSIC CLASSIFICATION, CONFIDENCE MEASURES

123

Multi-Step Linear Discriminant Analysis for Classification of Event-Related Potentials Nguyen Hoang Huy, Stefan Frenzel, and Christoph Bandt Institute for Mathematics and Informatics, University of Greifswald, 17487 Greifswald, Germany, [email protected] Abstract. Event-related potentials (ERPs) are responses to stimuli in electroencephalogram. By means of them it is possible to drive a brain-computer interface (BCI). Determining the presence or absence of ERPs from the electroencephalogram can be considered a binary classification problem. Linear classifiers are probably the most popular algorithms for BCI applications with many of them being based on linear discriminant analysis (LDA). In order to overcome the small sample size problem of LDA, techniques such as regularization of the sample covariance matrix have been applied, see Blankertz et al. (2011). We introduce a multi-step machine learning approach and use it to classify data from a visual ERP-based BCI, see Frenzel et al. (2011). Our approach is motivated by the separability of the spatio-temporal covariance matrix. At first all features are divided into disjoint subgroups and LDA is applied to each of them. This procedure is iterated until there is only one score remaining and this one is used for classification. Thereby we avoid to estimate the high-dimensional covariance matrix of all spatio-temporal features. We investigate the classification performance with special attention to the small sample size case. We also present some theoretical results regarding the asymptotic error rate for the normal model with separable covariance matrix. They give insight into the way the subgroups should be formed on each level.

References ¨ BLANKERTZ, B. and LEMM, S. and TREDER, M. and HAUFE, S and MULLER K. R. (2011): Single-trial analysis and classification of ERP components – A tutorial. NeuroImage, 56, 814825. FRENZEL, S. and NEUBERT, E. and BANDT, C. (2011): Two communication lines in a 3 × 3 matrix speller. Journal of Neural Engineering, 8, 036021.

Keywords LINEAR DISCRIMINANT ANALYSIS, EVENT-RELATED POTENTIALS, BRAINCOMPUTER INTERFACE 124

The Author in Translation: A Computational Method Sergiu Nisioi1 and Liviu P. Dinu2 1

University of Bucharest, Faculty of Mathematics and Computer Science, Academiei 14, Bucharest, Romania [email protected] † University of Bucharest, Faculty of Mathematics and Computer Science, Academiei 14, Bucharest, Romania [email protected]

2

Abstract. In this article we will discuss about quantitative stylistic measurements in order to describe the evolution of an author’s style, the differences which occur in a translation and, implicitly, the possibility to use these parameters in authorship attribution. As a case study we have chosen Vladimir Nabokov’s novels. We have made an analysis solely on his works looking for a stylistic pattern. For this purpose we have selected as similarity measuring tools the rankings of the function words and the Spectrum kernel. To identify the most relevant function words we have based our work on the study of Foucault about the ”author function”. Therefore we have reduced number of function words based on empirical facts about Nabokov. Our results proved to discriminate between the translated and the original text either Russian or English. We have also brought into discussion the importance of lemmatizing Russian function words. Starting from the assumption that under the pen-name of M. Ageyev lies V. Nabokov we have made an authorship attribution investigation. We have concluded that there exists a resemblance between the two.

References FOUCAULT, M. (1987): What Is an Author? Twentieth-Century Literary Theory. State University Press of New York, Albany. McKENNA, W., BURROWS, J. and ANTONIA, A. (1999): Beckett’s Trilogy: Computational Stylistics and the Nature of Translation. RISSH. 35, 151-71. POPESCU, M and DINU, P. L. (2007): Kernel methods and string kernels for authorship identification: The federalist papers case. Proceedings of the International Conference RANLP. Borovets, Bulgaria, pp 484-487.

Keywords AUTHOR FUNCTION, TRANSLATION, RANK DISTANCE, SPECTRUM KERNEL, VLADIMIR NABOKOV

†

Student

125

Differentiation of innovation strategies across regions Dominik Antoni Rozkrut12 1 2

Statistical Office in Szczecin, Poland [email protected] University of Szczecin, Depatment of Statistics and Econometrics, Poland

Abstract. Classical indicators, constructed using a single variable such as the ”innovation rate”, are of limited information capacity. These simple indicators that combines information, regardless of the way firms innovate turn out to be oversimplified. Since innovation is a multidimensional process, application of exploratory data analysis give additional insight into its nature. The need to develop appropriate indicators of innovation practices, and to examine how these vary across regions and industries was stated recently by many authors. The goal of the study is to better exploit the potential of innovation studies by producing disaggregated indicators that identify how firms innovate, as illustrated by the real example based on the data from 2010 Community Innovation Survey. The study tries too shed light on this by applying multidimensional statistical analysis to group enterprises according to their innovation practices and to identify resulting patterns. Specifically, factor analysis and k-means clustering is used to derive and group according to different practices. The research reveals differences in innovation practices observed on the regional level when compared with national results. These are particularly clear when the tetrachoric correlation is used as an input to the factor analysis. The interpretation of underlying modes of innovation activity increases understanding of what innovation strategies are prevalent in the region as compared with the countrywide picture. The conclusions are both of technical and substantive matter.

References ARUNDEL, A. et al. (2007): How Europe’s Economies Learn: A Comparison of Work Organization and Innovation Mode for the EU-15. Industrial and Corporate Change, Vol. 16, Number 6.

Keywords INNOVATION METRICS, EXPLORATORY ANALYSIS

126

The Impact of Student Loans on Personal Financing of Higher Education in Germany Alexandra Schwarz German Institute for International Educational Research, Schlossstr. 29, D-60486 Frankfurt am Main, Germany, [email protected] Abstract. Most colleges and universities in Germany are state-funded and free of charge. As a consequence, Germany does not have a loan culture with respect to financing higher education like the United States for example. The major part of private expenses (living expenses during university studies, learning material etc.) is financed by the parents. In addition, federal student support (so-called “BAfoeG”) is granted to students whose parents can not afford to fund their children’s education. Nonetheless, among the reasons to opt for vocational training instead of university studies, financial motives prevail. Financial security is of great importance for graduates eligible to study; they do not feel up to the financial burden of studying, or they are not willing to go into debt. This especially applies to school graduates from low-income families, where often parents themselves do not have a university degree. In addition, problems in financing living expenses are one of the major reasons for university drop-out. To counteract social differentiation and enable students to study more efficiently, the German government proposed to introduce a student loan program which serves to finance student’s living expenses: In 2006 the state-owned KfW Bankengruppe launched the “KfW student loan” which is an individual loan to be repaid plus interest. It is granted independently of the student’s and his/her parents’ income, without collateral, and at a favorable interest. Based on an online survey the German Institute for International Educational Research evaluated this loan program with respect to its effectiveness and its impact on personal financing of tertiary education. The study describes the methods deployed as well as detailed results of this evaluation where we focus on the individual funding requirements and financing structures of borrowers. The results clearly indicate that KfW student loan plays an important role in taking up and in commencing studies, and this especially applies to students with working-class and middle-class background.

Keywords HIGHER EDUCATION, EDUCATION FUNDING, STUDENT LOANS

127

Espionage Risk Assessment for Security of Defense based Research and Technology Dirk Thorleuchter1 and Dirk Van den Poel2 1 2

Fraunhofer INT, Appelsgarten 2, 53879 Euskirchen, Germany [email protected] Ghent University, Faculty of Economics and Business Administration, Tweekerkenstraat 2, B-9000 Gent, Belgium [email protected]

Abstract. Governmental and industrial espionage in security and defense based research and technology (R&T) becomes a more and more economic and security problem for companies and governments where sensitive information is collected without permission of the information holder. We introduce a new methodology that investigates the information leakage risk of security and defense based R&T-projects concerning governmental or industrial espionage. The methodology extends the well-known risk assessment methodology. It consists of two steps. In the first step, the sensitivity of a project is estimated by human experts. A qualitative assignment of projects to different sensibility classes is done using an adapted risk assessment methodology. The second step supports this qualitative estimation by a new quantitative methodology. Text and web mining is used to extract relevant information from strategic documents and to crawl relevant technological information from the internet. Text classification is used to assign the provided information to different aspects that are identified as relevant for the qualitative estimation.

References THORLEUCHTER, D. and VAN DEN POEL, D. (2011): Semantic Technology Classification. In: Uncertainty Reasoning and Knowledge Engineering. IEEE Conference Publications Management Group, NJ, USA, 36–39. THORLEUCHTER, D., VAN DEN POEL, D. and PRINZIE, A. (2010): A compared R&D-based and patent-based cross impact analysis for identifying relationships between technologies. Technological Forecasting and Social Change, 77 (7), 1037–1050. THORLEUCHTER, D., GERICKE, W., WECK, G., REILAENDER, F. and LOSS, D. (2009): Vertrauliche Verarbeitung staatlich eingestufter Inforation die Informationstechnologie im Geheimschutz. Informatik-Spektrum, 32 (2), 102–109.

Keywords ESPIONAGE, TECHNOLOGY, TEXT CLASSIFICATION

128

Using Latent Class Models with Random Effects for Investigating Local Dependence ¨ u Matthias Trendtel∗ and Ali Unl¨ Chair for Methods in Empirical Educational Research, TUM School of Education, Technische Universit¨at M¨unchen, Lothstrasse 17, 80335 Munich, Germany {matthias.trendtel,ali.uenlue}@tum.de Abstract. Local independence, i.e., stochastic independence given latent variable, is one of the key assumptions underlying such latent variable modeling approaches as item response theory (e.g., Hambleton et al. (1991)). It is a strong assumption, which may not hold in realistic contexts. Generalizations are possible however. Latent class models with random effects (LCMRE; Qu et al. (1996)) allow for a general local dependence structure among items, including as a special case local independence. In this paper, we demonstrate how the LCMRE approach can be used to model various local dependence structures among psychometric items. We derive a measure quantifying the degree of local dependence for pairs of items. This measure can be viewed as a dissimilarity function in the sense of psychophysical scaling (Dzhafarov and Colonius (2007)) and so allows representing the local dependence structure of a set of items with pairwise psychophysical distances graphically in the Euclidean 2D space. We illustrate our approach by simulations and by investigating the local dependence structures in item types and instances of large scale assessment data from the Programme for International Student Assessment (PISA; OECD (2012)).

References DZHAFAROV, E.N. and COLONIUS, H. (2007): Dissimilarity Cumulation Theory and Subjective Metrics. Journal of Mathematical Psychology, 51, 290–304. HAMBLETON, R.K., SWAMINATHAN, H. and ROGERS, H.J. (1991): Fundamentals of Item Response Theory. Sage Publications, Newbury Park, CA. OECD (2012): PISA 2009 Technical Report. OECD Publishing, Paris. QU, Y., TAN, M. and KUTNER, M.H. (1996): Random Effects Models in Latent Class Analysis for Evaluating Accuracy of Diagnostic Tests. Biometrics, 52, 797–810.

Keywords LOCAL DEPENDENCE, LATENT CLASS ANALYSIS WITH RANDOM EFFECTS, 2D VISUALIZATION, LARGE SCALE ASSESSMENT, PISA

∗

PhD student

129

Music Genre Prediction by High-Level Instrument and Harmony Characteristics Igor Vatolkin1 , G¨unther R¨otter2 , and Claus Weihs3 1 2 3

TU Dortmund, Chair of Algorithm Engineering [email protected] TU Dortmund, Institute for Music and Music Science [email protected] TU Dortmund, Chair of Computational Statistics [email protected]

Abstract. For music genre prediction typically low-level audio signal features from time, spectral or cepstral domains are taken into account. Another way is to use community-based statistics such as Last.FM tags. Whereas the first feature group often can not be clearly interpreted by listeners, the second one lacks in erroneous or not available data for less popular songs. We propose the two-level approach combining the specific advantages of the both groups: at first we create high-level descriptors which describe instrumental and harmonic characteristics of music content, some of them derived from low-level features by supervised classification (Vatolkin et al (2012)) or from analysis of extended chroma and chord features (Mauch and Dixon (2010)). Our previous study (R¨otter at al (2011)) demonstrated the high relevance of these high-level features for personal music categories, so that they can be used themselves as input to supervised genre classification. We discuss the performance with response to classification error, feature set size and feature interpretability.

References MAUCH, M. and DIXON, S. (2010): Approximate Note Transcription for the Improved Identification of Difficult Chords. In: Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR), 135–140. ¨ ROTTER, G., VATOLKIN, I. and WEIHS, C. (2011): Computational Prediction of High-Level Descriptors of Music Personal Categories. Accepted for Proc. of the 2011 GfKl 35th Annual Conf. of the German Classification Society (GfKl). VATOLKIN, I., PREUß, M., RUDOLPH, G., EICHHOFF, M. and WEIHS, C. (2012): MultiObjective Evolutionary Feature Selection for Instrument Recognition in Polyphonic Audio Mixtures. Accepted for Soft Computing, Special Issue on Evolutionary Music.

Keywords HIGH-LEVEL MUSIC FEATURES, MUSIC GENRE CLASSIFICATION

130

The OECD’s Programme for International Student Assessment (PISA) Study: A Review of Its Basic Psychometric Concepts ¨ u, Daniel Kasper∗ and Matthias Trendtel† Ali Unl¨ Chair for Methods in Empirical Educational Research, TUM School of Education, Technische Universit¨at M¨unchen, Lothstrasse 17, 80335 Munich, Germany {ali.uenlue,daniel.kasper,matthias.trendtel}@tum.de Abstract. The Programme for International Student Assessment (PISA; e.g., OECD, 2002, 2004, 2007, 2012) is an international large scale assessment study that aims to assess the skills and knowledge of 15-year-old students, and based on those results, to compare education systems across the participating (approximately 70) countries (with a minimum number of circa 4, 500 tested students per country). Initiator of this Programme is the Organisation for Economic Co-operation and Development (OECD)—see www.pisa.oecd.org. We review the main methodological techniques of the PISA study. Primarily, we focus on the psychometric procedures applied for scaling items and persons, and we recapitulate the methods applied in PISA for longitudinal data analysis. PISA proficiency scale construction and proficiency levels derived based on discretization of the continua are discussed as well. Finally, questions and suggestions are raised, and we hope that along these lines the PISA analyses can be better understood and evaluated, and if necessary, possibly be improved.

References OECD (2002): Sample Tasks from the PISA 2000 Assessment. OECD Publishing, Paris. OECD (2004): Learning for Tomorrow’s World: First Results from PISA 2003. OECD Publishing, Paris. OECD (2007): PISA 2006: Science Competencies For Tomorrow’s World. OECD Publishing, Paris. OECD (2012): PISA 2009 Technical Report. OECD Publishing, Paris.

Keywords PROGRAMME FOR INTERNATIONAL STUDENT ASSESSMENT, LARGE SCALE ASSESSMENT, ITEM RESPONSE THEORY, PSYCHOMETRICS

∗

PhD student

†

PhD student

131

Part X

Biostatistics and Bioinformatics

Rank aggregation for candidate gene selection Andre Burkovski1 , Ludwig Lausser1 and Hans A. Kestler1 Research Group Bioinformatics and Systems Biology, Institute of Neural Information Processing, Ulm University, 89069 Ulm, Germany {andre.burkovski, ludwig.lausser, hans.kestler}@uni-ulm.de Abstract. Molecular processes in biological systems are normally influenced or even determined by the gene expression levels (RNA concentrations) of the involved cells. It is believed that the higher the concentration of a RNA molecule is, the higher is its activating (or inhibiting) influence on the process. In order to explain changes in such systems high-dimensional expression profiles are screened for differentially expressed genes. The top scoring genes expressions are then considered candidates for further analysis. The measured differences may not be related to biological process as they can also be caused by variation in measurement or by other sources of noise. An alternative approach is the analysis of relative ranks of gene expression within a single profile. While measurements of a single profile can be considered comparable, measurements between profiles may vary considerably. Ranking the values and aggregating profiles from the same group can extract stable relationships. The aggregated ranking can be considered a consensus of single profiles. Comparison of intra- and inter-class aggregated rankings reveals changes in gene activity. The intersection between these rankings results in specific feature sets. These sets provide candidate genes for further investigation or analysis. Rankings can be aggregated in several ways. We will compare statistical and positional methods and their application to artificial as well as real world data. The resulting consensus rankings may be used to identify specifically expressed genes and differences between groups.

References SCHALEKAMP, F. and ZUYLEN, A. (2009): Rank aggregation: Together we’re strong. In I. Finocchi and J. Hershberger (Eds.): 11th Workshop on Algorithm Engineering and Experiments, ALENEX 2009, New York, New York, USA, 38–51.

Keywords RANK AGGREGATION, RANKING, CANDIDATE GENE SELECTION Andre Burkovski and Ludwig Lausser are PhD students.

134

Unsupervised dimension reduction methods for protein sequence classification Dominik Heider1 , Christoph Bartenhagen2 , J. Nikolaj Dybowski1 , Sascha Hauke3 , Martin Pyka4 , and Daniel Hoffmann1 1

2 3 4

Dept. of Bioinformatics, University of Duisburg-Essen, Universitaetsstr. 2, 45141 Essen, Germany, {dominik.heider, nikolaj.dybowski, daniel.hoffmann}@uni-due.de Dept. of Medical Informatics, University of M¨unster, Domagkstr. 9, 48149 M¨unster, Germany, [email protected] CASED, Technische Universit¨at Darmstadt, Mornewegstr. 32, 64293 Darmstadt, Germany, [email protected] Dept. of Psychiatry und Psychotherapy, Philipps-University Marburg, Rudolf-Bultmann-Str. 8, 35039 Marburg, Germany, [email protected]

Abstract. Feature extraction methods are widely applied in order to reduce the dimensionality of data for subsequent classification, thus decreasing the risk of noise fitting. Principal Component Analysis (PCA) is a popular linear method for transforming high-dimensional data into a low-dimensional representation. Non-linear and non-parametric methods for dimension reduction, such as Isomap and Stochastic Neighbor Embedding (SNE) are also used. In this study, we compare the performance of PCA, Isomap, t-SNE and Interpol as preprocessing steps for classification of protein sequences. Using random forests, we compared the classification performance on two artificial and nineteen real-world protein data sets, including HIV drug resistance, HIV-1 co-receptor usage and protein functional class prediction, preprocessed with PCA, Isomap, t-SNE and Interpol. Significant differences between these feature extraction methods were observed. The prediction performance of Interpol converges towards a stable and significantly higher value compared to PCA, Isomap and t-SNE. This is probably due to the nature of protein sequences, where amino acid are often dependent from and affect each other to achieve, for instance, conformational stability. However, visualization of data reduced with Interpol is rather unintuitive, compared to the other methods. We conclude that Interpol is superior to PCA, Isomap and t-SNE for feature extraction previous to classificaton, but is of limited use for visualization.

Keywords MACHINE LEARNING, FEATURE EXTRACTION, PROTEINS, SEQUENCES

135

Prediction of Surgery Duration Using Data Mining Methods on Anaesthesia Protocols Pawel Matuszyk1∗ , Dominik Brammen2 , Ren Schult1 , and Myra Spiliopoulou1 1

Otto-von-Guericke-University, Faculty of Computer Science Magdeburg, Germany (pawel.matuszyk|myra)@ovgu.de, [email protected] University Hospital Magdeburg Department of Anesthesiology and Intensive Care [email protected]

2

Abstract. An exact scheduling of surgeries plays an essential role in the efficiency of a hospital. It allows to reduce the waiting times for patients as well as for medical staff. A precise schedule does not only avoid frustration at the workplace among the personnel, but it also improves the utilisation of surgery rooms and reduces costs for overtimes. In order to create a realistic schedule, it is necessary to estimate the duration of a future surgery accurately. Popular methods for estimating the duration of a surgery are nowadays mean values and estimations based on experience of the medical staff. However, these methods often reveal a high prediction error. In this paper we propose a method for estimating the duration of a future surgery based on data mining techniques. It encompasses discretisation of the class attribute, which is the duration of a surgery, and classification using decision trees. The data used for learning the data mining models are anaesthesia protocols, which have to be recorded in every hospital. We figured out that our method results in a reduction of the absolute prediction error by up to 31,03 percent against the average-based estimations. 3

References 1. Schult, R.; Matuszyk, P.; Spiliopoulou, M.; Prediction of Surgery Duration using Empirical Anesthesia Protocols, In KD-HCM 2011. 2. Schult, R.; Matuszyk, P.; Spiliopoulou, M.; Framework for Computer Aided Analysis of Medical Protocols in a Hospital, In HEALTHINF 2012.

Keywords Data Mining, Anaesthesia Protocols, Discretisation, Surgery Duration ∗ 3

Ph.D. student

Other versions of the results have been presented in [1, 2].

136

The critical noise level for learning Boolean functions Markus Maucher, Christian Wawra, and Hans A. Kestler 1 2

Bioinformatics and Systems Biology Group, Ulm University, Ulm, Germany [email protected], [email protected], [email protected]

Abstract. The inference of gene regulatory systems from time series measurements is a challenging task to reveal the global functionality of a cell. Among several reconstruction methods Boolean networks have been successfully applied to such data. As time-resolved gene expression measurements at different stages of a cell are difficult and expensive, all reconstruction methods are faced with a relative small number of time points compared to the number of genes. In addition to this dimension problem, biological systems as well as measurement techniques are subject to noise. In this work, we present an analysis of the reconstructability of Boolean networks in the case of noisy data. We introduce the notion of the critical noise level, a function characteristic which measures the complexity of the reconstruction of a function from noisy time series data. This measure constitutes a natural upper bound for the noise probability under which a function can still be reconstructed, but can also be incorporated into the reconstruction process to improve reconstruction results. We show how to efficiently compute the critical noise level of any given Boolean function and present experimental data that shows how it can be used to improve the best-fit extension algorithm for the reconstruction of a Boolean network from noisy time series data.

References KAUFFMAN, S.A. (1993). The Origins of Order: Self-Organization and Selection in Evolution. Oxford University Press. ¨ LAHDESM¨ aKI, H., SHMULEVICH, I., YLI-HARJA, O. (2003). On learning gene regulatory networks under the boolean network model. Machine Learning, 52(12):147–167.

Keywords BOOLEAN FUNCTIONS, BOOLEAN NETWORKS, SYSTEMS BIOLOGY, RANDOM PERTURBATIONS

137

Decision tree ensembles with different split criteria. Sergej Potapov1 , Asma Gul2 , Werner Adler1 , and Berthold Lausen2 1 2

University of Erlangen-Nuremberg, Germany {sergej.potapov,werner.adler}@imbe.med.uni-erlangen.de University of Essex, United Kingdom {agul,blausen}@essex.ac.uk

Abstract. In recent years many papers discuss boosting and bagging based methods for supervised learning. Both concepts aggregate sets of estimated trees, which are derived by split criteria without adjusting for variables measured on different scales. Breiman et al. (1984) observed that quantitative variables tend to be more often selected as binary variables. As a solution Lausen et al. (1994, 2004) introduced p-value adjusted classification and regression trees, which introduce the p-value of maximally selected test statistics as split criteria. The p value adjustment avoids the possible selection bias of variables measured on different scales. The R package TWIX of Potapov et al. (2012) offer p-value adjusted classification trees. In our paper we compare bagging, double-bagging (Hothorn and Lausen, 2003) without and with p-value adjustment by means of simulation. Moreover, we illustrate our approach using a clinical study involving micro array data.

References Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984): Classification and regression trees. Wadsworth Press. Hothorn, T., Lausen, B. (2003): Double-bagging: Combinig classifiers by bootstrap aggregation. Pattern Recognition 36(6), 1303–1309. Lausen, B., Hothorn, T., Bretz, F., Schumacher, M. (2004): Assessment of optimal selected prognostic factors.Biometrical Journal 46, 364–374. Lausen, B., Sauerbrei, W., Schumacher, M. (1994): Classification and regression trees (CART) used for the exploration of prognostic factors measured on different scales, in: Dirschedl, P., and Ostermann, R. (eds.), Computational Statistics, Physica-Verlag, Heidelberg, 483–496. Potapov, S., Theus, M. (2012): The TWIX package (Version 0.2.19.). http://cran.r-project.org

Keywords ENSEMBLE LEARNING, CLASSIFICATION TREES, BAGGING

138

A Transductive Set Covering Machine Florian Schmid∗1 , Ludwig Lausser†1 and Hans A. Kestler1 Research Group Bioinformatics and Systems Biology, Institute of Neural Information Processing, Ulm University, 89069 Ulm, Germany {florian-1.schmid, ludwig.lausser, hans.kestler}@uni-ulm.de Abstract. Classifiying tissue samples according to genetic markers is one of the basic tasks in molecular medicine. Tissue samples are represented as high-dimensional gene expression profiles obtained from high-throughput experiments. It is often assumed that these profiles only contain a small set of predictive features. A classifier based on such small marker combinations is the Set Covering Machine (SCM) with data dependent rays [?]. The SCM is thereby an ensemble scheme which finds a minimal set of base classifiers covering all positive samples while minimizing the misclassifications of negative ones. Applied with univariate ray classifiers the SCM can be utilized to construct a decision rule of very low dimensionality. A SCM is normally trained in an inductive way, i.e. a SCM is initially adapted to a set of previously labeled training samples. This kind of learning might not be optimal in scenarios based on expensive or time consuming labeling processes; in this setting labeled training samples are often rare. The small sample size limits the training of an inductive classifier. The resulting models are likely to be affected by overfitting. Transductive learning is an alternative in this scenario. This learning scheme additionally incorporates unlabeled samples into the training process. In this work we propose and characterize a transductive version of the SCM with data dependent rays. The classifier is evaluated on partially labeled microarray data with different ratios of labeled and unlabeled data.

References Kestler, H., Lausser, L., Lindner, W., Palm, G.: On the fusion of threshold classi- fiers for categorization and dimensionality reduction. Computational Statistics 26(2), 321340 (2011)

Keywords TRANSDUCTIVE LEARNING, CLASSIFICATION, SET COVERING MACHINE, SCM

∗

Student

†

PhD student

139

Part XI

LIS’12 Workshop

LinSearch – Effiziente Indizierung an der Technischen Informationsbibliothek, Hannover Dr. Debora Daberkow, Dr. Petra Mensing, Dr. Irina Sens, Claudia Todt Technische Informationsbibliothek Hannover, Welfengarten 1b, 30167 Hannover [email protected] Abstract. Die Technische Informationsbibliothek Hannover (TIB) (http://www.tibhannover.de/) ist die Deutsche Zentrale Fachbibliothek f¨ur Technik sowie Architektur, Chemie, Informatik, Mathematik und Physik. Die TIB bietet ihren Nutzern mit GetInfo (https://getinfo.de/app) ein Fachportal f¨ur Technik und Naturwissenschaften. Ca. 45 Millionen Objekte (Texte, Forschungsdaten, AV-Medien, 3D-Modelle) sind im zentralen Index indexiert. Aufgrund der exponentiell anwachsenden Menge an verf¨ugbaren Informationen ist es kaum noch m¨oglich, alle Objekte manuell zu klassifizieren. Nach einer umfangreichen Projektphase werden die Metadaten GetInfo mithilfe (semi)automatischer Verfahren klassifiziert. Es handelt sich um ein insgesamt vierstufiges Verfahren zur automatischen Zuordnung von Metadaten. Die erste Stufe erm¨oglicht eine pauschale Zuordnung von Datens¨uatzen zu einem der sechs Schwerpunktf¨acher der TIB, in der zweiten Stufe werden alle in den Datens¨atzen vorhandenen Klassifikationsangaben, wie bspw. DDC oder MSC und weitere genutzt, um eine automatische Fachzuordnung zu erm¨oglichen. In der sich daran anschließenden Stufe werden ISSN- und Konferenzangaben zur Zuordnung herangezogen. Ist bis zu diesem Zeitpunkt keine Verarbeitung m¨oglich, werden die davon betroffenen Datens¨atze an die averbis extraction platform zur Klassifizierung u¨ bergeben. Ziel des Verfahrens ist die Zuordnung aller Datens¨atze zu den sechs TIB-F¨achern Architektur, Chemie, Informatik, Mathematik, Physik und Technik. Daraus resultiert ein weiterer Filter f¨ur die Suche im Portal GetInfo. Die ersten drei Stufen dieses Verfahrens sind Eigenentwicklungen der TIB und nutzen eigens f¨ur diesen Zweck erzeugte lexikalische Ressourcen, die aus dem Projekt LINSearch hervorgegangen sind. Die vierte Stufe basiert dagegen auf Methoden des automatischen Lernens und wurde mithilfe der Firma averbis aufgebaut.

Keywords AUTOMATISCHE ERSCHLIESSUNG, KLASSIFIKATION, METADATEN

142

¨ Herausforderung ”Neue Klassifikation fur Freihandbest¨ande” - 3 Praxis-Beispiele aus der Schweiz Uwe Geith1 and Dr. Wolfgang Giella2 1 2

ZHAW Hochschulbibliothek, Winterthur [email protected] ZHAW Hochschulbibliothek, Winterthur [email protected]

Abstract. Die Notwendigkeit der Einf¨uhrung einer neuen Aufstellungssystematik kann unterschiedliche Gr¨unde haben. Doch immer stellen die Evaluation und Einf¨uhrung einer neuen Klassifikation eine Herausforderung f¨ur eine Bibliothek dar, insbesondere f¨ur kleinere Bibliotheken. An 3 Beispielen aus der Schweiz • Kantonsbibliothek Graub¨unden, Chur (Basisklassifikation) • Teilbibliotheken der ZHAW Hochschulbibliothek, Standort Z¨urich (RVK) • Teilbibliotheken der ZHAW Hochschulbibliothek, Standort Wintherthur (DDC) werden die Hintergr¨unde f¨ur die Entscheidung f¨ur eine bestimmte Klassifikation offengelegt und die Durchf¨uhrung der entsprechenden Projekte skizziert. Unterschiedliche Ziele, Strategien und L¨osungsans¨atze werden gegen¨ubergestellt und aufgezeigt, dass die Entscheidung f¨ur eine bestimmte Klassifikation nicht nur inhaltliche, sondern auch pragmatische Gr¨unde haben kann.

Keywords SCHWEIZ, BIBLIOTHEK, AUFSTELLUNGSSYSTEMATIK, AUSWAHL

143

Die sachliche Suche in Schweizer Online-Katalogen und Discovery-Systemen Uwe Geith ZHAW Hochschulbibliothek, Winterthur [email protected]

Abstract. Heerscharen von BibliothekarInnen versehen bibliographische Datens¨atze mit Sacherschliessungsinformationen. Sie beschlagworten oder klassifizieren - oder machen sogar beides. Doch f¨ur wen machen sie sich die viele Arbeit? Bekanntermassen nutzen Bibliothekskunden nur zu einem kleinen Prozentsatz die sachlichen Sucheinstiege. Schweizer OPACs mit ihren klassischen Suchm¨oglichkeiten machen die Vielfalt der Anwendungen sowohl in der verbalen als auch in der klassifikatorischen Sacherschliessung deutlich. Dabei liegt der Schwerpunkt auf den Aleph-OPACs des Informationsverbundes Deutschschweiz (IDS), die klassische Union-Catalogues darstellen. Stark im Trend liegen Versuche, die OPACs durch RDS-Systeme abzul¨osen. Welche sachlichen Suchen k¨onnen in den derzeitigen mit Suchmaschinen-Technologie arbeitenden Discovery-Systemen in der Schweiz durchgef¨uhrt werden? Vorgestellt werden ”swissbib”, ”NEBIS recherche” und das Webportal ”e-lib.ch”. Es wird sowohl der aktuelle Stand der sachlichen Suche in der Schweiz reflektiert als auch versucht, Erweiterungsm¨oglichkeiten und neue Ans¨atze aufzuzeigen.

Keywords SCHWEIZ, INHALTSERSCHLIESSUNG, ONLINE-RECHERCHE, OPAC, RESSOURCE DISCOVERY SYSTEM

144

Verarbeitung von Sacherschliessungselementen in Discoverysystemen: Auf dem Weg zu einer nutzergerechten Verwendung von inhaltlicher Erschlieung in der E-LIB Bremen. Dr. Elmar Haake1 Staats- und Univerversit¨atsbibliothek Bremen, Bibliothekstr., 28359 Bremen [email protected] Abstract. Aktuelle kommerzielle Internetdienste zeigen, dass der Erfolg einer Webdienstleistung erheblich von der Qualit¨at und der zeitgem¨aßen Pr¨asentation der eigenen Angebote abh¨angt. Dabei sollte der Nutzwert der Dienste unmittelbar erkennbar sein. Usability Betrachtungen, klare einfache Strukturen und die Beschr¨ankung auf das Wesentliche k¨onnen Webangebote f¨ur ungeschulte Nutzer attraktiver machen. Die Online-Pr¨asentation der verschiedenen Dienste einer Bibliothek ist dagegen noch zu stark von Fragen der technischen Realisierbarkeit und bibliothekarischer Fachsicht beeinflusst. Dies gilt im Besonderen f¨ur bibliothekarische Suchinstrumente. Die Metadatenpr¨asentation aktueller Bibliothekskataloge wird in ihrer F¨ulle von ungeschulten Nutzern kaum verstanden. Im Rahmen der E-LIB Bremen experimentieren wir mit neuen M¨oglichkeiten der Pr¨asentation und Auswertung u.a. von inhaltserschliessenden Metadaten, um dem Nutzer in verst¨andlicher Weise die Modifikation seines Suchweges zu erleichtern oder ihn zum thematischen St¨obern zu motivieren. Die statistische Auswertung der gesamten Treffermenge einer Suchanfrage bietet die Basis zur Entwicklung zahlreicher neuer Empfehlungsfunktionen6. Intern dienen die gleichen Dienste auch als Vorschlagsfunktion zur Vereinfachung der Klassifizierung.

References J. Rochkind (2007): (Meta)search Like Google. The time has come for libraries, too, to negotiate for rights to index full text, Library Journal J. Wang und A. Lim (2009): Local touch and global reach: The next generation of network, Library Management 30, No. 1/2, 25-34 M. Parry (2009): After Losing Users in Catalogs, Libraries Find Better Search Software, The Chronicle of Higher Education, 28. Sept. 2009 M. Blenkle, R. Ellis, E. Haake (2009): Next-generation library catalogues: review of E-LIB Bremen, Serials 22(2) W. G¨odert (2004): Navigation und Konzepte f¨ur ein interaktives Retrieval im OPAC oder: von der Informationserschlieung zur Wissenserkundung, Mitteilungen der Vereinigung ¨ Osterreichischer Bibliothekarinnen & Bibliothekare 57, Nr. 1, 70-80

145

146

Dr. Elmar Haake

M. Blenkle, R. Ellis, E. Haake (2009):E-LIB Bremen Automatische Empfehlungsdienste fr Fachdatenbanken im Bibliothekskatalog / Metadatenpools als Wissensbasis f¨ur bestandsunabh¨angige Services, Bibliotheksdienst 43. Jg., 6, 618-627 E. Haake (2009): Erschliessen Sie immmer noch oder lassen Sie auch schon indexieren? Vortrag 8. Fortbildungstreffen der Arbeitsgruppe Fachreferat Naturwissenschaften 21.09.2009 R. Siegm¨uller (2007): Verfahren der automatischen Indexierung in bibliotheksbezogenen Anwendungen Berlin : Institut f¨ur Bibliotheks- und Informationswissenschaft der HumboldtUniversit¨at zu Berlin, 2007. - 106 S. : graph. Darst. - Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft ; 214) - ISSN: 1438-7662

Keywords intuitive Nutzung von Sacherschlieungselementen, Serendipitt, Nutzung des Katalogs als Wissensbasis fr Sacherschliessung

Der Blog als Thesaurus-Datenbank Andreas Ledl University Library of Basel, Switzerland [email protected]

¨ Abstract. Ubersichten von Online-Thesauri, -Klassifikationen und Ontologien werden gegenw¨artig in der Regel im Internet verstreut in mehr oder weniger unvollst¨andigen Linklisten angeboten. Es gibt bisher keinen zentralen, virtuellen Ort, der sich zur Aufgabe gesetzt hat, eine m¨oglichst komplette Zusammenstellung aller frei zug¨anglichen Erschliessungsinstrumente mit internationalem Anspruch zu liefern. Mit dem Blog Thesaurusportal (http://thesaurusportal.blogspot.com), der einmal u¨ ber 500 frei zug¨angliche Thesauri, Klassifikationen und Ontologien in 47 Sprachen enthalten wird, steht nun erstmals ein solches Angebot zur Verf¨ugung. Es richtet sich an interessierte Laien, Studierende und Forschende, die mit dem Block Building Approach nach Literatur oder Informationen recherchieren und dazu ein kontrolliertes Begriffsarsenal ben¨otigen; an (wissenschaftliche) Bibliothekare, Hochschuldozierende, Lehrer und sonstige Vermittler von Informationskompetenz, die ihren Studierenden resp. Sch¨ulern genau solche professionellen Herangehensweisen n¨aher bringen m¨ochten und daf¨ur inspirierenden Stoff brauchen; an Ontologieingenieure, da Thesauri bei der Abbildung komplexer Wissensbeziehungen und somit beim Konzipieren semantischer Netze als Grundlage dienen; an Bibliotheken, Archive, Museen, Forschungseinrichtungen und a¨ hnliche Institutionen, wo sie als Indexierungswerkzeuge verwendet werden k¨onnen. Im Vortrag wird der erw¨ahnte Blog vorgestellt und dargelegt, dass neben dem inhaltlichen Anspruch auch die Form der Pr¨asentation neu ist. Momentan werden Blogs von Bibliotheken haupts¨achlich dazu verwendet, Neuigkeiten u¨ ber die Institution oder den Bestand zu verbreiten. Dabei eignen sie sich gerade bei kleineren Datenmengen ideal dazu, statisch-unhandliche Linklisten zu ersetzen und k¨onnen als nachhaltige Open Access-Datenbanken mit Web 2.0-Funktionalit¨at dienen.

Keywords Blog, Thesaurus, Datenbank, Klassifikation, Open Access

147

AUSZUG AUS DEM LITERATURBERICHT 2011 DEWEY DECIMAL CLASSIFICATION (DDC) Bernd Lorenz1 Fachhochschule f¨ur o¨ ffentliche Verwaltung und Rechtspflege in Bayern Fachbereich Archiv- und Bibliothekswesen, M¨unchen [email protected] Abstract. ”The 23rd edition of the DDC enhances the efficiency and accuracy of your classification work in ways no previous editions have done.” Vgl. http://www.oclc.org/dewey/ Effenberger, Claudia: Ein semantisches Netz f¨ur die Suche mit der Dewey-Dezimalklassifikation - Optimiertes Retrieval durch die Verwendung versionierter DDC-Klassen (= Mitteilun¨ 64, 2011 S. 270-289) gen der VOB Effenberger, Claudia - Hauser, Julia: Would an Explicit Versioning of the DDC Bring Advantages for Retrieval? In: Concepts in Context. Proceedings of the Cologne Conference on Interoperability and Semantics in Knowledge Organization July 19th-20th, 2010. Ed. by Felix Boteram, Winfried Gdert, Jessica Hubrich. W¨urzburg: Ergon, 2011 S. 123-132 Golub, Koraljika: Automated Subject Classification of Textual Documents in the Context of WebBased Hierarchical Browsing(= Knowledge Organization 38, 2011 S. 230-244) (Verwendung der DDC) Green, Rebecca: See-also Relationships in the Dewey Decimal Classification (= Knowledge Organization 38, 2011 S. 335-341) Gr¨uter, Doris - K¨olbl, Andrea Pia - Villinger, Martin - Walger, Nicole: Genese, Aufgaben und Zukunft der Vifarom: Konzept und DFG-F¨orderung einer Virtuellen Fachbibliothek aus der Praxisperspektive (= ZfBB 58, 2011 S. 59-71)(S. 61 f.: Verwendung der DDC) Sch¨oning-Walter, Christa: Automatische Erschlieungsverfahren fr Netzpublikationen. Z um Stand der Arbeiten im Projekt PETRUS (= Dialog mit Bibliotheken 23, 2011 S. 31-36; einschl. Reklameteil)(enth. auch Arbeit mit DDC-Sachgruppen)

Keywords DEWEY DECIMAL CLASSIFICATION, LITERATURE REVIEW 148

Entwicklung eines Werkzeugs zur Visualisierung der SWD/GND Dr.-Ing. Jan Frederik Maas Staats- und Universittsbibliothek Hamburg, Von Melle Park 3, 20146 Hamburg [email protected] Abstract. Die Verschlagwortung von Medien anhand von kooperativ erstellten Normdateien hat sich als sehr flexibler, da kontinuierlich erweiterbarer Bestandteil der Sacherschließung etabliert. Grundlage f¨ur die Verschlagwortung war bisher im deutschsprachigen Raum die Schlagwortnormdatei (SWD), die schrittweise durch die Gemeinsame Normdatei (GND) ersetzt wird. Die GND enth¨alt u¨ ber die Sachschlagworte hinaus noch die Best¨ande der Personennamendatei (PND), der Gemeinsamen K¨orperschaftsdatei (GKD) sowie die Einheitssachtitel-Datei des Deutschen Musikarchivs. Die SWD/GND dient prim¨ar der Vereinheitlichung der Verschlagwortung. Dar¨uber hinaus sind in der Struktur der SWD Relationen zwischen Schlagw¨ortern definiert, die eine thematische Suche stark erleichtern k¨onnen. Beispiel f¨ur solche Relationen sind die Unterbegriff-/Oberbegriffrelationen (Hyponym/Hyperonym) oder ¨ die Relation der Ahnlichkeit von Begriffen. Um den Umgang mit der SWD/GND zu erleichtern, wurde ein Werkzeug zur Recherche in den Schlagw¨ortern erstellt, das eine Visualisierung der beschriebenen Relationen erm¨oglicht und dar¨uber hinaus komplexe Suchanfragen z.B. mittels Regul¨arer Ausdr¨ucke unterst¨utzt. So kann zum einen die Verschlagwortung von Medien erleichtert werden, zum anderen sind bei der Ansetzung von Schlagw¨ortern entstehende Fehler leichter vermeidbar. Eine lohnende Herausforderung stellt die Umstellung der Software auf die Struktur der GND dar, die perspektivisch diskutiert werden soll.

References MAAS, Jan F.(2010): SWD-Explorer - Design und Implementierung eines Software- Tools zur erweiterten Suche und grafischen Navigation in der Schlagwortnormdatei. Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft 275.

Keywords SWD, GND, Visualisierung, Schlagworte

149

Practical Experiences with Machine Learning-based Text Categorization for Library Applications Elisabeth M¨odden1 , Mathias L¨osch2 , Monika L¨osse1 , and Ulrike Junger1 1

2

Deutsche Nationalbibliothek Adickesallee 1 D-60322 Frankfurt am Main [email protected] [email protected] [email protected] Universit¨atsbibliothek Bielefeld Universit¨atsstr. 25 D-33619 Bielefeld [email protected]

Abstract. In recent years, text mining has gained more and more attention in the field of (digital) libraries. Potentially fruitful applications include tasks like automatic abstracting and automatic clustering or classification of documents. Our contribution presents experiences from two projects that aim at the automation of classification in the library domain using machine learning-based text categorization: The project “PETRUS”, carried out by the German National Library, aims at the automatic classification of electronic publications by attributing DDCSachgruppen. These DDC-Sachgruppen, a scheme based on Dewey Decimal Classification (DDC), consist of roughly one hundred subject classes and is used to structure the German National Bibliography. Coordinated at Bielefeld University Library, the project “Automatic Enhancement of OAI Metadata” aims at the automatic classification of Dublin Core metadata records, as employed by most institutional and subject repositories. The target category space of this project is the DDC, and in particular also the subset of the DDC-Sachgruppen, which is the prevalent category system in the German repository landscape. A comparative analysis of the experiences from both projects unveils promising classification accuracy rates in both applications, but also similar challenges faced during the construction of production-ready classifiers for library classification schemes. We therefore conclude with recommendations and best practices for the application of text categorization in the library context.

Keywords SUBJECT INDEXING, MACHINE LEARNING, TEXT MINING

150

¨ Abgleich von Titeldaten zur Ubernahme von ¨ Sacherschließungsinformationen uber Verbundgrenzen Magnus Pfeffer Hochschule der Medien, Stuttgart [email protected] Abstract. Sacherschließung in den Verbunddatenbanken ist – unter anderem bedingt durch die kaum koordinierte Zusammenarbeit – h¨ochst uneinheitlich. Dies betrifft auch Werke, die in vielen Auflagen und Drucklegungen inhaltlich im Wesentlichen unver¨andert erscheinen. Im letzten Jahr habe ich ein Verfahren vorgestellt, welches unterschiedliche Auflagen und Ausgabeformen eines Werkes in Gruppen zusammenfasst und die Sacherschließungsinformationen der erschlossenen Titel einer Gruppe auf die nicht erschlossenen Titel u¨ bertr¨agt. Die Grundidee dabei ist der Abgleich einer Kombination von Autoren/Urheberangaben und dem vollst¨andigen Titel. Das Verfahren wurde auf zwei Verbunddatenbanken (S¨udwestverbund und Hebis) angewendet. Die Ergebnisse sind beeindruckend: Waren im SWB vor dem Abgleich von 12.777.191 Monografien 3.979.796 mit SWD-Schlagw¨ortern und 3.235.958 mit RVK-Notationen erschlossen, konnten durch den Abgleich mit anschließender ¨ Ubernahme zus¨atzlich 636.462 mit SWD und 959.419 mit RVK erschlossen werden. Die Arbeitsgruppen der Sacherschließungsexperten haben in beiden Verb¨unden die Ergebnisse in zuf¨alligen und systematischen Stichproben u¨ berpr¨uft. Dabei wurde ¨ den erzielten Ergebnissen eine hohe Qualit¨at bescheinigt und die Ubernahme in die Produktivdatenbanken empfohlen, was zwischenzeitlich geschehen ist. Aktuell werden Datenabz¨uge der Katalogdatenbanken des BVB und des HBZ fr das Verfahren vorbereitet. Die Ergebnisse der Zusammenf¨uhrung der vier Verbundkataloge werden auf dem Workshop erstmals vorgestellt.

Keywords LIBRARY UNION CATALOGS, SUBJECT HEADINGS, TITLE MATCHING

151

Data Enrichment in Discovery Systems using Linked Data Dominique Ritze and Kai Eckert Mannheim University Library, Germany {dominique.ritze,eckert}@bib.uni-mannheim.de Abstract. The Linked Data Web is an abundant source for information that can be used to enrich information retrieval results. This can be helpful in many different scenarios, for example to enable extensive multilingual semantic search or to provide additional information to the users. In general, for the data enrichment two ways are possible: on the side of the client and on the side of the server. With client side data enrichment, i.e., usually an enrichment by means of JavaScript in the browser, users can get additional information related to the results they are provided with. This additional information is not stored with the retrieval system and thus not available to improve the actual search. An example would be the provision of links to external sources like Wikipedia, merely for convenience. By contrast, an enrichment on the server side can be exploited to improve the retrieval directly, at the cost of data duplication and additional efforts to keep the data up-to-date. In this talk, we show various examples where discovery systems have been enriched both on the client and the server side. We compare advantages and disadvantages of both variants and briefly demonstrate the data enrichment in Primo at the Mannheim University Library.

Keywords Linked Data, Data Enrichment, Improving Discovery Systems

152

Instrumentalisierung der klassifikatorischen Sacherschließung im neuen Suchportal mit AquaBrowser in der Vorarlberger Landesbibliothek Karl R¨adler Vorarlberger Landesbibliothek, Fluherstraße 4 6900 Bregenz www.vorarlberg.at/vlb Abstract. Das neue Suchportal der Vorarlberger Landesbibliothek mit AquaBrowser bringt die Aufstellungssystematik bzw. Klassifikation in der Recherche in mehrfacher Hinsicht aktiv ins Spiel: Einmal k¨onnen die Rechercheergebnisse thematisch sukzessiv Top-down verfeinert werden. Die entsprechenden Gliederungsebenen der Klassifikation werden dabei mittels verbaler Benennungen zur Auswahl angeboten. Analog zu den Sachnotationen werden auch die L¨andercodes u¨ ber ihre verbalen Repr¨asentanten als eigene Regional-Facette instrumentalisiert, wiederum mit der M¨oglichkeit, sukzessiv Top-down einschr¨anken zu k¨onnen. Eine 3. Facette “ZeitEpoche” soll noch hinzukommen. Andererseits kommen die verbalen Benennungen der einzelnen Klassen in einer eigenst¨andigen Verfeinerungskategorie “Schlag¨ wort” zur Geltung. Uber eine “Wortwolke” werden zudem auch hierarchische bzw. assoziative Verweisungen zwischen den Klassen aktiv zur Unterst¨utzung bzw. Auswahl angeboten. Mit praktischen Recherchebeispielen soll demonstriert werden, dass die klassifikatorische Sacherschließung gerade auch in Suchmaschinen in gewissem Sinne eine neue Dimension der Recherchequalit¨at bereitzustellen in der Lage ist. Insbesondere auch im Sinne der M¨oglichkeit, den Gesamtbestand einer Bibliothek aktiv transparent zu machen und sozusagen im multidimensionalen Informationsraum der Bibliothek auf Entdeckungsreise gehen zu k¨nnen. Um dies zu unterst¨uzen, wurde die Suche graphisch und funktional in die Homepage integriert. So werden u¨ ber einzelne “Reiter” auch Direkteinstiege u¨ ber dahinter liegende Search-Links angeboten, die es unter anderem auch erlauben, Fachgebiete als Ganzes zu recherchieren, und dann sukzessiv nach unterschiedlichsten Facetten (Medienart, Zeitraum, Klassifikation, ...) zu verfeinern. Eine Hauptmission des neuen Suchportals der Vorarlberger Landesbibliothek war, ihre Informationsdienstleistung bzw. ihr Medienangebot m¨oglichst aktiv zu pr¨asentieren, und so ein digitales multidimensionales “Schaufenster” anzubieten. Inwieweit dies gelungen ist, soll demonstriert und zur Diskussion gestellt werden.

Keywords VORARLBERGER LANDESBIBLIOTHEK, KATALOG, SUCHPORTAL, AQUABROWSER, KLASSIFIKATION, RECHERCHE

153

¨ den Ontologieaufbau Text Mining fur Elke Bubel2 , Nils Elsner1 , Peter K¨onig2 , Helmut M¨uller1 , Nadejda Nikitina3 , Mario Quilitz2 , Silke Rehme1 , Achim Rettinger3 , and Michael Schwantner1,4 1 2 3 4

FIZ Karlsruhe Leibniz Institut f¨ur Informationsinfrastruktur, Hermann-von-Helmholtz-Platz 1, 76344 Eggenstein-Leopoldshafen INM Leibniz-Institut f¨ur Neue Materialien gGmbH, Campus D2 2, 66123 Saarbr¨ucken Institut AIFB, KIT-Campus S¨ud, 76128 Karlsruhe corresponding author, [email protected]

Abstract. Im Rahmen des durch den Senatsausschuss Wettbewerb der Wissenschaftsgemeinschaft Gottfried Wilhelm Leibniz gef¨orderten Forschungsprojekts NanOn: Semiautomatische Ontologiegenerierung - ein Beitrag zum Knowledge Sharing in der Nanotechnologie wurde eine Ontologie f¨ur die chemische Nanotechnologie aufgebaut. Entsprechend der Ontology Engineering Methodology nach G`omezPerez und Su`arez wurde zun¨achst eine Anforderungsanalyse erstellt. Als Zielgruppe wurden Wissenschaftler und Produzenten identifiziert, die nach Materialien, Eigenschaften oder Prozessen (Synthesen, Applikationen) suchen. F¨ur den Aufbau der Ontologie wurden zur Unterst¨utzung der intellektuellen Konzeptualisierung auch bereits vorhandene Ontologien einbezogen (u.a. CMO und ChEBI). Der Schwerpunkt des Projekts war jedoch, die Eignung von Text Mining Methoden sowohl f¨ur den Aufbau der Ontologie als auch f¨ur die automatische Annotation wissenschaftlicher Artikel zu untersuchen. Dazu wurden eigene prototypische Werkzeuge entwickelt, wobei verschiedene Open Source Tools (z.B. GATE und OpenNLP) verwendet wurden. Es konnte gezeigt werden, dass f¨ur den Aufbau einer Ontologie Text Mining Methoden, mit denen aus Fachtexten relevante Begriffe extrahiert werden, von großem Wert sind, indem sie den Prozess der intellektuellen Konzeptualisierung unterst¨utzen und zu gr¨oßerer Vollst¨andigkeit der Ontologie f¨uhren. Bei der automatischen Annotation von Begriffen und Relationen, f¨ur die erste Tests durchgef¨uhrt wurden, fielen die Ergebnisse deutlich heterogener aus. Die Annotation von Konzepten ist erwartungsgem¨aß stark abh¨angig von der Vollst¨andigkeit der Ontologie hinsichtlich der (Quasi-)Synonyma. Die Annotation von Relationen ist wesentlich schwieriger. Hier wird die Qualit¨at vor allem davon beeinflusst, ob f¨ur die Relationen spezifische Formulierungen bestimmt werden k¨onnen.

Keywords ONTOLOGY, NANOTECHNOLOGY, ONTOLOGY ENGINEERING METHODOLOGY, TEXT MINING, AUTOMATIC ANNOTATION

154

Sacherschliessung mit GND/RSWK im Verbund Basel: eine erste Bilanz Alice Spinnler1 Universit¨at Basel Universit¨atsbibliothek, Straße am Forum 2 Karlsruhe 76131, Germany [email protected] Abstract. Im April 2011 hat der Verbund Basel von einer hauseigenen verbalen Sacherschliessung auf SWD/RSWK umgestellt. Gut ein Jahr nach der Einf¨uhrung ¨ ist es Zeit, eine erste Bilanz zu ziehen. Nach einem summarischen Uberblick u¨ ber die verbale Sacherschliessung im IDS - Basel ist Teil des Informationsverbunds Deutschschweiz - werden die Gr¨unde f¨ur den Wechsel aufgezeigt. Haben sich unsere Erwartungen er¨ullt? Wie gestaltet sich der Erschliessungsalltag f¨ur die FachreferentInnen und die Schlagwortredaktion mit dem neuen Regelwerk und SWD, ab Mai 2012 mit GND? Was hat die neue Erschliessungspraxis f¨ur Auswirkungen auf die thematische Suche im aktuellen OPAC und in swissbib Basel Bern, unserer k¨unftigen Suchplattform? Was geschieht mit dem nach hauseigenem Regelwerk erschlossenen Bestand? Und zuletzt: hat sich der Wechsel gelohnt?

Keywords SWD, RSWK, GND

155

Resource Discovery Systeme – Chance oder ¨ die bibliothekarische Verh¨angnis fur Erschließung? Heidrun Wiesenm¨uller Hochschule der Medien, Stuttgart [email protected] Abstract. Heterogenit¨at in Bibliothekskatalogen ist nichts Neues, hat aber mit der Einf¨uhrung so genannter Resource Discovery Systeme (z.B. EBSCO Discovery Service, Primo Central, Summon) ein bisher ungekanntes Maß erreicht: Diese bieten Suchindizes von gewaltiger Gr¨oße, die sich aus sehr unterschiedlichen Daten kommerzieller Anbieter zusammensetzen. Kombiniert man RDS-Daten mit Daten, die nach bibliothekarischen Standards erschlossen sind, so f¨uhrt dies zu erheblichen Problemen. Selbst bei einfachen formalen Facetten ist der Nutzen aufgrund fehlender Normierung stark eingeschr¨ankt. Komplexere Suchfunktionen wie z.B. eine Einschr¨ankung nach dem Fachgebiet scheinen sich u¨ berhaupt nicht mehr realisieren zu lassen. Im Beitrag werden typische Effekte ebenso wie bisherige L¨osungsans¨atze – in der Regel die Trennung von bibliothekarischen Daten und ¨ RDS-Daten – vorgestellt. Dar¨uber hinaus werden erste strategische Uberlegungen f¨ur ein gewinnbringendes Zusammenspiel von bibliothekarischen Daten und RDSDaten angestellt. Das Ziel muss es sein, dass unsere qualitativ hochwertigen Daten nicht im Meer der nicht-bibliothekarischen Daten, die beispielsweise nicht u¨ ber Normdatenverkn¨upfungen verf¨ugen, untergehen, sondern vielmehr zu deren Verbesserung beitragen. Dabei wird sich auch die Rolle von Bibliothekaren ver¨andern: Sie werden k¨unftig verst¨arkt auch als ”‘Metadatenmanager”’ t¨aresourcetig sein.

Keywords RESOURCE DISCOVERY SYSTEMS, LIBRARY DATA, HETEROGENEITY

156

Inhaltliche Anpassung der RVK als Aufstellungsklassifikation – Projekt Bibliotheksneubau Kleine F¨acher der FU Berlin, Schwerpunkt Orient Helen Younansardaroud Projekt Bibliotheksneubau “24 in 1” der FU Berlin Abstract. Im Zusammenhang mit der Bibliotheksstrukturreform der Freien Universit¨at Berlin soll f¨ur die sog. “Kleinen F¨acher” des Fachbereichs Geschichtsund Kulturwissenschaften eine integrierte Bibliothek mit einem gemeinsamen Standort entstehen. Die Best¨ande der jetzigen Fachbibliotheken der “Kleinen F¨acher” sollen retrokatalogisiert und f¨ur eine Freihandaufstellung schrittweise nach der Regensburger Verbundklassifikation einheitlich erschlossen werden. Um das Ziel zu erreichen, soll durch Anwendung der Crosskonkordanz-Methodik die Frage beantwortet werden, ob die Regensburger Verbundklassifikation (= RVK) f¨ur die nach der Haussystematik aufgestellten Bibliotheksbest¨ande des Orient-Clusters der Kleinen F¨acher der FU Berlin ausreicht bzw. aussagekr¨aftig ist. Dabei sollen m¨ogliche Erweiterungsvorschl¨age zur Optimierung der RVK aufgezeigt werden. (Mein Thema ist basiert auf meiner Masterarbeit mit dem Titel Inhaltliche Anpassung der RVK als Aufstellungsklassifikation: Projekt Bibliotheksneubau Kleine F¨acher der FU Berlin, Islamwissenschaft; erschienen in: Institut f¨ur Bibliotheks- und Informationswissenschaft der Humboldt-Universit¨at zu Berlin, 2010. (Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft; 287; http://edoc.hu-berlin.de/docviews/ abstract.php? lang=ger&id=37367)

References CARMEN = Content Analysis, Retrieval and MetaData: Effective Networking: Abschlussbericht des Arbeitspakets 12 (AP 12) Crosskonkordanzen von Klassifikationen und Thesauri. Als Online- Publikation aufbereitete Version, (2002), http://www.opus-bayern.de/uniregensburg/volltexte/2003/242/pdf/CARMENAP12 Abschlussbericht Netz.pdf ; Zugriff am 30.04.2010. Goldziher, Ign´ac: A short history of classical Arabic literature. Transl., rev., and enl. by Joseph DeSomogyi, Hildesheim: Olms, (Olms Paperbacks; 23), 1966. Mayr, Philipp; Walter, Anne-Kathrin: Einsatzm¨oglichkeiten von Crosskonkordanzen. In: Stempfhuber, Maximilian (Hg.): Lokal - Global: Vernetzung wissenschaftlicher Infrastrukturen: 12. Kongress der IuK-Initiative der Wissenschaftlichen Fachgesellschaft in Deutschland. Bonn: GESIS - IZ Sozialwissenschaften. (Tagungsberichte), (2006), S. 149-166, ˜ http://www.ib.hu- berlin.de/mayr/arbeiten/mayr-walter-IuK06.pdf ; Zugriff am 30.04.2010. Oberhauser, Otto; Seidler, Wolfram: Reklassifizierung gr¨osserer fachspezifischer Bibliotheksbest¨ande. Durchf¨uhrbarkeitsstudie f¨ur die Fachbibliothek f¨ur Germanistik an der Universit¨at Wien. Wien, 2000, http://www.germ.univie.ac.at/fbg/Studie.pdf ; Zugriff am 05.03.2010. Umlauf, Konrad: Einf¨uhrung in die bibliothekarische Klassifikationstheorie und -praxis mit ¨ Ubungen. Berlin: Institut f¨ur Bibliothekswissenschaft der Humboldt-Universit¨at zu Berlin, (Berliner Handreichungen zur Bibliothekswissenschaft; 67), 1999-2006, http://www.ib.hu˜ berlin.de/kumlau/handreichungen/h67/ ; Stand: 20.12.2006; Zugriff am 02.05.2010.

157