Topic Similarity in Information Retrieval Examples and Experience of NLP Centre and LEMMA Projects

Petr Sojka Laboratory of Electronic and Multimedia Applications1 and Natural Language Processing Centre, Faculty of Informatics Masaryk University, Brno, Czech Republic [email protected]

Seznam day, April 16th, 2014

1

Donor of today’s catering :-)

Petr Sojka: Topic Similarity in Information Retrieval

Seznam day, April 16th, 2014

1 / 18

Coping with Information Overload by Filtering of Big Data

Life is searching: group similar and narrow focus of search in [your] Big Data. Similarity types: from plagiarism (similarity on n-grams, narrative similarity, evolved into http://theses.cz) to thematic, topical similarity.

Petr Sojka: Topic Similarity in Information Retrieval

Seznam day, April 16th, 2014

2 / 18

Prehistoric Example: Project Ottův Slovník naučný, 1998 Levels of content processing: strings → words and collocations → semantics (word meaning) → information (knowledge). Grabing the essence (content) of documents: topical modelling.

Petr Sojka: Topic Similarity in Information Retrieval

Seznam day, April 16th, 2014

3 / 18

Topical Similarity in Digital Mathematics Library

I I

I

2005, GVP, Radim Řehůřek and Jan Pomikálek 2006, gensim, different machine learning methods as Random Projections, TFIDF word weighting, Latent Semantic Indexing/Analysis, Latent Dirichlet Allocation 50,000+ fulltexts on http://dml.cz

Petr Sojka: Topic Similarity in Information Retrieval

Seznam day, April 16th, 2014

4 / 18

Leading Edge Example: Automated Meaning Picking from Texts

Petr Sojka: Topic Similarity in Information Retrieval

Seznam day, April 16th, 2014

5 / 18

Probabilistic Topical Modelling: Latent Dirichlet Allocation I

topic: weighted list of words

I

document: weighted list of topics

Petr Sojka: Topic Similarity in Information Retrieval

Seznam day, April 16th, 2014

6 / 18

Topical Modelling: Latent Dirichlet Allocation II

I

all topics computed automatically from document corpora

Petr Sojka: Topic Similarity in Information Retrieval

Seznam day, April 16th, 2014

7 / 18

Content Similarity Results in EuDML Within European Digital Mathematics Library, EuDML, project EU CIP-ICT-PSP we have developed and delivered technology for similarity (gensim), document conversions (Braille) and accessibility (math OCR), NLP content normalization (Mathml2text).

Petr Sojka: Topic Similarity in Information Retrieval

Seznam day, April 16th, 2014

8 / 18

Math Search Interface EuDML Demo of math search in EuDML

Petr Sojka: Topic Similarity in Information Retrieval

Seznam day, April 16th, 2014

9 / 18

Digital Library Service Architecture and Workflow (EuDML) Document engineering and workflows including [Math] OCR.

Petr Sojka: Topic Similarity in Information Retrieval

Seznam day, April 16th, 2014

10 / 18

Digital Library Service Architecture and Workflow (DML-CZ)

Petr Sojka: Topic Similarity in Information Retrieval

Seznam day, April 16th, 2014

11 / 18

Data Visualization and Representation

Petr Sojka: Topic Similarity in Information Retrieval

Seznam day, April 16th, 2014

12 / 18

Award Winning Topic Similarity Framework gensim

I

Semantic similarity indexing and search of big (continuous stream of) data. Client (search) and server (indexing) architecture.

I

Developed by NLPlab PG student Radim Řehůřek (awarded in Česká hlava competition in 2011).

I

Leading edge machine learning methods implemented.

I

Used in 50+ local, EU or worldwide projects, 88+ citations.

I

Typical deployment and fine-tuning scenario: expressing data as words (features) → configuration of topic modelling of features → setting of gensim methods and tuning parametres → usage in an application with proper vizualization interface.

Petr Sojka: Topic Similarity in Information Retrieval

Seznam day, April 16th, 2014

13 / 18

Teaching Laboratory build with Constructivism Principles I

new course PV211, Úvod do získávání informací, Introduction to Information Retrieval

I

most work done by students themselves with agile techniques, XP

Petr Sojka: Topic Similarity in Information Retrieval

Seznam day, April 16th, 2014

14 / 18

New course PV211: Introduction to Information Retrieval

I

In Spring 2014 for the first time: 100 students registered, 60 enrolled

I

Invited lecture by Seznam (Roman Rožnik)2

I

students motivated by Khan Academy movies, premium tasks,. . .

I

further cooperation, continuation course?

2 cca 10 years ago lectured in Brno Štěpán Škrob and Ivo Lukaševič flied from Prague to brainhunt our students Petr Sojka: Topic Similarity in Information Retrieval

Seznam day, April 16th, 2014

15 / 18

Conclusions and Mutual Research Interests

I

similarity by topical modelling, document filtering and visualization

I

semantic, meaning computations and modelling of natural language texts (natural NLP)

I

new information retrieval course

I

personal research interests: random walking for desambiguation, math (tree) indexing and similarity

Petr Sojka: Topic Similarity in Information Retrieval

Seznam day, April 16th, 2014

16 / 18

That’s it!

Yes, we can!

Petr Sojka: Topic Similarity in Information Retrieval

Credits: Jiří Franek (illustrations) Seznam day, April 16th, 2014

17 / 18

Links

I

NLP Centre: http://nlp.fi.muni.cz/

I

Topical modelling: https://mir.fi.muni.cz/gensim/

I

Math Information Retrieval: https://mir.fi.muni.cz

I

DML-CZ project: http://dml.cz, http://project.dml.cz

I

EuDML project: http://eudml.cz, http://project.eudml.cz

I

LEMMA: http://www.fi.muni.cz/lemma/

Petr Sojka: Topic Similarity in Information Retrieval

Seznam day, April 16th, 2014

18 / 18