Sketchengine TOOL FOR T EXT-‐BA SED T ERMINOLOGY
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
Using texts for term mining v Building a corpus. vChoosing texts vConverting into common format (txt) vAnnotation v Croatian: http://nlp.ffzg.hr/api-‐for-‐our-‐language-‐technologies/
vAlignment v CAT tools (SDL, memoQ) or LF ALigner, https://sourceforge.net/p/aligner/wiki/Home/
v Searching the corpus. v Concordance tools: AntConc (free), Wordsmith (€), ParaConc (free) v Web-‐based corpus workbench: Sketchengine, http://www.sketchengine.co.uk
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
What is the Sketchengine? v Very powerful corpus workbench: https://www.sketchengine.co.uk/ v Provides access to multiple pre-‐compiled corpora (British National Corpus, hrWaC, DGT corpora and many more) v NOT free, but not expensiveJ (5,99 € per month) v Allows the creation of ad hoc corpora from web texts v Supports TMX import (for bilingual texts!) v Provides ways to extract terminology semi-‐automatically v Online tutorials: https://www.sketchengine.co.uk/sketch-‐engine-‐ video-‐tutorials/
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
Simple concordances
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
Other query types v simple: searches for word and its inflected forms v lemma: searches for all words with this lemma v phrase: for searching multiple words v word: to search for a specific wordform v character: to search for a string of characters v CQL: corpus query language
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
WordSketches
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
Thesaurus – similar words
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
Keyword extraction
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
Term queries in the DGT parallel corpus v simple queries: ribolov, brancin, grdobina v lemma queries: ribolov -‐> ribolova, ribolovu, ribolov v parallel query:
v querying using CQL syntax:
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
Basic CQL v Typical format: [attribute="value"], e.g. [lemma=“riba”] v Specifying word class or case: [tag=“N.*”] (any noun), [tag=“A.*”] (any adjective) v Regular expressions: v . (dot) matches any single character v * (asterisk) matches 0-‐100 repetitions v + (plus) matches 1-‐100 repetitions v {n,k} specifies exact range of repetitions, from n to k
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
[lemma=“rad”]
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
[tag=“A.*”][lemma=“riba”]
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
"ulov.*" []{0,3} [tag="N.*"]
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
Challenges v Search for verbs occurring before the word “ugovor” with up to 2 words in between. v Search for words ending with “anje”. v Search for defining contexts containing a noun in the nominative case followed by “je” followed by an adjective and noun in the nominative case.
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
Looking for definitions v Exploit typical definition patterns: v[X] is a [Y] v [X] is defined as [Y] v [X] is a kind of [Y] v …
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
WebBootCat v Tool to create text collections from web pages v User provides keywords & optionally selects sites to crawl v When the corpus is compiled it can be used for queries or download.
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
TMX Upload v Allows you to create corpora from your translation memories
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
Terminology extraction v Works for languages with a predefined “term grammar” v Manage corpus -‐> Keywords and terms v Terms can be exported into TBX or CSV
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
Exercise v Use the corpus-‐derived information on the following slides to create a term entry for “bluetongue”.
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB
INTEGRA TERMINOLOGY M ANAGEMENT, ZAGREB