Natural Language Processing >> Electronic Dictionaries Ab-tei.lung-en
Spar.gel.der:
vs.
Spargel-der
-> Spar-gel.der
be.ste.hen.de: be-stehende
vs.
beste-hende
-> be-ste.hen-de
be.in.hal.ten:
vs.
bein-halten
-> be-in.hal-ten
Spar-gelder
be-inhalten
23
the lexicon : considerations / questions 4. Which features do we code ? What about determiners and quantifiers and ... ?
Features: wordclass
ADJ, ADV, NOUN, PRO, (IJ), CONJ, PREP, VERB, PUNC
phonetic
flag [f l æ g] (IPA)
morphological syntactic semantic
e.g. person, number, gender e.g. transitivity, valency,... +- human, +-animate,... 24
the lexicon : considerations / questions 4. Which features do we code ?
Features: wordclass phonetic
morphological
In Computational Linguistics, these features are called:
syntactic
subcategorizational features
semantic
selectional features 25
the lexicon : considerations / questions 4. Which features do we code ? Large (complete ?) electronic dictionaries for English use approx. 250 features (also called: indicators or attributes). These features can be of complex nature in that they carry various values (computationally spoken: attribute-value pairs): examples: [attribute: values] [case: nominative, genitive, dative, accusative, vocative,...] [number: sing, plur] [gender: masc, fem, neutr] 26
the lexicon : considerations / questions 4. Which features do we code ? Is there a rule that tells us how many features we need to code or use ?
The rule: „The idea is that one should use as many features as would be useful, for any given language. So the number should be very, very flexible.“ (Karen Jensen, former IBM research / Microsoft Research scientist)
Example: parts of speech ? How many are there ? 27
the lexicon : considerations / questions 4. Which features do we code ?
Sample features: ADJ PHONE MASS SENS...
DATE PREP TITLS
DET QUANT ASKNG
NUM ANIM PRES
„Features used in PEG“ ! 28
the lexicon : considerations / questions
4. Which features do we code ? - Exercise Describe the morphology and syntax of the following words, using subcategorizational features ! is :
VERB, SING, PRES, AUX, PERS3, COPL
beauty: NOUN, SING, PERS3
29
the lexicon : considerations / questions
5. How do we treat non-English words ? What are loan words ?
Definition: words which originate in another language and which are integrated into your own language 30
the lexicon : considerations / questions
EXAMPLE from Greek -> English
31
Abyss Academy Acoustic Acrobat Aerial Aeronautics Aesthetic Air Airplane Alphabet Amazon Amnesty Amphitheater Analogy Analysis Angel Anthem Anthropology Aphrodisiac Apocalypse Apology Archeology Architect Aristocracy
Aroma Astronaut Athletic Atmosphere Atlas Atmosphere Audio .... Zeal Zephyr Zodiac Zoe Zoic Zome Zoology
32
the lexicon : considerations / questions 5. How do we treat non-English words ? lemma (Greek / Latin) -> plural (Engl.): lemmas plural (orig.): lemmata
English – German: the update – gender ??? to update – separable prefix ??? 33
the lexicon : considerations / questions
5. How do we treat non-English words ? English – German: the update – gender ???
to update – separable prefix ??? past participle: (updated – Engl.) geupdated / geupdatet (non-separable prefix) (German – version 1) upgedated / upgedatet (separable prefix)
(German – version 2) 34
the lexicon : considerations / questions
35
the lexicon : considerations / questions 5. How do we treat loan words ? more examples: English – German: the homebanking – gender ??? Loan word. Noun. Capitalized ???
verb form ??? in English? He is homebanking ? He is banking at home ? In German - impossible 36
the lexicon : considerations / questions 5. How do we treat loan words ? more examples: English – German: to outsource – separable prefix ??? (like in „to update“)
upgedated (separable prefix, Engl. ending) upgedatet (separable prefix, German ending) geupdated (non-separable prefix, Engl. ending) geupdatet (non-separable prefix, German ending) 37
the lexicon : considerations / questions 5. How do we treat loan words ? more examples: English – German: to outsource – separable prefix ??? (like in „to update“) In any case these wordcreations are hard to read.
!
outgesourced (separable prefix, Engl. ending) outgesourcet (separable prefix, German ending) geoutsourced (non-separable prefix, Engl. ending) geoutsourcet (non-separable prefix, German ending) 38
the lexicon : considerations / questions 6. How do we treat (idiomatic) phrases / phrasal expressions ? multiple morpheme strings (MWE) bite the dust, lose face, kick the bucket,... -> fixed expressions: cannot be modified
* bite the dirty dust * lost her beautiful face * kicked the full bucket
-> fixed expressions: require certain types of subjects (features !) The cow kicked the bucket. vs. [Human being] kicked the bucket. 39
the lexicon : considerations / questions 7. How do we treat neologisms ? Neologisms = new word creations, acronyms, loan words,...often „trendy“ words Don‘t be afraid of new words ! Fact is: when reading through dictionaries for neologisms or even
when reading newspapers, we find many words that are only a few years old, which sometimes still require some explanation but which belong to our active vocabulary:
airbag, AIDS, job-sharing, to pamper, global player, softy, to grab a bite, dude, (fast-food) joint, smarty pants, tin-foil face, to google, to xerox, ... 40
the lexicon : considerations / questions 7. How do we treat neologisms (contd.) ? Neologisms are often created across languages, e.g. trendy words among juveniles:
sich ausruhen: abkeimen (German) to veg (out) / to couch / to chill out / to hang out (English) chiller (French) vegetar (Spanish) sich grundlos betrinken: sich abschädeln (German) to get out of one‘s skull / to get plastered (English) se murger sa gueule (French) pillarse una tajada (Spanish) 41
the lexicon : considerations / questions 7. How do we treat neologisms (contd.) ? „Juvenile language“/ chat language – acronyms from text messaging:
(sometimes international) lol
yolo Words? Idiomatic expressions? Sentences? 42
spell aid – chat language (acronyms) AFAIK -- As Far As I Know AFK -- Away From Keyboard ASAP -- As Soon As Possible BAS -- Big A** Smile BBL -- Be Back Later BBN -- Bye Bye Now BBS -- Be Back Soon BEG -- Big Evil Grin BF -- Boyfriend BIBO -- Beer In, Beer Out BRB -- Be Right Back BTW -- By The Way BWL -- Bursting With Laughter C&G -- Chuckle and Grin CICO -- Coffee In, Coffee Out
CID -- Crying In Disgrace CP -- Chat Post(a chat message) CRBT -- Crying Real Big Tears CSG -- Chuckle Snicker Grin CYA -- See You (Seeya) CYAL8R -- See You Later (Seeyalata) DLTBBB -- Don't Let The Bed Bugs Bite EG -- Evil Grin EMSG -- Email Message FC -- Fingers Crossed FTBOMH -- From The Bottom Of My Heart FYI -- For Your Information
See: http://www.netlingo.com/acronyms.php
http://www.netlingo.com/top50/popular-text-terms.php
43
spell aid – chat language (symbols) :-| -- Ambivalent o:-) -- Angelic >:-( -- Angry |-I -- Asleep (::()::) -- Bandaid :-{} -- Blowing a Kiss \-o -- Bored :-c -- Bummed Out |C| -- Can of Coke |P| -- Can of Pepsi :( ) -- Can't Stop Talking :*) -- Clowning :' -- Crying :'-) -- Crying with Joy :'-( -- Crying Sadly
:-9 -- Delicious, Yummy :-> -- Devilish ;-> -- Devilish Wink :P -- Disgusted (sticking out tongue) :*) -- Drunk :-6 -- Exhausted, Wiped Out :( -- Frown \~/ -- Full Glass \_/ -- Glass (drink) ^5 -- High Five
44
spell aid – EMOJIs … language ???
Language is the ability to acquire and use complex systems of communication, particularly the human ability to do so, and a language is any specific example of such a system. - Natural Language Systems Harriehausen
45
EMOJIs vs. hieroglyphs
- Natural Language Systems Harriehausen
46
http://www.fbemoticons.com/wp-content/uploads/2010/03/facebook-emoticons.jpg
47
the lexicon : homonomy / polysemy Special types of ambiguities on WORD LEVEL: What are homonyms ? What are homographs ? What are homophones ? What are heterophones ? What are heterographs ?
Examples: 48
the lexicon : homonomy / polysemy
49
the lexicon : homonomy / polysemy
four – for watt - what
50
the lexicon : homonomy / polysemy
four – for watt - what
homophones & heterographs:
heterographic and homophonic 51
the lexicon : ambiguity on which NLP level?
52
the lexicon : ambiguity on which NLP level?
homonyms : homographic and homophonic; here: combined with MWE
53
the lexicon : ambiguity on which NLP level?
54
the lexicon : ambiguity on which NLP level?
homonyms : homographic and homophonic; but combined with different tags (for the different POS): word level: pretty dirty (ADV + ADJ) vs. ADJ (gap: when you are) ADJ 55
the lexicon : ambiguity on which NLP level?
56
the lexicon : ambiguity on which NLP level?
homonyms : homographic and homophonic; word level
57
the lexicon : ambiguity on which NLP level?
58
the lexicon : ambiguity on which NLP level?
No homonymy at all, but syntactic ambiguity 59
the lexicon : ambiguity on which NLP level?
60
the lexicon : ambiguity on which NLP level?
homonyms : homographic and homophonic; word level
61
homophones heterographs
62
63
64
65
66
67
68
69
70
71
72
73
74
the lexicon : homonomy / polysemy
questions: • Why should we have a dictionary for homographs / homonyms ? • How can we use such a dictionary ? • How are entries structured ?
For German, homonym dictionaries exist since 1532. 75
the lexicon : homonomy / polysemy homonyms (homographic and homophonic)
homographs
homophones
(homographic but not necessarily homophonic)
(heterographic and homophonic)
Engl: pen – pen German: August - August
their - there Wände - Wende 76
the lexicon : homonomy / polysemy PROBLEMS ? homonyms (homographic and homophonic)
homographs
homophones
(homographic but not necessarily homophonic)
(heterographic and homophonic)
Engl: pen – pen German: August - August
their - there Wände - Wende 77
the lexicon : homonomy / polysemy PROBLEMS homonyms (homographic and homophonic)
homographs
homophones
(homographic but not necessarily homophonic)
(heterographic and homophonic)
Engl: pen – pen German: August - August
their - there Wände - Wende 78
the lexicon : homonomy / polysemy We need to distinguish into 4 different categories: A homonyms : homographic and homophonic (pipe , pen , Ball,…)
B homographs: homographic but not necessarily homophonic (pipe – pipe ; pen – pen ; August – August ; Lache - Lache) C homophones= heterographs: heterographic and homophonic (extra/xtra ; right/write ; their/there ; deer/dear ; two/too; base – bass;… Bug/buk ; Wände/Wende) (no problem, but when can we use them?)
?
D heterophones/heteronyms : homographic and not homophonic (wound: [wu:nd] in : the wound, wounded, [waund] in : past tense of wind (wrap) ; desert - desert) Remark: A-D always have DIFFERENT meanings. 79
the lexicon : homonomy / polysemy We need to distinguish into 4 different categories: A homonyms : homographic and homophonic (pipe , pen , Ball,…)
B homographs: homographic but not necessarily homophonic (pipe – pipe ; pen – pen ; August – August ; Lache - Lache) C homophones= heterographs: heterographic and homophonic (extra/xtra ; right/write ; their/there ; deer/dear ; two/too; base – bass;… Bug/buk ; Wände/Wende) (no problem, but when can we use them?) D heterophones/heteronyms : homographic and not homophonic (wound: [wu:nd] in : the wound, wounded, [waund] in : past tense of wind (wrap) ; desert - desert) Remark: A-D always have DIFFERENT meanings. 80
the lexicon : homonomy / polysemy PROBLEMS
Homographs : pipe – pipe ; pen – pen ; August – August ; Lache – Lache (homographic but not necessarily homophonic )
Heterophones : wound - wound ; desert – desert; (homographic and not homophonic)
81
the lexicon : homonomy / polysemy
homographic
heterographic
homophonic
heterophonic
homonyme homograph
homograph heterophone
homophone
A : homonyme B : homograph C : homophone (/heterographs) D : heterophone (/heteronym) 82
the lexicon : homonomy / polysemy
homographic
heterographic
homophonic
heterophonic
homonyme homograph
homograph heterophone
homophone
A : homonyme B : homograph C : homophone (/heterographs)
PROBLEM
D : heterophone (/heteronym) 83
the lexicon : homonomy - polysemy EXAMPLES
homographic wordforms differing in their POS:
trust verb/noun poll
verb/noun
poor adj/noun lodge noun/verb-trans/verb-intrans long
noun
out
prep/sep. prefix
(longitude)/adj/verb/adv (Don‘t be too long about it !) (to cry out)
84
the lexicon : homonomy - polysemy Examples of homographs in German: Heroin: drug vs. female hero Konstanz: town in Southern Germany vs. stability/consistency Petra: female name vs. archeological site in Jordan Montage: Monday (pl) vs. assembling modern: adj (modern) vs. verb (to rot) übersetzen: translate vs. cross the river
Fliegen «» fliegen Regen «» regen Spinnen «» spinnen Zahlen «» zahlen
Noun vs. Noun
Adj/Verb vs. Verb
Noun vs. Verb
85
the lexicon : homonomy - polysemy Examples of homographs in German: Noun vs. inflected verbform (like the „wound“ example in English): Flucht: escape (noun) vs. 3rd pers sing „to curse“ Naht: seam (noun) vs. 3rd pers sing „to come closer“ / „ to approach“ Sucht: addiction (noun) vs. 3rd pers sing „to look for“ Homographs based on cross-language loan words: Bug: front of a ship vs. error in a program (Engl.) dies: demonstrative pronoun vs. days (Latin) gen: in direction of (prep) vs. genetic factor (noun, Greek) vs. Gen. (abbrev. for genitive or noun, Genossenschaft/cooperative)
86
the lexicon : homonomy / polysemy
questions: • Why should we have a dictionary for homographs / homonyms ? • How can we use such a dictionary ? • How are entries structured ?
87
the lexicon : homonomy / polysemy
Why should we have a dictionary for homographs / homonyms ? • simplify the dictionary lookup
• in conventional dictionaries, the correct base form is often not found past tense of wind [wand] wound noun [wu:nd]
88
the lexicon : homonomy - polysemy One needs to decide which one of these alternative notations of lexical amibiguity should be used: homonymy...homophonous words, different meaning -> separate entries: entry pen pen pen
information noun < instrument for writing> noun < for cattle> noun < female swan>
problem: multiple parses
polysemy...homophonous words, different meaning -> one entry: entry pen
information noun 1. < instrument for writing> < for cattle> < female swan>
problem: unclear entries 89
Identification of a wordform as being a homograph ...the wound was hurting him... 1. step: identify ambiguities wound
Homograph Dictionary 2. step ...
general
wordform -index
wordform ...
->
5. step
dictionary
stem 1/ stem2 address
3. step Homograph -classes ... address
->
homograph-class
... wound (N)
stem 1 / stem 2 inflection-code
... wind (V)
4. step inflection -paradigm ... inflection-code -> inflection-features ...
90
the lexicon : homonomy - polysemy For English: more information / reading material on disambiguating dictionary entries: www.research.microsoft.com/users/lucyv/ Lucy Vanderwende, computational linguist For German: Weber, Heinz J.: Homographen-Wörterbuch der deutschen Sprache. De Gruyter, 1996, ISBN 978-3-11-014641-7 91