Natural Language Processing >> Electronic Dictionaries

Natural Language Processing >> Electronic Dictionaries Ab-tei.lung-en Spar.gel.der: vs. Spargel-der -> Spar-gel.der be.ste.hen.de: be-stehende ...
Author: Clarence Quinn
4 downloads 1 Views 4MB Size
Natural Language Processing >> Electronic Dictionaries Ab-tei.lung-en

Spar.gel.der:

vs.

Spargel-der

-> Spar-gel.der

be.ste.hen.de: be-stehende

vs.

beste-hende

-> be-ste.hen-de

be.in.hal.ten:

vs.

bein-halten

-> be-in.hal-ten

Spar-gelder

be-inhalten

23

the lexicon : considerations / questions 4. Which features do we code ? What about determiners and quantifiers and ... ?

Features: wordclass

ADJ, ADV, NOUN, PRO, (IJ), CONJ, PREP, VERB, PUNC

phonetic

flag [f l æ g] (IPA)

morphological syntactic semantic

e.g. person, number, gender e.g. transitivity, valency,... +- human, +-animate,... 24

the lexicon : considerations / questions 4. Which features do we code ?

Features: wordclass phonetic

morphological

In Computational Linguistics, these features are called:

syntactic

subcategorizational features

semantic

selectional features 25

the lexicon : considerations / questions 4. Which features do we code ? Large (complete ?) electronic dictionaries for English use approx. 250 features (also called: indicators or attributes). These features can be of complex nature in that they carry various values (computationally spoken: attribute-value pairs): examples: [attribute: values] [case: nominative, genitive, dative, accusative, vocative,...] [number: sing, plur] [gender: masc, fem, neutr] 26

the lexicon : considerations / questions 4. Which features do we code ? Is there a rule that tells us how many features we need to code or use ?

The rule: „The idea is that one should use as many features as would be useful, for any given language. So the number should be very, very flexible.“ (Karen Jensen, former IBM research / Microsoft Research scientist)

Example: parts of speech ? How many are there ? 27

the lexicon : considerations / questions 4. Which features do we code ?

Sample features: ADJ PHONE MASS SENS...

DATE PREP TITLS

DET QUANT ASKNG

NUM ANIM PRES

„Features used in PEG“ ! 28

the lexicon : considerations / questions

4. Which features do we code ? - Exercise Describe the morphology and syntax of the following words, using subcategorizational features ! is :

VERB, SING, PRES, AUX, PERS3, COPL

beauty: NOUN, SING, PERS3

29

the lexicon : considerations / questions

5. How do we treat non-English words ? What are loan words ?

Definition: words which originate in another language and which are integrated into your own language 30

the lexicon : considerations / questions

EXAMPLE from Greek -> English

31

Abyss Academy Acoustic Acrobat Aerial Aeronautics Aesthetic Air Airplane Alphabet Amazon Amnesty Amphitheater Analogy Analysis Angel Anthem Anthropology Aphrodisiac Apocalypse Apology Archeology Architect Aristocracy

Aroma Astronaut Athletic Atmosphere Atlas Atmosphere Audio .... Zeal Zephyr Zodiac Zoe Zoic Zome Zoology

32

the lexicon : considerations / questions 5. How do we treat non-English words ? lemma (Greek / Latin) -> plural (Engl.): lemmas plural (orig.): lemmata

English – German: the update – gender ??? to update – separable prefix ??? 33

the lexicon : considerations / questions

5. How do we treat non-English words ? English – German: the update – gender ???

to update – separable prefix ??? past participle: (updated – Engl.) geupdated / geupdatet (non-separable prefix) (German – version 1) upgedated / upgedatet (separable prefix)

(German – version 2) 34

the lexicon : considerations / questions

35

the lexicon : considerations / questions 5. How do we treat loan words ? more examples: English – German: the homebanking – gender ??? Loan word. Noun. Capitalized ???

verb form ??? in English? He is homebanking ? He is banking at home ? In German - impossible 36

the lexicon : considerations / questions 5. How do we treat loan words ? more examples: English – German: to outsource – separable prefix ??? (like in „to update“)

upgedated (separable prefix, Engl. ending) upgedatet (separable prefix, German ending) geupdated (non-separable prefix, Engl. ending) geupdatet (non-separable prefix, German ending) 37

the lexicon : considerations / questions 5. How do we treat loan words ? more examples: English – German: to outsource – separable prefix ??? (like in „to update“) In any case these wordcreations are hard to read.

!

outgesourced (separable prefix, Engl. ending) outgesourcet (separable prefix, German ending) geoutsourced (non-separable prefix, Engl. ending) geoutsourcet (non-separable prefix, German ending) 38

the lexicon : considerations / questions 6. How do we treat (idiomatic) phrases / phrasal expressions ? multiple morpheme strings (MWE) bite the dust, lose face, kick the bucket,... -> fixed expressions: cannot be modified

* bite the dirty dust * lost her beautiful face * kicked the full bucket

-> fixed expressions: require certain types of subjects (features !) The cow kicked the bucket. vs. [Human being] kicked the bucket. 39

the lexicon : considerations / questions 7. How do we treat neologisms ? Neologisms = new word creations, acronyms, loan words,...often „trendy“ words Don‘t be afraid of new words ! Fact is: when reading through dictionaries for neologisms or even

when reading newspapers, we find many words that are only a few years old, which sometimes still require some explanation but which belong to our active vocabulary:

airbag, AIDS, job-sharing, to pamper, global player, softy, to grab a bite, dude, (fast-food) joint, smarty pants, tin-foil face, to google, to xerox, ... 40

the lexicon : considerations / questions 7. How do we treat neologisms (contd.) ? Neologisms are often created across languages, e.g. trendy words among juveniles:

sich ausruhen: abkeimen (German) to veg (out) / to couch / to chill out / to hang out (English) chiller (French) vegetar (Spanish) sich grundlos betrinken: sich abschädeln (German) to get out of one‘s skull / to get plastered (English) se murger sa gueule (French) pillarse una tajada (Spanish) 41

the lexicon : considerations / questions 7. How do we treat neologisms (contd.) ? „Juvenile language“/ chat language – acronyms from text messaging:

(sometimes international) lol

yolo Words? Idiomatic expressions? Sentences? 42

spell aid – chat language (acronyms) AFAIK -- As Far As I Know AFK -- Away From Keyboard ASAP -- As Soon As Possible BAS -- Big A** Smile BBL -- Be Back Later BBN -- Bye Bye Now BBS -- Be Back Soon BEG -- Big Evil Grin BF -- Boyfriend BIBO -- Beer In, Beer Out BRB -- Be Right Back BTW -- By The Way BWL -- Bursting With Laughter C&G -- Chuckle and Grin CICO -- Coffee In, Coffee Out

CID -- Crying In Disgrace CP -- Chat Post(a chat message) CRBT -- Crying Real Big Tears CSG -- Chuckle Snicker Grin CYA -- See You (Seeya) CYAL8R -- See You Later (Seeyalata) DLTBBB -- Don't Let The Bed Bugs Bite EG -- Evil Grin EMSG -- Email Message FC -- Fingers Crossed FTBOMH -- From The Bottom Of My Heart FYI -- For Your Information

See: http://www.netlingo.com/acronyms.php

http://www.netlingo.com/top50/popular-text-terms.php

43

spell aid – chat language (symbols) :-| -- Ambivalent o:-) -- Angelic >:-( -- Angry |-I -- Asleep (::()::) -- Bandaid :-{} -- Blowing a Kiss \-o -- Bored :-c -- Bummed Out |C| -- Can of Coke |P| -- Can of Pepsi :( ) -- Can't Stop Talking :*) -- Clowning :' -- Crying :'-) -- Crying with Joy :'-( -- Crying Sadly

:-9 -- Delicious, Yummy :-> -- Devilish ;-> -- Devilish Wink :P -- Disgusted (sticking out tongue) :*) -- Drunk :-6 -- Exhausted, Wiped Out :( -- Frown \~/ -- Full Glass \_/ -- Glass (drink) ^5 -- High Five

44

spell aid – EMOJIs … language ???

Language is the ability to acquire and use complex systems of communication, particularly the human ability to do so, and a language is any specific example of such a system. - Natural Language Systems Harriehausen

45

EMOJIs vs. hieroglyphs

- Natural Language Systems Harriehausen

46

http://www.fbemoticons.com/wp-content/uploads/2010/03/facebook-emoticons.jpg

47

the lexicon : homonomy / polysemy Special types of ambiguities on WORD LEVEL: What are homonyms ? What are homographs ? What are homophones ? What are heterophones ? What are heterographs ?

Examples: 48

the lexicon : homonomy / polysemy

49

the lexicon : homonomy / polysemy

four – for watt - what

50

the lexicon : homonomy / polysemy

four – for watt - what

homophones & heterographs:

heterographic and homophonic 51

the lexicon : ambiguity on which NLP level?

52

the lexicon : ambiguity on which NLP level?

homonyms : homographic and homophonic; here: combined with MWE

53

the lexicon : ambiguity on which NLP level?

54

the lexicon : ambiguity on which NLP level?

homonyms : homographic and homophonic; but combined with different tags (for the different POS): word level: pretty dirty (ADV + ADJ) vs. ADJ (gap: when you are) ADJ 55

the lexicon : ambiguity on which NLP level?

56

the lexicon : ambiguity on which NLP level?

homonyms : homographic and homophonic; word level

57

the lexicon : ambiguity on which NLP level?

58

the lexicon : ambiguity on which NLP level?

No homonymy at all, but syntactic ambiguity 59

the lexicon : ambiguity on which NLP level?

60

the lexicon : ambiguity on which NLP level?

homonyms : homographic and homophonic; word level

61

homophones heterographs

62

63

64

65

66

67

68

69

70

71

72

73

74

the lexicon : homonomy / polysemy

questions: • Why should we have a dictionary for homographs / homonyms ? • How can we use such a dictionary ? • How are entries structured ?

For German, homonym dictionaries exist since 1532. 75

the lexicon : homonomy / polysemy homonyms (homographic and homophonic)

homographs

homophones

(homographic but not necessarily homophonic)

(heterographic and homophonic)

Engl: pen – pen German: August - August

their - there Wände - Wende 76

the lexicon : homonomy / polysemy PROBLEMS ? homonyms (homographic and homophonic)

homographs

homophones

(homographic but not necessarily homophonic)

(heterographic and homophonic)

Engl: pen – pen German: August - August

their - there Wände - Wende 77

the lexicon : homonomy / polysemy PROBLEMS homonyms (homographic and homophonic)

homographs

homophones

(homographic but not necessarily homophonic)

(heterographic and homophonic)

Engl: pen – pen German: August - August

their - there Wände - Wende 78

the lexicon : homonomy / polysemy We need to distinguish into 4 different categories: A homonyms : homographic and homophonic (pipe , pen , Ball,…)

B homographs: homographic but not necessarily homophonic (pipe – pipe ; pen – pen ; August – August ; Lache - Lache) C homophones= heterographs: heterographic and homophonic (extra/xtra ; right/write ; their/there ; deer/dear ; two/too; base – bass;… Bug/buk ; Wände/Wende) (no problem, but when can we use them?)

?

D heterophones/heteronyms : homographic and not homophonic (wound: [wu:nd] in : the wound, wounded, [waund] in : past tense of wind (wrap) ; desert - desert) Remark: A-D always have DIFFERENT meanings. 79

the lexicon : homonomy / polysemy We need to distinguish into 4 different categories: A homonyms : homographic and homophonic (pipe , pen , Ball,…)

B homographs: homographic but not necessarily homophonic (pipe – pipe ; pen – pen ; August – August ; Lache - Lache) C homophones= heterographs: heterographic and homophonic (extra/xtra ; right/write ; their/there ; deer/dear ; two/too; base – bass;… Bug/buk ; Wände/Wende) (no problem, but when can we use them?) D heterophones/heteronyms : homographic and not homophonic (wound: [wu:nd] in : the wound, wounded, [waund] in : past tense of wind (wrap) ; desert - desert) Remark: A-D always have DIFFERENT meanings. 80

the lexicon : homonomy / polysemy PROBLEMS

Homographs : pipe – pipe ; pen – pen ; August – August ; Lache – Lache (homographic but not necessarily homophonic )

Heterophones : wound - wound ; desert – desert; (homographic and not homophonic)

81

the lexicon : homonomy / polysemy

homographic

heterographic

homophonic

heterophonic

homonyme homograph

homograph heterophone

homophone

A : homonyme B : homograph C : homophone (/heterographs) D : heterophone (/heteronym) 82

the lexicon : homonomy / polysemy

homographic

heterographic

homophonic

heterophonic

homonyme homograph

homograph heterophone

homophone

A : homonyme B : homograph C : homophone (/heterographs)

PROBLEM

D : heterophone (/heteronym) 83

the lexicon : homonomy - polysemy EXAMPLES

homographic wordforms differing in their POS:

trust verb/noun poll

verb/noun

poor adj/noun lodge noun/verb-trans/verb-intrans long

noun

out

prep/sep. prefix

(longitude)/adj/verb/adv (Don‘t be too long about it !) (to cry out)

84

the lexicon : homonomy - polysemy Examples of homographs in German: Heroin: drug vs. female hero Konstanz: town in Southern Germany vs. stability/consistency Petra: female name vs. archeological site in Jordan Montage: Monday (pl) vs. assembling modern: adj (modern) vs. verb (to rot) übersetzen: translate vs. cross the river

Fliegen «» fliegen Regen «» regen Spinnen «» spinnen Zahlen «» zahlen

Noun vs. Noun

Adj/Verb vs. Verb

Noun vs. Verb

85

the lexicon : homonomy - polysemy Examples of homographs in German: Noun vs. inflected verbform (like the „wound“ example in English): Flucht: escape (noun) vs. 3rd pers sing „to curse“ Naht: seam (noun) vs. 3rd pers sing „to come closer“ / „ to approach“ Sucht: addiction (noun) vs. 3rd pers sing „to look for“ Homographs based on cross-language loan words: Bug: front of a ship vs. error in a program (Engl.) dies: demonstrative pronoun vs. days (Latin) gen: in direction of (prep) vs. genetic factor (noun, Greek) vs. Gen. (abbrev. for genitive or noun, Genossenschaft/cooperative)

86

the lexicon : homonomy / polysemy

questions: • Why should we have a dictionary for homographs / homonyms ? • How can we use such a dictionary ? • How are entries structured ?

87

the lexicon : homonomy / polysemy

Why should we have a dictionary for homographs / homonyms ? • simplify the dictionary lookup

• in conventional dictionaries, the correct base form is often not found past tense of wind [wand] wound noun [wu:nd]

88

the lexicon : homonomy - polysemy One needs to decide which one of these alternative notations of lexical amibiguity should be used: homonymy...homophonous words, different meaning -> separate entries: entry pen pen pen

information noun < instrument for writing> noun < for cattle> noun < female swan>

problem: multiple parses

polysemy...homophonous words, different meaning -> one entry: entry pen

information noun 1. < instrument for writing> < for cattle> < female swan>

problem: unclear entries 89

Identification of a wordform as being a homograph ...the wound was hurting him... 1. step: identify ambiguities wound

Homograph Dictionary 2. step ...

general

wordform -index

wordform ...

->

5. step

dictionary

stem 1/ stem2 address

3. step Homograph -classes ... address

->

homograph-class

... wound (N)

stem 1 / stem 2 inflection-code

... wind (V)

4. step inflection -paradigm ... inflection-code -> inflection-features ...

90

the lexicon : homonomy - polysemy For English: more information / reading material on disambiguating dictionary entries: www.research.microsoft.com/users/lucyv/ Lucy Vanderwende, computational linguist For German: Weber, Heinz J.: Homographen-Wörterbuch der deutschen Sprache. De Gruyter, 1996, ISBN 978-3-11-014641-7 91