THE PRAGUE DEPENDENCY TREEBANK. The Prague Dependency Treebank (PDT 2.0), (2 ), - (1.5 ) - (0.8 ). -

THE PRAGUE DEPENDENCY TREEBANK . ([email protected]), . ([email protected]), , The Prague , Dependency , Treebank (ÚFAL MFF UK) ...
Author: Brice Wilcox
3 downloads 2 Views 501KB Size
THE PRAGUE DEPENDENCY TREEBANK . ([email protected]),

. ([email protected]), ,

The

Prague ,

Dependency

,

Treebank

(ÚFAL MFF UK)

(PDT

2.0)

,

– –

(2 (1.5

,

.

)

), (0.8

.

PDT

).

-

2.0

, .

.

! (The Prague CzechEnglish Dependency Treebank), Wall Street Journal

. "

. #

$ ,

-

, ,

.

THE PRAGUE DEPENDENCY TREEBANK Nedoluzhko . ([email protected]), Haji J. ([email protected])& Co. Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic The Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW); in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level. PDT 2.0 is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Besides the large corpus of Czech, a corpus of Czech-English parallel resources (The Prague Czech-English Dependency Treebank) is being developed. English sentences from the Wall Street Journal and their translations into Czech are being annotated in the same way as in PDT 2.0. This corpus is suitable for experiments in machine translation, with a special emphasis on dependency-based (structural) translation. In the report, the basic annotation scheme is represented, with special reference to complex semantic (tectogrammatical) level. The system of syntactic functors and valency lexicon VALLEX are also discussed.

1. (

,

, -

2.0, ( (1.5

,

(2 (2

! .

)

)

"

!–

! -

(

(PDT) –

,

(0.8

.

.

).

.)

.

.

),

, PDT "

) -

!

XML, RelaxNG). #

" Netgraph,

! . $

! " , . .

! " #, #

# # .$

" », . .

! #

ÚFAL MFF UK. $ , , #

,

! PDT

% -

" " PDT

!

! ! TrEd ( .

tree editor),

. «

,

# .

" ,

# .

-

! #

!

!

.

! ! , ! . ! PennTreebank (http://www.cis.upenn.edu/~treebank), PDT " ! # !. % PDT .& '($ -3, PDT # , ! , ) # . . .. PDT ! Danish Dependency Treebank – 5500 ! (http://www.id.cbs.dk/~mbk/treebank), ! - The Floresta Sintá(c)tica project, 10000 (http://acdc.linguateca.pt/treebank/info_floresta_English.html), " ! – METUSabanci Turkish Treebank (http://www.ii.metu.edu.tr/~corpus/treebank.html) . * PDT " ! " , . . ! ( " " , , ! " .) 2. $ , .*

.1

' ,

do lesa).

! .+ (



! ,

,

:

Byl by šel dolesa ( (Byl by šel)

, ,

!

) # ),

(dolesa «

»

-

# PDT 2.0. « »

.1

PDT 2.0

!

2.1

.

,

, ! tag

$

: lemma

# 15

" , .

! $

.

tag. $

lemma

" ( NNIS2-----A----).

" ,

" . *

.

, lemma -

!

.

tag. * ,

,

, .

#

. .

!, 2.2

!

-

(

)

,

( " )

, -

! . (

,

.

,

# ,

.



!

# " ,

" ,

.

PDT 2.0

.$

, . $

afun

ord.

-

" is_member

.%

.

# . +

"

2.3

.+

.*

!

(

/ . –

$

functor

! , 0$( –

,

" .

#, .

# ##

,

(

)

( " ) !

nodetype)

# ,



., , ADDR – , DENOM –

"

.



. ,

#

is_parenthesis_root !. $ m.rf

"



!

id

!

#

.),

, :

39



($ ( (PRED –

!

, PAR – ( 1NJ – " , " .), (LOC , DIR1 , DIR2 – , DIR3 ) (TWHEN , TTILL – , TSIN – , TPAR – .) . + 67 , ! 12 . t_lemma . $ 16 ! . ' gram ( ., gram/sempos – , # 19 : n.denot , adj.denot gram/verbmod "# , v . .; . .) ( , ., ! !, ! ! . + # , # " .

!), ADVS –

!

2.3.1 "

# 2730

! :

$ TrEd. *

. ! ,

"

VALLEX. * / . VALLEX. ' # ! ! ! rozum2t (

, " , ) pdf-

/ . -ní (

koupaní -

)

–tí (mýtí -

).

.

! ,

" ,

" . 2.3.2

. $ " " !: +. 3

!

"

! -

(

,

)

! f (

)

!

c (

PDT 2.0

"

#, .(

,

-

! .+ ,

,

PDT 2.0

" " , # (

"

.+

#

+

" ,

! ,

;

,

" " ,

"

, (

(



!

(

! ), .)

./

# " !

,

", ! "

, ,

, #. 5 . $ # ! " .

" id coref_text.rf, coref_gram.rf . $ coref_special

"

,

. "

.

"

(

!

PDT 2.0

"

/ .. (

#.

" )

)

!

.



"

. 2.3.3 % 5

!

" ! !. tfa (topic-focus articulation) – t ( ), . $ deepord , "

!

)

" 4. -

. + "

"



id "#

id ! !

!

" ,

+ ,

" (

# . .), "#

# «

3.

»

#

.

.

,

! . bridging anaphora,

,

$

!



.

,

/,

,

" .*

!

PDT 2.0

N"které kontury problému se však po oživením Havlovým projevem zdají být jasn"jší. – # $ % $ 3.1. N které

: kontury problému se

noun, adj. masc masc, Npl Npl

« » masc, Gsg pron.

však

po oživením Havlovým projevem

. adv

prep

noun,neutr, adj-poss, DSg masc, ISg

noun, neutr ISg

3.2. N které

n který

PZFP1----------

kontury

kontura

NNFP1-----A----

problému

problém

NNIS2-----A----

se

se_^(zvr._zájmeno/ ástice)

P7-X4----------

však

však

J^-------------

po

po-1

RR--6----------

oživení

oživení_^(*3it)

NNNS6-----A----

Havlovým

Havl v_;S_^(*3el)

AUIS7M---------

projevem

projev

NNIS7-----A----

zdají

zdát

VB-P---3P-AA---

být

být

Vf--------A----

jasn jší

jasný

AAFP1----2A----

.

.

Z:-------------

3.3.

!

-

.

zdají

být jasn jší .

( ) verb, ind,act, inf praes. 3Sg

c .

.

.

3.4.

-

(

##

)

***

PDT (Prague Arabic Dependency Treebank, http://www.ldc.upenn.edu) (Prague Czech-English Dependency Treebank http://ufal.mff.cuni.cz/pcedt). $ , ! . # Wall Street Journal, Penn Treebank 3. + 21600 ! .6 # # . +

PDT PDT " # !

!

" , , !

, " " !,

-

# !

. %

. 0 #

"

),

# " ( "

"

. .

: !

:

1. 7mejrek M., Cu8ín J., Havelka J., Haji9 J., Kubo: V. Prague Czech-English Dependecy Treebank: Syntactically Annotated Resources for Machine Translation, In 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal. 5 http://ufal.mff.cuni.cz/pcedt/doc/papers/lrec2004_pcedt.pdf. 2004. 2. Haji9 J., Haji9ová E., Hlavá9ová J., Klimeš V., Mírovský J., Pajas P., Št2pánek J., VidováHladká B., Žabokrtský Z. PDT 2.0 – Guide. UFAL & CKL, 2006 5 http://ufal.mff.cuni.cz/pdt2.0/ 3. Mikulova M. . Anotace na tektogramatické rovin2 Pražského závislostního korpusu. Anotátorská p8íru9ka Institute of formal and applied linguistics, Charles University, Prague, 2006. 4. N2dolužko A., Zpráva k anotování rozší8ené textové koreference a bridging vztah; v Pražském závislostním korpusu. (Report about the annotation of the extended textcoreference and bridging relations in Prague Dependency Treebank.). Technical report. Institute of formal and applied linguistics, Charles University, Prague. 2007 5. Žabokrtský, Z.; Lopatková, M.: Valency Frames of Czech Verbs in VALLEX 1.0. // In Frontiers in Corpus Annotation. Proceedings of the Workshop of the HLT/NAACL Conference, pp. 70--77. 2004

! !