Web Information Extraction for eenvironment

Web Information Extraction for eEnvironment We need help from domain experts! Jan Dědek Peter Vojtáš Department of Software Engineering, Charles Un...
Author: Marcus Hensley
2 downloads 0 Views 429KB Size
Web Information Extraction for eEnvironment We need help from domain experts!

Jan Dědek

Peter Vojtáš

Department of Software Engineering, Charles University in Prague Institute of Computer Science, Czech Academy of Sciences TOWARDS eENVIRONMENT 2009

Outline z

Introduction – signal data versus human produced data – –

z

Our extraction framework – –

z

z

Web → Text → Linguistics → Structured Data Extraction rules – Linguistic tree patterns

Learning of our tool –

z

Information processing top down versus bottom up Web and the environment, Web Information Extraction

Human learning and Machine learning

Examples of extracted data Conclusion –

We are interested in cooperation!

The Environment and Information on the Web sensors/monitoring versus human created information top down (egov) versus bottom up processing

Our Web Information Extraction Framework z

Motivated by the Semantic Web –

z

Information extraction form texts (from the Web) –

z

Supported by linguistic tools (PDT, Netgraph, …)

Machine learning of the extraction procedure –

z

Transformation of human understandable resources to machine understandable ones

Human annotated training set needed

Structured (and semantic) extraction output – –

Structured output: XML or Ontology data structure Automatically annotated web pages

The Data Flow

Web

1) Extraction of text Text

Domain expert has to: z

2) Linguistic annotation

z

Select relevant pages Support learning procedure

Linguistic Trees

Learning procedure produces rules for

3) Information extraction Structured Data

z z

4) Semantic interpretation

Ontology

Data extraction Data interpretation

Example of processed web page

Relevant text

„Únik provozních kapalin nebyl zjištěn.“ “Outflow of Sealing liquid was not found out.”

Information relevant to the environment

Example of a linguistic tree "Due to the clash the throat of fuel tank tore off and 800 litres of oil (diesel) has run out to a stream." “Nárazem se utrhl hrdlo palivové nádrže a do potoka postupně vyteklo na 800 litrů nafty.”

jihmor56559.txt-001-p1s3

(1) (2) litre

(3)

(5) diesel

(4)

"into" water stream

Example of an extraction rule. t_lemma = uniknout | unikat | vytéci 1 _name = unit 2

3 _optional = true functor = DIR3 _name = where

gram/sempos = adj.quant.def _name = amount 4 5 functor = MAT _name = material

Manual learning of extraction rules

Frequency Analysis

Key-words Query

Matching trees

z

Coverage tuning

More accurate matches Tree Query

Investigation of neighbors

More complex queries

Needs expert in both: in the domain and in linguistics

Machine learning of extraction rules ILP background knowledge

Web

Extraction process

Linguistic trees

Texts

Extraction rules

Domain expert

z z

+

Learning examples + Semantics

ILP learning

Extracted Semantic data

Use of Inductive Logic Programming Learning examples – –

Marked directly in the text Done by domain expert, no linguistic knowledge needed

Experimental results (1) Nárazem se utrhl hrdlo palivové nádrže a do potoka postupně vyteklo na 800 litrů nafty. litre 800 l nafta potok water stream diesel Z palivové nádrže vozidla uniklo do půdy v příkopu vedle silnice zhruba 350 litrů nafty, a proto byli o události informováni také pracovníci odboru životního prostředí Městského úřadu ve Vyškově a České inspekce životního prostředí. 350 l nafta půda soil

...

Experimental results (2) ... Z kamionu uniklo zhruba 20 litrů látky. other material 20 l látka Hasiči po likvidaci požáru trávy asi na 25 metrech čtverečních ještě uklízeli společně s pracovníky Správy silnic Moravskoslezského kraje zhruba 15 metrů silnice, na kterou vyteklo asi 40 litrů hydraulického oleje. 40 l olej gear oil

What is interesting on the Web?

z

For environment specialists? What information from the Web can help with the evidence, inspection and care for the environment?

z

Perhaps our method can provide it!

z

We are interested in cooperation!

z

Concluding Appeal z

We need your help!

z

Please send us examples of resources, content interesting for environmentalists! [email protected] [email protected] Thank you!