Intelligent Content Management System Project Presentation

Intelligent Content Management System Project Presentation April 2002 IST-2001-32429 ICONS Intelligent Content Management System www.icons.rodan.pl ...
2 downloads 0 Views 695KB Size
Intelligent Content Management System Project Presentation April 2002

IST-2001-32429 ICONS Intelligent Content Management System www.icons.rodan.pl Project Partners Rodan Systems (PL) The Polish Academy of Sciences (PL) Centro di Ingegneria Economica e Sociale (IT) InfoVide (PL) SchlumbergerSema (BE) University Paris 9 Dauphine (FR) University of Ulster (UK)

Intelligent Content Management System Project Presentation

Project name Acronym Workpackage Task Document type Title Subtitle Document acronym Author(s)

Reviewer(s) Accepting Location Version Date Status Distribution

Intelligent Content Management System ICONS WP9 T9.1 report Intelligent Content Management System Project Presentation D01 Witold Staniszkis, Nicola Leone, Pasquale Rullo, Łukasz Balcerek, Michał Śmiałek, Witold Litwin, Gérard Levy, Jules Georges, Kazimierz Subieta, Mariusz Momotko, Dorota Depowska, Janusz Charczuk, Waldemar Piszczewiat, Yaxin Bi, David Bell Annette Bleeker, Bartosz Nowicki Witold Staniszkis I:\WP9 Project Management\ICONS WP9 T1 D01 0115.doc 1.15 April 2002 final version public

April 2002

Intelligent Content Management System History of changes

1.15 April 2002

History of changes Date

Version

Author

6.4.02 4.4.02 30.3.02

1.14 1.12 1.8

Bartosz Nowicki Witold Staniszkis partners’ inputs provided

30.2.02 30.2.02 1.2.02

1.03 1.02 1.01

Bartosz Nowicki Witold Staniszkis Witold Staniszkis

Change description final packaging integration of partners’ inputs for chapter 5-9 Mariusz Momotko (5.5, 7, 8.3) Witold Litwin, Gérard Levy (8.1, 8.2) David Bell, Yaxin Bi (5.1-5.4) Nicola Leone, Pasquale Rullo (5.1-5.4) Kazimierz Subieta (5.6) Jules Georges (9.1) Łukasz Balcerek, Michał Śmiałek (9.2) Dorota Depowska, Waldemar Piszczewiat (6) ICONS template applied further elaboration; work distribution among partners document creation

IST-2001-32429 ICONS Intelligent Content Management System

page 3/86

Intelligent Content Management System Executive summary

1.15 April 2002

Executive summary The primary objective of the ICONS Project Presentation report is to provide a baseline platform for all ICONS project stakeholders representing the consensus of the ICONS consortium members with respect to the ICONS research and development strategy. Much effort has gone into interactions among members of the ICONS research and development community aiming at reconciliation of diverse views and specialisations in the relevant research realms. We assume that the ensuing research results may require refinements and modifications of the underlying ICONS assumptions and we plan to reflect them in the ensuing versions of the report. Hence, this report is to “live: status document reflecting the current views of the ICONS consortium. The initial effort has gone into the Knowledge Management System (KMS) feature requirements analysis in order to establish compatibility of requirements voiced by the knowledge management community and the prevailing opinions and conclusions of the on-going research work in the IT field. Our motivation has been to verify the ICONS project goals and objectives and possibly to re-orient some of the principal research and development objectives. The representative results of the management science research pertaining to intellectual capital and knowledge management have been examined. We have concentrated on the work of the Knowledge Management Consortium International [Firestone2000, McElroy1999], the seminal work in the area of learning organisations [Garvin1993] and knowledge modelling [Popper1971, Popper1977], as well as generally accepted views of Nonaka and Takeuchi [Nonaka1995] with respect to knowledge creation and dissemination processes. The principal conclusions are that the current KM needs require IT support for KM processes in order to facilitate innovation leading to enhanced competitive advantage. A mapping of the KM processes and the desirable KMS features has been established. Our findings have been confronted with the prevailing views of the IT research and development community with respect to the KMS architecture requirements. We have developed a KMS reference architecture enumerating the desirable KM features to provide a “common denominator” representation of the current IT research and development work. Principal results of the on-going European KM projects may be found in European KM Forum web site [KMForum2001]. The principal KMS feature sets include knowledge dissemination features, domain ontology features, content repository features, KMS actor collaboration features, knowledge security features, and content integration features. The KMS features role semantics with respect to the KM processes have been specified in order to confront the IT community prevailing views with those represented by management scientists. We have established that the referential KMS architecture is sufficiently powerful to provide significant enabling leverage for the KM field. The above complementary views on the KM scene provide a solid referential background for the ICONS architecture specification providing a backbone for our research and development work. We concentrate our project work on three key technological areas, namely on the Knowledge Management Technologies area, the Human/Computer Interaction (HCI) area, and the Distributed Architecture Technologies area. We further demonstrate that such approach is fully compatible with the stated ICONS project goal and objectives and that it enables us to provide the required technical support for the KMS reference architecture. The complete view of the ICONS architecture comprises additional technological areas, auxiliary to our project, namely the Content Management Technologies area and the Development Technologies area. The software modules within the auxiliary areas are input into the project, preferably as “open source” or proprietary to consortium partners to be subsequently used and/or modified within the ICONS prototype. The cross-reference between the KMS referential architecture and the proposed ICONS architecture indicating research and/or development effort needed shows completeness of the ICONS features with respect to the established requirements. Knowledge-based features are the important building block of the ICONS architecture therefore a multiparadigm approach has been proposed. The research work on formal aspects of knowledge representation including rules and uncertainty, the Dempster-Shafer theory, and the extended relational model. Disjunctive Datalog inference engine is to be extended and integrated into the system provides principal knowledge-based platform. Procedural knowledge based on workflow specifications is to extent the Workflow Management Coalition model with the time modelling features and the CPM (Critical Path Method) modelling capabilities. Such extensions allow for enhanced support for knowledge management processes usually unsuitable for the WfMC-based process modelling approach. We proposed an advanced graphic HCI interface to support visualisation and manipulation of structural knowledge comprising semantic nets, UML relationships, and process graphs.

IST-2001-32429 ICONS Intelligent Content Management System

page 4/86

Intelligent Content Management System Executive summary

1.15 April 2002

The knowledge-based capabilities are to be used in development of the intelligent content integration features to support an open ICONS content repository. The ICONS content management functions are to integrate under a unique knowledge map information resources stored internally and those stored in Web information sources, as well as in the legacy information systems and the heterogeneous databases. A wrapper-based architecture is to establish the technological base content integration. The key features of the ICONS workflow management platform are the dynamic workflow participant assignment functions, the dynamic control flow condition modification capabilities, and time modelling features. A knowledge-based support to be used within the workflow management engine is to be developed with the use of the disjunctive Datalog inference engine module. Appropriate extensions to the WfMC model will be developed. The ICONS distributed processing organisation, providing both for data and processing distribution, is to be based on the SDDS approach with appropriate extensions to meet the system requirements. Distributed processing will be enabled by the load balancing algorithms to be embedded in the ICONS control functions. The workflow process distribution and inter-operability is to be based on the distributed workflow communication and synchronisation features to be developed for the ICONS prototype. ICONS capabilities are to be demonstrated by a knowledge management application to be developed by the project team as “The NAS Best Practices Portal”. The application development cycle and techniques are to follow a KMS development methodology to be specified within the ICONS project. A preliminary analysis of the state-of-the-art in the area of KMS methodologies shows that, although sound methodological basis exists in the software engineering area, no generally accepted approach exists in the knowledge management realm. The conclusions of the report show that the proposed approach to the ICONS project research and development work is compatible with the stated project objectives. The ICONS project activities are covering the following research and development areas: (i) knowledge representation techniques and methodologies for a multimedia content repository, (ii) advanced graphic user interface design and management tools, (iii) design and implementation of efficient algorithms for management of large, distributed multimedia content repositories, and an analysis and design methodology for large, knowledge-based content repository systems.

IST-2001-32429 ICONS Intelligent Content Management System

page 5/86

Intelligent Content Management System Table of contents

1.15 April 2002

Table of contents History of changes................................................................................................................................................... 3 Executive summary ................................................................................................................................................. 4 Table of contents ..................................................................................................................................................... 6 List of figures .......................................................................................................................................................... 8 List of tables ............................................................................................................................................................ 8 1. Introduction ..................................................................................................................................................... 9 1.1 Objectives ................................................................................................................................................ 9 1.2 Scope ....................................................................................................................................................... 9 1.3 Relations to other documents................................................................................................................... 9 1.4 Intended audience .................................................................................................................................... 9 1.5 Usage guidelines...................................................................................................................................... 9 1.6 Notation conventions............................................................................................................................... 9 2. The ICONS Project Goal and Objectives ...................................................................................................... 10 3. Feature Requirements of a Knowledge Management System ....................................................................... 12 3.1 Knowledge Management: A Framework for User Requirements.......................................................... 12 3.2 The KMS Reference Architecture ......................................................................................................... 18 3.2.1 Domain Ontology features............................................................................................................. 19 3.2.2 Content Repository features .......................................................................................................... 21 3.2.3 Knowledge Dissemination features ............................................................................................... 21 3.2.4 Content Integration features .......................................................................................................... 22 3.2.5 Actor Collaboration features.......................................................................................................... 23 3.2.6 Knowledge Security features ......................................................................................................... 24 4. Architecture of the Intelligent CONtent management System (ICONS)....................................................... 25 4.1 The ICONS architecture specification................................................................................................... 25 4.1.1 Development Technologies ........................................................................................................... 25 4.1.2 Content Management Technologies .............................................................................................. 26 4.1.3 Knowledge Management Technologies......................................................................................... 27 4.1.4 Human Computer Interaction Technologies.................................................................................. 28 4.1.5 Distributed Architecture Technologies .......................................................................................... 29 4.2 The ICONS architecture vs. the KMS reference architecture................................................................ 29 5. The ICONS Knowledge Representation Features ......................................................................................... 33 5.1 Requirements for Knowledge Management (KM) ................................................................................ 33 5.2 Syntax/Semantics................................................................................................................................... 33 5.3 Formal foundations of knowledge representation.................................................................................. 35 5.3.1 Rules and uncertainty .................................................................................................................... 35 5.3.2 Data Representation using Dempster-Shafer theory...................................................................... 35 5.3.3 Extended relational database model .............................................................................................. 36 5.3.4 Hyperrelations used for representing mined knowledge................................................................ 36 5.3.5 Hyperrelations as knowledge representation ................................................................................. 36 5.3.6 Metadata ........................................................................................................................................ 37 5.3.7 Sharing data ................................................................................................................................... 37 5.4 Disjunctive Logic Programming............................................................................................................ 38 5.5 Procedural knowledge representation features ...................................................................................... 43 5.6 Knowledge representation and manipulation in the graphic user interface ........................................... 45 6. The ICONS Intelligent Content Integration Features .................................................................................... 50 6.1 The ICONS Global Knowledge Schema ............................................................................................... 50 6.2 The ICONS Content Repository............................................................................................................ 51 6.3 Integration of the heterogeneous content sources.................................................................................. 51 7. The ICONS Intelligent Workflow Features................................................................................................... 53 7.1 Dynamic workflow participant assignment ........................................................................................... 53 7.2 Dynamic control flow condition definition............................................................................................ 53 7.3 Time management ................................................................................................................................. 53 7.4 Task scheduling ..................................................................................................................................... 54 7.5 Extensions with respect to the WfMC's workflow process meta-model................................................ 54 8. The ICONS Distributed Processing Organisation ......................................................................................... 55 8.1 The ICONS scalable, distributed architecture ....................................................................................... 55 8.2 The ICONS distributed processing optimisation and load balancing .................................................... 57

IST-2001-32429 ICONS Intelligent Content Management System

page 6/86

Intelligent Content Management System Table of contents

1.15 April 2002

8.3 The ICONS distributed workflow process communication and synchronisation .................................. 58 Demonstration of ICONS prototype capabilities........................................................................................... 60 9.1 The “Newly-associated States Best Practices” Portal............................................................................ 60 9.1.1 Introduction ................................................................................................................................... 60 9.1.2 Key Issues for Application Development ...................................................................................... 64 9.1.3 Key Success Factors ...................................................................................................................... 66 9.1.4 Remarks......................................................................................................................................... 66 9.2 The Knowledge Management System Design Methodology................................................................. 67 9.2.1 Approaches to Knowledge Management methodologies............................................................... 67 9.2.2 Requirements for defining a comprehensive KMS development methodology ............................ 67 9.2.3 The ICONS Development Methodology ....................................................................................... 70 10. Conclusions ............................................................................................................................................... 72 10.1 Compatibility with the stated ICONS project goals and objectives....................................................... 72 10.2 Overview of the ICONS project development plan ............................................................................... 72 Appendix A. List of workpackages and deliverables ............................................................................................ 76 Workpackages ................................................................................................................................................... 76 Deliverables list ................................................................................................................................................. 77 Bibliography.......................................................................................................................................................... 78 External references ............................................................................................................................................ 78 ICONS references.............................................................................................................................................. 84 Dictionary.............................................................................................................................................................. 85 9.

IST-2001-32429 ICONS Intelligent Content Management System

page 7/86

Intelligent Content Management System List of figures

1.15 April 2002

List of figures Figure 1. The scope of KM activities in 423 corporations surveyed by KPMG [KPMG1999]............................. 12 Figure 2. The Knowledge Life Cycle (KLC)........................................................................................................ 13 Figure 3. Four processes of knowledge conversion [Nonaka1995]....................................................................... 15 Figure 4. ICONS taxonomy of knowledge. ........................................................................................................... 18 Figure 5. The Knowledge Management System reference architecture. .............................................................. 18 Figure 6. The ICONS architecture schematic model............................................................................................ 25 Figure 7. Treatment relation. ................................................................................................................................. 36 Figure 8. A hyperrelation. ..................................................................................................................................... 37 Figure 9. Architecture of the GUI module............................................................................................................. 45 Figure 10. ICONS GUI module with interfaces to databases............................................................................... 47 Figure 11. A graph of objects. ............................................................................................................................... 48 Figure 12. The idea of the user basket................................................................................................................... 48 Figure 13. Models of workflow co-operation........................................................................................................ 58 Figure 14. Main Concept of ICONS portal for NAS Best Practice. ...................................................................... 63 Figure 15. The Knowledge life cycle of the NAS Best Practices Portal. .............................................................. 65

List of tables Table 1. Cross-reference between the KM processes and the KMS features. ....................................................... 16 Table 2. Feature roles within the knowledge management processes. .................................................................. 17 Table 3. Feature requirements of a Knowledge Management System. ................................................................. 19 Table 4. The ICONS focus technological area modules and the Domain Ontology features cross reference ....... 30 Table 5. The ICONS focus technological area modules and the Content Repository features cross reference..... 30 Table 6. The ICONS focus technological area modules and the Knowledge Dissemination features cross reference. ....................................................................................................................................................... 31 Table 7. The ICONS focus technological area modules and the Content Integration features cross reference..... 32 Table 8. The ICONS focus technological area modules and the Actor Collaboration features cross reference.... 32 Table 9. Checklist of the acquis (chapters in Regular Reports). ........................................................................... 61 Table 10. Overview of Phare................................................................................................................................. 62 Table 11. Best practice taxonomy. ........................................................................................................................ 63 Table 12. Key technological issues for development of the NAS Best Practices Portal. ...................................... 66 Table 13. The ICONS project focus technological areas and the project objectives cross-reference.................... 72 Table 14. The ICONS focus technological area modules and the research stream workpackages........................ 75

IST-2001-32429 ICONS Intelligent Content Management System

page 8/86

Intelligent Content Management System Introduction

1.15 April 2002

1. Introduction 1.1 Objectives The ICONS project presentation represents a refinement of the technical project specification comprised in the ICONS project proposal and the ensuing Work Description [ICONS CONRACT] document developed as the addendum to the research contract with the European Commission. It also reflects the commitments of project partners represented in the Consortium Agreement. The primary objective is to present the current ICONS consortium views on the scope and directions of the research and development work specified in the project work description as well as on the methods and techniques to reach the stated project objectives. It is assumed that the project presentation document reconciles diverse approaches to attainment of the project objectives proposed by the project consortium partners and harmonises the initial research work on standards, research and technological terms of reference of the ICONS project. Although the preliminary ICONS architecture representing the functional scope of the project has been defined in the Work Description document [ICONS CONTRACT], a flexible approach is adopted to allow for changing views of the project team members, influenced by the ongoing research and development activities in the knowledge management field. Hence, the ICONS Project Presentation is to evolve, under the constraints of the project change management procedure [ICONS D2], to be published as new versions of the document. Each new version of the project presentation is to highlight the important changes with respect to the previous technical approach and the scope of work. The principal project change management rule indicates, that the scope of the project and the corresponding ICONS architecture may not be changed without the written consent of ICONS Project Officer representing the European Commission.

1.2 Scope The scope of this report covers the entire research and development work currently under way in the ICONS project.

1.3 Relations to other documents This report provides a baseline specification of the principal directions of the research and development work to be developed within the ICONS project. In this sense the report represent the consensus of the ICONS consortium members regarding the ICONS architecture and principal features as well as with respect to responsibilities and development tasks comprised in the project development plan. All ensuing technical documents to be produced within the ICONS project should not contradict the design decisions and research assumptions comprised in this report. Should there arise a need to modify the underlying assumptions of the ICONS project development philosophy, appropriate changes will be applied to this report to be published as the succeeding version.

1.4 Intended audience The intended audience comprises all members of the ICONS project consortium as well as the representatives of the European Commission monitoring and evaluating the progress of the project research and development work.

1.5 Usage guidelines The contents of the ICONS Project Presentation must be known to and evaluated by all by all members of the project team. Since the document is to represent the current consensus of the ICONS consortium, it is mandatory that no important deviations from the presented ICONS architecture and the principal technical directions, as represented in the current version of this document, are allowed.

1.6 Notation conventions No special notation conventions are used in this report.

IST-2001-32429 ICONS Intelligent Content Management System

page 9/86

Intelligent Content Management System The ICONS Project Goal and Objectives

1.15 April 2002

2. The ICONS Project Goal and Objectives Turning information into knowledge has been one of the principal goals of advanced information systems developed in all realms of social and economic life of modern societies. Terms like “knowledge management”, “knowledge engineering” and “knowledge bases” became ubiquitous in corporate board rooms as well as IT departments. Easy access to information enabled by the explosion of Internet technologies has created new problems related to exponentially growing wealth of information sources flooding the information system users. Many advanced information systems are focused on knowledge bases comprising large collections of facts, rules, and heuristics pertaining to a specific application domain. Such knowledge bases are typically divided into two principal parts, namely the content base comprising repositories of mutlimedia information objects and ontologies representing formal knowledge pertaining to the corresponding application domain. Our goal is to develop a prototype of an Intelligent CONtent management System (ICONS) supporting a uniform, knowledge-based access to distributed information resources available in the form of web pages, pre-existing heterogeneous databases (formatted, text, and multimedia), business process specifications and operational information, as well as legacy information processing systems. The principal objectives of our research and development project are to obtain and present novel results in the areas of knowledge representation and inference, heterogeneous information integration, and userfriendly interfaces based on advanced information architecture techniques. The overall approach of the ICONS project is to: (a) provide effective methods for analysing and modelling, (b) develop practical tools for exploiting and using, (c) assess in a pilot system the usefulness of ... an intelligent content management system with advanced knowledge management capabilities integrating internal content repositories with external heterogeneous information sources. To achieve these overall objectives four streams of technical work can be identified comprising the above operational goals: Objective 1: Development of knowledge representation techniques and methodologies for a multimedia content repository. The following specific research problems must be addressed in order to develop the knowledge representation capabilities of ICONS: (a) Application of semantic data models (UML) and deductive data base mechanisms as the domain ontology specification tool. (b) Extraction of knowledge embedded in XML documents and in the associated RDF specifications. (c) Representing knowledge embedded in the schemata of pre-existing heterogeneous databases and legacy information processing system outputs. (d) Design and implementation of an efficient, non-procedural content management framework providing content and knowledge model definition and query capabilities. (e) Development of mechanisms for procedural knowledge definition and its further exploitation in the area of effective knowledge and business processes management. Results obtained in the above research areas will be embedded in the ICONS prototype and they will be verified in the pilot application environment. The principal research approach is to create synergies by integrating known research results in novel configurations and contexts, as well as extending known results in order to meet the identified new requirements. Objective 2: Development of user interface design and management tools meeting the requirements of the information architecture methodology The user interface requirements fall into three distinct areas, namely the user tool set and dialogue model, the content presentation model, and the graphical knowledge presentation and manipulation model. All of the above presentation models must incorporate personalisation capabilities in order to enable dynamic adjustments to changing user preferences discerned from the system usage patterns.

IST-2001-32429 ICONS Intelligent Content Management System

page 10/86

Intelligent Content Management System The ICONS Project Goal and Objectives

1.15 April 2002

The information architecture methodologies and techniques are considered to be the prime requirements for design and implementation of the ICONS user interface management functionality. The multi-disciplinary research involves skills of industrial designers, psychologists, and computer scientists. The ICONS prototype and pilot application work is to provide a realistic test-bed for the proposed user interface management techniques. Objective 3: Design and implementation of efficient algorithms for management of large, distributed multimedia content repositories There are two dimensions of the ICONS content distribution. The first pertains to distribution of the system content repository comprising the Content Base and the Ontology Base and the hierarchical storage management processes among the ICONS servers. The second concerns integration of external information sources, such as pre-existing heterogeneous databases, legacy information processing systems, and web information resources. Distribution of the ICONS components among the system servers requires efficient load balancing algorithms inter-operational with the selective content and ontology replication mechanism. Research will also concentrate of adaptive data cashing techniques and the multi-criterial data distribution optimisation. Integration of the external information resources is to be performed with the use of the XML wrapper technology. Wrapper programs producing required XML envelopes for extracted data are to be enriched with RDF specifications resulting from extracting semantics from database schemata, in the case of the external databases, or representing semantics, in the case of the legacy information processing system outputs. The wrapper programs will be generated in the form of Enterprise Java Bean modules comprising the necessary query statements. Objective 4: Develop an analysis and design methodology for large, knowledge-based content repository systems. The multimedia content repositories with knowledge representation capabilities require a novel approach to the analysis and design methodology. An application development life-cycle and the associated methods and techniques will be specified and a pilot application of ICONS will be developed. The pilot application is to be the “Best practices of PHARE, SAPARD, and ISPA projects developed within the Newly Associated States” content repository accessible on the Internet. The aim is to present the viability of the proposed methodology and to provide a starting point for the clearly needed knowledge source.

IST-2001-32429 ICONS Intelligent Content Management System

page 11/86

Intelligent Content Management System Feature Requirements of a Knowledge Management System

1.15 April 2002

3. Feature Requirements of a Knowledge Management System Our objective is to confront the contemporary requirements of the fast growing knowledge management field with the current views on the KMS feature architectures as well as with the already existing IT technology pertaining to the KM realm.

3.1 Knowledge Management: A Framework for User Requirements The knowledge management field has been growing dynamically fuelled by intensification of the global competition in all principal areas of the world economy. The state of the KM field at the turn of centuries is illustrated by a study of 423 corporations performed by KPMG (KPMG1999). The scope of the KM activities in the study sample is presented in Figure 1.

KM is not currently planned

KM has been abolished

1%

KM is currently in operation

19%

34%

KM is currently considered

17%

KM is currently being implemented

29%

Figure 1. The scope of KM activities in 423 corporations surveyed by KPMG [KPMG1999]. High interest in the field was evident (80% of corporations in some stage of KM activities) at the time of the study and judging by the increasing number of trade conferences and exhibitions pertaining to the KM field the discipline has reached maturity. The principal questions from our point of view, to be discussed in this section, are (i) what is the role of IT as the enabling technology?, and (ii) what extension of the currently available information management platforms is required in order to meet the growing requirements of the KM field? The second question has been the root of the ICONS project proposal, so the proper identification of the added value for the KM field emerging from the project is of paramount importance to the project consortium. A critical appraisal of the state-of-the-art of the content management system area, massively claiming to provide direct support for KM, should provide the initial vantage point for evaluation of the ICONS project contribution. We commence with a brief overview of the requirements of the KM field identified in a number of research studies performed in the realm of the European KM Forum [KMForum2001]. We also consider views of the US knowledge management research community comprised in the research papers representing the current views of the Knowledge Management Consortium International (KMCI) [Firestone2000, McElroy1999] and focusing the KM research and practice in the USA [Garvin1993, Quinn1996, Baek1999, Becker1999, Coleman1999, Davenport1999, Huntington1999]. The common fallacy of the IT side of the KM scene is focusing on the purely technological view of the field with the tendency to highlight features that are already available in advanced content management systems. Such systems are commonly referred to as corporate portal platforms or, more to the point, as the knowledge portal platforms. From the KM perspective, as discussed in [McElroy1999], such claims may be justified only with respect to a narrow view of the field focusing on distribution of existing knowledge throughout the organisation. The above views, called by some authors the “First Generation Knowledge Management (FGKM)” or “Supply-side KM”, provides a natural link into the realm of currently used content management IST-2001-32429 ICONS Intelligent Content Management System

page 12/86

Intelligent Content Management System Feature Requirements of a Knowledge Management System

1.15 April 2002

techniques, such as groupware, information indexing and retrieval systems, knowledge repositories, data warehousing, document management, and imaging systems. We shall briefly refer to existing content management technologies in the ensuing sections of the report to show that, within the above narrow view, the existing commercial technologies meet most of the user requirements. With the growing maturity of the KM field the emerging opinions are that IT support for accelerating the production of new knowledge is a much more attractive proposition from the point of view of gaining the competitive advantage. Such focus, exemplified in stated feature requirements for so called “Second Generation Knowledge Management (SGKM)”, is on enhancing the conditions in which innovation and creativity naturally occur. This does not mean that such FGKM required features as systems support for knowledge preservation and sharing are to be ignored. A host of new KM concepts, such as knowledge life cycle, knowledge processes, organisational learning and complex adaptive systems (CAS), provide the underlying conceptual base for the SGKM, thus challenging the architects of the new generation Knowledge Management Systems (KMS). The Knowledge Life Cycle (KLC), developed within the KMCI sponsored research [Firestone2000], provides us with the high-level feature requirements abstraction to be used as the starting point for evaluation of the ICONS architecture. The KLC as proposed by KMCI is presented in Figure 2.

Knowledge Production

Knowledge Claims

Knowledge Validation

Organizational Knowledge

•Individual and group interaction •Knowledge claim peer review •Application of validation criteria •Data/Info acquisition •Weighting of value in practice •New knowledge claims •Initial knowledge codification •Formal knowledge codification

Knowledge Integration

•Knowledge sharing and transfer •Teaching and training •Operationalizing new knowledge •Production of knowledge artifacts

Experiental feedback loop

Figure 2. The Knowledge Life Cycle (KLC). The concepts underlying the KLC model of knowledge management comprise the notion of a Natural Knowledge Management System (NKMS) defined in [Firestone2000] as “the on-going, conceptually distinct, persistent, adaptive interaction among intelligent agents: (a) whose interaction properties are not determined by design, but instead emerge from the dynamics of the enterprise interaction process itself, (b) that produces, maintains, and enhances the knowledge base produced by the interaction”. The above definition of the knowledge management system fits the notion of a complex adaptive system (CAS) defined as “a goal-directed open system attempting to fit itself to its environment and composed of interacting adaptive agents described in terms of rules applicable with respect to some specified class of environmental inputs” [Holland1995]. In order to keep compatibility with our project terminology we shall distinguish two classes of actors interacting within the KM environment; human beings called employees or knowledge workers, and knowledge-based computer programs called intelligent agents. A thorough discussion of the intelligent agent technology may be found in [Baek1999] while a taxonomy of intelligent agent knowledge-based features is presented in [Huntington1999]. The Knowledge Base (KB) of the system is “the set of remembered data, validated propositions and models (along with metadata related to their testing), refuted propositions and models (along with metadata related to their refutation), metamodels, and (if the system produces such an artifact) software used for manipulating these, pertaining to the system and produced by it” [Firestone2000].

IST-2001-32429 ICONS Intelligent Content Management System

page 13/86

Intelligent Content Management System Feature Requirements of a Knowledge Management System

1.15 April 2002

A knowledge base, not necessarily meant as the IT-related concept, constitutes the principal element of any knowledge management system and therefore requires a more detailed consideration. There are emerging schools of thought, deviating from the popular definition of knowledge as the “justified, true belief” [Goldman1991] in several important aspects. First of all, the knowledge base is to comprise justified knowledge, where justification is specific to the validation criteria used by the system (note, that such validation criteria may vary from organisation to organisation), and, although the definition is consistent with the idea, that individual knowledge is a particular kind of belief, the notion of belief extends beyond cognition alone to evaluation. The concept of the learning organization, defined in [Garvin1993] as “an organization skilled at creating, acquiring, and transferring knowledge, and at modifying its behaviour to reflect new knowledge and insights”, provides an important context for the KMS feature analysis. Garvin introduces five main activities, acting as the building blocks of a learning organization, namely; “systematic problem solving, experimentation with new approaches, learning from one’s own experience and past history, learning from experiences and best practices of others, transferring knowledge quickly and efficiently throughout the organization”. Attributes of a learning organization, important for management of professional intellect, have been identified in [Quinn1996]. The intellectual capital of an organization comprises such elements as: cognitive knowledge (know what) – the basic mastery of a discipline that professionals achieve through extensive training and certification, advanced skills (know how) – the ability to apply the rules of a discipline to complex real-world problems, systems understanding (know why) – deep knowledge of the web of cause-and-effect relationships underlying a discipline, and self-motivated creativity (care why) – the will, motivation and adaptability for success. An important notion discriminating between the content management systems and the knowledge management systems is that of the domain ontology defined in [Becker1999] as “an explicit conceptualization model comprising objects, their definitions, and relationships among objects”. A well-defined terminology, called taxonomy [Letson2001], is used within a particular ontology to describe the classes of objects, their properties, and relationships. Domain ontologies are important elements of knowledge management systems, quite similar to the conceptual schema of the database management model, serving to organize the knowledge of an organization. Thus, the domain ontology management features of a knowledge management system directly pertain to modelling of knowledge. We concentrate on two distinct, but compatible, views pertaining to modelling of knowledge, represented by the seminal work of Popper [Popper1971, Popper1977], and by the generally accepted views of Nonaka and Takeuchi [Nonaka1995]. The above results directly relate to the KLC model, thus providing a base for the ensuing discussion of feature requirements for a knowledge management system. Popper’s views the body of knowledge existing in an organisation as three distinct worlds, namely; (a) the first world (World 1) made of material entities: things, oceans, towns etc., (b) the second world (World 2) made of psychological objects and emergent predispositional attributes of intelligent systems: minds, cognitions, beliefs, perceptions, intentions, evaluations, emotions etc., (c) the third world (World 3) made of abstractions created by the second world acting upon the first world objects. This approach provides us with a two-tier view of knowledge: 1. Knowledge viewed as a belief is a second world predispositional object. This pertains to such situations, where individuals, groups of individuals, and organizations, hold beliefs (subjectively considered to be true), that are immediate precursors of their decisions and actions. The predispositional knowledge is “personal” in the sense that other individuals have no direct access to one’s own knowledge in full detail and therefore can not either “know it” as their own belief, or validate it. 2. Knowledge viewed as validated models, theories, arguments, descriptions, problem statements, etc., is a third world linguistic object. One can talk about the truth, or nearness to the truth of such knowledge, defined as the above third world objects in terms of being closer to truth then those hold by the competitors. This kind of knowledge is not an immediate precursor of decisions and actions, it rather impacts the second world beliefs and these, in turn, impact the behaviour of the KMS actors. Such knowledge is objective, in the sense that it is not agent specific and is shared among agents. The above characteristics bring to the forefront the issue of community validation of the shared knowledge. Looking at the above two distinct categories of knowledge, we may conclude, that the third world knowledge is the principal product of a knowledge management system. Whereas the knowledge of the individuals in a social organisation is not produced by the system alone, although it may be strongly influenced by interaction with the objective knowledge represented by the third world abstractions.

IST-2001-32429 ICONS Intelligent Content Management System

page 14/86

Intelligent Content Management System Feature Requirements of a Knowledge Management System

1.15 April 2002

Importance of a widely recognized distinction between tacit and explicit knowledge, first introduced by Polonyi [Polonyi1966], is emphasized by the work of Nonaka and Takeuchi [Nonaka1995]. The principal idea is that knowledge is created by interaction between tacit and explicit knowledge presented schematically in Figure 3. Note, that the above two knowledge base models are compatible, since the tacit vs. explicit knowledge distinction corresponds closely to Popper’s subjective (World 2) vs. objective knowledge (World 3) distinction. Considering the knowledge categorisations and transformations from the organizational knowledge point of view, constituting the principal knowledge management perspective, we view the following aspects of the model as crucial from the knowledge creation process perspective:

Tacit knowledge Tacit knowledge

Explicit knowledge

To

Socialisation

Externalisation

Internalisation

Combination

From Explicit knowledge

Figure 3. Four processes of knowledge conversion [Nonaka1995]. 1.

2.

3.

Transformation from tacit to explicit knowledge. The process corresponds to the externalisation transformation of Nonaka and Takeuchi and that of abstracting the objective knowledge, or transformation of World 2 beliefs into the World 3 objective knowledge, in Popper’s model. The process corresponds to the knowledge claim formulation in the KLC. However, in view of the KLC model, knowledge claims do not constitute the “objective knowledge’ until they successfully pass the knowledge validation process. Only then the validated knowledge claims become the organisational knowledge, after having been formalised and edited in the knowledge integration process of the KLC. Transformation from tacit to tacit knowledge. The process corresponds to the socialisation transformation of Nonaka and Takeuchi as well as to sharing of “personal” knowledge by intelligent agent interactions implied in Popper’s approach. The process, although does not create “new” organisational knowledge may be crucial to maintaining and enhancing the competitive advantage of many creative organisations (e.g. a software company). This transformation fits into the knowledge production process of the KLC. Transformation from explicit to tacit knowledge. The process corresponds to the internalisation transformation of Nonaka and Takeuchi and to the “impact” of the objective knowledge on the World 2 beliefs, and consequently on the organizational decision making process, presented in Popper’s model. This transformation matches closely the knowledge operationalization step of the knowledge integration process of the KLC. Although no new knowledge is produced at this stage, the transformation may be very important for highly innovative organizations.

We do not consider the explicit knowledge combination to be relevant to knowledge management, since either a mechanical process of external knowledge takes place through some mechanism of information categorisation, or an intelligent agent must be involved in inferring new knowledge from a combination of external knowledge artifacts. In the latter case, other transformations, namely the internalisation-externalisation path, would have to be followed. A distinction must be made at this stage between knowledge management, dealing with the above classes of structural and procedural knowledge, and information derived from information systems supporting the daily operation of an organisation. Data and results of such information systems are considered, for the sake of our

IST-2001-32429 ICONS Intelligent Content Management System

page 15/86

Intelligent Content Management System Feature Requirements of a Knowledge Management System

1.15 April 2002

KMS feature requirement analysis, to be a representations of Popper’s World 1 entities and their relationships and are, therefore considered merely objects of the KMS actors’ activities and decisions. A similar view is taken with respect to ad hoc or unstructured business processes with flows determined by subjective knowledge of an intelligent agent, rather then by a validated artifact of objective knowledge. An artifact of the objective procedural knowledge may be, for example, a formal workflow definition controlling execution of all processes belonging to a given class. The above discussion sets the stage for an analysis of the principal feature requirements pertaining to the distinct knowledge management processes of the KLC and to the characteristics of the knowledge transformations underlying the knowledge production process. Note, that the KMS features are technological categories providing a taxonomy for user functions viewed collectively as the KMS architecture and, as such, they should be discussed in the context of the knowledge management processes present in the KLC. We relate the KMS features to the knowledge management processes in Table 1. KLC KMS features Domain Ontology (DO) Content Repository (CR) Knowledge Dissemination (KD) Content Integration (CI) Knowledge Security (KS) Actor Collaboration (AC)

Knowledge Production (KP) DO-KP CR-KP KD-KP CI-KP

Knowledge Validation (KV) DO-KV CR-KV KD-KV CI-KV

AC-KP

AC-KV

Knowledge Integration (KI) DO-KI KD-KI KS-KI AC-KI

Table 1. Cross-reference between the KM processes and the KMS features. The user functions clustered in the principal KMS features may play varying support roles within the knowledge management processes. Collectively, the sum of user requirements for a given principal feature, defined within the distinct knowledge management processes, represents the user requirement set for a given principal KMS feature. We discuss the support role semantics corresponding to the principal KMS features in Table 2. The principal KMS features serve as the basic building blocks for the reference KMS architecture presented in the ensuing section. Feature role DO-KP

DO-KV

DO-KI

CR-KP

Feature role semantics The domain ontology functionality supports: 1. The externalisation transformation by providing the KMS actor with the means for the initial knowledge codification during formulation of knowledge claims. Codification is performed on both declarative and procedural knowledge. 2. Referencing the content artifacts providing supporting evidence or providing the fact base for knowledge inference. The reference information provides a knowledge map serving as the principal access path to the content repository. The domain ontology functionality supports: 1. The formal knowledge codification pertaining to the validated knowledge claims. 2. The formal specification of the models and rules supporting the knowledge claim screening and validation activities, in particular those involving complex networks of experts. The domain ontology functionality supports: 1. The internalization transformation by providing means to interpret and learn from objective knowledge as well as to find reference to supporting evidence exemplified in the real world cases comprised in the content repository. 2. The socialization transformation by providing means to find reference to peer expertise and work results, including formulation of knowledge claims, thus fostering interaction between the KMS actors. The content repository comprises all content artifacts, actual and virtual, that support the daily operation of an organization. In this sense, the content repository provides the principal platform of information processing support for the knowledge worker (a KMS actor that uses and/or produces knowledge) activities. The knowledge map, provided by the KMS domain ontology, defines the structure and scope of the content repository.

IST-2001-32429 ICONS Intelligent Content Management System

page 16/86

Intelligent Content Management System Feature Requirements of a Knowledge Management System CR-KV

CR-KI KD-KP

KD-KV

KD-KI CI-KP

CI-KV CI-KI KS-KP KS-KV KS-KI

AC-KP

AC-KV

AC-KI

1.15 April 2002

The content repository provides the body of supporting evidence as well as the documentation means for the knowledge claim validation activities. Information comprised in the content repository may be used and processed during the normal activities of knowledge workers and it may be the basis for new knowledge claim formulations. N/A The body of organisational knowledge, formally codified in the domain ontology, and supported by information comprised in the content repository, must be accessible to the knowledge workers in order to influence their subjective beliefs and predispositions (tacit knowledge) and thus to impact their activities and decisions. The quality of systems support for this process determines the efficiency of the knowledge externalization transformation fundamental for the knowledge production process. The knowledge claim validation process may heavily depend on the existing body of information, accessible through the content repository, as well as on the already validated and integrated objective knowledge pertaining to the subject domain. The validation process typically involves complex, and variable, interactions among experts drawing upon declarative as well as procedural knowledge. The quality of systems support, as in the case above, is of paramount importance to the efficiency of the validation process, which, additionally, must be supported by complex and flexible workflow procedures representing the procedural knowledge. The dissemination functionality supports the principal facets of the knowledge integration process, namely the knowledge sharing and transfer, as well as teaching and training. Both the codified objective knowledge and the supporting information must be made available. Information represented in content artifacts may, either be created and retained in the content repository, or may be derived from heterogeneous information sources, usually maintained by external information systems. The derived content artifacts may be stored in the repository or they may be materialized on demand by the appropriate interaction with the external source. The content integration functionality entails selection and retrieval of structured and semi-structured information, homogenization into a common content model, and derivation of semantics into the domain ontology representations. Same semantics as above. Same semantics as above. N/A N/A The organisational knowledge comprised in the KMS, both in the form of the codified objective knowledge artifacts, and of the supporting information artifacts, represents an important part of the intellectual capital. Hence the system integrity and privacy must be maintained. Interaction of knowledge workers is the basis of socialization processes. Interaction may be spontaneous, or it may result from a, more or less formally, specified and supported procedure. Automatic support for such interactions may vary from typical groupware functions, such as chat rooms and messaging, to advanced ontology-based workflow procedures. An important by-product of automatic support may be the possibility to capture operational metrics characterising the knowledge production process. Knowledge claim validation may entail interactions within a complex network of experts, both internal and external to the organisation, using a variety of information processing environments. As in the case above, supporting expert interaction, possibly involving also intelligent agents, may be a critical success factor of the knowledge claim validation processes. Production of the objective knowledge artifacts and of the supporting content, inherent in the knowledge integration process, may require well-defined editorial procedures. Such procedures may typically be supported by automatic workflow management functionality. The requirements may vary from simple groupware-like support to complex, ontology-based workflow management environments. Table 2. Feature roles within the knowledge management processes.

Further analysis of the KMS feature requirements in the context of the knowledge life-cycle, leading to development of Use Case models [Rumbaugh1999] to be used for design and validation of the ICONS architecture, is to be performed in the succeeding phases of the ICONS project. We believe that the above discussion provides sufficient user requirements context for the ensuing presentation of the KMS reference

IST-2001-32429 ICONS Intelligent Content Management System

page 17/86

Intelligent Content Management System Feature Requirements of a Knowledge Management System

1.15 April 2002

architecture. The reference architecture is to provide a beacon for the further unfolding of the research and development work of the ICONS project. Within the document we use several types of the knowledge. Figure 4 presents the ICONS knowledge taxonomy while Dictionary presents their meaning. knowledge

declarative knowledge

structural knowledge

procedural knowledge

knowledge-based reasoning

knowledge maps

Figure 4. ICONS taxonomy of knowledge.

3.2 The KMS Reference Architecture The European KM Forum [KMForum2001, KMForum2001_D11, KMForum2001_D11a, KMForum2001_D12] is an IST project with the goal to collect the current KM practices and to create an almost complete overview of the KM domain in Europe. The KMS reference architecture presented in Figure 5 has been developed on the basis of the current KM technologies discussed in the EKMF project reports, as well as on the KMS feature requirements identified in the preceding section.

Business Intelligence Systems

Data Bases

Full Text

Knowledge Map graphs

Content Integration

Semantic nets

Knowledge Dissemination

SDM nets Web Pages

Content Object Properties

Time modelling

Push technology

Files

Conceptual trees

Semantic nets

Taxonomies

Domain Ontology

Knowledge-based reasoning Hyper-text

Legacy Information Systems

Intelligent Agents

Document Management

Encryption Knowledge Security

RDF

XML

Discussion Forums

Electronic signature

Process graphs

Semantic Data Models

A Knowledge Management System

Files Systems Content Repository

Version control

Access Control Autenthication

Knowledge Engineering

KMS Actor Collaboration

HSM

DBMS Rendering

Workflow Management

Internet Intranet

Message Exchange

Figure 5. The Knowledge Management System reference architecture. Table 3 presents the above presented feature requirements of a KMS reference architecture in the tabular form.

IST-2001-32429 ICONS Intelligent Content Management System

page 18/86

Intelligent Content Management System Feature Requirements of a Knowledge Management System Feature requirements of a Knowledge Management System Content Knowledge Content Actor repository Dissemination integration Collaboration XML Push technology Files Message exchange Conceptual trees RDF Content object Data bases Discussion repository forums Semantic data File systems Knowledge map Business Knowledge models graphs Intelligence engineering Process graphs Version control Full text Web pages Workflow management Internet/Intranet Hyper text DBMS Semantic data Legacy models net information systems Knowledge-based HSM Semantic nets Intelligent agents reasoning Time modelling Rendering Document management Taxonomies Domain Ontology Semantic Nets

1.15 April 2002

Security Encryption Access Control Authentication Electronic signature

Table 3. Feature requirements of a Knowledge Management System. The KMS features, grouped into six principal feature sets, represent our current views pertaining to the KM technology requirements. Some of the features are already common in the advanced content management systems, referred to as the corporate portal platforms, some other are subject to the on-going KMS research efforts. We discuss each of the principal feature sets in more detail in order to define reference feature requirements for the ICONS architecture presented in the succeeding section.

3.2.1 Domain Ontology features The Domain Ontology features pertain primarily to knowledge representation including the declarative knowledge representation features, such as taxonomies, conceptual trees, semantic nets, and semantic data models, as well as the procedural knowledge representation features exemplified by the process graphs. Time modelling and knowledge-based reasoning features pertain both to the declarative and the procedural knowledge representations. Hyper-text links are considered as a mechanism to create ad hoc relationships between content artifacts comprised in the repository. Taxonomies Taxonomies provide means to categorize information objects stored in the content repository. Categorisation classes may be arbitrary hierarchical structures grouping information objects selected by the class predicates. Class predicates are defined in the form of queries comprising information object property values or as full text queries comprising key word and/or phrases. Categorisation classes are not necessarily disjoint. Dictionaries are a special class of taxonomies, also organized into hierarchical structures, which may comprise any number of categories, usually corresponding to occurring information object property value (e.g. a name directory) with the maximum number of categories equal to the cardinality of the property value domain. Automatic categorisation of information objects may also be based on arbitrary functions defined on object property values and/or content and implemented as an arbitrary analytical algorithm or a knowledge-based reasoning function. In the latter case, an inference engine provides for the actual categorisation of information objects. Analytical algorithms provide for automatic categorisation of formatted data objects, textual objects, as well as multimedia objects, such as audio, images and video frames. Taxonomies provide a powerful navigation device for browsing the content repositories, since they usually represent intuitive semantics of the user information requirements. Conceptual trees Conceptual trees are also a categorisation device used in conjunction with full text queries providing means to define concepts on the basis of its hierarchical relationships with other concepts, key words, and phrases. Usually

IST-2001-32429 ICONS Intelligent Content Management System

page 19/86

Intelligent Content Management System Feature Requirements of a Knowledge Management System

1.15 April 2002

conceptual trees allow for the full text query relevance ranking. This technique allows for easy extension of the domain ontology terminology with the use of, usually abstract, concepts with arbitrarily rich semantics. Semantic Nets Semantic networks provide means to represent binary 1:1 relationships, expressed usually as named arcs of a directed graph, where vertices are information objects belonging to any of the information object classes. Normally, the linked object classes are determined by the binary relationship semantics of the corresponding named arc. An example of a simple semantic net may be a binary relation Descendants defined as a subset of the Cartesian product of the set of Persons. Semantic nets may be constructed over an arbitrary number of information object classes and binary relationships. Semantic Data Models The Unified Modelling Language (UML) [Rumbaugh1999] is the currently prevailing specification platform for semantic data models allowing for definition of structural as well as behavioural semantics. Class Association Diagrams provide easy to read, intuitive semantics closely matching the mental models of the KMS users. The UML-based knowledge representation, in order to be useful, must be supplemented with a navigation facility allowing the user to transverse the network of specified object associations and to view/retrieve the corresponding object sets. Hyper-text links The hyper-text links support referential link semantics that may exist among the information objects belonging to arbitrary object classes existing in the content repository. The ad hoc character of hyper-text links, usually no schema level information exists, limits their usefulness as a knowledge representation feature. However, they are a useful annotation tool to express, possible transient, referential relationships of information objects stored in the content repository. Time modelling Time represented in domain ontologies, as well as in the content repository, conveys important information. Time valued properties may be important elements of search and automatic categorisation operations. Hence, formal representation of time is of paramount importance for knowledge descriptions and content characterization. Problems that exist today are related to the lack of standard representation of time instances and periods, incompatible time scales, granularities as well as periodicity definitions. Precise rules must be established as to representation and treatment of temporal properties to be comprised in a knowledge management system. Time modelling is also an important element of the procedural knowledge representation. CPM-like (Critical Path Method) have been proposed for representation of time constraints and for optimisation of process execution times in advanced workflow management systems. Knowledge-based reasoning Knowledge-based (k-b) reasoning systems may be built for a wide range of decision-making problems. The reasoning is based on a collection of facts, usually represented by content property values, and heuristics represented as rules. The prevailing paradigms are production rules (forward and backward chaining), logic programming, and neural nets (reasoning about quantitative data). The k-b reasoning may be used for expert knowledge representation, knowledge and content categorisation and distribution, as well as for the intelligent agent implementation. Intelligent workflow management is a new application area for k-b reasoning both for process routing as well as for the dynamic role modification. Process graphs Business processes are usually represented by process graphs, typically by the Event-Condition Petri Nets or by directed graphs. Petri Net representation allows for expressing richer process semantics, in particular the pre-and post-conditions for process activities. The process specification must also be supplemented by the set of role definitions, one definition for each process activity, to enable the workflow management engine to properly assign tasks to KMS actors. The process graph representation should comprise a set of process metrics and, possibly, performance constraints and exception conditions.

IST-2001-32429 ICONS Intelligent Content Management System

page 20/86

Intelligent Content Management System Feature Requirements of a Knowledge Management System

1.15 April 2002

3.2.2 Content Repository features Extensible Markup Language (XML) Light version, tag-oriented meta-language of SGML standard adapted to the web that provides facilities to describe and diffuse structured documents through Internet. Also used as the emerging industry standard for exchange of data between information systems as well as for storage and retrieval of complex, multimedia objects in content repositories. Resource Description Facility (RDF) Extension of XML used to define complex relationships between documents or data. Popular as the target data structure for mapping UML semantics into the content repository data models. RDF schema is used as a template to define annotation in RDF syntax. File Systems File systems are commonly used in multimedia content repositories to serve as containers for large content objects represented as files. The use of file systems is a convenient technique for mapping content onto diverse hardware storage devices in order to exploit their inherent characteristics. E.g. for permanent non-modifiable storage of electronic documents an optical storage device may be used. File systems are composed into storage hierarchies usually controlled by the content repository management software. Hierarchical Storage Management The hierarchical storage management (HSM) functions control allocation of storage space available in a hierarchy of storage devices to large content object files. Such systems are based on a directory of all content objects including information pertaining to storage allocation rules and migration predicates. Content objects are automatically migrated up and down the storage hierarchy, where the top layer is the object-relational database management system, and the bottom layer may be an optical storage jukebox or a mass storage tape system. Migration predicates usually determine content object residence time at any given storage hierarchy level and serve to fire the storage allocation rules controlling the file migration operations. Database Management System (DBMS) Object-relational database management systems serve as an implementation platform for the domain ontology management functions and the content management functions. Solution architectures vary, yet a typical use would be for storage of all KMS directories and control blocks, for representation of the domain ontology data model, and for storage of content object files and attributes. Main memory relational database management systems may also be used to store frequently used ontology structures as well as to provide a platform for representing data structures representing facts in knowledge-based reasoning algorithms. Version control Content evolves over time. In some cases history of content change is as much important as the content itself. The versioning mechanism allows for transparent identification (incremental revision number) and storage (either full version or increments) of particular versions of content and content object properties. Access schemas pertaining to multiuser access problems is the neighbouring subject. Rendering Content is held within the repository in a variety of native formats. Therefore the content can also be viewed or edited in the tool that originally created the content. However, a uniform web based browser requires rendering that facilitates for presenting all of them in a consistent way. Content can be rendered and renditions include HTML and XML, as well as PDF and other well know formats.

3.2.3 Knowledge Dissemination features Push Technology Push technologies providing facilities for automatic supply of selected content objects to a predefined group of recipients (a role), who are usually the KMS actors (knowledge workers, intelligent agents), are the best approach to combat the information glut. The push technologies are strongly correlated with such knowledge representation features as the automatic content categorisation and knowledge-based reasoning.

IST-2001-32429 ICONS Intelligent Content Management System

page 21/86

Intelligent Content Management System Feature Requirements of a Knowledge Management System

1.15 April 2002

Content Object Properties Content object properties characterize the principal object properties, such as object identifier, origin, author(s), date, etc, as well as provide information, usually in the form of key words, characterizing the content. The latter type of properties are usually obtain at the object creation (storage) instant through automatic content analysis and categorisation, or through a manual content object description process (e.g. description of an ancient manuscript image). Either way the content object properties provide a convenient access path for content repository queries, taxonomy structure allocations, and for materialisation of content object relationships. Full Text Full text indexing and retrieval is a classical approach to content management. The full text retrieval techniques, used in conjunction with conceptual trees, are commonly used in automatic categorisation features. Often content object property values are automatically obtained through a full text search-based categorisation process. Knowledge Map Graphs Multi-level taxonomy trees, semantic nets and content object associations are usually represented as graphs on the user interface level. This fits nicely with the user mental model of the domain ontology structure and its relationships with the underlying content object model. Because of substantial scope and complexity of knowledge map advanced graph construction and manipulation techniques must be employed to provide the required ergonomic level of the KMS user interface. The knowledge map graphs are used, usually in a query mode, for navigation within the semantically meaningful structures and for browsing the associated content. Semantic Nets Graphic representation of semantic nets (SN-graphs), although quite straightforward, must be supplemented by manipulation functions supporting transversal, SN-graph node visualisation/retrieval, and SN-graph selection (entry). SN-graphs, representing a given semantic net class implementation, may either be materialised dynamically, or, usually in the case of complex association functions and large scope, may be cached as the persistent ontology structures. Transient storage and off-line semantic net materialisation techniques may be used to achieve the required KMS performance levels. Note, that the SN-graph navigation typically occurs at the content object instance level, where the SN-graph arc represents a 1:1 content object relationship. Semantic Data Model Nets SDM net graphs (SDM-graph) are envisaged as a representation of the UML graphic conceptual model notation. Hence, content object classes well represent subsets of the corresponding content object instances constrained by class association used for navigational selection. Hence, navigation, list manipulation, visualisation/retrieval, and SDM structure entry functions are necessary to exploit the rich semantic potential of navigation on the content object class level. Note, the as opposed to the SN-graph navigation presented above, the SDM-graph navigation yields subsets of content object instances at each visit at a corresponding SDM-graph node. The only similarity is the SDM-graph selection effected as selection of the entry content object instance (e.g. a particular Person occurrence).

3.2.4 Content Integration features All entities, regardless of their character (structural, procedural), participating in the content integration process must be accessible via the knowledge map graph, or via other existing access path to the content repository. Any of the integrated content objects, constrained by the corresponding descriptions of the content repository schema, may either be physically stored in the repository as a content object (snapshot, re-freshable), or may be dynamically materialised at the reference time. Usage of the above integration modes should be entirely transparent to the KMS user. Files Files feature among candidates for content integration, due to the widely diffused usage of file systems as repositories of large, multimedia content objects. Little, or no, analysis of the multimedia objects content, apart from the automatic categorisation analysis, is performed during the integration process. Data Bases Heterogeneous databases are a typical source of data for content integration. Multi-database query and integration techniques, as well as the homogenization of heterogeneous data models, are the underlying technologies. The most straightforward cases entail querying a single database to materialise the required content to be further exploited in the KMS context, either as an element of a content object stored in the repository, as a virtual content object materialised on-the-fly.

IST-2001-32429 ICONS Intelligent Content Management System

page 22/86

Intelligent Content Management System Feature Requirements of a Knowledge Management System

1.15 April 2002

Business Intelligence Systems Data warehouses and OLAP system deliver relevant knowledge content, that should be integrated into the KMS environment. The BIS-generated content may be integrated into repositories as elements of content objects or may be delivered dynamically. Legacy Information Systems Similarly, the legacy information systems are the source of content that may be relevant to the KMS users. Selected legacy system reports may be accessible as content objects, or their elements, via the KMS content repository. Intelligent Agents Intelligent agent (IA) technology is a rapidly growing area of research and new application development. Applications of IA technologies in the KMS context are discussed in [Baek1999]. The definition of an intelligent agent proposed by IBM [IBM1995] states that an intelligent agent is “a software entity that carries out some set of operations on behalf of a user or another program with some degree of independence or autonomy, and in so doing, employs some knowledge or representation of the user’s goals or desires”. The IA technologies are clearly useful and applicable in the KMS context, meeting two broad functionalities, that of a personal assistant or that of a communicating/collaborating agent. In both roles the intelligent agents are relevant as knowledge-based support for the content integration features. Document Management Systems Document management systems are a particular class of legacy information systems providing a rich content infrastructure directly relevant to the KMS users. Electronic documents and image-based information typically integrated into the KMS content repositories as principal factual knowledge artifacts. Some KMS architectures the document management functionalities are subsumed by the KMS features. Web Pages Paradoxically, the genuine knowledge is perfectly hidden in the enormous amount of data volumes that is available on web pages. Therefore even more intelligent and flexible mechanism are to be developed in the area of external knowledge acquisition and, what is even more important, keeping it up-to-date. Interoperability of systems and ability to choose the best offered content are of the primary importance.

3.2.5 Actor Collaboration features Message Exchange Instant messaging relevant to the socialisation process (tacit to tacit knowledge transformation) is an important vehicle supporting the knowledge production process. Hence, the KMS functionality should provide a platform for a semi-disciplined exchange of electronic messages that may subsequently be categorised and stored in the content repository. Some collaboration metrics, similar to activity measures used in e-learning systems, may also usefully applied for management of the knowledge production process. Discussion Forums Discussion forums are the electronic equivalent of the water cooler or cafeteria discussions, that have long ago been discovered as vital knowledge production activities. Again relevant and valuable statements and comments should be categorised, stored in the content repository and measures (e.g. attributed to the originating sources). Knowledge Engineering Knowledge-based reasoning applications and intelligent agents require analytical support to glean the expert knowledge out of individual (outstanding knowledge workers). The process of obtaining expert knowledge, required to build knowledge-based (or expert) applications, called traditionally knowledge engineering, requires specific methodologies and tools for the formal knowledge representation. Such tools may coincide with the knowledge representation paradigms used, both for declarative and procedural knowledge, within a specific KMS environment. Workflow Management The workflow management technology is an important platform supporting, both the knowledge management processes of the KLC and the business processes of the organizations. In the latter case, application of the workflow technology provides in-sight into the organization operations that is an important feed back into the knowledge production process. In fact it may be disputed that, in the case of organizations where knowledge management in an explicit management function, the KLC process may be considered to belong to the realm of

IST-2001-32429 ICONS Intelligent Content Management System

page 23/86

Intelligent Content Management System Feature Requirements of a Knowledge Management System

1.15 April 2002

business processes. We believe that keeping the above distinction may be advantageous in evaluation of the alternative KMS architectures viewed as the enabling platforms for KLC-driven knowledge management processes. Distinct workflow management paradigms have been discussed in [Swenson2001, Eder2001, Stader2001]. It has been pointed out that substantially different application requirements pertain to production business processes that today represent the principal realm of workflow management applications, then to the knowledge worker (called also an information worker) processes, and to the project-oriented activities such as development of a new product. In two latter cases, pertaining directly to the knowledge production processes, a substantially different workflow management paradigm, then that of the Workflow Management Coalition [WfMC1994], is desirable. Indeed, it has been shown in [Stader2001] that intelligent, ontology-based workflow management platform is required to support development of complex new industrial products. It is an open question, as to what degree of interaction should be present between the KMS workflow management processes, and the classical workflow management supporting the business processes of an organisation. It may very well be that, as in the case of the document management technology, the diverse workflow management paradigms will be reconciled and consequently integrated into the KMS environment. Internet/Intranet The web technologies already prevailing in advanced content management systems are paramount to the KMS architectures due to several important factors. First of all, application of the web paradigm removes an important initial barrier between the user and the KMS functions (premise: all educated people use Internet). Secondly, the cost of ownership, particularly high in large, distributed organizations in the context of complex KMS architectures, may be kept under control. Since any useful KMS must constantly scout the content resources to be integrated that are available on the Net, as well as to publish information relevant to organization’s partners and customers, the Internet orientation of the system architecture is a must.

3.2.6 Knowledge Security features The relevance of the knowledge security features is as obvious in the case of a KMS, as in the case of any information system with architecture opened to the Internet. As the result any practical KMS must integrate such security features as electronic signature, encryption, access control and user authentication. Our research is not oriented towards adding value in this particular field and, in fact, the use of security features is identical, as in the case of other information systems. Hence, we shall not elaborate the subject of knowledge security any further.

IST-2001-32429 ICONS Intelligent Content Management System

page 24/86

Intelligent Content Management System Architecture of the Intelligent CONtent management System (ICONS)

1.15 April 2002

4. Architecture of the Intelligent CONtent management System (ICONS) 4.1 The ICONS architecture specification The ICONS schematic architecture model is presented in Figure 6. Consistently with the ICONS project goal and objectives we are aiming at developing a complete ICONS prototype to be demonstrated and verified in a realistic application environment. We propose to adopt an integration strategy combining existing, existing to be expanded, and newly developed modules to provide building blocks for the ICONS architecture. Such approach allows to keep the ICONS project scope under control and to obtain research and development results adding value to the selected technological fields representing the project focus (marked with the thick boarder lines).

Figure 6. The ICONS architecture schematic model. The project technological areas are discussed in more detail below. We concentrate on the ICONS project primary technological areas, providing cursory information, representing our view concerning technological environment prerequisites of the project, pertaining to the secondary technological areas. We assume that our research efforts will concentrate on ICONS modules that are planned to be developed from scratch, whereas the specification and development work will also comprise the extension efforts planned for the existing functional modules to be adopted

4.1.1 Development Technologies Development technologies comprise modules providing basic functionalities and development tools required for web-oriented software development. All of the modules comprised in this technological area are to be adopted “as is” into the ICONS project.

IST-2001-32429 ICONS Intelligent Content Management System

page 25/86

Intelligent Content Management System Architecture of the Intelligent CONtent management System (ICONS)

1.15 April 2002

Since no budget has been planned for acquisition of development software licences, preference will be given to “open source” software tools. Detailed specification of the technological requirements with respect to the ICONS modules, comprised in the DT area, will be provided in deliverable [ICONS D5].

4.1.2 Content Management Technologies The premise of the ICONS project is not to develop solutions in technological areas, where a mature commercial technology already exists. Such approach allows us to realistically plan to achieve the project results on time and budget. The detailed specification of technological prerequisites will be presented in deliverable [ICONS D5]. We present our current views on technological requirements with respect to the CM technological area, in order to allow for a complete overview of the ICONS architecture to be presented in this section. One principal requirement, due to the necessity of developing extensions of the Content Management modules, is that all software is to be available in the source version and with the appropriate licence to modify it. Content Repository Manager The Content Repository Manager (CRM) provides an implementation platform for a XML-based object oriented content repository, controlled by an enhanced RDF schema, and comprising complex XML objects with embedded multimedia objects. Structure of the repository objects is determined and controlled by the DTD statements comprised in the RDF schema. Objects respond to methods implemented in Java classes, each principal class corresponding to a XML object class. The object class inheritance is supported. The embedded multimedia objects are stored as files and their location is managed by the hierarchical storage management functions. Content Semantic Model Manager Selected fields of XML objects, as well as the contents of text-oriented multimedia object types, are used for construction of auxiliary data structures comprising relational database tables, relational database indices, as well as full text search engine indices. These auxiliary structures serve to support the representation of content semantics with the use of such structural constructs as binary N:M relationships, N-ary N:M relationships with attributes, taxonomy hierarchical trees, and dictionaries. Operations on the auxiliary storage structures are available to application programmers creating new content repository objects as the CSMM application programmer interface (API). All structural semantic constructs are named and are used to reflect the application semantics to be implemented in the Content Repository. The auxiliary data structures are also used to support property-based selection operations as well as full text search operations. Workflow Manager The Workflow Manager supports the web-oriented business processes providing standard access to task lists and process execution information via Internet browsers. The process semantics meet the Workflow Management Coalition [WfMC1994] requirements with some enhancements in the area of the dynamic role modification (roles are sets of potential candidates to execute a specified task within a business process). Hierarchical Storage Manager The Hierarchical Storage Manager (HSM) provides functionality to manage allocation of storage space, and the subsequent tracking functions, for the multimedia objects stored in the Content Repository. Hence, the Content Repository storage space extends from the object relational database (objects stored as BLOBs), through an arbitrary path of file systems, to optical or tape mass storage devices. Object migration is performed automatically, triggered by pre-specified events, according to migration predicates defined by the Content Repository administrator. External Content Integrator The external content integration functions accept any schema-compliant XML input, as well as results of predefined parametric queries and procedures developed as the Content Manager applications. Such objects are called “integration objects” and they are treated as first class objects with respect to taxonomies and structural semantics constructs. Integration objects may be materialised and subsequently stored in the repository, usually taking a form of a report file, or they may be created dynamically as transient objects in response to the user specified parameter values.

IST-2001-32429 ICONS Intelligent Content Management System

page 26/86

Intelligent Content Management System Architecture of the Intelligent CONtent management System (ICONS)

1.15 April 2002

Role Manager Roles are subsets of the Content Management System users defined by common access rights and operation permissions, as well as by execution rights within specified workflow processes. Roles may be defined by role predicates or by enumeration and they may be modified on the basis of the processing history. Content Schema Definition Environment The Content Schema defines the data model of the Content Repository including both the XML object structure and the auxiliary data model created to represent the content structural semantics, and to facilitate the selection operations. The RDF schema is additionally annotated with system-defined tags or tag parameters to assign internal significance to the selected XML document fields. The XML schema provides also the structural information for generation of Electronic Form processing functions.

4.1.3 Knowledge Management Technologies Ontology Model Manager The Ontology Model (OM) comprises formal knowledge representation pertaining to a particular application domain, hence we interchangeably use the term domain ontology, as declarative knowledge or procedural knowledge. The declarative knowledge may formally be represented by the structural knowledge representation constructs, such as SDM relationships or Semantic Net links, or by rules supported by an inference engine. The OM Manager is to provide functions to create, maintain, and use the knowledge representation structures and to make those functions available to other KMS modules. Structural Knowledge Navigator The Structural Knowledge Navigator (SKN) is to provide an ontology structure manipulation language, available in the form of an API to developers of other pertinent ICONS modules, to provide navigation and selection facilities supporting the graphic object selection and graph navigation features available to ICONS users on the HCI level. The relationship and object link structures are to be defined in terms of link predicates, so the actual navigation is based on dynamically materialised object sets. Content Categorisation Engine Content categorisation of text files and other multimedia objects are gaining increasing importance in knowledge management systems. The current automatic categorisation methods are based on evaluation of property values with straightforward SQL-like queries, on full text queries supported by appropriate full text indices constructed on-the-fly by full text search engines. In general the content categorisation engines processing formatted (electronic form) or text data address the problem using algorithms to: (i) select words from text that should be used for indexing, (ii) look for close matches to personal names, company names, product names, or places, (iii) extract data from formatted tables or forms, and (iv) search for words that regularly appear in the same context and therefore may be related. In the case of image data algorithms already exist, that search image catalogues, provide face matching facilities, fingerprint identification, or medical image analysis. We are looking at candidate algorithms, open source solutions, or software products to be potentially integrated into the ICONS architecture. Datalog Inference Engine The Datalog Inference Engine is to be based on the DLV system developed by CIES to be accordingly modified and interfaced to the ICONS architecture. DLV is a deductive database system, based on disjunctive logic programming, which offers front-ends to several advanced KR formalisms. Disjunctive Datalog combines databases and logic programming. For this reason, DLV can be seen as a logic programming system or as a deductive database system. In order to be consistent with deductive database terminology, the input is separated into the extensional database (EDB), which is a collection of facts, and the intensional database (IDB), which is used to deduce facts. An In-Core relational DBMS is to be used to host the extensional database, to be materialized as a persistent or transient Content Object comprising the corresponding disjunctive logic programme as one of its methods. Execution of the logic programme on the basis of the EDB structure comprised in the Content Object will be materialized as the In-Core relational database structure. Intelligent Workflow Manager Extending workflow applications beyond the realm of classical production-level business process support into the realm of knowledge workers’ activities and large project control, require extension of the current workflow engine capabilities. The possible directions point at such WfMC architecture enhancements as application of knowledge-based techniques, in conjunction with advanced time modelling capabilities, process routing

IST-2001-32429 ICONS Intelligent Content Management System

page 27/86

Intelligent Content Management System Architecture of the Intelligent CONtent management System (ICONS)

1.15 April 2002

problems and to optimal workload allocation problems. Workload allocation problems, in conjunction with development and maintenance of reliable process metrics, may be solved with the use of knowledge-based techniques. Semi-structured Content Integrator Semi-structured information, such as XML (possibly with RDF annotations) and HTML pages and the associated multimedia objects usually down-loadable as files, represent a wealth of content, that may be directly relevant to a knowledge management system. Such information as competitive content, financial and commercial data, news reports, etc., should be directly accessible via the KMS content repository. Such objects should be associated with the repository content via the structural knowledge representations (relationships, links) as well as though taxonomy trees. Mapping the semi-structured objects into a predefined schema representing the corresponding content repository object classes may present a serious structure homogenization problem, in particular in view of the variety of representations used for the same entities in different, highly volatile Web information sources. The knowledge-based wrapper technology may provide one of promising areas for developing solutions of the above problem. We propose development of a class of intelligent agents, to be called intelligent content integrators, to solve the above problem. Intelligent Agent Development Environment Intelligent agents serving as personal assistants and/or communicating/collaborating agents are an important KMS technology. A framework for specification and development of knowledge-based agents is to constitute an integral part of the ICONS knowledge representation architecture. At this point that the logic programming reasoning features will provide an important ingredient of the knowledge-based IA solution.

4.1.4 Human Computer Interaction Technologies HCI Personalisation Engine Sound personalisation facilities already exist in advanced Web content management systems, called corporate portal platforms, with some of the software already available in the “open source” form. We plan further enhancements of the existing technology principally based on two technical areas: (i) advanced logging facilities of the KMS user activities, and (ii) knowledge-based analysis of user activities in conjunction to dynamic profiling of the user preferences. Personalisation should focus, apart from the preferred layout of the user interface frames (pages), on assisting the user in exploiting the complex ontology structures. Electronic Form Manager Electronic forms (EF) are ubiquitous in content management systems, in particular in the Web content management area, as means to create, update and search content objects. An outstanding problem, in particular pertaining to the Web-oriented solutions, is specification and enforcement of complex integrity constraints that may be enforced on the HCI level. At this point this is the potential area of the EF enhancement research and development to be undertaken in the ICONS project. Content Presentation Manager Content presentation pertains to displaying, usually in the Internet browser, of multimedia content objects comprised in the KMS content repository. Standard viewer technologies exist, with most of the current content management systems using products of few global suppliers of the viewer technology. An appropriate interface is to be developed to accommodate selected viewer technologies, and no further enhancements are planned within the scope of the project. Knowledge Map Graph Manager The Knowledge Map graphs are primarily composed of multi-level taxonomy trees providing navigational, entry level access, to the complex ontology structures combining the taxonomy trees and the structural knowledge graphs. The problem lies in representing large, nested tree structures in a user-friendly graphical way and in providing easy to use navigational facilities. The HCI level navigation is to be implemented with the use of the Structural Knowledge Navigator API. Structural Knowledge Graph Manager The structure knowledge graphs, representing the SDM nets and the Semantic Nets, also represent a hard problem from the point of view of the HCI level presentation and manipulation. The structured knowledge graph navigation is considered of paramount importance in communicating semantics of, and in providing the navigational access to, the content repository objects. Intuitive, user –friendly structure navigation, and the result

IST-2001-32429 ICONS Intelligent Content Management System

page 28/86

Intelligent Content Management System Architecture of the Intelligent CONtent management System (ICONS)

1.15 April 2002

list manipulation, is the cornerstone of the successful ICONS HCI environment. The HCI level navigation is to be implemented with the use of the Structural Knowledge Navigator API. Process Graph Manager The Process Graph Manager is to provide the following principal functionalities: (i) graphical design and consistency checking of the intelligent workflow process graphs, and (ii) to provide a graphic interface for monitoring the state of a particular process instance. All principal process parameters and control data should be accessible via the graphic interface.

4.1.5 Distributed Architecture Technologies Load Balancing Algorithms Load balancing algorithms should control device/media allocation to the active, i.e. process, elements of the distributed ICONS architecture. Distribution may include the ICONS functional modules, or selected processes of such modules, as well as the application object classes. The object-oriented architecture of the system, both on the ICONS and on the application software levels, renders itself well to distribution in the peer-to-peer as well as the hierarchical computer system architectures. Load balancing is important to system performance, due to the diffused use of the processor-intensive knowledge-based techniques in ICONS modules. Distribution Optimisation Algorithms Distribution optimisation pertains to the static elements of the ICONS architecture, i.e. to control data structures, domain ontology structures, and to the content object structures. Optimisation of the device/media allocation, with the possible replication of the above data structures, may be an important technique for the efficient system implementation. Scalable Distributed Data Structure (SDDS) A SDDS system should provide the principal distribution platform for the selected static elements of the ICONS architecture. The system must be adapted to the ICONS requirement, possible to support distribution of the InCore relational DBMS module. Distributed Workflow Communication Communication among workflow processes, managed by a common or by different workflow platforms, is currently subject to research and standardisation work of the Workflow Management Coalition task groups [Hayes2001]. XML-based messaging protocols are proposed as the means to transfer process information among heterogeneous platforms. Messaging standards are to be implemented in the ICONS Intelligent Workflow Manager and experimented with in the ICONS distributed architecture environment.

4.2 The ICONS architecture vs. the KMS reference architecture The goal of the ICONS project is to develop and demonstrate a KMS prototype meeting most of the feature requirements generally accepted for such systems. We have discussed the KMS reference architecture in section 3 in the context of user requirements identified within the principal streams of the knowledge management research. We shall now show, that the proposed ICONS architecture addresses most of the feature requirements defined in the KMS reference architecture. We relate the ICONS modules to the KMS reference architecture features in cross-reference tables Table 4 through Table 8, one for each principal feature of the KMS reference architecture. We do not discuss the Knowledge Security principal feature, since it clearly lies outside of the project terms of reference as is considered as a ready-to-use development technology. We only consider the ICONS focus technology modules, assuming that all the auxiliary technologies will be used as required and appropriately modified or enhanced as indicated in the ICONS architecture discussed in the preceding section. ICONS functional modules (Focus Tech. Areas) Knowledge Management Ontology Model Manager Structural Knowledge Navigator Content Categorisation Engine Datalog Inference Engine Intelligent Workflow Manager Semi-structured Content Integrator Intelligent Agent Development Environment

Conc. Trees

D D

Semant Taxono Nets mies

R R

D D D

IST-2001-32429 ICONS Intelligent Content Management System

Time Model.

K-B reason.

R

Hypertext

Process graphs

SDM

D

R

R R

R R R

R R

R

page 29/86

Intelligent Content Management System Architecture of the Intelligent CONtent management System (ICONS) Human Computer Interaction (HCI) HCI Personalisation Engine Electronic Form Manager Content Presentation Manager Knowledge Map Graph Manager Structural Knowledge Graph Manager Process Graph Manager

R

1.15 April 2002

R

R

R

R R

R R

Distributed Architecture Load Balancing Algorithms Distribution Optimisation Algorithms Scalable Distributed Data Structure (SDDS) Distributed Workflow Communication

R R

R

R R R

R – research work S – specification work D – development work Note that the work type notations imply the starting point of the project effort. I.e. R means that research work is necessary and the it will be naturally followed, if successful, by the specification (S), and development (D) efforts.

Table 4. The ICONS focus technological area modules and the Domain Ontology features cross reference All Domain Ontology features are addressed, with the most of the work starting at the research level. Development pertains to enhancements of the adopted content management functionality to be utilized in the project. ICONS functional modules (Focus Tech. Areas) Knowledge Management Ontology Model Manager Structural Knowledge Navigator Content Categorisation Engine Datalog Inference Engine Intelligent Workflow Manager Semi-structured Content Integrator Intelligent Agent Development Environment Human Computer Interaction (HCI) HCI Personalisation Engine Electronic Form Manager Content Presentation Manager Knowledge Map Graph Manager Structural Knowledge Graph Manager Process Graph Manager

XML

RDF

DBMS

File System

HSM

Vers. Contr.

Render ing

D D

D

S

D

D

D D

D D

Distributed Architecture Load Balancing Algorithms Distribution Optimisation Algorithms Scalable Distributed Data Structure (SDDS) Distributed Workflow Communication

R

R – research work S – specification work D – development work Note that the work type notations imply the starting point of the project effort. I.e. R means that research work is necessary and the it will be naturally followed, if successful, by the specification (S), and development (D) efforts.

Table 5. The ICONS focus technological area modules and the Content Repository features cross reference Most of the Content Repository features are outside the ICONS project focus technological area and they are to be supported by the content management platform to be selected as the base line development environment.

IST-2001-32429 ICONS Intelligent Content Management System

page 30/86

Intelligent Content Management System Architecture of the Intelligent CONtent management System (ICONS)

1.15 April 2002

There is some adaptation work to be performed with respect to the existing electronic form management, content presentation functions, and version control functions to meet the emerging new requirements of the XML and RDF standards. Research will be performed in the area of hierarchical storage management, where distribution optimisation algorithms could substantially enhance the HSM functionality and performance.

Seman. ICONS functional modules Nets (Focus Tech. Areas) Knowledge Management Ontology Model Manager Structural Knowledge Navigator R Content Categorisation Engine Datalog Inference Engine Intelligent Workflow Manager Semi-structured Content Integrator Intelligent Agent Development Environment

Human Computer Interaction (HCI) HCI Personalisation Engine Electronic Form Manager Content Presentation Manager Knowledge Map Graph Manager Structural Knowledge Graph Manager Process Graph Manager

SDM Nets

K. Map Graphs

Full Text

C.O. Prop.

R

D

D

S

S S

Push Techn.

R

R

R R

R

Distributed Architecture Load Balancing Algorithms Distribution Optimisation Algorithms Scalable Distributed Data Structure (SDDS) Distributed Workflow Communication R – research work S – specification work D – development work Note that the work type notations imply the starting point of the project effort. I.e. R means that research work is necessary and the it will be naturally followed, if successful, by the specification (S), and development (D) efforts.

Table 6. The ICONS focus technological area modules and the Knowledge Dissemination features cross reference. The main thrust of the research effort to be undertaken in the area of Knowledge Dissemination will be directed towards advanced graphic user interfaces to represent the knowledge map nested taxonomical trees and the structural knowledge graphs. The remaining work will focus on adaptation of the existing content management functions. ICONS functional modules (Focus Tech. Areas) Knowledge Management Ontology Model Manager Structural Knowledge Navigator Content Categorisation Engine Datalog Inference Engine Intelligent Workflow Manager Semi-structured Content Integrator Intelligent Agent Development Environment

Data Bases

Files

Doc. Manag.

Intell. Agents

Legacy IS

Web Pages

Busin. Intell. Syst.

R S

S

R S

R R

S

S

Human Computer Interaction (HCI) HCI Personalisation Engine Electronic Form Manager

IST-2001-32429 ICONS Intelligent Content Management System

page 31/86

Intelligent Content Management System Architecture of the Intelligent CONtent management System (ICONS)

1.15 April 2002

Content Presentation Manager Knowledge Map Graph Manager Structural Knowledge Graph Manager Process Graph Manager Distributed Architecture Load Balancing Algorithms Distribution Optimisation Algorithms Scalable Distributed Data Structure (SDDS) Distributed Workflow Communication

R

R – research work S – specification work D – development work Note that the work type notations imply the starting point of the project effort. I.e. R means that research work is necessary and the it will be naturally followed, if successful, by the specification (S), and development (D) efforts.

Table 7. The ICONS focus technological area modules and the Content Integration features cross reference The research and specification work in the area of Content Integration will pertain to semi-structured content integration, that may be used to extract information out of the Web, and possibly document management, information resources. Intelligent agent technologies are candidate for the formatted data integration, mainly from pre-existing databases and files, and from legacy information systems or business intelligence systems. ICONS functional modules (Focus Tech. Areas)

Know. Eng.

Knowledge Management Ontology Model Manager Structural Knowledge Navigator Content Categorisation Engine Datalog Inference Engine Intelligent Workflow Manager Semi-structured Content Integrator Intelligent Agent Development Environment

R R

Human Computer Interaction (HCI) HCI Personalisation Engine Electronic Form Manager Content Presentation Manager Knowledge Map Graph Manager Structural Knowledge Graph Manager Process Graph Manager Distributed Architecture Load Balancing Algorithms Distribution Optimisation Algorithms Scalable Distributed Data Structure (SDDS) Distributed Workflow Communication

Wfk. Manag

Inter. Intra net

Messag Discuss Exchan Forum

S

S

R

R

R R

S S

R

R

R – research work S – specification work D – development work Note that the work type notations imply the starting point of the project effort. I.e. R means that research work is necessary and the it will be naturally followed, if successful, by the specification (S), and development (D) efforts.

Table 8. The ICONS focus technological area modules and the Actor Collaboration features cross reference. The major research interests of the ICONS project in the area of KMS agent collaboration pertain to the intelligent workflow management and to the intelligent agent (IA) technologies. Some enhancement of the existing content management technologies is planned to provide support for the knowledge engineering features.

IST-2001-32429 ICONS Intelligent Content Management System

page 32/86

Intelligent Content Management System The ICONS Knowledge Representation Features

1.15 April 2002

5. The ICONS Knowledge Representation Features 5.1 Requirements for Knowledge Management (KM) Current syntactic approaches to search for information and, in its broadest sense, knowledge, over networks have proved useful for many applications – most conspicuously in applications using the Internet. However they do not retrieve the semantic content of documents. Semantics are needed if we wish to retrieve facts and other knowledge. They can be used for shared practical problem solving by several agents (computers or people). They support concatenation of knowledge with that from elsewhere and are therefore poorly suited to automated access and analysis. Knowledge representation (KR) and extraction techniques are at the centre of these knowledge management requirements, in particular the value of shared domain definitions and the conceptual reasoning approach, have been convincingly presented in [O’Leary 1998]. The activities of acquisition (including content-based retrieval of multi-media knowledge and information, such as images), indexing, filtering, linking, distribution and application of knowledge must be supported in ICONS. To match these requirements, the technical skeleton of ICONS is based on ontologies. Ontologies are “specifications of shared conceptualizations of particular domains”. They support knowledge access, integration and mediation. The present focus is upon structured and unstructured textually-represented information. But part of our research will be seeking to widen this scope with the ultimate goal being to represent and access multimedia information using semantic methods. The basic vision is of a representation and inference superstructure [Fensel, et al, 2000], based on ontologies, over distributed repositories. Three components of the structure of this layer can be discerned. The first is the provision of a formal semantics and efficient reasoning support sub-structure. At this level knowledge is described in terms of concepts, interrelationships and roles. The specific ICONS mechanisms for this will be detailed below. The second sub-structure supplies a rich set of primitives for modelling the Universe of Discernment. No single technique is adequate for this. The main ICONS techniques for this is Datalog, and the other methods outlined in the following sections will be invoked to supplement to this whole necessary. The third sub-component of the ICONS KM superstructure supports the sharing and co-operative usage of knowledge. In practice within ICONS the first 2 components are combined to allow knowledge to be described in a disciplined manner that supports rich modelling of application domains through the use of ontologies. From this it will be possible to derive classification taxonomies. In the research stream attention will be paid to the need for grounding of concepts, and knowledge (especially that derived by data mining). The idea is to be able to support explanation of the answers supplied to users. However our initial concern is to provide back-bone modelling capability, and for this reason we focus on Datalog, although other techniques will be used when needed for specific functionality. As suggested earlier (mined) knowledge has to be shared among compliant applications via an “ontology base’, and used as a “content base”. Hence a commonly accepted representation is required.

5.2 Syntax/Semantics One thing above any other distinguishes syntactic and semantic manipulation of information (including conventional web searching), and it applies to next-generation knowledge management in general. Syntactic manipulation is geared up to people rather than computers, while semantic manipulation is intended to bring structure to the meaningful content of pages of information [Berners-Lee 1999]. It suitably represented, it can be invoked and exploited by application programs. To build a semantic web, for example, requires • access to structured collections of information • sets of inference rules with which to reason automatically Sophisticated KR is required. Now, first-generation KR is centralized, although work has been done on distributed heterogeneous expert systems, for example [Zhang and Bell 1990]. Early systems were also

IST-2001-32429 ICONS Intelligent Content Management System

page 33/86

Intelligent Content Management System The ICONS Knowledge Representation Features

1.15 April 2002

‘shallow’, in that complied hindsight was recorded rather than deeper principles. A third conspicuous failing of first-generation KR was the absence of an explicit well-understood representation of uncertainty of knowledge. All three of these deficiencies will be met to varying extents in the ICONS system. A language that expresses both data and rules is required, and this makes it possible to export rules from any KR system to the web. The task of developing such a language has been simplified because much of the information we need is of the form • “An ancestor of a parent is an ancestor”; or • “A truck is a kind of land vehicle which is a kind of vehicle” Datalog is the obvious choice for this. Three important technologies are already in existence to help in the endeavour of providing a Data/Rules language in a web context: • XML – tags (hidden tabs) can be created to arbitrarily annotate part of pages, and thus structure them. However this gives no meaning, although scripts (programs) can use these in sophisticated ways. • RDF – expresses meaning via a triple: things, their properties, and their values. For example, “this web page was authored by D. Bell”. Things/values are each identified by Universal Resource Identifiers (URIs), like URLs, and their properties. They can be added to the syntax by just defining a URI for them somewhere on the Web. • Ontologies As has long been recognised by DDB designers, DBs may use different identifiers or names and structure for a single concept, so there is a need to discover common meanings. An ontology can be a document or file that formally defines relations among terms, or more commonly, a taxonomy plus a set of inference rules. An ontology base is a collection of ontologies. A research stream will be carried out on the content base and how it can cooperate with the ontology base to support a range of inference functionality in ICONS. The goal of this work will be to explore how to capture XML objects (metadata) out of data from external data sources using content models, which will be stored in content repositories, and transfer essential metadata as facts to the ontology bases for storage, using the formal knowledge representation and manipulation methods. An ontology base holds domain ontologies, each of which provides a declarative knowledge representation (Datalog and see below) including concepts and semantics which can be exemplified by hierarchical relationships (semantics nets). It is not normally directly associated with specific applications. The theories and technologies described below will be utilized to implement the ontology bases. In relation to a particular content model, it may directly pertain to particular applications, which provides a generic way to represent a range of data sources as XML objects which are metadata information for storage and retrieval of complex multimedia objects in external data sources. In ICONS, a mechanism will be developed to specify all aspects of the data transfer from content bases to ontology bases as required by the ontology base, including the different kind of metadata, such as orders and locations and relational structures. All of these can be represented as facts, rules, and semantics. In the ICONS context, a content base is assumed to hold a variety of content models, each model can be represented as an XML DTD which is associated with external data sources. The content model determines what data is extracted and how it is ultimately represented in the XML object. A content model contains several pieces of information: • The original data structure, in the form of a data element. For example, if we take a data source as a relational table, in the form of an SQL statement. In this way, we can use the content model to specify that data should be drawn from more than one relational table. • The overall structure of the XML DTD. This is in the form of the root element, which, through attributes, specifies the name of the destination root element and the name of the elements that are to represent tuples.

IST-2001-32429 ICONS Intelligent Content Management System

page 34/86

Intelligent Content Management System The ICONS Knowledge Representation Features •

1.15 April 2002

The names and contents of data elements. These are contained in a series of elements. The elements include the name and attribute or content elements. These two elements designate the data that should be added, and, in the case of attributes, what it should be called.

The meaning of XML codes used on web pages can be defined by pointers from the pages to an appropriate ontology. More complex applications use ontologies to relate the information on a page to associated knowledge structures and information rules. The semantic web, in naming every concept simply by a URI, lets users express new concepts with minimum effort. Its unifying language also enables these concepts to be progressively linked into a universal web.

5.3 Formal foundations of knowledge representation The prevailing approach to representing knowledge embodied in existing information resources, in particular in the web information resources, is by using metadata representing the complex information object relationships and in some cases inference rules. A summary and comparative analysis of knowledge management frameworks is presented in [Holsapple 1999]. A knowledge representation approach based on separately defined semantic schemes, usually based on special purpose knowledge representation languages, is increasingly gaining importance. An approach based on conceptual graphs has been proposed in [Martin 2000]. Representation of procedural knowledge and specified domain knowledge is proposed in [Fensel 1998]. Two separate knowledge representation language for procedure (P-Karl) and logic-based inference knowledge (L-Karl) are proposed. The use of logic as a knowledge representation scheme has also been postulated in [Lambrix 1999]. Conceptual reasoning and the semantic net approach have been proposed in [Lassila 1998, Martin 2000]. Prototype system implementations and knowledge management application frameworks have been discussed in [Bassiliades 2000, Bouguetaya 2000, Chang 2001, Goeschka 2001, Hammer 1997, Knoblock 1998, Lawrence 2001]. A novel approach of integrating the data mining results into the knowledge representation framework was presented in [Buchner 2000]. We now present the main formalisms to be used for KR in ICONS. The research stream of the project will seek to harmonise the use of these with Datalog methods – both for knowledge acquisition and for knowledge use.

5.3.1 Rules and uncertainty In recent years, much emphasis has been placed on the “softness” required to model our imperfect world. One aspect of this on which the University of Ulster has been working on for many years (since the ideas of Second Generation knowledge representation, e.g. distribution, deep and shallow reasoning (grounding), and uncertainty first appeared) is reasoning under uncertainty, and the implications this has for knowledge representation. This has been based on the Dempster Shafer theory of evidence, and we have extended it to general Boolean algebras (instead of merely applying it to subsets or propositions). The hypothesis in that the disjunctive nature if DLP matches well the disjunction inherent in the relational representation outlined in Section 3.2/3.3. One aim of the ICONS project is to include uncertainty in data representations (e.g. relations) and (ultimately, after research) include uncertainty in Datalog representation and use, and in multi-media knowledge representation.

5.3.2 Data Representation using Dempster-Shafer theory The Dempster-Shafer Theory of Evidence [Guan and Bell 1991] is a well-accepted basis for reasoning under uncertainty. It has been applied to reasoning using both uncertain rules and uncertain evidence. A domain (frame of discernment) is a finite set of mutually exclusive and exhaustive values. Let t be a data object, ai be an attribute of t, and Dj be the domain of ai (i and j do not have to be equal). An attribute ai is a mapping from a set of data objects to a domain Dj ∪ {⊥} where ⊥ represents an undefined value, and t.ai represents the mapped value in domain Dj ∪ {⊥}. The inclusion of ⊥ in the range of ai allows us to handle the special case where applying an attribute ai to a data object does not make sense. One major feature of the conventional relational database model is that every attribute value is atomic. In order to represent imprecise and uncertain information, we should modify this feature. Instead of a single attribute value, a set of values should be allowed for the representation of imprecise data. A probability distribution should be allowed for the representation of uncertain data. Definition 3.1. For any attribute aj of a data object ti, let Dk denote the domain the attribute maps into, and let mij represent the mass function for attribute aj of data object ti. Then, the attribute value

IST-2001-32429 ICONS Intelligent Content Management System

page 35/86

Intelligent Content Management System The ICONS Knowledge Representation Features

1.15 April 2002

ti.aj={ | d ⊆ Dk ∪ {⊥}, mij (d) > 0}. This definition says that a probability distribution of the power set of a domain is allowed in every attribute value (see an example as illustrated in Figure 7). Note that | ti.aj| > 1 implies that ti.aj is uncertain.

5.3.2.1 Patient # 006

5.3.2.2 Disease Heart disease (.90) Stomach upset (.10) 175 Flu (0.25), pneumonia (0.64) δ (0.11) … … */δ represents a full domain of disease and the implication is that 11% of our believe is assigned to ignorance Figure 7. Treatment relation. This mechanism also provides a solution to the traditional problem of handling null values in databases. A null value can be naturally handled using a set. The null value is subdivided into three different cases such as unknown, inapplicable, and unknown or inapplicable, denoted by the special strings, respectively. The string nk represents the corresponding domain D itself for an attribute. Similarly, na, and nka represent {⊥} and D ∪ {⊥}, respectively. Refer to [Bell, et al, 1996] for details.

5.3.3 Extended relational database model In the conventional relational model, information is represented by set-theoretic relations, which are subsets of the Cartesian product of a list of domains D1 × D2 × … × Dn. With the data representation in Definition 3.1, which is a probability distribution on the power set of a domain (a mass function), the definition of a relation is changed to the following. Definition 3.2. A relation (or table) T based on D1, D2, …, Dn is defined as T ⊆ G1× G2 × … × Gn × CL where Gi is a set of all the probability distributions on the power set of a domain Di and CL={ [b, p] | b, p ∈ [ 0, 1]; b ≤ p}. Each Gi corresponds to a domain, each element of which can be interpreted as a set of pairs – each being a focal element and its value for some mass function m. In the set of CL, a pair of value [b, p] is used to represent the confidence level for each tuple in a relation T. CL will be used also as a system attribute name included in every relation. Specifically, b and p represent the bel and pls functions, respectively. For example, in the Treatment Relation of Figure 7, this could represent a doctor’s opinion, which could, for example, be valued less strongly consultant than for a newly qualified and for an experienced practitioner. It should be noted at this point that the CL (Confidence Level) value is not, in any way, derived from the attribute value uncertainties. It is an independent measure of the strength of the predicate represented by the tuple. In ICONS, uncertainty will be expressed, again, using “special cases” of conventional relation in standard DBMS, and can be manipulated by supplementary (application) programs.

5.3.4 Hyperrelations used for representing mined knowledge Hyperrelations generalise the database concept of relation, and are particularly useful for representing rules derived from data mining exercises. There exists a semilattice structure with a (“more inclusive / less inclusive” ordering) in the set of all hypertuples of a domain, where hypertuples generalise traditional tuples from valuebased to set-based. Hyperrelations can represent rules just as decision trees can represent rules. We hypothesise that hyperrelations can also represent semantic nets, and this will be investigated in the ICONS research stream.

5.3.5 Hyperrelations as knowledge representation The semilattice structure in hypertuples can be used as a base for a hypothesis space. We take a hypothesis to be a hyperrelation, i.e., a set of hypertuples. A hyperrelation can be interpreted as a disjunction of conjunctions of disjunctions of attribute-value pairs. Such a hypothesis space is much more expressive than the conjunction of attribute-value pairs and the disjunction of conjunction of attribute-value pairs. For a dataset there is a large number of hypertuples which are consistent with the data, some of which can be merged (through the semilattice

IST-2001-32429 ICONS Intelligent Content Management System

page 36/86

Intelligent Content Management System The ICONS Knowledge Representation Features

1.15 April 2002

operation) to form a different consistent hypertuple. By definition, each field in a hypertuple is a set of values. For example, the following table is a hyperrelation where in the first row, the symptom field consists of two alternative values of “sore throat” or “high temp”. Symptom Sore throat ∨ high temperature High Blood Pressure ∨ High Cholesterol …

Disease Flu Heart disease …

Figure 8. A hyperrelation. In ICONS, we propose to focus on those hypertuples which are consistent with the given data and can not be merged further - they are said to be maximal. The version space is defined as the set of all these hypertuples, which is clearly a subset of the semilattice. An algorithm exists which is able to construct the version space. Implementation for the ICONS system will include use of conventional relational systems to represent hyperrelations, again, as “special cases”. These represent knowledge mined from databases.

5.3.6 Metadata Additional expressiveness to data content is supplied by relationally-specified metadata. We can store such useful information in relational format as a series of tables – e.g., in ADDSIA / MISSION [McClean 2002, McClean 2000] we used categorical table, numerical table, note table, correspondence table, etc. to tackle heterogeneity inherent in multiple data sources. Each table can be represented as a conventional relation and a selection from these table types will be available in ICONS. Metadata is often described as “data about data”. It has increasingly become recognised over the last few decades that such metadata must be encoded alongside data in databases so that it may be used in both a passive and an active role. We consider metadata as providing contextual and operational knowledge about the data in a broad sense and widen the scope to cover the encoding of general knowledge. [Grossman, 1996] defines metadata as formatted, structured description elements. Metadata may be used for: (1) documentation (passive), (2) automated support (active). Metadata may contain relevant contextual information concerning issues of comparability or elaboration, even interoperability. More generally, we categorise metadata into the following roles (using database examples again for illustration): 1. for data processing, e.g., schema information 2. for data access, e.g., locational information 3. for data harmonisation and integration in a distributed, heterogeneous environments, e.g., schema matching 4. providing rules concerning the data integrity constraints 5. providing contextual information to aid interpretation 6. providing information on quality 7. providing information on costs. Agents collaborate within an agency, using metadata concerning processing, access, fusion, rules and context. These can be regarded as forms of knowledge that are utilised by the various agents. Agents compete using metadata on quality and costs. Thus rival agencies may offer higher quality, or lower cost, services to the user. Time representation issues are an important ingredient of the knowledge representation schemes of a wide class of content repositories. Current results in the area of temporal aspects of knowledge management are presented in [Dyreson 2000, Gregersen 1999]. A pragmatic representation of time will be included in ICONS.

5.3.7 Sharing data Modelling primitives and their semantics together give a very important aspect of an ontology-based information/knowledge exchange language. The syntax of such a language must of course be formulated using existing web standards for information representation.

IST-2001-32429 ICONS Intelligent Content Management System

page 37/86

Intelligent Content Management System The ICONS Knowledge Representation Features

1.15 April 2002

The knowledge representation approach based on introducing tags in HTML and/or XML objects to represent the content semantics has been presented in [Dieng 2000, Ginsburg 1999, Shim 2000]. Prototype system solutions based on this approach have been presented in [Corby 1999, Raborijaona 2000]. The disadvantages of tag-based knowledge representation approach have been discussed in [Martin 2000]. In ICONS, XML will be used as a serial syntax definition language for ontology- based information exchange. RDF / RDFS can also do this (encode /exchange/reuse of metadata). The Resource Description Framework (RDF) is the emerging semantic interoperability and knowledge management standard for the web information resources. The RDF standard has been exhaustively discussed in [Decker 2000a, Decker 2000b, Lassila 2000]. It provides a means of adding semantics to a document without making assumptions about its structure. RDF has the advantage of providing a standard syntax for writing ontologies, and a standard set of modelling primitives. RDF schemas (RDFS) provide a basic type schema for RDF. Object oriented concepts such as objects, classes, and properties, can be described. RDF provides a standardised syntax for writing ontologies, and a standard set of object oriented modelling primitives. Therefore, ICONS may offer two syntactical variants: one based on XML schemas and one based on RDF schemas.

5.4 Disjunctive Logic Programming Disjunctive Logic Programming (DLP) is nowadays widely recognized as a valuable tool for knowledge representation and common sense reasoning. DLP is, just like Datalog [Ullman 1989], a deductive database language, but, as is explained below, it extends Datalog's expressivity by allowing disjunction in the head of rules. In this way, the conclusion of implications can be indefinite, which create different possible models of reality, as is shown in the examples below. In general, according to the stable model semantics, a DLP program may have several alternative models (possibly none), each corresponding to a possible view of the reality. In [Eiter et al. 1997f] it has been shown that, under stable model semantics, DLP has a very high expressive power: it captures the complexity class ∑P2. This is strictly higher than Datalog's, as it is not always possible to emulate disjunction through (non-stratified) negative rules. The use of both disjunction and constraints makes DLP a language well-suited to represent and solve a wide class of knowledge-based problems, including deductive database queries, incomplete knowledge, classical optimisation problems, planning, abduction, etc., in a very simple and natural way. For the ICONS project we have selected the DLV system as an implementation for DLP. In the following, we will briefly discuss the characteristics of knowledge representation with DLP, and the kind of problems that it is suited for. Considering the advantages and disadvantages of this approach, we will propose a way of incorporating the DLV system into the ICONS architecture, and we will address the questions and research issues that arise from this choice. Syntax and semantics In this section, we provide a formal definition of the syntax of the Disjunctive Logic Programming (DLP). For further background, see [Lobo et al. 1992, Eiter et al. 1997f, Gelfond and Lifschitz 1991]. We also provide a short informal description of the semantics; for the formal definition, see [Gelfond and Lifschitz 1991]. The main notion in DLP is the rule, which is built from variables, constants, atoms, and literals as follows. An atom is an expression p(t1, … ,tn), where p is a predicate of arity n and t1, … , tn are constants or variables. For example, supervisor(barbara, george) is an atom consisting of the 2-ary predicate supervisor and the two constants barbara and george. Similarly, one can use variables X, Y to form atoms like supervisor(X, george) and supervisor(X, Y). Strings starting with lower case letters denote constants and predicates, while strings starting with upper case letters denote variables. Such atoms or their negated versions (as in ¬supervisor(X, Y)) are called literals. Finally, a rule is a formula of the form a1 v … v an :- b1, …, bk, not bk+1, …, not bm where a1,…,an,b1,…,bm are literals and n ≥ 0, m ≥ k ≥ 0. This rule can be read as "the disjunction of a1,…,an is implied by the conjunction of b1,…, bk and not bk+1, …, not bm". Note that the D in DLP stands for the possible disjunction (i.e. logical "or") in the rules. Furthermore, we call the disjunction a1 v … v an the head of the rule and the conjunction b1,…,bk, not bk+1,…,not bm the body.

IST-2001-32429 ICONS Intelligent Content Management System

page 38/86

Intelligent Content Management System The ICONS Knowledge Representation Features

1.15 April 2002

For example, the rule employee(X) :- supervisor(X, Y) can be read as "if X is a supervisor of some Y, then X is an employee", and the rule female(X) :- person(X), not male(X) can be read as "if X is a person and not male, then X is female". Finally, as an example with a disjunction and with an empty body, the rule female(X) v male(X) can be read as "X is male or female". Note that, when the body is empty, we leave out the implication sign ":-" at the end. A disjunctive datalog program is a finite set of such rules. From the definition it can be seen that many kinds of rules are possible, each with different kind of knowledge that is represented by it. When the body is empty and the rule contains no variables, we call it a fact. Facts are the representation of the intensional database, and there exists a correspondence between e.g. rows of a relational table and DLP facts. The rules person(barbara) and supervisor(barbara, george) are examples of facts. In the ICONS project, the translation of relational and other external database data into DLP facts is of great importance. When the head of a rule is empty, the rule is called an (integrity) constraint, as it expresses a condition of what should not occur in the model of reality. For example, the constraint :- male(X), female(X) expresses that any X cannot be both male and female. Integrity constraints play an important role in database systems. When the head of a rule is either empty or contains only one literal, the rule is called normal. Normal rules express the definite knowledge of implications, where instantiations of the conditions of the body only lead to either an contradiction (in the case of a constraint) or to an instantiation of one literal, in other words: to a fact. Consider the example rule employee(X) :- supervisor(X, Y), which using the knowledge supervisor(barbara, george) leads to the sure conclusion that employee(barbara), which is a fact. A rule that is not normal is called disjunctive, and it expresses indefinite knowledge. The rule boss(X, Y) v boss(Y, X) v equal_worker(X,Y) :- same_team(X, Y) expresses that if X and Y are in the same team, then X is Y's boss, or vice versa, or they are equal co-workers. That means that given the knowledge same_team(tony, beth) we cannot conclude a new fact, but we have only so-called incomplete knowledge that boss(tony, beth) or boss(beth, tony) or equal_worker(tony, beth). The DLV system gives for every DLP program (which includes the facts, i.e. the data) zero or more possible models of reality, called answer sets. Informally, a model can be seen as a consistent set of facts, which are interpreted to be true in that model. An answer set of a DLP program is built up from the constants which appear in the program and it is closed under that program, that is: applying program rules to the facts in the set only lead to facts that are already in that set. Furthermore, answer sets of a DLP program are minimal with respect to set inclusion: that is, there exists no subset model closed under the program. (Note that these descriptions are informal: more precise definitions can be found in [Gelfond and Lifschitz 1991].) As an example, the program consisting of the two rules female(X) v male(X) and person(beth) has only two answer sets: the model {person(beth), female(beth)} and the model {person(beth), male(beth)}. Note that there are no answer sets introducing new constants (like {person(beth), female(beth), male(tony)}, because tony has not been mentioned in the program). Also note that the model {person(beth), female(beth), male(beth)} is not an answer set, even though it is consistent and closed under the program (recall that the disjunction female(X) v male(X) is not exclusive!), because it is a superset of one (in this case both) of the answer sets. Applications Note that the language of DLP programs is declarative: it is not needed to provide the DLV system with a procedure of how to find the matching answer sets, it suffices to tell the system what the rules are that it should obey. Combined with DLP's high expressivity this allows for a human-understandable description of complex problems. Compared to standard query languages like SQL, which only handle so-called local queries, DLP gives a much more powerful mechanism, with which it is possible to answer questions about the structure of relations, like reachability and 3-colorability, of which we will discuss examples below. In this section we will show how Disjunctive Logic Programming (DLP) allows us to represent and solve a large variety of problems in a simple and highly declarative way. In particular, we will concentrate on the following three classes of problems: deductive database, incomplete knowledge and search problems. Deductive database A typical deductive database query (inexpressible in SQL) is the transitive closure of a (binary) relation. As an example, consider the classical reachability problem: given a directed graph G, determine all pairs of nodes (a,

IST-2001-32429 ICONS Intelligent Content Management System

page 39/86

Intelligent Content Management System The ICONS Knowledge Representation Features

1.15 April 2002

b) of G such that there is a (directed) path from a to b. When we use edge(a,b) to denote the fact that there is an edge between node a and node b, the encoding of this problem is the following recursive program: edge(a,b) edge(a,c) ... reach(X,Y) :- edge(X,Y) reach(X,Y) :- edge(X,Z), reach(Z,Y) In other words, one can reach Y from X if there is an edge from X to Y or if there is an edge from X to another node Z, from where Y can be reached. Finding relatives in a family relation defined by the predicate parent is an example of the usage of the reachability query. Incomplete Knowledge Besides database queries, DLP is suitable to represent common sense reasoning. The following is a simple example of how DLP enables the treatment of incomplete knowledge. Consider this situation: ! !

we’ve seen Michael having a broken arm, but we do not remember which one. we know that Michael is used to write using his left hand, so Michael is able to write if its left arm is not broken.

The problem is to decide whether Michael can or can not write. Because of the uncertainty due to our incomplete knowledge about Michael’s arms, we cannot definitely answer. Anyway, we can trace two different sceneries: • •

“Michael’s left arm is broken, so he cannot write”. “Michael’s right arm is broken, so he can write”.

This situation can briefly be represented by the following disjunctive logic program: PMichael = {r1: left_arm_broken v right_arm_broken. ; r2: can_write :- not left_arm_broken.} What is represented by PMichael is very intuitive. It has two models: M1 = {left_arm_broken, not right_arm_broken, not can_write} and M2 = {not left_arm_broken, right_arm_broken, can_write}. M1 e M2 are the two possibile meanings of the problem, and match the sceneries we wanted to represent. Note that it is possibile to represent this situation even through a normal logic program (i.e. without disjunction), simply replacing the rule r1 with the two {r’1: left_arm_broken :- not right_arm_broken ; r’’1: right_arm_broken :- not left_arm_broken}. It is easy to see that this second variant (with the so-called stratified negation instead of the disjunction) makes the program less intuitive. Search Problems Another class of problems that naturally can be represented and solved by DLP is that of search problems. To this end, we show how the Guess&Check paradigm is a suitable technique which supports a highly declarative problem representation. The power of disjunctive rules allows one to uniformly express problems which are even more complex than NP over varying instances of the problem using a fixed program (i.e., a fixed program containing variables that work on any possible input). Given a set FI of facts that specify an instance I of some problem P, a Guess&Check program P for P consists of the following two parts: Guessing Part: The guessing part G ⊆ P of the program defines the search space, in a way such that answer sets of G ∪ FI represent “solution candidates” for I. Checking Part: The checking part C ⊆ P of the program tests whether a solution candidate is in fact a solution, such that the answer sets of G ∪ C ∪FI represent the solutions for the problem instance I.

IST-2001-32429 ICONS Intelligent Content Management System

page 40/86

Intelligent Content Management System The ICONS Knowledge Representation Features

1.15 April 2002

In general, we may allow both G and C to be arbitrary collections of rules in the program, and it may depend on the complexity of the problem which kind of rules are needed to realize these parts (in particular, the checking part). Without imposing restrictions on which rules G and C may contain, in the extreme case we might set G to the full program and let C be empty, i.e., all checking is moved to the guessing part such that solution candidates are always solutions. This is certainly not intended. However, in general the generation of the search space may be guarded by some rules, and such rules might be considered more appropriately placed in the guessing part than in the checking part. We do not pursue this issue any further here, and thus also refrain from giving a formal definition of how to separate a program into a guessing and a checking part. In order to solve a number of problems, however, it is possible to design a natural Guess&Check program in which the two parts are clearly identifiable and have a simple structure: ! !

The guessing part G consists of a disjunctive rule which “guesses” a solution candidate S. The checking part C consists of integrity constraints which check the admissibility of S, possibly using auxiliary predicates which are defined by normal stratified1 rules.

Thus, the disjunctive rule defines the search space2, in which rule applications are branching points, while the integrity constraints prune illegal branches. As an example which matches this scheme, let us consider the well-known 3-Colorability problem. 3COL: Given a graph G=(V,E) in the input, assign each node one of three colors (say, red, green, or blue) such that adjacent nodes always have different colors. 3-Colorability is a classical NP-complete problem. Assuming that the set of nodes V and the set of edges E are specified by means of predicates node (which is unary) and edge (binary), respectively, it can be encoded by the following Guess&Check program: r: c:

col(X,r) v col(X,g) v col(X,b) :- node(X). } edge(X,Y), col(X,C), col(Y,C). }

:-

Guess Check

The rule r nondeterministically guesses color assignments for the nodes in the graph, and the constraint C checks that these choices are legal, i.e., that no two nodes which are connected by an edge have the same color3. More precisely, let us suppose that the nodes and edges of the graph G are represented by a set F of facts with predicates node and edge. Then the (“guessing”) rule r above states that every node is colored either red or green or blue, while the (“checking”) constraint C forbids the assignment of the same color to two adjacent nodes. The answer sets of F ∪ {r} are all possible ways of coloring the graph. Note that minimality of answer sets guarantees that every node has only one color. If an answer set of F ∪ {r} satisfies the constraint C, then it represents an admissible 3-coloring of the graph. There is in fact a one-to-one correspondence between the solutions of the 3-coloring problem and the answer sets of F ∪ {r,c}. The graph is thus 3-colorable if and only if F ∪ {r,c} has some answer set, and each of the answer sets of F ∪ {r,c} represents a (different) legal 3-coloring of G. The problem 3COL is a popular example of NP-complete problems. We next show that even some harder problem, which is located at the second level of the polynomial hierarchy, can be encoded in a straightforward way in DLP. To this end, we consider the following problem Strategic Companies. 1

For a definition of stratification, see [Apt et al. 1988]. In some cases it would be possible to replace the disjunctive guessing rule by rules with unstratified negation. However, this is not possible in general. Disjunctive rules also have the advantage of being more compact and usually also more natural. 3 In this example, we assume that G contains no loops, i.e., edges from a node to itself. Such loops can be easily handled by adding XY to the constraint. 2

IST-2001-32429 ICONS Intelligent Content Management System

page 41/86

Intelligent Content Management System The ICONS Knowledge Representation Features

1.15 April 2002

STRATCOMP: Given the collection C of companies owned by a holding, together with information about the products each company produces and company control, compute the set of the strategic companies in the holding. Let us recall from [Cadoli et al. 1997] what a “strategic company” is in this context. Each company in the holding is producing a collection of goods, such that the holding produces a collection of goods G which consists of all goods produced by its companies. Company control information models that a set of companies D ⊆ C jointly may have control (e.g., by majority in shares) over another company c ∈ C. (Companies not in C, which we do not model here, might have shares in companies as well). The company control information in STRATCOMP lists records of such control information in terms of “controlling sets” D for “controlled” companies c. Note that, in general, a company might have more than one controlling set, and only non-redundant controlling sets (i.e., no proper subset is a controlling set) are recorded then. Now, some companies should be sold by the holding, while the following two conditions have to be maintained: 1. After the transaction, the remaining set of companies C’ ⊂ C still allows one to produce all goods. 2. No company is sold which would still be controlled by the holding after the transaction, i.e., if D is a controlling set for c ∈ C and D ⊆ C’ holds, then also c ∈ C’ holds. A set C’ ⊆ C is called a strategic set, if it is minimal with respect to inclusion, that is, it satisfies both (1) and (2), and no proper subset of C’ satisfies both (1) and (2). In general, the strategic set is not unique, and multiple solutions for C’ exist. A company c ∈ C is called strategic, if it belongs to at least one of these strategic sets. Computing the set of all strategic companies is relevant when companies should be sold, as selling any company which is not strategic for sure does not lead to a violation of any of the conditions (1) and (2). This problem is ΣP2-hard in general [Cadoli et al. 1997]; reformulated as a decision problem (“Given a particular company c in the input, is c strategic?”), it is ΣP2-complete. To our knowledge, it is the only KR problem from the business domain of this complexity that has been considered so far. We next present a program, which solves the complex problem STRATCOMP in a surprisingly elegant way by a few rules: r: s:

strat(Y) v strat(Z) :- produced_by(X,Y,Z). strat(W) :- controlled_by(W,X,Y,Z), strat(X), strat(Y), strat(Z).

} Guess } Constraint

Here strat(X) means that X is strategic, produced_by(X,Y,Z) that product X is produced by companies Y and Z, and controlled_by(W,X,Y,Z) that W is jointly controlled by X,Y and Z. We assume that a set of facts for company, controlled_by and produced_by is part of the input and have adopted the setting from [Cadoli et al. 1997], where each product is produced by at most two companies and each company is jointly controlled by at most three other companies (in this case, the problem is still ΣP2-hard). The answer sets of the program together with the encoded facts correspond one-to-one to the strategic sets of the holding. Thus, the set of all strategic companies is given by the set of all companies c for which the fact strat(c) is true under brave reasoning. In fact, it is possible to encode the same problem with the Guess&Check paradigm, in the same shape as the previous example. For details about that, see [Eiter et al. 2000]. Strategic Companies is a good example of the kind of complex knowledge a user of a knowledge management system may want to extract from the repository. Along the same line, one could think of personnel allocation and management problems, which could be solved by similarly straightforward programs. Further examples can be found in [Eiter et al. 1997f]. DLV system in the ICONS architecture As is clear from the above examples, the enhancement of a knowledge management system with DLP techniques is a major innovation, as a number of complex problems can be solved that are not solvable (and expressible) within existing traditional systems. Data from the repository can be used within the DLV system by transferring the relational data model into data as modelled by DLP (that is, facts). There are already existing tools within the DLV system to use SQL queries to extract the needed data. Still, the incorporation of the DLP techniques within the ICONS project is not entirely

IST-2001-32429 ICONS Intelligent Content Management System

page 42/86

Intelligent Content Management System The ICONS Knowledge Representation Features

1.15 April 2002

straightforward, as some complexity issues have to be taken into account. Tests of the DLV system show that its strength lies in solving complex problems on reasonable amounts of data. Because the system does not have its own internal DBMS, it does not effectively deal with larger amounts of data. However, within the ICONS system the amounts of available and accessible data will be large. For that reason, we seek to change the DLV system to a fruitful co-operation with a main memory database system, which would maintain the data management for DLV. To this end, a mapper is to be developed, which selects the needed data from the data repository and stores it into the MMDB, before invoking the DLV system. Research issues are: - For what kind of problems is it needed to speed up the DLV handling time by pre-selecting data? - How to select the needed data given a DLP program plus optionally a query on the answer sets? - Given a particular program, can we develop a mapper which selects data with the actual query (or the constants in the query, as focal points) as parameters? - How can we prove that the data we select give the correct answer, i.e. if they give in all cases the same answer sets as the full data would have given? - Is it possible to considerably decrease the number of selected data (and hence increase efficiency) without losing correctness, or will we have to pay a bit more efficiency with a great decrease of correctness? - Is correctness for all problems a discrete notion, or can we think of applications where a scale of correctness would make sense? Think of optimization problems as the travelling salesperson where we may not be interested exactly in the most optimal solution (which would take a lot of time), but rather in a fast solution which is, say, at least 90% optimal. Are there straightforward ways to decrease the selected data considerably, while retaining a level of correctness (or optimization) that is "good enough"? - Would it be possible to have the user choose a level of correctness (e.g. 100% or 90%) which would have a semantics that is easy to understand, also for the user who is not a specialist? Part of the selection can be done in a quite straightforward way by selecting only the relational tables which are mentioned in the program, or by calculating the maximum needed "distance" from focal points, for relations that are defined in a non-recursive matter. Another part is addressed by ongoing research, like the research on socalled magic sets. Usage The integration of DLV system as described above will allow for many different user applications. As the complexity of the problems that can be solved with DLP causes the language to be somewhat complicated, it may be difficult for the incidental user to use it to its full power. This is not a problem in itself; it is the nature of any computer system that different users will use different powers of the system. In the ICONS system, less experienced users can still be offered the possibility of querying via DLV, by means of available help schemas or pre-defined queries. In that way, we can distinguish the following 3 ways of accessing the DLV engine, in decreasing order of required familiarity with the system. First, there will be the availability of a direct user interface to DLV, where users can construct their own DLP programs and queries, possibly enhanced with options of keeping track of the individual search history and the sharing of often-used programs with others. Secondly, one can think of a shared library of expert programs that can serve as schemas to be edited for individual use by experts or other users. Experts could maintain this library, possibly in co-operation with a database expert. Thirdly, there may be often-used queries that could be ready for use without knowledge of DLP. This kind of queries could be implemented at the system installation phase, and could be maintained by local database experts. As an example, one can think of regular dependency checks, like in fraud checks or testing, which could be executed at regular times (once a week, overnight), or at individual instances of problems like dividing a set of persons into groups, with several constraints. These kinds of settings can be generalized and made available to people who do not have much knowledge of DLP (yet). On the other hand, this sliding scale can also be seen as a natural means of education: after having used the standard queries several times, one may try to edit an expert query. And after having dealt with several expert queries, one could be ready to write his/her own programs.

5.5 Procedural knowledge representation features As was stated earlier, there are several types of knowledge representations. One of them is procedural knowledge that defines algorithms how to achieve a given goal. In the context of organisations, such algorithms are called business processes. A business process defines what units of work, when, and by whom should be performed in order to achieve a given goal, that is to produce a product or to provide a service. Innovate, efficient and flexible business processes help an organisation to be competitive and play the leading role on the market.

IST-2001-32429 ICONS Intelligent Content Management System

page 43/86

Intelligent Content Management System The ICONS Knowledge Representation Features

1.15 April 2002

From repeatability point of view there are two types of business processes: repeatable and non-repeatable processes. The former are well-defined and mass processes. Usually the influence of the management on process control is rare. Changes in such process occur seldom and are evolutionary. The latter are requiring high degree of flexibility and can be well-defined only at the high level of abstraction. They are unique – usually they can be executed only once. Changes of such processes occur frequently and can be revolutionary. Business processes can be supported by computer automation (partially or fully). One of the most popular and effective tools to support business processes are workflow management systems (WFMSs). In a WFMS an automatable part of a business process is represented as a workflow definition. According to the WfMC’s meta-model defined in [WfMC2001], the main elements of a workflow definition are: ! activities – pieces of work that form logical steps within a workflow process. An activity is performed by one or more workflow participants; ! transitions – a point during the execution of a process instance where one activity completes and the thread of control passes to another, which starts. A transition can has a condition, which may be evaluated in order to decide the sequence of activity execution within a workflow process; ! workflow participant – a resource set, resource (specific resource agent), organisational unit (within an organisational model), role (a function of a human within an organisation), human (a WFMS user) or system (an automatic agent) that performs activities; ! control data - representing the dynamic state of workflow instances and the WFMS (e.g. workflow definitions); ! audit data - representing the history of workflow instances execution; ! relevant data - used for evaluation of conditional expressions, for instance, expressing transitions or participant assignments. WFMSs enable workflows to be designed, executed, monitored and optimised. If a workflow process is executed for a given case it is called a workflow process instance. Other elements of workflow definition are fully described in the WfMC’s workflow glossary in [WfMC1999]. In the ICONS project, workflow definitions will be stored as ordinary information objects and treated as a part of organisational knowledge. Usually, in order to increase the readability of the defined workflows, a workflow definition is modelled in a graphical tool. Such tool helps users in understanding defined processes, and during execution, checking which activity(s) of a given process is being performed. In addition such tool is used to simulate and test workflow processes before their implementation at customers. In the ICONS projects we are going to use a well-known, commercial workflow modelling tools such as Aris Toolset and iGrafx. Organisations expect that implementing their business processes as workflow processes in a WFMS can help them to produce a product or to provide a service: ! of optimal quality, ! by optimal period of time, ! with optimal resource effort, ! at optimal cost. In this context, optimal means that something is done at expected or the best level that it can be done with respect to the other factories of the workflow process. In order to satisfy the above factories, WFMSs should support: ! flexibility – a WFMS should be able to adapt dynamic changes that are required during a workflow process instance(s) execution in order to satisfy the expected criteria. Dynamic changes can apply to all aspects of workflow definition such as control flow, workflow participant assignments, and time management. Dynamic workflow modifications, depending on their durability, concern workflow definitions or workflow instances. A WFMS should use statistical, heuristic and artificial intelligence to modify workflow definition or workflow process instances. Adaptation of dynamic changes should be done on the basis relevant as well as control and audit data. Especially for non-repeatable processes, flexibility is very important, since these processes can not be fully specified a priori, at the workflow definition stage. In the ICONS we would like to implement a method of dynamic modification of control flow presented in [Aalst1999], extend a language for dynamic workflow participant assignments as well as control flow conditions (referred further to as WPAs and CFCs respectively). In order to increase the flexibility of the

IST-2001-32429 ICONS Intelligent Content Management System

page 44/86

Intelligent Content Management System The ICONS Knowledge Representation Features

1.15 April 2002

defined WPAs and CFCs we would like to use Datalog rules as WPA and CFC functions. The extension to the WfMC’s definition of WPA has been described in [Momotko2002]. !

risk management - the main aim of the risk management is to avoid undesirable situations as well as to minimalise the negative results of those that already occurred. In our opinion the risk management should, at least, take into consideration such aspects of workflow as time management and task scheduling. The former is described in detail in the section 7.3 and the latter – in the section 7.4.

As it is stated in [Koloupulos1995] and [Stader2001] the above requirements are not fully supported by the current WFMSs and should be developed in a knowledge-base, or intelligent WFMS. Moreover, it seems that at the moment the above features of an intelligent WFMC are not well-defined in the appropriate WFMC standards. In the ICONS project we will suggest some extensions to the WfMC standards and to develop a prototype to test practically their usefulness.

5.6 Knowledge representation and manipulation in the graphic user interface ICONS Graphic User Interface (ICONS GUI) is a tool to be used by a Web application developer for visualisation of user requests and outputs from the ICONS data / knowledge base. ICONS GUI cannot be separated from other issues related to the general data/knowledge base architecture. Its main role is visualisation of data stored in data/knowledge base. More precisely, it has to deal with visualisation of user requests to a data/knowledge base, together with visualisation of data retrieved from the database as the result of the requests. The interface should also allow some manipulations on the data/knowledge base, for instance, altering, creating or deleting some data. Hence during ICONS GUI design we must deal with the following issues: • A data model of a data/knowledge base that a graphic user interface will operate on. • Stored data structures (presented on the proper level of data independence) that will be searched or manipulated during requests. The data structures must be designed on the level of algorithmic precision, as their semantic properties will be directly used by ICONS GUI. • A user language for data description that will allow the user to have a view what data/knowledge base contains. This language can be designed on the level of database schema (c.f. CORBA IDL or ODMG ODL) or on the level of business ontology that describes not only structural properties of the data/knowledge base, but also some metadata related to the business domain. • Some universal API (a query language) that will allow one to make retrievals and manipulations on the database. The API must contain not only specification of a retrieval/manipulation language, but also specification of formats that will be returned by retrieval requests. These formats may be (and usually are, cf. ODMG) different from stored data structures (but based on the same notions). Since the results of requests will be an input to the GUI module, they must also be specified on the level of algorithmic precision. • ICONS GUI should contain features that will allow an application developer to customise the package to a particular application. The customisation can concern graphical icons that will be presented for the end user, navigation or browsing paradigms (i.e. additional actions connected with a single navigation act), as well as database views that will simplify the conceptual model of the application. Application program

graphic API customization

GUI module

API to a database: results of requests

GUI DB

API to a database: queries, manipulation requests

Figure 9. Architecture of the GUI module.

IST-2001-32429 ICONS Intelligent Content Management System

page 45/86

Intelligent Content Management System The ICONS Knowledge Representation Features

1.15 April 2002

The general GUI architecture, including the context of its use, is presented in Figure 9. The following elements must be considered during the development: • GUI module: it is generic software used by a developer of a Web application to prepare programs making interactions of Web end users with the data/knowledge base. To this end the developer has to use the following interfaces: − Customisation: it means parameterization of the entire GUI, according to some wishes of the developer, e.g. fonts, colours, kinds of icons to be displayed, etc. The customization may also require some (virtual or materialized) database views on the database, which will be used by the developer for conceptualisation of an application. − Graphic API: the interface makes it possible to activate/deactivate particular graphical widgets (buttons, menus, pictures, input/output text fields, tables, etc.) on the Web end user screen. Graphic API should enable presenting various forms of graphs on many levels of detail and with some possibilities of manipulations, e.g. changing colors to present the user navigation in the graph. − API to database: it is used by a developer to write scripts associated with events that can hold on particular widgets. For example, clicking a button named GetCompanies means issuing a request to a database “select * from Company”. API includes facilities to process results of requests received from the database. These facilities are used within an application program prepared by the application developer. The results of the requests are the input to the GUI module. • GUI DB: it is a database or a file storing customisation information (e.g. a palette of icons) and the current state of the interaction with a particular user (e.g. the history of operations, current results of search, views, etc.). An important feature of the whole interface is genericity, which means flexibility, robustness and independence on a particular application domain. The ICONS project architecture assumes multi-paradigm data and knowledge representation and processing. In particular, the architecture assumes (more or less explicitly) the following data models and corresponding paradigms: • “pure” object-oriented model, • relational or object-relational model, • XML model including typing facilities such as DTD and XML • Schema and mapping facilities such as XSL and XSLT, • RDF model, • Rodan Portal model, • Datalog model, semantic network, • temporal model, • model for process knowledge such as a workflow model assumed by WfMC, • perhaps other models that will appear as results of contributions of ICONS participants. This variety of considered and potential models has led us to the necessity to establish and develop a kind of a canonical data model that will present a “common denominator” of the various other models. As a candidate canonical model we have chosen a variant of an object-oriented database model in the spirit of ODMG, with significant improvements concerning enhancing it with dynamic object roles, cleaning up its semantics, observing principles such as object relativism, total internal identification and orthogonal persistence. The model will be quipped with a data query/manipulation API based on a query language SBQL built in the spirit of ODMG OQL, but based on fundamentally new semantic principles known as the Stack-Based Approach (SBA). In Figure 10 we present architecture of the wider context of the ICONS GUI interface, which includes the interface to canonical model through SBQL and wrappers to databases proprietary to particular data./knowledge representation paradigms. We plan that more sophisticated mapping of source data structures into canonical objects will be possible through object-oriented virtual views built on top of SBQL queries. In effect, the ICONS GUI will be conceptually and physically isolated from particular solutions concerning representation of data, thus allowing

IST-2001-32429 ICONS Intelligent Content Management System

page 46/86

Intelligent Content Management System The ICONS Knowledge Representation Features

1.15 April 2002

the developer and user of Web applications to have unified view on heterogeneous data resources that the ICONS architecture will deal with. This idea is much influenced by the CORBA IIOP bus, but shifted to higher conceptual level (i.e. the level of a query language). Web client user requests

HTML Page Generator

HTML Page

User requests processor

Application program

other APIs

graphic API customization

GUI module

GUI DB

SBQL queries API(canonical model) Object view processor SBQL DB

Object query processor CRUD API

Rodan Portal wrapper

Rodan Portal API RODAN Portal DB

XML/RDF DB wrapper

XML/RDF API XML/RDF files

another wrapper

another API another DB/file

Figure 10. ICONS GUI module with interfaces to databases. Navigation in a graph of inter-linked objects is a very attractive searching paradigm, which is so far not sufficiently explored in the context of Web applications. We can distinguish two kinds of such navigation: • Direct manual browsing in a graph of explicitly presented graphically objects. For instance, we can present on the screen the graph of connected objects and the user is allowed to move along named edges of this graph according his/her wishes. Another example of this kind of searching is navigation in a network of concepts (semantic network), navigation in a network of HTML pages, etc. • Manual browsing and searching in a graph presenting some data description or conceptual model of stored data. In contrast to the previous case, where the navigation concerns explicitly visualised objects, in this case only some description of objects is visualised, e.g. a UML schema. The user navigates in this schema; the effect of navigation is retrieval of objects that are of interest to the user. There are several problems connected with this kind of interfaces: • Size of end-user screen: usually it is impossible to present a very big and complex graph, hence it must be presented partly, with zooming facilities, perhaps with 3D views and with hiding details of objects depending on the mode or stage of searching. • User awareness: the user can very quickly lose orientation during navigation in a complex graph, thus special graphic facilities are necessary to keep him/her aware of current sub-goals or results of the search. • Combining manual and predicate-based automatic navigation. • Elliptic queries: for some kinds of navigation it would useful for the user to omit some details of navigation. In the graph navigation facility we would like to combine manual browsing in a graph of associated objects, selecting objects by predicates, and collecting results in user baskets. The idea is that the user during navigation collects interesting information within his/her personal baskets. This metaphor is illustrated in Figure 11 and Figure 12.

IST-2001-32429 ICONS Intelligent Content Management System

page 47/86

Intelligent Content Management System The ICONS Knowledge Representation Features

x A A A

y x y x

1.15 April 2002

B

B

z

B

t z z

C C C

y

w v v

D

w

D

t B

Figure 11. A graph of objects. In Figure 11 we present a graph, where objects (named A, B, C and D) are connected by directed edges x, y, z, t, w, v. As seen, we do not require the names of objects and names of edges to be unique. Objects can store information (attributes and their values) which can be displayed for the user. The user can select starting objects for navigation through the following actions: • Manual choice through clicking and marking proper objects on the screen (e.g. on the basis of their content, which can be optionally displayed). • Introducing a name of objects and a condition on their contents. • Taking proper objects from his basket (which has been filled in at previous search). After selecting initial objects she can navigate in the graph through named edges (selected from the menu). Suppose the user initially selects 1st and the 3rd object A, and then uses edge y. This means that she is moving to 2nd and 4th object B. If then she uses edge z, then she is moving to the 2nd object C. Then if she is using edge v, she is moving to both objects D. Objects that are selected during this search we will call marked; other objects are unmarked. During this process the user is allowed to do any actions, such as marking/unmarking objects, display object, move references to objects to her private basket, etc. The idea of basket is directly corresponds to the virtual shops metaphor. It has to support user awareness. A basket is a graphical element with icons representing selected objects. A basket has a unique name. Baskets can be organized hierarchically (similarly to operating system catalogs). They are persistent structures, i.e. they are stored in the database. In this way a single search can be subdivided into many user sessions. My today search D

D

B

C

B A

A

A

A

Figure 12. The idea of the user basket. Each basket has a name. The user can also assign to the basket some longer description or comment. The content of the basket can be presented in the 3D graphics. Icons representing particular kinds of objects can be different (they could be the subject of customization). The content of each object in the basket can be displayed. Each object in the basket should be supported by the following information (e.g. presented as a table): an icon representing the object, object identifier and name, representation of the object content, date/time of finding the object, and any string comment (annotation) introduced by the user. Example of the content of a basket is illustrated in the following table.

IST-2001-32429 ICONS Intelligent Content Management System

page 48/86

Intelligent Content Management System The ICONS Knowledge Representation Features

1.15 April 2002

Icon

Id

Object Name

Retrieval date

Object content

Comment

"

23156

Person

02.01.03

John Smith

I have checked him yesterday.

"

23456

Person

02.04.05

Mike Brown

Smart client!

#

766585

Document

02.08.19

Order 234527

Currently processed, ready in 2 days

!

3453453

Company

02.07.19

Brainstorm Ltd.

Our best supplier.

Navigation in a graph of objects could be connected with additional options, in particular, calling applications. For instance, if the navigation concerns semantic network used as an intelligent searching index, then after the search within the network the user can display the corresponding objects from the database, or display Word file, go to the Web through a URL, etc. A similar idea is assumed to navigation in a database schema, but a schema graph is displayed rather than the graph of objects. The schema graph should correspond to the canonical data model and data description of stored data according to the model. The graph will be presented as an improved subset of UML class diagrams (or ODMG ODL), to make it relevant to data description language assumed for the canonical object model. All other rules of marking, collection references to objects within baskets and calling applications should be similar as for the case of navigation within a network of objects described previously. An unexplored area in graphical querying concerns the paradigm known as Query By Example or Query By Forms (or simply Forms). This paradigm was extremely successful for relational database. We can consider to apply it for object/XML bases. The basic idea of this paradigm is that the system is displaying for the user an empty form based on a data description statement. For instance, it can present an DTD form, where corresponding XML values are initially empty. The user is filling in an empty field A in the form with a string value V (and possibly with some additional mark determining the kind of comparison). Then the system is filling in the rest of the form by values stored in the database, where field A has the value V. This paradigm can easily be adopted for object or XML databases.

IST-2001-32429 ICONS Intelligent Content Management System

page 49/86

Intelligent Content Management System The ICONS Intelligent Content Integration Features

1.15 April 2002

6. The ICONS Intelligent Content Integration Features The ICONS Content Repository (ICR) is to comprise content objects (CO’s) representing knowledge artifacts stored and manipulated by the ICONS content management functions. The knowledge artifacts directly represent results of intellectual work or may be derived from external information sources, such as information systems, databases and web sites. The ICONS Global Knowledge Schema (IGKS) is to include partial definitions of the content object data structures, content object methods, definitions of the content object relationships, as well as the content object taxonomies. The ICONS content objects are stored as XML documents conforming to the corresponding XML schema and comprising un-interpreted binary elements stored as files in a hierarchical memory system. The IGKS comprises meta-information pertaining to all CO classes represented in the repository regardless of the storage and access modes used to materialize their values. An important characteristic of the ICR is the flexible data structure, partially defined in the repository schema that escapes the traditional database requirement of consistent and complete database schema vs. database instance correspondence. Rather, the IGKS may be treated as a guide for interpreting the structured parts of the content objects and for navigation in the CO relationship structures. On the other hand, all CO methods must be defined and implemented with support of the object model inheritance structure, in order to provide facilities to manipulate the content object values. There are two dimensions of the ICONS content distribution. The first pertains to distribution of the system content repository comprising the Content Base and the Ontology Base and the hierarchical storage management processes among the ICONS servers. The second concerns integration of external information sources, such as pre-existing heterogeneous databases, legacy information processing systems, and web information resources. The first case is addressed in chapter 8. Integration of the external information resources is to be performed with the use of the XML-based wrapper technology. Wrapper programs producing required XML documents for extracted data serving as containers for file elements are to be enriched with RDF specifications resulting from extracting semantics from database schemata of the external databases, or appending semantic information in the case of the legacy information processing system outputs. The wrapper programs will be generated in the form of Enterprise Java Bean modules including the necessary query statements. Due to the open nature of the ICR the content integration features are envisaged as natural extensions of the ICR management features and they are discussed below in the context of the repository schema as well as in the context of the repository data structure. Finally, the content integration support to be developed within the ICONS project is outlined in the final section of this chapter.

6.1 The ICONS Global Knowledge Schema The Icons Global Knowledge Schema (IGKS) is to comprise the structural knowledge representation features, including partial specification of CO data structures, definition of CO methods and inheritance hierarchies, and CO relationship bindings, as well as the knowledge map representation features to be developed as multi-level taxonomic trees. The XML schema is to provide the partial specification of the CO data structure representing an arbitrary XML document tree. The leaf nodes may represent unstructured binary objects stored as files in the ICONS hierarchical storage structure. The CO class methods are to be defined within Java classes, where a Java class corresponds to a CO class defined in the XML schema. The inheritance structure is to be specified within the Java classes. We propose that special CO methods called inference methods are defined as a triple , where R is a set of Datalog rules, M is the materialization algorithm to dynamically create F, and F is the relational data structure representing facts. The inference methods are to be executed by the ICONS inference engine based on DLV (Buccafurri1998). The CO relationship bindings, to represent relationships implementing the structural knowledge metainformation, are to be specified as relationship predicates. The relationship predicates are logical expressions defined on CO properties. The CO relationship bindings are to specify binary and n-ary object relationships with arbitrary relationship cardinalities (1:1, 1:N, N:M). It is proposed that all CO relationships represented in the ICR

IST-2001-32429 ICONS Intelligent Content Management System

page 50/86

Intelligent Content Management System The ICONS Intelligent Content Integration Features

1.15 April 2002

are materialized dynamically during the corresponding query execution. Appropriate data structures are to be developed to support efficient materialization of CO relationships. The knowledge map consists of multi-level taxonomic trees representing either the closed taxonomies based on a specified list of categories, or open taxonomies based on an arbitrary value (values) of CO properties. Taxonomies are defined as logical expressions defined on CO properties and are to be materialized dynamically. We introduce a special class of implicit taxonomies grouping content objects by CO class and CO identifier. Thus, content objects are always accessible by navigation via some taxonomy.

6.2 The ICONS Content Repository The ICONS Content Repository consists of two distinct, strongly inter-related parts, the Content Base and the Ontology Base. The Content Base, organized as a hierarchical storage configuration, is to store Content Objects in the form of XML documents including binary file elements. The Ontology Base is to be organized as a relational database including tables comprising selected XML object properties represented by table attributes. Appropriate relational tables are to be created for each CO class. The table attributes are to used for attributebased CO selection, or as arguments of relationship binding and taxonomy expressions. The XML document properties will typically represent meta-information pertaining to the contents of the included file elements. Such property values are to be either defined manually, or extracted automatically from contents of the file elements. Properties representing structural or taxonomic knowledge will be replicated in the Ontology Base. Data redundancy is introduced in order to enable efficient manipulation of meta-information and to avoid complex data mappings during XML object manipulation operations. Content objects may either by persistent in the ICONS repository or they may be materialized on request during a repository user session. The life cycle of a persistent content object starts from the object create operation and expires after an explicit destroy operation. Content objects as well as their file elements may be organized in the form of version trees reflecting content modifications taking place during the object life cycle. The transient content object classes are to be represented by class templates providing means to specify properties of objects to be dynamically materialized during the user session. The object materialization algorithms must be implemented in the object class methods. Transient objects may be stored in the repository for a specified period between user sessions, either as frames comprising the desired content materialization parameters, or as complete content objects. In the latter case, the content object property values and elements may be refreshed at specified interval times.

6.3 Integration of the heterogeneous content sources Integration of heterogeneous, pre-existing databases has been an active research field in 1980ties and early 1990ties. A collection of papers comprised in [Hurson1994] provides a good insight into the state of the art in the area of multidatabase systems. The current research and development efforts have gone in direction of integrating the Web information resources, as shown in [Goeschka2001, Hammer1997, Knoblock1998], integration of object-oriented and multimedia databases [Chang2001], and extracting database semantics into a global dictionary [Lawrence2001]. Extracting semantic information from text-based information sources has been presented in [Soderland1997]. Integrating information from legacy information processing systems, in particular dealing with results of data mining queries has been discussed in [Buchner2000]. The emerging approach is to represent a common schema of integrated information resources as a XML repository and the technique for extracting and representing the underlying semantics is based on construction of wrappers to encapsulate the heterogeneity in accessing the diverse information sources. Wrappers are software modules that can transform data from a less structured representation into a more structured one. Examples if the wrapper- based solutions may be found in [Hammer1997, Kushmerick1997, Sahuguet1999]. The ICONS architecture provides facilities in the form of standard interfaces to accommodate diverse wrapper technologies ranging from Java beans including database queries and the required data mapping algorithms, to intelligent agents scanning predefined information sources for the required information. In all cases, we assume that the required data integration and mapping rules must be specified manually at the ICONS application development time. Typically the integrated data will be stored as the XML content object file element with

IST-2001-32429 ICONS Intelligent Content Management System

page 51/86

Intelligent Content Management System The ICONS Intelligent Content Integration Features

1.15 April 2002

semantics determined by the integration and mapping rules. The element meta-information may automatically be extracted and stored as the XML content object properties. The bulk of our research effort will be directed towards development of knowledge-based wrappers supporting integration of semi-structured information comprised in XML documents possible enhanced with the RDF semantic information. The XML technologies are the emerging information exchange standard facilitating information interchange and inter-operability of web-based as well as legacy information systems. The knowledge-based wrappers will be developed as Datalog programs to be executed by the ICONS DLV module. Similar approach to integration of semi-structured data has been reported in [Baumgartner2001].

IST-2001-32429 ICONS Intelligent Content Management System

page 52/86

Intelligent Content Management System The ICONS Intelligent Workflow Features

1.15 April 2002

7. The ICONS Intelligent Workflow Features 7.1 Dynamic workflow participant assignment As it is reported in [Momotko2002] a modern WFMS need to adapt dynamic changes. Especially dynamic changes in WPA are important. Some of the main requirements for WPA declared by WFMSs customers are: • Control and audit data – data on finished or currently executed workflows, for example: • a person that has the lightest workload or minimal number of tasks to perform, • a workflow participant that started the workflow, • a workflow participant that performed the previous/preceding activity, • a worker that does not have activities that have to be executed by Friday, • a salesman that in the last week performed more than 30 workflows. • Relevant data – processed data, organisational structure or other data, for example: − a user participant that is defined as a tester of a given system bug, − an employee that is the supervisor of Mr John Bean, − a manager that is the chief of the sales department, − a person that knows Java and XML, − a workflow participant that has the ‘knows English’ role, − a salesman who is responsible to the region of the customer who sent the claim; •

A WPA should be able to express the situation when workflow participants assigned to a given activity are selected ad-hoc, manually during workflow execution;



A WPA should be able to express organisational and functional structures, in particular user groups that exists in an organisation;



A WPA should be able to express the situation when exactly one workflow participant from a selected group should perform an activity;



A WPA should be able to define a workflow participant who will perform an activity if workflow participant assignments return inadequate set of workflow participants (e.g. an empty set).

In order to satisfy the above requirements and to assure the high level of flexible WPA, in the ICONS project we will use the WPAL language to define dynamic WPA presented in [Momotko2002]. The above mentioned approach proposes an extension of the WfMC’s definition of WPA. Moreover we consider using Datalog rules as WPA functions and an approach of assigning intelligent agents to activities on the basis of knowledge available from ontologies. This approach has been described in [Jarvis1999].

7.2 Dynamic control flow condition definition Similarly, to the notation of WPA, we suggest to define a procedural language to express control flow conditions (CFCs). A control flow condition is a pre or post activity condition and a transition condition. A flow condition should be built on relevant as well as control and audit data. It should also use logical operators (AND, OR, NOT) and predefined functions, for example a function to check if the activity of testing a repaired car is necessary or can be omitted. We consider using Datalog rules as such functions. In addition, there should be possible to have a library of the flow conditions already defined in order to reuse them. Such feature could reduce the cost of implementing a new workflow process. Moreover such approach can express optional activities. The same idea but different implementation is presented in [Klingemann2000].

7.3 Time management In the ICONS project we would like to extend the idea of time management presented by Eder and Panagos in [Eder2001], [Eder1999], and [Eder1997]. In order to represent time information, they defined two basic temporal types, namely durations and deadlines. Both durations and deadlines can be defined for individual activities and to the whole workflow process. Duration is a duration time to perform a given activity/process. Duration can be either calculated from past workflow executions or it can be assigned by specialists based on their experience and expectations. The most common duration values are minimum, maximum, and average. A deadline

IST-2001-32429 ICONS Intelligent Content Management System

page 53/86

Intelligent Content Management System The ICONS Intelligent Workflow Features

1.15 April 2002

corresponds to maximum allowable execution time for an activity/process. Deadlines do not have to be assigned to every activity of a workflow process, but it is beneficial to assign deadlines to all activities. In our opinion, the above approach to manage time in WFMSs seems to be promising. However, on the basis of our experience we think that in real workflows also waiting time has to be considered. A waiting time is time between placing an activity in a given workflow participant’s task list and the moment when the participant begins to perform the activity. Especially for workflow participants that have many activities to perform, such time can be significant. Waiting time depends at least on the type of performed activity, a workflow participant assigned to the activity, and the number of activities that have to be performed by the participant. Moreover, since in distributed WFMSs time to transfer control flow between two consecutive activities (i.e. workflow participants that perform those activities) can also be significant, we suggest to consider waiting time as well as transfer time. Transfer time depends mainly on the quality of communication links between workflow engines. Users who define a workflow process can not assign waiting and transfer times. They should be calculated from past/current workflow process executions

7.4 Task scheduling In order to reduce waiting time we will adopt well-known task scheduling algorithms to WFMS’s requirements. In our opinion, the function to prioritise activities should be flexible, and defined in the context of a given workflow process. Such function could use relevant application data as well as control and audit data, for example information about deadlines and durations, the cost of resources that have to be used to perform a given activity, the significance of the activity, etc. For each type of data, an administrator of the workflow process would be able to define its importance. For example – duration violation – 10%, deadline violation – 30%, the overdraft of the activity cost – 60%.

7.5 Extensions with respect to the WfMC's workflow process meta-model In order to disseminate the described features of an intelligent WFMS the following extensions to the WfMC’s standards are needed: • introducing the WPAL language to express dynamic workflow participants assignments, • sorting out the language to represent CFCs. Introducing CFC functions and the CFCs reuse mechanism, • representation of a complete model for time management.

IST-2001-32429 ICONS Intelligent Content Management System

page 54/86

Intelligent Content Management System The ICONS Distributed Processing Organisation

1.15 April 2002

8. The ICONS Distributed Processing Organisation 8.1 The ICONS scalable, distributed architecture To reach the practical acceptance, the ICONS goals require especially efficient data storage and processing architecture. This condition is difficult although crucial. It prohibited most of ambitious projects with similar goals from becoming more widely used (or used at all). The main prerequisites can be listed as follows: 1. Permanent data volume is large (many GBs). It is continuously growing, because of new knowledge. The current practice shows that the growth rate could easily reach 100 % year. 2. Temporary data can have largely unpredictable volume. Joins, or transitive closures, or more complex recursive computations often lead to tuple number explosion. Selectivity of these operations may be impossible to evaluate in practice. Even large temporary files have to be nevertheless accommodated in real-time and without performance deterioration. 3. Queries have to be processed in a way where response time is as independent of data size as possible. Definitively, this time cannot be a linear function of the file size. 4. Permanent data are highly valuable. They have to be reliably protected against loss and corruption. They have to be also highly available. With the Web available anytime & everywhere, 24/7 access is today a must. It becomes well accepted that no traditional centralized architecture can meet such goals, [CACM97], [Gray1996]. The single server CPU capacity, even if it is a multi-CPU one, or an expensive supercomputer, must become overfilled. Likewise, the available RAM storage quickly suffices for a fraction of the data only. Access to those on disk deteriorates the response time easily by two orders of magnitude. For many GB data sets, disk may overfill to the next level of the storage hierarchy with a similar performance deterioration ratio. The number of disk units that can be connected is also often rapidly reached in practice, and must be reached in any case when a scaling data collection should be managed. Sophisticated data operations often use scans, which has a response time, at any single server, at least linearly dependent of the data size. These were the constraints that basically no research or industrial system could successfully overcome till now. Finally, failure of the data server, may entirely prohibit the access to data at best, or may cause data destruction at worst. Many folks at World Trade Center made a bitter experience of this kind at 9/11/2001. This state-of-the-art and technological progress brought a new type of architecture, often termed a scalable distributed architecture (SD-architecture). Today, this framework seems the only one able to fulfil the ICONS goals and constraints. Our goal is to base ICONS on an SD-architecture. The keyword distributed in an SD-architecture is basically quite classical. It means that both data and processing are supported by multiple interconnected nodes. It seems reasonable to assume, and in our case is necessary, that most of processing nodes are linked by a high-speed network. This is typically assumed to be a local network, a 1Gb/s Ethernet most often these days. An important new twist is that the nodes the network should link, are mass-produced. They can be cheaply available computers, workstations, PCs… in this way in large numbers. They also often pre-exist the distributed system to build-up. Finally a node role can be largely alternative: as data server, or as the client, or as the application tier… All together, such configurations, proposed by prominent US researchers already a while ago, e.g., from UC Berkeley, [Culler1994], seem today the most efficient practical approach. If not the only one not utopian for most users, by their unbeatable price-performance ratio. Needless to stress, they have triggered a growing interest, especially in recent years, at highest decisional levels [President1998]. The literature designated such configurations as multicomputers, or as networks of workstations (NOW) [Culler1994]. More and more often, one is also buzzwording about the peer-to-peer architecture, and most recently, about the grid computing. Finally, IBM is pushing the concept of autonomic architecture [Gibbs2002]. The distributed architecture potentially meets also much better the goals of data reliability and high-availability. Data can be mirrored or partitioned over multiple nodes. Unavailability of a node still leave available all the data values, i.e., provides the high-availability of the data, through the access to the mirror, at the expense perhaps of some throughput deterioration, if both mirrors were regularly in use. In the case of partitioning, the unavailability of a storage node does not block access to other parts of the collection. Redundant partitioning with parity data may further provide the high-availability as the mirroring with much smaller storage overhead [Litwin2000]. The keyword scalable in an SD-architecture is more novel. It appeared in early 90s & basically means that performance of data unit access should be independent of data volume. One is often talking about the flat scaleup. For a relation or file constitution or scan time, it means that this time should be a linear function of the

IST-2001-32429 ICONS Intelligent Content Management System

page 55/86

Intelligent Content Management System The ICONS Distributed Processing Organisation

1.15 April 2002

size at worst. Likewise, this property is often termed linear scaleup. If the scan time, or more generally, an operation time, becomes too long because of the size of the data collection it operates upon, the speed-up resulting from a partitioning of the collection over more nodes should be linear as well. While all these goals are clearly in theory a wishful thinking, research has proven that they are often reachable in practice. The goal of scalability puts new requirements on the distribution management, with respect to more traditional architectures. Traditionally the distribution was designed for some fixed collection of data server nodes, often called cluster, [Gray1996]. Any cluster, at some level of scale-up, must progressively fulfil its storage and CPU capabilities and start presenting the limitations of a centralized system. This must adversely affect the goal of scalability. The new and only way out is that the data and processing capabilities are dynamically distributed over the appropriate collection of nodes. The collection may need to scale up, in the number of nodes, or less often, scale down. Research is active these days to investigate the underlying technical issues. A probably most advanced trend for building an SD-architecture are techniques for scalable distributed data structures (SDDS)s. This concept has appeared in early 90s, [Litwin1993] and is actively investigated since. Dozens of references are available at CERIA Web site [CERIA]. An SDDS is a new type of a data structure that dynamically partitions the application data over a collection of available server nodes. The number of servers increases with the data size, the distribution itself is transparent to the application. The data may remain entirely in distributed RAM or at local disks. The partitioned data can be also mirrored for high availability or provided with the parity data for this purpose. The CERIA team has widely recognized competence in SD-architectures based on SDDSs. A number of technical papers is available at [CERIA]. Research co-operation with HP Labs in Palo Alto and IBM Almaden Research led to three US patents (see IBM Patent Repository through http://www.ibm.com/). Recently, in March 2002, CERIA hosted an international workshop on Distributed Data & Structures (WDAS-2002). 1st known prototype of an SDDS manager was also developed by CERIA. A version is available for public noncommercial download at CERIA Web site. Its allows for very large data sets in distributed RAM with demonstrated data unit access performance of hundred times faster than to the disk. This know-how and performance should be crucial to the ICONS efficiency. It will be used by CERIA to develop the ICONS SD-architecture. It is planned to be based on SDDSs. More precisely, it should obey the following principles we now overview. The ICONS SD-architecture should be multi-tier. The ICONS private and permanent data should be stored at SDDS-server nodes, servers in short. The application agents, whether dealing with knowledge or database management should interact with SDDS client nodes, clients in short. The servers manage data storage (data buckets) and scalable distributed partitioning. More precisely, an overloaded server may split its bucket evacuating a part of it, usually a half of the data, to another node allocated dynamically. The main goal of this process is to keep the data for processing in the distributed RAM. The corresponding performance gain with respect to the disk storage (and centralized or cluster processing of scaling data) should provide to ICONS application data processing a leverage that crucially lacked to previous attempts in the domain. The clients are not made aware of the splitting process. Each client has an image of the data distribution, not necessarily the actual one. The client uses the image to issue the key queries. Such queries address (search, insert, update, delete) data units with identifiers (keys): records, tuples… Since its image can differ from the actual one, the client can send the query to an incorrect ICONS server. All servers have therefore the capability to recognize such a query and forward it towards the server that could be the correct one. This process should ultimately, possibly in at most few hops, to the correct server. This one processes the query. It also sends a specific message to the client, termed the Image Adjustment Message (IAM). The client uses this message to adjust its image. It still may be not the actual one. However, at least the same addressing error should not happen twice. In addition to the key queries the ICONS SD-architecture should support the scan queries. A scan addresses in parallel all servers in some data range, or, ultimately, all the servers. The processing time is then basically bound by the size of data collection at each server, instead of the entire data size. As this size remains fixed, the scalability should be largely attained. The RAM processing speed should add up to new levels of performance in processing of the complex operations. One new problem with the scans is that client may not know all the servers it should address. Hence, it may send the query to only some, but not all. The servers should forward the query to those who did not get it. The process should guarantee that each server gets the query once and only once. The client gets replies. There are several policies for organizing that reception to avoid the client’s overcharge. Furthermore, the client has then the choice between the probabilistic and deterministic termination protocols. The former means that the client terminates

IST-2001-32429 ICONS Intelligent Content Management System

page 56/86

Intelligent Content Management System The ICONS Distributed Processing Organisation

1.15 April 2002

when no further reply comes after some time-out. The latter corresponds to a subsumption algorithm that guarantees that all replies were received. The servers should also guarantee the high-availability. In ICONS it should be done by providing the parity data to the groups of servers. A group with the parity should then be able to transparently tolerate k ≥ 1 unavailable servers. The degree of protection k should scale-up transparently with the collection size. These properties will be provided by a variant of erasure correcting codes derived from the well-known Reed-Salomon error correcting codes [Litwin2000a]. There are various choices for the message passing between clients and servers, as well as for the system architecture at each node. Those will be analyzed during further work. As the overall assumption, one will use whenever possible standard and popular components. Hence, for the communication, one should use the TCP/IP stacks, and faster UDP messaging, unicasting and multicasting, for service messages, with a dedicated flow control when needed. Likewise, a multithread processing at each node seems the best basis as well [Diene2000]. Summing up, the ICONS SD-architecture should offer a number of novel features, to accommodate stringent performance requirements. These features should allow for the practical acceptance of the project results, as performance is then the key need.

8.2 The ICONS distributed processing optimisation and load balancing The distributed processing optimization for a data management system at the gross architecture level passes traditionally by the load balancing among the nodes and the inter-query optimization on the clients and servers [Ozsu1999]. Main reason is that at this level, the semantic of a query is unknown, hence intra-query optimization may only be quite general. The inter-query optimization passes then by possibly executing a query, while another query is waiting for a resource, especially the network transfer. The most widely accepted approach is the organization of the client and server processing as threads manipulating queues. We adopt this approach as the basis for the ICONS SD-architecture as well, for both clients and servers. More in depth, there should be a query queue at the where an application leaves its request at the client. The request consists of the query and data or a local pointer to. This queue should be read by a number of threads which remains to be determined for a given client. Each thread processes a query, finds the addressed server(s) and places the query in some internal send queue. Its role temporarily ends up by the request(s) to the sockets to send-out the query, using UDP or TCP/IP messaging depending on the case. Other threads may continue the data processing during this time, hence realizing the client side intra-query optimization. At the server, all incoming requests are to be placed in the listen queue. Several threads process this queue and search or update the storage. Any data to return, as well as IAMs if any, are sent out. A thread working in a pipeline mode can then be blocked while its current reply is being sent. The other threads may continue the data processing during this time, hence realizing the sever side intra-query optimization. Several threads at the client listen to the network buffers, and transfer ASAP the incoming replies into a reply queue. In the case of an SDDS this approach is particularly useful, as a key query may be sent to one server while the replies comes from another one. Other threads explore the reply queue, match it to the query queue and finally reply to the applications. Some processing may be in pipeline mode making a thread waiting blocked for next data item. Other reply can be processed in the meantime during this time, hence realizing the other facet of the client side intra-query optimization. Likewise, the servers in ICONS SD-architecture should possibly support the load in adequacy with processing capability of each server. Numerous research results, especially on the load balancing in a parallel DBMS show that the processing load balancing usually follows the data load balancing [Vocking2002]. Sophisticated and complex research attempts of processing load balancing by analyzing query frequency, resources consumption etc. did not lead yet to any practical acceptance. An SDDS may then allow for the load balancing in at least two ways. Those follow the similar ideas for a parallel DBMS. These offer the hash partitioning of the application data, e.g., DB2, or range partitioning, e.g., SQL Server, or both, e.g. Oracle. The most used one is the hash partitioning. A well performing hashing randomizes the data location and renders a server load naturally uniform. One can expand it into the double or triple hashing with symmetric or asymmetric record placement schema [Vocking2002]. In our case, this type of balancing, translates to a scalable distributed hash partitioning scheme. An LH* type of scheme appears best candidate [Litwin1996]. Especially, since variants of this scheme are known that provides also for the high-availability [Litwin2000] and others (see [CERIA]).

IST-2001-32429 ICONS Intelligent Content Management System

page 57/86

Intelligent Content Management System The ICONS Distributed Processing Organisation

1.15 April 2002

The range partitioning is another common type of partitioning. This one leads to an ordered collection of data. In our case, an RP* scheme appears best candidate. Such schemes provide at present ranges such that each server stores about the same number of data items. As for the hashing, this property usually provides good load balancing. However, the opposite is also naturally more frequent. Consider for instance that the range partitioning concerns a phone book of a region with partitioning key being the city and that some cities have important administrative centres, whose phones are retrieved therefore much more often. That would lead to more processing load of the servers with the ranges including those cities. The solution we plan for the ICONS SD-architecture consists in a modification of the RP* schemes, [Diene2000], to be selected later for the ICONS needs, so that ranges on overloaded severs, are made dynamically smaller. For instance, they are halved. Such a decision can be made locally by each server, on the basis of some statistics with respect to those from other servers. Through the splits triggered by the range change, the data items of an overloaded server spread on several servers. The processing load re-balances accordingly. Likewise, the under-loaded servers could merge. Summing up, distributed processing optimization and load balancing are complex matters. At SD-architecture level in particular the query semantics is unknown. One should concentrate on the inter-query optimization and the data load balancing, [Ozsu1999], [Vocking2002]. The ICONS solution for the latter should pass then through the concepts of threads co-operating through the queues, at both servers and clients. It will also be based on the load balancing, generalizing for the scalable distributed environment the more traditional widely-used techniques of data partitioning.

8.3 The ICONS distributed workflow process communication and synchronisation One of the most challenging features of WFMSs is workflow interoperability. Such interoperability enables two or more workflow engines to communicate and work together to co-ordinate their work. There are several different models of workflow co-operation, namely: the chained process model, the nested subprocess model, and the parallel synchronised model.

Figure 13. Models of workflow co-operation. In the chained process model after one workflow process is completed, another workflow process inherits the processing and starts. This is the most basic model. In the nested subprocess model, one workflow process has a part of its processing done by another workflow process. In the parallel synchronised model, two workflow processes that are proceeding independently become synchronised at some point and exchange information, and

IST-2001-32429 ICONS Intelligent Content Management System

page 58/86

Intelligent Content Management System The ICONS Distributed Processing Organisation

1.15 April 2002

then continue independently. When an activity reaches the synchronisation point, it waits for the other to arrive there, and then they exchange information. On the basis of the WfMC’s reference model, and the Interface 4 standard described in [WfMC1996], the Object Management Group (OMG) had developed JointFlow specification. JointFlow defines a framework for distributed workflow applications in the world of business objects ([OMG1998]). This specification enables interoperability of workflow process components, monitoring and workflow execution, and association of workflow components to resources involved in a workflow process. In the next step a simple workflow access protocol (SWAP) has developed. SWAP was envisioned as a binding of the jointFlow object model and related WfMC standards to an HTTP-based interaction protocol. Finally, in 1999, WfMC has presented the Wf-XML specification. This specification enhances some of its predecessors’ capabilities, providing: ! a structured and well-formed XML body protocol that consists of message containing headers and data ! logical interact model with synchronous, asynchronous, and batch capabilities ! independence from transport mechanisms ! easy extensibility through the use of XML and dynamic workflow context data. In a synchronous messaging a process A can may wish to initiate a sub-process and suspend its normal processing until that sub-process completes. In an asynchronous messaging, the initiating process sends a request to the enacting process. The enacting process then sends only an acknowledgement back to the initiator, informing that the request has been received. At some later point in time, the enacting process sends a response to the initiating process. The initiating process sends then an acknowledgement back to the initiator, informing that it received the response. In the batch messaging it is possible to place multiple Wf-XML interaction in a single message. In the ICONS project we will implement Wf-XML specification and used e-mail to transport XML workflow messages.

IST-2001-32429 ICONS Intelligent Content Management System

page 59/86

Intelligent Content Management System Demonstration of ICONS prototype capabilities

1.15 April 2002

9. Demonstration of ICONS prototype capabilities 9.1 The “Newly-associated States Best Practices” Portal 9.1.1 Introduction There is a proliferation of Web content management systems in various application realms. The integration of internal information repositories with the external data sources, is the current trend in the architecture of management information systems. Examples of active development in the areas of government, energy industry, and general B2B systems are presented in [Ambite2001, Bouguettaya2001, Elmagarmid2001, Mecella2001, Shim2000]. Although the current systems are designed according to disciplined life-cycles based on various design methodologies, there exists a clear need to formulate a life-cycle and the underlying methodology for development of large scale, knowledge-based content management systems. Such methodology must be substantiated by at least a pilot development of an application based on an intelligent content management system. The novelty of the ICONS project within the realm of this objective is exemplified by the following solution characteristics: 1. Specification of a prototype life-cycle and the underlying methodology for design and development of the intelligent content management systems applications. 2. Demonstrating the viability of the ICONS architecture and application development methodology by developing of a pilot knowledge-based content management application. In terms of project organisation, all this corresponds to Objective 4 of the ICONS project, i.e. to develop an analysis and design methodology for large, knowledge-based content repository systems. ICONS research results, especially those related to the ten technologies identified in Section 4 as ‘to be developed’ will be demonstrated both at application level and at methodological level. The planned work (WP7), includes three tasks and corresponding deliverables: T1-> D35, T2->D24, and T3->D25. T1 has already been started. Indeed D35 “Conceptual analysis of ‘NAS Best Practices’ portal” will be developed first. D35 will contain the essential requirements for the ICONS prototype. These requirements will provide relevant “attraction” points for technology developers, active in other WPs. Of course, during the last semester of the project, these same requirements, possibly updated, will serve as basis for the development of the prototype (D25). Another basic input will be provided by D24 “The knowledge-based content management application design methodology”. For the sake of being specific, the pilot application is first described. The specific objectives of ICONS prototype portal and its pilot application ‘NAS Best Practices’ are: Development and publishing for general use over Internet of a knowledge repository concerning procedures, management practices, and “best practice” projects funded by PHARE, ISPA, and SAPARD funds. The knowledge repository is to contain public information to be made available to all interested parties over the Internet. [ICONS D02] 9.1.1.1 NAS By “Newly Associated States (NAS)” is meant in fact the ten candidates to EU membership from Central and Eastern Europe (CEE), see [Enlarg-Report-2001]. These candidates are: Bulgaria, Czech Republic, Estonia, Latvia, Hungary Lithuania, Poland, Romania, Slovakia and Slovenia. “This year’s Regular Reports and the present stage of the accession negotiations do not yet allow the Commission to conclude that the conditions for accession are fulfilled by any of the candidate countries. Among the twelve negotiating4 countries, ten have target dates of accession compatible with the Göteborg timeframe. 4

“The Copenhagen political criteria continue to be met by all presently negotiating candidate countries. Turkey still does not meet these criteria.” (Political criteria/Conclusions of [Enlargement-Rep2001].

IST-2001-32429 ICONS Intelligent Content Management System

page 60/86

Intelligent Content Management System Demonstration of ICONS prototype capabilities

1.15 April 2002

The Union should therefore be prepared to conclude accession negotiations by the end of the Danish Presidency in 2002, in view of accession in 2004, with all countries meeting the necessary conditions. Necessary administrative preparations inside the Institutions are already under way and should be continued. (Conclusion/§4). “The 2002 Regular Reports will examine whether the candidate countries will have, by accession, adequate administrative capacity to implement and enforce the acquis.” (Conclusion/§5) If we look at such a regular report, e.g. for Poland, we will see that progress towards the adoption of the acquis is examined in 29 chapters. Here is the list of examined topics: 1: Free movement of goods 2: Free movement of persons 3: Freedom to provide services 4: Free movement of capital 5: Company law 6: Competition policy 7: Agriculture 8: Fisheries 9: Transport policy 10: Taxation 11: Economic and monetary union 12: Statistics 13: Social policy and employment 14: Energy 15: Industrial policy

16: Small and medium-sized enterprises 17: Science and research 18: Education and training 19: Telecommunications and information technologies 20: Culture and audio-visual policy 21: Regional policy and co-ordination of structural instruments 22: Environment 23: Consumers and health protection 24 - Co-operation in the field of justice and home affairs 25: Customs union 26: External relations 27: Common foreign and security policy 28: Financial control 29: Financial and budgetary provisions Plus : Translation of the acquis into the national languages

Table 9. Checklist of the acquis (chapters in Regular Reports). In [Enlargement-Rep2001-A] it can be seen that for Poland 11 chapters were still in negotiation in 2001. 9.1.1.2 Phare, ISPA, and Sapard “During the period 2000-2006 financial assistance from the European Communities to the candidate countries of Central and Eastern Europe will be provided through three instruments: the Phare programme (Council Regulation 3906/89), ISPA (Council Regulation 1267/99) and Sapard (Council Regulation 1268/99)...” The ten countries are listed above. Turkey, Cyprus, and Malta have access to other funds (namely MEDA). Notice that Phare funds exist since 1989. A synthetic, very simplified view of Phare is given in the following table. We skip the other two instruments as to remain focused on ICONS project and because Phare is the instrument which is the more important, and documented. Item Aim/name Budget

Phare To assist the candidate countries of central Europe in their preparations for joining the European Union. For the period 1995-99, funding under Phare totalled roughly EUR 6.7 Billion and covered fifteen sectors, the main five of which were:

infrastructure development of the private sector education, training and research environmental protection and nuclear safety agricultural restructuring. Instrument(s) Reforms

Web sites

The revamped Phare programme with a budget of over EUR 10 Billion for the period 2000-2006 now has two specific priorities, namely: institution building, financing investments. [EU-Glossary] Accession Partnerships National Programmes for the Adoption of the Acquis (NPAAs) ; Regular Reports. Phare exists since 1989. In 1997, important reforms were introduced (decentralisation / deconcentration). An Extended Decentralised Implementation System is currently being prepared (EDIS). New approaches should help the countries to prepare for a smooth transition from pre-accession assistance to Structural Funds. Phare: http://europa.eu.int/comm/enlargement/pas/phare/index.htm Tenders:(EuropeAid):

IST-2001-32429 ICONS Intelligent Content Management System

page 61/86

Intelligent Content Management System Demonstration of ICONS prototype capabilities Item Control of the EC Main actors

Practical Guide

1.15 April 2002

Phare http://europa.eu.int/comm/europeaid/cgi/frame12.pl ex-ante National Aid Co-ordinator (NAC) National Authorising Officer (NAO) National Fund Implementation Agencies (IAs) Central Financing and contracting Unit (CFCU)

EC Delegations DG for Enlargement EuropeAid Cooperation Office (formerly the SCR) Final beneficiaries (institutions, municipalities, ministries). The main features of the Practical Guide fall into three categories (simplification and harmonisation, increased transparency and more rights to companies participating in tenders, and eligibility criteria and other essentials) -- for Sapard only applicable to procurement. Three types of contracts: services, supplies, works. Procedures (+/- complex) vary according to type of contract and value. See: Practical Guide to Phare, ISPA and SAPARD Contract Procedures at http://europa.eu.int/comm/enlargement/pas/phare/procedures.htm#6.1

Table 10. Overview of Phare. 9.1.1.3 “Best Practices” To become member of the EU, candidate countries have to implement a large number of reforms. To help them funds are especially made available to them by present Members States though the central services of the EC, located mainly in Brussels and ‘deconcentrated’ services i.e. in EC delegations. Since Phare exists, thousands of projects have been tendered, contracted, implemented, assesses, and audited. The fact that organisational, and procedural context of these projects evolves is an indication that lessons learned by various actors have been, explicitly or not, transformed into knowledge, and eventually changes in rules and procedures. It is obvious that Phare programming (see [Phare_Review_2000]) is complex: it spans at least four years and it concerns multiple institutions and responsibility functions. Even if we limit ourselves to two phases: 1. “Implementation -- Tenders -- Contracts and Management “ 2. “Monitoring and Assessment Reports” It is clear that large amounts of information, structured or not, quantitative or not, could be considered as prime material for our prototype application of ICONS. Here are some examples of best practice, which could be supported by our prototype: Steps in Project Design

Elements of context Main actor/unit IA and Beneficiary

Tender preparation

CFCU

Tender evaluation Prequalification Tendering

CFCU CFU & Beneficiary Tenderer

Contracting

CFCU

Project realisation Financial control Assessment of results Assessment of results

Beneficiary CFCU Beneficiary and EC Contractor

Relevant Knowledge (adapted on context!) Elements/questions for Best Practice (see[Phare_Review_2000, p.44] for main requirements) Which chapters/sections of the acquis are relevant ? Criteria for mixing or separating supply with/from services. How to estimate necessary budget and duration (study, tendering, implementation)? Technical specification or terms of references by ad hoc tender expert or by Beneficiary. Big projects or smaller numerous ones? Variants allowed in tenders? Clarification meeting desirable? Which sections of Practical Guide apply ? Visit of premises by tenderers to be organised (instead of more detailed specifications) How to formulate evaluation criteria for service contracts? Composition of evaluation committee, duration of evaluation Optimal length of the “shortlist” How to evaluate own strong and weak points, compare with other shortlisted firms ? Addenda if any, payment schedules; guarantees, certificates or origin: sources of administrative problems ? Demanded reporting (frequency, details, languages) Budget was correct ? Assessment costs, duration, and results. Added value for “goodwill”, individual experts; need for developing

IST-2001-32429 ICONS Intelligent Content Management System

page 62/86

Intelligent Content Management System Demonstration of ICONS prototype capabilities Elements of context

1.15 April 2002 Relevant Knowledge (adapted on context!) other/new skills.

Table 11. Best practice taxonomy.

IC O N S P o rtal D A TA SO U R CE S

D ATA collection processes

EN D USER S

IN FO presentation & distribution

EC represen tatives central / local

O ther K B ases A ll available and relevant data O riginal D ocuments D B E xtracts centralised / decentralised

queries

M etadata / O ntologies ?

Q uery M gr

In form ation & K n owledge Base

C ontrol Data & R eports

IN FO quality & Knowledge enhancem ents

End User In terface

Usable K nowledge Context of interest --control feedback

N ation al coordin aton & deciders

C FC U & IAs

Ben eficiaries

control

T en derers C on tractors

E xperts

Know ledge W orkers System M anager

Figure 14. Main Concept of ICONS portal for NAS Best Practice. The main functional requirements are outlined hereafter, by considering the different ‘actors’ in succession. It must be underlined that during the elaboration of D35, these requirements will be made more precise and also adapted to the data which will be actually be made available. (See Remarks below). 9.1.1.4 Actors (1): End Users These end users will be or belong to the institutions as listed in Table 10. End Users will be identified personally and by their role, if this one is not unique. Possible outputs of the system are relevant parts of: 1. NPAA, NPD, Regular Report & Negotiations 2. Funding Programmes 3. Community legislation in force, National legislation 4. Fund request procedures 5. Forms / templates, Contact points 6. Success stories 7. Call for Tenders (including technical specification or terms of references (ToRs)) 8. Contracts and addenda, if any 9. Project implementation reports (from contractors) 10. Project assessment reports (from independent auditors/assessors)... selected on the basis of (assumed) end user interests combined with his own current indications. A second kind of outputs of the system are advanced queries in DB describing projects and their progress. EC DBs like DESIREE and PERSEUS, or their successors, plus National DBs like newly developed PENELOPA in Poland. 9.1.1.5 Actors (2): Knowledge Workers Application Developers and Maintainers On the basis of the common structure of the text documents (e.g. ToR, Tender forecast), they will establish the links between classes of documents representations, and classes of relevant and queriable DB (when permitted). The management and maintenance of the necessary Ontologies is an essential part of their work.

IST-2001-32429 ICONS Intelligent Content Management System

page 63/86

Intelligent Content Management System Demonstration of ICONS prototype capabilities

1.15 April 2002

In particular, how end users own knowledge (or experience) in their domain of interest can be amplified by the system along the successive interactions, and how this knowledge can made reusable by other users facing similar problems is of the outmost importance. Data Collectors Data sources will evolve, URLs are modified, decision centres can change (centrally, locally), therefore means to detect these changes have to be established. The need and permission to make cache copies of original documents (to guarantee permanent access) have to be established. Therefore, co-operation protocols need to be defined, especially in case classified information has to be accessed, either as such by authorised end users, and/or only through statistical queries (reductions). 9.1.1.6 Actors (3): System Administrators Their functions are classical. They will be responsible for giving permissions to access the specific knowledge/data to authenticated actors. The management and monitoring of all security and availability aspects of the system will be in their hands.

9.1.2 Key Issues for Application Development Reminder 1: ICONS => Knowledge based access to pre-existing, distributed information in various forms (web pages, databases, legacy information systems, etc.) Reminder 2: in general, Knowledge = “understanding gained from experience”, [Weidner2002, p. 18]. Hence: a working ICONS has to be developed and put into operation in a progressive manner: knowledge needs knowledge to grow, and this growth will be more sustainable if the right information is effectively identified and made accessible in the most efficient way. 9.1.2.1 The Idea Knowledge growth can be viewed as spirals made of growing from Knowledge Life Cycles (KLCs). Initially a basic ontology is selected as to seed the system together with a minimal set of information sources and reference documents. The following cycles will develop and consolidate what has been integrated in the previous cycles: • Ontology cycle: enrich the initial ontology (domain of interests such as chapters of the acquis, technologies (IT, environment, civil engineering, agro-bio-technologies, etc.), time models (dates), space (countries, regions, borders, rivers...), programmes, actors, projects etc.); add connections between these main concepts.5 Ontology is subject to validation, hence status of ontological objects has to be managed too. • Knowledge extraction: establish mechanisms (Intelligent agents) (i) to identify existing & accessible sources of information, (ii) to extract knowledge from these sources using standardised RDF, and (iii) to populate an ‘extensional’ base of “facts” (EDB). • Knowledge derivation : establish knowledge production rules (an ‘intensional’ DB) to derive additional knowledge from the base of facts (EDB + already derived and integrated facts). • Intelligent access: by combining informational goals expressed by end users, to deliver relevant facts and supporting information (original documents or parts of ). Finally, current achievements will be assessed; extensions or improvements proposed. It must be underlined that the underlying workflow and co-operation mechanisms between human knowledge workers and automated agents (hence also their developers) constitute an ubiquitous challenge for this project. Schematically, a prototype development cycle is 5 See references given in Section 9.2.1, e.g. [Holsapple2002]; initial concepts can be drawn form EU and Phare glossaries published on the Web (e.g. [EU-Glossary], [Phare-Glossary], and [PG-Glossary]).

IST-2001-32429 ICONS Intelligent Content Management System

page 64/86

Intelligent Content Management System Demonstration of ICONS prototype capabilities

1.15 April 2002

Initial C onfiguration C ontents sourc es: : well known U R Ls (