Verifying Integrity Constraints on Web Sites

Verifying I n t e g r i t y Constraints on Web Sites M a r y Fernandez Daniela Florescu Alon Levy Dan Suciu A T & T Research INRIA Dept. of Comp...
Author: Julia Little
2 downloads 2 Views 319KB Size
Verifying I n t e g r i t y Constraints on Web Sites M a r y Fernandez

Daniela Florescu

Alon Levy

Dan Suciu

A T & T Research

INRIA

Dept. of Computer Science

A T & T Research

180 Park Ave.

HP. 105 Rocquencourt

University of Washington

180 Park Ave.

Florham Park, N.J 07932 USA

be Chesnay cedex, Prance

Seattle, WA. 98195 USA

Florham Park, NJ 07932 USA

mffresearch.att.com

[email protected]

[email protected]

[email protected]

1

Abstract

the H T M L rendering of the site's pages. Step 2 is usually supported by a declarative, specification language.

Data-intensive Web sites have created a new form of knowledge base, as richly structured bodies of data. Several novel systems for creating dataintensive Web sites support declarative specification of a site's structure and content (i.e., the pages, the data available in each page, and the links between pages). Declarative systems provide a platform on which A1 techniques can be developed that, further simplify the tasks of constructing and maintaining Web sites. This paper addresses the problem of specifying and verifying integrity constraints on a Web site's structure. We describe a language that can capture many practical constraints and an accompanying sound and complete verification algorithm. The algorithm has the important property that if the constraints are violated, it proposes fixes to either the constraints or to the site definition. Finally, we establish tight bounds on the complexity of the verification problem we consider.

Web-site m a n a g e m e n t systems based on declarative representations offer several benefits. F i r s t , since a site's s t r u c t u r e and content are defined d e d a r a t i v e l y , not procedurally by a p r o g r a m , it is easy to create m u l t i p l e versions of a site. For e x a m p l e , it is possible to b u i l d i n t e r n a l and external views of an o r g a n i z a t i o n ' s site or to b u i l d sites tailored to novice or expert users. Currently, creating m u l t i p l e versions requires w r i t i n g m u l t i ple sets of p r o g r a m s or m a n u a l l y creating different sets of H T M L files. Second, these systems s u p p o r t the evol u t i o n of a site's s t r u c t u r e . For e x a m p l e , to reorganize pages based on frequent usage patterns or to extend the site's content, we s i m p l y r e w r i t e the site's specification. A n o t h e r advantage is efficient u p d a t e of a site when its d a t a sources change.

Introduction

Data-intensive Web sites have created a new f o r m of knowledge base. T h e y t y p i c a l l y contain and integrate several bodies of d a t a a b o u t the enterprise they are des c r i b i n g , and these bodies of d a t a are linked i n t o a rich s t r u c t u r e . For e x a m p l e , a c o m p a n y ' s i n t e r n a l Web site may c o n t a i n d a t a a b o u t its employees, linked to d a t a a b o u t the p r o d u c t s they produce a n d / o r to the customers they serve. T h e data in a Web site and the structure of the links in the site can be viewed as a richlv s t r u c t u r e d knowledge base. T h e management of data-intensive W e b sites has received significant a t t e n t i o n in the database c o m m u n i t y [Fernandez et al., 1998; Atzeni et al., 1998; A r o c c n a and Mendelzon, 1998; Chiet et al., 1998; P a o l i n i and F r a t e r n a l ] , 1998]. T h e key insight of recent systems is to specify the s t r u c t u r e and content of sites d e d a r a t i v e l y . These systems separate and provide direct s u p p o r t for the three p r i m a r y steps of site creation: (1) identifying and accessing the d a t a served at the site, (2) defini n g the site's s t r u c t u r e (i.e., the pages, the d a t a in each page, and the links between pages), and (3) specifying

614

KNOWLEDGE-BASED APPLICATIONS

Declarative Web-site management systems also allow us to view a site's d e f i n i t i o n and its content as a k n o w l edge base. A n a t u r a l next step is to consider how reasoning techniques can f u r t h e r i m p r o v e the process of b u i l d i n g and m a i n t a i n i n g Web sites. We consider the reasoning p r o b l e m of v e r i f y i n g i n t e g r i t y constraints over Web sites. Specifically, when the s t r u c t u r e of a site becomes c o m p l e x , it is hard for a designer to ensure t h a t the site w i l l satisfy a set of desired properties. For e x a m p l e , we m a y want to enforce t h a t all pages are reachable f r o m the r o o t , every o r g a n i z a t i o n homepage points to the homepages of its sub-organizations, or p r o p r i e t a r y d a t a is not displayed on the external version of the site. A s t u d y on the usability of on-line stores [Lohse and Spiller, 1998] provides other constraints t h a t i f followed, w o u l d i m prove the site design. For a verification t o o l to be useful, if must verify constraints against a site d e f i n i t i o n , not a p a r t i c u l a r i n stance of the site, because (1) we do not w a n t to verify the constraints every t i m e the site instance changes, and (2) if a Web site is d y n a m i c a l l y generated, an instance is never c o m p l e t e l y m a t e r i a l i z e d m a k i n g it is impossible to check the constraints. V e r i f y i n g the constraints on the site d e f i n i t i o n ensures t h a t as long as the site is generated according to the d e f i n i t i o n , the constraints w i l l be satisfied. For this reason, the verification p r o b l e m requires reasoning, and not j u s t a p p l y i n g a procedure to the site. F u r t h e r m o r e , when the i n t e g r i t y constraints are not ver-

ified, the system should a u t o m a t i c a l l y propose a set of candidate m o d i f i c a t i o n s to the site d e f i n i t i o n . T h i s raises a search p r o b l e m in the space of possible m o d i f i c a t i o n s . T h i s paper makes the f o l l o w i n g c o n t r i b u t i o n s . F i r s t , we i d e n t i f y an i m p o r t a n t class of i n t e g r i t y constraints relevant to W e b sites. Second, we describe a sound and complete a l g o r i t h m for v e r i f y i n g the i n t e g r i t y constraints and an analysis of their c o m p l e x i t y . T h e key feature of our a l g o r i t h m s is t h a t they consider o n l y the specification of the site's s t r u c t u r e and content, not a p a r t i c u l a r instance of a site. Hence, the verification is independent of changes to the u n d e r l y i n g site, as long as they are generated by the same specification. F i n a l l y , in cases where the verification a l g o r i t h m shows t h a t the constraints may be v i o l a t e d , it proposes a set of corrections to the Web site's d e f i n i t i o n . T h e p r o b l e m we consider is closely related to the problem of knowledge-base verification (see [ V V T ' 9 8 , 1998] for a recent w o r k s h o p ) . We follow the p a r a d i g m proposed in [Levy and Rousset, 1998], where a l g o r i t h m s for verification are based on query c o n t a i n m e n t . However, whereas in [bevy and Rousset, 1998] there was a 1-1 t r a n s l a t i o n between the verification problem and query c o n t a i n m e n t , a challenge in our case is to p e r f o r m the appropriate transformation. We believe t h a t Web-site management tools based on declarative specifications w i l l pose several i m p o r t a n t AT research problems in the near f u t u r e . Hence, one of the c o n t r i b u t i o n s of this paper is to b r i n g the p r o b l e m to the a t t e n t i o n of our c o m m u n i t y . In. the last section, we mention other research problems in this context.

2

Figure 1: T h e schema and d a t a u n d e r l y i n g the publicat i o n Web site. sentation (i.e., how each node is t r a n s l a t e d to H T M L ) . W r hen using the systems above, the site designer also specifies the graphical presentation of each page, usually by a set of H T M L t e m p l a t e s , each of w h i c h applies to a group of related pages.

2.1

Specifying Web-site Structure

In order to specify a site's s t r u c t u r e , we need to state (1) w h a t pages exist, (2) what, d a t a is available in each page, and (3) w h a t links exist between pages. We specify the s t r u c t u r e of site in a site definition. G i v e n a site defin i t i o n and a database instance, a p p l y i n g the d e f i n i t i o n to the database produces an instance of the site, called a site graph. F i g . 2 contains our example site d e f i n i t i o n and F i g . 3 contains the resulting site g r a p h .

Declarative Management of W e b Sites

Declarative systems for Web-site management are based on the principle of separating three tasks: (1) the m a n agement, of the d a t a u n d e r l y i n g the site, (2) the definit i o n of the site's s t r u c t u r e and the content, and (3) the graphical presentation of the site. T h e first step requires i d e n t i f y i n g the sources that c o n t a i n the site's d a t a . We refer to this d a t a as the raw data. These sources may include databases, s t r u c t u r e d files, or pre-existing sites. We assume t h a t we interact w i t h each of these sources v i a a wrapper p r o g r a m t h a t produces the necessary d a t a in t a b u l a r f o r m . Here, we assume t h a t the raw d a t a is stored in a single r e l a t i o n a l database system. In the rest of the paper, we use an e x a m p l e that, is a s m a l l fragment of a p u b l i c a t i o n ' s Web site. F i g . 1 contains the schema of the raw d a t a and sample d a t a . T h e second step in b u i l d i n g a Web site requires speci f y i n g the site's s t r u c t u r e . We describe a f o r m a l i s m for specifying this s t r u c t u r e t h a t captures features c o m m o n to m a n y declarative systems for Web-site m a n agement [Fernandez et al., 1998; A t z e n i et al., 1998; Arocena and M e n d e l z o n , 1998; C l u e t et al., 1998; P a o l i n i and F r a t e r n a l i , 1998]. We emphasize t h a t the declarat i v e specification is concerned w i t h the logical m o d e l of the site as a set of nodes and l i n k s , not its graphical pre-

Figure 2: T h e site d e f i n i t i o n for our example site

F i g u r e 3: T h e e x a m p l e site g r a p h . A site d e f i n i t i o n is a g r a p h whose nodes are labeled by variables or by f u n c t i o n a l t e r m s of the f o r m f(X), where

FERNANDEZ, FLORESCU, LEVY, AND SUCIU

615

X is a (possibly empty) tuple of variables. Functional nodes in the site definition represent sets of pages in the site graph. In our example, the node PersonPage(Y) represents the set of pages PersonPage(p) where p is a constant in the database. Non-functional nodes are leaves and have one incoming edge. They represent the data contained in the page that points to thern. For example, the node N represents the name of the person Y. Functional nodes that have no arguments represent unique pages, such as the root. Each functional node is labeled w i t h a Horn rule that defines the conditions for the existence of instances of the node. A rule's head is an atom of the form Node(f(X)), where Node is a special predicate. For example, the rule for YearPage specifies that there will be a node for a year Z if some article was published in year Z. The rules for our example site are:

Edges in the site definition represent sets of links in the site graph. Each edge has an associated a Horn rule, which specifies the conditions for existence of a link between instances of source and destination nodes. The Horn rules use the special predicate Link. For example, the rule fourth below specifies that there is a link in the site graph between the page PersonPage(Y) and the page ArticlePage(X) if Y is an author of paper X. The t h i r d argument of the predicate Link is the link's label in the site graph. 1 We assume that in all of the rules this argument is always a constant. The rules for the links in our example are given below. Link(Root(), Publications(), "publications'1) : —true. Link(Root(), People(), "people") : —true. Link(Publications(),YearPage(Z), "year") : -Article(-,^Z). Link(Peoplc(), PersonPage(Y), "person") : -Person(Y, _). Link(Per8onPage(Y),ArticlePage{X), "article") : Author(V, A'), Article(X, _, _.). Link(YearPage(Z), ArticlePage(X), "article" ) : Article(X, _, Z). Link(ArticlePage(X), PersonPage(Y), "author") : — Author(Y, X ) , Person( Y, _). Finally, the data contained in each page is also specified by Horn rules. For every leaf associated with afunctional node, we associate a Horn rule defining the contents of the leaf. The first rule below specifies that the name of a person will be contained in the appropriate person page: Ltnk(PersonPage(Y), N, "name") : -PersoniY, N). Link(ArticlePage(X)J\ "title") : -Article(XJ\ _). Link{ArticlePage(X),PS, "ps") : -Article(A', _, .), PsFilc(X,PS). 1

This string denotes the name of the relationship between the nodes in the site graph, and not the anchor that will appear on the link in the actual site. Anchors are omitted for clarity.

616

K N O W L E D G E - B A S E D APPLICATIONS

Declarative specification of a Web site offers many advantages: rapid modification of the site's structure; creation of multiple versions of the site for different classes of users; and, as we explore next, the ability to reason globally about the site's structure. In principle, restruct u r i n g a site or building another version requires m o d ifying the set of rules that define the site, instead of modifying each page and its hard-wired links.

3

Specifying I n t e g r i t y Constraints

Although declarative specification can simplify the task of creating complex sites, the specification of a richly structured site can be long. For example, the specification of a customer-billing site using the Strudel specification language [Fernandez et al., 1998] is 474 lines. The specification is more concise than the equivalent implementation in a scripting language, but still too large to determine w i t h o u t automated reasoning whether global constraints on the site are satisfied. [Lohse and Spiller, 1998] describes Web sites for on-line stores. They argue that enforcing integrity constraints on such sites is critical to customer satisfaction and describe a set of such constraints. Our goal is to take advantage of a site's declarative definition and develop algorithms for verifying that, a given definition only produces sites that satisfy the given set of constraints. For our example, some possible constraints include: 1C1: A l l article pages are reachable from the root page. IC2: For every article, there is a link from its article page to its PostScript source. 1C3: If two articles have a common author, there is a path between the corresponding article pages. IC4: If two articles have been published in the same year, there is a path between the corresponding article pages. We may also want to specify constraints that l i m i t the length of a path between two nodes, or that force every path to a node to go through some distinguished set of nodes. We define our language for specifying these kinds of integrity constraints and formally define the verification problem. Integrity constraints express properties we would like the Web site to have. Since the Web site is modeled as a graph, integrity constraints should be able to express the existence of certain paths between pages in the site. We express such paths using regular-path expressions. A regular-path expression over the set of constants C is formed by the following grammar (R, R1 and R2 denote regular-path expressions):

In the grammar, a denotes a constant in C; not (a) matches any constant in C different f r o m a. An _ denotes any constant in C; a period denotes concatenation, and | denotes alternation. R*, denotes 1 or more repetitions of R. For example, a.b...c + denotes the set of

paths beginning w i t h ab, then an arbitrary element of C and then any number of occurrences of c. We use * as a shorthand for , meaning an arbitrary path of length 1 or more. Regular-path expressions are used in path atoms of the form X R Y, where R is a regular-path expression, and X and Y are terms. The atom X R Y is satisfied in a labeled directed graph G by each pair of nodes XyY for which there is path from X to Y that satisfies the regular path expression R. In principle, we can express integrity constraints using arbitrary formulas in first-order logic. However, our main goal here is to identify a more restricted language for which it is possible to develop sound and complete verification algorithms and which is expressive enough to model integrity constraints that are of practical interest. We consider integrity constraints that have the form where and are conjunctions of path atoms, atoms of the relations of the raw data, and atoms of the relation Node. Variables that appear in both and are assumed to be universally quantified, while the others are existentially quantified. The following sentences express the integrity constraints in our example.

101: IC2: 1C3:

104:

Given a particular site graph, it is straightforward to test whether an integrity constraint holds. However, our goal is to verify at the intentional level whether an integrity constraint, is guaranteed to hold, i.e., given a site definition test whether the integrity constraint will hold for all Web sites that can be generated by 7v, for any possible database state. Formally, our problem is the following. D e f i n i t i o n 1: be the relations in the schema of the raw data, be a site definition. Let IC be an integrity constraint. We say that satisfies IC if for any given extension X of the relations IC is satisfied in the site graph resulting frorn and I. In our example, IC1 is satisfied, because every article has a year of publication, and therefore is reachable through the YearPage. Similarly, 1C3 is also, satisfied. IC2 is not satisfied, because some articles may not have PostScript sources. Although IC4 is satisfied by the site graph in Fig. 3, it is not necessarily satisfied for every site graph. Next, we describe a sound and complete verification algorithm, and show how the complexity of the verification problem changes w i t h the f o r m of the integrity constraints considered.

4

Verification A l g o r i t h m

The crucial step of our verification algorithm is to translate the integrity constraint into a pair of Datalog programs and Datalog [Ullman, 1997] is a database query language where queries are specified by sets of Horn rules, and the meaning of the query is given by the least fixpoint model of the database and the rules. Our translation has the property that the i n tegrity constraint is satisfied if and only if the datalog program is contained in the program Informally, given two queries and the query contains the query if 's result is a superset of 's result for any database instance. A l g o r i t h m s for query containment have been studied extensively in the database literature [Ullman, 1997]. These algorithms can be viewed as logical-entailment, algorithms for specific classes of logical sentences, which is why they are useful in our context. Our algorithm has two steps. 1. Given the integrity constraint, and the site definition, create a pair of Datalog queries and 2. We use an extended query containment algorithm to test whet her is contained in If the containment holds, then the integrity constraint is guaranteed to hold. If not, the containment algorithm returns a set of candidate fixes. We describe each step in more detail. The algorithm in Fig. 4 translates either or into a Datalog program. This step relies heavily on the possible paths specified in the structure of the site definition in order to generate and The subtle part of the translation concerns the path atoms. Given a path atom ,Y R Y, the translation builds in a b o t t o m up fashion a Datalog program that defines a relation corresponding to each of the subexpressions of R. The translation varies slightly depending on whether A' and Y are variables, functional terms, and whether there is another conjunct of the f o r m Node(X) (Node(Y)). In the figure, we show only the ease when A' and Y are unary functional terms. If our extended query-containment algorithm reports that is contained in then then the integrity constraint is guaranteed to hold. Otherwise the containment algorithm returns a set of candidate fixes. The algorithm considers four kinds of fixes: • A d d conditions to

in the integrity constraint,

• Remove conditions f r o m the rules in the site definition •

Modify and

by adding back arcs in the site definition,

• Suggest a set of integrity constraints to enforce on the raw data, which guarantee that the constraints on the site w i l l hold. The fixes are reported to the site designer, who can then decide how to proceed. Due to space limitations, we

FERNANDEZ, FLORESCU, LEVY, A N D SUCIU

617

only illustrate this phase of the algorithm through the example below. Intuitively, the fixes are generated by searching through the possible modifications to and such that for the modified queries, the containment

holds. Algorithm IC-translate Input: is either the LHS or RHS of an 1C. is the site definition. arc the universally quantified variables in the IC. O u t p u t : a Datalog program defining the relation Let be of the form For 1 let atomToProg

be the set of Horn rules returned by with query predicate