Consistent Caching of Data Objects in Database Driven Websites

Consistent Caching of Data Objects in Database Driven Websites Pawel Leszczy´ nski and Krzysztof Stencel Faculty of Mathematics and Computer Science ...

Author: Scarlett Ray

1 downloads 2 Views 223KB Size

Report

Download PDF

Recommend Documents

An Overview of Mobile Database Caching

Database Driven Reverse Dictionary

Caching Transient Data in Internet Content Routers

Application Development in Database-Driven Information Systems

Customizing Easy Database for Websites Widgets

Identifying Data Transfer Objects in EJB Applications

Data Driven Security

DATA DRIVEN IMPROVEMENT

Data Driven Dialogue

BBL DATA-DRIVEN

Database Driven Cartography The swisstopo Example

Towards data-driven cities?

Data-driven Emotion Conversion In Spoken English

DATA FROM CATALOGUES OF SOLAR SYSTEM OBJECTS

Data-driven Crowd Analysis in Videos

Leadership Competencies in Data Driven Decision Making

LAPD in Community Focused. Data Driven

Modeling and Analysis of A Multi-Level Caching in Distributed Database Systems

Teradata Database. Data Dictionary

Data-driven Modeling of Airlines Pricing

XML Schema driven Database Management of Speech Corpus Metadata

Session Guarantees for Weakly Consistent Replicated Data

Forum Data-Driven Marketing 2016

Data-Driven Testing z Selenium

Consistent Caching of Data Objects in Database Driven Websites Pawel Leszczy´ nski and Krzysztof Stencel Faculty of Mathematics and Computer Science Nicolaus Copernicus University Chopina 12/18, 87-100 Toru´ n {pawel.leszczynski,stencel}@mat.umk.pl http://www.mat.umk.pl

Abstract. Since databases became bottlenecks of modern web applications, several techniques of caching data have been proposed. This paper expands the existing caching model for automatic consistency maintenance of the cached data and data stored in a database. We propose a dependency graph which provides a mapper between update statements in a relational database and cached objects. When update on a database is performed the graph allows detecting cached objects which have to be invalidated in order to preserve the consistency of the cache and the data source. We describe a novel method of caching data and keeping it in a consistent state. We believe that this model allows keeping the number of invalidations as low as possible. We illustrate the method using a simple web community forum application and provide some benchmarks which prove that our method is eﬃcient when compared with other approaches. Keywords: database caching, cache consistency, scalability, web applications.

1

Introduction

WEB 2.0 applications are data-intensive. As the number of users grows, the backend database rapidly becomes a bottleneck. Thus, various data caching techniques has been developed. Most e-commerce applications have high browse-tobuy ratios [9], which means that the read operations are dominant. Furthermore, such applications have a relatively high tolerance for some data to be out of date. In this paper we show a novel method of running the cache of a data-intensive WEB 2.0 application. The goal of our research is to minimize the number of objects residing in a cache which must be invalidated after an update to the stored data. The main contribution of this paper is a coarse method to map data updates onto as small as possible sets of cached objects to be invalidated. The method uses the dependency graph composed of queries executed in order to populate the cache, objects of this cache and updates of the cached data. This graph is analysed at compile time in order to identify dependencies between B. Catania, M. Ivanovi´ c, and B. Thalheim (Eds.): ADBIS 2010, LNCS 6295, pp. 363–377, 2010. c Springer-Verlag Berlin Heidelberg 2010

364

P. Leszczy´ nski and K. Stencel

cached objects and the updates. Then, at the run-time cache objects are invalidated according to the inferred dependencies. This method allows keeping the invalidations as rare as possible. Sometimes not all select or update statements are identiﬁed at the applications development time. Some new statements can arise during the operation of the application, because e.g. they are generated at run-time by a framework. In such cases the dependency graph is augmented with new vertices and edges. We use a web community forum application as a running example. We employ it to illustrate the whole method. It also serves as a benchmark. The experimental results prove the eﬃciency of our method. The paper is organized as follows. Section 2 we present the motivating example of a forum application. In Section 3 we describe existing caching techniques and outline our consistency preserving object caching model. Section 4 we present the model in detail. Section 5 describes carried out experiments and their results. Section 6 concludes.

2

Motivating Example—A Community Forum Application

Let us consider a community forum application as a example. Suppose we have four tables: user, forum, topic, post. Each forum consists of topics, each of which contains a list of posts. Let us assume the following database schema: user: forum: topic: post:

user_id, user_nick forum_id, forum_name, forum_desc topic_id, forum_id post_id, topic_id, post_title, post_text, user_id, created_at

When a new topic is added, its title and other data is stored in the ﬁrst post. From the database’s point of view the website consists of three views which are database extensive: listing forums, listing topics and listing posts. Let us now focus on listing topics. Figure 1 contains a view of a real forum application to show which data is needed when displaying list of forums. Performing a query to display it each time the website is loaded could be extensive and harm the database. Thus such an architecture is not used in practice. Instead modern systems modify the database schema by adding redundant

Fig. 1. The ﬁgure shows a single line from a list of visible topics. Each line contains: a topic name which is the ﬁrst post name, an owner of the topic and the date it was created, a post count, and an information about the last post: its author and the date it was added.

Consistent Object Caching in Database Driven Websites

365

data. In particular, one can add ﬁrst post id,last post id and post count ﬁelds to the tables forum and topic and also topic count to the table forum. Such a solution resolves eﬃciency problem stated before but also introduces new ones. It adds derived attributes to the database schema, which is discouraged in OLTP applications. Each time a new post is added, not only the post table needs to be modiﬁed but also topic and forum, to maintain the post count and information about latest added post. This also does not solve the main problem since the database is still the bottleneck of the system. Adding redundant data means also adding several logical constrains that have to be maintained and are error prone. It would also introduce problems when trying to replicate the whole database. The solution is to use a cache to keep all these data in memory. However, the data are updated by the users who add posts. Whenever this happens, many post counters have to be recalculated. The desired property of a cache is to recompute only those counters which have become invalid and possibly nothing else. In this paper we show a method how to reduce the invalidations as much as it is practically possible.

3 3.1

Caching Existing Caching Algorithms

Before the WEB 2.0 era several techniques of caching whole websites have been proposed and surveyed in [14,18]. They construct mappers between URLs and database statements. However this approach loses eﬃciency in WEB 2.0 applications since the same HTML can used on many pages and it does not make sense to invalidate the whole pages. The existing caching models can be divided into caching single queries, materialized views and tables. When caching single queries it may be hard to discover similarities and diﬀerences between the queries and their result. Let us consider two similar queries which return the same result set, but the ﬁrst one produces it in ascending order, while the second one emits the descending order. Caching based on hash based algorithms does not work well in the presented application and applies more to caching ﬁles than to the OLTP database schemas. The idea of caching tables’ fragments has been ﬁrst shown in [5] where authors propose a model with dividing tables into smaller fragments. It can be understood as storing data sets in caches and allowing them to be queried [21,8,9]. However this does not solve the whole problem since it lacks count operations. In the forum example and most WEB 2.0 applications the website contains several counters which cannot be evaluated each time the website is loaded. Performing count operations on the cached data sets is diﬃcult because of the complexity of detecting that all data to be counted is loaded to the cache. There is also no need to store whole data when only counters are needed. The data can be loaded ad hoc or loaded dynamically each time it is needed. When all data is loaded to cache at once, one can easily discover which count queries can be performed but it also means caching data that may never be used. Also when

366

P. Leszczy´ nski and K. Stencel

an update occurs all cache needs to be reloaded. This may cause severe problems because updates occur frequently in OLTP. This also means performing updates to maintain the consistency of the cached data which is never used but stored in cache. Invalidation can occur each time the update occurs or in the speciﬁed time intervals [13]. The second case would be eﬃcient but would also allow storing and serving data that is not up to date. Loading data statically is more like database replication than a caching technique. An interesting recent work on replicating data sources and reducing communication load between backend database and cache servers has been described in [6]. It presents an approach based on hash functions that divide query result into data chunks to preserve the consistency. However this also does not solve the problem of aggregation queries whose results are diﬃcult to be kept consistent via hash similarity. Authors of [1,2,3] present model that detects incosistency based on statements’ templates which is similar to the presented model. However their approach cannot handle join or aggregation in select statements. Our approach is based on a graph which edges determine the impact of the update statements on the cached data. The idea of the graph representation has been ﬁrst presented in [10,11,12]. The vertices of the graph represent instances of update statements and cached data objects. However nowadays most webpages are personalized and number of data objects has increased and multiplied by a number of application users. According to these observations the graph size can grow rapidly and the method becomes impractical. In our approach the dependency graph has vertices which represent classes of update and select statements and not individual objects. This idea allows reducing the number of vertices. The problem can be understood as ﬁnding the best trade-oﬀ between eﬃciency and consistency for the given application. Described schemata show that data granularity of the cached data is strongly desired. In that case only atoms which are not up to date would be invalidated thus improving the eﬃciency. However these atoms cannot be understood as table rows, since count would be diﬃcult to deﬁne, and they should be more like tuples containing data speciﬁed by the application logic. Schemata shown above cannot aﬀord it because of persistent proxy between application server and database. On one hand this feature aids software programmers because they do not need to dig into caching techniques. On the other hand it is the software programmer who has to specify what data has to be cached because it is strongly related to the application’s speciﬁc logic. 3.2

Caching Objects vs. Caching Queries

Most of previously described caching techniques include caching queries. This solution needs a speciﬁcation of queries that need to be cached because they are frequently performed. If a query used for listing topics from the community forum is taken into consideration, one can argue if it makes sense to cache its result. On one hand the query is performed each time the user enters a speciﬁc forum but on the other hand one should be aware of user conditions. If the user searches for topics with a post containing a speciﬁc string it may be useless to cache them because of the low probability they will ever be reused.

Consistent Object Caching in Database Driven Websites

367

Instead of caching queries one should take into consideration caching objects. Suppose objects of class topic and forum are created and each of them contains the following ﬁelds: FORUM: forum_id, forum_name, post_count, topic_count, last_post_author_nick, last_post_created_at TOPIC: topic_id, first_post_title, first_post_created_at, first_post_author_nick, last_post_author_nick, last_post_created_at Having such objects stored in a cache, the query listing topics could be the following: SELECT topic_id FROM topic WHERE topic_id = $topic_id AND ## user condition LIMIT 20; When having list of id’s of topics we simply get those objects from the cache. Each time the object does not exist in the cache it is loaded from the database and created. This means signiﬁcant reduction of the query complexity and the performance improvement. Memcached [15] is an example of the mechanism widely used in practice. It is high-performance, distributed memory object caching system and is used by Wikipedia, Facebook and LiveJournal. In December 2008 Facebook considered itself as the largest memcached user storing more than 28 terabytes of user data on over 800 servers [17]. From theoretical point of view the idea can be seen as using a dictionary for storing data objects created from the speciﬁed queries. One can argue if storing relational data inside the dictionary is sensible. Here the performance becomes an issue. Since all the cached data is stored in RAM (it makes no sense to cache data on a disk) a cache server only needs to hash the name of the object and return data which is stored under the hashed location. The caching mechanism is outside the database which means a signiﬁcant performance gain since the database workload reduced. 3.3

Data Consistency Problem

Data consistency problem arises when caching techniques are applied. Let us assume topic objects in the forum example are cached. Suppose one software programmer has created functionality of caching objects and few months later the other is asked to add functionality of editing user’s user nick. Each time user nick is changed all objects of topics owned by this user or topics where user has his last post need to invalidated. How can the programmer from one department know about the functionality made by the other one which is hidden in the data model? According to [19] Wikipedia uses a global ﬁle with a desrciption of all classes stored in a cache, the ﬁle is available on [20]. However this is not a general solution to the problem but only an improvement which helps programmers to manually make a safe guard from using objects in an inconsistent state. In this paper we propose the model of fully automatic system which maintains the consistency of cached data.

368

4 4.1

P. Leszczy´ nski and K. Stencel

The Dependency Graph Basic Assumptions

As described before data granularity is strongly needed in the caching systems. However when caching objects, it is not possible to track the dependencies between all possible update statements and objects. The graph will track only the dependencies within the database schema. It assumes update statements are performed on single rows identiﬁed by primary keys and cached objects are identiﬁed by the class and the primary key of the speciﬁed relation. The following assumptions restrict the syntax of SQL facilitated by our method. This restriction is needed since mapping the relational data to the dictionary is performed. 1. 2. 3. 4. 5.

6. 7. 8.

Select statements are signiﬁcantly more frequent than inserts and updates. Database consists of k relations. Each table’s primary key consist of a single attribute. One can identify a set of select statements S = {S1 , S2 , ..., Sr } which are used for creating cached objects. Set U = {U1 , U2 , ..., Um } is a set of queries which modify the database and each of them modiﬁes only single row in a table reached by its primary key. Additionally select statement can have other conditions in the WHERE clause but they can only involve columns of the parameterised table. Cached objects and queries from S and U are parameterised by the primary key of some relation. For each class of the object we can identify subset of queries from S used to create the object. System does not assume anything about data in cached objects, but only knows select statements used to create them.

Sometimes it is convenient to know cached data to optimise invalidation clues. However the model assumes data inside objects to be persistent because sometimes instead of caching data developers decide to cache whole HTML fragments corresponding to the cached objects. Other database caching systems could not allow this because of being persistent to the application server. Once again the persistence of caching models reveals its drawback. The presented model can be also seen as conjunction of caching static HTML pages, which mechanisms are clearly described in [7], and data from database. 4.2

Query Identiﬁcation

Let us ﬁrst identify queries used by the application when creating objects. List of queries used when creating topic object follows: S1: SELECT * FROM topic WHERE topic_id = $topicId; S2_1: $max = SELECT max(p.created_at) FROM post p, topic t WHERE t.topic_id = p.topic_id AND t.topic_id = $topicId;

Consistent Object Caching in Database Driven Websites

369

S2_2: SELECT u.user_nick FROM post p, user u, topic t WHERE t.topic_id = p.topic_id AND t.topic_id = $topicId AND p.user_id = u.user_id AND p.created_at = $max; S3_1: $min = SELECT min(p.created_at) FROM post p, topic t WHERE t.topic_id = p.topic_id AND t.topic_id = $topicId; S3_2: SELECT u.user_nick FROM post p, user u, topic t WHERE t.topic_id = p.topic_id AND t.topic_id = $topicId AND p.user_id = u.user_id AND p.created_at = $min; S4: SELECT count(p.post_id) FROM post p, topic t WHERE p.topic_id = t.topic_id AND t.topic_id = $topicId The ﬁrst statement gets a row from a topic table. S2 1, S2 2 and S3 1, S3 2 are very similar and are used for getting data of the ﬁrst and the last post in the topic. Additionally S4 is performed to evaluate a number of posts in the topic. 4.3

Dependency Graph Construction

In this section a creation of the dependency graph is described. As before in the model description let us assume k relations R in the database schema and deﬁne the set of attributes for each relation. Additionally let O denotes the set of classes of objects stored in a cache Attr(R1 ) = {r11 , ..., r1m1 }, ..., Attr(Rk ) = {rk1 , ..., rkmk }

(1)

R = Attr(R1 ) ∪ Attr(R2 ), ... ∪ Attr(Rk )

(2)

O = {O1 , O2 , ..., On }

(3)

Vertices The dependency graph consists of the vertices from sets O, R, S, U where S and U are sets of select and update statements deﬁned before. The vertices of the graph form four layers as presented on Figure 2. Each ellipse denotes vertices of attributes in a relation but only attributes are vertices of the graph.

U1

r11

r1ki

r12

S2

S1

O1

Um

U2

r21

S3

O2

r22

r2j

S4

S5

O3

O4

rk1

rk2

Sr-1

rkmk

Sr

On

Fig. 2. Vertices of the dependency graph

370

P. Leszczy´ nski and K. Stencel

Edges As vertices are deﬁned let us now construct edges in the graph. There are two kinds of edges: weak and strong. Weak edges are used to ﬁnd object to be invalidated, while strong edges are used to identify keys of these objects. 1. Edges with vertices from U : As assumed before, each update performs modiﬁcation of a single row identiﬁed by a primary key. We create edges from update vertex Ui to all attributes that have been modiﬁed (Fig. 3). Especially if a new row is inserted or existing one is deleted then all attributes are modiﬁed and edges to all vertices are added. Additionally we will distinguish between strong and weak edges and the edge connecting attribute of the primary key are strong while the other are weak.

Fig. 3. Edges corresponding to update statement U1. Strong edges are solid, while weak edges are dotted.

2. Edges between attributes are between the primary key and all other attributes within a relation and also between the primary key attribute and its corresponding foreign keys (see Figure 4). All edges between attributes are strong.

r11

r12

r1i

rk1

rk2

rkk

rm1

rm2

rmn

Fig. 4. Edges between attributes

3. Edges for select queries: The simplest case is when single row in a table is accessed via primary key. The left-hand side of Figure 5 shows this. The edges connect the query node to all selected attributes and the attributes used in the WHERE clause. The edge containing primary key of the relation is strong. On the right of Figure 5 one can see edges for a join query. Only one-toone and one-to-many joins are facilitated by our method. It means that join can be performed only using a primary key. Attributes that are used within the selection are connected by an edge with the query vertex. In our running example four queries: S1, S2 1, S2 2, S3 1, S3 2 and S4 have been identiﬁed for creating the topic object. The S1 statement returns single row of the table identiﬁed by the primary key. S4 is a select statement with join on one to many relation and is allowed by the model. The S2 1, S2 2

Consistent Object Caching in Database Driven Websites

r11

r12

r21

r1i

r22

r2j

rk1

rk2

371

rkl

S2

S1

Fig. 5. Selection queries: a single point select query and a join querie

and S3 1, S3 2 look quite similar and are used for retrieving data of the ﬁrst and the last post. S2 1 and S3 1 are select statements with the single join on one to many relation. S3 1 and S3 2 include two joins but the last condition in the WHERE clause does not correspond to the parameterised table. Both statements are parameterised with the topic primary key and the condition uses columns from the post table which is ruled out by the restrictions imposed in the model. In that case the caching system treats such a condition as non-existent. The system is then still correct, since it will not miss any object invalidation. However, it can sometimes unnecessarily invalidate objects normally ﬁltered out by the uncared for condition. 4. Edges mapping objects to queries The lowest layer of the graph connects cached objects with select statements used for creating them and only those statements need to be added to the graph. Cached objects are parameterised by the primary key attribute and model works under the assumption that only this parameter can be used in the select statements as the condition. Si

Sj

Sk

O2

Fig. 6. Select statements used for creating objects

4.4

Forum Example

In the community forum example update statements need to be identiﬁed. We can ﬁnd ﬁve statements that manipulate data: U1: U2: U3: U4: U5:

INSERT INSERT INSERT INSERT UPDATE

INTO INTO INTO INTO user

user VALUES ... forum VALUES ... topic VALUES ... post VALUES ... SET nick WHERE user_id = ...

In Section 4.2 we listed select statements used to retrieve a topic object. Since we have the updates and the queries we can create the dependency graph. Figure 7 shows the graph of the running example.

372

P. Leszczy´ nski and K. Stencel

Fig. 7. A graph created for a topic objects in the community forum example

4.5

The Algorithm

The presented algorithm relies on the dependency graph. It can be statically generated at the system set up. On one hand it can be assumed that the graph is created before the web application runs. However this assumption can be false in some cases. In most modern development frameworks programmers do not write SQL statements by hand. These statements are automatically generated from the database schema. In such a case it is impossible to list upfront the queries used by the application. Therefore, the dependency graph needs to be created dynamically with respect to performed statements. Even if some graph elements will be identiﬁed when a web application runs, many of them are known in advance. Attribute vertices and edges between them can be created from the database schema. 1.

static G := initial_graph(database schema)

The rest of algorithm can be divided into two separate parts: the action for an update statement and the action for a select statement. When select statements are performed, the system checks if new vertices and edges need to be added to the graph. This routine only checks if an augment to the graph is needed. No object gets invalid. Therefore, this routine is not performed when the graph is static, i.e. no new select statements can appear at run-time. The following code snippet shows the routine for select statements: 1. 2. 3. 4. 5. 6. 7. 8. 9.

validateSelect(stmt, class) { if (object_vertex(class) is null) V(G) += {class} v := statement_vertex(stmt) if (v is null) V(G) += {stmt}; edges := {weak_edges(stmt), strong_edges(stmt)}; E(G) += edges; }

Consistent Object Caching in Database Driven Websites

373

Given the statement and the class of the cached objects we need ﬁrst to check if the vertex corresponding to the object class exists. If not it is added to the graph. Then the vertex of the performed select statement is being found. If it does not exist it is being added. In that case additionally weak and strong edges need to be identiﬁed. Weak edges are easily identiﬁed since they bind the columns used by the selection. On the other hand strong edges between the attributes have been added on the system set up and only the edges between attributes, select statements and cached object need to be detected. This is however easy since SQL statement is parameterised only via one parameter. Let us now focus on the action for update statements. Again, if all update statements are known at the compile time, the graph updating part (lines 6–9) of the following routine need not to be performed. 1. validateUpdate(stmt) 2. { 3. param := statement_parameter(stmt); 4. 5. v := statement_vertex(stmt); 6. if (v is null) 7. V(G) += {stmt}; 8. E(G) += edges_of_update_statement(stmt); 9. 10. ov := objects_reached_by_path_three(v); 11. 12. foreach (o in ov) 13. spath := shortest_path(v,o); 14. params := [param]; 15. prev := v; 16. while (next := next_node_in_path(spath, prev)) 17. params := next_vertex_params(params, prev, next); 18. prev := next; 19. invalidate_objects_of_class(o, params); 20. } When new update statement is received we ﬁrst need to check if the corresponding vertex exists. If not it is being added with all edges: one strong edge to the primary key attribute and weak edges to all attributes of the modiﬁed columns. Then, at line 10, the system identiﬁes classes of objects that may need an invalidation. It searches for the paths of length 3 between the given update vertex and vertices of object classes using both: strong and weak edges. The length 3 is important since it means that there is a column modiﬁed by the update and used by some select statement. This can be also seen on the example graph (Fig. 7). When a new forum is added there exists a path between U 2 and topic vertex but topic object needs not to be invalidated since no column used to create it have been changed.

374

P. Leszczy´ nski and K. Stencel

Having found the classes which may contain invalid objects the algorithm ﬁnds those invalid instances. For each object vertex the system searches for the shortest path between the update vertex and the object node but only using strong edges. Weak edges are used to ﬁnd vertices of object classes, while the strong edges are applied for getting keys of those objects. For each object vertex the system goes through the shortest path and gathers parameters of reached vertices. Eventually it stops in the object vertex with the list of cached objects parameters. Those objects instances are removed from the cache. This time when going through the path the algorithm uses only the strong edges because they allow to gather parameters of next vertices. Let us consider statement U4 (adding new post). The system starts from node U4 and goes through the post id attribute to the topic id in the post and the topic table. Having the topic id it knows the exact key of the object to invalidate since topic object and select statements are parameterised by the same primary key. In this case no additional queries need to be performed on a database. However, in case of U5 (updating a user’s nick), having a user id the system needs to ﬁnd all posts written by the user. This means a query to the database must be performed. The system invalidates then all topics where user has written posts. This can be seen as a drawback but it is impossible to examine if a user’s post is really the ﬁrst or last without querying the database. One can argue if it can be improved for min() and max() functions but it surely cannot be done for avg() so no general solution without digging into SQL syntax exists. The other thing is the chain of joins between tables. If the shortest path goes through several joins which require querying database the system can be ineﬃcient. However, it applies well in most cases since in OLTP queries shall be kept as simple as possible. One should also resemble that even having complicated database schema not all of data has to be cached and objects should be kept as granular as possible to prevent extensive invalidation.

5

Experiments

The presented model has been tested on a RUBiS benchmark [22]. RUBiS is an auction site prototype modelled after www.ebay.com and provides a web application and a client emulator. The emulator is modelled according to a real life workload and redirects the client from one web page to another due to a predeﬁned transition probability. In the presented benchmark we have used 100 client threads running at the same time for 18 minutes. We have tested the application in diﬀerent modes: without caching, caching with a deﬁned time-to-live of the cached objects and caching with the invalidation management based on the dependency graph. The application benchmark has been run 3 times in each mode. Figures 8 and 9 display achieved results. Our consistency preserving model reduces up to 54% of performed queries. It is more eﬃcient than techniques based on time-to-live and does not store stale data. In the experiment no cached objects have been invalidated when unnecessary and there have been no queries that did not ﬁt to the SQL syntax restrictions set by our model.

Consistent Object Caching in Database Driven Websites

375

Fig. 8. The comparison of diﬀerent caching techniques. The numbers indicate average count of select statements for each technique.

Fig. 9. (a) Number of data modiﬁcations. (b) Number of cache invalidations in diﬀerent models.

We have also measured a number of data modiﬁcations and the number of cache invalidations as stated in a ﬁgure 9. In the presented benchmark 7.1% of database statements have modiﬁed data and none of them updated more than one row. Our assumption that we deal with the read dominant database communication is therefore correct. The left part of the ﬁgure shows that the number of cache invalidations does not grow rapidly when compared to database statements. The right ﬁgure shows that almost 61% cache invalidations can be saved by the presented model when compared to time-to-live techniques which proves the signiﬁcant improvement of the presented model to those technique.

6

Conclusion

The database bottleneck is the most serious problem in modern web applications which is solved in practice by providing a scalable key-value cache storage engines. This however causes the consistency problem between a relational database and a cache. In this paper we presented a novel cache model based on the dependency graph which detects invalidations of the cached objects when updates occur.

376

P. Leszczy´ nski and K. Stencel

We provided series of tests on the RUBiS benchmark. We observed that restrictions on SQL introduced in the model are not harmful in the context of web applications. We observed a signiﬁcant reduction of performed database statements. The presented approach is more eﬃcient than time-to-live techniques and does not allow serving data which is not up to date. When compared to the template approach several improvements need noting. First, we allow join and aggregation in select statements which is very important since many aggregation functions are used in the modern web applications to provide frequent counters displayed on websites. Second, template based approaches need to know all performed statements classes in advance since the evaluation of invalidation rules is time consuming. Our dependency graph can be easily updated at any time since adding or removing vertices does not require complex operations. When compared to materialized views our mechanism does not exploit any knowledge of cached data and its structure. Also note that materialized views reside on the database side and thus they cannot solve the database communication bottleneck. Since the invalidation policy does not rely on the cached data and its structure, it allows storing semi-structured data. The future work may involve caching whole HTML code fragments. This can be also understood as an interesting consistency mapper between database and websites components for storing the current HTML. We also consider integration of the algorithm with one of the existing Object Relational Mappers. This could be also extended to an automatic generation of cached objects and invalidation rules due to a predeﬁned database schema.

References 1. Garrod, C., Manjhi, A., Ailamaki, A., Maggs, B., Mowry, T., Olston, C., Tomasic, A.: Scalable query result caching for web applications. In: Proceedings of the VLDB Endowment Archive, vol. 1(1), pp. 550–561 (2008) 2. Garrod, C., Manjhi, A., Ailamaki, A., Maggs, B., Mowry, T., Olston, C., Tomasic, A.: Scalable Consistency Management for Web Database Caches. Computer Science Technical Reports, School of Computer Science. Carnegie Mellon University (2006) 3. Manjhi, A., Gibbons, P.B., Ailamaki, A., Garrod, C., Maggs, B., Mowry, T.C., Olston, C., Tomasic, A., Yu, H.: Invalidation Clues for Database Scalability Services. In: Proceedings of the 23 rd International Conference on Data Engineering (2006) 4. Choi, C.Y., Luo, Q.: Template-based runtime invalidation for database-generated Web contents. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 755–764. Springer, Heidelberg (2004) 5. Dar, S., Franklin, M.J., J´ onsson, B.P., Srivastava, D., Tan, M.: Semantic Data Caching and Replacement. In: Proceedings of the 22th International Conference on Very Large Data Bases Table of Contents, pp. 330–341 (1996) 6. Tolia, N., Satyanarayanan, M.: Consistency-preserving caching of dynamic database content. In: Proceedings of the 16th International Conference on World Wide Web, pp. 311–320 (2007)

Consistent Object Caching in Database Driven Websites

377

7. Katsaros, D., Manolopoulos, Y.: Cache Management for Web-Powered Databases. Encyclopedia of Information Science and Technology (I), 362–367 (2005) 8. Altnel, M., Bornhvd, C., Krishnamurthy, S., Mohan, C., Pirahesh, H., Reinwald, B.: Cache tables: Paving the way for an adaptive database cache. In: Proc. VLDB 2003, pp. 718–729 (2003) 9. Luo, Q., Krishnamurthy, S., Mohan, C., Pirahesh, H., Woo, H., Lindsay, B., Naughton, J.: Middletier database caching for e-business. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 600– 611 (2002) 10. Iyengar, A., Challenger, J., Dias, D., Dantzig, P.: High-Performance Web Site Design Techniques. IEEE Internet Computing (4), 17–26 (2000) 11. Challenger, J., Dantzig, P., Iyengar, A., Squillante, M.S., Zhang, L.: Eﬃciently Serving Dynamic Data at Highly Accessed Web Sites. IEEE/ACM Transactions on Networking 12, 233–246 (2004) 12. Challenger, J., Iyengar, A., Dantzig, P.: A Scalable System for Consistently Caching Dynamic Web Data (1999) 13. Zhao, W., Schulzrinne, H.: DotSlash: Providing Dynamic Scalability to Web Applications with On-demand Distributed Query Result Caching, Computer Science Technical Reports, Columbia University (2005) 14. Katsaros, D., Manolopoulos, Y.: Cache management for Web-powered databases. In: Web-Powered Databases, pp. 201–242 (2002) 15. Memcached, Danga Interactive, http://www.danga.com/memcached/ 16. Velocity, http://code.msdn.microsoft.com/velocity 17. Scaling memcached at Facebook, http://www.facebook.com/note.php?note_id=39391378919 18. Li, W., et al.: CachePortal II: Acceleration of very large scale data center-hosted database-driven web applications. In: Proc. VLDB (2003) 19. Managing Cache Consistency to Scale Dynamic Web Systems; Chris Wasik; Master thesis at the University of Waterloo (2007), http://uwspace.uwaterloo.ca/handle/10012/3183 20. memcached.txt in Wikipedia, http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/ docs/memcached.txt?view=markup (04.04.2009) 21. Amiri, K., Tewari, R.: DBProxy: A Dynamic Data Cache for Web Applications. In: Proc. ICDE, pp. 821–831 (2003) 22. RUBiS (Rice University Bidding System), http://rubis.ow2.org/