Rough Set Theory in Very Large Databases

Rough Set Theory in Very Large Databases T. Y. Lin Department of Mathematics and Computer Science San Jose State University, San Jose, California 9519...
Author: Shauna Sparks
7 downloads 1 Views 43KB Size
Rough Set Theory in Very Large Databases T. Y. Lin Department of Mathematics and Computer Science San Jose State University, San Jose, California 95192-0103 and Berkeley Initiative in Soft Computing, Department of Electric Engineering and Computer Science University of California, Berkeley, California 94720 e-mail: [email protected] fax: 408-924-5080 ABSTRACT Rough set theory is an elegant and powerful methodology in extracting and minimizing rules from decision tables and Pawlak information systems. Its central notions are core, reduct, and knowledge dependency. It has been shown that finding the minimal reduct is an NP-hard problem, so its computational complexity have implicitly restricted its effective applications to a small and clean data set. In this paper, rough set methodology is extended to very large relational databases with some sacrificing on its elegancy. In essence techniques in extracting nice subsets of data from noisy data banks is integrated with rough set theory. Given a database, a sequence of various sizes of inter connected Pawlak information systems (PIS) are extracted from very large data banks. These PIS represent certain patters of data banks. Applying rough set methodology to these PIS’s, soft rules can be effectively mined. However, these rules may not be the minimal reduct.

1. INTRODUCTION The central notions in rough set theory are core, reduct and knowledge dependency. Around 1992, Skowron and Rauszer proved a beautiful, yet discouraging result, namely, finding minimal reducts is a NP-Hard problem [1]. So the full power of rough set methodology may only be effective on clean and small sets of data. With gigabytes of data in modern database applications, direct applications of rough set methodology are prohibitively expensive. Even in small data set environment, researchers have softened or generalized rough set theory by introducing probability [2], fuzzy theory [3] and neighborhood systems of modal logic [4, 5] into it. In this paper, we propose to extract a series of inter connected Pawlak information systems (PIS) that represent certain patterns of data [6,7], [8]. Then we apply rough set methodology to such a series of PIS’s. In essence, we integrate database searching techniques and rough set methodology into an effective procedure of mining very large databases.

2. ROUGH SET METHODOLOGY In this section, we will review rough set methodology as explained by Pawlak [9]. We review it by examples. Let us consider the following decision table: The first column ID# is transaction ids. RESULT is the decision attribute. TEST, LOW, HIGH, NEW and CASE are the conditional attributes. ID# ID-1. ID-2. ID-3. ID-4. ID-5. ID-6. ID-7. ID-8. ID-9. ID-10. ID-11. ID-12. ID-13. ID-14 ID-15. ID-16. ID-17. ID-18. ID-19. ID-20. ID-21.

TES T 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

LO W 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0

HIG H 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

NE W 11 11 11 10 10 10 10 10 10 10 10 10 12 12 12 12 12 23 23 23 23

CA SE 2 2 2 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2

RES ULT 1 1 1 1 1 1 1 1 1 1 1 1 10 10 10 10 10 30 30 30 30

TABLE 1 (1) An equivalence relation can be defined by RESULT:

----------------------* This research is supported by Electric Power Research Institute, Palo Alto, California

ID-i ≅ ID-j

iff ID-i.RESULT=ID-j.RESULT

It partition the transaction into three decision classes

DECISION1={ID-1,ID-2,...,ID-12}={1} DECISION2={ID-13,ID-14,...ID-17}={10} DECISION3={ID-18,ID-19,..ID-21}={30}

[R1] TEST ={R1, R2, R5, R6, R7} [R1] LOW ={R1, R7} [R1] HIGH ={R1, R5, R6, R7}

(2) For the conditional attributes {TEST, LOW, HIGH, NEW, CASE}, we have following condition classes.

[R1] NEW ={R1, R2}

F = { R1]TEST , [R1] LOW , [R1] HIGH, [R1] NEW } ∩ F = [R1] TEST ∩[R1] LOW ∩ [R1] HIGH ∩ [R1] NEW = {R1}

CONDITION1={ID-1, ID-2} CONDITION2={ID-3} CONDITION3={ID-4, ID-5,...., ID-12 } CONDITION4={ID-13} CONDITION5={ID-14,.....ID-17} CONDITION6={ID-18} CONDITION7={ID-19,.....ID-21}

By dropping each component, we find two minimal subfamilies of F, called value reduct, such that the inclusion hold: [R1] HIGH ∩ [R1] NEW ⊆ ∩ F [R1] LOW ∩ [R1] NEW

(3) Compare condition and decision classes, we have seven inclusions that give us seven decision rules: R1: R2: R3: R4: R5: R6: R7:

CONDITION1 CONDITION2 CONDITION3 CONDITION4 CONDITION5 CONDITION6 CONDITION7

→ → → → → → →

DECISION1 DECISION1 DECISION1 DECISION2 DECISION2 DECISION3 DECISION3

Rule #

TE ST

LO W

R1 R2 R3 R4 R5 R6 R7

1 1 0 0 1 1 1

0 1 1 1 1 1 0

HI G H 0 1 1 1 0 0 0

NE W 11 11 10 12 12 23 23

C AS E 2 2 3 2 2 2 2

RE SU LT 1 1 1 10 10 30 30

So for rule R1, we have two different set of minimal conditions. We summarize all the values reducts in the table below. In other words, the table represents the minimal conditions for each rule. Rule #

These seven rules can be represented by following table: # of item s 2 1 9 1 4 1 3

⊆ ∩F

R1 R1 R2 R2 R2 R3 R4 R4 R5 R5 R6 R7

TE ST

LO W

HI G H

0 0 1

1 1

1 0 1 0 1 1 0

NE W 11 11 11 11 10 12 12 12 12 23 23

RE SU LT 1 1 1 1 1 1 10 10 10 10 30 30

# of items 2 2 1 1 1 9 1 1 4 4 1 3

TABLE 3 We will not be interested in rule with low supporting cases, so we drop all the rules with # of item less than or equal to one.

TABLE 2 (4) Simplification Step 1: Find the attribute REDUCT First observed that CASE is RESULT-dispensable. In other words, we can drop CASE without affecting decision rules. It is clear that the rest of the attributes are indispensable. Hence we have the minimal Reduct attribute REDUCT = {TEST, LOW, HIGH, NEW}

Rule # R1 R1 R3 R5 R5 R7

TE ST

LO W

HI G H

0 0 0 1 0

NE W 11 11 10 12 12 23

RE SU LT 1 1 1 10 10 30

# of items 2 2 9 4 4 3

TABLE 4

Step 2: Find the value reduct for each rule. To illustrate the idea, we will compute the value reduct for first rule. Let [R1] TEST denotes the equivalence class of rules which defined by the attribute TEST in Table 2, namely

3. BOTTOM UP ROUGH SET METHODOLOGY

The method introduced in Section 2 is an elegant approach when data is clean and small. For example, if Table 1 is a very large database, then Table 3 is also large. In such a case, the rules that have very small supporting cases have no implications on real world., it is reduced to Table 4. Of course, in such situation, we will drop the rules with the small supporting cases at the very beginning. Or in other words, we will consider only the data that have high frequent occurrences. To appreciate the problem better, let us estimate the actual size of the data. A typical super market has about 10 check-out lines. It has about 5 hours peak time per day. Assume each cashier can process 10 items per minute. We have, during peak time, a total of 10*60*5*10 = 30,000 items per day. We will call the transaction of one item, an item-sale, and the transaction of a customer purchasing customer transaction. In addition, there are 20,000 item-sales during the 10 hours non-peak time. In total, there are 50,000 item-sales per day or 1,500,000 item-sales per months. In a typical super market, there are about 1000 items. In average, each customer transaction has about 10 item-sales. So the table may have 1000 attributes and have 1,500,000/10 tuples (customer transactions). One should note that this is a sparse table; most entries are null. For such databases, rough set methodology is extremely difficult to apply directly. Suppose the decision attribute always be one. Then there are 2999 possible conditional attributes. So, we have 1000*2999 decision tables with 150,000 tuples. Even just one decision table of this size, a direct application of rough set methodology is prohibitively expensive. Moreover this is not the only problem, scanning databases and identifying items are also expensive operations too. Adopting the idea from [6], we come up an bottom up version of rough set methodology. The sparse table of course is not stored as described. Each item is represented and stored as attribute value pair, i.e., (attribute, integer), where the attribute is an encoding of an item name and the integer is the number of items purchased. We will refer to its final encoding as encoded pair, encoded item or simply item. Each customer transaction is a variable length record or tuple. Each record or tuple is a finite sequence of approximately 10 encoded items. Each encoded item occupies 5 bytes, where 3 bytes are for attributes, 2 bytes for integer. So there are (1,500,000 item-sales)*(5 bytes)=7,500,000 bytes. Each page is about, say 2K bytes, so we have 3,750 pages. The accessing time of a page is about 20 milliseconds. As usual, these data are indexed by B+-trees. The total size of non-leaf nodes are small, we can assume the tree stays in main memory at all time. As in [6], an itemset consists of encoded items of uniform length, say k, k=2, 3, ...... Each k-itemset Pk is a PIS. These itemsets are interconnected together. All items of length k-1 in the kitemset appear in the (k-1)-itemset. Itemsets are constructed iteratively. The itemset discovered in the current iteration will be used to create a candidate itemset for next iteration. This iteration will continue until all items are exhausted.

Let us assume that the required minimum support is 1% of the total transactions. i.e., 150,000*0.01=1,500 transactions. We form a sequence PIS with such a minimum support. In database mining, we do not know our targets, so PIS are more suitable than decision tables. However, for the sake of comparison, we will base our discussions on decision tables. In other words, the database miner is seeking for specify patterns. For example, the cream cheese buyer’s purchasing patterns. In this example, the attribute “cream cheese” is the decision attribute. We will apply our methodology to the example in Section 2 for comparison. (All tables are shown in Appendix.) Step 1: Create 1-itemset P1: We scan the whole database and count the number of supporting item-sales. We collect the pair (encoded-item, # of supporting item-sales) for all items which have supports ≥ 2. The pair ((TEST, 1), 3) represent the encoded item (TEST, 1) that has been purchased in 3 transactions. Step 2: Create 2-item set P2 Step 2.1. We create candidate 2-itemset C2. Step 2.2. Then, we scan the database collect the set of tuples with two attributes, called 2-item set P2. Of course, we collect only 2-items with minimum support ≥ 2. We have the following 5 tables, Tables (2.a)-(2.e). (For readers’ sake, we represent by tables, it should be represented in encoded sequences (encoded item-1, encoded item-2, # of supporting item-sales) Step 2.3. Finally, we collect all the consistent rules (the decision attribute is RESULTS). Tables (2.a”)-(2.e”) Step 3. Create 3-item set P2: Similar to step 2, we have candidate 3-itemset C3, Table (3.a)- . Second, we have 3-item set P3. Third, we select only the consistent rules. Table (3.a”)-(3.j”). Step 4-6: We create the table of consistent rules. (Only 6itemset is shown). 5. CONCLUSION 6-itemset P6 is the same as Table 2 dropping the minimum supports less than 2. The rules in Table 4 are included various consistent tables of each Pk. If we have the information of supporting cases, we can identify which consistent rules in P6 is in fact Pk, k ≤6. After such identification, we will get Table 4. In other words, we may have all the results of classical rough set theory. However, in practice, we have so many data, so such identification is unrealistic. So the consistent rules in Pk, k=2,3,..6 contains all the rules in Table 4, but not as elegant. If we have no decision attributes, then the rules still can be deduced from all k-itemset Pk. For example, let us examine a tuple in (2.a): (TEST, 0), (RESULT 1), 9). Examine the supports in 1-itemset. The tuple (TEST, 0) should have 10 items. (RESULT 1) have 12 items. The rule (TEST, 0) → (RESULT,

1) has confident level = 9/10, while (RESULT,1) → (TEST, 0) has confident level = 9/12. In real world applications, these two levels could be very different. It is clear that what soft rules we have generated from bottom up rough set methodology. 6. APPENDIX . Table for Step 1: 1-item set P1 can easily be generated from the following table: For example (TEST, 1) has 2 +1 supporting cases, (RESULT, 1) has 2+1+9 supporting cases. TE ST

LO W

1 1 0 0 1 1 1

0 1 1 1 1 1 0

HI G H 0 1 1 1 0 0 0

NE W 11 11 10 12 12 23 23

C AS E 2 2 3 2 2 2 2

RE SU LT 1 1 1 10 10 30 30

5 4

(2.a”) TEST 1 1

RESULT 10 30

# of supports 4 4

(2.b”) LOW 1 0

RESULT 10 30

#of supports 5 3

(2.c”) 2 1 9 1 4 1 3

HIGH 1 0

RESULT 10 30

# of supports 5 4

NEW 11 10 12 23

RESULT 1 1 10 30

# of supports 3 9 5 4

CASE 2 2

RESULT 10 30

# of supports 5 4

(2.d”)

(2.e”) RESULT 1 1 10 30

# of supports 3 9 4 4

(2.b) LOW 0 1 1 0

10 30

Consistent rules from 2-itemset

# of supports

Tables for Step 2 (2.a) TEST 1 0 1 1

2 2

RESULT 1 1 10 30

#of supports 2 10 5 3

(2.c) HIGH 0 1 1 0

RESULT 1 1 10 30

# of supports 2 10 5 4

NEW 11 10 12 23

RESULT 1 1 10 30

# of supports 3 9 5 4

CASE 2 3

RESULT 1 1

# of supports 3 9

(2.d)

(2.e)

Table for Step 3: By combining tables in Step 2, we form candidates 3itemset. For example (2.a) and (2.b) give us the table (3.a). The rest of the tables are skipped. These table form the potential 3-itemset. By searching the database, one can verify that if a “candidate tuple” is indeed a “tuple” from database. The collection of all tuples that exist in database is called 3-itemset. There are various efficient database searching techniques to achieve this. (3.a) TEST 0 1 0 1 1 1

LOW 0 0 1 1 1 0

RESULT 1 1 1 1 10 30

# of supports

(3.a’) By searching the database, we find the supports. The following table are the 3-itemset obtained from table (3.a) TEST 1

LOW 0

RESULT 1

# of supports 2

0 1 1

1 1 0

1 10 30

9 4 3

(3.a”) Consistent 3-itemset (The last column is for readers’ reference, such information is too costly to be provided in real applications) TES T 0 1

LO W 1 1

RES ULT 1 10

# of supports

Rule#

9 4

R3 R5

0 1 1 0

11 10 12 23

1 1 10 30

2 9 5 3

R1 R3 R5 R7

LO W 1 1

CAS E 3 2

RES ULT 1 10

#of supports

Rule#

9 5

R3 R5

HIG H 0 1 0 0

NE W 11 10 12 23

RES ULT 1 1 10 30

#of supports

Rule#

2 9 4 4

R1 R3 R5 R7

HIG H 1

CAS E 3

RES ULT 1

# of supports

Rule#

9

R3

NE W 11 10 12 23

CAS E 2 3 2 2

RES ULT 1 1 10 30

# of supports

Rule#

3 9 5 4

R1 R3 R5 R7

(3.g”)

(3.h”) (3.b”) TES T 1 0 1 1

HIG H 0 1 0 0

RES ULT 1 1 10 30

# of supports

Rule#

2 9 4 4

R1 R3 R5 R7 (3.i”)

(3.c”) TES T 1 0 1 1

NE W 11 10 12 23

RES ULT 1 1 10 30

# of supports

Rule#

3 9 4 4

R1 R3 R5 R7

(3.j”)

(3.d”)

TES T 1 0 1 1

CAS E 2 3 2 2

RES ULT 1 1 10 30

LO W 0 1 1 0

HIG H 0 1 0 0

RES ULT 1 1 10 30

# of supports

Rule#

3 9 4 4

R1 R3 R5 R7

Table fort Step 4: (4.a”) TES T

LOW

HIG H

RES ULT

1 0 1 1

0 1 1 0

0 1 0 0

1 1 10 30

(3.e”) # of supports

Rule#

2 10 4 3

R1 R3 R5 R7

and others Step 4: 5-item set

(3.f”) LO W

NE W

RES ULT

#of supports

Rule#

(5.a”)

# of suppo rts 2 9 4 3

Rule # R1 R3 R5 R7

TE ST

LO W

HI GH

NE W

1 0 1 1

0 1 1 0

0 1 0 0

11 10 12 23

RE SU LT 1 1 10 30

# of supp ort 2 9 4 3

Rul e# R1 R3 R5 R7

and others Table for Step 6: 6-itemset P6: TE ST

LO W

HI G H

NE W

C AS E

RE SU LT

1 0 1 1

0 1 1 0

0 1 0 0

11 10 12 23

2 3 2 2

1 1 10 30

# of sup por ts 2 9 4 3

Rul e#

R1 R3 R5 R7

References [1] A. Skowron, C. Rauszer The discernibility matrices and functions in information systems, Decision Support by Experience - Application of the Rough Sets Theory, R. Slowinski (ed.), Kluwer Academic Publishers, 1992, pp. 331-362 [2] W. Ziarko, Variable Precision Rough Set Model, Journal Computer and System Science, 39-58, 1993 [3] T. Y. Lin, Coping with Imprecision Information-"Fuzzy" Logic, Downsizing Expo, Santa Clara Convention Center, Aug.3-5, 1993 [4]. Y. Y. Yao and T. Y. Lin, “Generalization of Rough Sets using Modal Logics”, Intelligent Automation and Soft Computing, an International Journal, to appear, 1996 [5] T. Y. Lin, “Neighborhood Systems -A Qualitative Theory for Rough and Fuzzy Sets”, Workshop on Rough Set Theory, Proceedings of Second Annual Joint Conference on Information Science, Wrightsville Beach, North Carolina, Sept. 28-Oct. 1, 1995, pp. 257260. [6] R. Agrawal, T. Imielinski A. Swami, Mining Association Rules Between Sets of Items in Large Databases. In Proceeding of ACM-SIGMOD international Conference on Management of Data, pp. 207-216, Washington, DC, June, 1993 [7] R. Agrawal, T. Imielinski A. Swami, Database Mining: A Performance Perspective. IEEE Transaction on Knowledge and Data Engineering, 5(6), pp. 914-925,

December, 1993. Special Issue on Learning and Discovery in Knowledge-Based Databases. [8] M. Houtsma, A. Swami, Set-Oriented Mining for Association Rules in Relational Databases, Proceedings of 11th International Conference on Data Engineering, pp. 25-33, March 1995. [9] Z. Pawlak Rough sets. Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, 1991 Tsau Young (T. Y.) Lin received his Ph.D from Yale University, and now is a Professor at San Jose State University and Visiting Scholar in BISC, Department of Electrical Engineering and Computer Science, University of California at Berkeley. He has been chairs and members of program committees in various conferences and workshops, also served in editorial boards of several international journals. His interests include approximation theory (in databases and knowledgebases), data mining, data security, fuzzy sets, Petri nets, and rough sets.