Contingency tables with fuzzy categories

Institut f. Statistik u. Wahrscheinlichkeitstheorie 1040 Wien, Wiedner Hauptstr. 8-10/107 AUSTRIA http://www.statistik.tuwien.ac.at Contingency table...
Author: Brice Barker
0 downloads 2 Views 210KB Size
Institut f. Statistik u. Wahrscheinlichkeitstheorie 1040 Wien, Wiedner Hauptstr. 8-10/107 AUSTRIA http://www.statistik.tuwien.ac.at

Contingency tables with fuzzy categories S. Taheri, G. Hesamian, and R. Viertl

Forschungsbericht SM-2010-2 Dezember 2010

Kontakt: [email protected]

Contingency Table with Fuzzy Categories S.M. Taheri and GH. Hesamian Department of Mathematical Sciences Isfahan University of Technology Isfahan, 84156–83111, Iran Email: [email protected] Email: [email protected]

Abstract— The method of analysis of a contingency table is extended to the case in which we prefer to categorize the variables based on linguistic terms rather than crisp quantities. To do this, the usual concepts of the Chi-Square test statistic and p-value are extended to the fuzzy test statistic and fuzzy pe-value, by using the -cuts approach. In addition, a measure of association is extended to the fuzzy version, for evaluating the relationship between two fuzzy-categorized variables. The proposed method is illustrated by a real world numerical example.

I. I NTRODUCTION A particular class of non-parametric statistical procedures is composed of statistical tests based on categorized variables. Classical procedures in these cases, are commonly based on crisp (exact/nonfuzzy) categories. But, in real world, there are some situations in which categories based on linguistic terms are more realistic and more suitable. For example, consider the relationship between income level and energy consumption investigated in economic studies. In such case, it is more reasonable to categorize the possible amounts of income level by a fuzzy partition in some linguistic terms, say, ”very high”, ”high”, ”moderate”, ”low”, and ”very low”. On the other hand, it is more natural to categorize the possible amounts of energy consumption by some linguistic terms, say, ”low”, ”moderate”, and ”high”. The list of such examples includes many other fields in which the categories of one or more variables of interest may be described in linguistic terms rather than in exact ones. Note that, the border between these fuzzy categories are vague to assign a response only in one specified category. The analysis of contingency tables based on fuzzy categories needs to develop some new statistical methods. Fuzzy set theory provides suitable tools for modeling and analyzing such contingency tables. Over the last decades, many attempts have been made in various fields of study to combine statistical methods and fuzzy sets. However, to the best of the authors’ knowledge, there have been few works dealing with non-parametric methods in the fuzzy environment. For the purposes of this study, let us briefly review some of the literature on this topic. Kahraman et al. [11] proposed some algorithms for fuzzy non-parametric rank-sum tests based on fuzzy random variables. Grzegorzewski [7] demonstrated a straightforward generalization of some classical non-parametric tests for fuzzy

R. Viertl Institute of Statistics and Probability Theory Vienna University of Technology 1040 Wien, Austria Email: [email protected]

random variables. Also, he studied the problem of testing the equality of k-samples against the so-called ”simple-tree alternative” for fuzzy random variables [8] based on the necessity index of strict dominance suggested by Dubois and Prade [3]. In this manner, they obtained a fuzzy test showing a degree of possibility and a degree of necessity for rejecting the underlying hypothesis [8]. Denoeux et al. [2], using the concept of fuzzy partial ordering, extended the non-parametric rank-sum tests based on fuzzy data. They introduced the concepts of fuzzy pe-value and the degree of rejection of the null hypothesis quantified by a degree of possibility and a degree of necessity. Hryniewicz [10] introduced the fuzzy version of the Goodman-Kruskal’s measure in so-called contingency tables with multiple responses described by ordered categorical data in the case where observations of the response variable are fuzzy (and observations of the explanatory variable are crisp). For more on statistical methods with fuzzy observations, the reader is referred to the relevant literature [12], [15]. In this work, we propose a procedure for analyzing the contingency tables when the categories of interest are imprecise rather than crisp. Here, let us state the main problem more precisely. Suppose we have a random sample of observations as (x1 ; y1 ); (x2 ; y2 ) ; : : : ; (xN ; yN ), and there are two attributes of interest, say T and S, for each subject in the sample. Suppose there are r categories of the variable S, and categories of the variable T, and each of N observations (xi ; yi ), i = 1; 2; : : : ; N is classified into exactly one of the r  crosscategories. In a r  contingency table, the entry in the (i; j ) cell, denoted by fij , is the number of items having the cross-classification T = ti ; S = sj . In the classical approach to contingency tables, we are interested in testing the null hypothesis that the two variables are independent, and also to determine a measure of association between two variables (for more details, see [1], [4]). But, suppose that the variables of interest are categorized as linguistic terms, in which the boundary of categories are not precise. In specific words, suppose that instead of categories t1 ; t2 ; : : : ; tr of the variable T, we have e t1 ; et2 ; : : : ; etr as fuzzy categories of the possible values of the variable T. The aim of this study, is to provide an appropriate method for analyzing such contingency tables. This paper is organized as follows: In Section II, we recall

some concepts of fuzzy numbers. In Section III, we introduce a method to test the hypothesis of independent in a twoway contingency tables with fuzzy categories. To do this, we develop a method to construct the fuzzy cell frequencies, fuzzy Chi-Square test statistic, and fuzzy pe-value. To reject or not the null hypothesis, we use an index to compar the observed fuzzy pe-value and a fuzzy level of significance. In addition, a fuzzy measure of association is introduced to evaluate the strenght relationships between two variables. A numerical example is provided to clarify the discussions in this paper, in Section IV. A brief conclusion is provided in Section V. II. F UZZY

NUMBERS

e of the universal set X is defined by its A fuzzy set A e) = membership function Ae : X ! [0; 1℄, with the set supp(A e e fx 2 X : Ae(x) > 0g, the support of A. We say A is a normal fuzzy set if there exists at least one element x 2 X, such that Ae(x) = 1. In this work, we consider R (the e the -cut real line) as the universal set. We denote by A e of R, defined for every 2 (0; 1℄, by of the fuzzy set A Ae = fx 2 R : Ae(x)  g, and Ae0 is the closure of e). supp(A e of R is called a fuzzy number if it is The fuzzy set A e is a compact normal, and for every 2 (0; 1℄, the set A e = [AeL ; AeU ℄, interval. Such an interval will be denoted by A L U e e e e g. where A = inf fx : x 2 A g and A = supfx : x 2 A One of the popular forms of a fuzzy number, to be considered in this work, is the so-called trapezoidal fuzzy number Ae = (Al ; A ; As ; Ar )T whose membership function is given by 8 > > > > < Ae(x) = > > > > :

0 x Al A Al 1 Ar x Ar As 0

x < Al ; Al  x < A ; A  x < As ; As  x  Ar ; x > Ar :

9 > > > > = 8x 2 R : > > > > ;

If A = As , it is called a triangular fuzzy number and is e = (Al ; A ; Ar )T . For more on fuzzy numbers, denoted by A see [13]. III. C ONTINGENCY

crisp categories T = ft1 ; t2 ; : : : ; tr g and S = fs1 ; s2 ; : : : ; s g. Set

Iki =



ti = xk ; ; I j = k ti 6= xk ;

1 0

A. Fuzzy Chi-Square test statistic To explain the motivation for our proposed method, let us look at the construction of a contingency table with crisp categories. Consider an ordinary two-way contingency table with

sj = yk ; sj 6= yk :

1 0

Therefore, an observation (xk ; yk ), k = 1; 2; : : : ; N , belongs to the cell ij if min[Iki ; Ikj ℄ = 1. Now, if we consider a two-way contingency table with fuzzy categories T = fe t1 ; et2 ; : : : ; etr g and S = fse1 ; es2 ; : : : ; se g, then it is natural to allocate the observation (xk ; yk ) to the cell ij, at level , when min[tei (xk ); sej (yk )℄  . So, we can develop a two -way contingency table in this case in the following way. Definition 3.1: In a two -way contingency table with fuzzy t1 ; et2 ; : : : ; etr g and S = fse1 ; se2 ; : : : ; se g, the categories T = fe fuzzy frequencies feij , i = 1; 2; : : : ; r, j = 1; 2; : : : ; , are defined to be the fuzzy sets, with the degree of membership at f 2 f0; 1; : : : ; N g as follows

feij (f ) = supf 2 [0; 1℄ :

N X k=1

I(min[tei (xk ); sej (yk )℄  ) = f g;

where, I is the indicator function,



1 if  is true; 0 if  is false: Definition 3.2: In a two-way contingency table with fuzzy categories, the fuzzy Chi-Square test statistic is defined to be e[ ℄ = [QeL ; Qe U ℄, where, a fuzzy set with the -cuts Q

I() =

Qe L = inf Q; 0 <  1; QeU S

= sup S

Q; 0 <  1:

In which,

S = fzij ; i = 1; 2; : : : ; r; j = 1; 2; : : : ; ; and,

Q=

TABLE WITH FUZZY CATEGORIES

In this section, we provide an approach for analyzing a two-way contingency table with fuzzy categories T = fet1 ; et2 ; : : : ; etr g and S = fse1; se2 ; : : : ; se g, based on a sample of crisp observations (x1 ; y1 ); (x2 ; y2 ); : : : ; (xN ; yN ) (briefly: a two-way contingency table with fuzzy categories). In the following, we introduce a method to test the null hypothesis of independence, for such cases.



X

2 feij [ ℄g;

r X

X (z zij zi zj )2 ;

zzi zj

i=1 j =1

where, feij [ ℄ denotes the ijth cell, and

zi =

: zij

-cuts

zij ; zj =

r X

of the fuzzy frequency for

zij ; z =

r X

X

zij : j =1 i=1 i=1 j =1 Remark 3.1: Based on the Resolution Identity [13], it is e[ ℄, easily concluded that the sequence of the closed intervals Q 2 (0; 1℄ constitute a fuzzy number on [0; +1). Remark 3.2: If the fuzzy categories reduce to crisp categories, then the fuzzy frequency and fuzzy Chi-Square test statistic reduce to the classical frequency and to the classical Chi-Square test statistic, respectively.

B. Fuzzy pe-value

1

0.8

Degree of membership

Definition 3.3: In the problem of testing the hypothesis of independence in a two-way contingency table with fuzzy categories, the fuzzy pe-value is defined to be a fuzzy set with the following -cuts

1)(

1)

 q);

0.5 0.4 0.3

0

0

1000

2000

3000

4000

5000

6000

Value of income

sup PH0 (X(2r q2Qe[ ℄

2 eU 1)  Q [ ℄); PH0 (X(r

0.6

0.1

1)(

1)(

1)

 q)℄

Fig. 1.

where, X r 1)( 1) denotes the Chi-Square distribution with (r 1)( 1) degree of freedom. Remark 3.3: Based on the Resolution Identity, one can conclude that the family of closed intervals pe -value[ ℄ where 2 (0; 1℄, constitutes a fuzzy number on [0; 1℄. Remark 3.4: It should be mentioned that, if the fuzzy categories reduce to crisp categories, then the fuzzy pe-value reduce to the classical p-value.

Finally, a decision is made by comparing the observed pvalue and the given significance level. Since the p-value is defined as a fuzzy set, it is natural to consider the significance level as a fuzzy set, too. In addition, we need a method for comparing the obtained fuzzy pe-value and the given fuzzy significance level. Here, we recall a definition of the fuzzy significance level [9], and a method of ranking fuzzy numbers called the necessity index of strict dominance (NSD index), suggested by Dubois and Prade [3], and employed in [2], [6] in some problems of statistics. Definition 3.4: A fuzzy significance level is any set Æe on (0; 1). e and Be , we can Definition 3.5: For two fuzzy numbers A e  Be evaluate the degree of necessity to which the relation A is fulfilled by

Ne (Ae  Be ) = 1

sup minfAe(x); Be (y )g: x;y;xy In addition, the degree of possibility to which the relation Ae  Be is fulfilled, is defined to be P os (Ae  Be ) = 1 Ne (Ae  Be ). Definition 3.6: Consider the problem of testing the null hypothesis H0 of independence in a two-way contingency table with fuzzy categories. Then, 1 = Ne (Æe  pe-value) is called the necessity degree that H0 is rejected. Also, 2 = P os (Æe  pe-value) is called the possibility degree at which H0 would not be rejected. D. A fuzzy measure of association In this section, we extend a common measure of association (i.e. the contingency coefficient C [1], [4]) to the fuzzy environment in order to evaluate the relationships between underlying variables in contingency tables with fuzzy categories.

Low Moderate High

0.9 0.8

2 (

C. Method of decision making

Fuzzy categories for income in Example 4.1.

1

eL 1)  Q [ ℄)℄;

Degree of membership

= [PH0 (X r 2 (

1)(

0.7

0.2

pe value[ ℄ = [ inf PH0 (X(2r q2Qe[ ℄

Very low Low Moderate High Very high

0.9

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

100

200

300

400

500

600

700

Consumption (in unit)

Fig. 2.

Fuzzy categories for energy consumption in Example 4.1.

Definition 3.7: In a two-way contingency table with fuzzy categories, the fuzzy contingency coefficient is defined to be e with the following -cuts a fuzzy set C

Ce[ ℄ = [Ce L ; Ce U ℄;

where

Ce L = inf S

s

Q ; 0 <  1; z + Q

s

Q Ce U = sup ; 0 <  1: z  + Q S Remark 3.5: It is easy to verify that the family of the closed e[ ℄, 2 (0; 1℄ constitute a fuzzy number on [0; 1℄. intervals C IV. N UMERICAL

EXAMPLE

To demonstrate the application of our method, we provide an example. Example 4.1: (See also [14]) A study is designed to investigate the relationship between income level and energy consumption among households. The results of the collected data of 50 households are shown in Table I. Five categories are considered for income level: ”very low”, ”low”, ”moderate”, TABLE I T HE DATA SET IN E XAMPLE 4.1 No. 1 5 9 13 17 21 25 29 33 37 41 45 49

In. 430 550 1500 1500 1500 2400 2500 2500 4450 4000 5250 5750 6500

Sat. 41 584 36 294 581 269 284 576 47 686 277 473 573

No. 2 6 10 14 18 22 26 30 34 38 42 46 50

In. 470 650 1500 1500 1700 2500 2500 3300 3750 4000 5800 6450 5550

Sat. 38 539 48 258 541 256 293 277 543 574 289 536 663

No. 3 7 11 15 19 23 27 31 35 39 43 47 –

In. 600 1500 1500 1750 2500 2500 2500 4000 4000 4000 6400 5700 –

Sat. 93 46 47 39 46 263 288 269 530 669 293 672 –

No. 4 8 12 16 20 24 28 32 36 40 44 48 –

In. 450 1500 1500 1550 2500 2500 2500 4000 4000 4350 6350 5600 –

Sat. 274 39 283 567 41 277 608 288 602 277 123 628 –

fe13 = fe22 = fe31 =







1

; 0:194 ; 0:285 ;

fe21 =

1

; 05:1 ;

fe23 =

1 2

; 0:325 ;

fe32 =

3

0 78 3

;

fe12 =

;

0







1 2

:

02 3

:

0 16 4

:

0 05 5







:

;

1

; 0:675 ; 07:1 ;

1

; 0:294 ; 03:8 ; 0:415 ; 0:503 ;

1 6

; 07:9 ; 0:846 ;

5 1

0 21 2

0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

0

;

0.05

0.1

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

TABLE II

Energy consumption (S ) Income level (T )

Low

Moderate

High

V ery low

fe11

fe12

fe13

Low

fe21

fe22

fe23

Moderate

fe31

fe32

fe33

High

fe41

fe42

fe43

V ery high

fe51

fe52

fe53





V. C ONCLUSION

0.2

T HE TWO - WAY CONTINGENCY TABLE IN E XAMPLE 4.1



Now, suppose that we wish to test the null hypothesis that the income and energy consumption are independent at the significance level ”about 0.10” which is represented by the triangular fuzzy number Æe = (0:05; 0:10; 0:15)T (Fig. 3). Based on definitions in Subsections III-A and III-B, and by using the computational procedures in MATLAB software (optimization over a set of alternatives) the membership function of the fuzzy pe-value is obtained as drawn in Fig. 3 (which can be interpreted as: ”about 0.06”). By comparing the observed fuzzy pe-value and the fuzzy significance level, we see that 1 = Ne (Æe  pe -value)= 0:2 and 2 = P os (Æe  pe -value)= 0:8. Therefore, we reject the hypothesis of independence with a necessity degree of 0.2. Example 4.2: In Example 4.1, the fuzzy contingency coe for evaluating the strength between the fuzzy efficient C categorized-variables of income and energy consumption is calculated to be ”about 0.57”, as shown in Fig. 4.

0.15

Fuzzy pe-value and fuzzy significance level Æe in Example 4.1.

Fig. 3.



1 1

  1 0:55 e ; ; ; ; f41 = 0 ; 1 ;     1 0:83 ef42 = 12 ; 0:365 ; 0:453 ; e f43 = 5 ; 6 ;     1 0:36 0:32 0:13 ef51 = 10 ; 0:163 ; 0:245 ; e f52 = 3 ; 4 ; 5 ; 6 ;   fe53 = 15 ; 0:685 : fe33 =



:

1 2

Fuzzy pe-value Fuzzy significance level δe

0.9 0.85

we use a common index to compare the observed fuzzy pe-value and the given fuzzy significance level. Also, we introduced a fuzzy measure of association for evaluating the relationship between two underlying variable in such contingency tables. ACKNOWLEDGMENT Parts of this work were completed during the first author’s stay in the Department of Statistics and Probability Theory of the Vienna University of Technology. He is grateful for the hospitality of this department. The first and second authors are grateful to the Isfahan University of Technology. R EFERENCES

&

[1] A. Agresti, Categorical Data Analysis. Second Edition, J. Wiley Sons, New Jersey, 2002. [2] T. Denoeux and M. H. Masson and P. H. Herbert, Non-parametric rankbased statistics and significance tests for fuzzy data. Fuzzy Sets and Systems, 153, 1–28, 2005.

1

Fuzzy contingency coefficient C

0.95 0.9 0.85 0.8 0.75

Degree of membership

fe11 =

1 0.95

Degree of membership

”high”, ”very high” and three categories are considered for energy consumption: ”low”, ”moderate”, ”high”. In this example, therefore, we deal with a two-way contingency table 5  3 with the fuzzy categories T = fte1 = "very low" ; te2 = "low"; te3 = "moerate"; te4 = "high"; te5 = "very high"g for income, and S = fse1 = "low" ; se2 = "moderate"; se3 = "high"g for energy consumption. The related membership functions are shown in Fig. 1 and Fig. 2. For instance, assume that the income and energy consumption of a household are 3300 ($) and 277 (kWh), respectively. For this case, we have et1 (3300) = 0, et2 (3300) = 0, et3 (3300) = 0:47, et4 (3300) = 0:53, et5 (3300) = 0, and es1 (277) = 0, es2 (277) = 1, es3 (277) = 0. To construct the related contingency table, first we have to derive the fuzzy frequencies of each cell. By employing Definition 3.1, the related contingency table is obtained as shown in Table II, where

0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2

A new method for testing independence in a two-way contingency table with fuzzy categories was developed. For this purpose we introduced the fuzzy versions of Chi-Square test and fuzzy pe-value. To evaluate the independence hypothesis,

0.15 0.1 0.05 0

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Value of contingency coefficient C

Fig. 4.

e in Example 4.2. Fuzzy contingency coefficient C

1

[3] D. Dubois and H. Prade, Ranking of fuzzy numbers in the setting of possibility theory. Information Sciences, 30, 183–224, 1983. [4] J. D. Gibbons and S. Chakraborti, Non-parametric Statistical Inference. Forth Edition, Marcel Dekker, New York, 2003. [5] P. Grzegorzewski, Statistical inference about the median from vague data, Control and Cybernetics, 27, 447-464, 1998. [6] P. Grzegorzewski, Testing fuzzy hypotheses with vague data. In: Bertoluzza, C., et al. (Eds.), Statistical modeling, analysis, and management of fuzzy data. Springer, Heidelberg, 213-225, 2002. [7] P. Grzegorzewski, Distribution-free tests for vague data. In: Lopez-Diaz M., et al. (Eds.), Soft Methodology and Random Information Systems, Springer, Heidelberg, 495–502, 2004. [8] P. Grzegorzewski, K-sample median test for vague data. International Journal of Intelligent Systems, 24, 529–539, 2009. [9] M. Holena, Fuzzy hypotheses testing in a framework of fuzzy logic. Fuzzy Sets and Systems, 145, 229–252, 2004.

[10] O. Hryniewicz, Goodman-Kruskal measure of dependence for fuzzy ordered categorical data. Computational Statistics and Data Analysis, 51, 323–334, 2006. [11] C. Kahraman and C. F. Bozdag and D. Ruan, Fuzzy sets approaches to statistical parametric and non-parametric tests. International Journal of Intelligent Systems, 19, 1069–1078, 2004. [12] R. Kruse and K. D. Meyer, Statistics with Vague Data. Reidel Publishing, New York, 1987. [13] K. H. Lee, First Course on Fuzzy Theory and Applications, Springer, Heidelberg, 2005. [14] P. Rajagopalan, Selected Statistical Tests, New Age International (P) Ltd., New Dehli, 2006. [15] R. Viertl, Statistical Methods for Non-Precise Data. CRC Press, Boca Raton, 1996.