APRIORI algorithm based medical data mining for frequent disease identification

IPASJ International Journal of Information Technology (IIJIT) Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm Email: [email protected] ISSN 2321-59...
Author: Miranda Small
30 downloads 2 Views 1MB Size
IPASJ International Journal of Information Technology (IIJIT) Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm Email: [email protected] ISSN 2321-5976

A Publisher for Research Motivation ........

Volume 2, Issue 4, April 2014

APRIORI algorithm based medical data mining for frequent disease identification Gitanjali J1 , C.Ranichandra2 ,M.Pounambal3 School of Information Technology and Engineering, VIT UNIVERSITY, Vellore-632014, Tamil Nadu, India

Abstract The data mining comprises of analysis of large data from various perspectives and obtaining summary of useful information. The information can be transferred into knowledge regarding future trends and history. Data mining has a very important role in the information technology domain. Huge amounts of complex data is generated by health care sector today. These data includes details about diseases, patients, diagnosis methods, electronic patients details hospitals resources etc,. The data mining methods are very helpful in making medicinal decisions in disease curing. The vast data collected by healthcare industry are not mined and hence information is hidden. And as a result the decision making is not effective. The knowledge discovered can be used by the healthcare administrators for enhancing the service quality. In this paper, a method for identifying frequency of diseases in particular geographical location for a given period of time using Apriori data mining technique based on association rules is proposed.

Keywords: KDD, Bayesian classification, Genetic Algorithm.

1. INTRODUCTION The meaning of Data Mining is the extraction of knowledge from large data. It is also named as knowledge mining from large amount of data. There are so many other terms which give similar meaning of data mining, they are knowledge extraction, data archaeology, data /pattern analysis etc. The other famously used terms are knowledge discovery from data or KDD. Decision making can be achieved by converting data mining in to knowledge and this process is called knowledge discovery. The iterative sequence present in knowledge discovery are 1.,Data cleaning [inconsistent data and noise are removed],2., Data integration [the combination of multiple data sources are done],3., Data selection [the relevant data is extracted to the analysis task from data base],4., Data transformation [the data is transformed in to relevant other forms ],5,. Data mining [the data patterns are extracted by applying intelligent methods], 6, Pattern evaluation [depending upon some measures, we will identify the interesting patterns in knowledge],7., Knowledge presentation [some techniques are used to represent knowledge and visualizations].Data mining includes many other functions like classification, association, clustering, and predictions. Relationships and hidden patterns are discovered by using advanced data mining techniques. Mining association rules is the one of the important data mining applications. In 1993 association rules are used to identify relationships among item sets in data bases, these are not inherited properties. In medical field it is used to find the most frequently occurred diseases in different geographical locations at given time period. Hence the medical data is analyzed in this research work.

2. LITRATURE REVIEW Jyothi soni. [1] Provided a survey of latest techniques in predicting heart diseases using data mining techniques of knowledge discovery. So many experiments are conducted to compare the performances and to determine the outcomes. The survey reveals that in accuracy wise Bayesian classification is having similar results as of decision tree. When these are compared to other methods, like Neural Networks, Classification based on clustering they are performing well. Decision tree algorithm and Bayesian classification are improved by applying Genetic algorithm optimal data sets are obtained by reducing the actual data size which is useful in predicting Heart diseases. Carlos Ordonez [2] studied how to limit the association rules in order to predict the heart diseases. He proposed three things to decrease the number of patterns. Firstly, the required things needed such that attributes should present on only one side of the rule. Secondly, divide the attributes into uninteresting groups. Thirdly to reduce the number of rules applied. Maria-Luiza Antonie [3] investigated different data mining techniques like association rule mining and neural networks in detection of tumor in digital mammography. The two results performed well in the accuracy wise which gave 70%

Volume 2, Issue 4, April 2014

Page 1

IPASJ International Journal of Information Technology (IIJIT) Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm Email: [email protected] ISSN 2321-5976

A Publisher for Research Motivation ........

Volume 2, Issue 4, April 2014 3. APRIORI ALGORITHM

It is used for finding frequent item sets. This algorithm is proposed by R. Agrawal and R. Srikant in 1994. The name of this algorithm is Apriori because it uses the prior knowledge of frequent item sets. Firstly, the input is D the dataset is given and also we should know the min_sup, which is minimum support count threshold. And we get the output as L the frequent item sets in D. The procedure for the algorithm is as follows: step1: scan D for count of each candidate and generate .List all the candidate item sets and place the corresponding candidate support count of . Step2: compare the candidate support count with the minimum support count. Generate candidate from . Step3: scan D for count of each candidate. Compare the candidate support count with the minimum support count list remaining in .This process is continued until the most frequent item set is produced.

4. PROPOSED WORK Apriori algorithm can be used for mining the disease occurrence details for a specific time range. Algorithm Mn : Medical data item set of size n Fn : frequent item set of size n F1 = {frequent items}; for (n = 1; Fn !=; n++) do begin Mn+1 = Medical data derived from Fn each t transaction in the database do Increment count of all the medical data in Mn+1 that are in t Fn+1 = min_support medical data in Mn+1 end return nFn; Result of the Research The proposed approach is very helpful in identifying the frequently occurring diseases in a huge medical data. As a result, medical conclusions and decisions regarding frequent diseases can be made by practitioners accurately. Data for analysis is obtained from different geographical areas during various time ranges.

5. EXPERIMENTAL RESULT This research utilizes the data set containing the electronic medical details of different patients. This include patient’s name, disease name, age, sex, date, address, , etc, in particular year. Fig.1. shows the bar graph of the number of diseases affecting the patients monthly. Fig. 2. Depicts number of patients affected by various diseases monthly.. It unfolds the fact that in a particular month some patients are affected by the same disease.

no of diseases 14 12 10 8 6 4 2 0

Nov

Jul

Sep

May

Mar

Jan

no…

Fig.1. shows the bar graph of the number of diseases affecting the patients monthly

Volume 2, Issue 4, April 2014

Page 2

IPASJ International Journal of Information Technology (IIJIT) A Publisher for Research Motivation ........

Volume 2, Issue 4, April 2014

Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm Email: [email protected] ISSN 2321-5976

no of patients 160 140 120 100 80

no of patie nts

60 40 20 0 Jan Mar May Jul Sep Nov

Fig. 2. Depicts number of patients affected by various diseases monthly

Relation: Disease Instances: 12 January February March April May June July August September October November December

Apriori Minimum support:=0.35 (4 instances) Minimum metric =0.9 Number of cycles performed: 13

Attributes: 29 AIDS Allergies Heart disease Asthma HIV human papilloma virus hypertension Impotence Insomnia Jaundice Kidney Disease Leukemia Liver cancer Liver Disease Lung Cancer Lupus Overweight Eye Disease Pain Pertussis Pregnancy Raynauds Phenomenon sexually transmitted diseases sleep disorders smoking stroke Thrush Thyroid disorders Whooping Cough

Large item sets L(1): 20 Large Item sets L(1):

Volume 2, Issue 4, April 2014

Page 3

IPASJ International Journal of Information Technology (IIJIT) Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm Email: [email protected] ISSN 2321-5976

A Publisher for Research Motivation ........

Volume 2, Issue 4, April 2014 Allergies

4

Heart disease Jaundice HIV Hypertension Impotence Thrush Whooping Cough Insomnia sexually transmitted diseases sleep disorders Smoking Raynauds Phenomenon Pregnancy Pain Overweight Lung Cancer Liver Disease

7 5 4 5 6 4 7 6 5 5 9 4 4 7 5 4 4

large item sets L(2): 21 Large Item sets L(2): Heart disease, hypertension Heart disease, Insomnia Heart disease, Kidney Disease Heart disease, Liver Disease Heart disease, Overweight Heart disease, Pain Heart disease, smoking Asthma, Thrush HIV,smoking Hypertension, smoking Impotence, Raynauds Phenomenon Impotence, smoking Impotence, Whooping Cough Insomnia, smoking Liver Disease, Overweight Liver Disease, smoking Pain, sexually transmitted diseases Overweight, smoking Pain, smoking sexually transmitted diseases, smoking Smoking, Whooping Cough

4 5 4 4 5 5 6 4 4 4 4 4 5 5 4 4 5 4 5 4 5

Large item sets L(3): 7 Large Item sets L(3): Heart disease,Insomnia,smoking Heart disease, Liver Disease, Overweight Heart disease, Liver Disease=t smoking Heart disease,Overweight,smoking Impotence,smoking,Whooping Cough

Volume 2, Issue 4, April 2014

4 4 4 4 4

Page 4

IPASJ International Journal of Information Technology (IIJIT) Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm Email: [email protected] ISSN 2321-5976

A Publisher for Research Motivation ........

Volume 2, Issue 4, April 2014 Liver Disease, Overweight, smoking Pain, sexually transmitted diseases, smoking

4 4

Large item sets L(4): 1 Large Item sets L(4): Heart disease, Liver Disease,Overweight,smoking

4

6. CONCLUSION This research work proposes Apriori data mining based on association rule and generates the frequency of diseases affected by patients and also the number of patients affected by these diseases .Based on various geographical areas and at various time periods the study is made.. Existing electronic medical details obtained from hospitals are utilized as training data set for analysis. The analysis and study concluded that the patients are affected frequently by 4 different diseases at different geographical areas during a particular year.

References [1] Jyothi Soni, et al., “Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction” [2] Carlos Ordonez, “Improving Heart Disease Prediction Using Constrained Association Rules” [3] Maria-Luiza Antonie et al., “Application of Data Mining Techniques for Medical Image Classification” [4] M. Ilayaraja, T. Meyyappan,”Mining Medical Data to Identify Frequent Diseases using Apriori [5] Murugesan K., Md.Rukunuddin Ghalib., Gitanjali J., Indumathi J., Manjula D.(2009), “A pioneering Cryptic Random Projection based approach for Privacy Preserving Data Mining”, In proceedings of The IEEE International Conference on Information Reuse and Integration (IEEE IRI-09) July 10-12, Las Vegas, USA. pp. 437-439. [6] “Sprouting Modus Operandi for Selection of the Best PPDM Technique for Health Care Domain”, International Journal Conference in Recent Trends in Computer Science. Vol. 1, No. 1, pp. 627-629.

AUTHORS GITANJALI J received her M.Tech IT Networking from Vellore Institute of Technology, India, in year 2008.She is working for Vellore Institute of technology as an Assistant Professor Senior. She is currently doing her PhD from VIT University, Vellore. Her research interest includes Security for Data Mining, Networks, Software Engineering and Ontology. C.RANICHANDRA is working as Assistant Professor Selection Grade in VIT University, Vellore, Tamil Nadu, India. She has fourteen years of teaching experience in VIT. Ranichandra was born in 1975 in Madurai District. She graduated in B.Tech(CSE) from Vellore Engineering College in 19197 and received her M.Tech (CSE) from VIT University in 2008. The author started the research work from 2009 in Grid Databases and is currently working on Database issues in Cloud. M.Pounambal is Assistant Professor Selection Grade in School of Information Technology and Engineering at VIT University, Vellore, India. She received B.E and M.Tech Degree in Computer Science field. Her area of interest includes Wireless Networks.

Volume 2, Issue 4, April 2014

Page 5

Suggest Documents