Introduction to Concepts of Data Mining By Susan Miertschin
1
Data Mining Definitions
Purpose
Use computer learning techniques to
Identify trends in data Identify patterns in data Many M experts think h k that h most off
analyze and extract knowledge from data Process of automatically discovering useful information in large repositories of data
2
human learning is pattern recognition
Knowledge Discovery in Databases (KDD) Application of the scientific method to data mining processes Converts raw data into useful information Useful information is in the form of a model A generalization based on the data
Data mining is one step of the KDD process
3
KDD Input Data KDD– Data may reside in
4
Database D tb Spreadsheet Flat file Other multiple
KDD Data Preprocessing KDD– Combine data from multiple sources Data cleansing Remove duplicates and incomplete instances
Feature selection What data features will you consider for the task at hand (the question you
are trying to answer)
Normalization or denormalization
5
KDD Data Mining KDD– Choose and apply methods and algorithms Use software tools
6
KDD Post Processing KDD– Ensure valid and useful results Interpret results in business context Visualization of results Hypothesis testing of statistical results
7
KDD Using Information KDD– Integrate results into decision-making “Close the loop”
8
A Si Simple l D Data t Mi Mining i gP Process M Model d l
Figure 1.3 A simple data mining process model
Assembling the Data •May reside in multiple types of repositories •Relational databases, spreadsheets, flat files •Data warehouse
The Data Warehouse A repository p y of historical data (as opposed pp to transactional) designed g for decision support
Inductive Reasoning: Used in Data Mi i g Mining Description
Examples
Look at specific examples;
The average high temperature
generalize from examples Premises support the pprobable truth of the conclusion If premises are true, then it is unlikely that the conclusion is false 12
for the month of July in Houston has been in the 90s (F) for every July in recorded history. Therefore, the average high temperature for this coming July in Houston will be in the 90s, as well. This account has never made a purchase outside the state of NM. Thus, the recent three purchases in Uzbekistan are suspicious.
What Can be Learned? Facts Fact: a simple truth
Concepts Concept: C a grouping i off objects, bj symbols, b l events bbasedd on
common characteristics Procedures Procedure: a step-by-step course of action to get something
done Principles Principle: a general truth, a law – informs other truths 13
Computers & Learning Computers are good at learning concepts. Concepts are the output of a data mining session. session
Computer Learning Computers can efficiently extract groupings of items based
on common characteristics Concept structures (used to illustrate and define groupings) Trees T Rules Networks Equations
15
C Concepts: t From F Different Diff t Perspectives P ti Classical View or perspective Every concept has specific defining properties
Probabilistic View or perspective A concept h has probabilistic b bili i properties i that h can aid id iin making ki
good decisions Exemplar View or perspective Given an instance, it is an example of a concept if it possesses
characteristics sufficiently similar to other examples of the concept
16
Classical Concept View - Example IF Annual_Income >= $40K AND Time_in_Current_Position >= 5 years AND Owns_Home = True THEN Looks Good_Credit_Risk = True
like a RULE
17
P b bili ti Concept Probabilistic C t View Vi - Example E l Median annual income for people who make loan payments
on time 90% of the time = $40K Mean length of time in current position for people who make loan payments on time 90% of the time = 55.16 16 years Of people who make loan payments on time 90% of the time,, 50% own their own home
18
Exemplar Concept View - Example Of people with the following characteristics, 90% make
90% of loan payments on time
19
Profile fil #1
Profile fil #2
Profile fil #3
Annual Income = $40K
Annual Income = $52K
Annual Income = $29K
Number ofYears on Job = 6
Number of Years on Job = 4
Number of Years on Job = 10
Homeowner
Homeowner
Homeowner
Introduction to Concepts of Data Mining By Susan Miertschin