Mining. By Susan Miertschin

Introduction to Concepts of Data Mining By Susan Miertschin 1 Data Mining Definitions Purpose  Use computer learning techniques to  Identify t...
Author: Jessie Malone
32 downloads 1 Views 182KB Size
Introduction to Concepts of Data Mining By Susan Miertschin

1

Data Mining Definitions

Purpose

 Use computer learning techniques to

 Identify trends in data  Identify patterns in data  Many M experts think h k that h most off

analyze and extract knowledge from data  Process of automatically discovering useful information in large repositories of data

2

human learning is pattern recognition

Knowledge Discovery in Databases (KDD)  Application of the scientific method to data mining processes  Converts raw data into useful information  Useful information is in the form of a model  A generalization based on the data

 Data mining is one step of the KDD process

3

KDD Input Data KDD–  Data may reside in     

4

Database D tb Spreadsheet Flat file Other multiple

KDD Data Preprocessing KDD–  Combine data from multiple sources  Data cleansing  Remove duplicates and incomplete instances

 Feature selection  What data features will you consider for the task at hand (the question you

are trying to answer)

 Normalization or denormalization

5

KDD Data Mining KDD–  Choose and apply methods and algorithms  Use software tools

6

KDD Post Processing KDD–  Ensure valid and useful results  Interpret results in business context  Visualization of results  Hypothesis testing of statistical results

7

KDD Using Information KDD–  Integrate results into decision-making  “Close the loop”

8

A Si Simple l D Data t Mi Mining i gP Process M Model d l

Figure 1.3 A simple data mining process model

Assembling the Data •May reside in multiple types of repositories •Relational databases, spreadsheets, flat files •Data warehouse

The Data Warehouse A repository p y of historical data (as opposed pp to transactional) designed g for decision support

Inductive Reasoning: Used in Data Mi i g Mining Description

Examples

 Look at specific examples;

 The average high temperature

generalize from examples  Premises support the pprobable truth of the conclusion  If premises are true, then it is unlikely that the conclusion is false 12

for the month of July in Houston has been in the 90s (F) for every July in recorded history. Therefore, the average high temperature for this coming July in Houston will be in the 90s, as well.  This account has never made a purchase outside the state of NM. Thus, the recent three purchases in Uzbekistan are suspicious.

What Can be Learned?  Facts  Fact: a simple truth

 Concepts  Concept: C a grouping i off objects, bj symbols, b l events bbasedd on

common characteristics  Procedures  Procedure: a step-by-step course of action to get something

done  Principles  Principle: a general truth, a law – informs other truths 13

Computers & Learning Computers are good at learning concepts. Concepts are the output of a data mining session. session

Computer Learning  Computers can efficiently extract groupings of items based

on common characteristics  Concept structures (used to illustrate and define groupings)  Trees T  Rules  Networks  Equations

15

C Concepts: t From F Different Diff t Perspectives P ti  Classical View or perspective  Every concept has specific defining properties

 Probabilistic View or perspective  A concept h has probabilistic b bili i properties i that h can aid id iin making ki

good decisions  Exemplar View or perspective  Given an instance, it is an example of a concept if it possesses

characteristics sufficiently similar to other examples of the concept

16

Classical Concept View - Example IF Annual_Income >= $40K AND Time_in_Current_Position >= 5 years AND Owns_Home = True THEN Looks Good_Credit_Risk = True

like a RULE

17

P b bili ti Concept Probabilistic C t View Vi - Example E l  Median annual income for people who make loan payments

on time 90% of the time = $40K  Mean length of time in current position for people who make loan payments on time 90% of the time = 55.16 16 years  Of people who make loan payments on time 90% of the time,, 50% own their own home

18

Exemplar Concept View - Example  Of people with the following characteristics, 90% make

90% of loan payments on time

19

Profile fil #1

Profile fil #2

Profile fil #3

Annual Income = $40K

Annual Income = $52K

Annual Income = $29K

Number ofYears on Job = 6

Number of Years on Job = 4

Number of Years on Job = 10

Homeowner

Homeowner

Homeowner

Introduction to Concepts of Data Mining By Susan Miertschin

20