Data Foundations. Data Attributes. Data Attributes and Features Data Pre-processing Data Storage Data Analysis

Data Foundations  Data Attributes and Features  Data Pre-processing  Data Storage  Data Analysis 1 Data Attributes  Describing data content and...
Author: Grant Bell
4 downloads 0 Views 3MB Size
Data Foundations  Data Attributes and Features  Data Pre-processing  Data Storage  Data Analysis

1

Data Attributes  Describing data content and characteristics  Representing data dimensions  Set of all attributes: attribute vector or

attribute array.

2

1

Attribute Types

Nominal Data

Sortable Data

3

Numerical Attributes

Discrete vs. Continuous

4

2

Statistical Features of Data Mean:

Median:

Standard Deviation: 5

Similarity Matrix

Distance for Categorical Data:  p: total # of types;  m: # of same types 6

3

Distance Measures (Numerical) Euclidean Distance: L2

Manhattan Distance: L1

Minkowski Distance: Lp

Cosine Distance:

s( X , Y ) 

X X



Y Y 7

Data Uncertainty Source of uncertainty  Attribute error  Missing attributes  Data integration error  Resolution conversion  Application uncertainty 8

4

Data Preprocessing Data Products Application database

ETL: Extract,

Data Warehouse

Transform, & Load

Commercial Intelligence

Analysis 9

Data Preprocessing 1. Data Cleaning 2. Data Integration  Data Quality – Accuracy – Completeness – Consistency – Timeliness – Believability – Interpretability 10

5

Data Visualization Quality Data-Ink Ratio:

11

Data Error Types  Missing data – Replace with constants – Replace with average value of attributes – Regression – Manual filling  Noisy data – Regression – Outlier analysis 12

6

Visual Data Cleaning  Using visualization techniques for data

cleaning

13

Data Integration  Combining data from multiple sources – Structural conflicts – Schema differences – Data Conflicts – Repeated data  Providing a uniform visualization

14

7

Data Integration Example Form 1:

Form 2: Integrated Form:

15

Data Storage  File Systems  Databases and DBMS  Data Warehouse

16

8

CSV file (comma-separated values)

17

Structured Files  XML files: eXtensible Markup Language Tove Jani Reminder Don't forget me this weekend!



 XML Extensions: IVOA VOtable (Space programs), KML

(Web maps), etc.  Special formats: e.g. HDF (Hierarchical Data Format) for scientific data

18

9

Databases  A database is an organized collection of data. The

data is typically organized to model aspects of reality in a way that supports processes requiring information. For example, modelling the availability of rooms in hotels in a way that supports finding a hotel with vacancies.

19

Relational Databases  DBMS: Database Management System  Data definition and query languages (SQL)

20

10

Visualization of a database Challenge: Interactivity! Example: National Science Foundation Database:

21

Data Warehouse  A data warehouse is a system used for

reporting and data analysis. It is a central repository of integrated data from one or more disparate sources.

22

11

Database vs Data Warehouse Database

Data Warehouse

Purpose

Data operation

Information in the data

Application

Business

Analysis

Users

Employee, DB Administrator

Analyst, manager, executive

Functionalities

Daily operations

decision making support

Data

Current

Historical, time-variant

Access

Read, write, mean, etc

Read

Focus

Input, query

information output

Size

1 GB ~ < 1 TB

TBs 23

Data Analysis  Statistical Analysis  Exploratory Analysis  Data Mining

24

12

Statistical Analysis  Statistical description:

properties, parameters, distribution, correlations.  Statistical prediction: using probability methods and sampling theory to predict statistical properties (distribution, correlation, parameters, forecasting, etc.) 25

Data Exploration Using Visualization  Raw data drawing  Statistical drawing  Multi-view

26

13

Data Trajectory

27

Data Comparison

28

14

Trend and Patterns

29

Relations

30

15

Line Chart

31

Line Chart: Sunspots

32

16

Bar Chart

33

Sorted Bar Chart

34

17

Bar Chart Labeling

35

Bar Chart: Scale

36

18

Bar Chart : Variant

37

Stacked Bar Chart

38

19

Stacked Bar Chart

39

Stacked Chart

40

20

Pie Chart

41

Pie Chart

42

21

Pie Charts of Van Gogh Paintings

43

Histogram

44

22

Contour Map

45

Scatter Plot Matrix

46

23

Reference Lines in Scatter Plot

47

Heatmap

48

24

Box Plot

49

Box Plot Variations

50

25

Multi-View

51

Data Mining  “Data mining, also popularly referred to as knowledge

discovery from data (KDD), is the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, other massive repositories or data streams” – J. Han and M Kamber, “Data Mining: Concepts and Techniques” Mining

Data

Model

Validation

Knowledge

52

26

Data Mining Tasks Description: Algorithm

Features

Data

Prediction

1

2

Training Data

Trained Model

Trained Model

Model

New Data

Features

53

Data Mining Tasks Descriptive Tasks: Concept Description

Association Mining

Clustering

Outlier Analysis

Predictive Tasks: Classification

Evolution Analysis 54

27

Data Mining Methods Statistical Method: Regression, Parameter Estimation Machine Learning: Decision tree, Neural Networks

Statistical Learning: Probability model, Basian networks

Algorithmic Method: Kmean, Graph operations 55

Data Mining New Applications  Text Mining – summarizes, navigates, and

clusters documents contained in a database.  Web Mining – integrates data and text mining within a Web site; enhances the Web site with intelligent behavior, such as suggesting related links or recommending new products to the consumer

56

28

Data Visualization Pipeline

57

Looping Model

58

29

Visual Data Mining and Visual Analytics  Users are involved in the data mining

process through visualization and user interactions.  Certain tasks are difficult to be automated – Validation of clustering results – Checking data focal points and noise – Expert knowledge input, etc. 59

Visual Analytics Paradigm Knowledge Visualization

Data Processing & Mining

User Interaction

Revision of Mining Engine & Decision Making

60

30

Visual Analytics

By Daniel Keim

61

Visual Data Mining Examples: Visualizing data correlations

62

31

Visual Data Mining Example: Biomarker detection Z 

1 N

N

 i 1

f ( D 1 , Pi )

Z 

1 7

7

 i 1

f ( D 1 , Pi )

Z

1 7  f ( D2 , Pi ) 7 i 1

Z

Z

Z 

1 7

7

 i 1

f ( Pi , D j ),

Z

1  f (D1, Pi ),i 1, 2, 7 3 i

1 f (Pi , Dj ),i 1, 2, 7 3i

1 f (Pi, Dj ),i 1, 2, 3, 5, 7 5i

63

Visual Data Mining Example: Facial feature detection for medical diagnosis

64

32

Visual Data Mining Example: Concept detection in text data

65

Visual Data Mining Example: Visualizing decision trees

66

33

67

34