Data Foundations Data Attributes and Features Data Pre-processing Data Storage Data Analysis
1
Data Attributes Describing data content and characteristics Representing data dimensions Set of all attributes: attribute vector or
attribute array.
2
1
Attribute Types
Nominal Data
Sortable Data
3
Numerical Attributes
Discrete vs. Continuous
4
2
Statistical Features of Data Mean:
Median:
Standard Deviation: 5
Similarity Matrix
Distance for Categorical Data: p: total # of types; m: # of same types 6
3
Distance Measures (Numerical) Euclidean Distance: L2
Manhattan Distance: L1
Minkowski Distance: Lp
Cosine Distance:
s( X , Y )
X X
Y Y 7
Data Uncertainty Source of uncertainty Attribute error Missing attributes Data integration error Resolution conversion Application uncertainty 8
4
Data Preprocessing Data Products Application database
ETL: Extract,
Data Warehouse
Transform, & Load
Commercial Intelligence
Analysis 9
Data Preprocessing 1. Data Cleaning 2. Data Integration Data Quality – Accuracy – Completeness – Consistency – Timeliness – Believability – Interpretability 10
5
Data Visualization Quality Data-Ink Ratio:
11
Data Error Types Missing data – Replace with constants – Replace with average value of attributes – Regression – Manual filling Noisy data – Regression – Outlier analysis 12
6
Visual Data Cleaning Using visualization techniques for data
cleaning
13
Data Integration Combining data from multiple sources – Structural conflicts – Schema differences – Data Conflicts – Repeated data Providing a uniform visualization
14
7
Data Integration Example Form 1:
Form 2: Integrated Form:
15
Data Storage File Systems Databases and DBMS Data Warehouse
16
8
CSV file (comma-separated values)
17
Structured Files XML files: eXtensible Markup Language Tove Jani Reminder Don't forget me this weekend!
XML Extensions: IVOA VOtable (Space programs), KML
(Web maps), etc. Special formats: e.g. HDF (Hierarchical Data Format) for scientific data
18
9
Databases A database is an organized collection of data. The
data is typically organized to model aspects of reality in a way that supports processes requiring information. For example, modelling the availability of rooms in hotels in a way that supports finding a hotel with vacancies.
19
Relational Databases DBMS: Database Management System Data definition and query languages (SQL)
20
10
Visualization of a database Challenge: Interactivity! Example: National Science Foundation Database:
21
Data Warehouse A data warehouse is a system used for
reporting and data analysis. It is a central repository of integrated data from one or more disparate sources.
22
11
Database vs Data Warehouse Database
Data Warehouse
Purpose
Data operation
Information in the data
Application
Business
Analysis
Users
Employee, DB Administrator
Analyst, manager, executive
Functionalities
Daily operations
decision making support
Data
Current
Historical, time-variant
Access
Read, write, mean, etc
Read
Focus
Input, query
information output
Size
1 GB ~ < 1 TB
TBs 23
Data Analysis Statistical Analysis Exploratory Analysis Data Mining
24
12
Statistical Analysis Statistical description:
properties, parameters, distribution, correlations. Statistical prediction: using probability methods and sampling theory to predict statistical properties (distribution, correlation, parameters, forecasting, etc.) 25
Data Exploration Using Visualization Raw data drawing Statistical drawing Multi-view
26
13
Data Trajectory
27
Data Comparison
28
14
Trend and Patterns
29
Relations
30
15
Line Chart
31
Line Chart: Sunspots
32
16
Bar Chart
33
Sorted Bar Chart
34
17
Bar Chart Labeling
35
Bar Chart: Scale
36
18
Bar Chart : Variant
37
Stacked Bar Chart
38
19
Stacked Bar Chart
39
Stacked Chart
40
20
Pie Chart
41
Pie Chart
42
21
Pie Charts of Van Gogh Paintings
43
Histogram
44
22
Contour Map
45
Scatter Plot Matrix
46
23
Reference Lines in Scatter Plot
47
Heatmap
48
24
Box Plot
49
Box Plot Variations
50
25
Multi-View
51
Data Mining “Data mining, also popularly referred to as knowledge
discovery from data (KDD), is the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, other massive repositories or data streams” – J. Han and M Kamber, “Data Mining: Concepts and Techniques” Mining
Data
Model
Validation
Knowledge
52
26
Data Mining Tasks Description: Algorithm
Features
Data
Prediction
1
2
Training Data
Trained Model
Trained Model
Model
New Data
Features
53
Data Mining Tasks Descriptive Tasks: Concept Description
Association Mining
Clustering
Outlier Analysis
Predictive Tasks: Classification
Evolution Analysis 54
27
Data Mining Methods Statistical Method: Regression, Parameter Estimation Machine Learning: Decision tree, Neural Networks
Statistical Learning: Probability model, Basian networks
Algorithmic Method: Kmean, Graph operations 55
Data Mining New Applications Text Mining – summarizes, navigates, and
clusters documents contained in a database. Web Mining – integrates data and text mining within a Web site; enhances the Web site with intelligent behavior, such as suggesting related links or recommending new products to the consumer
56
28
Data Visualization Pipeline
57
Looping Model
58
29
Visual Data Mining and Visual Analytics Users are involved in the data mining
process through visualization and user interactions. Certain tasks are difficult to be automated – Validation of clustering results – Checking data focal points and noise – Expert knowledge input, etc. 59
Visual Analytics Paradigm Knowledge Visualization
Data Processing & Mining
User Interaction
Revision of Mining Engine & Decision Making
60
30
Visual Analytics
By Daniel Keim
61
Visual Data Mining Examples: Visualizing data correlations
62
31
Visual Data Mining Example: Biomarker detection Z
1 N
N
i 1
f ( D 1 , Pi )
Z
1 7
7
i 1
f ( D 1 , Pi )
Z
1 7 f ( D2 , Pi ) 7 i 1
Z
Z
Z
1 7
7
i 1
f ( Pi , D j ),
Z
1 f (D1, Pi ),i 1, 2, 7 3 i
1 f (Pi , Dj ),i 1, 2, 7 3i
1 f (Pi, Dj ),i 1, 2, 3, 5, 7 5i
63
Visual Data Mining Example: Facial feature detection for medical diagnosis
64
32
Visual Data Mining Example: Concept detection in text data
65
Visual Data Mining Example: Visualizing decision trees
66
33
67
34