Applications (contd.) Database System II Preliminaries Instructor: Sharma Chakravarthy [email protected] The University of Texas at Arlington



We are familiar with cpu intensive applications. They keep most/all of their data in main memory for processing (Java projects, data structures projects). The files you used were very small!



We know how to design these algorithms and applications (algorithms and software engineering)



Design and implementation of data intensive applications is different from main memory applications. We deal with this in cse 5330 What is the equivalent of UML for data modeling?



– In fact, EER came way earlier than UML! 1

Applications 

Applications (contd.)

Computation intensive applications



– E. g., finite element analysis, Computer aided design, wind tunnel testing, simulations, numerical analysis 

3

– – – –

Online shopping: Netflix user rentals, Amazon, ebay, … Online indomation: Google, Yahoo, Wiki, … Online services: credit cards, banking, … Others: library book search, comparative shopping, EDGAR database, Facebook database, … – Enterprise databases: …

Data intensive applications – E. g., library of congress book management, Walmart point of sales database, Amazon inventory database, airline reservations, Homeland security database and applications …



What applications are likely to be data intensive?



Computation & data intensive applications – Data mining, drill down and exploration of data warehouse, dimensional analysis, Google search, stream/sensor data processing, scientific computations… – Mining very large volumes of unstructured data: cloud computing, NoSQL processing, Map/Reduce paradigm 2

For these application, we not only have to design the computational part, but we have to design the data to be represented and managed

4

1

Applications (contd.)

Applications (Contd.) 

In addition, we also need to model processing of data  This is done by 

In cpu intensive applications, representation of data was assumed to be simpler of the two

– Listing commonly used computations E.g., which book is borrowed by who? E. g., generate a report of customers who have not paid the last credit card bill (or last 2 credit card bills)  E. g., Track a UPS shipment to know the current status  E.g., 10 most popular movies (overall or in each genre)  E.g., honors students in the COE 



The choice was in the identification of one of the main-memory data structures (arrays, lists, trees, hash tables…)



The focus was on the algorithm, its correctness, and complexity



– Notion of transactions 

The above has to be designed and developed

5

7

Applications (contd.)

Applications (contd.)

We want to model the data component and at the same time keep the computation in mind  For example, 



How do we model a data intensive application

– Payment of rent and receipt is modeled as a data item which indicates what amount was paid on what date! (actions correspond to data insertion!) – The actual exchange of receipt or payment is not captured in that way!

– Requirements analysis – Use of EER model to represent real-world objects (entities and relationships) – Mapping EER diagram to a set of relations – Normalizing them and storing them in a database – Developing applications/GUI on top of a DBMS

– With the captured data, it should be possible to generate a receipt, or a non-payment notice after the due date

6

8

2

Applications (contd.)

Applications (contd.)

We cannot model actions in a database; instead, we capture the underlying data that corresponds to actions; actions need to be modeled/implemented as computations (or code)!  For example, 



– Although invoice is used in real-life, the contents of an invoice is what is captured in the database so that an invoice can be generated – If you capture all the items purchased, receipt or invoice can be generated – Temporally changing data can be captured using date and timestamp as properties

There are a number of other issues: – – – –

Interfaces/gui Use of templates Physical data base design Authorization and access to sensitive information

– – – –

Continuous operation (24 x 7) Recovery/backups, mirroring Optimization of the system Scalability, distribution, etc.



Salary, ss#, …

9

11

In this course

Applications (contd.)

 

Internet applications

– – – –

– accept login id, password, and security question/answers 

We focus on DBMS architecture and its components

The question is: how to model it using EER? – The key is not to concentrate on the process – But on the underlying data needed – For the above, you need to store, for each user, a set of data: login id, password, security questions/answers – The interface that implements the login process can access this data and use it appropriately



Storage and indexing Transaction management and Recovery Query Optimization Buffer Management

Projects will be to implement some of the above components in a realistic setting – We will also look at new approaches, such as  

10

Map/reduce Non-relational or NoSQL DBMSs 12

3

Why Study Databases??

What Is a DBMS? 

A very large, integrated collection of data.



Models a real-world enterprise.



e.g., students, courses, buildings, websites



– Relationships 



Shift from computation to information management – at the “low end”: ad hoc data, spread sheet – at the “high end”: scientific data management

– Entities 

?

Datasets increasing in diversity and volume. – Digital libraries, interactive video, Human Genome project, EOS project, multi-media dbms – ... need for DBMS is exploding

e.g., Billy Joel performs at UTA, students take courses, lenders issue mortgages, an employee manages a department, students submit projects



A Database Management System (DBMS) is a software package designed to store and manage databases.

DBMS study encompasses all aspects of CS – OS, languages, theory, AI, multimedia, logic

13

15

[Adapted from Peter Lyngbaek:OOPSLA:1991]

Why Use a DBMS?

Evolution of DBMSs

Data independence and efficient access.  Reduced application development time.  Data integrity and security.  Uniform data administration.  Concurrent access, recovery from crashes.  Persistence, scalability, portability 

Application

Application

Modeling Semantics Program Tx Mgmt

Modeling Semantics Program

Tx Mgmt File I/O

Application Modeling Semantics

Query Opt Tx Mgmt File I/O

File I/O DBMS Pre 1970’s

Non-relational (1970)

Application

Modeling (static/dynamic) Query Opt Tx Mgmt File I/O

DBMS Relational (1980)

DBMS OODBMS (1990)

14

16

4

Performance Issues

Moving On … Application

Mining XML support Workflow support Modeling (static/dynamic) Query Opt Tx Mgmt File I/O

DBMS



Non-DBMS Applications

Application

– Improving existing abstraction/adding new functionality – Increase in data volume without any other changes

Stream Processing Mining XML support Workflow support



….

Modeling (static/dynamic) Query Opt Tx Mgmt File I/O

[2000]

DSMS

[2009]

There are 2 primary sources of performance problems

[2013]

In the absence of the above, we can continue to improve performance (improvements are offset by the increase in data or additional software needed for supporting the abstraction)

17

Solutions

Impedance mismatch If you are studying DBMS, you have to understand this clearly  Cpu is way faster than disk access. In fact, it is a Million or more times faster





There are several solutions – Doubling of cpu speed every 10 months or sooner (Moore’s Law) Have you heard of Neuromorphic computing? – Increase in disk i/o throughput – Use of multiple processors (Map/Reduce leverages this on a large scale)

– Cpu accessed are in nano secs and disk accesses are still in milli secs!



Because of this speed difference, it takes much longer to bring data from a disk  So, cpu is waiting/idling most of the times for data (if not architected properly)! 



19





This is called impedance mismatch!

The above CANNOT completely address the problems mentioned earlier New algorithms/techniques, optimization are critical (e.g., data structures, indexes, concurrent processing, …) The introduction of SSDs (solid state disks) are warranting a re-examination of some of the above issues – No impedance mismatch!!

18

20

5

My Thoughts

Data Models A data model is a collection of concepts for describing data.  A schema is a description of a particular collection of data, using the a given data model.  The relational model of data is the most widely used model today. 



 



Improving functionality/abstraction is a natural evolution (not only in databases but in all aspects of life) However, this has a critical impact on performance Performance has to keep up with the increased burden in novel ways (e.g., scheduling, load shedding as witnessed in stream processing) There is an adjustment period for the performance to catch up and it always does

– Main concept: relation, basically a table with rows and columns. – Every relation has a schema, which describes the columns, or fields. 21

What can be done   







23

Levels of Abstraction

We seem to bundle to much into our system and make it all or nothing situation! We need to unbundle and provide mechanisms to configure systems based on application needs For example, cc and recovery is not needed for a number of newer applications (e.g., stream processing, mining) Similarly, novel storage representations with appropriate search and efficient access can be useful for other applications We can/should not wait for the CPU and hardware speedup to address our performance issues

 

Many views, single conceptual (logical) schema and physical schema. – Views describe how users see the data. – Conceptual schema defines logical structure – Physical schema describes the files and indexes used.

This is happening now with Map/Reduce and NoSQL Systems

View 1

View 2

View 3

Conceptual Schema Physical Schema

 Schemas are defined using DDL; data is modified/queried using DML. 22

24

6

Example: University Database 

DBMS Components 

Conceptual schema:



– Students(sid: string, name: string, login: string, age: integer, gpa:real) – Courses(cid: string, cname:string, credits:integer) – Enrolled(sid:string, cid:string, grade:string) 

   

Physical schema:

 

– Relations stored as unordered files. – Index on first column of Students. 

External Schema (View):

Architecture – typically client/server DDL, DML parsing, processing Physical DB design (files, indexing, access methods) Query processing, optimization Buffer management Concurrency control Recovery Utilities – backup, archiving, exporting, OLAP, report generator, physical design wizards, access to multiple DBMSs, designer tools, tuning of parameters, etc. etc.

– Course_info(cid:string,enrollment:integer) 25

Data Independence

27

DBMS Architecture

SQL commands from applications

Applications insulated from how data is structured and stored.  Logical data independence: Protection from changes in logical structure of data.  Physical data independence: Protection from changes in physical structure of data. 

Query Optimization and Execution (parser, optimizer, Plan executor) Transaction

Files and Access Methods

Manager

Buffer Management

Recovery Manager

Lock Manager

Disk Space Management

Concurrency control

 One of the most important benefits of using a DBMS!

Log 26

DB

System catalog, data files, Index files 28

7

Query Processing

Transaction: An Execution of a DB Program Key concept is transaction, which is an atomic (eithe all or nothing) sequence of database actions (reads/writes).  Each transaction, executed completely, must leave the DB in a consistent state if DB is consistent when the transaction begins. 

Need an expressive Query Language over a simple data model  Non-procedural query language 

– Burden of optimization on the system (not on the user as in hierarchical and network databases) – Query representation should not affect query optimization 

Need for report generation

29

31

Concurrency Control 

Concurrent execution of user programs essential for good DBMS performance.

The Log is



– Ti writes an object: the old value and the new value.

– Because disk accesses are frequent, and relatively slow, it is important to keep the cpu humming by working on several user programs concurrently. 



The following actions are recorded in the log: 

Log record must go to disk before the changed page!

– Ti commits/aborts: a log record indicating this action.

Log records chained together by Xact id, so it’s easy to undo a specific Xact (e.g., to resolve a deadlock).  Log is often archived on “stable” storage.  All log related activities (and in fact, all CC related activities such as lock/unlock, dealing with deadlocks etc.) are handled transparently by the DBMS.

Interleaving actions of different user programs can lead to inconsistency: e.g., check is cleared while account balance is being computed. DBMS ensures such problems do not arise: users can pretend they are using a single-user system.



30

32

8

Competitors

Databases make these folks happy ...



– ObjectStore, Versant, Ontos – a synthesis: object-relational model

End users and DBMS vendors  DB application programmers 



– E.g. smart webmasters 



Database administrator (DBA) – – – –

Earlier Competitor: Object-Oriented model

Informix Universal Server, UniSQL, O2, Oracle

More recent competitor: Map/reduce, NoSQL – Works on unstructured data – Useful for one-time processing as opposed to data management – Overhead is low, however, programming level is high – Less expensive, use of commodity machines!!

Designs logical /physical schemas Handles security and authorization Data availability, crash recovery Database tuning as needs evolve

Must understand how a DBMS works! 33

Why Study the Relational Model? 

Relational Database: Definitions

Most widely used model. – Vendors: IBM (Informix), Microsoft, Oracle, Sybase, etc.



35

 

“Legacy systems” in older models

– Instance : a table, with rows and columns. #rows = cardinality, #fields = degree / arity – Schema : specifies name of relation, plus name and type of each column.

– e.g., IBM’s IMS – Network models are not used today 

More recently, NoSQL systems – Based on different data models – Key-store, document store, graph – Tradeoffs in functionality ACID vs. CAP (or BASE)

Relational database: a set of relations. Relation: made up of 2 parts:





34

E.g. Students(sid: string, name: string, login: string, age: integer, gpa: real)

Can think of a relation as a set of rows or tuples. (i.e., all rows are distinct) 36

9

Adding and Deleting Tuples

Example Instance of Students Relation sid 53666 53688 53650

name login Jones jones@cs Smith smith@eecs Smith smith@math

age 18 18 19



gpa 3.4 3.2 3.8

Can insert a single tuple using: INSERT INTO Students (sid, name, login, age, gpa) VALUES (‘53688’, ‘Smith’, ‘smith@ee’, 18, 3.2)



 Cardinality = 3, degree = 5 , all rows distinct

Can delete all tuples satisfying some condition (e.g., name = Smith): DELETE FROM Students S WHERE S.name = ‘Smith’

 Do all columns in a relation instance have to be distinct?

 Powerful variants of these commands are available; more later! 37

39

Creating Relations in SQL

Integrity Constraints (ICs)

Creates the Students CREATE TABLE Students (sid: CHAR(20), relation. Observe that the name: CHAR(20), type (domain) of each field login: CHAR(10), is specified, and enforced by age: INTEGER, the DBMS whenever tuples gpa: REAL) are added or modified.  As another example, the CREATE TABLE Enrolled Enrolled table holds (sid: CHAR(20), information about courses cid: CHAR(20), that students take. grade: CHAR(2))





IC: condition that must be true for any instance of the database; e.g., domain constraints. – ICs are specified when schema is defined. – ICs are checked when relations are modified.



A legal instance of a relation is one that satisfies all specified ICs.



If the DBMS checks ICs, stored data is more faithful to real-world meaning.

– DBMS should not allow illegal instances.

– Avoids data entry errors, too! 38

40

10

Relational Query Languages

Relational Model: Summary

A major strength of the relational model: supports simple, powerful querying of data.  Queries can be written intuitively, and the DBMS is responsible for efficient evaluation.



A tabular representation of data.  Simple and intuitive, currently the most widely used.  Integrity constraints can be specified by the DBA, based on application semantics. DBMS checks for violations.



– The key: precise semantics for relational queries. – Allows the optimizer to extensively re-order operations, and still ensure that the answer does not change.

– Two important ICs: primary and foreign keys – In addition, we always have domain constraints. 

Powerful and natural query languages exist.

41

43

The SQL Query Language

Relational Model: Summary (contd.)

The most widely used relational query language. Current standard is SQL-2011.  To find all 18 year old students, we can write:



ACID properties are important



Concurrency control and Recovery together guarantee ACID properties



SELECT * FROM Students S WHERE S.age=18

sid

name

53666 Jones

login

age gpa

jones@cs

18 3.4

– Atomicity, consistency, isolation, and durability

53688 Smith smith@ee 18 3.2

•To find just names and logins, replace the first line: SELECT

S.name, S.login 42

44

11

Other data models 

The older hierarchical and network models are not used much (although they are in use)



However, several non-relational data models have been proposed and currently used in specific contexts



They are called NoSQL DBMSs – Key-value store – Document – graph 45

Thank You !

© Sharma Chakravarthy ©

☞46

46

12