Applications (contd.) Database System II Preliminaries Instructor: Sharma Chakravarthy
[email protected] The University of Texas at Arlington
We are familiar with cpu intensive applications. They keep most/all of their data in main memory for processing (Java projects, data structures projects). The files you used were very small!
We know how to design these algorithms and applications (algorithms and software engineering)
Design and implementation of data intensive applications is different from main memory applications. We deal with this in cse 5330 What is the equivalent of UML for data modeling?
– In fact, EER came way earlier than UML! 1
Applications
Applications (contd.)
Computation intensive applications
– E. g., finite element analysis, Computer aided design, wind tunnel testing, simulations, numerical analysis
3
– – – –
Online shopping: Netflix user rentals, Amazon, ebay, … Online indomation: Google, Yahoo, Wiki, … Online services: credit cards, banking, … Others: library book search, comparative shopping, EDGAR database, Facebook database, … – Enterprise databases: …
Data intensive applications – E. g., library of congress book management, Walmart point of sales database, Amazon inventory database, airline reservations, Homeland security database and applications …
What applications are likely to be data intensive?
Computation & data intensive applications – Data mining, drill down and exploration of data warehouse, dimensional analysis, Google search, stream/sensor data processing, scientific computations… – Mining very large volumes of unstructured data: cloud computing, NoSQL processing, Map/Reduce paradigm 2
For these application, we not only have to design the computational part, but we have to design the data to be represented and managed
4
1
Applications (contd.)
Applications (Contd.)
In addition, we also need to model processing of data This is done by
In cpu intensive applications, representation of data was assumed to be simpler of the two
– Listing commonly used computations E.g., which book is borrowed by who? E. g., generate a report of customers who have not paid the last credit card bill (or last 2 credit card bills) E. g., Track a UPS shipment to know the current status E.g., 10 most popular movies (overall or in each genre) E.g., honors students in the COE
The choice was in the identification of one of the main-memory data structures (arrays, lists, trees, hash tables…)
The focus was on the algorithm, its correctness, and complexity
– Notion of transactions
The above has to be designed and developed
5
7
Applications (contd.)
Applications (contd.)
We want to model the data component and at the same time keep the computation in mind For example,
How do we model a data intensive application
– Payment of rent and receipt is modeled as a data item which indicates what amount was paid on what date! (actions correspond to data insertion!) – The actual exchange of receipt or payment is not captured in that way!
– Requirements analysis – Use of EER model to represent real-world objects (entities and relationships) – Mapping EER diagram to a set of relations – Normalizing them and storing them in a database – Developing applications/GUI on top of a DBMS
– With the captured data, it should be possible to generate a receipt, or a non-payment notice after the due date
6
8
2
Applications (contd.)
Applications (contd.)
We cannot model actions in a database; instead, we capture the underlying data that corresponds to actions; actions need to be modeled/implemented as computations (or code)! For example,
– Although invoice is used in real-life, the contents of an invoice is what is captured in the database so that an invoice can be generated – If you capture all the items purchased, receipt or invoice can be generated – Temporally changing data can be captured using date and timestamp as properties
There are a number of other issues: – – – –
Interfaces/gui Use of templates Physical data base design Authorization and access to sensitive information
– – – –
Continuous operation (24 x 7) Recovery/backups, mirroring Optimization of the system Scalability, distribution, etc.
Salary, ss#, …
9
11
In this course
Applications (contd.)
Internet applications
– – – –
– accept login id, password, and security question/answers
We focus on DBMS architecture and its components
The question is: how to model it using EER? – The key is not to concentrate on the process – But on the underlying data needed – For the above, you need to store, for each user, a set of data: login id, password, security questions/answers – The interface that implements the login process can access this data and use it appropriately
Storage and indexing Transaction management and Recovery Query Optimization Buffer Management
Projects will be to implement some of the above components in a realistic setting – We will also look at new approaches, such as
10
Map/reduce Non-relational or NoSQL DBMSs 12
3
Why Study Databases??
What Is a DBMS?
A very large, integrated collection of data.
Models a real-world enterprise.
e.g., students, courses, buildings, websites
– Relationships
Shift from computation to information management – at the “low end”: ad hoc data, spread sheet – at the “high end”: scientific data management
– Entities
?
Datasets increasing in diversity and volume. – Digital libraries, interactive video, Human Genome project, EOS project, multi-media dbms – ... need for DBMS is exploding
e.g., Billy Joel performs at UTA, students take courses, lenders issue mortgages, an employee manages a department, students submit projects
A Database Management System (DBMS) is a software package designed to store and manage databases.
DBMS study encompasses all aspects of CS – OS, languages, theory, AI, multimedia, logic
13
15
[Adapted from Peter Lyngbaek:OOPSLA:1991]
Why Use a DBMS?
Evolution of DBMSs
Data independence and efficient access. Reduced application development time. Data integrity and security. Uniform data administration. Concurrent access, recovery from crashes. Persistence, scalability, portability
Application
Application
Modeling Semantics Program Tx Mgmt
Modeling Semantics Program
Tx Mgmt File I/O
Application Modeling Semantics
Query Opt Tx Mgmt File I/O
File I/O DBMS Pre 1970’s
Non-relational (1970)
Application
Modeling (static/dynamic) Query Opt Tx Mgmt File I/O
DBMS Relational (1980)
DBMS OODBMS (1990)
14
16
4
Performance Issues
Moving On … Application
Mining XML support Workflow support Modeling (static/dynamic) Query Opt Tx Mgmt File I/O
DBMS
Non-DBMS Applications
Application
– Improving existing abstraction/adding new functionality – Increase in data volume without any other changes
Stream Processing Mining XML support Workflow support
….
Modeling (static/dynamic) Query Opt Tx Mgmt File I/O
[2000]
DSMS
[2009]
There are 2 primary sources of performance problems
[2013]
In the absence of the above, we can continue to improve performance (improvements are offset by the increase in data or additional software needed for supporting the abstraction)
17
Solutions
Impedance mismatch If you are studying DBMS, you have to understand this clearly Cpu is way faster than disk access. In fact, it is a Million or more times faster
There are several solutions – Doubling of cpu speed every 10 months or sooner (Moore’s Law) Have you heard of Neuromorphic computing? – Increase in disk i/o throughput – Use of multiple processors (Map/Reduce leverages this on a large scale)
– Cpu accessed are in nano secs and disk accesses are still in milli secs!
Because of this speed difference, it takes much longer to bring data from a disk So, cpu is waiting/idling most of the times for data (if not architected properly)!
19
This is called impedance mismatch!
The above CANNOT completely address the problems mentioned earlier New algorithms/techniques, optimization are critical (e.g., data structures, indexes, concurrent processing, …) The introduction of SSDs (solid state disks) are warranting a re-examination of some of the above issues – No impedance mismatch!!
18
20
5
My Thoughts
Data Models A data model is a collection of concepts for describing data. A schema is a description of a particular collection of data, using the a given data model. The relational model of data is the most widely used model today.
Improving functionality/abstraction is a natural evolution (not only in databases but in all aspects of life) However, this has a critical impact on performance Performance has to keep up with the increased burden in novel ways (e.g., scheduling, load shedding as witnessed in stream processing) There is an adjustment period for the performance to catch up and it always does
– Main concept: relation, basically a table with rows and columns. – Every relation has a schema, which describes the columns, or fields. 21
What can be done
23
Levels of Abstraction
We seem to bundle to much into our system and make it all or nothing situation! We need to unbundle and provide mechanisms to configure systems based on application needs For example, cc and recovery is not needed for a number of newer applications (e.g., stream processing, mining) Similarly, novel storage representations with appropriate search and efficient access can be useful for other applications We can/should not wait for the CPU and hardware speedup to address our performance issues
Many views, single conceptual (logical) schema and physical schema. – Views describe how users see the data. – Conceptual schema defines logical structure – Physical schema describes the files and indexes used.
This is happening now with Map/Reduce and NoSQL Systems
View 1
View 2
View 3
Conceptual Schema Physical Schema
Schemas are defined using DDL; data is modified/queried using DML. 22
24
6
Example: University Database
DBMS Components
Conceptual schema:
– Students(sid: string, name: string, login: string, age: integer, gpa:real) – Courses(cid: string, cname:string, credits:integer) – Enrolled(sid:string, cid:string, grade:string)
Physical schema:
– Relations stored as unordered files. – Index on first column of Students.
External Schema (View):
Architecture – typically client/server DDL, DML parsing, processing Physical DB design (files, indexing, access methods) Query processing, optimization Buffer management Concurrency control Recovery Utilities – backup, archiving, exporting, OLAP, report generator, physical design wizards, access to multiple DBMSs, designer tools, tuning of parameters, etc. etc.
– Course_info(cid:string,enrollment:integer) 25
Data Independence
27
DBMS Architecture
SQL commands from applications
Applications insulated from how data is structured and stored. Logical data independence: Protection from changes in logical structure of data. Physical data independence: Protection from changes in physical structure of data.
Query Optimization and Execution (parser, optimizer, Plan executor) Transaction
Files and Access Methods
Manager
Buffer Management
Recovery Manager
Lock Manager
Disk Space Management
Concurrency control
One of the most important benefits of using a DBMS!
Log 26
DB
System catalog, data files, Index files 28
7
Query Processing
Transaction: An Execution of a DB Program Key concept is transaction, which is an atomic (eithe all or nothing) sequence of database actions (reads/writes). Each transaction, executed completely, must leave the DB in a consistent state if DB is consistent when the transaction begins.
Need an expressive Query Language over a simple data model Non-procedural query language
– Burden of optimization on the system (not on the user as in hierarchical and network databases) – Query representation should not affect query optimization
Need for report generation
29
31
Concurrency Control
Concurrent execution of user programs essential for good DBMS performance.
The Log is
– Ti writes an object: the old value and the new value.
– Because disk accesses are frequent, and relatively slow, it is important to keep the cpu humming by working on several user programs concurrently.
The following actions are recorded in the log:
Log record must go to disk before the changed page!
– Ti commits/aborts: a log record indicating this action.
Log records chained together by Xact id, so it’s easy to undo a specific Xact (e.g., to resolve a deadlock). Log is often archived on “stable” storage. All log related activities (and in fact, all CC related activities such as lock/unlock, dealing with deadlocks etc.) are handled transparently by the DBMS.
Interleaving actions of different user programs can lead to inconsistency: e.g., check is cleared while account balance is being computed. DBMS ensures such problems do not arise: users can pretend they are using a single-user system.
30
32
8
Competitors
Databases make these folks happy ...
– ObjectStore, Versant, Ontos – a synthesis: object-relational model
End users and DBMS vendors DB application programmers
– E.g. smart webmasters
Database administrator (DBA) – – – –
Earlier Competitor: Object-Oriented model
Informix Universal Server, UniSQL, O2, Oracle
More recent competitor: Map/reduce, NoSQL – Works on unstructured data – Useful for one-time processing as opposed to data management – Overhead is low, however, programming level is high – Less expensive, use of commodity machines!!
Designs logical /physical schemas Handles security and authorization Data availability, crash recovery Database tuning as needs evolve
Must understand how a DBMS works! 33
Why Study the Relational Model?
Relational Database: Definitions
Most widely used model. – Vendors: IBM (Informix), Microsoft, Oracle, Sybase, etc.
35
“Legacy systems” in older models
– Instance : a table, with rows and columns. #rows = cardinality, #fields = degree / arity – Schema : specifies name of relation, plus name and type of each column.
– e.g., IBM’s IMS – Network models are not used today
More recently, NoSQL systems – Based on different data models – Key-store, document store, graph – Tradeoffs in functionality ACID vs. CAP (or BASE)
Relational database: a set of relations. Relation: made up of 2 parts:
34
E.g. Students(sid: string, name: string, login: string, age: integer, gpa: real)
Can think of a relation as a set of rows or tuples. (i.e., all rows are distinct) 36
9
Adding and Deleting Tuples
Example Instance of Students Relation sid 53666 53688 53650
name login Jones jones@cs Smith smith@eecs Smith smith@math
age 18 18 19
gpa 3.4 3.2 3.8
Can insert a single tuple using: INSERT INTO Students (sid, name, login, age, gpa) VALUES (‘53688’, ‘Smith’, ‘smith@ee’, 18, 3.2)
Cardinality = 3, degree = 5 , all rows distinct
Can delete all tuples satisfying some condition (e.g., name = Smith): DELETE FROM Students S WHERE S.name = ‘Smith’
Do all columns in a relation instance have to be distinct?
Powerful variants of these commands are available; more later! 37
39
Creating Relations in SQL
Integrity Constraints (ICs)
Creates the Students CREATE TABLE Students (sid: CHAR(20), relation. Observe that the name: CHAR(20), type (domain) of each field login: CHAR(10), is specified, and enforced by age: INTEGER, the DBMS whenever tuples gpa: REAL) are added or modified. As another example, the CREATE TABLE Enrolled Enrolled table holds (sid: CHAR(20), information about courses cid: CHAR(20), that students take. grade: CHAR(2))
IC: condition that must be true for any instance of the database; e.g., domain constraints. – ICs are specified when schema is defined. – ICs are checked when relations are modified.
A legal instance of a relation is one that satisfies all specified ICs.
If the DBMS checks ICs, stored data is more faithful to real-world meaning.
– DBMS should not allow illegal instances.
– Avoids data entry errors, too! 38
40
10
Relational Query Languages
Relational Model: Summary
A major strength of the relational model: supports simple, powerful querying of data. Queries can be written intuitively, and the DBMS is responsible for efficient evaluation.
A tabular representation of data. Simple and intuitive, currently the most widely used. Integrity constraints can be specified by the DBA, based on application semantics. DBMS checks for violations.
– The key: precise semantics for relational queries. – Allows the optimizer to extensively re-order operations, and still ensure that the answer does not change.
– Two important ICs: primary and foreign keys – In addition, we always have domain constraints.
Powerful and natural query languages exist.
41
43
The SQL Query Language
Relational Model: Summary (contd.)
The most widely used relational query language. Current standard is SQL-2011. To find all 18 year old students, we can write:
ACID properties are important
Concurrency control and Recovery together guarantee ACID properties
SELECT * FROM Students S WHERE S.age=18
sid
name
53666 Jones
login
age gpa
jones@cs
18 3.4
– Atomicity, consistency, isolation, and durability
53688 Smith smith@ee 18 3.2
•To find just names and logins, replace the first line: SELECT
S.name, S.login 42
44
11
Other data models
The older hierarchical and network models are not used much (although they are in use)
However, several non-relational data models have been proposed and currently used in specific contexts
They are called NoSQL DBMSs – Key-value store – Document – graph 45
Thank You !
© Sharma Chakravarthy ©
☞46
46
12