MANAGING ORGANIZATIONAL DATA AND INFORMATION

MANAGING ORGANIZATIONAL DATA AND INFORMATION CHAPTER PREVIEW Our previous chapters gave us an introduction to information systems and organizational t...

Author: Annice Jean Jenkins

3 downloads 0 Views 776KB Size

Report

Download PDF

Recommend Documents

Managing Organizational Interfaces: The Information Technology Factor

Managing Individual and Organizational Change

Information Extraction Challenges in Managing Unstructured Data

Analyzing And Managing Deviant Organizational Leaders

Successfully Planning and Managing Organizational Change

Future Data Centres and managing risks Information Classification: External

Benefits of Managing Organizational Capability

11. Managing Data: Information Systems. Computer Information Systems Categories

Business Information INSIGHT ON MANAGING AND USING DATA

Organizing and Managing Information PartTwo

Information Technology Managing Information Assets

Big Data Governance. Managing Evolving Data Privacy and Security Risks. March 14, Information Protection & Business Resilience

Concepts and use of the organizational information and strategic information

Managing Organizational Change during Institutional Upheaval

Managing Organizational Change: Metrics that Matter

Managing Marketing Spirit with Organizational Symbolism

Managing Mobile data growth

information management ICS 132: Organizational Information Systems keys to information mgmt organisational factors data, database, DBMS

Individuals, Teams, and Organizational Benefits of Managing Diversity

MODFLOW Managing Transient Data

Managing Organizational Change and Commitment in a Global IT Company

Managing Information. November 2003

PostgreSQL for Managing CAD and GIS Data

Data, Information, and Databases

MANAGING ORGANIZATIONAL DATA AND INFORMATION CHAPTER PREVIEW Our previous chapters gave us an introduction to information systems and organizational topics, as well as insights into how IT hardware and software function. These technologies and systems support organizations through their ability to handle—acquire, store, access, analyze, and transmit—electronic data. Properly managed, these data become information, a precious organizational resource and the basis of much competitive advantage. This chapter focuses mainly on how data are stored and accessed, with some discussion of data analysis techniques related to modern databases.

CHAPTER OUTLINE

LEARNING OBJECTIVES

Basics of Data Arrangement and Access The Traditional File Environment Databases: The Modern Approach Database Management Systems Logical Data Models Data Warehouses

1. Discuss traditional data file organization and its problems. 2. Explain how a database approach overcomes the problems associated with the traditional file environment, and discuss disadvantages of the database approach. 3. Describe how the three most common data models organize data, and the advantages and disadvantages of each model. 4. Describe how a multidimensional data model organizes data. 5. Distinguish between a data warehouse and a data mart. 6. Discuss the similarities and differences between data mining and text mining.

5.1 5.2 5.3 5.4 5.5 5.6

5

HOW TO MANAGE “TONS” OF DATA The Business Problem Large corporations manage literally tons of data. Referring to “tons” of data may be intuitive for paper records, but it is an unusual way to describe computer-stored information, which is usually measured by gigabytes and terabytes. Still, using “tons” may give a sense of just how much data a terabyte is. Measuring data by the ton assumes that a common 20-gigabyte hard drive weighs about one pound. Figure that the weight of a shared enclosure, power supply, and electronics will roughly double the drive’s weight, and we can say that 20 terabytes of data is approximately equivalent to one ton (20 GB 1 pound; 1 TB 1000 GB; 40 TB 40 GB 1000 2000 pounds). That much storage is cumbersome and very difficult to manage. How do enterprises deal effectively with such unwieldy mountains of information? Two data-intensive companies, Aetna and Boeing, discuss the problems they face in managing massive data stores, and how they solve them. For each company, data are a significant corporate asset resulting from huge investments of time and effort. The data are also the source of many problems for the employees who manage data. Aetna. Aetna is responsible for 4.4 tons of data (174.6 terabytes). Most of Aetna’s data is health care information. The insurance company maintains records for both health maintenance organization participants and customers covered by insurance policies. Aetna has detailed records of providers, such as doctors, hospitals, dentists, and pharmacies, and it keeps track of all the claims it has processed. Some of Aetna’s larger customers send tapes containing insured employee data, but Aetna is moving toward using the Internet to collect such data. Data integrity, backup, security, and availability are Aetna’s biggest concerns. The company’s data handling tools, procedures, and operations schedules have to stay ahead of not only the normal growth that results from the activities of the sales, underwriting, and claims departments, but also growth from corporate acquisitions and mergers. Boeing. Boeing makes sure the approximately 50 to 150 terabytes (1.3 to almost 4 tons) of data the company owns remain as reliable and safe as the aircraft and spacecraft the company builds. The 50- to 150-terabyte estimate reflects Boeing’s inability to know exactly how much data exist on its 150,000 desktop computers. Users do not necessarily store their data files on a server, which makes quantifying Boeing’s data stores difficult. For Boeing’s diverse terabytes, the firm’s basic concerns are data integrity, backup, security, and availability. Boeing’s data stores are spread out across 27 states and a few overseas locations, but most computing takes place in the Puget Sound area of Washington. A major data loss has not happened yet, but the company is aware of the risks and plans to centralize the backing up and restoring of data in the future.

The IT Solutions Aetna. Of Aetna’s data, 119.2 terabytes reside on mainframe-connected disk drives, while the remaining 55.4 terabytes are on disks attached to midrange computers. Almost all of these data are located at the company’s headquarters. Most of the information is in relational databases. In addition, outside customers have access to about 20 terabytes of the information. Four interconnected data centers containing 14 mainframes and more than 1,000 midrange servers process the data. It takes more than 4,100 direct-access storage devices to hold Aetna’s key databases. Aetna is increasing

its use of storage-area network (SAN) technology (discussed in Chapter 3) to centralize and streamline the management of its data. Aetna knows the importance of maintaining large amounts of data from a logical perspective. While the physical management of large data stores is a huge effort, failing to keep the data organized leads inexorably to user workflow problems, devaluation of the data as a corporate asset, and eventually, customer complaints. Boeing. Boeing has many mainframes and thousands of midrange servers. Much of the data exist in relational form, but across the enterprise, Boeing uses multiple file formats. Boeing uses SAN technology to manage its data. The company currently has dozens of different backup-and-restore software utilities. Each department buys its own backup media and performs its own backup-andrestore operations. Although hard disks are inexpensive these days, data management costs on a per-disk or per-tape basis are high enough that Boeing wants to significantly reduce the amount of disk and tape “white space—the portion of the media that Boeing does not use. In the future, Boeing wants to use a storage-on-demand model, whereby the company could simply rent whatever capacity it needed from an outside vendor and not have to worry about running out of space.

The Results Interestingly, the results of implementing the complex blend of hardware, software, networking, and database technologies have barely enabled these two companies to stay abreast of their data management needs. The reason is that the amount of data generated by each company is growing tremendously fast. The International Data Corporation estimates that by 2003, organizations worldwide will have to manage 1.6 million terabytes of data (that is 1,600 petabytes or 1.6 exabytes of data).

What We Learned from This Case While these companies say that good tools are important for managing terabytes of information, their IT and database administrators also agree that having a clear and comprehensive perspective on the data, via both logical and physical views, is even

more critical. Security, data integrity, and data availability are not trivial concerns, they point out, and giving users easy access to the data is a never-ending job. Managing multiple terabytes of data is far more complex than managing gigabytes of data. You cannot simply extrapolate from experiences with small and medium-size data stores to understand how to successfully manage “tons of data.” Even backing up a database can be a major problem if the time needed to finish copying the data exceeds the time available. Data integrity, backup, security, and availability are collectively the Holy Grail of dealing with large data stores. The sheer volume of data makes these goals a challenge, and a highly decentralized environment complicates matters even more. Developing and adhering to standardized data maintenance procedures in your organization will not only give you the best return on your data dollar investment, but will also let you sleep well at night. Multiple terabytes of the best-maintained data in the world are just a collection of bits without accurate, meaningful data definitions and schemas. When you analyze your company’s operating procedures for administering large data stores, make sure you incorporate the definitions of that information in your plan. Together, the data and their definitions are a corporate asset that contributes to your company’s bottom line, and that you cannot do without. (Source: “Managing Tons of Data,” Computerworld (April 23, 2001); aetna.com, boeing.com).

5.1

BASICS OF DATA ARRANGEMENT AND ACCESS Data, when properly managed, become the information upon which business decisions are based. Few business professionals are comfortable making or justifying a business decision that is not based on solid information, especially when modern data management techniques, coupled with modern hardware, software, and trained IS staff, can make access to that information rapid and easy. Certainly the managers at the two companies in the opening case would not make their marketing decisions without first conducting an extensive analysis of their data. Organizations must be able to collect, organize, analyze, and interpret data in order to survive in hypercompetitive global markets. And data management is vital to all business functions.

The Data Hierarchy A computer system organizes data in a hierarchy that begins with bits, and proceeds to bytes, fields, records, files, and databases (see Figure 5.1). Remember from Chapter 3 that a bit represents the smallest unit of data a computer can process (a 0 or a 1), and a group of eight bits, a byte, represents a single character, which can be a letter, a number, or a symbol. A logical grouping of characters into a word, a small group of words, or a complete number is called a field. For example, a student’s name in a university’s computer files would appear in the “name” field. A logical grouping of related fields, such as the student’s name, the courses taken, the date, and the grade, comprise a record. A logical grouping of related records is called a file. For example, the student records in a single course would constitute a data file for that course. A logical grouping of related files would constitute a database. The student course file could be grouped with files on students’ personal histories and financial backgrounds to create a student database.

Section 5.1

Basics of Data Arrangement and Access

127

File

Record

Field

Byte

Record

Field

Field

Byte

Bit

Field

Byte

Bit

Bit

Byte

Bit

A record describes an entity. An entity is a person, place, thing, or event about which information is maintained (such as a customer, employee, or product). Each characteristic or quality describing a particular entity is called an attribute (for example, customer name, employee number, product color). Every record in a file should contain at least one field that uniquely identifies that record so that the record can be retrieved, updated, and sorted. This identifier field is called the primary key. For example, a student record in a U.S. college would probably use the Social Security number as its primary key. In addition, locating a particular record may require the use of secondary keys. Secondary keys are other fields that have some identifying information, but typically do not identify the file with complete accuracy. For example, the student’s last name might be a secondary key. It should not be the primary key, as more than one student can have the same last name.

Storing and Accessing Records Records are stored in different ways on secondary storage media, and the arrangement determines the manner in which they can be accessed. As we learned in Chapter 3, with sequential access, data records must be retrieved in the same physical sequence in which they are stored. In direct, or random, access, users can retrieve records in any sequence, without regard to the actual physical order on the storage medium. Magnetic tape utilizes sequential file organization, whereas magnetic disks use direct file organization. Indexed sequential access method (ISAM). The indexed sequential access method (ISAM) uses an index of key fields to locate individual records (see Figure 5.2). An index to a file lists the key field of each record and where that record is physically located in storage. Records are stored on disks in their key sequence. A track index shows the highest value of the key field that can be found on a specific track. To locate a specific record, the track index is searched to locate the cylinder and the track containing the record. The track is then sequentially read to find the record. Direct file access method. The direct file access method uses the key field to locate the physical address of a record. This process employs a mathematical formula called a transform algorithm to translate the key field directly into the record’s storage location on disk. The algorithm performs a mathematical calculation on the record key, and the result of that calculation is the record’s address. The direct access method is most appropriate when individual records must be located directly and rapidly for immediate processing, when a few records in the file need to be retrieved at one time, and when the required records are found in no particular sequence.

Figure 5.1 Hierarchy of data for a computer-based file.

128

Chapter 5

Managing Organizational Data and Information

174

Cylinder

Highest Key

1

100

2

200

3

300

4

400

5

500

.

.

.

.

.

.

Data

Key

Track Index Track Number

Cylinder 1

Track Index

Highest Key

Track Number

Cylinder 2

Track Index

Highest Key

Track Number

Cylinder 3 Highest Key

1

10

1

110

1

210

2

20

2

120

2

220

3

30

3

130

3

230

4

40

4

140

4

240

5

50

5

150

5

250

6

60

6

160

6

260

7

70

7

170

7

270

8

80

8

180

8

280

.

.

.

.

.

.

.

.

.

.

.

.

Track #8

171

Figure 5.2

Data

Cylinder 2

172

Data

173

Data

174

Data

175

Data

176

Data

Indexed sequential access method.

Before you go on . . . 1. What are the smallest and largest units of the data hierarchy? 2. What is the difference between sequential and direct file access?

5.2

THE TRADITIONAL FILE ENVIRONMENT From the time of the first computer applications in business (mid-1950s) until the early 1970s, organizations managed their data in a file environment. This environment began because organizations typically began automating one application at a time. These systems grew independently, without overall planning. Each application required its own data, which were organized in a data file. A data file is a collection of logically related records. Therefore, in a file management environment, each application has a specific data file related to it, containing all

Section 5.2

Registrar's office

Class programs

The Traditional File Environment

129

Class file

Instructor data Student data Course data Registration data Accounting dept.

Accounts programs

Accounts file

•Employee data contains information of the instructor who is an employee.

Employee data

•Student data is repeated.

Student data Tuition data Financial aid

Athletics dept.

Sports programs

Sports file

•Athletic data contains information of the student who is an athlete.

Athlete data

•Financial aid data is repeated for the athletes receiving athletic scholarships.

Academic info Team data Financial aid

Figure 5.3 Computer-based files in the traditional file environment cause problems such as redundancy due to partial or full duplication, inconsistency across files, and data isolation. the data records needed by the application. Over time, organizations developed numerous applications, each with an associated, application-specific data file. For example, a university has many computer-based applications involving students. These applications include course registration, fee payment, and grades, among others. In a file management environment, each of these applications would have its own student data file (see Figure 5.3). This approach to data management, where the organization has multiple applications with related data files, is considered the traditional approach.

Problems with the File Approach The traditional file approach led to many problems. First, corporate applications typically share some common core functions, such as input, report generation, querying, and data browsing. However, these common functions typically were designed, coded, documented, and tested, at great expense, for each application. Moreover, users must be trained to use each application. A clerk moving from one functional area to another, for example, would likely have to be fully trained on a new application with different procedures, screen images, data formats, and detailed functionality. Traditional file environments thus often waste valuable resources creating and maintaining similar applications, as well as in training users how to use them. Other problems also arise with the traditional file systems. The first problem is data redundancy. As applications and their data files were created by different programmers over a period of time, the same piece of information could be duplicated in several places. In the university example, each data file will contain records about students,

130

Chapter 5

Managing Organizational Data and Information many of whom will be represented in other data files. Therefore, student files in the aggregate will contain some amount of duplicate data. For example, all student records are likely to have a field for name, student identification number, address, telephone, and so on. It is not uncommon in such file environments for a new student to have to give his or her name, address, phone number, and so forth to many different campus offices—the same information over and over. This process wastes physical computer storage media, the students’ time and effort, and the clerks’ time needed to enter and maintain the data. Data redundancy leads to the potential for data inconsistency. Data inconsistency means that the various copies of the data no longer agree. For example, if a student changes his or her address, the new address must be changed across all applications in the university that require the address. File organization also leads to difficulty in accessing data from different applications, a problem called data isolation. With applications uniquely designed and implemented, data files are likely to be organized differently, stored in different formats (e.g., height in inches versus height in centimeters), and often physically inaccessible to other applications. In the university example, an administrator who wanted to know which students taking advanced courses were also starting players on the football team would most likely not be able to get the answer from the computer-based file system. He or she would probably have to manually compare printed output data from the two data files. This manual process would take a great deal of time and effort and would ignore the greatest strengths of computers—fast processing and accurate storage. Keeping data in the form of files in computer systems seriously limits the productive potential of computers and of information systems users. The file environment does not allow users to retrieve needed data conveniently and efficiently, nor does it allow for multiple users to have concurrent, simultaneous access to the data. Not every user who has access to a computer system should have access to all the data files in the system. Security is difficult to enforce in the file environment, because new applications may be added to the system on an ad-hoc basis. The file environment may also cause data integrity problems. Data values must often meet integrity constraints—that is, they must be accurate and fit for their intended use. For example, the students’ Social Security data field should contain no alphabetic characters, and the students’ grade-point-average field should not be negative. It is difficult to place data integrity constraints across multiple data files. Finally, applications should not have to be developed with regard to how the data are stored. That is, the applications and data in computer systems should be independent—a characteristic called application/data independence. In the file environment, the applications and their associated data files are dependent on each other. Storing data in data files that were tightly linked to their applications eventually led to organizations having hundreds of applications and data files, with no one knowing what the applications did or what data they required. There was no central listing of data files, data elements, or definitions of the data. The numerous problems arising from the file environment approach led to the development of databases.

Before you go on . . . 1. What other problem is often found with the problem of data redundancy? 2. How does data isolation prevent different departments, for example, from using the same data file?

Section 5.3

5.3

Databases: The Modern Approach

DATABASES: THE MODERN APPROACH

A database, which is a logical group of related files, can eliminate many of the problems associated with a traditional file environment. With the database approach, all the data are typically contained in the same storage location, rather than residing in many different files across the organization. Unlike the traditional approach, in which different programs access the different data files, the database is arranged so that one set of software programs—the database management system—provides access to all the data. Therefore, data redundancy, data isolation, and data inconsistency are minimized, and data can be shared among all users of the data. In addition, security and data integrity are increased, and applications and data are independent of one another (see Figure 5.4).

Locating Data in Databases A database is a collection of related files, and where those related files are located can greatly affect user accessibility, query response times, data entry, security, and cost. In general, database files can be centralized or distributed. A centralized database has all the related files in one physical location. Centralized database files on large, mainframe computers were the main database platform for decades, primarily because of the enormous capital and operating costs of other alternatives. Not only do centralized databases save the expenses associated with multiple computers, but they also provide database administrators with the ability to work on a database as a whole at one location. Files can generally be made more consistent with each other when they are physically kept in one location because file changes can be made in a supervised and orderly fashion. Files are not accessible except via the centralized host computer, where they can be protected more easily from unauthorized access or modification. Also, recovery from disasters can be more easily accomplished at a central location. Like all centralized systems, however, centralized databases are vulnerable to a single point of failure. When the centralized database computer fails to function properly, all users suffer. Additionally, access speed is often a problem when users are widely dispersed and must do all of their data manipulations from great distances, thereby incurring transmission delays.

Registrar's office

Class programs Academic info Team data Employee data

Accounting dept.

Accounts programs

Database management system

Tuition data Financial data Student data Course data Registration data

Athletics dept.

Figure 5.4

Sports programs

A database management system (DBMS) provides access to all data in the database.

131

132

Chapter 5

Managing Organizational Data and Information A distributed database has complete copies of a database, or portions of a database, in more than one location, which is usually close to the user (see Figure 5.5). There are two types of distributed databases: replicated and partitioned. A replicated database has complete copies of the entire database in many locations, primarily to alleviate the single-point-of-failure problems of a centralized database as well as to increase user access responsiveness. There is significant overhead, however, in maintaining consistency among replicated databases, as records are added, modified, and deleted. A partitioned database is subdivided, so that each location has a portion of the entire database (usually the portion that meets users’ local needs). This type of database provides the response speed of localized files without the need to replicate all changes in multiple locations. One significant advantage of a partitioned database is that data in the files can be entered more quickly and kept more accurate by the users immediately responsible for the data. On the other hand, widespread access to potentially sensitive company data can significantly increase corporate security problems. Telecommunications costs and associated time delays can also be major factors.

Users New York

Users Los Angeles

Central Location

New York

Users Chicago

Users Kansas City (a)

Users New York

Users Los Angeles

New York

Los Angeles Central Location

New York

Figure 5.5

(a) Centralized database. (b) Distributed database with complete or partial copies of the central database in more than one location.

Kansas City

Chicago

Users Kansas City

Users Chicago (b)

Section 5.3

Databases: The Modern Approach

Creating the Database To create a database, designers must develop a conceptual design and a physical design. The conceptual design of a database is an abstract model of the database from the user or business perspective. The physical design shows how the database is actually arranged on storage devices. The conceptual database design describes how the data elements in the database are to be grouped. The design process identifies relationships among data elements and the most efficient way of grouping data elements together to meet information requirements. The process also identifies redundant data elements and the groupings of data elements required for specific applications. Groups of data are organized, refined, and streamlined until an overall logical view of the relationships among all of the data elements in the database appears. Entity-relationship modeling and normalization are employed to produce optimal database designs. Entity-relationship modeling. Database designers plan the database design in a process called entity-relationship modeling. They often document the conceptual data model with an entity-relationship (ER) diagram. ER diagrams consist of entities, attributes, and relationships, each of which is represented on the diagram. Entities are pictured in boxes, and relationships are shown in diamonds. The attributes for each entity are listed next to the entity, and the key field is underlined. Figure 5.6 shows an entityrelationship diagram. As defined earlier, an entity is something that can be identified in the users’ work environment. For example, consider student registration at a university. Students put their schedules together by selecting from a list of courses. Each class has one professor at this university (no team teaching). In this example, STUDENT, SCHEDULE, COURSE, and PROFESSOR would be examples of entities, as shown in Figure 5.6. Entities of a given type are grouped into entity classes. In our example, STUDENT, SCHEDULE, COURSE, and PROFESSOR are examples of entity classes. An instance of an entity class is the representation of a particular entity. Therefore, a particular STUDENT (James T. Smythe, 145–89–7123) is an instance of the STUDENT entity class; a particular schedule (91778–1) is an instance of the SCHEDULE entity class; a particular course (76890) is an instance of the COURSE entity class; and a particular professor (Ted Wilson, 115–65–7632) is an instance of the PROFESSOR entity class. Entity instances have identifiers, which are attributes that identify entity instances. For example, STUDENT instances can be identified with StudentIdentificationNumber; SCHEDULE instances can be identified with ScheduleNumber; COURSE instances can be identified with CourseNumber; and PROFESSOR instances can be identified with ProfessorIdentificationNumber. These identifiers are underlined on ER diagrams (see Figure 5.6). Entities have attributes, or properties, that describe the entity’s characteristics. In our example, examples of attributes for STUDENT would be StudentIdentificationNumber, StudentName, and StudentAddress. Examples of attributes for SCHEDULE would be ScheduleNumber and StudentName. Examples of attributes for COURSE would be CourseNumber, CourseName, CourseTime, and CoursePlace. Examples of attributes for PROFESSOR would be ProfessorIdentificationNumber, ProfessorName, and ProfessorDepartment. You might wonder why, in our example, StudentName and StudentIdentificationNumber could not be attributes of SCHEDULE and why ProfessorName could not be an attribute of COURSE. The answer is that they could. However, if you consider all interlinked university systems, the STUDENT entity class will be needed for other applications, such as fee payment and, ultimately, graduation. Also, the PROFESSOR entity class will be needed for payroll applications and insurance applications. Therefore, for flexibility, we need separate entity classes for STUDENT and PROFESSOR.

133

134

Chapter 5

Managing Organizational Data and Information

A student can have many courses.

Student

1

Can have

M

Can have

1

Schedule

M Can have A course can have many students.

M Course

A course can have only 1 professor.

1

A course can be on many schedules.

M

A schedule can have many courses.

Key

Entities

Can have Relationships A professor can have many courses.

M Keyfield

Professor (a) STUDENT

SCHEDULE

Student Identification Number

Schedule Number

Student Name Student Address

Student Name

COURSE

PROFESSOR Course Number

Professor Identification Number

Course Name Course Time Course Place

Professor Name Professor Department

(b)

Figure 5.6

Entity-relationship model.

Entities are associated with one another in relationships, which can include many entities. (Remember that relationships are noted by diamonds on ER diagrams.) The number of entities in a relationship is the degree of the relationship. Relationships between two items are common and are called binary relationships. There are three types of binary relationships. • In a 1:1 (one to one) relationship, a single-entity instance of one type is related to a single-entity instance of another type. Figure 5.6 shows STUDENT-SCHEDULE as a 1:1 relationship that relates a single STUDENT with a single SCHEDULE. That is, no student has more than one schedule, and no schedule is for more than one student. • The second type of relationship, 1:M (one to many) is represented by the COURSEPROFESSOR relationship in our example. This relationship means that a professor can have many courses, but each course can have only one professor. (Remember that we have no team-teaching in our university example—see Figure 5.6.)

Section 5.3

Order

Order Number

Figure 5.7

Number Part Part of Number Description Parts

Unit Price

Databases: The Modern Approach

Supplier Supplier Supplier Order Number Name Address Date

Delivery Date

135

Order Customer Customer Customer Total Number Name Address

Nonnormalized relation.

• The third type of relationship, M:M (many to many) is represented by the STUDENTCOURSE and SCHEDULE-COURSE relationships in our example. The first relationship means that a student can have many courses, and a course can have many students. The second relationship means that a schedule can have many courses and a course can appear on many schedules (see Figure 5.6). Normalization. In order to use a relational database model (discussed below) effectively, the data must be analyzed to eliminate redundant data elements. Normalization is a method for analyzing and reducing a relational database to its most streamlined form for minimum redundancy, maximum data integrity, and best processing performance. When data are normalized, attributes in the table depend only on the primary key. As an example, consider an automotive repair garage. This business takes orders from customers who want to have their cars repaired. In this example, ORDER, PART, SUPPLIER, and CUSTOMER would be examples of entities. In this example, there are many PARTS in an ORDER, but each PART can come from only one SUPPLIER. In a nonnormalized relation called ORDER (see Figure 5.7), each ORDER would have to repeat the name, description, and price of each PART needed to complete the ORDER, as well as the name and address of each SUPPLIER. This relation contains repeating groups and describes multiple entities. The normalization process breaks down the relation, ORDER, into smaller relations, each describing a single entity. This process is conceptually simpler and eliminates repeating groups (see Figure 5.8). For example, consider an order at the

1

Order

Order Order Number Date

Delivery Date

Multiple parts contained in an order

Order Customer Total Number

Ordered Parts

Relation containing basic order information

Supplier

4

Supplier Supplier Supplier Number Name Address

Relation contains supplier information

Customer

Customer Customer Customer Number Name Address

Relation contains customer information

Figure 5.8

Normalized relation.

Order Part Number Number

2

Part

3

Which order belongs to which customer

Part Part Number Description

Number of Parts

Relation contains details of order

Each part is ordered multiple times

Unit Price

Which parts are supplied by which supplier

Supplier Number

Relation contains part information

136

Chapter 5

Managing Organizational Data and Information automobile repair shop. The normalized relations can produce the order in the following manner. • The ORDER relation provides the OrderNumber (the key), OrderDate, DeliveryDate, OrderTotal, and CustomerNumber. • The key of the ORDER relation (OrderNumber) provides a link to the ORDEREDPARTS relation (the link numbered 1 in Figure 5.8). • The ORDERED-PARTS relation supplies the NumberofParts information to ORDER. • The key of the ORDERED-PARTS relation (PartNumber) provides a link to the PART relation (the link numbered 2 in Figure 5.8). • The PART relation supplies the PartDescription, UnitPrice, and SupplierNumber to ORDER. • The SupplierNumber in the PART relation provides a link to the SUPPLIER relation (the link numbered 3 in Figure 5.8). • The SUPPLIER relation provides the SupplierName and SupplierAddress to ORDER. • The CustomerNumber in ORDER provides a link to the CUSTOMER relation (the link numbered 4 in Figure 5.8). • The CUSTOMER relation supplies the CustomerName and CustomerAddress to ORDER. The automotive repair shop order now has all the necessary information.

Before you go on . . . 1. What are the common options for locating data in databases? 2. What tools and techniques are used to produce optimal database designs?

5.4

DATABASE MANAGEMENT SYSTEMS The software program (or group of programs) that provides access to a database is known as a database management system (DBMS). The DBMS permits an organization to store data in one location, from which it can be updated and retrieved, and it provides access to the stored data by various application programs. DBMSs also provide mechanisms for maintaining the integrity of stored information, managing security and user access, recovering information when the system fails, and accessing various database functions from within an application written in a third-generation, fourth-generation, or object-oriented language. The DBMS provides users with tools to add, delete, maintain, display, print, search, select, sort, and upgrade data. These tools range from easy-to-use natural-language interfaces to complex programming languages used for developing sophisticated database applications. Today, DBMSs are no longer entirely the domain of the IS department, but are installed in a broad range of information systems. Some are loaded on a single user’s PC and employed in an ad-hoc manner to support individual decision making. Others are located on several interconnected mainframe computers and are used to support large-scale transaction-processing systems, such as order entry and inventory control systems. Still others are interconnected throughout an organization’s local area networks, giving individual departments access to corporate data. Whatever their

Section 5.4

‘s A b o u t B u s i n e s s Box 5.1:

www.lexis-nexis.com

Database Management Systems

MKT

137

POM

Lexis-Nexis uses database to deal with court ruling

A Supreme Court decision is forcing electronic archive Lexis-Nexis to purge what may amount to hundreds of thousands of documents from its database. In New York Times vs. Tasini, a group of freelance writers sued newspaper and magazine publishers, including the New York Times, Newsday, and Time, saying that these news organizations had resold articles to online databases such as Lexis-Nexis without compensating or gaining the permission of freelance writers. Most newspapers now include language in their contracts that deal with this issue, but hundreds of thousands of articles from the 1980s and 1990s were resold to Lexis-Nexis without the writers’ consent. The Supreme Court ruled that the publishers had infringed on freelancers’ copyrights, and passed the case back to a lower court to determine whether the writers should receive damages. In the meantime, the news organizations and Lexis-Nexis are faced with a major data management challenge: removing all the affected stories from their databases. A spokesman for the New York Times Co. says that the company has put together a list of more than 115,000 stories by 27,000 authors, which ran between 1980 and 1995, that could be affected by the ruling. The spokesman

says the list will be passed on to Lexis-Nexis, which will have to remove the stories from its database. The New York Times online database, which is accessible through its Web site, will not be affected, because those stories date back only to 1996. At the moment, Lexis-Nexis is unclear about the scope of the project it faces. “We have no way of knowing offhand how many of the 3 billion documents on our database are written by freelancers,” says a spokesman. He says the company is preparing its database managers for the job as they wait to hear from publishers, who will be responsible for identifying the affected stories. Source: “Lexis-Nexis Faces Database Purge in Wake of Copyright Ruling,” Information Week (June 22, 2001); nytimes.com.

Questions 1. With three billion documents in the Lexis-Nexis database, how would you use the database management system to uncover the documents written by freelance writers? Can you do this realistically? 2. What does the Lexis-Nexis problem tell you about keeping historical data in databases?

purpose, database management systems are designed to be relatively invisible to the user. To interact with them as a user, however, it helps to understand how databases are structured and the procedures for interacting with them—even though much of their work is done behind the scenes and is therefore “transparent” to the end user. As essential as databases and DBMSs are to all areas of business, they must be carefully managed. In fact, as IT’s About Business Box 5.1 shows, databases can present a problem even with the most careful management.

Logical versus Physical View A database management system provides the ability for many different users to share data and process resources. But as there can be many different users, there are many different database needs. How can a single, unified database meet the differing requirements of so many users? For example, how can a single database be structured so that sales personnel can see customer, inventory, and production maintenance data while the human resources department maintains restricted access to private personnel data? A DBMS minimizes these problems by providing two views of the database data: a physical view and a logical view. The physical view deals with the actual, physical arrangement and location of data in the direct access storage devices (DASDs). Database specialists use the physical view to make efficient use of storage and processing resources.

138

Chapter 5

Managing Organizational Data and Information Users, however, may wish to see data differently from how they are stored, and they do not want to know all the technical details of physical storage. After all, a business user is primarily interested in using the information, not in how it is stored. The logical view, or user’s view, of a database program represents data in a format that is meaningful to a user and to the software programs that process that data. That is, the logical view tells the user, in user terms, what is in the database. One strength of a DBMS is that while there is only one physical view of the data, there can be an endless number of different logical views—one specifically tailored to each individual user, if necessary. This feature allows users to see database information in a more business-related way rather than from a technical, processing viewpoint. Clearly, users must adapt to the technical requirements of database information systems to some degree, but DBMS logical views allow the system to adapt to the business needs of the users.

DBMS Components There are four main components in a database management system: the data model, the data definition language, the data manipulation language, and the data dictionary. Data model. The data model defines the way data are conceptually structured. Examples include the hierarchical, network, relational, object-oriented, objectrelational, hypermedia, and multidimensional models. We will present a more detailed discussion of these models in a later section. Data definition language. The data definition language (DDL) defines what types of information are in the database and how they will be structured. The DDL defines each data element as it appears in the database before that data element is translated into the forms required by the applications. The DDL is essentially the link between the logical and physical views of the database. A DBMS user defines views, or schemas, using the DDL. The schema is the logical description of the entire database and the listing of all the data items and the relationships among them. Each user or application program utilizes a set of DDL statements to construct a listing of those data elements that are of interest. Because there may be many users and application programs using the same database, many different “user views” or subschemas can exist. Therefore, a subschema is the specific set of data from the database that is required by each application. The DDL is used to define the physical characteristics of each record, the fields within a record, and each field’s logical name, data type, and character length. The DDL is also used to specify relationships among the records. Other primary functions of the DDL are to: • Provide a means for associating related data. • Indicate the unique identifiers (or keys) of the records. • Set up data security access and change restrictions. Data manipulation language. The data manipulation language (DML) is used with third-generation, fourth-generation, or object-oriented languages to query the contents of the database, store or update information in the database, and develop database applications. The DML allows users to retrieve, sort, display, and delete the contents of a database. Requesting information from a database is the most commonly performed operation. Because users cannot generally request information in a natural-language form, query languages form an important component of a DBMS. Structured query lan-

Section 5.4

Database Management Systems

guage (SQL) is the most popular relational database language, combining both DML and DDL features. SQL offers the ability to perform complicated searches with relatively simple statements. Keywords such as SELECT (to specify a desired attribute), FROM (to specify the table to be used), and WHERE (to specify conditions to apply in the query) are typically used for the purpose of data manipulation. For example, a state legislator wants to send congratulatory letters to all students from her district graduating with honors from the state university. The university information systems staff would query the student relational database with an SQL statement such as SELECT (Student Name), FROM (Student Database), WHERE (Congressional District 7 and Grade Point Average 3.4). Data dictionary. The data dictionary stores definitions of data elements and data characteristics such as individuals, business functions, programs, and reports that use the data elements, as well as the physical representation, responsible parties in the organization (data ownership), and security. A data element represents a field. Besides listing the standard data name and aliases for the element, the dictionary lists the names that reference this element in specific systems and identifies the individuals, business functions, applications, and reports that use this data element. Data dictionaries provide many advantages to the organization. Because the data dictionary provides standard definitions for all data elements, the potential for data inconsistency is reduced. That is, the probability that the same data element will be used in different applications, but with a different name, is reduced. In addition, data dictionaries provide for faster program development because programmers do not have to create new data names. Data dictionaries also make it easier to modify data and information because programmers do not need to know where the data element is stored or what applications use the data element in order to make use of it in a program. Database environments ensure that data in the database are defined once and consistently, and that they are used for all applications whose data reside in the database. Applications request data elements from the database and are found and delivered by the DBMS. The programmer and end user do not have to specify in detail how or where the data are to be found. Database management systems provide many advantages to the organization: • • • • • • • •

Improved strategic use of corporate data Reduced complexity of the organization’s information systems environment Reduced data redundancy and inconsistency Enhanced data integrity Application-data independence Improved security Reduced application development and maintenance costs Improved flexibility of information systems

• Increased access and availability of data and information

Before you go on . . . 1. What is the difference between the logical and the physical views of the data in a database? 2. What are the main components of a DBMS?

139

140

Chapter 5

5.5

Managing Organizational Data and Information

LOGICAL DATA MODELS Just as there are many ways to structure business organizations, so also are there many ways to structure the data those organizations need. A manager’s ability to use a database is highly dependent on how the database is structured logically and physically. The DBMS separates the logical and physical views of the data, meaning that the programmer and end user do not have to know where and how the data are actually stored. In logically structuring a database, businesses need to consider the characteristics of the data and how the data will be accessed. The three most common data models are hierarchical, network, and relational. Other types of data models include multidimensional, object-oriented, objectrelational, and hypermedia. Using these models, database designers can build logical or conceptual views of data that can then be physically implemented into virtually any database with any DBMS. Hierarchical, network, and object-oriented DBMSs usually tie related data together through linked lists. Relational and multidimensional DBMSs relate data through information contained in the data. We’ll look at most of these models in this section.

Hierarchical Database Model The hierarchical database model rigidly structures data into an inverted “tree” in which each record contains two elements. The first is a single root or master field, often called a key, which identifies the type location or ordering of the records. The second is a variable number of subordinate fields, which define the rest of the data within a record. As a rule, while all fields have only one “parent,” each parent may have many “children.” An example of a hierarchical database is shown in Figure 5.9. The hierarchical structure was developed simply because hierarchical relationships are commonly found in many traditional business organizations and processes. For example, organization charts most often describe a hierarchical relationship—top management at the highest level, middle management at lower levels, and other employees at the lowest level. Within each hierarchy, each level of management may have many employees or levels of employees beneath it, but each employee generally

Sales

East Coast

China

Plates

Stemware

Bowls

Midwest

Flatware

China

Plates

Stemware

Bowls

West Coast

Flatware

China

Plates

Stemware

Bowls

Flatware

Key Region Product Category Product

Figure 5.9

Hierarchical database model.

Section 5.5 has only one manager. The hierarchical structure is characterized by this one-to-many relationship among data. The strongest advantage of the hierarchical database approach is the speed and efficiency with which it can be searched for data. This speed is possible because so much of the database is eliminated in the search with each “turn” going down the tree. As shown in Figure 5.9, half the records in the database (East Coast Sales) are eliminated once the search turns toward West Coast sales, and two-thirds of the West Coast Sales are eliminated once the search turns toward stemware. Organizational transaction processing systems (e.g., airline reservation systems) typically use the hierarchical approach for the speed and efficiency it offers. But the hierarchical model does have problems. Access to data in this model is predefined by the database administrator before the programs that access the data are written. Programmers must follow the hierarchy established by the data structure. Also, in the hierarchical model, each relationship must be explicitly defined when the database is created. Each record in a hierarchical database can contain only one key field, and only one relationship is allowed between any two fields. This structure can create a problem because real-world data do not always conform to such a strict hierarchy. For example, a product like dietetic cereal might be found in a cereal section of a grocery store and in a special section for dietetic foods, a situation that would be awkward for a hierarchical structure to handle. Moreover, all data searches must originate at the top or “root” of the tree and work downward from “parent” to “child.”

Network Database Model The network database model creates relationships among data through a linked-list structure in which subordinate records (called members, not children) can be linked to more than one data element (called an owner). Similar to the hierarchical model, the network model uses explicit links, called pointers, to link members and owners. Physically, pointers are storage addresses that contain the location of a related record. With the network approach, a member record can be linked to an owner record and, at the same time, itself can be an owner record linked to other sets of members. In this way, many-to-many relationships are possible with a network database model. (See the Website for an illustration of the network database model.) The network model essentially places no restrictions on the number of relationships or sets in which a field can be involved. This model, then, is more consistent with real-world business relationships where, for example, vendors have many customers and customers have many vendors. However, network databases are very complex. For every set of linked data elements, a pair of pointers must be maintained. As the number of sets or relationships increases, the work involved to expand and maintain the database becomes substantial. The network model is by far the most complicated type of database to design and implement.

Relational Database Model While most business organizations have been organized in a hierarchical fashion, most business data—especially accounting and financial data—have traditionally been organized into simple tables of columns and rows. Tables allow quick comparisons by row or column, and items are easy to retrieve by finding the point of intersection of a particular row and column. The relational database model is based on the simple concept of tables in order to capitalize on characteristics of rows and columns of data. In a relational database, these tables are called relations, and the model is based on the mathematical theory of sets and relations. In this model, each row of data is

Logical Data Models

141

142

Chapter 5

Managing Organizational Data and Information equivalent to a record, and each column of data is equivalent to a field. In the relational model terminology, a row is called a tuple, and a column is called an attribute. A relational database is not always, however, one big table (usually called a flat file) consisting of all attributes and all tuples. That design would likely entail far too much data redundancy. Instead, a database is usually designed as many related tables. There are some basic principles involved in creating a relational database. First, the order of tuples or attributes in a table is irrelevant, because their position relative to other tuples and attributes is irrelevant in finding data based on specific tuples and attributes. Second, each tuple must be uniquely identifiable by the data within the tuple—some sort of primary key data (for example, a Social Security number or employee number). Third, each table must have a unique identifier—the name of the relation. Fourth, there can be no duplicate attributes or tuples. Finally, there can be only one value in each row-column “cell” in a table. In a relational database, three basic operations are used to develop useful sets of data: select, join, and project. The select operation creates a subset consisting of all records in the file that meet stated criteria. “Select” creates, in other words, a subset of rows that meet certain criteria. The join operation combines relational tables to provide the user with more information than is available in individual tables. The project operation creates a subset consisting of columns in a table, permitting the user to create new tables that contain only the information required. One of the greatest advantages of the relational model is its conceptual simplicity and the ability to link records in a way that is not predefined (as is the case with hierarchical and network models). This ability provides great flexibility, particularly for end users. The relational or tabular model of data can be used in a variety of applications. Most people can easily visualize the relational model as a table, although the model does use some unfamiliar terminology. Consider the relational database example on East Coast managers shown in Figure 5.10. The table contains data about the entity called East Coast managers. Attributes or characteristics about the entity are name, title, age, and division. For example, the tuples, or occurrences of the entity, are the two records on A. Smith and W. Jones. The links among the data, and among tables, are implicit, as they are not necessarily physically linked in a storage device but are implicitly linked by the design of the tables into rows and columns. This property of implicit links provides perhaps the strongest benefit of the relational model—flexibility in relating data. Unlike the hierarchical and network models, where the only links are those rigidly built into the design, in the relational model all the data in a table and between tables can be linked, related, and compared. This ability gives the relational model much more data independence than do the other two models. That is, the logical design of data into tables can be more independent of the physical implementation. This independence allows more flexibility in implementing and modifying the logical design. Of course, as with all tables, an end user needs to know only two things: the identifier(s) of the tuple(s) to be searched, and the desired attribute(s).

Figure 5.10

Table of relational database model.

Name

Title

Age

Division

Smith, A.

Dir., Accounting

43

China

Jones, W.

Dir., Total Quality Management

32

Stemware

Lee, J.

Dir., Information Technology

46

China

Durham, K.

Manager, Production

35

Stemware

Stone, L.

Administrative Asst.

28

Flatware

Section 5.5

‘s A b o u t B u s i n e s s

muze.com

Logical Data Models

143

MKT

Box 5.2: Relational database helps Muze grow Some very successful sellers of books, music, and other entertainment on the Internet owe part of their success to a company called Muze. Based in New York City, Muze offers media retailers—both the click-and-modem and brick-and-mortar varieties—the type of information that influences consumers’ buying decisions. Muze aggregates and classifies millions of products from thousands of publishers in order to match contextual information and multimedia clips with the products of distributors around the world. Muze stores this massive amount of information in a relational database and licenses its database at a fraction of what it would cost sellers to compile their own information. For a retailer, the service makes the difference between putting a list of titles on the Internet and enabling shoppers to browse in an information-rich catalog environment. The information provided by Muze enables retail customers to get in-depth information about books, CDs, and videotapes without actually having the product in hand. The shopper initiates the search using an instore kiosk or by browsing the retailer’s Web site. Muze data, such as book reviews or multimedia clips, appear as search results within the format set up by the retailer’s

search engine. Muze also provides classification data that help the retailer’s search engine operate more efficiently. Further, Muze enables consumers to search for related music, books, and video products using a single query. Muze’s database is operable with retailers’ systems, regardless of database structure. An important factor behind Muze’s choice of IBM’s DB2 Database is that it provides support for XML structure, which enables data to be interoperable among different database systems. Amazon.com and hundreds of other Muze customers receive daily feeds from the Muze database, enabling their customers to have the latest information about their purchases. Source: ibm.com; muze.com.

Questions 1. Are there any disadvantages for businesses using Muze’s databases? 2. What is Muze’s core competency and how does database technology facilitate that core competency?

The relational model is currently the most popular of the three most common database structures because it provides the most flexibility and ease of use. But this model has some disadvantages. Because large-scale databases may be composed of many interrelated tables, the overall design may be complex and therefore have slower search and access times (as compared to the hierarchical and network models). The slower search and access times may result in processing inefficiencies that lead to an initial lack of acceptance of the relational model. These processing inefficiencies, however, are continually being reduced through improved database design and programming. Second, data integrity is not inherently a part of this model, as it is with hierarchical and network models. Therefore, it must be enforced with good design principles. IT’s About Business Box 5.2 provides an example of how a properly designed relational database can drive the success of a business.

Advantages and Disadvantages of the Three Database Models The main advantage of the hierarchical and network database models is processing efficiency. The hierarchical and network structures are relatively easy for users to understand because they reflect the pattern of real-world business relationships. In addition, the hierarchical structure allows for data integrity to be easily maintained. Hierarchical and network structures have several disadvantages, though. All the access paths, directories, and indices must be specified in advance. Once specified, they are not easily changed without a major programming effort. Therefore, these designs

144

Chapter 5

Managing Organizational Data and Information

Manager’s Checklist 5.1 Advantages and Disadvantages of Logical Data Models

Model

Advantages

Disadvantages

Hierarchical database

Searching is fast and efficient.

Access to data is predefined by exclusively hierarchical relationships, predetermined by administrator. Limited search/query flexibility. Not all data are naturally hierarchical.

Network database

Many more relationships can be defined. There is greater speed and efficiency than with relational database models.

This is the most complicated model to design, implement, and maintain. Greater query flexibility than with hierarchical model, but less than with relational mode.

Relational database

Conceptual simplicity; there are no predefined relationships among data. High flexibility in ad-hoc querying. New data and records can be added easily.

Processing efficiency and speed are lower. Data redundancy is common, requiring additional maintenance.

have low flexibility. Hierarchical and network structures are programming intensive, time-consuming, difficult to install, and difficult to remedy if design errors occur. The two structures do not support ad-hoc, English-language-like inquiries for information. The advantages of relational DBMSs include high flexibility in regard to ad-hoc queries, power to combine information from different sources, simplicity of design and maintenance, and the ability to add new data and records without disturbing existing applications. The disadvantages of relational DBMSs include their relatively low processing efficiency. These systems are somewhat slower because they typically require many accesses to the data stored on disk to carry out the select, join, and project commands. Relational systems do not have the large number of pointers carried by hierarchical systems, which speed search and retrieval. Further, large relational databases may be designed to have some data redundancy in order to make retrieval of data more efficient. The same data element may be stored in multiple tables. Special arrangements are necessary to ensure that all copies of the same data element are updated together. Manager’s Checklist 5.1 summarizes the advantages and disadvantages of the three common database models.

Emerging Data Models Three emerging data models are multidimensional, object-oriented, and hypermedia. The object-oriented and hypermedia data models are discussed here, and the multidimensional data model is discussed in the section on data warehousing, later in this chapter. Object-oriented database model. A recent development in databases is the objectoriented model. Although no common definition for this model has yet emerged, there is agreement as to some of its features. The central idea is that of an object—a small amount of data put together (encapsulated) with all the data needed in order to

Section 5.5 perform an operation with that data (as described in Chapter 4). Terminology in the object-oriented model, similar to object-oriented programming languages, consists of objects, attributes, classes, methods, and messages. An object is similar to an entity in that it represents a person, place, or thing, but it also contains all of the data that the object needs in order to perform an operation. Similarly, attributes are characteristics that describe the state of that object—the attribute values for an object at a given period in time (for example, the age of an employee). A method is an operation, action, or a behavior the object may undergo (for example, a product may be sold). A message from other objects activates operations contained within the object. Once an operation is activated, it will often send another message to a third object, which, in turn, may activate methods within that object, and so on. Every object is an instance of some class. An object’s class defines all the messages to which the object will respond, as well as the way in which objects of this class are implemented. Classes are typically arranged in a tree-like structure, connecting superclasses to their subclasses. The links or relationships between a superclass and a subclass are often called IS-A links. For example, a “truck” class is a subclass of a “motor vehicle” class; a truck “is a” motor vehicle. The complete chain of IS-A links shows that all subclasses inherit all behaviors and attributes defined by their superclass, as well as having additional behaviors and attributes of their own. Object-oriented databases can be particularly helpful in multimedia environments, such as in many manufacturing sites. Data from design blueprints, photographic images of parts, operational acoustic signatures, and test or quality-control data all can be combined into one object, itself consisting of structures and operations. For companies with widely distributed offices, an object-oriented database can provide users with a view of data throughout the overall system. In general, objectoriented databases allow organizations to structure their data and use them in ways that would be impossible, or at least very difficult, with other database models. Object-oriented databases can be used in some very strategic ways, as shown by the following example.

EXAMPLE Promoting business and tourism via database. The North Cholla Provincial Government, one of eight provincial governments in Korea, will use Computer Associates’ Jasmine for an Internet-based multimedia Web application designed to increase business and tourism in the province. Jasmine is a pure object-oriented database with class libraries that manage multimedia data, including bitmaps, animation, audio, and full-motion video. Using the multimedia capabilities supported by Jasmine, visitors will be able to take a virtual tour of the North Cholla Province by clicking on objects that inform and entertain using text, animation, video, and audio. Current and historical information about the Province will be accessible in such areas as weather, transportation facilities, government services, lodging, and dining. The goal is to promote and enhance the image of North Cholla Province around the world, increase business opportunities and tourism, and ultimately to improve the standard of living of North Cholla Province citizens. (Source: cai.com.)

●

Object-relational database model. The object-relational database model adds new object storage capabilities to relational database management systems. Systems based on this model integrate management of traditional fielded data, complex objects such as time-series and geospatial data (e.g., maps and photos derived from satellite transmissions), and diverse binary media such as audio, video, images, applets, and formatted and unformatted text. Object-relational database management systems include

Logical Data Models

145

146

Chapter 5

Managing Organizational Data and Information both data and processes; that is, what information the users have and what they are going to do with it. Hypermedia database model. The hypermedia database model stores chunks of information in the form of nodes connected by links established by the user. The nodes can contain text, graphics, sound, full-motion video, or executable computer programs. Searching for information does not have to follow a predetermined organizational scheme. Instead, users can branch to related information in any kind of relationship. The relationship between nodes is less structured than in a traditional DBMS. In most systems, each node can be displayed on a screen. The screen also displays the links between the node depicted and other nodes in the database.

Other Database Models Because a database management system need not be confined to storing just words and numbers, firms use them to store graphics, sounds, and video as well. These capabilities have led to specialized databases, depending on the type or format of data stored. For example, a geographical information database may contain locational data for overlaying on maps or images. Using this type of data, users are able to spatially view customer and vendor locations instead of simply reading the actual addresses. A knowledge database can store decision rules used to evaluate situations and help users make decisions like an expert. A multimedia database can store data on many media—sounds, video, images, graphic animation, and text.

Small-Footprint Databases Small-footprint databases enable organizations to put certain types of data in the field where the workers are. These databases offer more information, more readily available, in a form that is accessible. Where once laptops were the only portable machines capable of running a database, advances in technology such as more powerful CPUs and increased memory at lower cost are enabling handheld devices and smart phones to run some form of an SQL database and to synchronize that mobile database with a central database at headquarters. Small-footprint databases have replication mechanisms that take into account the occasionally connected nature of laptops and handhelds, that are programmed to resolve replication conflicts among mobile users, and that ensure that data synchronization will survive a low-quality wireless or modem connection. Small-footprint database technology also runs on personal digital assistants, such as those from Palm or Psion, and embedded specialty devices and appliances, like a barcode scanner or medical tool. The following examples show applications of small-footprint databases.

EXAMPLES Book reps with small footprints. When Simon & Schuster (simonsays.com) built a sales-force automation system, the company needed to fit the software—application, presentation tools, databases, files—on the laptops of its sales representatives. The company wanted to give its salespeople enough information to take on the road and make a presentation about an upcoming title to bookstore buyers in their territories. The company knew that it could not put all the data in the firm’s databases on the laptops, but salespeople needed information about the bookstore’s account (order status, historical sales, and special handling and discounts for which it was eligible) as well as marketing information about new books, such as sales forecasts, cover art, and promotion plans. The key piece of technology that made the laptop project possible was a

Section 5.6 small-footprint database, one whose selective replication features allowed Simon & Schuster to put just the data relevant to each salesperson on each one’s laptop and to collect the orders once the sales call was over. Small footprints on the Palm. Bidcom’s (citadon.com) site-inspection application places a Palm in the hands of construction company supervisors, who come to the job site to count workers, add up the materials consumed, and note safety issues. Each day, the inspectors go to a different site. The Palm has to be loaded with that day’s project information. At the end of the day, the updates are sent to a central Oracle database, where they are converted into Web pages so that company executives can track each day’s progress.

●

Before you go on . . . 1. What are the relative advantages and disadvantages of hierarchical, network, and relational databases? 2. How might a company use a multimedia DBMS for competitive advantage?

5.6

DATA WAREHOUSES

Access to the accurate and timely information needed for the management of daily operations and long-term strategic planning has become increasingly important in the modern global marketplace. Unfortunately, it has not always been easy to identify, access, and retrieve the required information. For that reason, for many years companies have been working to improve access to data for decision making and analysis. To date, most of these efforts have focused on enhancing or replacing the online transaction-processing systems that are the entry point for most company data. These operational systems now contain huge amounts of data, so companies are focusing their attention on making this information available to end users through efficient organization and access management. The focus of technology has shifted from data input and capture through the firm’s operational systems to information access and availability provided by the firm’s data warehouse. A data warehouse is a relational or multidimensional database management system designed to support management decision making. A data warehouse are oriented around the major business subjects of the enterprise, such as customer, vendor, product, or activity. The data in the “warehouse” are stored in a single, agreed-upon format even when underlying operational applications store the data differently. For example, one operational application may store the date as year-month-day, and another may store it as month-day-year. In the data warehouse, the date will have a consistent format throughout (e.g., month-day-year). The data warehouse transforms data into a more useful resource by grouping them more conveniently for end users, putting them into more usable formats, enabling them to be analyzed, and dispersing them to appropriate working groups to increase availability and accessibility (see Figure 5.11). Data warehouses contain current detailed data, historical detailed data, lightly summarized data, highly summarized data, and metadata (discussed below). Current and historical detailed data are voluminous because they are stored at the highest level of detail (the least amount of summarization). Lightly and highly summarized data are necessary to save processing time when users request them and are readily

Data Warehouses

147

148

Chapter 5

Managing Organizational Data and Information Applications EIS/DSS Access Replication Data mart

Legacy

Select Extract

Metadata Repository

Marketing

Transform OLAP

Integrate Maintain

Enterprise Data Warehouse

Data mart Risk management

Preparation External

Operational Systems/Data

Target database(s) (RDB, MDDB)

get sls.mkt

Custom-built applications (4GL tools)

A P I S

M i d d l e w a r e

Data mart

Production reporting tools

Relational query tools OLAP/ROLAP

Engineering Data mining

Web browsers

Figure 5.11 Data warehouse framework and views. (Source: cutter.com and Data Management Strategies, Cutter Information Operations, February 1999.)

accessible. For example, in a retail sales organization such as Sears, current detailed data might be the daily sales data over the last 90 days. Historical detailed data would be sales data over the last 10 years. Lightly summarized data could be weekly sales of a store or weekly sales of a particular product. Highly summarized data could be annual sales for a store, annual sales for a region, or annual sales of a product. Interestingly, older historical data are the least-used data in a data warehouse, while highly summarized data are the most used. Metadata, which are “data about data,” are important for designing, constructing, retrieving, and controlling the warehouse data. Data warehouse users need to know what data are available, what their sources are, where they are, and how to access them. Technical metadata include where the data come from, how the data were changed, how the data are organized, how the data are stored, who owns the data, who is responsible for the data and how to contact them, who can access the data, and the date of last update. Business metadata include what data are available, where the data are, what the data mean (description of data contents, units of measure, definitions and aliases for the data, details of how the data were calculated), how to access the data, predefined reports and queries, and how current the data are. A data warehouse offers many business advantages: • It provides business users with a “customer-centric” view of the company’s heterogeneous data by helping to integrate data from sales, service, manufacturing and distribution, and other customer-related business systems. • It provides added value to the company’s customers by allowing them to access better information when data warehousing is coupled with Internet technology. • It consolidates data about individual customers and provides a repository of all customer contacts for segmentation modeling, customer retention planning, and crosssales analysis. • It removes barriers among functional areas by offering a way to reconcile views from multiple areas, thus providing a look at activities that cross functional lines.

Section 5.6

Data Warehouses

149

• It reports on trends across multidivisional, multinational operating units, including trends or relationships in areas such as merchandising, production planning, and so forth. Table 5.1 presents some strategic uses of data warehousing in various industries, and IT’s About Business Box 5.3 provides an example of Sprint’s strategic application of its data warehouse.

Multidimensional Database Model Multidimensional databases are often the core of data warehouses. In a multidimensional database, the data are intimately related and can be viewed and analyzed from different perspectives, which are called dimensions. A multidimensional database allows for the effective, efficient, and convenient storage and retrieval of large volumes of data. The data in such databases are analyzed by online analytical processing (OLAP). In multidimensional databases, data are stored in arrays, which are the fundamental elements of these databases. Similar to tables in the relational database model, arrays group related information in columns and rows. Multidimensional databases, however, typically consist of at least three dimensions. The visualization of more than three dimensions becomes increasingly complex for human beings. Therefore, most examples pertaining to multidimensional databases artificially limit the dimensions to only three, depicting the resulting database as a cube. Dimensions are the edges of the cube, and represent the primary views of the business data. For example, sales data could be represented as a three-dimensional cube with the dimensions of product, geography (market), and time. A specific block

Table 5.1

Strategic Uses of Data Warehousing

Industry

Functional Areas of Use

Strategic Use

Airline

Operations; marketing

Crew assignment, aircraft deployment, mix of fares, analysis of route profitability, frequent-flyer program promotions

Apparel

Distribution; marketing

Merchandising and product replenishment

Banking

Product development; operations; marketing

Customer service, trend analysis, product and service promotions; reduction of IS expenses

Credit card

Product development; marketing

Customer service, new information service, fraud detection

Health care

Operations

Reduction of operational expenses

Investment and insurance

Product development; operations; marketing

Risk management, market movements analysis; customer tendencies analysis, portfolio management

Personal care

Distribution; marketing

Distributions decisions, product promotions, sales decisions, pricing policy

Public sector

Operations

Intelligence gathering

Retail chain

Distribution; marketing

Trend analysis, buying-pattern analysis, pricing policy, inventory control, sales promotions, optimal distribution channel

Steel

Manufacturing

Pattern analysis (quality control)

Telecommunications

Product development; operations; marketing

New product and service promotions, reduction of IS budget, profitability analysis

Source: Y. T. Park, “Strategic Uses of Data Warehouses,” Journal of Data Warehousing (April 1997).

150

Chapter 5

Managing Organizational Data and Information

‘s A b o u t B u s i n e s s

sprint.com

MKT

Box 5.3: Using a data warehouse to manage customer churn Stiff competition in the telecommunications market is constantly tempting customers with incentives to switch providers. That makes churn management, the process of acquiring and retaining customers, a major challenge for the carriers. To turn capricious service subscribers into loyal customers, Sprint’s Global Markets Group relies on a 5-terabyte customer data warehouse and business intelligence system. The system’s goals are to increase the amount of business Sprint does with its business customers and to identify customers who may be about to defect. Sprint says that churn management is not just about retaining a customer. It is also about retaining and growing its business. It is easier and less expensive to “up-sell” (sell new and perhaps more expensive services to existing customers) than it is to acquire new customers. Sprint’s customer-management effort comes as the range of services has expanded in recent years to include local and long-distance voice communications, Internet access, data communications, and wireless services. Understanding a customer’s needs is critical. Sprint has found that customers who subscribe to two or more services are much more likely to stay with Sprint than are customers who use only one service. Sprint’s data warehouse contains information from its customer-billing and customer-service records, augmented with external, publicly available information

about business customers. It is the first time that Sprint has integrated sales and customer data into one system. Sprint uses the information to build profiles of its business customers and their service needs. That data is analyzed using SAS business-intelligence tools, allowing sales and marketing managers to see what additional services they might sell to customers. Using predictive analysis techniques, Sprint can identify customers who may be about to move to another carrier. Warning signs include reduced use of a Sprint service. That information gives Sprint the chance to correct the problem or offer incentives for users to stay put. Sprint’s customer churn rate is lower than it was before the company began using the system. Since the system went live, it has saved Sprint one million dollars in what the company used to pay outside data-analysis service providers and database marketing firms. Further, the analysis is done quicker—in hours, rather than weeks. Source: “Managing Customer Churn,” Information Week (May 14, 2001); sprint.com; sas.com.

Questions 1. Why is it more important to retain customers than to acquire customers? 2. How does Sprint’s data warehouse contribute to customer retention?

in this cube (at the intersection of all three dimensions) represents the sales of a specific product, to customers in a specific market, on a certain date (time). Multidimensional databases are typically a more efficient and effective means of storing large amounts of data than relational databases. There are several reasons for these advantages over relational databases: • Data in multidimensional databases can be presented and navigated with relative ease. • Multidimensional databases are easier to maintain. • Multidimensional databases are significantly faster than relational databases as a result of the additional dimensions and the anticipation of how the data will be accessed by users. For an example to differentiate between relational and multidimensional databases, suppose your company has four products (nuts, screws, bolts, and washers), which have been sold in three territories (East, West, and Central) for the last three years (1999, 2000, 2001). In a relational database, these sales data would look like Figures 5.12, a, b, and c. In a multidimensional database, these data would be represented by a three-dimensional matrix, as shown in Figure 5.14. We would say that this matrix represents sales dimensioned by products and regions and year. Notice that we can see

Section 5.6 (a) 1999

(b) 2000

Product

Region

Nuts

East

Nuts

West

Nuts

Central

Screws

East

40

Screws

West

Screws Bolts

Sales

(c) 2001

Product

Region

50

Nuts

East

60

Nuts

West

100

Nuts

Central

Screws

East

50

70

Screws

West

Central

80

Screws

Central

East

90

Bolts

East

100

Bolts

West

120

Bolts

West

Bolts

Central

140

Bolts

Central

Washers

East

20

Washers

East

30

Washers

West

10

Washers

West

Washers

Central

30

Washers

Central

Figure 5.12

Data Warehouses

Product

Region

60

Nuts

East

70

70

Nuts

West

80

110

Nuts

Central

Screws

East

60

80

Screws

West

90

90

Screws

Central

100

Bolts

East

110

130

Bolts

West

140

150

Bolts

Central

160

Washers

East

40

20

Washers

West

30

40

Washers

Central

50

Sales

Sales

120

Relational databases.

only sales in 1999 in Figure 5.13a. Therefore, sales in 2000 and 2001 are shown in Figures 5.13 b and c. Figure 5.14 shows the equivalence between relational and multidimensional databases.

Data Marts Data warehousing approaches can range from simple, called a data mart, to complex, the enterprise data warehouse. These approaches differ in scale and complexity. In practice, however, few organizations begin by implementing an enterprise data warehouse for the entire organization because the cost and time frame are prohibitive. Instead, an organization often has a cooperating collection of data marts forming a virtual data warehouse. Therefore, data marts are implemented most often, usually as the first step in proving the usefulness of the technologies to solve business problems.

East West

70

60

110

40

80

90

140

30

100 160

50

Central 120 East

50

40

90

20

West

60

70

120

10

Central

100

80

140

30

Nuts Screws Bolts Washers

(c)

Nuts Screws Bolts Washers 2001

60

50

100

30

70

80

130

20

Central 110

90

150

40

East 2001 2000 1999

West

(b)

Nuts Screws Bolts Washers 2000 East

50

40

90

20

West

60

70

120

10

Central 100

80

140

30

(a)

Nuts Screws Bolts Washers 1999

Figure 5.13

Multidimensional databases.

151

152

Chapter 5

Managing Organizational Data and Information

Product

Region

Nuts

East

Nuts

West

Nuts

Central

Screws

East

40

Screws

West

70

Screws

Central

80

Bolts

East

90

Bolts

West

120

Bolts

Central

140

Washers

East

20

Washers

West

10

Washers

Central

30

Product

Region

Sales

Nuts

East

Nuts

West

Nuts

Central

Screws

East

50

Screws

West

80

Screws

Central

90

Bolts

East

100

Bolts

West

130

Bolts

Central

150

Washers

East

30

Washers

West

20

Washers

Central

40

Product

Region

Sales

Nuts

East

70

Nuts

West

80

Nuts

Central

Screws

East

Screws

West

Screws

Central

100

Bolts

East

110

Bolts

West

140

Bolts

Central

160

Washers

East

40

Washers

West

30

Washers

Central

50

Figure 5.14

Sales 50 60 100 East

50

40

90

20

West

60

70

120

10

Central

100

80

140

30

(a) 1999

Nuts

Screws

Bolts

Washers

East

60

50

100

30

West

70

80

130

20

Central

110

90

150

40

1999

60 70 110

(b) 2000

Nuts

Screws

Bolts

Washers

East

70

60

110

40

West

80

90

140

30

Central

120

100

160

50

2000

120 60 90

(c) 2001

Nuts

Screws

Bolts

2001

Equivalence between relational and multidimensional databases.

Washers

Section 5.6 A data mart is a scaled-down version of a data warehouse that focuses on a particular subject area. The data mart is usually designed to support the unique business requirements of a specific department or business process. A company can have many data marts, each focused on a subset of the entire firm. The trend toward greater use of the data mart often is driven by end users and departments that want to build local data marts to meet their specific needs instead of waiting for a companywide enterprise data warehouse. Because of its reduced scope, a data mart takes less time to build, costs less, and is less complex than an enterprise data warehouse. It is easier to agree on common data definitions for a single subject area than for an entire company. Therefore, a data mart is appropriate when a company needs to improve data access in a targeted area, such as the marketing department. However, the indiscriminate introduction of multiple data marts with no linkage to each other or to an enterprise data warehouse will cause problems. Because these data marts can proliferate quickly, they may create the same problems that earlier file management systems did, including data redundancy, data isolation, data integrity problems, and so on.

Data Mining Data mining provides a means of extracting previously unknown, predictive information from the base of accessible data in data warehouses. Data mining tools use sophisticated, automated algorithms to discover hidden patterns, correlations, and relationships among organizational data. These tools are used to predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. For example, one typical predictive problem is targeted marketing. Data mining can use data on past promotional mailings to identify the targets most likely to maximize the return on the company’s investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together, such as milk and greeting cards at a supermarket. Another pattern discovery problem is detecting fraudulent credit card transactions. Functions. Data mining identifies facts or suggests conclusions based on sifting through the data to discover either patterns or anomalies. Data mining has five main functions: • Classification: infers the defining characteristics of a certain group (such as customers who have been lost to competitors). • Clustering: midentifies groups of items that share a particular characteristic. (Clustering differs from classification in that no pre-defining characteristic is given in classification.) • Association: identifies relationships between events that occur at one time (such as the contents of a shopping basket). • Sequencing: similar to association, except that the relationship exists over a period of time (such as repeat visits to a supermarket or use of a financial planning product). • Forecasting: estimates future values based on patterns within large sets of data (such as demand forecasting). Applications. The applications for data mining are wide-ranging. They include customer relationships (that is, customer retention); cross-selling and up-selling;

Data Warehouses

153

154

Chapter 5

Managing Organizational Data and Information

‘s A b o u t B u s i n e s s

eddiebauer.com

MKT

Box 5.4: Data mining at Eddie Bauer A division of the Spiegel Group, apparel retailer Eddie Bauer has more than 500 stores in 49 states. Bauer publishes 44 catalogs annually with a circulation of 105 million in the United States and Canada. The company is trying to build one-to-one relationships with more than 15 million retail, catalog, and Internet customers. An average customer might have placed 20 orders in the last five years, with four items per order. Now, add to those 15 million buying histories what you know about the customer: name, age, and address, for example. The result is several terabytes of raw data—and a problem. Before implementing its data mining and data warehousing projects, the company had data in different places. So, managers decided it was time to implement data mining and data warehousing projects that could consolidate that disparate data and make them available to everyone who needed them. The company realized that it should operate under a customer-centric point of view rather than a channel-centric point of view. To do so, Bauer had to rethink the metrics it used to gauge its success and used data mining of the firm’s data warehouse to determine what these metrics need to be. Metrics such as comparing annual sales figures began to give way to customer-relationship-oriented measures such as determining a customer’s current value (e.g., likelihood to buy, type and quantity of product likely to buy, amount of money likely to spend, satisfaction with Bauer) and projecting his or her lifetime value. Data mining for customer relationship management gives Eddie Bauer effective ways of analyzing customer

behavior, together with the power to make projections based on that customer information. Data mining typically relates to direct-mail campaigns at Eddie Bauer— both within the catalog and retail sectors. Increasingly, such activities also focus on the Internet. Bauer uses predictive modeling to decide who receives specialized mailings and catalogs. For example, each year Eddie Bauer features an outerwear special, and, thanks to data mining, the company can determine which customers are most likely to buy. Data mining also allows Bauer to determine seasonal buying habits. Then the company can identify people with similar characteristics and target them with mailings to bring them into the store or encourage them to buy from the catalog. Customer loyalty is critical to Bauer, and its data mining has made tracking and maintaining the customer base easier and more efficient. Data mining helps Bauer find pieces of information that are not just common sense and use them to retain customers more effectively. Source: sas.com; eddiebauer.com.

Questions 1. What is the difference between a customer-centric point of view for a company and a channel-centric point of view? Describe both in relation to Eddie Bauer. 2. What other metrics might Bauer use to measure its success?

campaign management; market, channel, and pricing analysis; and customer segmentation analysis. IT’s About Business Box 5.4 provides an example of such business applications at Eddie Bauer. However, data mining applications are valuable in a variety of industries as the following examples show.

EXAMPLES The National Basketball Association. A data mining application, Advanced Scout, is proving very useful for NBA coaches. Coaches, like business executives, carefully study data to enhance their natural intuition when making strategic decisions. By helping coaches make better decisions, data mining applications are playing a huge role in boosting fan support and loyalty. That means millions of dollars in gate traffic, television sales, and licensing. Before these data mining applications, the sheer volume of statistics was overwhelming, with as many as 200 possessions a game and about 1,200 games a year. Pre-

Section 5.6

Data Warehouses

155

vious applications produced only basic results—the kind of statistics anyone could find in a local newspaper. Using the data mining software, coaches can drill down into the statistics and data and unearth comprehensible patterns that were previously hidden among seemingly unrelated statistics. Coaches are able to obtain, in real time, statistical evaluations that allow them to put in the very best players for specific points in the game. Coaches can also, in real time, ask the application which play will be the most effective relative to the time elapsed and the specific combinations of players on the court. Data mining, simply put, helps coaches make more effective decisions. Selecting branch locations. The Dallas Teachers Credit Union (DTCU) decided to become a full-fledged community bank, but did not know where to build branches. DTCU used data mining software to comb through demographic and customer data. One target was people who might open checking accounts, because bankers consider such accounts a way to make “cheap money.” DTCU found that if a branch was within a 10-minute drive, customers had a checking account. But if the branch was a 10.5 minute drive, they did not have a checking account. Consequently, when DTCU opened a branch in north Dallas within a 10-minute drive for a large number of potential customers, it became profitable in 90 days. Normally, a branch takes a year to climb into the black. DTCU now uses data mining to select all branch locations.

●

Table 5.2 lists some other common data mining applications. We will reexamine data mining in a decision support role in Chapter 10.

Table 5.2

Common Data Mining Applications

Application

Description

Market segmentation

Identifies the common characteristics of customers who buy the same products from your company

Customer churn

Predicts which customers are likely to leave your company and go to a competitor

Fraud detection

Identifies which transactions are most likely to be fraudulent

Direct marketing

Identifies which prospects should be included in a mailing list to obtain the highest response rate

Market basket analysis

Understands what products or services are commonly purchased together (e.g., beer and diapers)

Trend analysis

Reveals the difference between a typical customer this month versus last month

Science

Simulates nuclear explosions; visualizes quantum physics

Entertainment

Models customer flows in theme parks; analyzes safety of amusement-park rides

Insurance and health care

Predicts which customers will buy new policies; identifies behavior patterns that increase insurance risk; spots fraudulent claims

Manufacturing

Optimizes product design, balancing manufacturability and safety; improves shop-floor scheduling and machine utilization

Medicine

Ranks successful therapies for different illnesses; predicts drug efficacy; discovers new drugs and treatments

Oil and Gas

Analyzes seismic data for signs of underground deposits; prioritizes drilling locations; simulates underground flows to improve recovery

Retailing

Discerns buying-behavior patterns; predicts how customers will respond to marketing campaigns

Transportation

Optimizes distribution schedules and vehicle use; analyzes loading patterns for trucks and railcars

156

Chapter 5

Managing Organizational Data and Information

Text Mining Text mining is the application of data mining to nonstructured or less structured text files. Data mining takes advantage of the infrastructure of stored data to extract predictive information. For example, by data mining a customer database, an analyst might discover that everyone who buys product A also buys products B and C, but does so 6 months later. Text mining, however, operates with less structured information. Documents rarely have strong internal infrastructure, and where they do, it is frequently focused on document format rather than document content. Text mining helps organizations to do the following: • Find the “hidden” content of documents, including additional useful relationships. • Relate documents across previously unnoticed divisions (e.g., discover that customers in two different product divisions have the same characteristics). • Group documents by common themes (e.g., identify all the customers of an insurance firm who have similar complaints and cancel their policies). The following example demonstrates an application of text mining.

EXAMPLE Searching for documents at Procter & Gamble. Procter & Gamble’s (pg.com) intranet contains increasingly large amounts of information from both inside and outside the company. Finding documents relevant to an individual’s needs by utilizing traditional full-text indexing and searching is proving to be more and more difficult as the amount of information increases. Text mining’s ability to organize and retrieve information based on concepts greatly reduces the number of documents returned by a search as well as increasing the relevancy of those documents. P&G’s text mining software also displays a graphical depiction of closely related concepts and documents.

●

WHATS IN

FOR ME ?

ACC

F OR THE ACCOUNTING MAJOR Data gathered about each transaction (business event) in the organization is stored in its databases. Accountants access these data to create an unbroken audit trail from each transaction to the balance sheet and then to show profit and loss for the company on the transaction. The speed with which data can be accessed and searched directly affects the productivity of the accountant. Also, the flexibility with which the data can be searched, stemming from the design of the database, means that the modern accountant can investigate relationships with unprecedented ease. With the advent of modern data mining techniques, the firm can discover relationships that have not even been considered.

FIN

FOR THE FINANCE MAJOR Employees in the finance department make extensive use of computerized databases external to the organization, such as CompuStat or Dow Jones, to obtain financial data on organizations in their industry. They can use these data to determine if their organization meets industry benchmarks in return on investment, cash management, or other financial ratios. As for accounts, the speed and flexibility with which data can be accessed and searched bear directly on the finance professional’s productivity. Modern data mining techniques are also becoming popular in finance, particularly for the automated discovery of relationships in securities and portfolio management.

Summary

FOR THE MARKETING MAJOR When a customer makes a purchase from an organization, the transaction generates data that are stored in the firm’s databases. Marketing personnel access this information to plan targeted marketing campaigns and to evaluate the success of previous campaigns. They also link this information to geographic databases to determine where certain products sell the best. Data mining is also uncovering many unanticipated relationships between some aspect of the buyer’s “profile,” the product, and the marketing and advertising campaigns, that, when identified and exploited, can increase sales substantially.

MKT

FOR THE PRODUCTION/OPERATIONS MANAGEMENT MAJOR Production/operations personnel access organizational databases to determine optimum inventory levels for parts in a production process. They also use information in databases to know when to perform required service on machines. Past production data enable these persons to determine the optimum configuration for assembly lines. Firms also keep quality data that inform them not only about the quality of the finished products, but also about quality issues with incoming raw materials, production irregularities, shipping and logistics, and after-sale use and maintenance of the product. Modern databases allow POM professionals to quickly identify problem areas and resolve them. Data mining automates discovery of previously undetected production issues.

POM

FOR THE HUMAN RESOURCES MANAGEMENT MAJOR Organizational databases contain extensive data on employees, including gender, race, age, current and past job descriptions, and performance evaluations. Human resources personnel access these data to provide reports for governmental agencies regarding compliance with federal Equal Opportunity guidelines. These persons also use these data to evaluate hiring practices in the organization, evaluate salary structures, and manage any discrimination grievances or lawsuits brought against the firm. Cutting-edge technologies such as data mining can help the HR professional investigate relationships in the data that bear upon the health, safety, productivity, and retention of valuable human resources.

HRM

Before you go on . . . 1. What are some of the advantages of data warehousing? 2. How would a firm use data mining and text mining for competitive advantage?

SUMMARY 1 Discuss traditional data file organization and its problems.

In a file management environment, each application has a specific data file related to it, containing all the data records needed by the application. Records stored in a sequential file structure may be accessed sequentially, or they may be accessed directly via an index (and then sequentially) using an indexed sequential access method. Records stored in a direct file structure may be accessed directly without using an index.

157

158

Chapter 5

Managing Organizational Data and Information The traditional data file organization led to many problems, including data redundancy, data inconsistency, data isolation, data integrity, security, and application/data dependence. Storing data in data files that are tightly linked to their applications resulted in organizations having hundreds of applications and data files, with little or no coordination among the applications and files, and no overall plan for managing corporate data. 2 Explain how a database approach overcomes the problems associated with the traditional file environment, and discuss disadvantages of the database approach. A database, which is a logical group of related files, eliminates the problems associated with a traditional file environment. In a database, data are integrated and related so that one set of software programs provides access to all the data. Therefore, data redundancy, data isolation, and data inconsistency are minimized, and data can be shared among all users of the data. In addition, security and data integrity are increased, and applications and data are independent of one another. The database approach does have disadvantages. Databases are expensive and require time and effort to program. Also, databases do provide security for corporate data, but once inside a database, a hacker can cause tremendous damage. 3 Describe how the three most common data models organize data, and the advan tages and disadvantages of each model. The hierarchical model rigidly structures data into an inverted “tree” in which records contain a key field and a number of other fields. All records have only one “parent,” and each parent may have many “children.” Therefore, the hierarchical structure is characterized by one-to-many relationships among data. In the network model records can be linked to more than one parent, allowing many-tomany relationships among the data. The relational model uses tables to capitalize on characteristics of rows and columns of data that are consistent with real-world business situations. The main advantage of the hierarchical and network database models is processing efficiency. The hierarchical and network structures are relatively easy for users to understand because they reflect the pattern of many (but not all) realworld business relationships. In addition, the hierarchical structure allows for data integrity to be easily maintained. Hierarchical and network structures have several disadvantages. These designs have low flexibility and are programmingintensive, time-consuming, difficult to install, and difficult to remedy if design errors occur. Nor do they support ad-hoc, English-language-like inquiries for information. The advantages of relational databases include high flexibility in regard to adhoc queries, power to combine information from different sources, simplicity of design and maintenance, and the ability to add new data and records without disturbing existing applications. The disadvantages of relational databases include their relatively low processing efficiency. 4 Describe how a multidimensional data model organizes data. In multidimensional databases, data are stored in arrays. Similar to tables in the relational database model, arrays group related information in columns and rows. Multidimensional databases, however, typically consist of at least three dimensions. However, due to problems depicting more than three dimensions, most examples artificially limit the dimensions to only three, depicting the resulting database as a cube. Dimensions are the edges of the cube, and represent the primary views of the business data.

Discussion Questions

159

5 Distinguish between a data warehouse and a data mart.

Data warehousing approaches can range from simple, the data mart, to complex, the enterprise data warehouse. These approaches differ in scale and complexity. A data mart is a scaled-down version of a data warehouse that focuses on a particular subject area. The data mart is usually designed to support the unique business requirements of a specific department or business process. Because a data mart takes less time to build, costs less, and is less complex than an enterprise data warehouse, it is appropriate when a company needs to improve data access in a targeted area, such as the marketing department. The enterprise data warehouse provides an enterprisewide, consistent, and comprehensive view of the company with business users employing common terminology and data standards throughout the firm. The warehouse reconciles the various departmental perspectives into a single, integrated corporate perspective. 6 Discuss the similarities and differences between data mining and text mining.

Data mining extracts previously unknown, predictive information from data warehouses. Data mining tools use sophisticated, automated algorithms to discover hidden patterns, correlations, and relationships among organizational data. These tools are used to predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. Text mining applies data mining to nonstructured or less structured text files. Text mining helps organizations find the “hidden” content of documents, relate documents across previously unnoticed divisions, and group documents by common themes.

INTERACTIVE LEARNING SESSION Go to the book’s Web site (or CD), access Chapter 5: Managing Organizational Data and Information, and read the case presented. It will describe a business problem that will require you to query a database for certain information. You will be able to construct SQL statements to obtain the needed information. As you construct your SQL statements, you will see what information results and decide if it meets your requirements. You may then change your SQL statements as necessary to obtain further (or different) information.

DISCUSSION QUESTIONS 1.

You are the CIO of your company. You have just made a presentation to your CEO, proposing that the company implement a data warehouse. The CEO responds, “We already have several databases, don’t we? Why do we need a data warehouse?” Prepare a response to justify your proposal. 2. As the CIO of a company, you want to implement a series of data marts. The CEO wants to know why you do not just implement one large data warehouse. Make your case to support data marts.

3. Should your university implement a data warehouse? Data marts? What types of uses would your university have for a data warehouse? For data marts? 4. In the university question above, what might the dimensions be in the multidimensional database used in the data warehouse? Give a three-dimensional example. What would each cell represent in your example?

160

Chapter 5

Managing Organizational Data and Information

PROBLEM-SOLVING ACTIVITIES 1. Perform a feasibility study for your university to implement a data warehouse (or data mart). Include in your study the application(s) that you would address with the data warehouse. What types of data would you include? Why? 2. Develop a simple database for your personal use, for your family’s use, or for use in an organization of which you are a member. This might contain data on

names, addresses, telephone numbers, birthdays, and other pertinent information. 3. Entrepreneurs can benefit from using databases and data mining by exploiting demographic data that can be purchased for any geographic region. For a startup business of your choice, what kind of data would you want to purchase, and what kinds of relationships would you want to investigate?

INTERNET ACTIVITIES 1. Access the Web sites of one or two of the major data management vendors, such as Oracle (oracle.com), Sybase (sybase.com), and IBM (ibm.com). What are the capabilities of their latest products? Compare these capabilities. Pay particular attention to the Web connections offered by each vendor’s products. 2. Access the Web sites of one or two of the major data warehouse vendors, such as NCR (ncr.com),

SAS (sas.com), and Comshare (comshare.com). What are the capabilities of their latest products? Pay particular attention to the Web connection offered by each vendor’s product. 3. Access the Web site of the Gartner Group (gartnergroup.com). Examine its research papers on marketing databases, data warehousing, and data management. Prepare a report noting the most current practices in these three areas.

TEAM ACTIVITIES AND ROLE PLAYING 1. Go to your university computer center. Determine what type(s) of database(s) your university is using. Find the reasons why these type(s) of database(s) are in use and determine the applications that run on each type of database. 2. Examine the data that your university keeps on you. (You will have to go to your university computer

REAL-WORLD CASE

center for this project.) List the fields that are included in your record. What applications would be supported by the fields in your record? Are you surprised by how much information your university keeps on students?

harrahs.com

Harrah’s Entertainment Database The Business Problem The difference between Harrah’s and its competitors is that most companies put money into the spectacle. The prevailing wisdom in the casino industry is that it is the property’s attractiveness that drives customers to one site or another. This view that has propelled the spending of ever-greater amounts on increasingly lavish hotels and casinos. While its competitors continued to pour money into extravagant casinos to drum up new business, Harrah’s wanted a national approach to its business. Harrah’s used to operate under the assumption that its customers were partial to one particular casino.

Its casinos around the country each operated independently and competed with one another. Each had its own player card that was valid only at the casino that issued it. None of the information systems at these individual sites were integrated with those at other casinos or could even communicate with them. By doing marketing research, Harrah’s found that customers patronized different Harrah’s sites around the country. That led Harrah’s to extend its card-player program so that customers could use their cards at any Harrah’s casino. Extending the card-player program allowed the company to track particular customers on a

Real World Case national basis. Knowing customers’ habits would enable Harrah’s to improve service, customize the kinds of “comps” (free dinners, show tickets, and hotel rooms) it offered, and better tailor its marketing promotions. All of that would help tie customers closer to the Harrah’s brand and increase the company’s share of the U.S. gaming market. Harrah’s was looking for a customer relationship management (CRM) program, and at the heart of such a program is an excellent database. The IT Solution Harrah’s developed the winner’s information network (WINet), the industry’s first national customer database. The first step was to consolidate all its disparate, IT systems, so that all the company’s properties could communicate with each other and share information about customers. Harrah’s had to make its AS/400 transactional systems at each property communicate with a Unix-based national customer database. This Patron Database contains all the company’s customer information. Harrah’s team used middleware and software developed inhouse to enable this communication between disparate systems. Once WINet was running, Harrah’s could share information across its properties, in real time. The company’s 35,000 slot machines now also connect to the AS/400 systems, and call-center representatives are linked to both the AS/400s and the Unix Patron Database. Analysts in the marketing department access the data warehouse and use SAS software to do predictive modeling. Real-time access is key to enabling Harrah’s Total Rewards program. A customer who receives a promotion in the mail will call Harrah’s to inquire about it. As soon as the customer service agent gets on the phone, a large amount of information identifying the customer pops up on the agent’s computer screen. It indicates the customer’s tier (platinum, gold, or diamond), where he or she usually plays, how much he or she has won or lost, and even the customer’s net worth. The agent then asks where the customer wants to make a hotel reservation and for what dates, and can bring up Harrah’s reservation system to see if a room is available. The agent asks if the customer is responding to an offer. The reservation system then automatically searches the Patron Database to see if the customer has already received or redeemed the offer and if it is still valid.

161

WINet has ability to drill into Harrah’s extensive database, which in turn allows the company to customize its marketing and promotions to individual customers, particularly those at the lower end of the economic scale. Harrah’s has been able to retain those lower-end players by calling them to ask about their trips. Retaining a customer costs one-tenth the cost of acquiring a new customer. Talk time has been reduced by an average of 12 seconds per phone call, because employees no longer have to ask for information that other employees have previously asked for, nor do they have to rekey that information. The Results The CRM strategy was not complete simply with the deployment of WINet. Harrah’s also had to radically change its relationship with its regional properties. Before the Total Rewards program, each of Harrah’s properties operated independently. Regional Harrah’s managers were possessive of their markets, customers, and data and were focused almost exclusively on their own operation’s bottom line. However, the parent company eventually sold the regional properties on this new strategy. It argued that customers favored the idea of extending the benefits that guests received for patronizing the casino in one area so that they could also use them at a casino in another region. The corporate office also convinced the regional property managers that the IT capabilities and marketing tools it was developing would boost their businesses. For example, Harrah’s IT systems link all its properties together for corporate promotional events. Since Total Rewards began, Harrah’s has saved $20 million a year in overall costs, while increasing same-store sales growth. At the same time, the number of Harrah’s customers playing at more than one of Harrah’s properties has increased by 72 percent. Harrah’s has changed its relationship with its customers, gained competitive advantage, and created enterprise value. Source: “Jackpot! Harrah’s Entertainment, “CIO (February 1, 2001); harrahs.com.

Questions 1. What was Harrah’s biggest problem in developing its Total Rewards system—technology or people? Would this be true in developing all new systems? 2. What competitive advantage did the Total Rewards system give Harrah’s?

162

Chapter 5

Managing Organizational Data and Information

VIRTUAL COMPANY ASSIGNMENT