CS101 Lecture 13: Text Representation and Data Compression

John Magee 15 July 2013

1

Overview/Questions – How do computers store text information? – Why do some characters show up as �s on my browser? – What is compression, and why is it important? – How can data be compressed?

2

1

Binary Representations Recall: a single bit can be either a 0 or a 1

What if you need to represent more than 2 choices?

n bits can represent 2n possible combinations 3

Representing Text There are finite number of characters to represent, so list them all and assign each a binary pattern. Character set A list of characters and the binary codes used to represent each one. Computer manufacturers agreed to standardize in the early 1960s. 4

2

The ASCII Character Set ASCII stands for American Standard Code for Information Interchange ASCII originally used seven bits to represent each character, allowing for 128 unique characters Later extended ASCII evolved so that all eight bits were used.

5

The ASCII Character Set

(7 bits)

6

3

The Extended ASCII Character Set

7

Can't You Take a Joke? :-)

Carnegie Mellon professor Scott E. Fahlman Proposed ASCII emoticons, Sept. 19, 1982. Source: http://www.wired.com/science/discoveries/news/2008/09/dayintech_0919

8

4

ASCII Art Text-based systems had no graphics…

…so people created art and graphics out of ASCII text! 9

The Unicode Character Set Extended ASCII is not enough for international use. Unicode uses 16 bits per character

How many characters can UNICODE represent? Unicode is a superset of ASCII. The first 256 characters correspond exactly to the extended ASCII character set 10

5

The Unicode Character Set

11

Recall: Morse Code Invented by Samuel Morse for the telegraph in 1840s

12

6

Text Compression Problem: Assigning 16 bits to each character in a document uses a heck of a lot of space.

We need ways to store and transmit text efficiently. Why? Common compression techniques: keyword encoding run-length encoding Huffman encoding

13

Keyword Encoding Replace frequently used words with a single character

14

7

Keyword Encoding Example Given the following paragraph, We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness. That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness.

15

Keyword Encoding Example The encoded paragraph is We hold # truths to be self-evident, $ all men are created equal, $ ~y are endowed by ~ir Creator with certain unalienable Rights, $ among # are Life, Liberty + ~ pursuit of Happiness. — $ to secure # rights, Governments are instituted among Men, deriving ~ir just powers from ~ consent of ~ governed, — $ whenever any Form of Government becomes destructive of # ends, it is ~ Right of ~ People to alter or to abolish it, + to institute new Government, laying its foundation on such principles + organizing its powers in such form, ^ to ~m shall seem most likely to effect ~ir Safety + Happiness. 16

8

Keyword Encoding Compression ratio The size of the compressed data divided by the size of the original data (0 < c.r.