Transitioning to Unicode : What To Expect

NWEG 2004: May 20, 2004, Session 1: 2:30pm Transitioning To Unicode: What to Expect Francesca Lane Rasmus & Layne Nordgren 1 Presentation Pathway T...
Author: Guest
2 downloads 0 Views 617KB Size
NWEG 2004: May 20, 2004, Session 1: 2:30pm Transitioning To Unicode: What to Expect Francesca Lane Rasmus & Layne Nordgren

1

Presentation Pathway

Transitioning to Unicode™: What To Expect Francesca Lane Rasmus Layne Nordgren

About Pacific Lutheran University • Comprehensive Private University

http://www.plu.edu

– Liberal Arts & Professional Schools – 3,400 Students

• Languages Taught: – Chinese, Classics, French, German, Norwegian/Scandinavian Studies, Spanish

• Wang Center for International Programs • Our Library Collection

• PLU Context • Importance of Voyager® with • • • • •

Unicode™ Release What is Unicode™? Impact of Moving to Unicode™ Diacritics and Special Characters Cleanup Strategies Wrap Up

Obligatory Disclaimer & Caveat • Beta and Early Release Partner for Voyager® with Unicode™ Release

• Record errors and display problems we describe due to data entry errors and the peculiar history of our data, upgrades, and migrations 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

Bibliographic Utility WLN WLN

– 274,000 BIBs currently – 104 languages

• almost 3 years • Voyager® Clients: 30 Voyager® Customer:

Why Is Unicode™ Important? • All BIB, MFHD, and AUTH records will change to Unicode including your English language records

OCLC OCLC

Integrated Library System Geac Geac

Dynix Dynix

Endeavor Endeavor

What is Unicode™? http://www.unicode.org • International standard • “Unicode™ provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language”

• Increases number of unique coded characters from 128 (in ASCII) to over 40,000

• Includes all major scripts and technical symbols, plus expansion space for additional characters From: http://www.unicode.org/charts

NWEG 2004: May 20, 2004, Session 1: 2:30pm Transitioning To Unicode: What to Expect Francesca Lane Rasmus & Layne Nordgren

2

Latin-1 & MARC-8 Character Sets

Precomposed & Decomposed Latin-1 Precomposed

MARC-8 Decomposed

1 character space

2 character spaces

EXTENDED SET 00F1

Basic

ASCII “n” 006E

EXTENDED SET “~” 0303

ASCII

Diacritic comes AFTER ASCII character

Pre-composed

Diacritic

Characters

Characters

Character Composition Influences:

Counting Non-Filing Indicators http://www.loc.gov/catdir/cpso/nonfil.pdf

• Data entry • Searching & sorting • Non-filing indicators (e.g., 245, 440) Diacritic comes AFTER ASCII character Example: 245 14 $a Los últimos alazapas… ASCII 0075

1

2

3

4

L o s

5

6

7

8

9

EXTENDED SET 0341

10 11 12

u ´ l t i m o s

What Happens During Unicode™ Conversion?

• MARC records (BIB, AUTH, & MFHD ) converted from Voyager® Encoding (VRLIN) to UTF-8 (UCS Transformation Format-8), which is the MARC standard – PLU database conversion (August 20, 2003) - 1 day • SupportWeb estimates: – < 300,000 bibs = 1 day – > 300,000 bibs but < 1,000,000 bibs = 2 days – > 1,000,000 bibs = 3 days

• New clients & ODBC drivers installed on staff computers

• Browser/font adjustments for staff and public workstations

• Cataloging client preference adjustments – Font – Import profiles

Catalog Displays Special Characters & Diacritics

NWEG 2004: May 20, 2004, Session 1: 2:30pm Transitioning To Unicode: What to Expect Francesca Lane Rasmus & Layne Nordgren

Cataloging Client Display

3

Unicode™ Transition May…

• display skeletons in database



• affect searching • require problem identification & cleanup strategies

The Skeletons

Effects on Searching

• If your

• Searching for

diacritics or special characters were not correctly coded and converted, in WebVoyáge you now see… character error icons

Assessing the Impact



terms with encoding errors may fail to retrieve expected results e.g. keyword anywhere search for “espanola”

Frequency of Language Coded Records • Count BIB_TEXT.LANGUAGE • “eng” or “”

• How many records might have diacritics or special characters?

• How many records actually contain diacritics or special characters? • In what tags are they most likely to lurk? • What percentage of the collection is involved?

Number of Language Records 3,188

3,500 3,000 2,500

Percent of Collection 7.3

2,766 2,113

2,000

1,428

1,500

1,111 675

1,000

547

500

346

92.7 189

112

chi

grc

0

No diacritics? Possible diacritics

nor

ger

fre

spa

swe

dan

lat

ita

NWEG 2004: May 20, 2004, Session 1: 2:30pm Transitioning To Unicode: What to Expect Francesca Lane Rasmus & Layne Nordgren

4

Records with Diacritics & Special Characters

0.95

Sub sample Records in Query

No diacritics Likely Diacritics

99.05

1.11

No diacritics Likely Diacritics

Record Cleanup Strategies

First 100 with diacritics: 8%

• Review Conversion Log Reports

Last 100 with diacritics: 13%; most in Swedish or Norwegian

• Develop Access Reports

RECORDS WITH DIACRITICS (high) 13% 13% of 20,000 = ~ 2,600 records Percentage = ~ 0.95% of total records

• Consider Record Replacements • Make Global Headings Changes

Another View: Bulk Importing Records Sent to OCLC = 235,518 Total Parse Errors on Bulk Import = 2,614 Percentage: 1.11% of Bulk Imported records

98.89

Conversion Logs

Working with the Conversion Log loose char

BIB_ID, tag

• Helpful to import log into a spreadsheet • Sort by type of entry – undefined character errors – loose character warnings

• Once you can visually see the data, patterns start to emerge undefined char

Undefined Character Errors • Out of 866 errors, we discovered

• •

– 750 in 650 tag (86%) “—”used instead of delimiter (‡) – 70 in 590 tag – 25 in 440 tag – 21 in all other tags combined Nearly all errors were from non-MARC records input in our previous ILS Special cleanup projects were already planned for the non-MARC records

Ultimately, there were no diacritic or special character conversion errors in our data!

Loose Character Warnings • What is a loose character? – Examples are carriage returns, line feeds, backspaces, and MARC-8 superscript and subscript numbers

• All but a handful of our 3,432 loose characters were in non-MARC records

• Primary problem caused by improper Backspace key emulation settings on terminals in previous ILS

NWEG 2004: May 20, 2004, Session 1: 2:30pm Transitioning To Unicode: What to Expect Francesca Lane Rasmus & Layne Nordgren

What the Conversion Log Missed

• Improper input of ASCII characters as diacritics – ~ (tilde) – ` (grave) – ^ (circumflex) Now we knew where the real cleanup focus needed to be in the PLU database

Adding and Editing Diacritics

5

Finding ASCII Characters Used as Diacritics Used Access/SQL to identify the characters: Opted to search field-by-field for these characters. Example below is for the 100 tag SELECT BIB_INDEX.INDEX_CODE, BIB_INDEX.BIB_ID, BIB_INDEX.DISPLAY_HEADING FROM BIB_INDEX WHERE (((BIB_INDEX.INDEX_CODE) Like "*100*") AND ((BIB_INDEX.DISPLAY_HEADING) Like "*~*" Or (BIB_INDEX.DISPLAY_HEADING) Like "*^*" Or (BIB_INDEX.DISPLAY_HEADING) Like "*`*"));

Record Replacement

• When should you replace a record instead of correcting the diacritics and special characters?

– When you do not have in-house language expertise • For PLU, we will replace our records containing Chinese still in Wade-Giles to Pin Yin

– Staff Time & Cost • When a record has multiple errors, it may save time • OCLC search and export charges ~ $1.00 per record – Workflow Issues

Global Headings Change

Tools & Features

• Potentially useful • PLU has not exploited this tool (we haven’t



quite figured out how to use it correctly) • If you use this tool, please talk to us after the presentation!

• UTF8to16 function in Access reports (allows you to

Character error icon to indicate incorrect diacritic

see the diacritics in Access reports)

NWEG 2004: May 20, 2004, Session 1: 2:30pm Transitioning To Unicode: What to Expect Francesca Lane Rasmus & Layne Nordgren

Tools & Features • ALA Character Entry Table in OCLC

6

Tools & Features • Voyager’s® Special Character Entry lists diacritics and special characters by name

Tools & Features for Diacritics • PLU’s website with a listing of the most frequently found diacritics and special characters in the PLU Library catalog as represented in Voyager and OCLC: http://www.plu.edu/~libr/EndUser2004/diacritics.html

• Princeton’s diacritics by language website: http://infoshare1.princeton.edu/katmandu/catcopy/diatoc.html

• Michael Doran’s Coded Character Sets http://rocky.uta.edu/doran/charsets

Wrap Up

Key Transitions • • • •

New browser New client fonts & open type Adjust record import/export configurations Adopt new schema, Access Reports/ODBC drivers and UTF8to16 functions • Enter diacritics after character (not before) • Adjust to new standards for non-filing indicators • Clean up incorrectly coded diacritics

Contact Info

• At last we can view diacritics and special characters properly!

• Converting to Unicode™ is not a difficult transition though you may need to do some cleanup • Tools & strategies exist to locate errors • Developing, refining, and sharing cleanup strategies would be beneficial

Francesca Lane Rasmus [email protected] 253-535-7141

Layne Nordgren [email protected] 253-535-7197