NWEG 2004: May 20, 2004, Session 1: 2:30pm Transitioning To Unicode: What to Expect Francesca Lane Rasmus & Layne Nordgren
1
Presentation Pathway
Transitioning to Unicode™: What To Expect Francesca Lane Rasmus Layne Nordgren
About Pacific Lutheran University • Comprehensive Private University
http://www.plu.edu
– Liberal Arts & Professional Schools – 3,400 Students
• Languages Taught: – Chinese, Classics, French, German, Norwegian/Scandinavian Studies, Spanish
• Wang Center for International Programs • Our Library Collection
• PLU Context • Importance of Voyager® with • • • • •
Unicode™ Release What is Unicode™? Impact of Moving to Unicode™ Diacritics and Special Characters Cleanup Strategies Wrap Up
Obligatory Disclaimer & Caveat • Beta and Early Release Partner for Voyager® with Unicode™ Release
• Record errors and display problems we describe due to data entry errors and the peculiar history of our data, upgrades, and migrations 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
Bibliographic Utility WLN WLN
– 274,000 BIBs currently – 104 languages
• almost 3 years • Voyager® Clients: 30 Voyager® Customer:
Why Is Unicode™ Important? • All BIB, MFHD, and AUTH records will change to Unicode including your English language records
OCLC OCLC
Integrated Library System Geac Geac
Dynix Dynix
Endeavor Endeavor
What is Unicode™? http://www.unicode.org • International standard • “Unicode™ provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language”
• Increases number of unique coded characters from 128 (in ASCII) to over 40,000
• Includes all major scripts and technical symbols, plus expansion space for additional characters From: http://www.unicode.org/charts
NWEG 2004: May 20, 2004, Session 1: 2:30pm Transitioning To Unicode: What to Expect Francesca Lane Rasmus & Layne Nordgren
2
Latin-1 & MARC-8 Character Sets
Precomposed & Decomposed Latin-1 Precomposed
MARC-8 Decomposed
1 character space
2 character spaces
EXTENDED SET 00F1
Basic
ASCII “n” 006E
EXTENDED SET “~” 0303
ASCII
Diacritic comes AFTER ASCII character
Pre-composed
Diacritic
Characters
Characters
Character Composition Influences:
Counting Non-Filing Indicators http://www.loc.gov/catdir/cpso/nonfil.pdf
• Data entry • Searching & sorting • Non-filing indicators (e.g., 245, 440) Diacritic comes AFTER ASCII character Example: 245 14 $a Los últimos alazapas… ASCII 0075
1
2
3
4
L o s
5
6
7
8
9
EXTENDED SET 0341
10 11 12
u ´ l t i m o s
What Happens During Unicode™ Conversion?
• MARC records (BIB, AUTH, & MFHD ) converted from Voyager® Encoding (VRLIN) to UTF-8 (UCS Transformation Format-8), which is the MARC standard – PLU database conversion (August 20, 2003) - 1 day • SupportWeb estimates: – < 300,000 bibs = 1 day – > 300,000 bibs but < 1,000,000 bibs = 2 days – > 1,000,000 bibs = 3 days
• New clients & ODBC drivers installed on staff computers
• Browser/font adjustments for staff and public workstations
• Cataloging client preference adjustments – Font – Import profiles
Catalog Displays Special Characters & Diacritics
NWEG 2004: May 20, 2004, Session 1: 2:30pm Transitioning To Unicode: What to Expect Francesca Lane Rasmus & Layne Nordgren
Cataloging Client Display
3
Unicode™ Transition May…
• display skeletons in database
☠
• affect searching • require problem identification & cleanup strategies
The Skeletons
Effects on Searching
• If your
• Searching for
diacritics or special characters were not correctly coded and converted, in WebVoyáge you now see… character error icons
Assessing the Impact
•
terms with encoding errors may fail to retrieve expected results e.g. keyword anywhere search for “espanola”
Frequency of Language Coded Records • Count BIB_TEXT.LANGUAGE • “eng” or “”
• How many records might have diacritics or special characters?
• How many records actually contain diacritics or special characters? • In what tags are they most likely to lurk? • What percentage of the collection is involved?
Number of Language Records 3,188
3,500 3,000 2,500
Percent of Collection 7.3
2,766 2,113
2,000
1,428
1,500
1,111 675
1,000
547
500
346
92.7 189
112
chi
grc
0
No diacritics? Possible diacritics
nor
ger
fre
spa
swe
dan
lat
ita
NWEG 2004: May 20, 2004, Session 1: 2:30pm Transitioning To Unicode: What to Expect Francesca Lane Rasmus & Layne Nordgren
4
Records with Diacritics & Special Characters
0.95
Sub sample Records in Query
No diacritics Likely Diacritics
99.05
1.11
No diacritics Likely Diacritics
Record Cleanup Strategies
First 100 with diacritics: 8%
• Review Conversion Log Reports
Last 100 with diacritics: 13%; most in Swedish or Norwegian
• Develop Access Reports
RECORDS WITH DIACRITICS (high) 13% 13% of 20,000 = ~ 2,600 records Percentage = ~ 0.95% of total records
• Consider Record Replacements • Make Global Headings Changes
Another View: Bulk Importing Records Sent to OCLC = 235,518 Total Parse Errors on Bulk Import = 2,614 Percentage: 1.11% of Bulk Imported records
98.89
Conversion Logs
Working with the Conversion Log loose char
BIB_ID, tag
• Helpful to import log into a spreadsheet • Sort by type of entry – undefined character errors – loose character warnings
• Once you can visually see the data, patterns start to emerge undefined char
Undefined Character Errors • Out of 866 errors, we discovered
• •
– 750 in 650 tag (86%) “—”used instead of delimiter (‡) – 70 in 590 tag – 25 in 440 tag – 21 in all other tags combined Nearly all errors were from non-MARC records input in our previous ILS Special cleanup projects were already planned for the non-MARC records
Ultimately, there were no diacritic or special character conversion errors in our data!
Loose Character Warnings • What is a loose character? – Examples are carriage returns, line feeds, backspaces, and MARC-8 superscript and subscript numbers
• All but a handful of our 3,432 loose characters were in non-MARC records
• Primary problem caused by improper Backspace key emulation settings on terminals in previous ILS
NWEG 2004: May 20, 2004, Session 1: 2:30pm Transitioning To Unicode: What to Expect Francesca Lane Rasmus & Layne Nordgren
What the Conversion Log Missed
• Improper input of ASCII characters as diacritics – ~ (tilde) – ` (grave) – ^ (circumflex) Now we knew where the real cleanup focus needed to be in the PLU database
Adding and Editing Diacritics
5
Finding ASCII Characters Used as Diacritics Used Access/SQL to identify the characters: Opted to search field-by-field for these characters. Example below is for the 100 tag SELECT BIB_INDEX.INDEX_CODE, BIB_INDEX.BIB_ID, BIB_INDEX.DISPLAY_HEADING FROM BIB_INDEX WHERE (((BIB_INDEX.INDEX_CODE) Like "*100*") AND ((BIB_INDEX.DISPLAY_HEADING) Like "*~*" Or (BIB_INDEX.DISPLAY_HEADING) Like "*^*" Or (BIB_INDEX.DISPLAY_HEADING) Like "*`*"));
Record Replacement
• When should you replace a record instead of correcting the diacritics and special characters?
– When you do not have in-house language expertise • For PLU, we will replace our records containing Chinese still in Wade-Giles to Pin Yin
– Staff Time & Cost • When a record has multiple errors, it may save time • OCLC search and export charges ~ $1.00 per record – Workflow Issues
Global Headings Change
Tools & Features
• Potentially useful • PLU has not exploited this tool (we haven’t
•
quite figured out how to use it correctly) • If you use this tool, please talk to us after the presentation!
• UTF8to16 function in Access reports (allows you to
Character error icon to indicate incorrect diacritic
see the diacritics in Access reports)
NWEG 2004: May 20, 2004, Session 1: 2:30pm Transitioning To Unicode: What to Expect Francesca Lane Rasmus & Layne Nordgren
Tools & Features • ALA Character Entry Table in OCLC
6
Tools & Features • Voyager’s® Special Character Entry lists diacritics and special characters by name
Tools & Features for Diacritics • PLU’s website with a listing of the most frequently found diacritics and special characters in the PLU Library catalog as represented in Voyager and OCLC: http://www.plu.edu/~libr/EndUser2004/diacritics.html
• Princeton’s diacritics by language website: http://infoshare1.princeton.edu/katmandu/catcopy/diatoc.html
• Michael Doran’s Coded Character Sets http://rocky.uta.edu/doran/charsets
Wrap Up
Key Transitions • • • •
New browser New client fonts & open type Adjust record import/export configurations Adopt new schema, Access Reports/ODBC drivers and UTF8to16 functions • Enter diacritics after character (not before) • Adjust to new standards for non-filing indicators • Clean up incorrectly coded diacritics
Contact Info
• At last we can view diacritics and special characters properly!
• Converting to Unicode™ is not a difficult transition though you may need to do some cleanup • Tools & strategies exist to locate errors • Developing, refining, and sharing cleanup strategies would be beneficial
Francesca Lane Rasmus
[email protected] 253-535-7141
Layne Nordgren
[email protected] 253-535-7197