HTML Parsing in Java for Accessibility Transformations

HTML Parsing in Java for Accessibility Transformations Beth R. Tibbitts Susan Crayne Vicki Hanson Jonathan Brezin Cal Swart John T. Richards Ab...
0 downloads 0 Views 403KB Size
HTML Parsing in Java for Accessibility Transformations Beth R. Tibbitts Susan Crayne Vicki Hanson Jonathan Brezin Cal Swart John T. Richards

Abstract As more people begin to actively access the Web, more people who were previously unable to easily go online are coming in droves. Many of these people have minor and major impediments to fully experiencing a full complement of web sites. Disabilities from normal aging of vision, to motor disabilities, to moderate vision loss, commonly inhibit users from fully experiencing the web. While severe disabilities, especially blindness, have already been addressed, usually with special-purpose software or devices, those left in the middle with mild to moderate disabilities can feel left out. The Web Accessibility Initiative (http://www.w3.org/WAI/) and the W3C's Web Content Accessibility Guidelines (http://www.w3.org/TR/WCAG10/) explain to authors how to create web content that is accessible. However, not all web sites adhere to these principles, and require changes to be accessible. Even when the guidelines are followed, the page can be made even more accessible for more users to be able to experience the content more fully. As part of the Web Adaptation Technology Project [WA], we have developed browser extensions and server services that allow users to make dynamic changes to web pages based on their personal preferences. This paper will focus on dynamic visual changes to web sites, accomplished mainly via HTML/XML manipulation of the document, and the problems encountered along the way. These problems include difficulty in changing the text size to suit the user, due to web authors’ different ways of specifying font sizes. Color contrast can make web page viewing difficult for some users, and changing to a high-contrast color combination can help. However some pages use transparent images, which then are difficult to see, but can be adjusted. While style sheets allow flexibility in rendering the page, their specifications can be challenging to determine while parsing a DOM. Examples of difficult web sites will be shown, different parsing solutions demonstrated, and recommendations for web authors will be made. Even when web pages do not meet the WAI or other accessibility guidelines, dynamic changes can be made to make the pages more accessible, based on a user’s individual preferences. Information about this joint project between IBM and SeniorNet [IBM-SNET] is available at: http://www.ibm.com/ibm/ibmgives/grant/helping/seniornet.shtml This project is implemented in Internet Explorer using a Browser Helper Object (BHO)[BHO] [BHO2] [Roberts] written in Java, which gives program access to the document object before it is rendered in the browser. The object model is an extension of the W3C DOM that permits the BHO to modify the document's rendering and to handle user interface events in the document. Our implementation uses Java-COM to call the Microsoft MSHTML [MSHTML] objects that allow query and manipulation of the html document. We compile the Java implementation into a DLL that is registered to be called by the browser. This paper will describe some transformations that help persons with disabilities view web pages more easily based on their personal preferences, shows some ways some simple transformations can be done with existing browser tools, and shows how Web Adaptation Technology can make those and other changes more easily and consistently for the user. DOM parsing to do the DOM

XML 2002 Proceedings by deepX Rendered by www.RenderX.com

1

HTML Parsing in Java for Accessibility Transformations

manipulation transformations is shown in the Microsoft APIs that are associated with BHOs. Suggestions are included to help web authors make pages more accessible to more users, as well as more modifiable by technologies such as the Web Adaptation Technology.

Table of Contents 1. Introduction ............................................................................................................................... 2 2. Making Web Pages Accessible ...................................................................................................... 2 3. Accessibility Options in Internet Explorer ....................................................................................... 3 4. Implementation .......................................................................................................................... 3 5. Background Removal .................................................................................................................. 4 6. Color Changes - Adding Style Sheets ............................................................................................. 5 7. Color Changes - Transparent GIF Problems ..................................................................................... 7 7.1. Color Precedence rules for a node ...................................................................................... 10 8. Page Linearization ..................................................................................................................... 10 9. Modifying the DOM to facilitate user-interface events ..................................................................... 13 10. Some things are hard to change .................................................................................................. 16 11. Joint Projects / Usage ............................................................................................................... 17 Bibliography ............................................................................................................................... 17 Glossary ..................................................................................................................................... 17

1. Introduction Small and relatively minor disabilities can make a big difference in being able to successfully navigate the Internet. In the normal process of aging, vision loss occurs and larger fonts, and more contrasting colors, become helpful in viewing web sites. Animated images can be distracting and sometimes add to the confusion. Images are small and difficult to see, yet may be critical to site navigation. Reading difficulties like dyslexia can make reading of even the most well-designed sites difficult. Senior citizens are frequently unfamiliar with computers and the Internet, but are joining up in increasing numbers. While popular Internet browsers have built-in facilities for changing how web pages are rendered, and even operating systems can provide assistance with other things related to accessibility, often a user who could make use of use these helpful facilities has difficulty locating and configuring them. Even so, these facilities are often not very comprehensive in their ability to transform what a web page looks like for an individual user, and making changes requires complex re-configuration by the user, often prohibitively so. The Web Adaptation Technology project [WA] is an IBM Research project that provides an enhancement to Microsoft Internet Explorer that allows users to configure their browsing experience based on personal preferences. Preferences for web page changes are specified by the user on a single interface at the bottom of the IE window, but changes are implemented in a variety of ways, including operating system / registry changes, Style Sheet additions and modification, and parsing of the document. This document is a “DOM” – like object provided by Microsoft Internet Explorer, and modifiable prior to rendering in the browser window and shown to the user. A variety of transformations is possible with the Web Adaptation Technology project, but they are presented to the user in a single unified way. Some web sites, not surprisingly, are easier to transform than others. Just like some web sites are easier to navigate and view, especially for a user with limited vision, some web sites can be hard to view in their original manifestation, yet easily changed, perhaps with an increased font size, to be viewable by a user with low vision. Some sites may look quite artistically impressive at first glance, but even a small change, in size or color, can render it unreadable.

2. Making Web Pages Accessible Many guidelines for web site design for accessibility exist. The Web Accessibility Initiative [WAI] and the W3C's Web Content Accessibility Guidelines [W3C] explain to authors how to create web content that is accessible. However, not all web sites adhere to these principles, and require changes to be accessible. Even when the guidelines are followed, the page can be made even more accessible for more users to be able to experience the content more fully.

XML 2002 Proceedings by deepX Rendered by www.RenderX.com

2

HTML Parsing in Java for Accessibility Transformations

WAI guidelines include suggestions of using stylesheets instead of html attributes to specify layout and rendering preferences. They recommend adding ALT tags to images to describe the image, and many other guidelines. Actually, web sites that adhere to WAI suggestions scale much more easily to font sizes and other changes as well. There are also tools available for testing whether web sites adhere to guidelines and are accessible. Bobby [BOBBY] tests web pages using the guidelines established by the World Wide Web Consortium's (W3C) Web Access Initiative (WAI) as well as Section 508 [Section508] guidelines from the Architectural and Transportation Barriers Compliance Board (Access Board) of the U.S. Federal Government.

3. Accessibility Options in Internet Explorer Before we go further, we should say that many things are configurable by the user in Internet Explorer. See the Tools Menu, Internet Options, "Accessibility" dialog, as shown in Figure 1 below, and the text size menu in Figure 2 as well. The Web Adaptation Technology project makes use of some of these, but most of the changes described here go beyond what can be done with the built-in Accessibility options, and are done in the DOM.

Figure 1. Accessibility Options in Internet Explorer

Figure 2. Changing Text Size in Internet Explorer

4. Implementation For this research, we used the approach of manipulation of the DOM after its construction by Internet Explorer. The BHO [BHO] [BHO2] [Roberts] and MSHTML [MSHTML] programming interfaces allow us to obtain a

XML 2002 Proceedings by deepX Rendered by www.RenderX.com

3

HTML Parsing in Java for Accessibility Transformations

DOM, already parsed and processed, in the manner in which the content will be rendered in the browser. Thus, we do not need to determine how the HTML will be parsed, just what to do with it once the DOM is determined. We found in a previous version of this project [WA] that targeting multiple browsers was problematic. Netscape and Internet Explorer can render the same HTML differently. At times one will tolerate incorrect HTML and the other won't. We needed to know how the browser was going to render something in order to know what transformations would work to display the element in a way the user had requested. So by obtaining a DOM already parsed as the browser will render it, we bypass this problem. This paper will cover five different transformations done to the page by the Web Adaptation Technology project, and show how the changes to the DOM were accomplished. 1.

Background Removal

2.

Color Changes

3.

Transparent GIF fixes during color Changes

4.

Page Linearization

5.

Modifying the DOM to facilitate user-interface events

5. Background Removal One of the simplest changes that can be made to a page to greatly improve legibility for many users is to remove any background images. Often background images are used for decoration and are not integral to the understanding of the page. If the background image distracts from the page content, users may prefer to remove it to make reading of the text easier. The example in Figure 3 below shows a web page with a distracting background.

Figure 3. Web page with distracting background image The example in Figure 4 below shows the same web page with the background image removed.

Figure 4. Web page with background image removed A background image is usually specified on the html document's BODY tag.

XML 2002 Proceedings by deepX Rendered by www.RenderX.com

4

HTML Parsing in Java for Accessibility Transformations

It can be removed with the MS DOM APIs simply enough. The following also shows how the basic objects are retrieved when the HTML document loading is complete, and the DOM analysis and possible modification can begin. IWebBrowser2 browser = .. // obtained from BHO DocumentComplete IHTMLDocument2 document = (IHTMLDocument2) browser.getDocument(); IHTMLElement body = document.getBody(); body.removeAttribute("background", 0);

6. Color Changes - Adding Style Sheets Another fairly simple thing is to change the colors on a page by introducing a "user style sheet." We do this from within the BHO to add foreground and background colors to the BODY element. The following is a simplified example to change text colors to white on black. IHTMLStyleSheet ss = document.createStyleSheet("",0); ss.addRule("body","color: white; background-color: black");

This is analogous to adding a style sheet at the top of the HTML Document like this: BODY {color: white; background-color: black}

We will show later that not even this will always change all the colors on the page. There are a set of precedence rules for how colors are determined for an element node, and a rule in a stylesheet will not always be honored over color information specified at the node. The examples in Figure 5 and Figure 6 show text colors before and after a change. Some users can see light colors on dark backgrounds better than others, and may prefer this color combination, for example.

XML 2002 Proceedings by deepX Rendered by www.RenderX.com

5

HTML Parsing in Java for Accessibility Transformations

Figure 5. Web page with original colors

XML 2002 Proceedings by deepX Rendered by www.RenderX.com

6

HTML Parsing in Java for Accessibility Transformations

Figure 6. Web page with colors changed

7. Color Changes - Transparent GIF Problems The most obvious of these changes that can be made by the web adaptation technology are color changes. If color changes are simply made to the entire page without considering page content (for example, if made with browser options shown above), the color changes don't always improve legibility. Sometimes, for example, background/foreground colors are specifically engineered by the web author to appear beneath a background image or background color. And transparent gifs are often designed with display over a specific color in the background component assumed in order to blend with the layout of the page or for other purposes. For example, a transparent gif, especially one with text, can be unreadable if placed over a different background color. See the following figure, Figure 8, for an example.

XML 2002 Proceedings by deepX Rendered by www.RenderX.com

7

HTML Parsing in Java for Accessibility Transformations

Figure 7. Transparent GIF and text To make the page easier to read, the background color and text size could be changed. Using the IE menus (Tools, Internet Options, General Tab, Colors button) the same changes can be made directly to IE (via the Windows registry). Here the background color is changed to a light color, as in Figure 8, in which case some portions of the image may not be visible, or text imbedded in the image may not be readable. (Note that background images are always removed whenever text color changes are made.)

Figure 8. Transparent GIF and text, with background color changed. Note illegible text on white background. The gif was placed by the web author over a relatively dark background color, and the text in the image is a light color. If a user prefers to read web pages as dark-colored text on a light background, transparent gifs on this page can cause problems because they were intended to be shown over a dark color. So, in addition to changing the colors of the web page, we also traverse the DOM to determine the color that was intended to appear behind the original gif. Then the IMG component's background color is adjusted to render the GIF with this original color behind it. The requested color changes for the text are still honored, but the original intended color is replaced behind just the GIF images. See Figure 9 for how this is rendered.

XML 2002 Proceedings by deepX Rendered by www.RenderX.com

8

HTML Parsing in Java for Accessibility Transformations

Figure 9. Original background color is replaced only behind the gif, making it more legible. Changes implemented with values stored by the Windows registry/OS result in the original author's color intentions not being discernable at DOM-traversal time. Such changes must therefore be made at runtime on the DOM itself. The DOM is traversed, carrying down colors to children of nodes, to specify color replacement on IMG tags if necessary. A simple recursion of the HTML document will accomplish this, as shown in the following example. We visit the tree top-down. To compute the current node's effective background color, the precedence order for node color is used (see Section 7.1), which includes the current node's style and attributes, and the parent's effective background color. The current node's style may be obtained in the latest API by getCurrentStyle(), a somewhat tricky point as one might be tempted to call getStyle(), which catches only the style attribute in the node. One thing definitely to avoid is visiting the document's stylesheets to determine the color (or anything else!), because the performance overhead of the required COM calls is prohibitive when, as happens for many popular sites, the stylesheets are very large. The various news services are good examples of sites with large style sheets. String currentBGColor; public void recurse(IHTMLElement element) { currentBGColor = getEffectiveColor(element, currentBGColor); IHTMLElementCollection children = (IHTMLElementCollection)element.getchildren(); for(iKid=0; iKid=0; --i) { indexVariant.putInt(i); IHTMLElement element = (IHTMLElement)elements.item(indexVariant,dummyVariant); String tagName = element.getTagName(element); if ((tagName.equals("TD")|| tagName.equals("TH"))){ String innerHTML=element.getinnerHTML(); if (innerHTML == null || innerHTML.trim().length() == 0) { continue; } element.putinnerHTML(""+innerHTML+""); }

XML 2002 Proceedings by deepX Rendered by www.RenderX.com

14

HTML Parsing in Java for Accessibility Transformations

} }

Some comments on the above code: it shows some of the vestiges of the COM API that is underneath the Java classes: the ubiquitous Microsoft Variant object. It is often required to pass information to the APIs. See next code sample for other applicable comments on usage of this API framework. This wrapping of tags that did not generate user-interface events with tags that do, worked well for generating events and targeted a more acceptable "chunk" of text to be read at one time, but it caused another problem. Scripting was often broken since often JavaScript code refers to parent and child elements in a specific expected relationship. We were changing the former child of an element to a grandchild, and a parent to a grandparent. Presently we are careful to not insert elements around anything but leaf nodes, or other adjustments so as not to break scripting and the original relationship of most document elements to each other. Another problem comes with the web author's use of tags in perhaps less than optimum ways. The paragraph tag is useful to divide text into block elements that represent, well, a paragraph. But since much use of HTML is only considering the visual effects of tags, often web authors will overuse the break tag
. Two breaks do not a paragraph make - except visually. So we'd often see
This is supposed to be a paragraph of text.

Another "paragraph" here.


And, what was selected could sometimes be the entire page, not a good "chunk" for speaking or rendering as banner text. This is also a good candidate for or even insertion. The chunks within the s can be selected with the mouse for improvement.
This is supposed to be a paragraph of text.

Another "paragraph" here.


The following code shows an example of wrapping text adjacent to a node with another element. This is used to wrap text adjacent to
tags. It could be placed in the loop exemplified previously. /** * replace text adjacent to the element with the same text, surrounded by another set of tags. */ public void wrapAdjacentText(IHTMLElement element) { String before=null; String after=null; IHTMLElement2 el=null; String temp=null; try{ el=(IHTMLElement2)element; before = el.getAdjacentText("beforeBegin"); after= el.getAdjacentText("afterEnd"); } catch(Exception e) { // some tags have invalid calls, e.g. BR tag has no afterBegin or beforeEnd // so, ignore this exception here so we won't be bothered by it

XML 2002 Proceedings by deepX Rendered by www.RenderX.com

15

HTML Parsing in Java for Accessibility Transformations

} replaceText(element,before,"beforeBegin"); replaceText(element,after,"afterEnd"); }

/** * Replace the text to the left and right of the element node, with the text * wrapped with SPAN tags */ private void replaceText(IHTMLElement element, String text, String position) { StringBuffer newStr = new StringBuffer(""); try{ IHTMLElement2 el2 = (IHTMLElement2)element; // first remove the raw text el2.replaceAdjacentText(position,""); // force insertion of whitespace that gets lost. Thanks, MS. // without this, text can be UPagainst another word. if(text.startsWith(" ")) { newStr.append(" "); // replace leading space with something IE will recognize newStr.append(text.substring(1)); } else { newStr.append(text); } newStr.append(""); // replace text, wrapped in tag element.insertAdjacentHTML(position,newStr.toString()); } catch(Exception ee) { /* error... */} }

A couple of notes on the above are in order. Note the casting between IHTMLElement and IHTMLElement2. Because the underlying API's called via Java-COM are in C and cannot implement inheritance, several layers of APIs exists in this framework and casting is a common occurrence. Also the above example does the insertion of another element by replacing the text of the html instead of by manipulating DOM nodes. This is a simple insertion and can be done this way, but DOM manipulation, in a newer set of APIs in this framework, can also be done, specifically with the IHTMLDOMNode and its associated classes. Additionally, you will notice the inserting of a   (non-breaking space) to force the re-introduction of whitespace that this manipulation erroneously removes. Otherwise certain words would be upAgainstEachOther.

10. Some things are hard to change Some techniques in web page design and implementation make it more difficult than others to modify for accessibility. 1.

Transparent gifs can be placed over a background image, with the intention that the color "showing through" will contrast with the content of the image. If the background image is removed, the gif can be illegible. (We can't determine the color that should be behind the gif.) This is also true of text placed over a background image. Also colors for elements can be specified in JavaScript (especially for rollover menus) and are not evident in the HTML document's DOM, thus cannot be modified if, for example, a background is removed or other text colors changed.

2.

If vital page contents are found in a background image, and if images are removed by the user to eliminate distractions and clarify the text, the page content may be incomplete.

3.

Images without ALT tags provide no content that can be spoken or enlarged with banner text.

XML 2002 Proceedings by deepX Rendered by www.RenderX.com

16

HTML Parsing in Java for Accessibility Transformations

4.

If line spacing is specified in absolute units, and the font is enlarged, overlap of lines can easily occur. This can be rectified by removing line-spacing: rules in inline styles or stylesheets, but in practice most of these are specified in style sheets, which can be prohibitive to parse though, in terms of performance.

5.

Element placement by pixel - although use of tables is discouraged for element placement, and style sheets are encouraged, if components are placed by pixel (x,y) location, then when text is enlarged, this placement is often inaccurate, resulting in overlapping elements among other things.

11. Joint Projects / Usage Currently this project is being used in a joint project between IBM [IBM-SNET] and SeniorNet [SNET], assisting Seniors in being able to better access the Internet.

Acknowledgements Many thanks to Fran Brown, who did some of the seminal work on transparent gif color replacement.

Bibliography [BHO] Esposito, Dino, 1999. "Browser Helper Objects: The Browser the way you want it", Microsoft Corporation, 1999, http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnwebgen/html/bho.asp [BHO2] Esposito, Dino, "Customizing Microsoft http://www.microsoft.com/mind/1199/cutting/cutting1199.asp

Internet

Explorer

5",

[BOBBY] Bobby, web accessibility software tool, http://bobby.watchfire.com [IBM-SNET] "IBM and SeniorNet expand partnership to http://www.ibm.com/ibm/ibmgives/grant/helping/seniornet.shtml

bring

millions

more

online"

[MSHTML] MSHTML Reference, MSDN library, http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/reference.asp [Roberts] Roberts, Scott, Programming Microsoft Internet Explorer 5 (Microsoft Programming Series) , Microsoft Press, 1999 [Section508] http://www.section508.gov/ [SNET] SeniorNet http://www.seniornet.org [W3C] W3C Web Content Accessibility Guidelines, http://www.w3.org/TR/WCAG10/ [WA] Web Adaptation Technology Project, http://www.research.ibm.com/access [WAI] The Web Accessibility Initiative, http://www.w3.org/WAI/ [VV] IBM ViaVoice, http://www.ibm.com/software/speech

Glossary BHO

XML 2002 Proceedings by deepX Rendered by www.RenderX.com

Browser Helper Object

17

HTML Parsing in Java for Accessibility Transformations

Biography Beth R. Tibbitts IBM Accessibility Research, IBM T. J. Watson Research Center Lexington United States of America [email protected] Beth Tibbitts is a 25 year veteran of software development in IBM, including APL, LISP, C++, and now Java. She was an IBM manager for a brief stint but escaped back to the technical world when she came to her senses. Programming in APL and LISP, both underdog languages and environments, gave way to C++ and eventually Java about 6 years ago. Beth wrote Java book reviews and articles for the IBM developerworks site in its earlier days. She developed software in Java for IBM "Reinventing Education" grants and for training/evaluation of ADHD children. She is now part of a group developing tools for making web sites more accessible to persons with disabilities, including an IBM grant and joint development project with Seniornet (http://www.ibm.com/ibm/ibmgives/grant/helping/seniornet.shtml). Susan Crayne IBM Accessibility Research, IBM T. J. Watson Research Center Hawthorne United States of America [email protected] Vicki Hanson IBM Accessibility Research, IBM T. J. Watson Research Center Hawthorne United States of America [email protected] Jonathan Brezin IBM Accessibility Research, IBM T. J. Watson Research Center Hawthorne United States of America [email protected] Cal Swart IBM Accessibility Research, IBM T. J. Watson Research Center Hawthorne United States of America [email protected] John T. Richards IBM Accessibility Research, IBM T. J. Watson Research Center Hawthorne United States of America [email protected]

XML 2002 Proceedings by deepX Rendered by www.RenderX.com

18