Using Page Histories for Improving Browsing the Web

Using Page Histories for Improving Browsing the Web Adam Jatowt Yukiko Kawai Katsumi Tanaka Kyoto University Yoshida-Honmachi, Sakyo-ku, 606-8501 K...
Author: Mark Welch
1 downloads 0 Views 624KB Size
Using Page Histories for Improving Browsing the Web Adam Jatowt

Yukiko Kawai

Katsumi Tanaka

Kyoto University Yoshida-Honmachi, Sakyo-ku, 606-8501 Kyoto, Japan Phone: +81-75-7535969

Kyoto Sangyo University Motoyama, Kamigamo, Kita-Ku, 603-8555 Kyoto, Japan Phone: +81-75-7052958

Kyoto University Yoshida-Honmachi, Sakyo-ku, 606-8501 Kyoto, Japan Phone: +81-75-7535969

[email protected]

[email protected]

[email protected]

ABSTRACT Currently, users generally do not have much temporal support when browsing Web pages. The Web is in fact a transitive collection where little effort is made for enabling access to historical content of pages. However, integrating documents with their histories should bring many benefits such as facilitated judgment of documents’ trustworthiness or time travel. In this paper we present several interaction methods that users could have with page histories. We also demonstrate example systems designed for realizing these interaction types and discuss related issues.

Categories and Subject Descriptors H.5.4 [Information Interfaces and Presentation (e.g., HCI)]: Hypertext/Hypermedia

General Terms Algorithms

Keywords interaction with document’s history, web archives, past web, page versioning.

1. INTRODUCTION The Web is a very dynamic environment where many changes occur continually. However, usually, there is lack of sufficient temporal support as users have no easy and direct means to access past content of pages during browsing neither to analyze their histories. An average Web user usually has no idea on where to look for previous versions of documents or even on how to use them. We think that there is a need for providing efficient access means to document histories and various interaction mechanisms with these histories. In this paper, we discuss approaches for empowering users to access, observe and utilize historical data of documents during browsing; hence, extending their interaction capabilities with visited pages into the temporal dimension. This should increase the documents’ understanding as well as improve This work is licensed under a Attribution -NonCommercial -NoDerivs 2.0 France Creative Commons Licence. IWAW’08, September 18–19, 2008, Aarhus, Denmark.

the trustworthiness judgment of visited pages. For example, recent topic drifts or average change frequency could be estimated by observing historical content of pages. Another benefit would be related to the facilitated re-finding of previously encountered content. In this paper, we demonstrate several different interaction mechanisms through example applications. One application allows users to navigate and browse page histories much in the way they browse the present Web. The proposed browser integrates the present versions of pages with their past snapshots and provides navigation and visualization means for efficient time travel. Another system generates visual summaries of page histories by displaying term clouds and spatially arranging thumbnails of past page snapshots on 2D pane. The third application enables enriching current views of pages with certain history-derived information. Particularly, it annotates content on pages with its approximate age determined by searching within historical versions of these pages. Lastly, we provide a discussion of the proposed interaction methods, their application scenarios and other related issues. The remainder of this paper is structured as follows. The next section introduces several history-based interaction methods. In Section 3 we discuss these methods and the related issues. In Section 4 we describe the related work. The last section concludes the paper.

2. INTERACTION WITH PAGE HISTORIES As in the real-world users can analyze the history of certain objects (e.g., companies, institutions, countries, persons), thus in the Web-world they should be able to freely analyze histories of basic Web units – pages. In this section we discuss several potential page history-based interaction mechanisms. Before proceeding we need to provide some definitions. Def. 1: Page snapshot with a timestamp ti is a copy of the content that was on the page at time point ti during its lifetime. Page snapshot should be distinguished from a page version, as the latter implies the occurrence of changes. Thus, two snapshots of a page with timestamps ti and tj (i ≠ j) may have exactly same content. Def. 2: Web archive - a collection of past snapshots of Web pages. Def. 3: Page history reconstruction - the process of reproducing past content of the page by using its available snapshots. The

reconstructed page history is a continuous approximation of page’s past content derived from a series of its past snapshots.

2.1 Basic Interaction Types Accessing page histories is the fundamental interaction mechanism, which actually means viewing past page snapshots to see the content a given document had at certain time point in the past. Related interaction type relies on observing page evolution by traversing along the time axis, thus, browsing page history. This way of interaction can be actually seen as a sequence of individual accesses to single page snapshots. The past snapshots can be shown in a sequential order or in any other pattern. These standard access types are currently enabled by Web archives or their interfaces. If links connecting pages in the past are preserved and connected with their corresponding past snapshots (e.g., the rewritten links within the Internet Archive’s Web collection) then users can actually move between page histories, thus, in fact, browsing the Past Web. Def. 4. Past Web – the structure of page histories with reactivated links, where the reactivation means pointing links to the corresponding parts of page histories (snapshots) in the same or similar way as it was in the past, thus, retaining the historical connectivity between pages. In the present Web, the browsing is usually closely related to the notion of navigation, which implies guiding users through the Web. In a similar way, we can define the process of navigating page histories and navigating the past Web, which means supporting the process of “traversing” and “orienteering” within histories of a single page or of larger structures of pages, respectively. Note that search in page histories and search in the Past Web are related concepts and may provide potential starting points for accessing or browsing page histories. Another interesting issue comes from the combination of user interactions between the present and past page snapshots and, taking it further, between the present and past Web. For example, users can browse pages as they currently do using standard browsers with the option of making time jumps into document history in order to access or browse page histories when needed. Figure 1 demonstrates this kind of mixed interaction.

Figure 1 The concept of browsing the present and past Web at the same time [4]. There are various reasons for which users may need to access and browse page histories.



They may wish to view some content from the past that they have seen before (e.g., the content may no longer appear on the Web).



They may wish to check what was on a page before, for example, in order to search for interesting content in addition to the one already shown in the present page version.



Sometimes, users cannot access the current versions of documents due to server-related or other problems. In such cases they may want to view, at least, the last saved document versions. This concept is similar to “cached” links published by some search engines next to search results in order to show the recent snapshots of the documents. In the questionnaire study that was administered in Japan on 1000 online users in 2008, we have found that about 17% of users sometimes follow “cached links”.

Previously, we proposed and evaluated an application called Past Web Browser [3,6] that provides basic functionalities for accessing, browsing and navigating page histories and the past Web. The browser works as a standard browser, yet, at the same time, provides several means for interaction with page histories. Usually, users must know the location of Web archives in order to be able to access the stored collections. In addition, they cannot easily retrieve data on page histories from different collections at the same time. Taking this into consideration, our proposed application merges past snapshots from different collections, providing kinds of virtual links from the present versions of documents to their past snapshots in a unified form. In our experimental implementation we have used the Internet Archive’s Web collection 1 as well as caches of Web pages stored in repositories of Yahoo!2 and Google3 search engines and a local cache. Furthermore, it facilitates accessing and navigating page histories or the past Web by adding additional layers over distributed Web archive collections. The browsing style is similar to watching slideshows. It is done in a passive style, in which past snapshots are shown sequentially to users who can use controls somewhat similar to those found in VCR players (Fig. 2). In addition, in order to facilitate page history browsing and understanding its evolution, changes are detected between the consecutive snapshots and indicated on displayed pages. Changes (both content additions and deletions) are emphasized by different background colors or, alternatively, presented as animation effects. The latter results in the effect of smoother page transition mimicking the process of page evolution by showing deleted content as a disappearing and added content as an appearing animation effect. The user can control the speed of the presentation using a slider provided in the top-right corner of the browser (Fig. 2). Besides, as sometimes page snapshots may be too large to be shown at once, the user can choose between the automatic scrolling option and the option of displaying only the top part of page content. However, for simplicity, and due to certain technical constraints, no animation is done and changes are emphasized using only different background colors in case when the change degree between any 1

http://www.archive.org

2

http://www.yahoo.com

3

http://www.google.com

two consecutive page snapshots is higher than the predefined threshold. The user can stop the browsing at any time by pressing stop or pause buttons. Then, upon clicking on any link, the browser loads the snapshot of the linked page that is closest in time to the one being currently viewed 4 and, after a short time period, it automatically starts browsing the history of the visited page. The navigation support is realized through several functionalities. One is a clickable timeline that is displayed above the page content (Fig. 2) and that shows the distribution of available page snapshots. The currently viewed page snapshot is indicated in the timeline as a blue rectangle, informing users about the “browsing time” within page history. Users can make time jumps to any page snapshot simply by clicking at red dots which represent available page snapshots. The timeline can be also zoomed to provide a more detailed view. Besides the timeline, the clickable list of all page snapshots containing the information about their timestamps is also displayed. Another navigation functionality is realized by two back and two forward buttons for moving between the histories of different pages in a snapshot-consistent or time-consistent modes. The first pair of back and forward buttons realizes the snapshot-consistent mode which is basically same as the one employed in standard browsers and which simply returns the previously visited snapshots. On the other hand, the second pair of buttons realizes “vertical” movement in the past Web to minimize the time variance between visited pages. Thus, it accesses the snapshots of previously visited pages that are closest in time (i.e., there is the smallest difference between the timestamps) to the currently displayed snapshot. The next navigation mechanism that is provided is automatic jumping facility. It enables the browser to skip changeless periods in the page history or the periods during which the content did not change much. When this functionality is switched on, the browser displays only those page snapshots that exhibit more than a certain degree of change. In result, browsing page history should be smoother, more interesting and take less time, especially in the case of relatively “static” (unchanging) pages. Lastly, the browser allows query-constrained navigation within page histories through change selection based on user-specified query terms. Only those content changes that contain a given set of terms will be thus indicated and only the snapshots that contain them will be visited. Thanks to this function a user can browse the history of a page from the perspective of an arbitrary topic, for example, watching the history of CNN page 5 to see any past content about the Iraq War.

2.2 Page History Summary The next interaction mechanism that we discuss generates concise representations of page historical content, or in other words, summaries of page histories. Such temporal summaries should provide crude overviews of page past content and its evolution over time, and may be classified as a sort of higher level history representations of pages. As pages may often have multiple past snapshots stored in Web archives, hence browsing their histories 4

Due to the links rewritten within the Internet Archive’s Web collection.

5

http://www.cnn.com

may be tiresome and require much time. We thus think that there is a need for a system that, upon users’ request, would generate temporal summaries of pages. The summaries could be tailored to specific time periods of pages or embrace their whole lifetimes. They could also be of general purpose or rather topic-, or querydependent. The summaries can be in a form of simple statistics or produced as a concatenation of salient content parts extracted from page historical content. The generation method should depend on particular user needs and purposes of the produced summaries. In general, providing summaries should increase users’ understanding of pages. Page historical summary could be also contrasted with the current page content for determining if the latter is consistent with the main page topics. In Figure 3 we demonstrate a prototype system called Page History Explorer (PHE) for summarizing historical content of pages and its evolution [4,7]. Through a term cloud structure the system displays the main terms that occurred in a page over time. Thus, if a given term frequently appeared in the past content of the page, then it will be shown using a large font size. Term clouds can be also built using the activity levels of terms, thus, indicating those terms that are often added or removed from the page within selected time periods. This should help to better characterize the content of changes in pages in the past. In addition, term clouds are generated for a selected number of unit time periods of page history in order to better portray changes in page content over time. In addition, page dynamics and its outlook are revealed by displaying a series of thumbnails of past page snapshots arranged on 2D pane chronologically and according to the content change degree. The vertical distance between any pair of thumbnails indicates the degree of their content difference, while the horizontal one denotes the length of the time period bounded by the timestamps of the snapshots. The summarized views of page histories produced by PHE can be modified and explored in real time by entering query terms or interacting with the displayed results (e.g., zooming thumbnails, increasing/decreasing the number of used snapshots or the length of the selected time period or unit time periods).

2.3 Enriching Pages with History-derived Information The last interaction type that we describe relies on an indirect usage of historical data on pages [5]. Users may not wish to access neither browse page histories keeping their interest limited only to the present versions of pages and the present Web. However, certain knowledge or content can be extracted from the page histories in order to enhance their present views. The difference between this interaction mechanism and the one based on summarizing page histories boils down to the nature of the derived knowledge and its relation to the present page version. While the latter concisely represents the content and behavior of pages in the past, the former may use any kind of history-based knowledge that is relevant to the current page state. Such knowledge is then mapped on the current page view in order to improve its presentation and understanding. For example, content elements currently seen on pages may have their origin dates detected for their age to be approximately judged. Next, the obtained information can be added to the present page view in the form of static or dynamically generated (e.g., via mouse events)

annotations. In another example, the most frequently changing or most static page parts over time can be determined and indicated on the current page view (e.g., by different background colors). In

both cases we actually provide a kind of temporal context for the current content.

Figure 2 Snapshot of Past Web Browser [6].

Figure 3 Example of history-based summary of an example page. We present here an application that realizes the above-mentioned example of detecting age of content elements and annotating page with this information. The motivation for this kind of application is dictated by the fact that content is often introduced to pages at different time points, although, it may not look so for users visiting the pages. Certain pages sometimes indicate their content’s age, for example, to alter its perception by users. However, a large fraction of pages does not have any sort of indication of the age of their content. On the other hand, users may need to know how old certain content elements in pages are. For example, they may inquire since when a given person’s name has been listed in a laboratory’s home page or since when a financial statement has appeared in a company’s home page. We have thus implemented a browser extension to let users obtain information on the age of displayed Web content. This kind of information is determined through efficient search in the repositories of archived data of Web pages. In particular, the

system triggers binary or sequential6 search processes within page history in order to determine the timestamps of past snapshots in which given content elements occurred for the first time. This involves multiple comparisons of past page snapshots with the present version (see Fig. 4). Naturally, the precision of the derived origin dates of content depends on several factors of which the most important are the amount and distribution of past page snapshots. Lastly, the current page content is annotated with the determined age-related information and displayed to users.

6

Binary search is preferred over the sequential one for its lower cost.

comparison schemes based on the previously introduced basic interaction mechanisms:

Figure 4 Visualization of age detection process through sequential comparison of past page snapshots with the present page snapshot.



Comparison of single page snapshots of the same page. This includes the natural comparison of the present and past page snapshots or the comparison between any two snapshots. Both indicate the level of changes that the page has undergone through time.



Comparison of past snapshots of different pages from the same or nearly same time points. This enables observing how differently the pages looked like at the same time points in the past, for example, to see how they reflected some real world events.



Comparison of past snapshots of different pages at different time points. This sort of comparison could be, for example, topic-driven, where one could compare content relevant to the certain topic in different pages.



Comparison of sets of page historical snapshots within the same or different pages. For example, the histories of the same of different pages could be contrasted over certain selected time periods. The sets of the snapshots can be consecutive, thus representing the reconstructed page history or could be arbitrarily arranged for achieving a desired effect. For instance, page snapshots containing any given term could be contrasted to those containing another term within the same of different pages.



Comparison of historical page summaries (or other higher-level history representations) within the same or different pages.



Comparison of modifications made to page views by using the knowledge derived from their histories (as in Section 3.3). For example, it could be possible to determine which document currently provides the freshest content.

3. DISCUSSION 3.1 General Issues The proposed categorization of history-based interactions is by no means exhaustive. We expect there can be other interesting approaches to interact with and to utilize page histories. We think that many such interactions may actually have one feature in common - the reference to the present time point. This is because the key purpose and advantage resulting from studying the history of any real-world objects (e.g., countries, persons, companies) is to facilitate the understanding of their present states and, if possible, casting some predictions as for their future. We think that the similar rule may apply in the case of histories of Web pages. There are still many issues and unresolved obstacles related to interacting with document histories (e.g., technical or legal ones). For example, the proposed interaction methods may suffer from high temporal cost causing considerable delays when implemented as browser extensions. In addition, they may pose much burden on Web archives if the number of users is too high. From a legal viewpoint, certain copyright restrictions may prevent Web archives from freely enabling the access to their collections for any users. Or in another case, the heightened popularity of such interactions and the unrestricted access to past content might actually provoke page authors and content owners to request the removal of the past content of their pages from Web archives. However, the objective of this paper is not to discuss the existing or potential problems of the above history interaction mechanisms but rather to propose and emphasize new usage cases and their potential benefits. The popularity and the actual usefulness of Web archives will largely depend on the freedom of their access, various kinds of interaction methods and their successful applications. We hope that Web archives will be more used by average Web users rather than only by researchers or professionals. To this end, there should be more research conducted on the various usage cases of the stored data.

The first three types represent the comparison based on the level of single historical snapshots, while the remaining ones are based on the level of sets of historical snapshots. We assume here a page snapshot as a basic comparison unit; yet, smaller units (e.g., sentences, paragraphs) could be used here, too. In addition, the cross-type comparison could be also possible (e.g., the comparison of differences between the present and the oldest page snapshots in two different pages). Also, more than two pages can be compared at the same time.

In [4] we have reported some of the results of a large-scale online questionnaire administrated in Japan aiming at evaluating the needs users have for different kinds of temporal access when interacting with the Web. According to the results, only 1.9% of Japanese Web users knows and uses any Web archive. However, many subjects indicated that the information on the age of content on the Web is quite important to them and that they would like to revisit the already disappeared content.

The above categorization schemes call for providing definitions of different types of time in Web pages.

3.2 History-based Comparison Humans often wish to compare and contrast histories of real world objects in order to derive some useful knowledge. In the same way page histories or the information obtained from them can be compared for different purposes. We list the following possible

3.3 Types of Time on the Web Some researchers have already proposed some categorizations of time, usually, in the context of early hypertext systems. For example, Luesebrink [8] distinguished various times specific to hypertext literature, such as the interface time as the time span during which users interact with documents or the cognitive time, which determines the chronological ordering of events in the narrative. In addition, in the hypertext theory, a notion of time branches, in which hypertexts could have different versions at the same time, was quite common. This, usually, does not apply in the

case of Web pages, which have unique identifiers (URLs) and thus generally cannot have different states at the same time7. We distinguish four different times related to Web content and users interacting with it: 

Transaction time – time of content creation and occurrence in a page.



Valid time - time of information in page content being valid from the viewpoint of real world entities or events that it refers to. For example, the information about hiring new employees posted on a company’s homepage is valid as long as the company is really seeking new staff.



Interaction time of a user with a page - time when the user performs any sort of activities on the page (e.g., accessing, reading, editing).



Social interaction time - the time when other users interact with a page.

The interplay between the above times should provide opportunities for establishing new types of interactions and applications for supporting these interactions. One such interplay could involve the social interaction time. Usually, when accessing documents, users have little information about their popularity among other users. If there was some other data available on the historical popularity of page then it could be used for improving the browsing experience or for construction of higher quality history representations. For example, historical page traffic or the rate of previously created social bookmarks may indicate periods of the elevated levels of users’ interests in the page and may be used for the above purposes.

4. RELATED RESEARCH Some proposals have been done so far for automating version management in Web servers. For example, DeltaV protocol 8 supports remote versioning and configuration management of documents stored on Web servers. Dyreson et al. [1] proposed transaction-time Web servers that would automatically preserve previous versions of documents and would provide standardized access methods to them. Visual Knowledge Builder (VKB) [11] was an early proposal of an application for history navigation in private hypertexts. The authors’ objective was to enable users playback the history of hypertexts much like in VCR players. Users could then witness the authoring styles of hypertexts and understand their various historical contexts. WERA 9 (Web ARchive Access) and Wayback Machine 10 are applications for accessing data stored in Web archives. WERA supports time and URL input for retrieving a particular page snapshot. It shows a timeline with available past page snapshots and indicates the one whose content is displayed in the main window. Browsing page histories along the time axis can be done 7

Adaptive, and, in general, dynamically generated Web pages could be considered to some extent as parallel page versions.

8

http://www.webdav.org/deltav

9

http://archive-access.sourceforge.net/projects/wera

10

http://web.archive.org

by clicking arrows in the timeline. Wayback Machine is a Webbased interface to Web archives that is best-known as the gateway to the Internet Archive’s Web collection. Accessing past content of pages can be done by directly entering specially modified URL containing requested date. Another way is through the directory page listing available past snapshots. Users can access the content of snapshots and even follow its links as the links are rewritten to point to other snapshots within the archive’s collection. The directory page indicates also page snapshots that contain changes when compared to the consecutive snapshots by marking them with asterisks. The concept of combining browsing of archived data with browsing the live Web has been also implemented in WAXToolbar - a Firefox browser extension to the Wayback Machine11. In a multi-authoring area, an interesting application has been recently proposed for effective visualization of histories of wiki pages [13]. It allows viewing the contributions of different authors and their persistence over time. From a social viewpoint, Wexelblat and Maes [12] demonstrated the Footprints system that adds social context to browsed document structures by utilizing historical data on user visits. In result, new users could be guided to useful and popular resources. McCown et al. [10] have estimated the persistence and availability of pages in several Web repositories such as search engines or the Internet Archive. Their project - Warrick was started with the goal of helping users to reproduce the latest content of their Web sites in cases when they were accidentally lost. Lastly, Web archiving community has been for a long time involved in the issues of content selection, preservation and management. Some overview of the current state-of-the-art as well as future directions in this area can be found in [9]. Particular cases in which Web archives should be useful for users were listed by the International Internet Preservation Consortium [2].

5. CONCLUSIONS Currently, users have little temporal support when visiting Web pages. They cannot directly access and interact with page past content. In this paper we have outlined and classified some possible interaction methods with page histories. We believe that the tighter integration of pages with their histories through different interaction mechanisms and various applications should improve browsing experience of users on the Web, for example, by increasing the trustworthiness of content, and should indirectly increase the popularity of Web archives. In the future, it is necessary to estimate what kinds of and in which situations different history-based interaction should be most useful for users.

6. ACKNOWLEDGMENTS This research was supported in part by the National Institute of Information and Communications Technology, Japan, by the MEXT Grant-in-Aid for Scientific Research in Priority Areas entitled: Content Fusion and Seamless Search for Information Explosion (#18049041, Representative: Katsumi Tanaka), by the Kyoto University Global COE Program: Informatics Education and Research Center for Knowledge-Circulating Society (Representative: Katsumi Tanaka) and by the MEXT Grant-in-

11

http://archiveaccess.sourceforge.net/projects/waxtoolbar

Aid for Young Scientists B (#18700111, Representative: Adam Jatowt; #18700110, Representative: Yukiko Kawai).

7. REFERENCES [1] Dyreson C. E., Lin H.-L., and Wang Y. Managing versions of web documents in a transaction-time Web server. Proceedings of the 13th International World Wide Web Conference, 2004, 422–432. [2] IIPC’s Access Working Group. Use cases for Access to Internet Archives, 2006, http://netpreserve.org/publications/iipc-r-003.pdf [3] Jatowt, A., Kawai, Y., Nakamura, S., Kidawara, Y. and Tanaka, K. Journey to the past: proposal for a past web browser. Proceedings of the 17th ACM Conference on Hypertext and Hypermedia, 2006, 134–144. [4] Jatowt, A., Kawai, Y., Ohshima, H. and Tanaka, K. What Can History Tell Us? Towards Different Models of Interaction with Document Histories, Proceedings of the 19th ACM Conference on Hypertext and Hypermedia, ACM Press, Pittsburgh, USA, 2008. [5] Jatowt, A., Kawai, Y. and Tanaka, K. Detecting age of page content. Proceedings of the 8th International Workshop on Web Information and Data Management, 2007, 137–144. [6] Jatowt, A., Kawai, Y. and Tanaka, K. Utilizing Past Web for Knowledge Discovery, In: Krol D., Nguyen N.T. (Eds.):

Intelligence Integration in Distributed Knowledge Management, IGI Global, 2008, 283-301. [7] Jatowt, A., Kawai, Y. and Tanaka, K. Visualizing historical content of web pages, Proceedings of the International World Wide Web Conference, pp. 1221-1222, 2008. [8] Luesebrink, M.C. The moment in hypertext: a brief lexicon of time. Proceedings of the 12th ACM Conference on Hypertext and Hypermedia, 1998, 106–112. [9] Masanes, J. (ed.). Web archiving. Berlin Heidelberg New York, Springer Verlag, 2006. [10] McCown, F., Diawara, N. and Nelson, M.L. Factors affecting website reconstruction from the web infrastructure. Proceedings of the Joint Conference on Digital Libraries, 2007, 39–48. [11] Shipman F. M. and Hsieh H. Navigable history: a reader's view of writer's time. New review of hypermedia and multimedia, vol. 6, 2000, 147–167. [12] Wexelblat, A, and Maes, P. Footprints: history-rich tools for information foraging. Proceedings of Conference on Human Factors in Computing Systems, 1999, 270–277. [13] Viégas, F., Wattenberg, M. and Dave, K. Studying cooperation and conflict between authors with history flow visualizations. Proceedings of CHI 2004, 575-582.

Suggest Documents