Bridging Structured and

The New Face of Enterprise Search: Bridging Structured and Unstructured Information New technology has expanded the scope of the search engine, openi...
Author: Dylan Dawson
5 downloads 0 Views 714KB Size
The New Face of Enterprise Search:

Bridging Structured and Unstructured Information New technology has expanded the scope of the search engine, opening the doors for searches beyond the traditional

Joaquin Delgado, Ph.D., Renaud Laplanche, and Viswanathan Krishnamurthy

40

The Information Management Journal



November/December 2005

E

nterprise search allows employees to search and retrieve information needed to accomplish professional activities in a manner compliant with their enterprise’s information sharing and access control policy. Such information includes publicly available information, proprietary enterprise information, and private employee information available on the employee’s desktop. As of August 12, 2005, Yahoo estimated that there were more than 20 billion pages on the Internet. In addition, billions of e-mails are traded each day between corporations, and the volume of information stored in corporate intranets, file systems, document management systems, and other enterprise applications is growing faster than ever. Consequently, the task of finding a reference in a six-month-old e-mail or retrieving the latest memo published on a given subject becomes more complex and time-consuming each day. Concurrent with increasing volume, the information available to corporate workers is originating from a larger and more diverse set of sources. Information appears on public and semi-public information networks such as blogs and Really Simple Syndication (RSS) feeds, for example, while within corporations, scanning and optical character recognition processes render paper archives into digital formats. The most unsettling diversification and complexity of data sources, however, has appeared as a function of the extended role, scope, and reach assigned to enterprise search software. First-generation enterprise search engines were limited to searching a single data source or, in the most complex cases, to combining several internal indexes or federating queries to several underlying search engines. With the emergence of new, more powerful search engines, scope has expanded to provide “universal” search,

ic documents, there are now an increasing number of transient or virtual documents that are generated on the fly that constitute temporary renderings of live data. Most enterprise applications are now web-enabled, allowing users to interface with them via a regular web browser. By accessing these applications on the Internet, users are able to perform transactions and retrieve information in the form of transient documents, which are dynamically generated HTML “screens” or PDF “reports.” Transient documents are often composites of text, images, and structured fields assembled from various back-end systems such as relational databases and file systems. These virtual docu-

including the ability to seamlessly access data stored in enterprise applications such as databases and enterprise resource planning systems. Such an expanded scope and functionality implies that structured data (e.g., in a database) and unstructured data, although stored in different sources, can and should now be managed jointly. The increasing diversity of applications that rely on search as a cornerstone points to another implication. In addition to solving traditional “finding” problems, search is now also being used for information integration, discovery, collaboration and knowledge management, compliance, and records management. Enterprise search must thus go beyond traditional document-based information retrieval. Enterprise information is managed by myriad secure applications and systems that impose an extra layer of abstraction from the original data source. This layer determines protocols, access methods, data control and display, as well as business logic that ultimately shapes output to end-users. To understand the concept of enterprise search requires a fresh look at documents and data as they exist in today’s organizations, the mechanisms for discovering content, and services that enable users to perform better, more relevant searches.

At the Core This article

Examines the relationship of documents and data Describes models that search software uses to discover information Relates how software services affect search relevance

ments are usually mapped to rows within database tables, views, or query results. The declarative and semi-structured nature of extensible markup language (XML) makes it the preferred vehicle for transmitting, transforming, and rendering data into transient documents, thus the on-going effort to fully embrace and support XML as a standard within relational databases and applications. Despite connecting to the same backend systems, such enterprise applications may yield totally different search output, depending upon the user’s role and the privileges they have on the system. This applies not only to transactional systems but to any system that handles data fields, whether they contain strings, numbers, or Boolean indicators, or unstructured content such as text, audio, video, and so forth, including all docu-

Information Restructuring: Revisiting the “Document” Model To fully understand current enterprise search mechanisms, it is critical to realize one thing: there is constant information restructuring, that is, transitioning back-and-forth between structured and unstructured data, in modern information systems. Normal electronic documents, or what are referred to as static documents, are the digital equivalent of analog paper documents; they are more or less permanent – or at least stable – snapshots of information saved in document repositories and handled, for example, through word processing and electronic mail. In addition to those stat-

November/December 2005



The Information Management Journal

41

Most enterprise search tools encounter obstacles that are somewhat similar to those experienced by web search engines in a realm known as the “hidden web”or “deep web.” search and retrieval paradigm of Internet search engines that are beloved by all.

ment repositories and the sophisticated systems that sit on top of them. Interestingly, although the actual data may be stored in structured repositories, it is often organized and presented to users in a more coherent, consumable, and natural way via natural language, tabular, or pictorial displays. This is because human beings process and interpret data differently than computers. Computers require mathematical models and data structures to represent and process information. For example, for computers to index and search text, a text stream must be parsed and broken down into small pieces such as tokens or words and sometimes into larger logical units such as sentences or parts of speech. These are then heuristically or statistically analyzed for fast retrieval and intelligent language processing. Considering the possible combinations of applications, users, queries, and data, it becomes obvious that much relevant and useful information resides not just in static documents, but also in transient documents, which are temporary in nature and thus more difficult to find. Not surprisingly, employees in large and medium-sized corporations find themselves in a position where, before they can effectively search for any given information, they first spend a great amount of time finding out which applications contain which information. Only then can they log into each application and perform the required sequence of actions to retrieve the desired information. This is in sharp contrast to the simple, intuitive

42

The Information Management Journal

Uncovering “Hidden” Enterprise Data and Information Sources Most enterprise search tools encounter obstacles that are somewhat similar to those experienced by web search engines in a realm known as the “hidden web” or “deep web.” The “deep web” refers to database-driven, publicly accessible or password-protected websites that Internet search engine crawlers and spiders cannot easily access, thus excluding potentially valuable information from the searchable universe. The issue of hidden pages worsens with the number of web-based applications available, as the volume of documents or web “impressions” provided by these applications grows exponentially. In the context of an enterprise, the problems raised by hidden pages are compounded by the necessity of dealing with named users as opposed to the Internet’s anonymous users, as well as issues such as corporate security and varying access protocols (for example, WebDAV, HTTP, POP3, IMAP, FILE) that each application may use. Several information discovery paradigms can be implemented to solve these issues.

The Crawling Paradigm Before indexing a document set, search engines first collect or “crawl” such documents; that is, they must read all static and transient documents provided by underlying systems regardless of the access



November/December 2005

protocol each system uses. This usually requires the set-up of multiple crawlers or spiders, each running as a super-user with at least read access to all documents. The crawler will either simulate user actions required to access all possible documents via a regular user interface – such as by traversing links and filling in forms in web applications – or will use a provided application programming interface (API) to retrieve the documents. The crawling paradigm imposes multiple challenges and issues: • Manageability and maintenance º Application-specific crawlers use either a standard access protocol or an API. Unfortunately, the number of protocols and APIs inevitably grows as new applications are introduced, and it rapidly becomes cumbersome to maintain these crawlers and ensure that they are kept up to date as underlying applications evolve. For example, each version upgrade of the underlying application must be accompanied by a version upgrade of its corresponding crawler. This is particularly problematic in the case of API-based development, as software vendors do not always release new versions of their APIs with the same frequency as the new versions of their applications. • Authentication and security º Enterprise search engines must be able to identify and authenticate

Search proxies are another interesting approach that leverages the most common architecture of today’s information systems.

users against all the underlying systems it searches. Sometimes these identities are not shared, though much progress has been made in the area of enterprise-wide directories and single sign-on technology. Search engines have to work in synchronization with these systems.

system or data source. Since crawlers have to deal with large amounts of information, and because systems do not cooperate by alerting the crawler to such changes, it is unrealistic to expect that crawlers will keep up with changes, a situation that sometimes creates inconsistent indexes.

º Enterprise search engines must also comply with security and access restrictions at search time. This is normally the case for all secure sources, and additional user- access-related metadata is needed in the form of access control lists. Sometimes, access metadata is difficult to get, simply not available, or possibly encoded as business rules.

• Application agnostic search º Crawlers usually know little about the application they are dealing with and thus provide very little application-specific information to the search engine, especially when it comes to metadata schemas and ways the search engine should treat metadata when filtering and sorting results.

• Data depravation and synchronization

º Applications have no way of communicating with the crawler about what is important or not and have no way to convey other facts that could improve both crawling and information retrieval

º Some virtual documents generated by applications are the result of complex operations including selected options and filled out forms of which the crawlers have no knowledge. This is the classic “deep web” problem that deprives the search engine from finding information it is unable to crawl.

The Federated Search Paradigm Federated search is an attempt at solving enterprise search problems by allowing applications to develop and handle their own search engines. The user is authenticated to the main search software via an identification token that is passed back and forth between the main search software and each application.

º Crawlers follow a “pull” model of information awareness. This means that crawlers must efficiently scan the source application to learn about the existence of new documents or to update the index to reflect changes in the original

44

The Information Management Journal



November/December 2005

User queries can be federated to individual application search engines and search results can be merged and displayed to the user in a combined results list. Although federated search is certainly an attractive and – in some cases – a very effective proposition, this approach also presents several drawbacks. First, the development and maintenance of embedded search engines, even if based on a standard platform, is very computer and human resource intensive. Second, the query federation still suffers from the problems associated with crawling within each application, especially given the size of some individual systems. Finally, it is desirable that enterprise information management is controlled and expertise is acquired by a central entity in a position to enforce corporate standards and policies, and this can hardly be achieved with a federated search model.

The Search Proxy Paradigm Search proxies are another interesting approach that leverages the most common architecture of today’s information systems. Web applications usually rely on application servers or web servers that act as proxies to transmit information to and from end-users’ web clients. These proxies sometimes cache – that is, temporarily store – virtual documents that users have accessed through applications. Caching accelerates recurrent access because virtual documents are then retrieved from cache rather than from the original application. Instead of

crawling the original disparate sources, proxy servers capture these virtual documents and make them searchable by providing indexing and search services. Proxy servers work well with respect to information that has previously been searched and retrieved, but they provide little help for discovering new information or services. Search proxies must also capture additional facts as to which application generated the virtual document and which users have access rights to them, which can prove a complex and resource-consuming task.

The Content Syndication Paradigm The content syndication or publishing paradigm is, in effect, a “push” model that gives each application the ability to publish or syndicate relevant data that a central search server then subscribes to, retrieves, and processes. This model provides better coverage, more up-to-date search results, and better security by giving each application full control over which data is published and when it is published. Some web search engines have implemented services that leverage similar mechanisms and are clearly inspired by the “publishing” model. Examples are Google’s SiteMaps (http://www.google.com webmasters/sitemaps/docs/en/about.html), which allows sites to provide specific information about their content, such as when a page was last modified or how frequently a page changes, and A9’s OpenSearch RSS (http://opensearch.a9. com/), which promotes a query syntax and content delivery standard for search services that can be used for both content syndication and federated search.

worsened. The linkage of structured and unstructured data and the ability to search it all simultaneously increases the volume of data that can potentially be returned by enterprise search engines. Search-related services, however, can determine which documents applications should make available to the search engine and present such documents to users in the most efficient manner.

a “document modified date” can trigger the application to push the document to the search engine in the case of the content syndication paradigm. For applications to extract or locate and publish such valuable metadata, there is a growing need for a core set of interchangeable document services that would enable any application to transform raw, unstructured data into annotated documents containing the right set of metadata. Another example of document services enabling better searches can be found in the ability to use a document location, as opposed to its content, to provide additional contextual information. For example, the data structure – which may be the document’s place in a corporate file plan’s taxonomy, a table’s structure in a database, or a site map in an intranet or extranet site – can be used to assign coordinates that will be made available to the search engine as additional metadata or search options for the user. Taking this idea a little further, creating a “data structure index” that can be searched prior to or simultaneously with the term index would provide additional contextual information. Such an index would let organizations leverage the existence, structure, and content of structured information

Document Services The extraction of metadata from static or virtual documents is an example of a document service that enables better searches. The progress made by concept/entity extraction technology vendors now makes it possible to extract relevant concepts and qualified entities from textual information to create metadata that can be indexed by a search engine or imported into an application. This metadata must be preserved and, if possible, enhanced and standardized when conveyed by the application to the search engine through a crawler, federator, or other information discovery paradigm previously described. Certain metadata can also be used to tell which information must be made available by the application to the search engine. For example,

Why Content Syndication? • RSS and blogs have popularized it. • It gives control to each individual application so it can publish/syndicate relevant data that a central search engine can pic up and process. • Popular search engines are promoting it and pushing it as a standard.

Finding Relevant Content Using Services If all enterprise data – structured and unstructured, without regard for the application hosting it – can be efficiently searched using one of the paradigms described above, then enterprise information overload has actually

Search-Enabled Applications Source: Oracle. Used with permission.

November/December 2005



The Information Management Journal

45

New information discovery paradigms and emerging standards that facilitate search services will help make enterprise search much more efficient. to gain better access to and understanding of unstructured content.

applications. Each application can provide users search services for comprehensive enterprise information, but with a common interface to post queries and display results sets, relying on the enterprise search server to process all queries and search results. There is a third set of search-related services that are outside the scope of this discussion but worth mentioning. User services include identity and session management, user profile management, and mechanisms such as user agents and subscription management.

Search Services There are a number of search services that search engines or software applications make available to end users. The most basic search service provides the ability to post a query and get back a set of results in a format that can efficiently be displayed to the user. In addition to such basic searchrequest-response service, enterprise search engines and other applications often offer both pre-query and postquery search services. Pre-query search services include various query builders such as a spell checker, query expansion and/or query reduction mechanisms that help the user refine the original query by comparing it to other users’ queries, or a thesaurus to help the user find synonyms or perform concept searching. Post-query search services offer alternative ways to visualize search results, such as graphical result maps, result taxonomy, or clustering and sometimes also let the user perform actions on the results sets such as re-ordering or reducing the results to a subset thereof based on metadata (for example re-ordering by date or viewing only documents from a given author). Search services are a key driver for expanding the scope and reach of enterprise search and for giving users the ability to find unstructured and structured data stored in various enterprise

46

The Information Management Journal

Crawling Toward the Future New information discovery paradigms and emerging standards that facilitate search services will help make enterprise search much more efficient. Enterprise search will also become truly ubiquitous when enterprise applica-

tions are able to “call” search services. While the four information discovery paradigms mentioned above will undoubtedly continue to be used, it is believed that the crawling model will progressively be replaced by the publishing or content syndication model. This model allows for real-time data push, security, and access control checking. It also allows for centralization of the crawling-indexing process, ensuring better compliance control and seamless integration into existing IT infrastructures and business processes. As increasingly sophisticated and efficient search services become available to enterprise applications, users will gain the ability to find any structured or unstructured data, wherever it resides, without the need to interrupt their work or to switch between applications.

Joaquin Delgado, Ph.D., is Consulting Member of the Technical Staff of Oracle Corporation Server Technology Division, within the group that develops Oracle Enterprise Search and XML/DB Technologies. Previously, he was Chief Technology Officer and cofounder of TripleHop Technologies Inc., a software firm that developed MatchPoint, an award-winning context-sensitive search, retrieval, and classification software. He can be reached at [email protected]. Renaud Laplanche is a Senior Director in the GSS Technology group of Oracle Corporation, responsible for Enterprise Search. Previously, he was CEO of TripleHop Technologies. He can be reached at [email protected]. Viswanathan Krishnamurthy is a Senior Director in the Server Technologies group of Oracle Corporation responsible for the development of unstructured data management technology. He can be reached at [email protected].



November/December 2005