Chapter 3. Methodology 3.1. Introduction This study decomposed the triage process in digital reference services that receive questions via asynchronous media – email and webforms. While real-time services, utilizing synchronous media such as chat environments and instant messaging, are becoming increasingly important in the digital reference community, there has been no literature on real-time reference indicating that any services are performing triage on incoming real-time requests, though Francoeur (2001) states that more sophisticated question management applications allow an administrator to “transfer” incoming requests to available librarians (p. 193). This study was therefore concerned with only those services that receive questions via asynchronous media and perform triage. The first goal of this study was to discover what could be learned from direct observation of the triage process about both the actions performed by triagers, and how question type affects those actions. The second goal of this study was to use what was discovered about the triage process to draw up a set of rules that can be utilized as the basis for designing and building a system to automate part or all of the triage process. The methodology employed by this study was, first, a series of think-aloud studies with triagers from a cross-section of digital reference services to elicit the actions they perform on questions during the triage process, and their reasons for those actions. These actions were analyzed and classified utilizing the constant comparative method from grounded theory. A subset of the questions triaged during the think-aloud studies were then classified by a subset of the triagers themselves, according to three taxonomies of questions at different levels of linguistic analysis, that were identified though a review of literature that deals with questions, from the fields of desk and digital reference, question answering, and linguistics. Intercoder reliability statistics were computed on these coders’ classifications, to determine the reliability with which questions can be classified according to these taxonomies. Finally, the strength of correlation between question type and the question attribute that affected the triager’s triage decision was computed. 79

The remainder of this section presents the research questions and research design for this study. The remainder of this chapter discusses in detail the methodology employed by this study.

3.1.1. Research Questions Digital reference consists of five major processes (Pomerantz et al., forthcoming), all of which are in some way concerned with taking action upon questions. It is the first of these processes that is within the direct control of the digital reference service and which is the subject of this study: the triage process. The first goal of this study was to observe and decompose the triage process, to learn from direct observation of the actions performed by triagers how question type affects those actions. The second goal of this study was to use what was discovered about the triage process to draw up a set of rules that can be utilized as the basis for designing and building a system to automate part or all of the triage process. The research questions for this study follow from these goals: RQ1. What attributes of questions affect the triage process? RQ2. How does question type correlate with the action taken on a question in the triage process? The first research question was exploratory, and sought to discover which characteristics of questions influence the triage decisions made by digital reference triagers. This question breaks down into two sub-questions: RQ1a. What attributes of questions are taken into account by digital reference triagers when performing triage on received questions? RQ1b. How do these attributes affect triagers’ decisions in triaging questions? Research question 1a seeks to discover what attributes of questions affect a triager’s decision-making process in determining where or to whom to triage a question. Research

80

question 1b seeks to discover how these attributes influence these triage decisions. Other criteria not specific to the question that influence triage decisions were also discovered, but these are not the primary focus of this study. The second research question tests the hypothesis that question type correlates with triagers’ decisions in triaging that question. Question type is defined here as the intersection of classes, according to a faceted scheme of taxonomies of questions, into which that question is classified.

3.1.2. Research Design This study employed a two-phase design approach (Creswell, 1994), conducting first a qualitative and then a quantitative phase to investigate the research questions. The steps in the study methodology are represented in Figure 3-1.

Phase 1: Thi nk-al oud Tri age Studie s

Phase 2: Identi fi ciati on of questi on taxonomi es

Phase 3: Cl assi fi cati on task

Figure 3-1: Research Methodology

81

The three-phase approach taken by this study allows for a two-pronged approach to the study of the triage process. The first phase employs qualitative methodology to elicit from expert digital reference triagers 1) the attributes of questions that they take into account during the triage process, and 2) how these attributes affect their decisions on actions to take on questions in the triage process. The second phase began with a literature review to identify question taxonomies, and concluded with a quantitative analysis of the intercoder reliability of classification according to these taxonomies, and a qualitative evaluation of these taxonomies. The third phase employs both qualitative and quantitative methodology to determine the strength of the correlation between question type and the action taken on it in the triage process. The steps in the methodologies of phases 1-3 are presented in Table 3-1 – 3-3, respectively. Table 3-1: Steps in the Phase 1 Methodology: Think-aloud Triage Studies

1

Step in Method

Description

Selection of digital

Digital reference services were selected as the source

reference services

of questions and respondents. It was required that the practice of these services is to both forward questions to other services and triage questions to individual answerers (as opposed to letting answerers self-select questions). It was also required that these services be willing to let the researcher use their questions (stripped of identifying information) for the Classification task.

2

Solicitation of respondents

Respondents were solicited from the set of triagers employed by the digital reference services selected in the previous step.

3

Think-aloud task

Respondents were instructed to perform triage as usual. Respondents were instructed to think aloud about the task they were performing, with specific attention to attributes of questions that affected their

82

triage decisions. This task continued for the duration of the respondent’s triaging all questions received by the service, up to 30 questions. This step was repeated with 28 digital reference services. 4

Normalization of generated

Terms or descriptions suggested by respondents in

terms

step 3 for attributes of questions that affected their triage decisions were normalized, so that synonymous ways of referring to the same type of attribute were reduced to one canonical term or phrase.

Table 3-2: Steps in the in the Phase 2 Methodology: Identification of Question Taxonomies

1

2

3

4

Step in Method

Description

Identification of question

Taxonomies of questions were identified in the

taxonomies

literature.

Development of scope

Scope notes were developed for the classes in the

notes

identified question taxonomies.

Digital reference service

One digital reference service was selected as the

selection

source of questions for the test data set.

Data selection

Questions were selected from the archives of the digital reference service.

5

Classification of questions

Questions were classified according to the question taxonomies.

6

Intercoder reliability

The taxonomies and scope notes, and a subset of the

testing

set of questions were provided to coders. Coders were instructed to classify the questions according to the taxonomies.

83

7

8

Evaluation of question

Question taxonomies were evaluated to determine how

taxonomies

expressive each is for the questions in the data set.

Modification of question

The question taxonomies were modified based on this

taxonomies

evaluation, to improve the expressiveness of each for the questions in the data set.

Table 3-3: Steps in the Phase 3 Methodology: Classification Task

1

Step in Method

Description

Solicitation of coders

Coders were solicited from among the triagers who participated in the Phase 1 Think-aloud study.

2

Data selection

A stratified random sample of 30 questions was created from those triaged and collected by the researcher during the 28 think-aloud studies.

3

Classification task

The question taxonomies identified and modified in Phase 2, and the sample of questions selected in step 2 was provided to the coders. Coders were instructed to classify 10 questions according to each of the three taxonomies, for a total of 30 questions.

4

5

Computation of intercoder

The intercoder reliability statistic Cohen’s κ was

reliability statistics

computed between all coders.

Computation of strength of

The correlation statistic Cramér’s V was computed

correlation

between the types of questions, as classified by the coders, and the attributes of questions that affected triagers’ triage decisions.

84

3.2. Phase 1: Triage Think-aloud Studies In a Delphi study of digital reference triagers, Pomerantz, Nicholson, and Lankes (2003) identified fifteen factors that triagers agree affect the process of triage (see section 4.2.2.4 for the list of these fifteen factors). These fifteen factors served as a “partial framework” (Glaser and Strauss, 1967, p. 45) for this phase of the present study’s analysis and discovery of attributes of questions that influence the triage process. This section discusses the methodology employed in the present study to elicit and analyze these attributes. This phase of the study sought to answer Research Question 1: What attributes of questions affect the triage process? This question breaks down into two sub-questions: RQ1a. What attributes of questions are taken into account by digital reference triagers when performing triage on received questions? RQ1b. How do these attributes affect triagers’ decisions in triaging questions? In order to answer these questions, triagers were observed performing the task of triage, utilizing the think-aloud methodology.

3.2.1. The Think-aloud Method Ericsson and Simon (1980) describe two possible relationships between cognitive processes and verbalization: concurrent verbalization, in which “information is verbalized at the time the subject is attending to it,” and retrospective verbalization, in which “a subject is asked about cognitive processes that occurred at an earlier point in time” (p. 218). The think-aloud method is a methodology to elicit concurrent verbalization of an individual’s internal cognitive processes, and further, to structure the verbalization process so that the verbalization can be utilized as data.

85

The premise of the think-aloud method is that individuals may not have conscious access to all of their internal cognitive processes involved in performing a particular task. The think-aloud method does not, therefore, attempt to gain access to individuals’ internal cognitive processes, but rather to elicit verbalizations that are representative of individuals’ cognitive processes. The major assumption made by the think-aloud methodology is that elicited verbalizations are in fact representative of individuals’ cognitive processes. Van Someren, Barnard, and Sandberg (1994) discuss a simple model of the human cognitive system, breaking this system down into three parts: 1. The sensory system, “that transforms information from the environment into an internal form,” 2. Long-term memory, “where knowledge is stored more or less permanently,” and 3. Working memory, “where the currently ‘active’ information resides” (p. 20). Van Someren, Barnard, and Sandberg (1994) claim that neither the contents of the sensory system nor of long-term memory can be verbalized, “unless it is somehow retrieved” (p. 20) – in other words, stored temporarily in working memory. Thus, it is only the contents of working memory that can be verbalized. The think-aloud methodology is able to elicit verbalizations that are representative of individuals’ cognitive processes by eliciting verbalizations from the contents of working memory. The think-aloud methodology elicits verbalizations from the contents of working memory by providing an individual with a specific task, and instructing the individual to speak aloud while performing that task. The individual is instructed to say anything and everything that crosses his or her mind, speaking constantly, without consciously filtering what is being said (insofar as that is possible). In this manner, the respondent will articulate his or her cognitive processes involved in performing the given task. The thinkaloud method is used in this phase of this study to elicit data from expert digital reference

86

triagers about the attributes of questions that they take into account during the triage process, and how these attributes affect their decisions in the triage process. Selecting an appropriate task is the most important factor in the reliability of the thinkaloud method. Certain tasks are inappropriate for the think-aloud method due to the fact that they are either non-verbal or entirely verbal: non-verbal tasks such as dance are inappropriate because they do not require internal verbalization to perform, and so may be difficult or impossible to externally verbalize. On the other hand, a verbal task such as performing a reference interview is inappropriate because the task itself requires the subject to be speaking, thus allowing no “space” for the subject to verbalize about the task without interrupting the task itself. The ideal task for the think-aloud method is one that requires thought that can be expressed verbally, while not requiring much or any verbalization in the process of performing the task. The task of triage is therefore highly appropriate for the think-aloud method, as it is an inherently verbal task: questions, which are textual documents, are routed and assigned within and between services for reasons that can be articulated verbally. A task must not simply be verbal to be appropriate for the think-aloud method. Van Someren, Barnard, and Sandberg (1994) provide two criteria for a task that will allow elicitation of useful data in a think-aloud study (p. 36): 1. The task must be relevant to the cognitive processes that the study is attempting to elicit, requiring the subjects to engage in the appropriate cognitive processes. 2. The task must be sufficiently difficult to require the subjects to think about solving it; it cannot be so simple that it can be solved automatically. The task of triage is as relevant as it is possible to be to the cognitive processes that this study is attempting to elicit – since the object of this phase of the study is the triage process itself. The cognitive processes that this phase of the study is interested in are the reasons that triagers have for making triage decisions.

87

The task of triage is technically quite a simple task to perform: in digital reference services that perform triage on questions received by email, triage involves forwarding an email message to the appropriate email address for an individual expert or other digital reference service. Forwarding an email is technically simple; every email application has a Forward button and an addressbook. What are more complex are the reasons for choosing that particular expert or other digital reference service. Triage requires a fair amount of thought, as the decision of which expert or service to which to forward any specific question involves an analysis of a number of variables. An individual may know how to perform a task, but may not be able to articulate how that performance occurs. Indeed, the more expert the individual performing a task is at performing that particular task, the less able he or she may be to articulate how he or she performs it. As van Someren, Barnard, and Sandberg (1994) state, “they are used to do their job, not to explain it” (p. 1). One individual summed up this difficulty perfectly, in her email response to the initial solicitation: “…my colleague, X, who is actually in charge of question triage for our group, was showing me what she does when she sorts the incoming mail (coincidentally enough!). We opened up your email and… X’s response was ‘what would I have to say to him?’ I don’t think she thinks of it as a highly conscious sort of task – more about gut instinct…” Not everyone makes a good subject for a think-aloud study. There are many factors that may interfere with an individual’s thinking aloud in a way that is useful for this methodology. Some experts may perform a task from “gut instinct,” and may be unable to articulate the steps that go into the performance of that task. Any explanation from such an expert is therefore likely to be incomplete, and may contain much that is revisionist, as the expert may attempt to make a coherent story out of a process that may be full of inconsistencies and idiosyncrasies. Alternatively, an expert may simply not be comfortable talking aloud constantly while performing an action, either because he or she

88

is reticent by nature or has difficulty “multi-tasking” in that way. In either case, the thinkaloud method will elicit spotty or incomplete data from such individuals. The individual referred to as X above may be one of those individuals, and indeed, X declined to be a respondent for this phase of the study. Van Someren, Barnard, and Sandberg (1994) state that “a little training will help people to become more fluent” with verbalizing their thoughts, but that even after such training, differences in verbalization ability remain between individuals (p. 35). The think-aloud method is therefore susceptible to the following bias: it is possible that the data elicited is biased towards those respondents who are comfortable with verbalizing their thoughts as they perform an action. Even if data is elicited from individuals who are less comfortable verbalizing their thoughts, more data may be elicited from those individuals who are more so. While this may be a bias in data elicited using the think-aloud method, there is no reason to believe that the reasons for making triage decisions differ between those respondents who are comfortable verbalizing their thoughts and those who are not. In light of these shortcomings of the think-aloud method, it is worth reiterating why it is still a useful method for this phase of the study. In fact there are two reasons why the think-aloud method is appropriate for this study: First, the task of triage is sufficiently difficult that it requires respondents to think about solving it, and it is an inherently verbal task. Second, having triagers think aloud about the task of triage elicits data about their cognitive processes during their performance of this task, and it is of course precisely triagers decision-making processes that is the object of this phase of the study.

3.2.2. The Constant Comparative Method To analyze the data elicited in this phase of the study on the attributes of questions that triagers took into account during the triage process, and how these attributes affected the triage process, the constant comparative method was utilized. The constant comparative method was first described by Glaser and Strauss (1967), as a method for “joint coding and analysis” of qualitative data (p. 102). This method allows data to be collected and

89

analyzed and categories of entities to be developed from the data, and then more data to be collected and analyzed and the categories revised or new categories developed, and this process to be repeated. Before the constant comparative methodology can be employed, data must be collected. It does not matter how data is collected; the more different methodologies are employed, the better. Indeed, Glaser and Strauss (1967) state that “no one kind of data on a category nor technique for data collection is necessarily appropriate. Different kinds of data give the analyst different views or vantage points from which to understand a category” (p. 65). Beyond simply advocating the use of diverse methods in collecting data, Glaser and Strauss come close to advocating data collection by any means necessary, even relating a story about a colleague of theirs who used bribery to collect data. Prior to this study, Pomerantz, Nicholson, and Lankes (2003) utilized the Delphi methodology to elicit fifteen factors that triagers agree affect the process of triage. These fifteen factors served as a “partial framework” (Glaser and Strauss, 1967, p. 45) for this phase of the present study’s analysis and discovery of attributes of questions that influence the triage process. Building on this partial framework of fifteen factors, the present study utilized the thinkaloud method, as discussed above, to discover an additional twenty-three factors that are taken into account by triagers when performing triage. The first step in the constant comparative methodology, then, is to code the data, classifying each piece of data into as many categories as is possible to fit it into. These categories emerge through the researcher’s experience in collecting and analyzing data; categories emerge as the researcher makes generalizations about entities in the data, and is able to state that a specific entity is an example of a specific category. As more and more data is collected, categories are created and refined, and it becomes clearer to which categories specific entities belong. For this phase of the study, the fifteen factors that Pomerantz, Nicholson, and Lankes (2003) discovered influence the triage process served as a partial framework of categories, according to which the attributes of questions elicited in these think-aloud studies were coded (see section 4.2.2.4 for the list of these fifteen factors). As data was collected from the think-aloud studies, more attributes that

90

influence the triage process emerged, and more categories of attributes were developed. In this way, data collection and analysis proceeded side-by-side, each influencing the other, guiding the researcher both in attributes to probe for in the think-aloud studies, and in how to interpret new attributes in the data analysis. The constant comparative methodology allows this data collection and analysis cycle to proceed for as long as necessary. As long as necessary, according to Glaser and Strauss (1967), is until saturation is achieved; the point in the data analysis at which “no additional data are being found whereby the [researcher] can develop properties of the category” (p. 61). This point is reached when the same or similar entities continue to be elicited from the methods of data collection being employed, and no new categories are being developed. In order to insure that a “false” saturation is not achieved, Glaser and Strauss state that the researcher should “look for groups that stretch the diversity of data as far as possible, just to make certain that saturation is based on the widest possible range of data” (p. 61). It was not possible to collect data from a very wide range of sources for this study, since this study is concerned specifically with digital reference triagers – and not triagers from other types of reference services, not individuals who perform other roles in reference services, etc. In order to insure that data was collected from as wide a range of sources as possible, triagers were solicited for participation in this study from digital reference services of different types: AskA services and academic, public, and special libraries, in all English-speaking countries in the world. The details of this solicitation are discussed in the subsequent section.

3.2.3. Selection of Digital Reference Services For this phase of the study, digital reference services were selected for solicitation from two sources: •

The libraries listed in the LIBWEB database that offer digital reference services, and



The digital reference services listed in the VRD’s AskA Locator.

91

The LIBWEB website (sunsite.berkeley.edu/Libweb) is a database of library websites, maintained by the University of California at Berkeley’s Digital Library SunSITE. LIBWEB is the most complete database of library websites that exists online as of this writing, listing over 6,500 library websites in over 100 countries. The LIBWEB database lists libraries by geographic regions roughly corresponding to the continents. Within these regions, the database is further subdivided by country. Within the United States, libraries are divided up according to the following categories: •

Academic libraries



Public libraries



National libraries and library organizations



State libraries



Regional consortia



Special and school libraries

For the think-aloud studies to be conducted, it was necessary that the researcher speak to the respondents, if not in person then on the telephone. Since the researcher is not fluent in any language but English, in order to avoid any language barrier between the researcher and the respondents, these think-aloud studies had to be conducted in English. In order for this to be possible, libraries were solicited only in nations in which, according to the CIA World Factbook (www.cia.gov/cia/publications/factbook), English is spoken. The CIA World Factbook lists 53 nations in which English is spoken, though not always as the only language spoken. Not all 53 of these nations have libraries listed in the LIBWEB database; thus, libraries were solicited only in nations in which English is spoken and have libraries listed in the LIBWEB database. This restriction reduced the number of nations in which libraries were solicited to 27: 1. Australia

7. Fiji

2. Bermuda

8. Ghana

3. Botswana

9. India

4. Brunei

10. Ireland and Northern Ireland

5. Canada

11. Jamaica

6. Cyprus

12. Kenya

92

13. Namibia

21. Trinidad and Tobago

14. New Zealand

22. Uganda

15. Nigeria

23. The United Kingdom

16. Pakistan

24. The United States

17. Philippines

25. The Virgin Islands

18. Singapore

26. Zambia

19. South Africa

27. Zimbabwe

20. Sri Lanka As of this writing a total of 5,340 libraries were listed in the LIBWEB database in these 27 nations. A sample of 125 libraries was randomly selected from these 5,340 libraries to be solicited for participation in this phase of the study. The AskA Locator (www.vrd.org/locator) is a database of AskA services, maintained by the Virtual Reference Desk Project. The AskA Locator is perhaps the only database of AskA services that exists online. As of this writing the Locator lists 92 AskA services. The AskA Locator organizes services according to the following fourteen subject areas: •

Arts



Language Arts / Linguistics



Careers



Mathematics



Educational Management



Philosophy



Foreign Language



Physical Education



General Education



Science



General Reference



Social Studies



Health



Vocational Education

No definitive list of digital reference services exists; it is therefore impossible to know the ratio of libraries that offer digital reference service to AskA services that exist in the world. However, if the LIBWEB database and the AskA Locator are any indication, far more libraries that offer digital reference service exist than AskA services. As stated above, 125 services were randomly selected from the LIBWEB database. So as not to bias the sampling for this phase of the study by sampling a disproportionate number of 93

AskA services relative to the number of libraries that offer digital reference service, therefore, a smaller number of AskA services were sampled. A sample of 50 AskA services was randomly selected from the 92 services listed in the AskA Locator to be solicited for participation in this phase of the study. In total, 175 services – both affiliated with a physical library and AskA services – were solicited for participation in this study. The break-down of these 175 services by type is as follows: •

79 academic libraries



48 AskA services



41 public libraries



7 special libraries

Additionally, the services solicited are located in several countries. Most are in the United States, but a few other countries are represented. The break-down of the locations of these services is as follows: •

United States: 147



Canada: 19



Australia: 3



Netherlands: 2



Singapore: 2



United Kingdom: 2

The values of 125 libraries from the LIBWEB database and 50 services from the AskA Locator were chosen as arbitrarily large numbers that were likely to allow for a sufficiently large number of respondents, given the likelihood that some percentage of those solicited would not participate. If saturation had not been achieved after all of the think-aloud studies had been performed with the respondents from that pool of 175, it was the researcher’s intention to solicit another set of libraries from the LIBWEB database and services from the AskA Locator. As it turned out, however, that was not necessary. 94

3.2.4. Solicitation of Respondents Prior to conducting a think-aloud study with the digital reference services selected for solicitation, contact was made with all of the 175 services that it was possible to contact, to determine if a service performed triage on received reference questions. This initial contact was made via the service’s question submission webform, if the service maintains one, or via email if it does not (see Appendix A for the text of this solicitation). Initial contact was made in this way because it was a sure way to have the solicitation read by the triagers for the services: since the solicitation was submitted as any question would be submitted to a service, the triager would see the solicitation in the course of triaging the submitted questions for the day. This solicitation asked if the service triages questions to its librarians or experts, and if so, if it would be possible for the researcher to contact, by email or telephone, an individual who performs the triage process. It was then up to the service to respond to the solicitation. If a service did not perform triage, the reason why triage is not performed was noted, and there was no further contact with that service. If a service did perform triage, the researcher contacted a triager at the service. In order for a triager to participate in this phase of the study, four criteria must have been met: 1. The service performed triage, 2. The service performed triage manually, and not automatically, 3. The triager was willing to participate in a think-aloud study, and 4. The service was willing to share the questions that were triaged during the think-aloud study with the researcher. In most cases the triager who responded to the solicitation was willing to participate in a think-aloud study. In some cases, the triager who responded to the solicitation forwarded the researcher to another triager in the same library; the researcher then contacted that triager to solicit him or her for participation in this phase of the study. Once a triager was contacted and willing to participate in a think-aloud study, an appointment was set up to

95

conduct a think-aloud study in person or, in more cases, over the telephone. Twenty-eight think-aloud studies were conducted for this phase of the study, with twenty-eight triagers, from twenty-eight different digital reference services.

3.2.5. Think-aloud Task The goal of this phase of the study was to elicit from expert digital reference triagers the attributes of questions that they take into account when performing triage on received questions, and how these attributes affect their decisions in triaging questions. To elicit this data, triagers were observed performing the task of triage, utilizing the think-aloud methodology. Respondents were instructed to perform their job as usual, triaging incoming questions to the appropriate (in their estimation) reference or subject expert. Respondents were instructed to think aloud while performing triage, with specific attention to attributes of questions that affect their triage decisions. See Appendix B for the text of the instructions provided to the triagers before the think-aloud studies. The think-aloud studies continued for as many questions as were received by services on that day, up to a maximum of 30 questions, if the service received more than 30 on the day that the think-aloud was performed. As it turned out, twenty-five out of the twenty-eight services studied (89.3%) received fewer than 30 questions on the day that the think-aloud was performed. As mentioned above, some think-aloud studies were conducted in person, but most were conducted over the telephone. Before beginning the studies, the triager was instructed to try to speak aloud everything that goes his or her mind, and to verbalize his or her thoughts as they occur, and not interpret or filter them. In conducting the think-aloud studies in person, the researcher sat with the triager while he or she was performing triage. In conducting the think-aloud studies over the telephone, the researcher and the triager were on the telephone while the triager was performing triage.

96

The think-aloud studies were recorded, with the respondents’ consent: for the in-person studies a small hand-held cassette recorder was placed in front of the triager; for the telephone studies a phone tap-like adapter was attached to the telephone and the handheld cassette recorder. Once a think-aloud study was completed, the researcher transcribed the triager’s verbalizations, word for word – including the researcher’s own prompts and probes to the triager – to create a written protocol for every respondent. As the triager was triaging questions, he or she also forwarded them to the researcher. It was necessary that the researcher have these questions, so that they could be used in Phase 3 of the study. In order to determine how question type correlates with the action taken on it in the triage process, it was necessary to have a question that was actually triaged to be classified. Before the think-aloud study, the researcher provided the triager with an email address to which to forward questions. During the think-aloud studies, as the triager triaged questions, he or she forwarded them to that email address. A total of 185 questions were triaged and collected during the think-aloud studies. 3.2.5.1.

Prompts and Probes

As discussed above, not everyone makes a good subject for a think-aloud study, as there are a number of reasons that may interfere with an individual’s thinking aloud. Fortunately for this study, all of the triagers were willing and able to articulate their thoughts while performing the triage process, though some required more prompting than others. Indeed, several of the librarians studied in this phase of the study were downright effusive in speaking about their job – they seemed pleased that the researcher was interested enough to ask! Nevertheless, the researcher took Van Someren, Barnard, and Sandberg’s (1994) advice that “a little training will help people to become more fluent” with verbalizing their thoughts (p. 35), and provided the triagers with a bit of “warm up” time before beginning the process of triage. Van Someren, Barnard, and Sandberg suggest that respondents be given a practice task to perform, to get them used to thinking aloud. So as to impose on the respondents’ time as little as possible, a practice task was not used for this study. Instead, the researcher asked the triager to talk aloud through the

97

process of preparing to perform the triage process: logging in to the network, opening up the email account in which questions are received, and whatever other steps were taken by that specific triager. There were, of course, a few triagers who were not so effusive in thinking aloud about their reasons for triaging questions – who were more reticent by nature, who got distracted and stopped thinking aloud, or any number of other possible reasons – and who therefore required prompting to think aloud. To compensate for this problem, the researcher formulated a list of neutral questions and comments to use as prompts to respondents who were not articulating their reasons for making triage decisions well. These questions and comments fell into two categories. Some respondents would at times cease to think aloud, lapsing into silent thought; some questions and comments to compensate for this situation are: •

“Please go on,” “Please continue,” etc.



“Have you decided who/where you’re going to send that question to?”



“What question are you looking at now?”

Some respondents would provide a reason for making a triage decision that amounted to a statement of the obviousness of that particular decision. Some quotes from respondents along these lines are as follows: •

“That one’s pretty obviously Disabilities.”



“Give that to John, because it’s his kind of thing.”



“So, who am I going to give this to… give it to April.”

These sorts of responses are interesting insofar as they make clear the sorts of triage decisions that the respondent perceives as so obvious that they do not even require a decision-making process. However, responses of this type are not at all informative, in that they do not provide a reason for making a triage decision that is meaningful to anyone but the respondent him- or herself. Some questions and comments to compensate for this situation are: •

“Why have you decided to send this question to X?”

98



“What makes it obvious to you that this question should go to X?”



“What is it about that makes this question such a good fit for him/her/them?”

In a few of the think-aloud studies (though mercifully not many), the respondent was unable or unwilling to articulate his or her reasons for making a specific triage decision in sufficient detail that it was clear that he or she had exhausted these reasons. In these cases, the researcher allowed the respondent to move on to triaging the next question. After all questions had been triaged, the researcher then asked one or more contrast questions – questions which allow the researcher to “discover the dimensions of meaning which informants employ to distinguish the objects and events in their world” (Spradley, 1979, p. 60). Such questions are based in the Contrast Principle, which states that “the meaning of a symbol can be discovered by finding out how it is different from other symbols” (p. 157). By asking contrast questions, the researcher forced the respondent to compare the questions that had just been triaged (specifically, the one for which the triage decisions had not been well articulated) with other types of questions. It is impossible to formulate specific contrast questions ahead of time, as the specific questions that must be asked are highly dependent on the respondent and what he or she does – or more specifically does not say. Some contrast questions that the researcher asked were: •

“Why did you send the question about martial arts to John instead of Tracey?”



“Even though you have someone [a subject expert] who specializes in medicine, you sent the antibiotics question to Dorothy. Why?”



“If your numbers had been higher today [received more questions], where would you have sent the database development question?”

3.2.5.2.

Content Analysis and the Unit of Analysis

Stemler (2001) states that “content analysis has been defined as a systematic, replicable technique for compressing many words of text into fewer content categories based on explicit rules of coding” (¶ 1). The analysis of the think-aloud data sought to do precisely

99

that with questions: classify a large number of questions according to a finite number of classes. Krippendorff (1980) states that there are three units that must be identified for any content analysis: sampling units, context units, and recording units (p. 57). For the purposes of this study, the sampling unit was the email message, and both the context unit and the recording unit was the question. Sampling units are pieces of content that are independent of one another “as far as the phenomenon of interest is concerned” (Krippendorff, 1980, p. 58). The sampling unit for the think-aloud data was the email message from a patron to which a response was sent by VRD answerers in January 2001. Email messages received by VRD may be treated as if they are independent of one another for two reasons: First, given that VRD receives emails from users all over the world, the overwhelming probability is that the users who send these emails are independent of one another and have no knowledge of the email sent by other users. Second, if a single user sends more than one email to a digital reference service, there may be no way to know if subsequent emails are follow-ups to previous ones, unless the entire thread of the conversation is preserved in the text of the email. There are some obvious objections to these two points made above. First, questions from associated users almost certainly make up some percentage of questions received by a digital reference service, such as several students from the same course submitting questions for an assignment. For example, there were several questions in the data set concerning factors that might affect heart rates. However, all of these questions were sent from different email addresses with different domains, and signed with different names. There is no evidence that these questions are not independent, despite the apparent coincidence. Second, “follow-up” questions do make up some percentage of the questions received by digital reference services. For example, there were two questions in the data set concerning painting a carving, sent by the same user. The chronologically second of these two emails was clearly a follow-up to the first, as it contained the same

100

Subject: line, and the first line of the body of the second message was “Thank you for the information.” However, the questions in these two emails can be treated as being independent: the first email contained a question about finding the proper citation for a specific book on painting, while the second email contained a question asking for – in the terminology of reference librarianship – “reader’s advisory” service in selecting more sources for information on painting. As stated above, Krippendorff states that sampling units must be independent of one another “as far as the phenomenon of interest is concerned.” The phenomenon of interest to this study was question type – not the content of a question, or the “position” that a question occupies in the series of “moves” (Hutchby and Wooffitt, 1998) that constitute a conversation – in this case an asynchronous conversation between a user and a reference librarian. Despite the fact that some questions were in fact follow-ups to previous questions, questions in the data set were considered to be independent. This is consistent with Lankes’ (1998a) conclusion that “digital reference services treat the user as simply a question” (p. 155). Context units are the smallest piece of content that must be analyzed in order to identify a recording unit (discussed below). Context units do not need to be independent, though in this study they were treated as if they were independent. A single context unit may contain more than one recording unit. For the analysis of the think-aloud data, the context unit was the question as received by the digital reference service, for the following reasons: First, an email message received by a digital reference service may contain more than one question, either associated or independent. Second, a single question may span more than one sentence. Recording units are “the basic unit of text to be classified” (Weber, 1985, p. 22). Recording units must be independent. For the analysis of the think-aloud data, the recording unit was also the question as received by the digital reference service. As discussed above, questions in the data set were treated as if they were independent.

101

3.2.6. Post-Think-aloud Survey After conducting a think-aloud study with a triager, a short structured interview was conducted, to collect data to enable the “demographics” of the services studied in this phase of the study to be determined. This interview collected some supplementary data about the service itself and the triage process as performed by the service, if this data did not come up during the think-aloud study or the solicitation exchange. This data includes the following: •

The number of individuals employed by the service that perform triage, and



The percentage of questions that come in through the service’s webform versus directly to the service’s email address.

Other data that was included in this post-think-aloud survey were seven of the nine key characteristics that differentiate the three different types of services according to their use of automation in the process of providing asynchronous digital reference, as described by Pomerantz and others (forthcoming): high-tech/low-touch, low-tech/high-touch, and high-tech/high-touch: •

Whether the service verifies the patron’s email address prior to working on a response,



Whether the service has the ability to detect follow-up questions,



Whether the service stores question-answer sets in a knowledge base,



Whether the service automatically searches a knowledge base when a question is received,



Whether the service automatically generates a response to a question,



Whether the service automatically tracks the progress or state of a question, and



Whether patrons can pick up their responses on the web.

The eighth characteristic, whether a service maintains a webform, was determined by the researcher before contact was made with the service. The ninth characteristic, whether the service automatically triages questions to experts, was determined form the initial 102

solicitation. As mentioned above, if the service performed triage automatically, it was eliminated from the pool of potential respondents. There is therefore a bias in the data collected in this phase of the study, as this data was collected only from digital reference services that performed triage, and moreover performed triage manually, and not automatically. This bias was unavoidable, however, given that this phase of the study sought to discover what attributes of questions are taken into account by human triagers when performing triage on received questions, and how these attributes affect those human triagers’ triage decisions.

3.2.7. Coding of Reasons for Triage Decisions As discussed above, all think-aloud studies were tape-recorded, and the tape recordings transcribed, to create a written protocol for every respondent. These protocols were imported into ATLAS.ti, a software application for performing content analysis (www.atlasti.de). In ATLAS.ti these protocols were coded so that comments by respondents that indicated reasons for triage decisions were marked in the protocol. These codes were arrived at both deductively and inductively, utilizing the constant comparative method: the fifteen factors that Pomerantz, Nicholson, and Lankes (2003) discovered influence the triage process served as a partial framework of categories, according to which the attributes of questions elicited in this think-aloud studies were coded. As data was collected from the think-aloud studies, more attributes that influence the triage process emerged, and more categories of attributes were developed. The think-aloud study sought to achieve “theoretical saturation” (Glaser and Strauss, 1967) of reasons for triage decisions. Glaser and Strauss do not offer any guidelines for recognizing when saturation has occurred; rather, it is a heuristic process, involving a great deal of subjectivity on the part of the researcher. Glaser and Strauss state that “after an analyst has coded incidents for the same category a number of times, he learns to see quickly whether or not the next applicable incident points to a new aspect” (p. 111). In other words, the researcher’s familiarity with the categories into which entities are coded

103

allows him or her to understand when no new categories are emerging from the data. As it turned out, saturation was achieved quickly in the present study, even before all of the think-aloud studies were performed. This in itself was useful, as it allowed the researcher to conduct several more think-aloud studies than would have been necessary simply to achieve saturation, and thus to be certain that saturation had, in fact, been achieved.

3.2.8. Member Check After each think-aloud study had been performed, the reasons for triage decisions from the new written protocol coded, and the categories of reasons for triage decisions modified to accommodate the new data, a member check was performed with the triager most recently studied. Lincoln (1986) describes the member check as “the process by which facts and individual constructions are checked, during the data collection, and then again, upon the production of a draft [research report], with members who provided data in the first instance” (p. 13). Each triager was provided by email with the list of categories derived from all of the think-aloud protocols coded up to that point, the reasons that were coded from the thinkaloud study conducted with them, and the categories that the researcher had classified each reason into. The triagers were instructed to critique the categories, and to point out if they disagreed with the researcher’s assessment that a specific reason for making a triage decision was an example of a specific category. If the triager did disagree with any of the researcher’s assessments, they were instructed to write a short explanation of why, and to suggest an alternative category that a reason should belong to – using one that the researcher had already created, or to create a new category. In this way, reasons for making triage decisions were classified first by the researcher, but that classification was validated by the triagers themselves. Additionally, the categories of reasons for making triage decisions were developed first by the researcher, but those categories were validated, and new ones developed by the triagers themselves.

104

3.3. Phase 2: Identification of Question Taxonomies Four taxonomies of questions were identified in the literature from the fields of desk and digital reference, question answering, and linguistics. These taxonomies are as follows: 1. Wh- words 2. Subjects of questions 3. Functions of expected answers to questions 4. Forms of expected answers to questions These four taxonomies can be thought of as corresponding to the top four levels of linguistic analysis: Syntactic, Semantic, Discourse, and Pragmatic, respectively. The identification of these taxonomies was discussed in detail in section 2.8, and will not be discussed further here. Scope notes were created for the four taxonomies identified in the literature. See Appendix C for version of these taxonomies and scope notes utilized in this phase of the study. This study classified questions according to three of the four taxonomies identified in the literature: the taxonomies of wh- words, functions of expected answers to questions, and forms of expected answers to questions. This study did not classify questions according to the subjects of questions. This choice was made because several well-developed classification schemes exist that classify entities by subject, and different digital reference services use different schemes. While it is undeniable that classifying materials by subject is useful, it is a matter of preference which particular subject classification scheme is used. Given the availability of several subject classification schemes, however, the choice of the particular subject classification scheme was left to the particular reference service and its unique requirements and classification practices. In order to classify questions according to their subjects in this study, either: 1) crosswalks would have had to have been developed between all of the different schemes used by different services, or 2) one scheme would have to have been selected and imposed on all services that did not “natively” use that particular scheme. In the former

105

case, some pre-existing crosswalks could be utilized: both the AskERIC and VRD services, for example, organize their resources according to the Gateway to Educational Materials (GEM) element set, and a crosswalk already exists between the GEM and the Dublin Core element set (www.geminfo.org/Workbench/CrosswalkTemplate.doc). On the other hand, many services have created their own subject classification schemes – and crosswalks between all of these idiosyncratic schemes would have to have been created. Developing crosswalks between multiple subject classification schemes would certainly be a worthwhile task, useful to the digital reference community. It is a task, however, that is not suited to a research study such as this, but would rather be best suited to a subcommittee of the GEM consortium, Dublin Core Metadata Initiative, Library of Congress, or other organization dedicated to the creation and maintenance of a subject classification scheme. In the latter case, unilaterally selecting and imposing one subject classification scheme on all services studied would smooth over one of the most important differences between different digital reference services (as will be discussed in section 4.2.2), the subject scope of the service. For these reasons, this study did not classify questions according to the subjects of questions. That said, however, the researcher recognizes that in order to fully utilize the power of faceted classification of questions for the development of algorithms to automate processes in digital reference, questions must be classified by subject in addition to the other three taxonomies investigated in this study.

3.3.1. Selection of a Digital Reference Service The questions that make up the data set for this phase of the study were sampled from the archives of the Virtual Reference Desk Project (VRD)’s AskA service (www.vrd.org). The VRD service was selected as the source of this data because it is a general reference service, and therefore answers questions across a broad range of subjects, for a broad range of types of users. In short, the VRD service’s questions are likely to be as diverse as it is possible to get from a reference service of any type. No studies have been conducted to determine whether the types of questions received by digital reference services are correlated with the type of the service (however digital reference services

106

may be classified). That would be a worthwhile subject of study, and could extend the results from this study. In the absence of such studies, however, it seems plausible that a general digital reference service would receive a broader range of types of questions than a service that specializes in a specific subject area or serves a narrower range of users. It was for this reason that the questions used for this phase of the study were sampled from a general reference service. The VRD AskA service is a digital reference consortium, a network of fifteen AskA services that support one another by accepting each other’s out-of-scope and overflow questions. When a participating service receives a question that is outside of its stated scope area, or receives more questions in a day than it can answer, it can forward those questions to the VRD Network for assistance. VRD then forwards what questions it can to another service. If a question cannot be addressed by another participating service, it will be answered by a librarian volunteering for the VRD service. The forwarding of questions between services is performed by email: participating services forward questions to a VRD email address, and a triager for VRD forwards the question to the appropriate service or librarian. In March 2002 the VRD began a web-based service in which users could submit questions directly to the VRD, to be answered by VRD’s librarian volunteers. The questions sampled for this phase of the study were sampled from the VRD’s archive of email questions forwarded between services. This set of questions was selected as the pool to sample from for the simple reason that it contained a far greater number of questions than the archives from the web-based service: the email-based consortium has been in operation since September 1998, and at the time that questions were sampled, contained approximately 6,500 questions. The web-based service, on the other hand, was just getting off the ground, and so had a very small number of questions archived.

107

3.3.2. Sampling The data set of questions was a naturalistic sample taken from the VRD archives. The purpose of naturalistic sampling is to ensure that the sample is representative of the population of interest; in this case, that the questions in the data set were representative of questions in the entire VRD archives. Naturalistic sampling is achieved either by random sampling, or by consecutive sampling of the population of interest (Cook and Campbell, 1979); in this case, the latter approach was utilized. The data set is the 301 email messages from patrons to which responses were sent by VRD answerers in January 2001. One of the functions of the triage process is the filtering out of non-questions (e.g., viruses, advertisements, server error messages, spam, “thank you” messages from patrons, and a variety of other types). Such non-questions are not forwarded to participating services and do not receive a response from VRD librarians, and are therefore filtered out during the triage process, and not stored in the archives. It is therefore impossible to determine the actual number or percentage of non-questions that were filtered out in the triage process in January 2001. In order to estimate the percentage of non-questions that are filtered out in VRD’s triage process, two naturalistic samples were collected of all email messages received by VRD over two-week time spans: the first between 19 March – 1 April 2002, and the second between 3 – 16 June 2002. Analysis of these samples determined that approximately 15% of the total emails received by VRD are non-questions. Of the 301 email messages received by VRD in January 2001, 15% were removed from the data set. Of the messages that were removed: •

34% were removed because the questions in them were duplicates: either o More than one answerer had responded to the same question from the same patron, or o The same question-answer pair had been stored in the archive more than once.

108



66% were removed because the patron’s original question did not appear in the archived response, presumably because the answerer had removed the question when composing the response.

After removing 15% of the original 301 email messages received by VRD in January 2001, 257 remained. Many of the remaining 257 email messages contained more than one question. The unit of analysis for this phase of the study, as for Phase 3, was the question, rather than the email message in which the question was “packaged.” Thus, the data set was composed of 396 questions, extracted from 257 email messages in which the patron’s original question or questions were maintained intact, for an average of 1.54 questions per email. This finding of the number of questions per email is consistent with Hert’s (2000) average of 1.45 questions per email message received by the Bureau of Labor Statistics’ web site in 1997.

3.3.3. Classification of Questions All 396 questions in the data set were classified by the researcher according to the taxonomies and using the scope notes discussed above. In order to insure that the three processes of classification did not influence or bias each other, all questions were classified according to one taxonomy, and then all questions were classified according to a second taxonomy, and then a third. Classes in each taxonomy were treated as being mutually exclusive – that is, every question was classified into one and only one class per taxonomy.

3.3.4. Intercoder Reliability Testing In order to test the reliability with which questions could be classified according to these taxonomies, as well as to test the usability and clarity of the taxonomies of questions and scope notes, an intercoder reliability test was performed.

109

This testing was performed by four volunteers – M.L.S. and Ph.D. students at least one year into their programs of study in Information Science. These volunteers were therefore familiar with the principles of classification, but were unfamiliar with and untrained in the use of these specific taxonomies. Each volunteer was provided with a subset of twenty questions from the data set, scope notes for the three question taxonomies, and a set of instructions. See Appendix C for the full text of these instructions. Of the twenty questions classified by each volunteer, ten were also classified by another volunteer. Each question in the subset was therefore classified three times: by the researcher, and by two volunteers. The volunteers were instructed to classify the questions provided to them according to the three question taxonomies, using the scope notes as a guide. These instructions familiarized the volunteers with the three taxonomies and the set of questions to be classified, so that the processes of classification performed by volunteers were comparable. This comparability enabled intercoder reliability statistics to be computed both among the coders and between coders and the researcher. The statistic used to calculate the intercoder reliability was Cohen’s κ (Cohen, 1960). The formula for κ for comparison between two coders is: κ =

PAO =

PAO − PAE . Percent agreement, 1 − PAE

A , is the “crude agreement” (Neuendorf, 2002, p. 149) between coders, where A n

= the number of agreements between coders in their classification, and n = the total

 1 number of entities coded by both coders. Expected agreement, PAE =  2 (Σpmi ) , is the n  agreement that would be expected by chance, where pmi = each product of the marginals of an n x n table of intercoder agreement. The formula for κ for comparison between more than two coders is: κ =

P − Pe , where P = the mean of all observed pairs of 1 − Pe

agreements, and Pe = the mean of all pairs of agreements that would be expected by chance. Values for κ may range between 1 and –1, where 1 = perfect agreement beyond

110

chance, 0 = no agreement beyond chance, and negative values = agreement worse than chance. Carletta (1996) states that κ > 0.8 is considered a good reliability measure, and 0.67 < κ < 0.8 allows “tentative conclusions to be drawn” (p. 252). Uebersax (1987), however, claims that any attempt to quantify levels of agreement is a misuse of κ, and that κ should instead be considered to be a binary statistic: agreement either is or is not greater than what would be expected by chance. Cohen’s κ was selected as the statistic to use to measure intercoder reliability for this study for three reasons: First, it is a common measure of intercoder reliability for content analysis (Krippendorff, 1980; Neuendorf, 2002), classification, and increasingly linguistics (Carletta, 1996), and should therefore be interpretable by any reader of this study. Second, unlike some other measures of intercoder reliability, κ takes explicitly into account the deviation of coders’ agreement from the amount of agreement that would be expected by chance. Third, unlike some other measures of intercoder reliability, κ may be used to determine agreement between more than two coders (Fleiss, 1971). When coders (volunteers and the researcher) did not agree on the class to which a specific question should be classified, the researcher interviewed the volunteers about their process of classifying those questions on which there was disagreement. These interviews were conducted in order to determine why coders classified those questions as they did. The purpose of these interviews was to determine the cause of the disagreement, and so use that information to clarify the scope notes for the classes disagreed upon.

3.3.5. Evaluation of Question Taxonomies Evaluation of the three question taxonomies was performed based on the researcher’s experience in classifying all 396 questions in the data set, and in the event of any disagreements in the classification, the results of the interviews with the volunteers, discussed above, about their process of classifying that question. This evaluation was performed according to the set of thirteen criteria for the evaluation of classification

111

schemes derived from Kwaśnik and Liu (2000), discussed in section 2.7.6. These criteria are as follows:

Table 3-4: Thirteen Criteria for the Evaluation of Classification Schemes Criteria Concerned with

The domain of the entire classification scheme The domain of classes Relationships between classes and entities in classes The terminology used in the classification scheme The utility of the classification scheme

Criteria

Scope Exhaustivity Expressiveness Granularity Hospitality Structure Partitioning Prototype characteristics Vocabulary Coherence Consistency Usability Browsability

3.3.6. Modification of Question Taxonomies Modifications to the three question taxonomies utilized in this phase of the study were proposed at the conclusion of this phase. These modifications were based on the set of thirteen criteria for the evaluation of classification schemes presented above, the evaluation performed on the three question taxonomies according to these criteria, and a classification task the purpose of which was to “talk to consensus” the taxonomies. The researcher and the two doctoral students coded thirty questions, randomly sampled from the pool of 185 questions collected during the think-aloud phase of this study. The

112

researcher and these students then met to discuss any disagreements that they had in their classification. This “talking to consensus” is common as a method for ensuring that all coders agree on the interpretation of a data analysis scheme. Indeed, “talking to consensus” is used as a control mechanism for a number of methodologies: Fowler Jr. (2002) recommends this as a method for training coders and identifying ambiguous coding rules. In discussing their disagreements, the researcher and students were able to uncover ambiguities in some classes and scope notes in the taxonomies. The outcome of this “talking to consensus” was that the researcher was then better able to modify the taxonomies to clarify these ambiguities. The taxonomies used for Phase 3 of this study were the modified versions. See Appendix D for the full version of the modified taxonomies and scope notes.

3.4. Phase 3: Classification Task As mentioned above, the services in which the triagers were employed who were studied in the think-aloud studies, must have been willing to share the questions that were triaged with the researcher. This was necessary so that a subset of the same questions triaged during the think-aloud studies could be classified according to the three taxonomies of questions identified in Phase 2 of the study. This phase of the study sought to answer Research Question 2: How does question type correlate with the action taken on a question in the triage process? In order to answer this question, a subset of the triagers studied in the think-aloud studies were instructed to classify a set of questions. Intercoder reliability statistics were computed between the coders’ classifications. Correlation statistics were computed between the types of questions, as classified by the coders, and the action taken on questions, as recorded in the think-aloud studies.

113

3.4.1. Inter-indexer Reliability Studies This classification task resembles a number of inter-indexer reliability studies, in that a number of individuals classified the same document according to a specific classification scheme, and from this classification the amount of similarity in the terms chosen by different coders can be calculated. The individuals in this phase of the study are digital reference triagers, not indexers, and the document is a question, not a book, an article, or an abstract. The fundamental task, however, is similar. This phase of the study particularly resembles Borko’s (1964) study of “the reliability with which skilled subject-matter specialists can classify a collection of documents into predetermined categories” (p. 269). Borko’s study serves as a point of comparison for this phase of the study for two reasons: First, the classification scheme used by Borko’s subjects contained eleven classes – roughly the same number as the three taxonomies used in this study, at least in comparison with other classification schemes used in other inter-indexer reliability studies (for example, the entire set of terms in the Medical Subject Headings (MeSH) vocabulary (Funk and Reid, 1983), or the Universal Decimal Classification scheme (Tell, 1969)). Second, by virtue of containing a finite set of classes, Borko’s classification scheme employs a controlled vocabulary – as do the three taxonomies used in this study (as opposed to other studies that allowed indexers free choice in selecting descriptor terms – that is, used uncontrolled vocabularies (e.g., Tinker, 1966; Mullson et al., 1969)). Past inter-indexer reliability studies have spanned a wide range of both number of documents indexed (from thirteen (Clarke and Bennett, 1973) to 760 (Funk and Reid, 1983)) and number of respondents (from three (Borko, 1964) to 340 (Lilley, 1954)). Borko’s (1964) study additionally serves as a point of comparison for this phase of the study in guiding sampling and the solicitation of respondents. Borko employed three indexers to index 338 abstracts. Because this study requires that each question be classified according to three taxonomies, rather than only one (as in most inter-indexer

114

studies), three times as many coders will be employed. These points will be discussed below.

3.4.2. Solicitation of Coders Coders for this phase of the study were solicited from the pool of twenty-eight digital reference triagers who were respondents in the think-aloud phase of this study. Nine individuals were randomly selected from this pool. Upon completion of the think-aloud study, an email was sent to these nine individuals, asking if they would be willing to participate in a classification task for this study (see Appendix A for the text of this solicitation email). Not all of the nine individuals first selected were willing to participate; when one individual declined to participate, another was randomly selected.

3.4.3. Sampling Thirty questions were sampled from the pool of 185 questions collected during the thinkaloud phase of this study, utilizing stratified random sampling. What makes the sampling for this phase of the study stratified is that all of the questions triaged by the coders during the think-aloud studies were removed from the pool of 185 questions before sampling. Before any questions were sampled, all nine coders had been solicited, and had agreed to participate in this phase of the study. It was necessary to know which triagers would be participating before beginning this phase of the study, so that the questions triaged by those specific triagers could be removed from the pool. Questions triaged by the triagers who participated in the think-aloud studies were removed from the pool of 185 questions so that there could be no biasing effect on the classification for this phase of the study, due to any of the coders having previously seen any of the questions. And since all nine coders were given the same set of thirty questions to classify, it was necessary to remove all of the questions triaged by all nine of these coders. These questions were removed from the pool because, as mentioned above, triage is essentially a classification task. It may be that the triagers themselves are not aware of

115

the extent to which question type affects their triage decisions, and the triagers may not even (prior to this phase of the study) have been familiar with the taxonomies of questions utilized in this study. Nevertheless, as Research Question 2 for this study seeks to discover how question type correlates with the action taken on a question in the triage process, it was important to remove any possible source of bias in the classification of questions that might arise from the coders being the triagers themselves. The full set of 185 questions collected during the think-aloud phase of this study were given “accession numbers” indicating the order in which they were collected – the first question from the first think-aloud study was question #1, etc. Removing the questions triaged by these nine triagers removed 34 questions from the pool, leaving 151 from which to sample. Using the random sequence generator on the random.org website (which generates sequences of numbers without duplication), 30 numbers out of 185 were randomly generated; the questions corresponding to these numbers were selected for this phase of the study. As discussed in section 2.5.2, approximately 15% of all messages received by digital reference services do not contain valid questions (e.g., viruses, advertisements, server error messages, spam, “thank you” messages from patrons, and a variety of other types). Additionally, approximately 15% of the questions remaining are out-of-scope for the service which received them; in this case, the service must either refuse to reply to the question or forward it to another service. Non-questions did not appear in the set of questions sampled for this phase of the study, because non-questions did not appear in the set of questions collected form the think-aloud phase of the study. The reason for this is simple: non-questions are not replied to by digital reference services, and so are not triaged. Whether or not a question was out of scope for the service that received it was ignored in the classification task. What was important for the classification task was that a question was triaged in the think-aloud phase of the study; to what individual or service a question was triaged, and the reasons for that triage decision were irrelevant to the classification task. These factors re-enter this study when the correlation between question type and the action taken on a question in the triage process was computed.

116

3.4.4. Classification Task Nine coders were provided with a set of thirty questions, the three taxonomies of questions and scope notes for the taxonomies, and a set of instructions. (See Appendix D for the full text of these instructions.) The coders were instructed to classify the questions provided to them according to the three question taxonomies, using the scope notes for the three taxonomies as a guide. Each coder received the same set of thirty questions. These thirty questions were randomly subdivided into three sets of ten. The coders were instructed to classify each of the three sets of ten questions according to one of the three taxonomies – in other words, the coders classified one set of ten questions according to each of the three taxonomies. Thus, each coder classified each of the thirty questions according to only one taxonomy. The taxonomy according to which each set of ten questions was classified was rotated between three groups of three coders. Within each set of ten questions, the order of questions was randomly shuffled. The order in which the sets of ten questions were presented to the coders was also randomly shuffled. This organization of groups of coders, sets of questions, and taxonomies is laid out in Table 3-5.

117

Table 3-5: Classification of Sets of Questions among Groups of Coders Group 1

Group 2

Group 3

Coders 1 Questions 1 2 3 4 Set 5 1 6 7 8 9 10 11 12 13 14 Set 15 2 16 17 18 19 20

WhWhWhWhWhWhWhWhWhWhFunctions Functions Functions Functions Functions Functions Functions Functions Functions Functions

2

WhWhWhWhWhWhWhWhWhWhFunctions Functions Functions Functions Functions Functions Functions Functions Functions Functions

3

WhWhWhWhWhWhWhWhWhWhFunctions Functions Functions Functions Functions Functions Functions Functions Functions Functions

4

5

6

Forms Forms Forms Forms Forms Forms Forms Forms Forms Forms WhWhWhWhWhWhWhWhWhWh-

Forms Forms Forms Forms Forms Forms Forms Forms Forms Forms WhWhWhWhWhWhWhWhWhWh-

Forms Forms Forms Forms Forms Forms Forms Forms Forms Forms WhWhWhWhWhWhWhWhWhWh-

118

7

Functions Functions Functions Functions Functions Functions Functions Functions Functions Functions Forms Forms Forms Forms Forms Forms Forms Forms Forms Forms

8

Functions Functions Functions Functions Functions Functions Functions Functions Functions Functions Forms Forms Forms Forms Forms Forms Forms Forms Forms Forms

9

Functions Functions Functions Functions Functions Functions Functions Functions Functions Functions Forms Forms Forms Forms Forms Forms Forms Forms Forms Forms

Group 1

Group 2

Group 3

Coders Questions 21 22 23 24 Set 25 3 26 27 28 29 30

1

2

3

Forms Forms Forms Forms Forms Forms Forms Forms Forms Forms

Forms Forms Forms Forms Forms Forms Forms Forms Forms Forms

Forms Forms Forms Forms Forms Forms Forms Forms Forms Forms

4

Functions Functions Functions Functions Functions Functions Functions Functions Functions Functions

5

Functions Functions Functions Functions Functions Functions Functions Functions Functions Functions

119

6

Functions Functions Functions Functions Functions Functions Functions Functions Functions Functions

7

WhWhWhWhWhWhWhWhWhWh-

8

WhWhWhWhWhWhWhWhWhWh-

9

WhWhWhWhWhWhWhWhWhWh-

This elaborate randomization of groups of coders, sets of questions, and taxonomies was performed in order to insure 1) that each question would be coded according to each taxonomy by three coders, and 2) that order effects did not influence the coding of the questions. All thirty questions were coded by three coders so that intercoder reliability statistics could be calculated between more than two coders; this was important so that any coincidental correlations in coding between two coders may be offset by the third coder.

3.4.5. Computation of Intercoder Reliability Statistics The primary purpose of the classification task was to classify a subset of the questions from the think-aloud phase of the study. In addition, however, the classification task allowed the determination of the reliability with which question could be classified according to these taxonomies – an indication of the usability and clarity of the taxonomies of questions and scope notes – through the computation of intercoder reliability statistics. As in Phase 2 of the study, the statistic used to calculate the intercoder reliability was Cohen’s κ (Cohen, 1960). When coders did not agree on the class to which a specific question should be classified, the researcher interviewed the volunteers about their process of classifying that question, asking what their reasons were for classifying that question as they did. The purpose of these interviews was to determine the cause of the disagreement, and so use that information to clarify the scope notes for the classes disagreed upon. The researcher did not attempt to get the coders to change their classification of questions; all that was sought was an explanation of the coders’ reasons for classifying questions.

3.4.6. Computation of Strength of Correlation The correlation statistic Cramér’s V (Cramér, 1966) was computed between the types of questions, as classified by the coders, and the attributes of questions that affected

120

triagers’ triage decisions. This will answer Research Question 2, in determining how question type correlates with the action taken on it in the triage process.

The formula for Cramér’s V is: V =

χ2 Nm

, where N is the sample size, and m is the

smaller of (number of rows – 1) or (number of columns – 1). Cramér’s V is based on the Chi-square test for independence, χ2, a measure of association for nominal data – which  ( f − fe )2   , where is what the data from this study is. The formula for χ2 is: χ 2 = ∑  o fe  

f o = the observed frequency and f e = the expected frequency of the entities in an n x n table. The Chi-square test for independence was not used to determine the correlation between type of a question correlates with the action taken on it in the triage process because it is ineffective when the number in any cell in an n x n table is small, which is the case with the table of question types vs. triage actions. Cramér’s V compensates for this shortcoming of χ2 by norming from 0 to 1, regardless of table size. This norming requires that the row and column marginals be equal, which is the case with the data from this study. Values for V therefore range between 0 and 1, where 0 = no correlation and 1 = perfect correlation. Because there was not perfect intercoder reliability in the classification of all questions, the correlation between question types and triage action could not be computed with perfect validity for all questions. On the classification of some questions, all three coders agreed, and the correlation was computed between that question type and the triage action taken upon it. On those questions for which two out of three coders agreed on the classification, the correlation was computed between the class that was agreed on and the triage action taken upon it, but the correlation must be interpreted as being less than perfect, given that the intercoder reliability of the classification of that question was less than perfect. Fortunately, there were no questions for which all three coders disagreed; this avoided the problem of deciding which class for which to calculate the correlation 121

with triage action. This demonstrates that either the questions or the taxonomies and scope notes, or both, were clear and unambiguous. Additionally, with only thirty questions classified in this phase of the study, not all possible question types were classified. With ten classes in the taxonomy of wh- words, twenty-one in the taxonomy of functions of expected answers, and twelve in the taxonomy of forms of expected answers, there are 2,520 question types in the “taxonomy space” defined by those three taxonomies. Since there were only 185 questions collected during the think-aloud studies – which is less than 10% of the total number of question types – it would not have been possible to compute the correlation between every question type and the action taken on it during the triage process.

3.5. Chapter Summary This chapter has described the methodology employed in conducting this study. The first goal of this study was to learn from direct observation of the actions performed by triagers how question type affects those actions. The second goal of this study was to use what was discovered about the triage process to draw up a set of rules for the performance of triage based on the question type being triaged, which rules could be utilized as the basis for designing and building a system to automate part or all of the triage process. The research questions for this study follow from these goals: RQ1. What attributes of questions affect the triage process? RQ1a. What attributes of questions are taken into account by digital reference triagers when performing triage on received questions? RQ1b. How do these attributes affect triagers’ decisions in triaging questions? RQ2. How does question type correlate with the action taken on a question in the triage process? The methodology employed by this study was, first, a series of think-aloud studies with triagers from a cross-section of digital reference services to elicit the actions they perform

122

on questions during the triage process, and their reasons for those actions. These actions were analyzed and classified utilizing the constant comparative method from grounded theory. A subset of the questions triaged during the think-aloud studies were then classified by a subset of the triagers themselves, according to three taxonomies of questions at different levels of linguistic analysis, that were identified though a review of literature that deals with questions, from the fields of desk and digital reference, question answering, and linguistics. Intercoder reliability statistics were computed on these coders’ classifications, to determine the reliability with which questions can be classified according to these taxonomies. Finally, the strength of correlation between question type and the question attribute that affected the triager’s triage decision was computed.

123