A Guide to Scientific Evaluation in Information Visualization

2010 14th International Information Conference Visualisation Information Visualisation A Guide to Scientific Evaluation in Information Visualization ...
Author: Andrew French
11 downloads 0 Views 484KB Size
2010 14th International Information Conference Visualisation Information Visualisation

A Guide to Scientific Evaluation in Information Visualization Camilla Forsell Linköping University, Norrköping [email protected] practitioners to tackle some of the fundamental and practical issues concerning empirical evaluation of information visualizations” [7, p. 631 ] This special issue was a starting point and in the last decade there has been a continuous presence of research and position papers and workshops focusing on the importance of evaluation as a practice and on the need for new methods and metrics tailored to specific InfoVis needs [4, 5, 6, 7, 8, 12 ,13, 22, 24]. Also, the International Symposium Information Visualization Evaluation, IVE [17] held at the International Conference on Information Visualisation [16] is another example of a forum specifically devoted to InfoVis evaluation. There is also a need to appreciate the difficulty of designing and conducting sound scientific evaluations. Although there is an increasing awareness of its importance and value many InfoVis techniques are still never subject to evaluation. There are several reasons for this: the effort required in undertaking such a study exceeds the resources available, there exists no body of knowledge in terms of toolboxes with established methods to guide developers, and developers often do not have the skills needed to perform sound scientific evaluation. This last statement might seem rude but it is merely the reality and it’s my belief that most would agree: traditionally development focuses on technical issues and possibilities and not on users’ prerequisites and needs. This is also reflected in the call for papers to the annual IEEE VisWeek which is regarded as the major international conference on visualization. “We do suggest that potential authors who have not had formal training in the design of experiments involving human subjects may wish to partner with a colleague from an area such as psychology or human-computer interaction who has experience with designing rigorous experimental protocols and statistical analysis of the resulting data” [26]. To my knowledge many developers indeed want to evaluate their InfoVis techniques but scientific evaluation is difficult to master and carry out well. Cross-disciplinary collaboration or consulting with an expert on experimental design is an excellent solution to overcome this problem but is not always possible to achieve. The result is that it is not uncommon to see publications presenting evaluations where the outcome

Abstract This paper addresses some fundamental and practical issues that should be considered when pursuing evaluation studies in Information Visualization. The main focus is on quantitative experimental research but the general information applies to all kinds of studies. The purpose is to increase awareness of what constitutes a sound scientific approach to evaluation and to point out common pitfalls and mistakes during the phases of such study. These phases cover how to plan, design, conduct and analyse the outcome of an evaluation and finally how to report in a way that enhances readability, provides details relevant to the outcome and that allows replication. The paper could be used as a guide when conducting evaluation and it could also be helpful when reviewing publications since the same rules apply. Keywords--Information evaluation, experimental research.

Visualization,

1. Introduction Scientific evaluation is a key research challenge within the Information Viualization (InfoVis) community [8]. To develop successful InfoVis techniques we need to assess their merits and disadvantages and the value of evaluation cannot be overestimated. Well tested results are meaningful, positive results provide a basis on which the next generation of development can be built with greater confidence, while negative results provide useful knowledge which can help to refine future work and ensure that it proceeds in the right directions. Lack of evaluation on the other hand allows both for less useful ideas being promoted as well as promising and potentially useful ideas not being adopted by industry or the public since evidence of usability issues and measurable merits are not presented [13]. Today, evaluation is recognized to be an important part of research when developing InfoVis techniques. In 2000 the International Journal of Human-Computer Studies published a special issue on empirical evaluation of information visualizations with the aim to “provide a timely and uniform forum for researchers and 1550-6037/10 $26.00 © 2010 IEEE DOI 10.1109/IV.2010.33

162

is, in the worst case basically meaningless since flaws in the method applied and/or the analysis of data makes it more or less impossible to draw useful and scientific insights from the results. This is very unfortunate since the authors have invested a lot of work into a study where the outcome is not as informative and valuable as could have be. Also many of these flaws are easily avoided once you are aware of them and how to avoid having them confound your study. The aim of this paper is to increase awareness of what constitutes a sound scientific approach to evaluation and to outline how to proceed to achieve high quality results. The paper will cover the most basic and relevant issues to consider, point out common pitfalls and mistakes during the different phases of an evaluation study and provide examples of how to avoid these. The main focus is on quantitative experimental research but the general knowledge applies to all kinds of studies. The main contribution of reading this paper should be its value as a guide when planning to conduct an evaluation study. However, the same rules that guide a sound scientific approach to pursue evaluation in practice also apply when reviewing. Hence the reader could also find the content in this paper helpful when judging the merits of a reported evaluation in any publication.

levels range from low, as in naturalistic observation, to very high as in controlled experiments. All levels are scientific when used properly – the levels of constraints have to map to the question(s) to be answered [3]. In experimental research we investigate and compare participants’ responses under different conditions. We do this by manipulating one (or more) factor or variable (the independent variable) to investigate its effect on one (or more) other factor (the dependent variable) [3]. We could, for example, compare performance using a 2D visualization and a 3D visualization for some task. The independent variable here is then visualization method (having two levels, 2D and 3D). The dependent variable is performance which could be measured as accuracy. This measure is then analysed using statistical tests to investigate if there was any experimental effect, that is a difference between the conditions, or not. We want the effect to be attributed to the independent variable and not to any confounding factors that might affect the dependent measure (accuracy in the previous example) [3]. Confounding factors are all threats to the reliability and validity of our study. Reliability refers to the consistency of something from one time to another, for example a measure, a procedure or the behavior of a person moderating an experiment. Validity is about soundness or quality, i.e. whether a study can scientifically answer the questions it is intended to answer or not. This includes [3]:  Construct validity: Do we investigate and measure what we intend to according to the theory behind our question?  Internal validity: Is the result based on the design, measures, setting and procedures or due to any confounding factors?  External (ecological) validity. Can we generalize results to different people, contexts or places?  Statistical validity: Are our conclusions from statistical testing sound and justified from using the appropriate test?

2. Scientific Evaluation There are a variety of general aims to pursue an evaluation study. In [23, p. 20] the authors have classified them into the following common thematic areas:  Evaluate strengths and weaknesses of different techniques.  Seek insight into why and for what a particular technique gives good performance.  Demonstrate that a new technique is useful in a practical sense according to some objective criteria.  Demonstrate that a new technique is better than an existing one according to some objective criteria.  Investigate whether theoretical principles from other disciplines apply under certain practical conditions, (results from psychophysics or computer vision may or may not extend to InfoVis).

Please note that “a measure cannot be valid unless it is reliable, but a measure can be reliable without being a valid measure of the variable of interest” [3, p. 81]. In experimental research we apply a variety of control procedures to eliminate as many potential confounding factors as possible and to maximize reliability and validity. This lies in the nature of the method and is the way in which we can obtain trustworthy results and have confidence in the conclusions drawn from such studies. Unfortunately there are many issues that, if not adequately addressed, can compromise the reliability and validity of a study, or make it difficult to draw useful insights from the results. The remainder of this section outlines the most important issues to consider and discusses potential pitfalls and mistakes and how these can be circumvented.

Based on these aims you have a general research question to begin with. Once this is refined into more specific problem statements, or hypothesis, you could decide on how to conduct your evaluation, that is, to select a method that is appropriate to give robust answers. A clear question is crucial since it both specifies what to investigate and it also largely decides how the research should be carried out when it comes to level of constraints, method for collecting data and data analysis. Level-of-constraints refers to the extent to which we limit or control any part of an evaluation study. These 163

2.1 Experimental design

these kinds of reasons you have to control that there are no important differences between the groups to start with. For instance if you want to investigate how novices learn to use a visualization make sure that all participants are novices to begin with and don’t differ in their level of previous experience with visualizations. There are many issues to consider when choosing an appropriate design and it is not always crystal clear what would be the right one. Within-subjects designs are known to be more sensitive to small differences between conditions and are more likely to detect differences if they exist, but, as is evident from above, for many studies that procedure is not appropriate. To conclude, often the result is that you need a mixed-design which means that your study has both within- and between group comparisons.

The research question of a study and the design of the study are to a high degree interdependent. You cannot actually settle your research question (hypothesis) if you do not have some clear idea about what study design you will use since the design will determine what sort of question(s) you can answer with the resulting data [14]. First you need to decide whether you should examine performance of the same participants or if you should compare different groups of participants. Here it’s appropriate to explain some terminology. In the first design the same participants take part in all experimental conditions. Any comparison takes place within the same group of participants with the scores from each participant being related. Each participant serves as their own control and any individual difference (experience, high or low motivation, age) is the same across all conditions [1, 3]. In the literature this design can be called a related design, same subjects design, withinsubjects design, or a repeated measures design. When the comparison is between groups of participants their scores are unrelated [ibid]. This is called an unrelated design, a between-subjects design or a between groups design. Hence, there are several different names meaning the same thing and this is important to know since they are reflected in the names of the statistical tests associated to the designs. For instance, if you have tested how one group of participants has performed on a variable having two levels (such as a task being simple or complex) then you will use a related t-test [1, 2]. How you decide on what design to apply depends on several factors [e.g., 14]. These are Availability of participants: With a between-group design you need a larger number of participants. When it is hard to recruit this option might just not be possible. The aim of the study: If you want to investigate whether there is a difference between groups of people, for example two types of professions using the same visualization technique at work, then it is fundamental to your research question that you compare performance between groups of participants. But, if you are interested in how easy it is to perceive different elements on a display then it does not matter where the participants come from and you could test the same participants under all conditions. Confounding factors: If you suspect that participation in one condition of a study (solving tasks using visualization A) will affect the result in the following (solving tasks using visualization B) you must use a between-groups design to avoid carry-over effects e.g., a practice effect. Sometimes this can be avoided by controlling the order of presentation of conditions (see section 2.3) but in other cases that may not help. Participation time is another major consideration. If each participant must spend a long time completing the study you should consider a between-group design to avoid fatigue (negative practice effect) and decreasing motivation. Of course allowing rest periods can control this. When you are using a between-groups design for

2.2 Tasks When it comes to the definition and selection of tasks, sometimes this is predetermined. Your visualization technique is developed with a specific target user population in mind and you have a clear idea about which activities should be included in the evaluation procedure. At other times you are developing a visualization concept or only want to investigate a certain phenomenon, then you need to invent them. The choice of task has a great impact on the validity of the obtained results and there are two major issues to consider. The task (and its associated metrics for measurements) has to be appropriate. This implies that it is supported from theory and empirical work related to your research question, i.e. that it assesses what it is supposed to assess. The task also needs to be representative, meaning that it should be characteristic for the intended application domain, the intended users etc. This is how we allow for generalization of results outside the specific experimental situation [3]. For example, testing participants’ performance for some task when using a parallel coordinates visualization with a data set having 75 data items this may not to be considered as representative for a practical usage situation where a data set may include 15,000 items. Another issue is whether to use a real or a synthetic data set. The advice should be to strive for using real data or at least a data set that is representative of such in terms of size and complexity. However, there are studies in which it’s more suitable to use a synthetic data set. Only then you can fully specify and control structures in the data which can be necessary to ensure that participants can execute the tasks without interference from other factors in a data set, and for you to measure performance [15]. For example, in [10] we examined the ability to discriminate between five different patterns in data presented in a parallel coordinates display and in [20] we investigated threshold levels for perceiving noise in data. For these studies it was crucial to use synthetic data to ensure that only the features we were seeking were present.

164

observed between groups can be attributed to the effect of what is being studied and not to any characteristic of the individuals in the groups [2, 3]. Sometimes it is the case that you have participants that cannot be considered totally equivalent. Perhaps you have recruited 24 persons and four of them are more experienced than the others. If having two groups in your design then you randomly assign two of these more experienced persons to each group. However, please note that random assignment never guarantees that different groups are equivalent, only that any differences are due to chance [3]. Randomization can also be used when it comes to assignment of order of presentation of conditions or tasks [2, 3]. For instance, if you are studying the ease of use between different glyphs and visualization methods and you have a design with two variables (having two levels each): visualization method (2D and 3D) and type of glyph (glyph A and glyph B). This design yields four different conditions: 2D glyph A, 2D glyph B, 3D glyph A and 2D glyph B. Using a within-subjects design all participants will take part in all conditions. Remembering the discussion about carry-over effects in section 2.1 you realize that the order of presentation is a major concern. Let’s say that a participant will start by performing tasks with the 2D glyph A visualization and then proceed to do the equivalent task using the 3D glyph A visualization. Then this second condition might be easier. It might be that the 3D version of the glyph is more intuitive but it might also be the case that having previously used the 2D counterpart makes it easier to understand the 3D version or it might be because the participant has become more familiar with the situation. Consequently the order of presentation has to be considered. There are several techniques one can use to achieve a sound sequencing and the logic behind them is to control order effects by having these contribute equally to each condition in the study [3]. First, we can randomly assign each participant to a different order of the four conditions. Any differences observed between conditions can be attributed to a true difference between visualizations and/or glyphs since any potential order effects are assumed to be evened out. Second, we can counterbalance which means that the presentation order of conditions is systematically varied. With complete counterbalancing an equal number of participants are assigned to each order and you calculate the number of participants needed by calculating X! (X factorial), where X is the number of conditions [3, p. 226]. This procedure is very useful when you have few conditions, however in the previous example with four conditions there are 24 orders and with six conditions there will be 720 orders! The best thing to do here is to use a partial counterbalancing. If you have a participant group of 12 persons then you can randomly select 12 out of the 24 (or 720) possible orders and randomly assign participants to these selected orders [3, p. 226]. Another method for partial counterbalancing is the Latin Square. To plan the order of presentation applying this method you use a matrix where you arrange the

2.3 Participants and assignments The first critical task when it comes to participants is to select them. Again, we want to be able to generalize our findings so they should be representative of a larger group of people (the population) and not just the participants in our study (the sample). Thus we need to find a sample that correctly reflects the properties of the larger group (the population) [3]. This is harder than it sounds. As we all know, many InfoVis tools are intended for expert users and such persons are difficult to engage for a sufficient period of time and in sufficient numbers. This brings us directly to the next critical question – how many participants are needed? The appropriate sample size depends on the aim of the study, the design and also how you plan to analyse your results. For statistical testing there is a concern about statistical power which refers to the sensitivity of a statistical test for detecting a significant experimental effect (difference) assuming it is present. When studies do not have enough power they don’t include enough participants to be confident in detecting and you risk missing the actual difference [2]. The traditional way to increase power is to increase the number of participants. The size needed for a specific level of power can be computed and there is now good software available for doing this. This is named G*power and can be downloaded for free from the University of Dusseldorf’s website [25]. The result is often that a large number of participants are required so in reality we often need to accept low statistical power, like when needing hard-torecruit participants or when time is short. Hence we have to make an educated decision from study to study. A good rule is that the more participants the better result and to use not less than 12-14 per group. Hence, if you have a design comparing three different groups you need at least 36 participants. Once again good guidelines can be found in work from trustworthy authors, reviewing their choices and motivation. One common misconception is that “for all user studies 5 participants is enough”. This statement originates from Nielsen’s work on Heuristic Evaluation [21] and is, as clearly stated by himself, only valid when the study involves having evaluators (usability experts in the original form of the method) find usability problems in an application. Then, based on a mathematical formula, the gain of adding more users is not in relation to the findings since 5 people will find 85% of the problems [21, p. 33]. In no other case can this number, 5, be recommended. The next critical task is to assign the participants to the different groups and conditions in the study; obviously the first issue is not relevant when using a within-subjects design. Using a between-group design means that you need to allocate the participants to one group or another. The best way to do this is to use random assignment (when you don’t want the properties of each group to be different that is). By doing so, the group characteristics will be approximately comparable and therefore any experimental effect (difference)

165

2.5 Results

conditions in rows and columns so that each condition appears only once in each column and once in each row. In a partial Latin square one order can appear more times than another. There are also complete ones where all conditions appear in each position in the sequence an equal number of times, and all conditions follow each other condition an equal number of times [3, p. 226]. Using this complete version, again a large amount of participants are required. If you do not apply randomization and/or balancing to assign participants to groups and conditions properly the outcome of your study will be meaningless and no statistical analysis of the data can overcome this flaw [2].

The choice of what statistical test would be suitable for analysis of the significance of your data follows from the experimental design you have chosen to investigate your experimental hypothesis. Once you have made your decisions about the design of your study (number of groups of participants, number of variables and number of experimental conditions) you have, in most cases, automatically selected which test to apply [19]. However the test must also be adequate to the type of data you have obtained and here you don’t have all the answers until you have the actual data. Therefore, the first step in analyzing your data should always be to explore it: screen the data, look at descriptive statistics and plot graphs to support exploration and finally check some basic assumptions [2]. These activities will aid in the final decision on what test(s) to use to ensure credibility of the obtained results. Descriptive statistics summarizes and describes data with just a few numbers showing the central tendency (e.g., mode, median) variability (e.g., range, variance, and standard deviation) and measures of relationships (correlation). These statistics tell a lot about the nature of the data and facilitate comparisons; obviously they should be included when reporting your results. They also provide the basis for further analysis of data using inferential statistics, that is significance testing, and to help interpret the overall meaning of your results [3]. Statistical tests are classified into parametric and non-parametric tests [1, 2]. Most parametric tests have four basic assumptions that must be fulfilled for a test to give an accurate result whereas non-parametric tests don’t [2, p. 64]. The assumptions are; normally distributed data, homogeneity of variance, data should be measured at least on an interval level, and independence. The last two assumptions are only subject to testing by means of common sense and, unfortunately, a common mistake that is often seen is rating scale data, data measured at an ordinal level, being tested by parametric tests. Homogeneity of variance can be tested in different ways depending on the nature of the study (repeated measures design or not). When it comes to whether the data follows a normal distribution or not, this assumption is often explored by plotting the data and just looking at it. But this procedure is highly subjective and one should let an objective test in the analysis software decide. Here you can use the Kolmogorov-Smirnov or Shapiro-Wilk tests [2]. If your data violates any of the four assumptions you really should consider using a nonparametric test instead, to ensure statistical validity. Apart from using the wrong test in relation to the data type there is another unfortunate mistake that often appears in literature. The basis of all statistical tests of significance is to calculate the percentage probability, the level of significance, of obtaining a difference (an effect) in scores if scores are in fact occurring on a random basis rather than being the result of an experimental effect [19, p. 23]. This means that if the probability is very low, then you can reject the null hypothesis (which means that

2.4 Conducting an evaluation To give a thorough description on how to moderate an evaluation session is beyond the scope of this paper [see e.g., 18 for a detailed review of this matter]. However, there are some issues that are really important to adhere to in order to ensure the validity of your data. This means that your data will be a result of effects due to the variables you are studying and not due to confounding factors. Collect relevant background information about the participants. This can be very helpful in combination with statistical analysis when interpreting your results. Sometimes it is also necessary to test participants for color blindness or for stereo vision ability prior to the experiment. Have the participants perform training tasks. This is often very important and you should make sure that they have really grasped what they need to know. This implies that you have to check accuracy (in most cases) somehow and not just ask them if they know what, and how to do. If the study is not self-paced you might have to build in rest phases to avoid fatigue. Another issue is whether you should monitor the session or not. In most cases the best thing is to leave the room and let the participant work alone. It reduces stress through not having the feeling of being observed (if observation is not part of the study of course). Also, you avoid the participant asking questions. Your presence may encourage them to do that even if you have stated that it is not allowed, which might be difficult to handle. The drawback of not being present is that you have no control over what happens. I recommend that you use written instructions for participants to review before the evaluation starts. And also that you follow a written protocol for how to moderate each part of the study. These uniform procedures will ensure that all participants receive equal information and are treated the same way. Also, it allows for different people to moderate if you cannot deal with all participants yourself. Finally, conduct a pilot study. This is invaluable in refining and finalizing your study and it will help you to discover things that you have overlooked, miscalculated or perhaps designed totally wrong.

166

text. Instead this section covers only what is regarded as the method and results section in a publication. The method section should describe all relevant details about what you have done and how you proceeded when doing it. Considered the section as a recipe, it should be possible to replicate your study by following the description. There are several fundamental issues that need to be covered and below some typical subsections are described. How to organize these subsections is highly flexible and depends on what will constitute a logical order of presentation both to enhance readability and to avoid repetition. Naturally, some subsections can be collapsed or perhaps some of their content would make more sense if placed elsewhere. There are some example sentences inserted below to exemplify how to describe certain things taken from [9, 10]. For greater detail I recommend reviewing these publications. Other examples on how to write a method section can be found in [11]. Stimuli (or material). This subsection presents details about what visual stimuli and other materials such as questionnaires that were used. More general issues about the visualization technique could be explained in the introduction to the study and then this section need cover only what the actual images presented on the display looked like. For example, “Each stimulus display was comprised of a 12x12 matrix of grid cells creating a total of 144 grid positions with a square size of 0.8x0.8 m”. If not explained in the introduction to the study this is also where you describe the data set used in the study. This should be clearly defined to allow for verification and allow replication. Apparatus. This is the place to describe the equipment used and also the experimental setup and usage conditions. This includes the type of computer, response apparatus and other hardware, and also what software was used to present stimuli and record responses. For example, “The images were rendered using OpenGL”. The name of the manufacture’s products may sometimes be needed, e.g., “The computers were equipped with Nvidia TNT graphics cards”. Participants. This section should describe essential information about the people that were engaged to take part in the evaluation. This could include sex, age (here I advise to state the median age instead of mean age since that is more informative), level of experience, nationality and whether or not they received any compensation for taking part. Experimental design. In the design subsection you provide a description of the structure of your experiment: what design was used, what were the variables, what procedures were applied to assign orders, the total number of trials (number of repetitions in the experiment)) per participant etc. For example, ”The study was performed as a four-variable mixed design with two within-subject variables: task type A (simple) vs. task type B (complex), and block of trials. The two between-subject variables were: type of visualization (2Dm vs. 3Dm vs. 2Da) and sequence of presentation of

any differences between scores in your experiment are likely to be random). Instead you accept the experimental hypothesis – that your results are significant [ibid]. Most often we use a level of significance, called alpha level (written α), of 0.05 which means that there is a 95% probability that our results are significant against 5% (1 in 20) that it’s random. Hence, every time we do a test we must be aware that we might accept an effect that is not actually real/true. If we do lots of tests on the same data these errors accumulate [1, p. 172]. For instance, it is not unusual that you need to compare more than three different means in an experiment. For instance, you have scores from group A, group B and group C. Or one group has performed in four different conditions and you want to compare these four scores, e.g., the mean values for each condition. It is not unusual that you see people using t-tests for this kind of analysis. However, a t-test compares only two means at a time meaning that in the above examples three tests are required in the first case and five in the second case to make all comparisons (I have seen publications with far more than five comparisons too). Therefore, if we need to make many comparisons we should use tests which, instead look for an overall experimental effect between several means (a difference between them) at once while maintaining the 5% level of significance. Analysis of variance (ANOVA) is a well known such test. The next step is to find out where the difference exits, between what specific means, using so called post-hoc tests. They compare every experimental condition with every other one, it’s like doing lots of t-tests but these tests are calculated in a way such that the overall 5% level of significance is maintained despite many tests having been done. Bonferroni correction is one example of such a test that is often used in literature [1, p 173-174]. Finally, in recent years, it has been emphasized to go a step beyond statistical significance and also report the effect size of an experimental result. The motivation is that even if you have found a statistically significant difference effect in your study this does not automatically mean that it is important or meaningful [2, p 32]. The magnitude and thus the importance of an obtained effect on the other hand, is an objective and standardized measure that can be used to compare findings over different studies [ibid]. The effect size (small, medium or large) is expressed in standard deviation units, d, or correlation using Pearson’s r [ibid] It is still unusual to see authors reporting the effect size but is becoming more common and should be encouraged. To learn more I strongly recommend the very accessible books on statistical testing by Field [1, 2].

2.6 How to report When it comes to reporting on your evaluation study this will always require some introduction and an overall description of the study itself. That can be organized in several ways but I have chosen not to include that in this

167

task type”, or “The presentation order was balanced using a Latin-square procedure”. Here you also state how many trials in total the design yielded per participant. Procedure. This part gives details of how the study was conducted in a practical sense. This includes instruction, training, task to be performed and how this was done (e.g., if feedback was provided or not, response procedure, if the sequence of trials was self-paced or not etc.) and total participation time. For example, “Participants reviewed written instruction material and completed a block of practice trials to learn the concept and usage of the visualization and the two types of task to be performed”, and “Stimuli were displayed until a response was given”. Somewhere in the method the actual task(s) have to be thoroughly explained. This could be done in the procedure section but often the best place is in the overall description of the evaluation. Results. This section presents the results from applying your method. When you report your results you should include any treatments of data, for example, “We employed a logarithmic transformation of the data before further statistical testing”. You should state what test was used and how data fitted into this procedure: “Group mean values were calculated and a between-subject ANOVA was carried out using a decision criterion of 0.05. Variables were visualization type (2Dm vs. 3Dm vs. 2Da) and sequence of task type”. You state the finding to which the test relates, report the test statistic, usually with its degrees of freedom, the probability value and associated descriptive statistic. For example, “There was a significant effect of visualization type F (2,24) = 5.528, p < 0.01”. Or, “A t-test was performed. The response times for the 3Dm visualization were significantly faster than for the 2Dm visualization ((T = 2.4891, n = 10, p < 0.05). The group mean value for the search times with 3Dm was 23.9 seconds with a standard deviation of 1.35 while in the 2Dm condition it was 37.2 seconds with a standard deviation of 1.61”. Sometimes it’s helpful to include a short discussion of the results in the end of this section, or in a subsection especially if the results lead to follow up experiments. However, in the majority of cases the interpretations and discussions of the results should come in the sections covering a general discussion and conclusions (these sections are not covered in this paper). The results should stand for themselves, i.e. a result could be accurate whereas a conclusion based on them may not be. The information you provide should be presented in a way that is scientific, unambiguous and useful. Here terminology is highly important but often, unfortunately, there is no “universal code” to apply. Several words can be used to refer to the same thing. For example, a trial is one of a number of all repetitions in an experiment but can also be used to refer to an entire experiment. Also, one of a number of all repetitions can be called a task or a case. I have seen publications where several different words have been used throughout a publication actually referring to the same thing and making the description impossible to follow. When you first describe something, that is when you give it an operational

definition it should be crystal clear what you mean and then you should use that term consistently throughout the text. Good operational definitions define and describe variables and procedures so that they cannot be misunderstood and so that other researchers can replicate them by following the descriptions [3, p. 75]. To conclude, to write the method section is not an easy task. A good description of a study requires a considerable amount of space, which can be difficult when faced with a page limit and there is a trade-off between including irrelevant information and leaving relevant aspects out. A good approach is to have someone without prior knowledge about your study provide feedback about what’s missing and what can be left out. When space is critical it’s more important than ever to focus on the details most important to the outcome of the experiment and for replication. However, the aim should always be for the description to be as complete as possible. This is how you make it possible for others to evaluate and verify your work and allowing them to replicate it to see if they will obtain the same findings.

3. Conclusions Readers of this paper will hopefully have understood both the importance of scientific evaluation in InfoVis and some of the basics of how it should and should not be done. Reading this paper will not leave them as someone now fully capable of designing and conducting an experiment, and analyzing its result. However, they should come away more capable of knowing what is important to consider if they want to do it themselves and what resources are available to guide them. They should also be more able to find potential problems in an evaluation paper when reviewing such work, see where studies have been poorly designed and executed, identify basic problems with statistical methods used for analysis and be able to ask questions about why evaluations have been carried out in a less than optimal way then could have been.

4. Acknowledgements The author thanks Matthew Cooper for valuable feedback on this paper.

5. References [1] [2] [3] [4]

168

A. Field and G. Hole. How to Design and Report Experiments. Sage Publications. 2003. A. Field. Discovering Statistics Using SPSS, 2th ed. Sage Publications. 2005. A.M. Graziano and M.L. Raulin. Research Methods: A Process of Inquiry, 7th ed. Allyn & Bacon, Boston, MA. 2010. BELIV’06. BEyond time and errors: novel evaLuation methods for Information Visualization. A workshop of the AVI 2006 International Working Conference. May, 2006. Accessed 2010-03-28. http://www.dis.uniroma1.it/ ~beliv06/.

[5]

[6]

[7]

[8] [9] [10]

[11]

[12] [13] [14] [15]

BELIV’08. BEyond time and errors: novel evaLuation methods for Information Visualization. A Workshop of the ACM CHI 2008 Conference, April, 2008. Accessed 2010-03-28. http://www.dis.uniroma1.it/~beliv08/. BELIV’10. BEyond time and errors: novel evaLuation methods for Information Visualization. A Workshop of the ACM CHI 2010 Conference, April, 2010. Accessed 2010-03-28. http://www.beliv.org/beliv2010/ index.php?title=Main_Page C. Chen and M.P. Czerwinski. Empirical Evaluation of Information Visualization: An Introduction. International Journal of Human-Computer Studies, 53(5), 631-635, 2000. C. Chen. Top 10 Unsolved Information Visualization Problems. Computer Graphics and Applications, 25(4), 12-16, 2005. C. Forsell, S. Seipel and M. Lind. Surface Glyphs for Efficient Visualization of Spatial Multivariate Data. Information Visualization, 5 (2), 112-124, 2006. C. Forsell and J. Johansson. Task-Based Evaluation of Multi-Relational 3D and Standard 2D Parallel Coordinates. In Proceedings of SPIE-IS&T Electronic Imaging, SPIE: The International Society for Optical Engineering, 6495, 64950C-1-12, Jan 2007. C. North and B. Shneiderman. Snap-Together Visualization: Can Users Construct and Operate Coordinated Visualizations? International Journal of Human-Computer Studies, 53(5), 715-739, 2000. C. North. Towards measuring visualization insight. Computer Graphics and Applications, 26 (3), 6–9, 2006. C. Plaisant. The Challenge of Information Visualization Evaluation. In Proceeding. AVI 2004, ACM Press, 109116. 2004. C. Wood, D. Giles and C. Percy. Your Psychology Project Handbook. Becoming a Researcher. Pearson, 2009. D.A. Keim, D. Bergeron and R. Pickett. Test Data Sets for Evaluating Data Visualization Techniques. Perceptual issues in Visualization, 9-22, 1994.

[16] 14th International Conference Information Visualisation IV10. http://www.graphicslink.co.uk/IV10/. Accessed 2010-03-25. [17] 2nd International Symposium Information Visualization Evaluation, IVE. http://www.graphicslink.co.uk/ IV10/IVE.htm. Accessed 2010-02-25 [18] J. Dumas and B. Loring. Moderating Usability Tests: Principles & Practices for Interacting. Morgan Kaufman. 2008. [19] J. Greene and M. D’Oliviera. Learning to Use Statistical Tests in Psychology, 2th ed. Open University Press, Philadelphia. 2001. [20] J. Johansson, C. Forsell, M. Lind and M. Cooper. Perceiving Patterns in Parallel Coordinates: Determining Thresholds for Identification of Relationships. Information Visualization, 7(2), 152–162, 2008. [21] J. Nielsen. 1994. Heuristic evaluation. In J. Nielsen, and R.L. Mack, (Eds.). Usability Inspection Methods. John Wiley & Sons, NY, USA, 25-61, 1994. [22] M. Tory and T. Möller. Human Factors in Visualization Research. Transactions on Visualization and Computer Graphics, 10(1), 72-84, 2004. [23] R. Kosara, C.G. Healey, V. Interrante, D. H. Laidlaw and C. Ware. Thoughts on User Studies: Why, How, and When. Computer Graphics and Applications, 23 (4), 2025, 2003. [24] S. Carpendale. Evaluating Information Visualizations. In Information Visualization: Human-Centered Issues and Perspectives. A. Kerren, J. T. Stasko, JD. Fekete, C. North (Eds.), LNCS 4950, Springer, 19-45, 2008. [25] University of Dusseldorf, G* Power. http://www.psycho.uni-duesseldorf.de/abteilungen/aap/ gpower3/ Accessed 2010-0406. [26] Wisweek.http://vis.computer.org/VisWeek2010/vis_cfp_ papers.html. Accessed 2010-03-27.

169