Statistical analysis of biological data diagnostic tests, gene ontology and gene expression

Clara-Cecilie Günther Statistical analysis of biological data – diagnostic tests, gene ontology and gene expression Thesis for the degree of Philoso...
Author: Ashlyn Green
0 downloads 0 Views 3MB Size
Clara-Cecilie Günther

Statistical analysis of biological data – diagnostic tests, gene ontology and gene expression

Thesis for the degree of Philosophiae Doctor Trondheim, June 2009 Norwegian University of Science and Technology Faculty of Information Technology, Mathematics and Electrical Engineering Department of Mathematical Sciences

NTNU Norwegian University of Science and Technology Thesis for the degree of Philosophiae Doctor Faculty of Information Technology, Mathematics and Electrical Engineering Department of Mathematical Sciences © Clara-Cecilie Günther ISBN 978-82-471-1596-1 (printed ver.) ISBN 978-82-471-1597-8 (electronic ver.) ISSN 1503-8181 Doctoral theses at NTNU, 2009:107 Printed by NTNU-trykk

T

P

p

BMC Bioinformatics

BioMed Central

Open Access

Software

GeneTools – application for functional annotation and statistical hypothesis testing Vidar Beisvag*1, Frode KR Jünge1, Hallgeir Bergum1, Lars Jølsum1, Stian Lydersen1, Clara-Cecilie Günther2, Heri Ramampiaro3, Mette Langaas2, Arne K Sandvik1,4 and Astrid Lægreid1 Address: 1Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway, 2Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim, Norway, 3Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway and 4Department of Medicine, St. Olav's University Hospital, Trondheim, Norway Email: Vidar Beisvag* - [email protected]; Frode KR Jünge - [email protected]; Hallgeir Bergum - [email protected]; Lars Jølsum - [email protected]; Stian Lydersen - [email protected]; Clara-Cecilie Günther - [email protected]; Heri Ramampiaro - [email protected]; Mette Langaas - [email protected]; Arne K Sandvik - [email protected]; Astrid Lægreid - [email protected] * Corresponding author

Published: 24 October 2006 BMC Bioinformatics 2006, 7:470

doi:10.1186/1471-2105-7-470

Received: 19 July 2006 Accepted: 24 October 2006

This article is available from: http://www.biomedcentral.com/1471-2105/7/470 © 2006 Beisvag et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Background: Modern biology has shifted from "one gene" approaches to methods for genomic-scale analysis like microarray technology, which allow simultaneous measurement of thousands of genes. This has created a need for tools facilitating interpretation of biological data in "batch" mode. However, such tools often leave the investigator with large volumes of apparently unorganized information. To meet this interpretation challenge, gene-set, or cluster testing has become a popular analytical tool. Many gene-set testing methods and software packages are now available, most of which use a variety of statistical tests to assess the genes in a set for biological information. However, the field is still evolving, and there is a great need for "integrated" solutions. Results: GeneTools is a web-service providing access to a database that brings together information from a broad range of resources. The annotation data are updated weekly, guaranteeing that users get data most recently available. Data submitted by the user are stored in the database, where it can easily be updated, shared between users and exported in various formats. GeneTools provides three different tools: i) NMC Annotation Tool, which offers annotations from several databases like UniGene, Entrez Gene, SwissProt and GeneOntology, in both single- and batch search mode. ii) GO Annotator Tool, where users can add new gene ontology (GO) annotations to genes of interest. These user defined GO annotations can be used in further analysis or exported for public distribution. iii) eGOn, a tool for visualization and statistical hypothesis testing of GO category representation. As the first GO tool, eGOn supports hypothesis testing for three different situations (master-target situation, mutually exclusive target-target situation and intersecting target-target situation). An important additional function is an evidence-code filter that allows users, to select the GO annotations for the analysis. Conclusion: GeneTools is the first "all in one" annotation tool, providing users with a rapid extraction of highly relevant gene annotation data for e.g. thousands of genes or clones at once. It allows a user to define and archive new GO annotations and it supports hypothesis testing related to GO category representations. GeneTools is freely available through www.genetools.no

Page 1 of 13 (page number not for citation purposes)

BMC Bioinformatics 2006, 7:470

Background Microarray technology allows researchers to monitor transcript levels of thousands of genes in a single experiment [1]. Typically it confronts the researcher with vast amounts of numerical data as a starting point from which to begin to investigate how molecular mechanisms are involved in a specific biological setting. Typically, scientists have to manually query several resources/databases for information. Although these can be highly informative individually, the collection of available content would be more useful if provided in an integrated manner. Highthroughput, automated annotation summaries can expedite this step and today several resources like Source [2], GeneCards [3] and NetAffx [4] already offer this. In order to understand how cells function within a tissue, e.g. in a given state one can use data-driven methods, such as hierarchical clustering and self-organizing maps [5,6], which identify groups of genes with similar expression patterns. However, a complementary approach is to view data at the level of biological background knowledge such as a gene's involvement in a biological processes or pathway. The leading controlled vocabulary for such functional information is Gene Ontology (GO) [7]. Annotation of genes with GO terms creates a biological knowledge profile, in three layers dependent on the toplevel GO branch used (biological process, molecular function or cellular component). Several tools are suited for analysis of the GO hierarchy and for statistical evaluation of GO category representations between gene lists [8]. Comparisons of gene lists are important in order to answer questions such as "are genes involved in process P overrepresented among the total of differentially expressed genes in an experiment" or "does treatment A induce more genes involved in process P than treatment B?". A potential problem using such tools, is that the existing annotation databases are incomplete and for most organisms only a subset of the known genes are functionally annotated [8]. Moreover, a major part of the available annotations e.g. those inferred from electronic annotations may be imprecise or incorrect. The present paper describes GeneTools, a package of webbased tools for gene annotation. GeneTools is built on top of an underlying database that is updated on a weekly basis to provide information as recent as possible. The annotation data is accessible through two user interfaces, the NMC Annotation Tool which offers general functional annotation information in both single- and batch search mode, and the eGOn tool which can annotate, display and perform statistical hypothesis testing to assess the degree of similarity of GO category representation

http://www.biomedcentral.com/1471-2105/7/470

between different gene lists. An important function in eGOn is the possibility to filter on evidence codes. Also, additional user defined GO annotations can be added to the database through the GO Annotator Tool for use in further analysis. Another unique feature in GeneTools is that user submitted data is stored in the database and can be shared with other users. Finally, a significant part of this paper deals with how the hypothesis testing for GO category representations is performed, which we think has been inadequately described for many other resources.

Implementation GeneTools is a web service. It runs on most web browsers, including IE 5.0 or higher, Netscape 7 or higher and Mozilla Firefox 1.0 or higher, and is platform-independent. GeneTools is implemented in the PHP programming language. We have chosen to implement this tool as a web service to make it as user-friendly as possible, as most of the users are not bioinformaticians able to perform programming. However, more advanced use of the service is possible as described later in this chapter. GeneTools is the front-end of a MySQL database containing annotation data from the following publicly available resources: UniGene [9], EntrezGene [10] (including GOA [11], Proteome, MGD [12] and RDG [13] annotations), Gene Ontology [14], SwissProt [15], and HomoloGene [16]. Information from 64 organisms available through UniGene is included, but the most comprehensive information is available for human, rat and mouse genes. All these databases are stored as local copies, enabling quick access to the data in response to the user query. Since many of the resources on which GeneTools draws continuously change their information content, the GeneTools database is updated on a weekly basis to ensure that it contains the most up-to-date information, continuously updating the stored gene reporter lists. An automated process checks for updates of the outside databases, downloads these files, and populates database tables accordingly. This ensures that the connections between external databases made within GeneTools are as accurate as possible. Thus, both the mapping of clones to genes and the functional attributes associated with those genes are dynamic and current. All data and graphics from searches and analysis can be exported in various formats (txt, XML or as Excel files). Due to the heterogeneous nature of annotation information, bioinformaticians and systems biology researchers may want to perform more high-level analysis than offered through our web service. We therefore offer an API solution, based on web services description language (WSDL), for external resources wishing to use data from

Page 2 of 13 (page number not for citation purposes)

BMC Bioinformatics 2006, 7:470

our database. Typically new and important tools like Taverna [17] can easily utilize this system using SOPE/RPC. Currently our API solution is utilized by the Norwegian Microarray Consortium (NMC) which updates their local BASE (BioArray Software Environment) [18] servers with information from this database. Moreover, SciCraft [19], a general data analysis tool, uses data from the GeneTools database in its microarray data analysis tool box. We will also offer R code for the statistical testing in eGOn upon request. The structure of our GeneTools database is built so that it can be used in the future as part of local or external data warehouses.

Results and discussion Inputs Figure 1 gives an overview of GeneTools with its single search and batch search (gene reporter lists) inputs, its underlying database structure and associated tools for analysis. The ability to simultaneously collect data from

http://www.biomedcentral.com/1471-2105/7/470

numerous sources for e.g. thousands of genes from microarray experiments in batch is especially important and made very user friendly through GeneTools. Single search The database enables searching by gene symbols/names, GenBank accession numbers, UniGene cluster IDs, SwissProt entry names and several unique clone IDs (IMAGE clone IDs, University of Iowa clone IDs, Operon oligo IDs, TAIR IDs and a subset of selected Affymetrix and Agilent IDs).

The names and symbols of genes/proteins may be highly ambiguous [20]. We therefore recommend using primary gene IDs, like GeneBank accession numbers or specific probe IDs when querying the database. However, if gene names or symbols are used, caution is advised because only official names/symbols associated with UniProt knowledgebase will be recognized.

Figure Flowchart 1 of the GeneTools program and the underlying database Flowchart of the GeneTools program and the underlying database. The underlying database is updated on a weekly basis with annotation information from several external databases including UniGene, Swiss-Prot, Entrez Gene and GO. User data are submitted to the database as text files of gene reporters and analysis of the annotation data can be performed through three user interfaces: the NMC Annotation Tool, the GO Annotator Tool and eGOn. Analysis results and annotation data can be exported in various formats.

Page 3 of 13 (page number not for citation purposes)

BMC Bioinformatics 2006, 7:470

http://www.biomedcentral.com/1471-2105/7/470

Batch search Input of gene reporter lists for batch search is done by uploading tab-delimited text files to the server. After submission, the gene reporters are automatically mapped to a UniGene cluster, and functional annotations/attributes (e.g. GO annotation) are associated with the specific gene/protein (Figure 1). Uploaded gene reporter lists are stored and can easily be managed in folders or shared with other users. If new annotation information becomes available for any of the stored gene reporter lists, the user will be notified.

available, each single search result view will contain all or a subset of the following categories of data:

Updates The user may at any time choose to update a stored gene reporter list, thus incorporating the most recent annotation information from the weekly update of the GeneTools database in the analysis. The updating process is fast even for lists of thousands of gene reporters. The user receives a specified report detailing which gene reporters are associated with new annotation information and the changes made.

III. Data from Entrez Gene: A. gene name, symbol and aliases. B. biological roles and summary of functions curated by Entrez (Ref.seq summary). C. gene ontology (GO) annotations with direct link to references and links to alternative ontologies like KEGG. D. direct link to curated PubMed Gene RIFs (reference into function).

Tools, analyses and outputs NMC annotation Tool A major challenge when using genomic scale methods like microarrays, is to handle annotation information from the resulting comprehensive gene reporter lists. Thus, one of the most important features of GeneTools is the ability to simultaneously extract pre-existing annotation data from a wide variety of database resources for thousands of genes in a batch. Since the GeneTools database is weekly updated and the NMC Annotation Tool provides user friendly functionalities for associating new annotation information with the reporters in uploaded gene lists, the NMC Annotation Tool is particularly useful when it is important to always have access to the most recent information on the genes and clones being examined. The NMC Annotation Tool enables the user to query the GeneTools database by singe gene search or by batch search after submission of a gene reporter list for a microarray experiments. Given the massive amount of data available through GeneTools (Figure 1), information overload can be a potential problem. Therefore, we have provided the user with the option to select (in the "preferences" menu) the information to be shown on the screen for single search and batch view and to select which information to export. However, we will stress that this option should be used cautiously, because it may introduce selection bias and important information may be lost. Single search outputs The single search function captures the collection of features attributable to the given gene and its products, when a gene is defined by a unique UniGene cluster. Whenever

I. Data from Unigene, including e.g. A. gene cluster, name and symbol. B. protein similarities with selected organisms (with direct link to Entrez protein). C. chromosome localization information. D. UniGene associated sequences with cluster. II. Data from Homologene: Shows homologous genes for human, rat and mouse.

IV. Data from Swiss Prot: A. protein names and aliases. B. biological role and function information curated by Swiss Prot. C. protein sequence information. D. direct links to various external sources associated with current protein are offered for each gene reporter. Batch search outputs One of the most important and unique features of the NMC Annotation Tool is the batch search mode which utilizes all of our database sources for gene reporter lists from microarray experiments. For instance, the users can easily extract biological function, chromosomal localization, and get access to publications (GeneRIFs) that describe gene functions. The results for reporter gene list from a batch search can be viewed in a user-friendly tabular form (Figure 2). Moreover, the annotation data displayed on the screen are associated with hyperlinks to the underlying database or to the single search view. The annotation data can be exported in several formats for printing or storage (XML and XLS).

NMC Annotation Tool provides several features not available in other gene annotation tools. To our knowledge, few other application stores users' gene reporter lists allowing update of the reporter lists at any time with the most recent UniGene, Entrez Gene and GO information. This is important since the clusters in UniGene change rapidly and new GO annotations are being added continuously. To achieve this, the submitted gene reporter lists can easily be updated with all new information. Information about the external databases included in GeneTools and their last updates can be found from a link named "database status" in the menu, and provides useful documentation for publishing purposes. Information about commercial arrays supported by GeneTools (currently Affymetrix, Operon and Agilent) is also given. To our knowledge, a similar

Page 4 of 13 (page number not for citation purposes)

BMC Bioinformatics 2006, 7:470

http://www.biomedcentral.com/1471-2105/7/470

Figure "overview" Typical 2 result output for a submitted gene reporter list Typical "overview" result output for a submitted gene reporter list. Input gene reporter and associated UniGene cluster, gene name, symbol and chromosome localization is shown for all the gene reporters in the submitted lists. Several of the information boxes are hyperlinked redirecting the user to the original source. More specific annotations can be found under the "tabs" named Entrez, SwissProt and GO. By clicking on the gene reporter ID, a single search window for the selected gene reporter will appear.

variety of important features is not available in gene annotation tools like Source [2], GeneCards [3], NetAffx [4], GeneCruiser [21], Onto-Tools [22], GARBAN [23] and GeneLynx [24].

GO annotator tool (user defined GO annotations) The introduction of Gene Ontology (GO) [25] as a standardised vocabulary for describing genes, gene products and their biological functions represents an important

Page 5 of 13 (page number not for citation purposes)

BMC Bioinformatics 2006, 7:470

milestone in the possibilities to handle and include biological background information in functional genomics analyses. Many databases today provide GO annotations for a variety of organisms including humans and other species. However, GO is still incomplete and significant extensions to its structure are needed before all available biological knowledge can be represented as GO annotations in public databases. Also, besides the human research filed other organisms e.g. common model organisms like rat and mouse are still lagging behind when it comes to raising the quality of curation of GO annotations. Thus, a high proportion of GO annotations offered in the rat genome database (RGD) [13] and the mouse genome database (MGD) [12] are associated with the IEA (inferred by electronic annotation) evidence code, which implies a lower degree of certainty than some users may require. To overcome at least some of these problems, GeneTools allows a user to define their own GO annotations to genes of interest. The GO Annotation Tool (accessible through "single search" mode in the NMC Annotation Tool) enables the addition of new, user defined GO annotations as well as the curation of GO annotations e.g. annotations with evidence code IEA. GO Annotation Tool is supported by a GO term search system, simplifying the browsing for GO terms. Evidence codes and references (e.g. PMID) according to GO standards and free text can be added (Figure 3). New annotations are stored in the database and can be included in further analysis (e.g. added to the GO analysis in the eGOn tool). We are in the process of making an export function, where these user defined GO annotations can be exported to the GOA database [11] by an email service. GOA will curate these annotations and make them available for others through the GO annotation database [26]. Explore Gene Ontology (eGOn) Controlled vocabularies facilitate query and retrieval of knowledge from many different sources using a common query structure. Three separate important activities are needed to enable this: the production and maintenance of the ontologies themselves; the creation of associations (or annotations) between the GO terms and gene products, and the development of tools that facilitate the creation, maintenance and use of the ontologies.

eGOn visualizes gene annotations in the GO hierarchy and offers a collection of statistical tests that translate the GO annotation information associated with the reporters in gene lists from functional genomics experiments to provide insight into the biological mechanisms involved. A wide range of resources are available for GO analysis [27]. In a recent review, Khatri et al. [8] question how such

http://www.biomedcentral.com/1471-2105/7/470

resources are built and used. Khatri et al. point out that existing annotation databases are incomplete, that a proportion of the annotations may be imprecise or incorrect, that name space mapping (how to connect a probe sequence to a gene/protein) is a problem, and that available statistical tests are not always validated. We think that the tool eGOn of the GeneTools suite meets many of these challenges since it enables filtering of annotations by evidence code, it allows the entry of new annotations and curation via the GO Annotator Tool and it provides a series of robust statistical tests that are thoroughly validated and documented. For GO annotations, GeneTools uses Entrez Gene which offers curated data from the GO database that includes all registered GO annotations [26]. Some annotations available in the GO database will not be included using the Entrez curated GO annotations but the quality of annotation is most likely better. eGOn offers the possibility to filter the GO annotations from a gene reporter list by evidence codes. A substantial proportion of GO annotations are inferred by electronic methods (evidence code IEA), potentially being imprecise and possibly biasing further analysis. Thus, in a given analysis, it may be beneficial to exclude IEA annotations and only use more robust annotations, like e.g. annotations derived from "traceable author statement" (TAS), "inferred from direct assay" (IDA) or "inferred by curator" (IC). In other situations it may be desirable to include electronic annotations in order to obtain a sufficient amount of data to do a valid analysis, e.g. for rat and mouse genes where most of the annotations up to now are IEA. Another possibility which to our knowledge is not in use by any GO analysis tool today, might be to perform some kind of weighting by the type of evidence code for the statistical calculations. An essential feature of eGOn is the possibility to compare and analyze annotated genes from two or more gene reporter lists in the GO-tree. eGOn both visualizes these comparisons within the GO-tree and formally calculates the degree of GO category representation similarity between the gene lists using statistical tests (Figure 4). Testing statistical hypotheses of association between gene reporter lists To investigate and better interpret the relevance of biological annotations of lists of gene reporters, statistical hypothesis testing can be a valuable tool. Let us for example consider a microarray experiment where the objective of the study is to compare the differentially expressed genes from heart failure tissue between cases and controls where the cases are patients with coronary artery disease (CAD) or dilated cardiomyopathy (DCM) and the controls are tissue from non-failing hearts [28].

Page 6 of 13 (page number not for citation purposes)

BMC Bioinformatics 2006, 7:470

http://www.biomedcentral.com/1471-2105/7/470

Figure User interface 3 for the GO Annotator Tool User interface for the GO Annotator Tool. To add a new GO annotation, the user selects a gene, adds a GO term, chooses an appropriate evidence code and adds a reference article (PMID). The GO annotations are then stored in the database and an exported function to GOA for world wide distribution is under development. A link to the GO Annotator Tool can be launched from the top of the page of the result window from a single gene search, in the NMC Annotation Tool mode.

To formally state the statistical hypothesis, consider a randomly chosen gene and a given GO category denoted G. Define the following three events: • A = the gene is in gene reporter list A • B = the gene is in gene reporter list B

• G = the gene is a member of GO category G. In this example the list A would be the list of differentially expressed genes between CAD and controls while list B would be the differentially expressed genes between DCM and controls. At the given GO category G (e.g. catabolism), we are interested in investigating whether the prob-

Page 7 of 13 (page number not for citation purposes)

BMC Bioinformatics 2006, 7:470

http://www.biomedcentral.com/1471-2105/7/470

Figurereport Result 4 output from eGOn Result report output from eGOn. Gene reporter lists submitted to eGOn can be visualized in tree-view, as result-view or as report-view. In the tree-view (A) the nodes may be collapsed or expanded producing the desired level of detail and the resulting structure can be saved as a template for future use. Several preset levels can also be selected. By clicking on a GO node the gene reporter associated with this GO node in the GO-tree can be interactively examined and links are offered to single gene view in the NMC Annotation Tool. In result view p-values for all GO categories are shown and for the report view (B), only the GO categories that fit the user's p-value cut-off are shown.

Page 8 of 13 (page number not for citation purposes)

BMC Bioinformatics 2006, 7:470

ability of belonging to GO category G is different for genes on gene list A and genes on gene list B. For each gene on list A, there is a conditional probability P(G|A) of belonging to GO category G, and for each gene on list B, there is a conditional probability P(G|B) of belonging to GO category G. Under the null hypothesis these two probabilities are equal. From this the following null hypothesis and alternative hypothesis can be formulated. H0: P(G|A) = P(G|B) vs. H1: P(G|A) z P(G|B) By using the laws of conditional probability, we have the following additional interpretation. For a chosen GO category G, the ratio between the probability of membership of gene reporter list A and membership of gene reporter list B, is the same as the ratio between the probability of being a member of gene reporter list A to the probability of being a member of gene reporter list B in the whole GO-tree. Statistically we need to distinguish between three situations, to correctly handle the possible dependencies between gene reporter lists A and B. An illustration of these situations is given in figure 5. Different statistical hypothesis tests are suitable for the three situations. In eGOn we have implemented three tests for these situations: the master-target test, the mutually exclusive targettarget test and the intersecting target-target test. In brief, all three tests are parametric and the tests for the mastertarget situation and the mutually exclusive target-target situation are based on the same implementation of Fisher's exact test, but with different inputs. The intersecting target-target test is based on a test statistic by Leisenring et al. [29]. The test of Leisenring is designed to test if the positive predictive value (PPV) of two medical diagnostic tests is equal. A further description of the different situations and the corresponding tests can be found in the next chapters. Moreover, a detailed description of the statistical tests is offered in the supplementary material (additional file 1). Master-target situation In the master-target situation the GO categories (e.g. biological processes) of the genes of interest (e.g. differentially expressed) from a given experiment (target list) are compared with the distribution of GO categories for all gene reporters represented as physical probes on the microarray (master list) used in the experiment. The purpose is to find whether, in any of the GO categories, the genes of interest are over- or underrepresented compared to the genes represented on the microarray. For our heart failure example, list M would be a list of all the genes investigated on the microarray and list B would be the genes that are found to be differentially expressed between the DCM hearts and the controls (Figure 5).

http://www.biomedcentral.com/1471-2105/7/470

This type of comparison between two gene reporter lists is useful and most GO tools offer tests for this. Statistically this situation can be transformed into a problem where we for each GO category under consideration want to test if two independent binomial proportions are equal (for details, see Günter et al. [30]). Several statistical approaches can be used, e.g. Fisher's exact test, Pearson's asymptotic Chi-square-test, a conditional mid-p test, or an unconditional test. We refer to Agresti [31] for a presentation of these tests, and to Khatri and Dragici [8] for an overview of different statistical tests implemented in the various GO-tools available in the master-target situation. In eGOn we have chosen the Fisher's exact test for the master-target situation and we call this the master-target test. The implementation is based on a translation to PHP of a JAVA-script by Langsrud [32]. The use of this two sided test is further explained by Zeeberg et al. [33]. Mutually exclusive target-target situation In the mutually exclusive target-target situation there are no common genes in the two lists compared, in the heart failure example list A1 could be the list of differentially expressed genes that are up-regulated for the CAD hearts compared to the controls, while list A2 contains the genes that are down-regulated for the CAD hearts compared to the controls. The purpose with this type of comparison is to find which e.g. biological processes as defined by GO categories are differentially represented in the up- and down-regulated genes in the same experiment (Figure 5).

Statistically this situation is very similar to the master-target situation and can be transformed into a problem where we for each GO category under consideration want to test if two independent binomial proportions are equal. The same statistical tests as listed for the master-target test can be used. In eGOn we have chosen to implement the Fisher's exact test for the mutually exclusive target-target situation, called the mutually exclusive target-target test, using the same implementation, but with different inputs, as for the master-target test. Intersecting target-target situation When two gene reporter lists are compared and a number of gene reporters are represented on both lists, the intersecting target-target test is used to investigate whether the GO categories represented by these genes are over- or under represented in the experiments behind the two lists. In our heart failure example, list A could be the differentially expressed genes between CAD hearts and controls while list B would be the differentially expressed genes between DCM hearts and controls (Figure 5).

In Günther et al. [30], three different statistical tests are presented in the situation where the two gene lists are intersecting. All three tests are constructed for use with

Page 9 of 13 (page number not for citation purposes)

BMC Bioinformatics 2006, 7:470

http://www.biomedcentral.com/1471-2105/7/470

Figure Three different 5 situations covered by the statistical testes in e GOn Three different situations covered by the statistical testes in e GOn. Master-target situation: When one gene reporter list is a subset of the other list (the master list) the master-target test can be used in the comparison. Mutually exclusive targettarget situation: If the gene reporters do not have any reporters in common (e.g. lists of up- vs. down regulated genes form the same experiment) the mutually exclusive target-target test can be used. Intersecting target-target situation: if the two lists compared include common gene reporters, from e.g. two experiments, then the intersecting target-target test can be used.

large samples, and are based on an asymptotic relation to the Chi-square distribution. In eGOn we have chosen to implement the test based on Leisenring et al. [29], originally constructed for comparing positive predictive values of two diagnostic tests, tests A and B, with respect to a disease G. This test uses a score statistic based on generalized estimating equations to fit a generalized linear model. We have translated this test into the setting of comparing two gene lists at a given GO category. Further details can be

found in Günther et al. [30] or in the supplementary material (additional file 1). Methodical considerations The statistical tests for association between two gene reporter lists under consideration are based only on the gene lists submitted to eGOn, and the raw data underlying the statistical analyses producing the gene reporter lists are not submitted to eGOn. This means that eGOn does not

Page 10 of 13 (page number not for citation purposes)

BMC Bioinformatics 2006, 7:470

offer permutation based methods for addressing the dependence structure between the genes. The statistical tests in eGOn are thus based on the assumption that under the null hypothesis the genes on the lists (or subsets of the lists in the intersecting target-target situation) act independently, as is also commonly assumed in other GO-tools. This should be taken into consideration when analysis is performed, and duplicate genes/reporters, close family members or pathways partners may be removed. This can easily be done by the filtering tool in GeneTools. The p-values produced by the statistical test can be displayed for all GO categories or only those satisfying a certain p-value cut-off. Adjusted p-values can be calculated for a selected set of GO categories and is dependent on how the GO hierarchy is collapsed/expanded, using the step-up procedure of Benjamini and Hochberg [34] for controlling the False Discovery Rate (FDR). Setting a cutoff at 0.05 for the adjusted p-value will control the (FDR) at level 0.05. The Benjamini-Hochberg step-up procedure controls the FDR under certain dependence structures (for example positive regression dependency, see Benjamini and Yekutieli [35] for a detailed presentation). However, the dependency structure among the selected GO-categories in the GO-tree is not known, and questions remain about controlling the FDR in hierarchical structures. One important "consensus point" within statistical inference discussed by Allison et al. [36] is that gene set testing is desirable, and has become a popular and widely accepted analytical tool. However, one problem with gene class testing, according to Allison et al. [36], is that the null hypotheses of these tests are not, or poorly defined. By formally stating the null and alternative hypotheses, we think our paper has addressed these concerns in a thorough manner. An important consideration when searching for statistically significant GO categories within a gene reporter list (our master-target test) is the choice of the reference (master) list of gene reporters from which the pvalues for each GO category in the results are calculated. Some tools use the total set of genes in a genome as a reference (the master list). We do not think this is the best solution since the observed number of gene reporters for a specific GO category should be compared with the number of gene reporters that could appear if a random selection was taken from the list of all genes that was under study in the experiment. In eGOn p-values can be shown for the whole GO tree and unlike most other tools several preset levels can be chosen and users can modify the tree as they like. In addition a result report view is accessible, showing only the GO nodes which satisfy a specific pre set p-value cutoff. Unique in the eGOn tool, we offer statistical tests for comparisons between gene reporter lists. The master-target test

http://www.biomedcentral.com/1471-2105/7/470

and mutually exclusive target-target test are both used in different variations in several programs today, but no other GO-tool, to our knowledge, offers tests for the intersecting target-target situation. However, the statistical test of FatiGO [37] is valid for the mutually exclusive targettarget situation, and was in a simulation study found to preserve the test size when the gene reporter lists are of equal length [30]. Our intersecting target-target test is valid when the two gene reporter lists are intersecting, potentially constituting a useful test, since it offers the opportunity to compare gene reporter list for different experiments (as previously described by the heart diseases example). In this way both our target-target tests may answer questions not necessary answered by the standard master-target tests applied to most tools. Future plans GeneTools was released in September 2005 and has steadily gained popularity since then. In October 2006 over 1 700 users from 60 countries were registered and over 4 000 gene reporter lists were submitted to the database. We plan to continue adding new features to GeneTools, including more information from external databases like e.g. Ensembl and OMIM. Furthermore, we hope to provide developers of other tools an extended version of our API and extend the export function to support SBML (systems biology markup language) [38] which will make more high-level analysis possible. We think the need for central and publicly available resources which curate biological data will only continue to grow and that GeneTools and similar tools will be essential for biologists and bioinformaticians to efficiently analyze genome-scale datasets. Today their main utility is for gene expression analysis, but in the future proteomic and SNP data need to be analyzed by similar tools. In addition, an important future use of annotation tools will be in systems biology approaches that are now evolving rapidly.

Conclusion GeneTools is a flexible and user friendly "all in one" annotation tool, where the users can rapidly extract gene annotation data for e.g. thousands of genes or clones at once. The user can add "user defined" GO annotation to gene products and all annotation information is stored in a database which can easily be shared with other users and exported in different formats. eGOn is the first tool that can perform hypothesis testing for three different situations, looking for over- or under-representation of GO categories between gene reporter lists.

Availability and requirements Project name: GeneTools Project Homepage: http://www.genetools.no

Page 11 of 13 (page number not for citation purposes)

BMC Bioinformatics 2006, 7:470

http://www.biomedcentral.com/1471-2105/7/470

Operating System: Platform independent

5.

Programming Language: PHP

6.

Underlying Database: mySQL

Authors' contributions VB initiated and coordinated the project and wrote the manuscript. AL co-initiated the project and together with AKS they supervised the project and were involved in drafting and reviewing the manuscript. ML, CCG and SL devised the statistical algorithms. FKJ, HB and HR designed and built the underlying database. LJ, HB and FKJ contributed equally in writing the program code and maintain the underlying database. FKJ, HB, and VB designed the GeneTools web interface. All authors read and contributed to revising the manuscript for intellectual content and approved the final manuscript.

7.

8. 9. 10. 11.

12.

Additional material Additional file 1 eGOnv2_statistics.pdf. As supplementary materials a detailed description of the background for the statistical tests in eGOn is offered. Click here for file [http://www.biomedcentral.com/content/supplementary/14712105-7-470-S1.pdf]

13.

14. 15. 16. 17.

Acknowledgements We wish to thank Martin Kuiper for reviewing the manuscript and øyvind Møll for his great help with debugging and implementing the final features of the software. In addition, we wish to thank Vladimir Yankovski and Jan Komorowski for their contributions to early stages of GeneTools.

18.

19.

VB is the recipient of a research fellowship from the National Council on Cardiovascular Diseases. CCG is funded by a NTNU grant.

20.

GeneTools was developed with support from the National Microarray Technology platform (Norwegian Microarray Consortium, NMC), which is supported by the Functional genomics program (FUGE) of the Norwegian Research Council.

21.

References 1.

2.

3.

4.

Schena M, Shalon D, Heller R, Chai A, Brown PO, Davis RW: Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proc Natl Acad Sci USA 1996, 93:10614-10619. Diehn M, Sherlock G, Binkley G, Jin H, Matese JC, Hernandez-Boussard T, Rees CA, Cherry JM, Botstein D, Brown PO, Alizadeh AA: SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Res 2003, 31:219-223. Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D: GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 1998, 14:656-664. Liu G, Loraine AE, Shigeta R, Cline M, Cheng J, Valmeekam V, Sun S, Kulp D, Siani-Rose MA: NetAffx: Affymetrix probesets and annotations. Nucleic Acids Res 2003, 31:82-86.

22.

23.

24. 25. 26. 27. 28.

Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95:14863-14868. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 1999, 96:2907-2912. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, IsselTarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25-29. Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 2005, 21:3587-3595. UniGene website [http://www.ncbi.nlm.nih.gov/./UniGene/] Entrez website [http://www.ncbi.nlm.nih.gov/entrez/ query.fcgi?db=gene] Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 2004, 32:D262-D266. Blake JA, Richardson JE, Davisson MT, Eppig JT: The Mouse Genome Database (MGD). A comprehensive public resource of genetic, phenotypic and genomic data. The Mouse Genome Informatics Group. Nucleic Acids Res 1997, 25:85-91. Twigger S, Lu J, Shimoyama M, Chen D, Pasko D, Long H, Ginster J, Chen CF, Nigam R, Kwitek A, Eppig J, Maltais L, Maglott D, Schuler G, Jacob H, Tonellato PJ: Rat Genome Database (RGD): mapping disease onto the genome. Nucleic Acids Res 2002, 30:125-128. GeneOntology website [http://www.geneontology.org/] SwissProt website [http://ca.expasy.org/sprot/] HomoloGene website [http://www.ncbi.nlm.nih.gov/Homolo Gene/] Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T: Taverna: a tool for building and running workflows of services. Nucleic Acids Res 2006, 34:W729-W732. Saal LH, Troein C, Vallon-Christersson J, Gruvberger S, Borg A, Peterson C: BioArray Software Environment (BASE): a platform for comprehensive management and analysis of microarray data. Genome Biol 2002, 3:SOFTWARE0003. Alsberg B, Kirkhus L, Tangstad T, Andressen E: Data analysis of microarrays using SciCraft. Knowledge exploration in life science informatics, proceedings lecture notes in artificial intelligence 2004, 3303:58-68. Ref Type: Journal (Full) Liu H, Hu ZZ, Torii M, Wu C, Friedman C: Quantitative assessment of dictionary-based protein named entity tagging. J Am Med Inform Assoc 2006, 13:497-507. Liefeld T, Reich M, Gould J, Zhang P, Tamayo P, Mesirov JP: GeneCruiser: a web service for the annotation of microarray data. Bioinformatics 2005, 21:3681-3682. Draghici S, Khatri P, Bhavsar P, Shah A, Krawetz SA, Tainsky MA: Onto-Tools, the toolkit of the modern biologist: OntoExpress, Onto-Compare, Onto-Design and Onto-Translate. Nucleic Acids Res 2003, 31:3775-3781. Martinez-Cruz LA, Rubio A, Martinez-Chantar ML, Labarga A, Barrio I, Podhorski A, Segura V, Sevilla Campo JL, Avila MA, Mato JM: GARBAN: genomic analysis and rapid biological annotation of cDNA microarray and proteomic data. Bioinformatics 2003, 19:2158-2160. Lenhard B, Wahlestedt C, Wasserman WW: GeneLynx mouse: integrated portal to the mouse genome. Genome Res 2003, 13:1501-1504. The Gene Ontology (GO) project in 2006. Nucleic Acids Res 2006, 34:D322-D326. Gene Onology: Current annotations website [http:// www.geneontology.org/GO.current.annotations.shtml] Gene Ontology: Tools website [http://www.geneontology.org/ GO.tools.shtml] Beisvag V, Lehre PK, Midelfart H, Aass H, Geiran O, Sandvik AK, Laegreid A, Komorowski J, Ellingsen O: Aetiology-specific patterns in end-stage heart failure patients identified by functional

Page 12 of 13 (page number not for citation purposes)

BMC Bioinformatics 2006, 7:470

29. 30.

31. 32. 33.

34. 35. 36. 37. 38.

http://www.biomedcentral.com/1471-2105/7/470

annotation and classification of microarray data. Eur J Heart Fail 2006, 8:381-389. Leisenring W, Alonzo T, Sullivan S: Comparisons of Predictive Values of Binary Medical Diagnostic Tests for Paired Designs. Biometrics 2000, 56:345-351. Günther CC, Langaas M, Lydersen S: Statistical Hyhpothesis Tesing of Association Between Two Lists of Genes for a Given Gene Class. 4-1-2006. Preprint Statistics 1/2006. [http:// www.math.ntnu.no/preprint/statistics/2006/]. Department of Mathematical Sciences, The Norwegian University of Science and Technology Ref Type: Report Agresti A: Categorical Data Analysis 2nd edition. John Wiley & Sons, New York; 2002. Langsrud Ø: Fisher's exact test. 2004 [http://www.matforsk.no/ ola/fisher.htm]. Ref Type: Computer Program Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC, Lababidi S, Bussey KJ, Riss J, Barrett JC, Weinstein JN: GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol 2003, 4:R28. Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society 1995, 57:289-300. Yekutieli D, Benjamini Y: The Control of the FDR multiple testing under dependency. The Annuals of Statistics 2001, 29:1165-1188. Allison DB, Cui X, Page GP, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 2006, 7:55-65. Al-Shahrour F, az-Uriarte R, Dopazo J: FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 2004, 20:578-580. Hucka M, Finney A, Sauro HM, Bolouri H, Doyle JC, Kitano H, et al.: The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 2003, 19:524-531.

Publish with Bio Med Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime." Sir Paul Nurse, Cancer Research UK

Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright

BioMedcentral

Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp

Page 13 of 13 (page number not for citation purposes)

ThLV article is reprinted with kind permission from Elsevier, sciencedirect.com

Journal of Reproductive Immunology 74 (2007) 7–14

Fetal growth restriction is associated with reduced FasL expression by decidual cells ˚ Salvesen b,d , Mette Langaas c , Irina P. Eide a,d,∗ , Christina V. Isaksen b,e , Kjell A. c Clara-Cecilie G¨unther , Ann-Charlotte Iversen a , Rigmor Austgulen a a

Department of Cancer Research and Molecular Medicine, DMF, Norwegian University of Science and Technology (NTNU), Medisinsk teknisk forskningssenter, Olav Kyrres gt. 3, N-7489 Trondheim, Norway b Department of Laboratory Medicine, Children’s and Women’s Health, Norwegian University of Science and Technology (NTNU), N-7489 Trondheim, Norway c Department of Mathematical Sciences, Norwegian University of Science and Technology (NTNU), N-7489 Trondheim, Norway d Department of Obstetrics and Gynecology, St. Olavs Hospital, Trondheim, Norway e Department of Pathology, St. Olavs Hospital, Trondheim, Norway Received 1 May 2006; received in revised form 27 November 2006; accepted 28 November 2006

Abstract The Fas-Fas ligand (FasL) system contributes to immune tolerance at the feto-maternal site and has been ascribed a role in implantation and placental development by regulating trophoblast invasion and spiral artery remodelling. In the present study, we have examined FasL expression in decidual tissue from pregnancies with impaired placental development. Women with preeclampsia (PE) and/or fetal growth restriction (FGR) were enrolled as cases (n = 33), and women with normal pregnancies were used as controls (n = 27). Decidua basalis tissue was obtained by vacuum suction of the placental bed after delivery. FasL expression by extravillous trophoblasts (EVTs) and decidual cells (DeCs), together with EVT apoptosis, were assessed by immunohistochemistry. Levels of soluble FasL in maternal serum and apoptosis-related gene expression in decidual tissue were determined. The proportion of FasL-expressing DeCs was high in controls (72.0 ± 10.2%), with a significant reduction among cases (58.1 ± 19.7%; p = 0.002), especially in those with FGR (54.3 ± 19.9%; p < 0.001). EVTs had a lower proportion of FasL expression than DeCs, with a less pronounced reduction in cases compared to controls (10.9 ± 3.9 and 8.3 ± 4.0%, respectively; p = 0.02). Decidual FasL expression correlated with placental growth. The EVT apoptosis rate did not differ between cases and controls (1.1 ± 1.9 and 1.1 ± 1.3%, respectively). These findings indicate a reduction of immune privilege in decidua of PE/FGR pregnancies by reduced FasL expression and that DeCs may have a central role in the Fas-FasL-based feto-maternal immune balance. © 2006 Elsevier Ireland Ltd. All rights reserved. Keywords: FasL; Decidual cells; Extravillous trophoblasts; Fetal growth restriction; Pre-eclampsia

1. Introduction ∗ Corresponding author at: Department of Cancer Research and Molecular Medicine, DMF, Norwegian University of Science and Technology (NTNU), Medisinsk teknisk forskningssenter, Olav Kyrres gt. 3, N-7489 Trondheim, Norway. Tel.: +47 73550290; fax: +47 73598801. E-mail address: [email protected] (I.P. Eide).

Fas ligand (FasL) is a member of the tumour necrosis (TNF) superfamily (Bohana-Kashtan and Civin, 2004). Its expression is confined to activated T lymphocytes, natural killer (NK) cells and cells in immune privileged sites (Medvedev et al., 1997; Bohana-Kashtan

0165-0378/$ – see front matter © 2006 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.jri.2006.11.002

8

I.P. Eide et al. / Journal of Reproductive Immunology 74 (2007) 7–14

and Civin, 2004). In contrast, its receptor Fas (belonging to the TNF receptor superfamily) is expressed by most cell types (Bohana-Kashtan and Civin, 2004). The interaction between Fas-FasL induces apoptosis of the Fas-expressing cell (Bohana-Kashtan and Civin, 2004). The Fas-FasL system has been ascribed a role in immune tolerance at the feto-maternal interface (StraszewskiChavez et al., 2005), and has been considered a potential pathway to prevent excessive infiltration of leukocytes into the decidua (Hunt et al., 1997; Mor et al., 1998; Kauma et al., 1999; Qiu et al., 2005). Both extravillous trophoblasts (EVTs) (Runic et al., 1996; Uckan et al., 1997; Hammer and Dohr, 2000) and decidual cells (DeCs) (Hunt et al., 1997; Makrigiannakis et al., 2003; Kayisli et al., 2003) express FasL. Apoptotic leukocytes have been detected in close proximity to FasL-expressing EVTs in in vivo materials (Mor et al., 1998), and trophoblast-induced FasL-mediated apoptosis of activated lymphocytes has been confirmed in vitro (Coumans et al., 1999). In a recent study, an inverse relationship was reported between FasL-expressing DeCs and maternal leukocytes (Qiu et al., 2005). Based on these data, Qiu et al. (2005) hypothesized that FasLexpressing DeCs may regulate infiltration of maternal Fas-positive leukocytes at the feto-maternal interface. Superficial trophoblast invasion and insufficient spiral artery remodelling characterize pregnancies complicated by pre-eclampsia (PE) and/or fetal growth restriction (FGR) (Redman and Sargent, 2005). The FasFasL system is involved in these processes (Murakoshi et al., 2003; Ashton et al., 2005; Harris et al., 2005) and, in accordance with this, abnormal Fas-FasL expression has been reported in PE/FGR. FasL exists also in a soluble (s) form. The physiological functions of sFasL are not completely understood, but it is assumed to play a role in the regulating functions of the immune system. In women with PE/FGR, reduced leukocyte expression of FasL has been reported (Kuntz et al., 2001), whereas observations regarding sFasL are conflicting; some report elevated levels in PE/FGR (Kuntz et al., 2001; Hu et al., 2005) and others find sFasL concentrations corresponding to those in normal pregnant women (Laskowska et al., 2006). Similarly, observations at the local site are diverging—reduced (Allaire et al., 2000; Yue et al., 2005), unchanged (Hu et al., 2005) and increased (Koenig and Chegini, 2000) FasL expression have been reported in PE/FGR pregnancies. The closest interaction between fetal and maternal cells occurs in decidua and thus this tissue will probably provide the most sensitive detection of changes in FasL expression, disturbing immune privilege and physiological implantation processes. Thus, in this study, we

have used decidual tissue to examine FasL expression and apoptosis in pregnancies with PE and/or FGR. Concomitantly, maternal sFasL levels were assessed to learn whether disease is associated with changes. 2. Materials and methods 2.1. Study groups Since we aimed to study FasL expression in impaired placental development, cases with suspected placental insufficiency (with PE and/or FGR) were recruited. PE was defined as persistent hypertension (blood pressure of ≥140/90 mmHg) plus proteinuria (≥0.3 g/24 h or ≥2+ according to a dipstick test), developing after 20 weeks of pregnancy (Gifford et al., 2000). FGR implied birth weight ≤2 standard deviations (S.D.) below the expected birth weight as related to gestational age (GA) and sex (Marsal et al., 1996). Due to tissue sampling procedures, only cases delivered by caesarean section (CS) could be included. Healthy women with normal pregnancies undergoing CS for various reasons considered irrelevant to the aim of this study (e.g. breech presentation and previous CS) served as controls. Materials were collected at St. Olavs Hospital (the University Hospital in Trondheim). The study was approved by the Regional Committee for Medical Research Ethics. Informed consent was obtained from all participants. 2.2. Sample collection and preparation Decidual tissue was obtained by vacuum suction of the placental bed after the placenta was delivered (Staff et al., 1999; Harsem et al., 2004). Some tissue was snapfrozen in liquid nitrogen and stored at −80 ◦ C, some was fixed in 10% neutral buffered formalin and paraffinembedded. Tissue for gene expression was immediately submerged in a RNA stabilization solution (RNAlater, Ambion, Huntington, UK), incubated at 4 ◦ C overnight and stored at −80 ◦ C. Peripheral maternal blood was collected, centrifuged and serum was stored at −80 ◦ C. 2.3. Placental growth and histology Placental weight and placental weight ratio (PWR: observed placental weight/expected placental weight) were registered. Expected placental weight according to GA was obtained from published standards (Benirschke and Kaufmann, 2000). All placentas were examined by one pathologist in accordance with established clinical routines.

I.P. Eide et al. / Journal of Reproductive Immunology 74 (2007) 7–14

9

2.4. Expression of FasL by DeCs and EVTs

2.5. Soluble FasL in maternal blood

The expression of FasL in decidual tissue was assessed by a double immunofluorescence technique using an antibody against FasL in combination with antibodies to detect decidual cells (anti-prolactin) or trophoblasts (anti-cytokeratin 7). A polyclonal rabbit antibody (pAb) against FasL (Q-20; Santa Cruz Biotechnology, Santa Cruz, CA; diluted 1:20), with highly ranked specificity (Hammer and Dohr, 2000), was used for FasL detection. Tissue was cut in three serial sections at 5 ␮m on a freeze microtome. One section was stained with haematoxylin and eosin (HE) and used for quality control. Only specimens containing both DeCs and EVTs, with a low contamination of blood and/or placental tissue, were included for further studies. The two remaining cryosections were fixed in acetone (10 min at 4 ◦ C). Bovine serum albumin (2%) was added for 10 min to inhibit non-specific binding and, subsequently, pAb against FasL (Q-20) was added for 1 h either in combination with a mouse monoclonal antibody (mAb) against cytokeratin 7 (CK7, clone OV-TL 12/30; DakoCytomation, Glostrup, Denmark; diluted 1:300) or mouse mAb against human prolactin (clone PRL02; Neo Markers, Fremont, CA, USA; diluted 1:2). Sections were incubated with secondary antibodies (TRITC-conjugated swine anti-rabbit pAb (code R0156; DakoCytomation; diluted 1:30) and FITC-conjugated goat anti-mouse pAb (code F0479; DakoCytomation; diluted 1:10) for 30 min. All incubations were performed at room temperature in a dark moist chamber, and phosphate-buffered saline (PBS) was used for washing between all incubation periods. Sections were counterstained with 1000 ng/mL 4,6-diamidino-2-phenylindole (DAPI 1, Abbott Laboratories AS, Gentofte, Denmark) and examined with a fluorescent microscope (Nicon Eclipse E600) at ×600 magnification using CytoVision 3.6 software (Applied Imaging, Newcastle upon Tyne, UK). Placental tissue (with FasL-expressing villous trophoblasts, Hofbauer cells and fetal blood vessels) (Hammer and Dohr, 2000) were used as positive controls, whereas sections incubated with serum from non-immunized rabbits (Code X0936; DakoCytomation; diluted 1:20) served as negative controls in the FasL experiments. Glands in decidua were used as internal positive control for mAb against CK7, and sections of pituitary glands were included as a positive control for anti-prolactin (not shown). The ratio of FasL-expressing DeCs and EVTs was calculated as the percentage of FasL-positive DeCs and EVTs in each section (number of FasL-positive prolactin or CK7-positive cells among 100 of the corresponding cell type).

Levels of sFasL in maternal serum were assessed by a sandwich enzyme-linked immunoassay (ELISA, R&D Systems, Abingdon, UK), according to the manufacturer’s standard procedure. Samples were tested in duplicates (50 ␮L each). The detection level of the assay was 2.7 pg/mL. 2.6. Apoptosis-related gene expression in decidua The expression of apoptosis-relevant genes in decidual tissue was studied by Affymetrix GeneChip analysis. Total RNA was purified from decidual tissue using a RNeasy Midi Kit according to the manufacturer’s instructions (Qiagen, Crawley, UK), and RNA concentration and purity were measured using NanoDrop ND-1000 spectrophotometer (NanoDrop Technologies, Rockland, DE, USA). Double stranded cDNA was synthesized using 5 ␮g total RNA. Synthesis of cDNA (cDNA synthesis kit; Invitrogen, Carlsbad, CA, USA) and biotin-labeled cRNA (BioArrayTM HighYieldTM kit; Enzo Life Sciences, NY, USA), cRNA fragmentation, target hybridization, washing, staining and scanning were performed using standard Affymetrix protocols (Affymetrix, Santa Clara, USA). Fragmented biotinylated cRNA targets were hybridized to Human Genome Focus Arrays (Affymetrix) using 10 ␮g fragmented biotinylated cRNA. The arrays were scanned using a Agilent GeneArray Scanner controlled by MAS 5.0. Gene expression data were analyzed with a Gene Map Annotator and Pathway Profiler (GenMAPP, Gladstone Institute, CA, www.GenMAPP.org). 2.7. Trophoblast apoptosis Apoptosis of EVTs was assessed in two serial sections of formalin-fixed decidual tissues by counting mAb M30 (clone M30; Roche, Mannheim, Germany; diluted 1:50) (Kadyrov et al., 2001) positive cells in one section and relating numbers to mAb CK7 (clone OV-TL 12/30; Dako Cytomation; diluted 1:1000) positive cells in section two. The staining was performed in an automated slide stainer (DakoCytomation) after deparaffination, rehydration and heat-induced antigen retrieval. Endogenous peroxidase activity was blocked, and primary antibodies added. Incubation time was 30 and 40 min for anti-CK7 and M30, respectively. Secondly, a peroxidase-conjugated polymer with antibodies against rabbit/mouse (ChemMate Dako Envision Detection Kit Peroxidase/DAB) (DakoCytomation) was used according to the manufacturer’s standard procedure.

10

I.P. Eide et al. / Journal of Reproductive Immunology 74 (2007) 7–14

Antibody-diluent without antibody was used as a control for non-specific staining, and sections from colon adenocarcinoma were used as positive controls for M30. The sections were counterstained in haematoxylin for 1 min and analyzed at ×200 magnification in a light microscope. The trophoblast apoptosis ratio in decidual tissues was calculated as M30-positive cells among CK7positive cells. Cell counting was performed in areas that: (1) contained at least 100 trophoblasts, and (2) were easily reproduced from one section to the other. Areas that fulfilled these criteria were demarcated with Indian ink in paired serial sections. One case with 35 EVTs in the whole section was excluded from further apoptosis analyses due to statistical considerations. Cell counting was performed by two independent individuals, blinded for the case/control status at ×200 magnification, using a 13 mm × 13 mm reticule. The mean count was used for statistical analysis.

Table 1 Clinical information Cases (n = 33) Maternal age (years) Gestational age (weeks) at delivery Systolic blood pressure (mmHg) Diastolic blood pressure (mmHg) Birth weight (g) BWRa Placental weight (g) PWRb

Controls (n = 27)

30 ± 5 32 ± 4*

31 ± 5 39 ± 1

148 ± 20*

121 ± 14

93 ± 12*

71 ± 13

1525 0.71 323 0.96

± ± ± ±

668* 0.16* 116* 0.25*

3662 1.10 644 1.45

± ± ± ±

501 0.15 144 0.33

Values are expressed as mean ± S.D., unless stated otherwise. All case groups were compared to controls. a BWR: birth weight ratio (observed birth weight/expected birth weight). b PWR: placental weight ratio (observed placental weight/expected placental weight). * p < 0.001.

2.8. Statistical analysis 3.1. Placental weight and histology The clinical data and the results of cell counting were expressed as the mean ± the corresponding S.D. The Mann–Whitney test was used for comparison between groups. Spearman’s ranked correlation test was used to examine correlation between FasL expression, EVT apoptosis, placental weight and clinical data. p < 0.05 was considered significant. SPSS Version 12 was used for all statistical analyses. Statistical analysis of the Affymetrix GeneChip arrays data was based on summary expression measures for each gene probe set. These summary measures were computed from quantile-normalized and logtransformed perfect match values for each probe pair using the robust multiarray average method (Irizarry et al., 2003). Tests for significant differential expression for cases versus controls were performed using moderated T-tests (Smyth, 2004). To account for multiple testing, we calculated adjusted p-values controlling the false discovery rate and inserted the estimated value of the proportion of non-differentially expressed genes (Langaas et al., 2005). All analyses were performed using the R statistical package and the Affy & Limma sofware packages from the Bioconductor project (Gentleman et al., 2004). 3. Results Sixty women were included in the study (33 cases and 27 controls). Cases and controls differed significantly with respect to GA, birth weight and placental weight (Table 1).

The PWR was significantly lower among cases (0.96 ± 0.25) than the control group (1.45 ± 0.33) (p < 0.001). Reduced PWR was observed in pregnancies with FGR (p < 0.001), whereas pregnancies with isolated PE had a PWR comparable with controls. Abnormal morphological findings in the placenta were more frequent among cases than controls; 13 (39.4%) cases demonstrated advanced villous maturation compared with 3 (11.1%) controls, and 4 (12.1%) cases had multiple infarcts compared with none in the control group. 3.2. Expression of FasL by DeCs and EVTs FasL staining of DeCs demonstrated a distinct punctuate staining pattern in the cytoplasm (Fig. 1A), with a less intense staining of EVTs (Fig. 1C). The proportion of FasL-expressing DeCs was reduced among cases (58.1 ± 19.7%) compared with controls (72.0 ± 10.2%) (p = 0.002) (Fig. 1B). The reduction was most pronounced in pregnancies complicated with FGR (54.3 ± 19.9%) (p = 0.001), whereas isolated PE did not differ significantly from controls. A positive correlation was observed between the proportion of FasL expression by DeCs and PWR (p < 0.001). EVTs demonstrated a much lower proportion of FasLexpressing cells than DeCs, both in cases and controls (Fig. 1D). As with DeCs, FasL expression by EVTs was reduced among cases (8.3 ± 4.0%) compared with

I.P. Eide et al. / Journal of Reproductive Immunology 74 (2007) 7–14

11

Fig. 1. Double immunofluorescence and single immunohistochemical staining on cryosections (A and C) and formalin-fixed, paraffin-embedded section (E) of decidua basalis obtained from women at delivery. The immunofluorescence staining was performed on two serial sections: (A) demonstrates FasL staining (red signal) of decidual cells (DeCs) with anti-prolactin (green signal) as DeC marker, and (C) illustrates FasL expression (red signal) in extravillous trophoblasts (EVTs) with anti-CK7 (green signal) as EVT marker. Two FasL-positive DeCs are shown in (A), whereas both FasL-positive ( ) and FasL negative ( ) trophoblasts are presented in (B). DeCs demonstrated a distinct and punctuate FasL staining in ) (A), whereas the FasL staining of EVTs was less intense ( ) (B). Trophoblast apoptosis was detected by M30 monoclonal cytoplasm ( antibody (brown staining) (E). The proportions of FasL-positive DeCs (B), FasL-positive EVTs (D) and apoptotic EVTs (F) were calculated. Results are displayed as mean values with the corresponding S.D.

controls (10.9 ± 3.9%) (p = 0.02) (Fig. 1C). A positive correlation was observed between the proportion of FasL-positive EVTs and PWR (p = 0.045). 3.3. Soluble FasL levels in maternal serum Levels of sFasL in maternal serum did not differ between cases (n = 14) and controls (n = 12); 77.6 ± 16.4 and 81.3 ± 13.5 pg/mL, respectively. No correlation was observed between decidual FasL expression and sFasL concentrations.

3.4. Apoptosis-related gene expression in decidua Gene expression analyses in decidual tissue were performed on 37 of the specimens collected (18 cases and 19 controls). Due to operating costs, the number of samples subjected to gene expression analyses was restricted. The pregnancies included were randomly selected, and the gene expression analysis of cases and controls did not differ from the total case and control group with respect to GA, birth weight and blood pressure.

12

I.P. Eide et al. / Journal of Reproductive Immunology 74 (2007) 7–14

In total, 82 genes involved in main apoptotic pathways (both activation and inhibition, including genes involved in the Fas-FasL pathway) were studied. Expression of apoptotic pathway genes did not differ between groups. Different expression of genes known to change in PE was verified (Gallery et al., 1999), suggesting that the experimental setting had the statistical power required to detect differences between study groups. 3.5. Trophoblast apoptosis The apoptosis rate was low both in cases and in controls, and the proportion of apoptotic EVTs (M30+/CK7+) did not differ between groups (1.1 ± 1.9 and 1.1 ± 1.3%, respectively) (Fig. 1E and F). No correlation was observed between the proportion of apoptotic EVTs and FasL expression by DeCs and by EVTs. 4. Discussion In the present investigation we have found that FasL was expressed by both DeCs and EVTs. This observation is in accordance with previous reports (Hammer and Dohr, 1999; Kayisli et al., 2003). The physiological proportion of FasL-expressing DeCs was much higher (72%) than that of EVTs (11%). Both were reduced in association with placental disease (i.e. PE and/or FGR), but the reduction in FasL expressed by DeCs was most pronounced. Our findings are in agreement with earlier observations indicating that decidual cells are involved in immune interactions at the feto-maternal site, with possible consequences for implantation and placental development (Olivares et al., 1997; Ruiz et al., 1997; Kitaya et al., 2000; Dimitriadis et al., 2002). Correlations between PWR and the proportion of FasL-positive DeCs (p < 0.001) and EVTs (p = 0.045) found in the present study suggest a possible association between decidual FasL expression and placental development and growth. Trophoblast expression of FasL has been suggested as a protective mechanism against maternal leukocyteinduced apoptosis (Hammer and Dohr, 2000). Reduced, enhanced and unchanged FasL expression have been reported in PE/FGR (Koenig and Chegini, 2000; Allaire et al., 2000; Hu et al., 2005; Yue et al., 2005). All these studies were, however, performed on placental tissue/villous trophoblasts. Only one previous study has focused on changes in decidual FasL expression in association with disease (Koenig and Chegini, 2000). Koenig and Chegini (2000) found raised expression of Fas-L in cases, but their diagnosis is pregnancy-induced hyper-

tension and the clinical information given is sparse. Accordingly, we are not sure whether the cases included in the study of Koenig and Chegini (2000) may be compared to the cases with quite severe pregnancy complications included in the present study. Since PE/FGR pregnancies are commonly delivered preterm, sampling of GA-matched healthy controls is difficult and the mean GA among cases was 6 weeks shorter than controls (Table 1). However, since FasL expression on villous trophoblasts has been reported to decrease towards term (Balkundi et al., 2000), it is likely that the observed differences in FasL expression are underestimated due to the fact that GA-matching between groups cannot be carried out. Some investigators have reported increased EVT apoptosis in PE (Genbacev et al., 1999), whereas others found reduced apoptosis of interstitial EVTs and increased apoptosis of endovascular EVTs in PE and FGR (Kadyrov et al., 2003, 2006). No difference and generally low proportion of apoptotic EVTs (both interstitial and endovascular) has been shown in the present study (1.1% in both cases and controls). Contrasting reports on EVT apoptosis can be partly explained by use of different methods and reflect difficulties in apoptosis assessment. As decidual samples were taken from women with active disease, the presence of EVT apoptosis may have preceded the time of analyses; thus, the possibility of altered EVT apoptosis in placental insufficiency cannot be excluded. Some investigators have reported raised maternal sFasL concentrations in PE (Kuntz et al., 2001; Hu et al., 2005), whereas we and others (Laskowska et al., 2006) found similar sFasL levels in cases and controls. Discrepancies between studies may be due to differences in diagnostic criteria and small study groups. The fact that decidual tissue from cases with placental disease demonstrated altered decidual FasL expression, whereas normal sFasL serum levels remained unchanged, makes it tempting to assume that the local immunological environment is more important for successful pregnancy than systemic regulatory events (Saito and Makino, 2005). The specific role of DeCs, compared to that of EVTs, in FasL-based immune privilege (Hunt et al., 1997; Kauma et al., 1999) remains to be elucidated, but some murine experiments are indicated. Hunt et al. (1997) found decidual necrosis and poor pregnancy outcome in mice lacking functional FasL on both fetal trophoblasts and maternal decidual cells. Thus, a normal pregnancy outcome is obtained if non-functional fetal FasL is combined with functional FasL expressed by maternal decidual cells (Rogers et al., 1998). This is supported by observations made in the present study that DeCs seem to

I.P. Eide et al. / Journal of Reproductive Immunology 74 (2007) 7–14

be a fundamental contributor to appropriate FasL-based balance between mother and fetus in decidual tissues. Acknowledgements Toril Rolfseng and Borgny Ytterhus, Department of Laboratory Medicine, Children’s and Women’s Health, Norwegian University of Science and Technology, Trondheim, Norway, performed the double immunofluorescence staining. Affymetrix GeneChip Analysis was supported by Norwegian Microarray Consortium. The study was supported by the Norwegian University of Science and Technology and St. Olavs Hospital HF. References Allaire, A.D., Ballenger, K.A., Wells, S.R., McMahon, M.J., Lessey, B.A., 2000. Placental apoptosis in preeclampsia. Obstet. Gynecol. 96, 271–276. Ashton, S.V., Whitley, G.S., Dash, P.R., Wareing, M., Crocker, I.P., Baker, P.N., Cartwright, J.E., 2005. Uterine spiral artery remodeling involves endothelial apoptosis induced by extravillous trophoblasts through Fas/FasL interactions. Arterioscler. Thromb. Vasc. Biol. 25, 102–108. Balkundi, D.R., Hanna, N., Hileb, M., Dougherty, J., Sharma, S., 2000. Labor-associated changes in Fas ligand expression and function in human placenta. Pediatr. Res. 47, 301–308. Benirschke, K., Kaufmann, P., 2000. Pathology of the Human Placenta. Springer, New York, p. 921. Bohana-Kashtan, O., Civin, C.I., 2004. Fas ligand as a tool for immunosuppression and generation of immune tolerance. Stem Cells 22, 908–924. Coumans, B., Thellin, O., Zorzi, W., Melot, F., Bougoussa, M., Melen, L., Zorzi, D., Hennen, G., Igout, A., Heinen, E., 1999. Lymphoid cell apoptosis induced by trophoblastic cells: a model of active foeto-placental tolerance. J. Immunol. Methods 224, 185–196. Dimitriadis, E., Robb, L., Salamonsen, L.A., 2002. Interleukin-11 advances progesterone-induced decidualization of human endometrial stromal cells. Mol. Hum. Reprod. 8, 636–643. Gallery, E.D., Campbell, S., Arkell, J., Nguyen, M., Jackson, C.J., 1999. Preeclamptic decidual microvascular endothelial cells express lower levels of matrix metalloproteinase-1 than normals. Microvasc. Res. 57, 340–346. Genbacev, O., Difederico, E., Mcmaster, M., Fisher, S.J., 1999. Invasive cytotrophoblast apoptosis in pre-eclampsia. Hum. Reprod. 14 (Suppl. 2), 59–66. Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A.J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J.Y., Zhang, J., 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80. Gifford, R.W., August, P.A., Cunningham, G., Green, L.A., Lindheimer, M.D., Mcnellis, D., Roberts, J.M., Sibai, B.M., Taler, S.J., 2000. Report of the National High Blood Pressure Education Program Working Group on high blood pressure in pregnancy. Am. J. Obstet. Gynecol. 183, S1–S22.

13

Hammer, A., Dohr, G., 1999. Apoptotic nuclei within the uterine decidua of first trimester pregnancy arise from CD45 positive leukocytes. Am. J. Reprod. Immunol. 42, 88–94. Hammer, A., Dohr, G., 2000. Expression of Fas-ligand in first trimester and term human placental villi. J. Reprod. Immunol. 46, 83–90. Harris, L.K., Baker, P.N., Keogh, R.J., Cartwright, J.E., Whitley, G.S., Aplin, J.D., 2005. Trophoblast-induced smooth muscle cell apoptosis during spiral artery remodelling involves Fas/Fas ligand interactions. Placenta 26, A77. Harsem, N.K., Staff, A.C., He, L., Roald, B., 2004. The decidual suction method: a new way of collecting decidual tissue for functional and morphological studies. Acta Obstet. Gynecol. Scand. 83, 724–730. Hu, W.S., Wang, Z.P., Dong, M.Y., Wang, H.Z., 2005. Expression of Fas and FasL in serum and placenta of preeclamptic pregnancy and its significance. Zhejiang Da Xue Xue Bao Yi Xue Ban 34, 499–502. Hunt, J.S., Vassmer, D., Ferguson, T.A., Miller, L., 1997. Fas ligand is positioned in mouse uterus and placenta to prevent trafficking of activated leukocytes between the mother and the conceptus. J. Immunol. 158, 4122–4128. Irizarry, R.A., Bolstad, B.M., Collin, F., Cope, L.M., Hobbs, B., Speed, T.P., 2003. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 31, e15. Kadyrov, M., Kaufmann, P., Huppertz, B., 2001. Expression of a cytokeratin 18 neo-epitope is a specific marker for trophoblast apoptosis in human placenta. Placenta 22, 44–48. Kadyrov, M., Kingdom, J.C., Huppertz, B., 2006. Divergent trophoblast invasion and apoptosis in placental bed spiral arteries from pregnancies complicated by maternal anemia and earlyonset preeclampsia/intrauterine growth restriction. Am. J. Obstet. Gynecol. 194, 557–563. Kadyrov, M., Schmitz, C., Black, S., Kaufmann, P., Huppertz, B., 2003. Pre-eclampsia and maternal anaemia display reduced apoptosis and opposite invasive phenotypes of extravillous trophoblast. Placenta 24, 540–548. Kauma, S.W., Huff, T.F., Hayes, N., Nilkaeo, A., 1999. Placental Fas ligand expression is a mechanism for maternal immune tolerance to the fetus. J. Clin. Endocrinol. Metab. 84, 2188–2194. Kayisli, U.A., Selam, B., Guzeloglu-Kayisli, O., Demir, R., Arici, A., 2003. Human chorionic gonadotropin contributes to maternal immunotolerance and endometrial apoptosis by regulating Fas-Fas ligand system. J. Immunol. 171, 2305–2313. Kitaya, K., Yasuda, J., Yagi, I., Tada, Y., Fushiki, S., Honjo, H., 2000. IL-15 expression at human endometrium and decidua. Biol. Reprod. 63, 683–687. Koenig, J.M., Chegini, N., 2000. Enhanced expression of Fasassociated proteins in decidual and trophoblastic tissues in pregnancy-induced hypertension. Am. J. Reprod. Immunol. 44, 347–349. Kuntz, T.B., Christensen, R.D., Stegner, J., Duff, P., Koenig, J.M., 2001. Fas and Fas ligand expression in maternal blood and in umbilical cord blood in preeclampsia. Pediatr. Res. 50, 743–749. Langaas, M., Ferkingstad, E., Lindqvist, B.H., 2005. Estimating the proportion of true null hypotheses, with application to DNA microarray data. J. R. Stat. Soc., Ser. B 67, 555–572. Laskowska, M., Laskowska, K., Leszczynska-Gorzelak, B., Oleszczuk, J., 2006. Evaluation of the maternal and umbilical vein serum sFas/sFasL system in pregnancies complicated by preeclampsia with intrauterine growth retardation. Eur. J. Obstet. Gynecol. Reprod. Biol. 126, 155–159.

14

I.P. Eide et al. / Journal of Reproductive Immunology 74 (2007) 7–14

Makrigiannakis, A., Zoumakis, E., Kalantaridou, S., Mitsiades, N., Margioris, A., Chrousos, G.P., Gravanis, A., 2003. Corticotropinreleasing hormone (CRH) and immunotolerance of the fetus. Biochem. Pharmacol. 65, 917–921. Marsal, K., Persson, P.H., Larsen, T., Lilja, H., Selbing, A., Sultan, B., 1996. Intrauterine growth curves based on ultrasonically estimated foetal weights. Acta Paediatr. 85, 843–848. Medvedev, A.E., Johnsen, A.C., Haux, J., Steinkjer, B., Egeberg, K., Lynch, D.H., Sundan, A., Espevik, T., 1997. Regulation of Fas and Fas-ligand expression in NK cells by cytokines and the involvement of Fas-ligand in NK/LAK cell-mediated cytotoxicity. Cytokine 9, 394–404. Mor, G., Gutierrez, L.S., Eliza, M., Kahyaoglu, F., Arici, A., 1998. Fas-fas ligand system-induced apoptosis in human placenta and gestational trophoblastic disease. Am. J. Reprod. Immunol. 40, 89–94. Murakoshi, H., Matsuo, H., Laoag-Fernandez, J.B., Samoto, T., Maruo, T., 2003. Expression of Fas/Fas-ligand, Bcl-2 protein and apoptosis in extravillous trophoblast along invasion to the decidua in human term placenta. Endocr. J. 50, 199–207. Olivares, E.G., Montes, M.J., Oliver, C., Galindo, J.A., Ruiz, C., 1997. Cultured human decidual stromal cells express B7-1 (CD80) and B7-2 (CD86) and stimulate allogeneic T cells. Biol. Reprod. 57, 609–615. Qiu, Q., Yang, M., Tsang, B.K., Gruslin, A., 2005. Fas ligand expression by maternal decidual cells is negatively correlated with the abundance of leukocytes present at the maternal–fetal interface. J. Reprod. Immunol. 65, 121–132. Redman, C.W., Sargent, I.L., 2005. Latest advances in understanding preeclampsia. Science 308, 1592–1594. Rogers, A.M., Boime, I., Connolly, J., Cook, J.R., Russell, J.H., 1998. Maternal-fetal tolerance is maintained despite transgene-driven

trophoblast expression of MHC class I, and defects in Fas and its ligand. Eur. J. Immunol. 28, 3479–3487. Ruiz, C., Montes, M.J., Abadia-Molina, A.C., Olivares, E.G., 1997. Phagocytosis by fresh and cultured human decidual stromal cells: opposite effects of interleukin-1 alpha and progesterone. J. Reprod. Immunol. 33, 15–26. Runic, R., Lockwood, C.J., Ma, Y., Dipasquale, B., Guller, S., 1996. Expression of Fas ligand by human cytotrophoblasts: implications in placentation and fetal survival. J. Clin. Endocrinol. Metab. 81, 3119–3122. Saito, S., Makino, T., 2005. IX International Congress on Reproductive Immunology, 11–15 October 2004, Hakone, Japan. J. Reprod. Immunol. 68, 121–126. Smyth, G.K., 2004. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, Article 3 http://www.bepress.com/ sagmb/vol3/iss1/art3. Staff, A.C., Ranheim, T., Khoury, J., Henriksen, T., 1999. Increased contents of phospholipids, cholesterol, and lipid peroxides in decidua basalis in women with preeclampsia. Am. J. Obstet. Gynecol. 180, 587–592. Straszewski-Chavez, S.L., Abrahams, V.M., Mor, G., 2005. The role of apoptosis in the regulation of trophoblast survival and differentiation during pregnancy. Endocr. Rev. 26, 877–897. Uckan, D., Steele, A., Cherry, Wang, B.Y., Chamizo, W., Koutsonikolis, A., Gilbert-Barness, E., Good, R.A., 1997. Trophoblasts express Fas ligand: a proposed mechanism for immune privilege in placenta and maternal invasion. Mol. Hum. Reprod. 3, 655–662. Yue, X.Y., Zhang, X., Cui, S.H., Wang, X.Q., 2005. Expression of Fas antigen and ligand, placental growth factor in placenta of pregnant women with pre-eclampsia. Zhonghua Fu Chan Ke Za Zhi 40, 320–322.

C OMPARISON OF PREDICTIVE VALUES FROM TWO DIAGNOSTIC TESTS IN LARGE SAMPLES C LARA -C ECILIE G ÜNTHER *, Ø YVIND BAKKE *, S TIAN LYDERSEN# AND M ETTE L ANGAAS * * Department of Mathematical Sciences. # Department of Cancer Research and Molecular Medicine. The Norwegian University of Science and Technology, NO-7491 Trondheim, Norway. D ECEMBER 2008

S UMMARY Within the field of diagnostic tests, the positive predictive value is the probability of being diseased given that the diagnostic test is positive. Two diagnostic tests are applied to each subject in a study and in this report we look at statistical hypothesis tests for large samples to compare the positive predictive values of the two diagnostic tests. We propose a likelihood ratio test and a restricted difference test, and we perform simulation experiments to compare these tests with existing tests. For comparing the negative predictive values of the diagnostic tests, i.e. the probability of not being diseased given that the test is negative, we propose negative predictive versions of the same tests. The simulation experiments show that the restricted difference test performs well in terms of test size.

1

I NTRODUCTION

Diagnostic tests are used in medicine to e.g. detect diseases and give prognoses. Diagnostic tests can for example be based on blood samples, X-ray scans, mammography, ultrasound or computed tomography (CT). Mammography is used for detecting breast cancer, a blood sample may show if an individual has an infection, fractures may be detected from X-ray images, gallstones in the gallbladder can be found using ultrasound, and CT scans are useful for identifying tumours in the liver. A diagnostic test can have several outcomes or the outcome may be continuous, but it can often be dichotomized in terms of presence or absence of a disease and we will only consider diagnostic tests for which the disease status is binary. When evaluating the performance of diagnostic tests, the sensitivity, specificity and positive and negative predictive values are the common accuracy measures. The sensitivity and specificity are probabilities of the test outcomes given the disease status. The sensitivity is the probability of a positive test outcome given that the disease is present and the specificity is the probability of a negative outcome given no disease. These measures tell us the degree to which the test reflects the true disease status. The predictive values are probabilities of disease given the test result. The positive predictive value (PPV) is the probability that a subject who has a positive test outcome has the disease and the negative 1

predictive value (NPV) is the probability that a subject who has a negative test outcome does not have the disease. The predictive values give information about the prediction capabilities of the test. For a perfect test both the PPV and NPV are 1, the test result will then give the true disease status for each subject. When there are several diagnostic tests available for the same disease, we are interested in knowing which test is the best to use, but depending on what we mean by best, there are different methods available. If we want to find the best test regarding the ability to give a correct test outcome given the disease status then e.g. McNemar’s test, see Alan Agresti (2002), can be used for comparing the sensitivity or specificity of two tests evaluated on the same subjects. A test that has a high sensitivity and specificity will be most likely to give the patient the correct test result. However, for the patient it is utterly important to be correctly diagnosed and thereby getting the right treatment. We need to take into account the prevalence of the disease. If the prevalence is low, the probability that the patient does have the disease when the test result is positive, will be small even if the sensitivity of the applied test is high. Therefore, comparing the positive or negative predictive values is often more relevant in clinical practise as discussed by Guggenmoos-Holzmann and van Houwelingen (2000). In the remainder of this work, we wish to test if the positive or negative predictive values of two diagnostic tests are equal. In this report we apply existing tests by Leisenring, Alonzo and Pepe (2000) and Wang, Davis and Soong (2006), we propose a likelihood ratio test, and suggest improvements for some of the already existing tests in the large sample case. In Section 2 we describe the model and the structure of the data and define the predictive values. The null hypothesis, along with our proposed methods and already existing methods are presented in Section 3. A simulation study is conducted to compare the methods in Section 4. In Section 5 the methods are applied to data from the literature. We also present an alternative model and test statistic for the likelihood ratio test in Section 6. The results are summarised in the conclusions in Section 7.

2

M ODEL

AND DATA

Next we define the random variables and the model used to describe the situation when comparing the predictive values.

2.1

D EFINITIONS

Two tests, test A and test B, are evaluated on each subject in a study. Each test can have a positive or negative outcome, i.e. indicating whether the subject has the disease under study or not. The true disease status for each subject is assumed to be known. For each subject, we define three events: • D: The subject has the disease. • A: Test A is positive. • B: Test B is positive. Let D∗ , A∗ and B ∗ denote the complementary events. The situation can then be illustrated by a Venn diagram as in Figure 1. There are eight mutually exclusive events and we define the random variable 2

Ni , i = 1, ..., 8, to be the number of times event i occurs. In total there are N = N1 + . . . + N8 subjects in the study. Table 1 gives an overview of the notation for the eight random variables in terms of the events A, B, D and their complements. Notation N1 N2 N3 N4 N5 N6 N7 N8

Alternative notation NA∩B∩D∗ NA∩B ∗ ∩D∗ NA∗ ∩B∩D∗ NA∗ ∩B ∗ ∩D∗ NA∩B∩D NA∩B ∗ ∩D NA∗ ∩B∩D NA∗ ∩B ∗ ∩D

Explanation number of non-diseased subjects with positive tests A and B number of non-diseased subjects with positive test A and negative test B number of non-diseased subjects with negative test A and positive test B number of non-diseased subjects with negative tests A and B number of diseased subjects with positive tests A and B number of diseased subjects with positive test A and negative test B number of diseased subjects with negative test A and positive test B number of diseased subjects with negative tests A and B

TABLE 1: Notation for the random variables defined by the events A, B and D and their complements. N

B

A N1

N2

N3 N5 N7

N6

N4

N8

D

F IGURE 1: Venn diagram for the events D, A and B showing which events the random variables N1 , ..., N8 correspond to. To each of the eight mutually exclusive events there corresponds an unknown probability pi ,  i = 1, . . . , 8, where 8i=1 pi = 1, which is the probability that event i occurs for a randomly chosen subject. The positive predictive values of test A and test B can be expressed in terms of these probabilities and are given as PPVA = P (D|A) =

p5 + p6 P (D ∩ A) = P (A) p1 + p2 + p5 + p6

PPVB = P (D|B) =

p5 + p7 P (D ∩ B) . = P (B) p1 + p3 + p5 + p7

and

3

Similarly, the negative predictive values of test A and B are NPVA = P (D ∗ |A∗ ) =

p3 + p4 P (D∗ ∩ A∗ ) = P (A∗ ) p3 + p4 + p7 + p8

NPVB = P (D∗ |B ∗ ) =

p2 + p4 P (D∗ ∩ B ∗ ) = . P (B ∗ ) p2 + p4 + p6 + p8

and

The predictive values are dependent on the prevalence of the disease, P (D), which is the probability that a randomly chosen subject has the disease. For the positive predictive value, PPVA = P (D|A) =

P (A|D) · P (D) P (D ∩ A) = , P (A) P (A|D) · P (D) + (1 − P (A∗ |D∗ )) · (1 − P (D))

where P (A|D) is the sensitivity and P (A∗ |D ∗ ) is the specificity of test A. When P (A) = P (B) testing if PPVA = PPVB is equivalent to testing if P (A|D) = P (B|D), i.e. testing whether the sensitivities of the two tests are equal. We assume that the prevalence among the subjects in the study is the same as the prevalence in the population, and this can be achieved with a cohort study in which the subjects are randomly selected.

2.2

T HE MULTINOMIAL MODEL

Given the total number of subjects N in the study, the random variables N1 , N2 , ..., N8 can be seen to 8be multinomially distributed with parameters p = (p1 , p2 , p3 , p4 , p5 , p6 , p7 , p8 ) and N , where i=1 pi = 1. The joint probability distribution of N1 , N2 , ..., N8 is   8 8   p i ni . (Ni = ni ) = N ! P ni ! i=1

The expectation of Ni is

i=1

E(Ni ) = μi = N pi

for i = 1, ..., 8, and the variance is Var(Ni ) = N pi (1 − pi ). The covariance between Ni and Nj is Cov(Ni , Nj ) = −N pi pj for i = j. This leads to the covariance matrix Σ = Cov(N ) = N (Diag(p) − pT p), for the multinomial distribution, Johnson, Kotz and Balakrishan (1997). The general unrestricted maximum likelihood estimator of pi is pˆi = ni /N for i = 1, ..., 8. 4

(1)

2.3

DATA

For a number of subjects under study, we observe for each i = 1, ..., 8, the number of times event i occurs among the N subjects, ni . Table 2 shows the observed data in a 23 contingency table. In the following, let n = (n1 , n2 , n3 , n4 , n5 , n6 , n7 , n8 ) be the vector of the observed data. Using the unrestricted maximum likelihood estimators for p, we can then estimate the positive and negative predictive values of test A and B as follows: A = PPV

n5 + n6 , n1 + n2 + n5 + n6

B = PPV

n5 + n7 n1 + n3 + n5 + n7

A= NPV

n3 + n4 , n3 + n4 + n7 + n8

B= NPV

n2 + n4 . n2 + n4 + n6 + n8

Test A

+ −

Subjects without disease Test B + − n1 n2 n3 n4

Subjects with disease Test B + − n5 n6 n7 n8

TABLE 2: Observed data n1 , ..., n8 presented in a 23 contingency table.

3

M ETHOD

Assume that we would like to test the null hypothesis that the positive predictive values are equal for test A and B, i.e. PPVA = PPVB . The null hypothesis can be written as HP0 : P (D|A) = P (D|B), i.e. HP0 :

p5 + p6 p5 + p7 = . p1 + p2 + p5 + p6 p1 + p3 + p5 + p7

(2)

Alternatively, if we would like to test whether the negative predictive values are equal for test A and B, i.e. if NPVA = NPVB , the null hypothesis is ∗ ∗ ∗ ∗ N HN 0 : P (D |A ) = P (D |B ), i.e. H0 :

p3 + p4 p2 + p4 = . p3 + p4 + p7 + p8 p2 + p4 + p6 + p8

(3)

Our alternative hypotheses will be that the predictive values are not equal, i.e. H1P : P (D|A) = P (D|B) and H1N : P (D∗ |A∗ ) = P (D∗ |B ∗ ).

3.1

L IKELIHOOD RATIO TEST

One possibility to test the null hypothesis in (2) is to use a likelihood ratio test. We first write down the test statistic and then describe how to find the maximum likelihood estimates of parameters.

5

3.1.1

T EST

STATISTIC

In a general setting, if we want to test H0 : θ ∈ Θ0 versus H1 : θ ∈ Θc0 where Θ0 ∪ Θc0 = Θ and Θ denotes the entire parameter space, we may use a likelihood ratio test. This approach was also suggested by Leisenring et al. (2000), who faced numerical difficulties trying to implement it. The likelihood ratio test statistic is in general defined as λ(n) =

supΘ0 L(θ|n) supΘ L(θ|n)

where n is the observed data, Casella and Berger (2002). The denominator of λ(n) is the maximum likelihood of the observed sample over the entire parameter space and the numerator is the maximum likelihood of the observed sample over the parameters satisfying the null hypothesis. Let N = (N1 , N2 , N3 , N4 , N5 , N6 , N7 , N8 ) be the vector of the random variables. When the sample size is large, −2 · logλ(N ) ≈ χ2k i.e. −2 · logλ(N ) is χ2 distributed with k degrees of freedom where k is the difference between the number of free parameters in the unrestricted case and under the null hypothesis. Let θ = p = (p1 , . . . , p8 ) be the parameters in the multinomial distribution and n = (n1 , . . . , n8 ) the observed data. The log-likelihood to be maximized for the multinomial distribution is l(p) = logL(p|n) = c +

8 

ni · log(pi )

(4)

i=1

where c is a constant. The sum of p1 , p2 , ..., p8 must equal 1,

8 

pi = 1.

(5)

i=1

Under the null hypothesis that the positive predictive values for the two tests are equal, their difference δP is zero, i.e. p5 + p6 p5 + p7 − = 0. (6) δP = p1 + p2 + p5 + p6 p1 + p3 + p5 + p7 In the unrestricted case (i.e. H0 ∪ H1 ), the maximum likelihood estimates for p1 , . . . , p8 are the estimates given by (1), which satisfy (5). Under the null hypothesis, the estimates cannot be given in closed form and we will need to use an optimization routine to estimate p1 , . . . , p8 by maximizing the log-likelihood (4) under the constraints (5) and (6). Let pˆ = (ˆ p1 , pˆ2 , pˆ3 , pˆ4 , pˆ5 , pˆ6 , pˆ7 , pˆ8 ) be the unconstrained maximum likelihood estimates and p˜ = (˜ p1 , p˜2 , p˜3 , p˜4 , p˜5 , p˜6 , p˜7 , p˜8 ) the maximum likelihood estimates under the null hypothesis. Then, in our model, asymptotically as N is large,   8  ni · (log(˜ pi ) − log(ˆ pi )) ≈ χ21 . (7) −2 · log(λ(n)) = −2 i=1

We have one less free parameter in the restricted case because of the constraint (6). 6

For testing whether the negative predictive values for the two tests are equal, the constraint δP (6) is replaced by δN , where δN = 3.1.2

F INDING

p3 + p4 p2 + p4 − = 0. p3 + p4 + p7 + p8 p2 + p4 + p6 + p8

(8)

MAXIMUM LIKELIHOOD ESTIMATES UNDER THE NULL HYPOTHESES

To find the maximum likelihood estimates under the null hypothesis, we can either maximize the likelihood function under the given constraints using a numerical optimization routine or find the estimates analytically by solving a system of equations. In both approaches we use Lagrange multipliers and in either case we have two constraints. N UMERICAL MAXIMIZATION OF THE LOG - LIKELIHOOD If we want to find the maximum likeli˜ under the null hypothesis hood estimates using an optimization routine, the goal is to find the values p such that logL(˜ p) ≥ logL(p) for all p that satisfies the two constraints (5) and (6). To maximize the log-likelihood (4) under the null hypotheses, we use the R interface version of TANGO (Trustable Algorithms for Nonlinear General Optimization), see Andreani, Birgin E. G., Martinez and Schuverdt (2007) and Andreani, Birgin, Martinez and Schuverdt (2008), which is a set of Fortran routines for optimization. In order to run the program, one must specify the objective function and the constraint and their corresponding first order derivatives. We reparametrize the problem by setting p1 = p2 = p3 =

1 , 1 + ey1 + ... + ey7 y e1 , 1 + ey1 + ey2 + ... + ey7 y 2 e , 1 + ey1 + ey2 + ... + ey7

.. . p8 =

1+

ey1

ey7 + ey2 + ... + ey7

where −∞ < yi < ∞, i = 1, . . . , 7. This reparametrization ensures that the constraint (5) is satisfied, in addition to restricting the estimated probabilities to be 0 ≤ pi ≤ 1, i = 1, . . . , 8. Let y = (y1 , y2 , y3 , y4 , y5 , y6 , y7 ). The constraint under the null hypothesis (2) is then δP,y =

ey4 + ey5 ey4 + ey6 − =0 y y y 1 4 5 1+e +e +e 1 + ey2 + ey4 + ey6

(9)

and the constraint under the null hypothesis (3) is δN,y =

ey2

ey2 + ey3 ey1 + ey3 − = 0. + ey3 + ey6 + ey7 ey1 + ey3 + ey5 + ey7

(10)

These constraints are both non-linear equality constraints. The TANGO program uses an augmented Lagrangian algorithm to find the minimum of the negative log-likelihood while ensuring that the H0 constraints (9) and (10) are satisfied when testing the null hypotheses (2) and (3) respectively. The 7

Lagrangian multiplier is updated successively starting by an initial value that must be set. We also set the initial value of y and its lower and upper bounds. The value of y at the optimum is returned. Some computational remarks are given in Appendix D. A NALYTICAL MAXIMIZATION OF THE LOG - LIKELIHOOD Another approach is to find the estimates analytically by solving a system of equations arising from the method of Lagrange multipliers, for an introduction see Edwards and Penney (1998). The constraint under the null hypothesis can be rewritten as k(p) = p1 p7 + p2 p7 + p2 p5 − p1 p6 − p3 p5 − p3 p6 = 0.

(11)

In addition, let h(p) be the constraint that p1 , ..., p8 must sum to one, h(p) =

8 

pi = 1,

(12)

i=1

and let l(p) be the log-likelihood function given in (4). The system of equations to be solved then consists of ∇l = γ∇h + κ∇k

(13)

where γ and κ are Lagrangian multipliers, together with the above constraints. The partial derivatives of the log-likelihood l and the constraints h and k with respect to p1 , p2 , p3 , p4 , p5 , p6 , p7 and p8 are given by ∇l =

n1 n2 n3 n4 n5 n6 n7 n8 , , , , , , , p1 p2 p3 p4 p5 p6 p7 p8

,

(14)

∇k = (p7 − p6 , p5 + p7 , −p5 − p6 , 0, p2 − p3 , −p1 − p3 , p1 + p2 , 0),

(15)

∇h = (1, 1, 1, 1, 1, 1, 1, 1).

(16)

and

From Equations (11) – (16) we obtain the following system of equations, which consists of ten equa-

8

tions and ten unknown variables n1 = p1 (γ + κ(p7 − p6 )) n2 = p2 (γ + κ(p5 + p7 )) n3 = p3 (γ + κ(−p5 − p6 )) n 4 = p4 γ n5 = p5 (γ + κ(p2 − p3 ))

(17)

n6 = p6 (γ + κ(−p1 − p3 )) n7 = p7 (γ + κ(p1 + p2 )) n 8 = p8 γ 8 

pi = 1

i=1

p1 p7 + p2 p7 + p2 p5 − p1 p6 − p3 p5 − p3 p6 = 0. The denominators of (14) have been multiplied over to the right hand side in order to allow for pi = 0 as a possible solution for ni = 0. Obviously, l cannot have a maximum value pi = 0 if ni = 0, as l(p) would be −∞ in this case. The solutions of this system of equations involve roots of third degree polynomials, and we have used the Maple 12 command solve to find solutions. Among its solutions, ˜ i under the null the one that maximizes l(p) and where all pi ≥ 0 yields the likelihood estimates p hypothesis. We can show that when ni = 0, the corresponding likelihood estimate under the null hypothesis p˜i is 0 for i = 1, 4, 5, 8, but that this is not necessarily true for i = 2, 3, 6, 7. For p˜4 and p˜8 we have the more general result that p˜4 = n4 /N and p˜8 = n8 /N , see Appendix A for the proofs. A gradient based optimization routine searches for the global minimum across the negative loglikelihood surface and it can get stuck in a local minimum. In our experience this especially happens when some of the cell counts in the contingency table are small. The analytical maximization might yield more accurate estimates in these cases, see Appendix D.

3.2

D IFFERENCE BASED TESTS

Other possible test statistics start out by looking at the difference of the PPVs, and then these test statistics can be standardized by using Taylor series expansion. We also suggest some improvement to these tests. 3.2.1

T EST

STATISTICS

Based on the difference δP given in Equation (6), which equals zero under the null hypothesis, we may suggest a variety of possible test statistics. Wang et al. (2006) suggested the test statistics g1 (N ) =

N5 + N6 N5 + N7 − N1 + N2 + N5 + N6 N1 + N3 + N5 + N7

and

9

(18)

g2 (N ) = log

(N5 + N6 )(N1 + N3 + N5 + N7 ) . (N5 + N7 )(N1 + N2 + N5 + N6 )

(19)

For a more detailed description of their work, see Appendix B.1. Moskowitz and Pepe (2006) also suggest a similar test statistic to g2 (N ), see Appendix B.2. Since the null hypothesis can be written p1 + p3 p1 + p2 = p5 + p7 p5 + p6

H0P : another test statistic to be used may be g3 (N ) =

N1 + N3 N1 + N2 − . N5 + N7 N5 + N6

Another possibility is to use the log ratio of the terms, instead of their difference, g4 (N ) = log

(N1 + N3 )(N5 + N6 ) (N1 + N2 )(N5 + N7 )

or we may rewrite the null hypothesis in order to obtain g5 (N ) = 3.2.2

S TANDARDIZATION

BY

TAYLOR

N5 + N6 N5 + N7 − . N1 + N2 N1 + N3

SERIES EXPANSION

For a general test statistic, g(N ), we may construct a standardized test statistic by subtracting the expectation of the test statistic, E(g(N )), and dividing by its standard deviation, Var(g(N )). In the large sample case the square of the standardized test statistics may then be assumed to be approximately χ21 -distributed, T (N , μ, Σ) =

g(N ) − E(g(N )) Var(g(N ))

2 ≈ χ21 .

(20)

The expectation and variance of the test statistic can be approximated with the aid of Taylor series expansion as suggested by Wang et al. (2006). Let E(N ) = μ be the point around which the expansion is centered. As before, Σ = Cov(N ). A second order Taylor expansion in matrix notation is given as 1 g(N ) ≈ g(μ) + GT (μ)(N − μ) + (N − μ)T H(μ)(N − μ) 2

(21)

where G is a vector containing the first order partial derivatives of g(N ) with respect to the components of N and GT is the transpose of G. Further H is a matrix with second order partial derivatives of g(N ) with respect to the components of N , i.e. the Hessian matrix. The expectation of g(N ) can then be approximated as E(g(N )) ≈ g(μ) 10

for the first order Taylor expansion and as 1 E(g(N )) ≈ g(μ) + tr(H(μ)Σ) 2

(22)

for the second order Taylor expansion, since   E 1 (N − μ)T H(μ)(N − μ)   21 = E tr 2 H(μ)(N − μ)T (N − μ)   = 12 tr H(μ)E((N − μ)T (N − μ)) where we have used the result xT Ax = tr(xT Ax) = tr(AxxT ) where x = N − μ and A is the Hessian matrix H. E((N − μ)(N − μ)T ) is the covariance matrix Σ of N . The variance of g(N ) can be approximated as Var(g(N )) ≈ GT (μ)Σ G(μ) for the first order Taylor expansion. Using a second order Taylor expansion for the variance requires finding the third and fourth order moments of N . Using the first order Taylor approximation of E(g(N )) and Var(g(N )) in the standardized test statistic of (20) yields T (N ) =

(g(N ) − g(μ))2 ≈ χ21 . GT (μ)ΣG(μ)

(23)

Under the null hypothesis, g(μ) = 0. G(μ) and Σ are functions of the unknown parameters p and needs to be estimated. We can either insert the general maximum likelihood estimates pˆi = ni /N or the maximum likelihood estimates p˜i under H0P , as found in Section 3.1.2. When we use the standardized test statistic (23) with g1 (N ) and estimate the variance using the maximum likelihood estimates under H0 we refer to it as the restricted difference test. If we instead use the unrestricted maximum likelihood estimates to estimate the variance, we refer to it as the unrestricted difference test. We have investigated two possible improvements of the standardized test statistics. In addition to using the restricted maximum likelihood estimates to estimate the variance of (23), we have looked at the effect of using a second order Taylor series approximation to E(g(N )) as an attempt to arrive at a more accurate approximation to a χ21 distributed test statistic. The expectation and variance in the standardized test statistic given in (20) is found using a first order Taylor series expansion and the difference between using the first order and the second order Taylor series approximation to E(g(N )) is the term 1/2 · tr(H(μ)Σ). For the simulation experiment in Section 4 this turned out to be very small as compared to the denominator of (23).

3.3

T EST BY L EISENRING , A LONZO AND P EPE (LAP)

Leisenring et al. (2000) present a test for the null hypothesis given in (2). We will denote this the LAP test. They define three binary random variables; Dij that denotes disease status, Zij that indicates which test was used and Xij that describes the outcome of the diagnostic test for test j, j = 1, 2, for subject i, i = 1, . . . , N . 11



0, non-diseased 1, diseased  0, test A Zij = 1, test B

Dij =

 Xij =

0, negative 1, positive

The PPV of test A can be written as PPVA = P (Dij = 1 | Zij = 0, Xij = 1) and the PPV of test B as PPVB = P (Dij = 1 | Zij = 1, Xij = 1). Based on generalized estimation equations Leisenring et al. (2000) fit the generalized linear model logit(P (Dij = 1 | Zij , Xij = 1)) = αP + βP Zij . Testing the null hypothesis H0 : PPVA = PPVB is equivalent to testing the null hypothesis H0 : βP = 0. To derive the generalized score statistic, an independent working correlation structure is assumed for the score function and the corresponding variance function is vij = μij (1 − μij ) where Np mi ¯ μij = E(Dij ). The score function is then SP = i=1 j=1 Zij (Dij − D) which also can be written Np mi ¯ Here Np is the number of subjects with at least one positive test as SP = i=1 j=1 Dij (Zij − Z). outcome and mi is the number of positive test results for subject i. Np mi Zi Di Z¯ = i=1 Np i=1 mi is the proportion of positive B tests for the diseased subjects among all the positive tests and NP i=1 mi Di ¯ =  D NP i=1 mi is the proportion of positive tests for the diseased subjects among all the positive tests. The resulting test statistic for testing the null hypothesis H0 : βP = 0 is obtained by taking the square of the score function divided by its variance:   Np mi i=1

TP P V =   Np mi i=1

j=1 Dij (Zij

j=1 (Dij

2 ¯ − Z)

2 . ¯ ¯ − D)(Z ij − Z)

(24)

Under the null hypothesis, this test statistic is asymptotically χ21 -distributed. It is worth noting that only the subjects with at least one positive test outcome contribute to the test statistic (24). The test statistic in (24) is general and can be used even if the disease status is not constant within a subject. Usually the disease status be constant within the subject and the test statistic can be then  will i Z , the number of positive B tests subject i contributes to the simplified. By defining Ti = m ij j=1 analysis, the statistic can be written 2  Np ¯ D (T − m Z) i i i i=1 . TPPV = Np ¯ 2 ¯ 2 (D i − D) (Ti − mi Z) i=1 12

We derived the test statistic by using our notation of the eight mutually exclusive events in Figure 1. The numerator can be separated into six terms, in three of which the disease status D = 0 and three where D = 1, by noting that Ti = 0 and mi = 1 when only test A is positive, Ti = 1 and mi = 1 when only test B is positive and Ti = 1 and mi = 2 when both tests are positive. Then TPPV =

((N1 + N2 + N5 + N6 )(N5 + N7 ) − (N1 + N3 + N5 + N7 )(N5 + N6 ))2 f (N1 , N2 , N3 , N5 , N6 , N7 )

(25)

where f (N1 , N2 , N3 , N5 , N6 , N7 )



2 2N5 + N6 + N7 2N1 + N2 + N3 + 2N5 + N6 + N7

2 2N5 + N6 + N7 2 + N2 (N1 + N3 + N5 + N7 ) 2N1 + N2 + N3 + 2N5 + N6 + N7

2 2N5 + N6 + N7 + N3 (N1 + N2 + N5 + N6 )2 2N1 + N2 + N3 + 2N5 + N6 + N7

2 2N5 + N6 + N7 2 + N5 (N2 − N3 + N6 − N7 ) 1 − 2N1 + N2 + N3 + 2N5 + N6 + N7

2 2N5 + N6 + N7 + N6 (N1 + N3 + N5 + N7 )2 1 − 2N1 + N2 + N3 + 2N5 + N6 + N7

2 2N5 + N6 + N7 + N7 (N1 + N2 + N5 + N6 )2 1 − . 2N1 + N2 + N3 + 2N5 + N6 + N7

= N1 (N2 − N3 + N6 − N7 )2

To compare the NPVs for test A and test B, Leisenring et al. (2000) fit the generalized linear model logit(P (Dij = 1|Zij , Xij = 0)) = αN + βN Zij . by using the generalized estimating equations method. The null hypothesis in this case is H0 : βN = 0. Under the assumption that disease status is constant within a subject, this leads to the test statistic  Nn TNPV = Nn

i=1 Di (Ti

i=1 (Di

2 ¯ − ki Z)

¯ 2 (Ti − ki Z) ¯ 2 − D)

where Nn is the number of subjects with at least one negative test outcome and ki is the number of negative test results for subject i. Only the subjects with at least one negative test outcome contribute to this test statistic. Leisenring et al. (2000) also propose a Wald test based on the estimates of the regression coefficients, but their simulation studies show that the score test performs better.

4

S IMULATION

STUDY

In order to compare the test size under the null hypothesis for the tests presented in Section 3 and to assess the power of the tests under the alternative hypothesis, we perform a simulation experiment. 13

All the tests are asymptotic tests, but it is not clear how large the sample size has to be for the tests to preserve their test size. Therefore we will consider different sample sizes. Two different simulation strategies for generating datasets will be presented. The maximum likelihood estimates under the null hypotheses needed for the likelihood ratio test and the restricted difference test are found using TANGO as described in Section 3.1.2. All analyses are performed using the R language, R Development Core Team (2008).

4.1

S IMULATION EXPERIMENT FROM L EISENRING , A LONZO AND P EPE

The first simulation experiment is based on the simulation experiment of Leisenring et al. (2000) and we use their algorithm to generate the data. Therefore we denote this simulation experiment the LAP simulation experiment. 4.1.1

A LGORITHM

We generate datasets by using the algorithm presented in Appendix B in Leisenring et al. (2000). Let ID denote the disease status,  1, diseased ID = 0, non-diseased and IA and IB the test results of test A and B,  1, IA = 0,  1, IB = 0,

test A positive test A negative test B positive test B negative

In order to generate the datasets, the number of subjects tested, N , the positive and negative predictive values for both tests, the prevalence of the disease P (D) and the variance σ 2 for the random effect for each subject must be set. The random effect introduces correlation between the test outcomes for each subject. Our interpretation of the simulation algorithm is as follows: 1. Set N , P (D), PPVA , NPVA , PPVB , NPVB and σ. 2. Calculate the true positive rate TP and the false positive rate FP for test A and test B defined by the equations (1 − P (D) − NPV) · PPV TP = (1 − PPV − NPV) · P (D) and FP =

1 − P (D) − NPV . (1 − P (D))(1 − NPV − PPV·NPV 1−PPV )

3. Given TP and FP, the parameters αi and βi , i = 1, 2, for each test are calculated from the following equations, αi = Φ−1 (FP) 1 + σ 2 βi = Φ−1 (TP) 1 + σ 2 , where Φ(·) is the cumulative distribution function of the standard normal distribution. 14

Case no. 1 2 3 4 5 6 7 8

N 100 500 100 500 100 500 100 500

P (D) 0.25 0.25 0.50 0.50 0.25 0.25 0.50 0.50

σ 0.1 0.1 0.1 0.1 1.0 1.0 1.0 1.0

PPVA 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

PPVB 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75

NPVA 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85

NPVB 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85

TABLE 3: Specifications of the cases under the null hypotheses in the LAP simulation experiment. 4. For each subject the disease status ID is drawn independently with probability P (D). 5. A random effect r ∼ N (0, σ 2 ) is generated for each subject. 6. Given the disease status and the random effect r, the probability of a positive test outcome for each subject is given by P (IA = 1|ID , r) = Φ(α1 (1 − ID ) + β1 ID + r) for test A and by P (IB = 1|ID , r) = Φ(α2 (1 − ID ) + β2 ID + r) for test B. The test outcomes are drawn with these probabilities for each subject. 7. Find n1 , . . ., n8 by counting the number of subjects that belongs to each of the eight events described in Section 2, e.g. n1 is the number of subjects for which ID = 0, IA = 1 and IB =1, the number of subjects that are not diseased and have positive tests A and B. The algorithm is repeated M times, providing M datasets of n1 , . . . , n8 .

4.1.2

C ASES

UNDER STUDY

In the simulation experiment, we suggest eight cases by varying the input parameters N , P (D) and σ in the LAP simulation algorithm. The setup of the experiment is a 23 factorial experiment, i.e. we have three factors, N , P (D) and σ, and each factor has two levels. The low level for N is 100 and the high level is 500, while the low level for P (D) is 0.25 and the high level is 0.50. For σ the low level is 0.1 and the high level is 1.0. The response in this experiment is the estimated test size for each test. The cases that are under the null hypotheses H0P and H0N in equations (2) and (3) are given in Table 3. For cases not under the null hypotheses, the parameters N , P (D) and σ are the same, but the remaining parameters are changed and will be described below. For each of these eight cases we simulate M = 5000 datasets. We generate data under the null hypotheses (2) and (3), by setting PPVA = PPVB = 0.75 and NPV1 = NPV2 = 0.85. These datasets are used to estimate the test size under H0 for both the PPV and NPV tests. To estimate the power of the tests, we need datasets under H1 , and for PPV 15

we generate datasets where PPVA = 0.85 and PPVB = 0.75 and NPV1 = NPV2 = 0.85. To compare the power for the NPV tests, we generate datasets where NPV1 = 0.85 and NPV2 = 0.80 and PPVA = PPVB = 0.75. To compare the positive predictive values of test A and B, we calculate the test statistics for the LAP test, the likelihood ratio test and the unrestricted and restricted difference tests. To compare the negative predictive values for test A and B we use the negative predictive value versions of these test statistics. We calculate p-values based on the χ21 distribution. We also assess the performance of the four other difference based tests as proposed in Section 3.2.1. 4.1.3

R ESULTS

A summary of the results of the simulation experiment will follow. For each case and selected value of the nominal significance level α, let W be a random variable counting the number of p-values smaller than or equal to α. Then W is binomially distributed with size M , the number of p-values generated, and probability α. An estimate of the true significance level of the test, α ˆ is then α ˆ=

W . M

(26)

Let  = W +2 W  = M +4 M  W . α ˜ =  M A 100 · (1 − γ)% confidence interval for α ˆ with limits α ˆ L and α ˆ U , according to Agresti and Coull (1998) is given as  α ˜ · (1 − α ˜) ˜ − zγ α ˆL = α (27) 2  M 

and α ˆU = α ˜ + zγ

2

α ˜ · (1 − α ˜)  M

(28)

where zγ/2 is the γ/2-quantile in the standard normal distribution. When the samples are drawn under ˆ will be an estimate of the test size, i.e. the probability of making a type I error, P (reject H0 |H0 ). H0 , α A p-value is valid, as defined by Lloyd and Moldovan (2008), if the actual probability of rejecting the null hypothesis never exceeds the nominal significance level. We choose the nominal significance level to be 0.05 and we say that the test preserves its test size if the lower confidence limit is less ˆ U < 0.05, the test is said to be conservative, while if than or equal to 0.05, i.e. if α ˆ L ≤ 0.05. If α α ˆ L > 0.05 it does not keep its test size and it is then optimistic. If the samples are drawn under the ˆ is an estimate of the power of the test, i.e. P (reject H0 | H1 ), the probability to alternative H1 , α correctly reject the null hypothesis when it is not true. Table 4 shows the estimated test size with 95% confidence limits for the LAP test, the likelihood ratio test and the restricted and unrestricted difference tests in Case 1–8 where the data is generated under the null hypothesis that PPVA = PPVB . 16

α ˆ

α ˆL

α ˆU

Case 1 LAP test Case 1 Likelihood ratio test Case 1 Restricted difference test Case 1 Unrestricted difference test Case 2 LAP test Case 2 Likelihood ratio test Case 2 Restricted difference test Case 2 Unrestricted difference test Case 3 LAP test Case 3 Likelihood ratio test Case 3 Restricted difference test Case 3 Unrestricted difference test Case 4 LAP test Case 4 Likelihood ratio test Case 4 Restricted difference test Case 4 Unrestricted difference test Case 5 LAP test Case 5 Likelihood ratio test Case 5 Restricted difference test Case 5 Unrestricted difference test Case 6 LAP test Case 6 Likelihood ratio test Case 6 Restricted difference test Case 6 Unrestricted difference test

0.058 0.065 0.051 0.067 0.056 0.056 0.055 0.058 0.051 0.050 0.048 0.051 0.057 0.057 0.057 0.057 0.058 0.070 0.053 0.065 0.054 0.053 0.052 0.055

0.052 0.059 0.046 0.060 0.050 0.050 0.049 0.051 0.046 0.044 0.043 0.045 0.051 0.051 0.051 0.051 0.052 0.063 0.047 0.058 0.048 0.048 0.046 0.049

0.065 0.072 0.058 0.074 0.063 0.063 0.062 0.064 0.058 0.056 0.055 0.058 0.064 0.064 0.064 0.064 0.065 0.077 0.059 0.072 0.060 0.060 0.058 0.061

Case 7 LAP test Case 7 Likelihood ratio test Case 7 Restricted difference test Case 7 Unrestricted difference test Case 8 LAP test Case 8 Likelihood ratio test Case 8 Restricted difference test Case 8 Unrestricted difference test

0.053 0.055 0.049 0.054 0.055 0.055 0.054 0.055

0.047 0.049 0.044 0.048 0.049 0.049 0.048 0.049

0.060 0.061 0.056 0.060 0.062 0.062 0.061 0.062

Case/test

TABLE 4: Estimated test size with 95% confidence limits when testing PPVA = PPVB for data generated under the null hypothesis using the LAP-simulation algorithm.

17

In Case 3, 6, 7 and 8 all four test preserve the test size. In Case 2 the unrestricted difference test is too optimistic, while the other tests preserve the test size. In Case 1 and 5 the restricted difference test is the only test preserving the test size. The other tests are too optimistic. These cases have small cell counts, and it might be that the restricted difference test is more robust towards small cell counts than the other tests. Table 5 shows the mean observed cell counts in Case 1–8 for the data generated under the null hypothesis that PPVA = PPVB . We see that in Case 1 and 5, n ¯ 1 is 0.2 and 1.1 respectively, and thereby n1 = 0 in many of the datasets, and also some of the other cell counts are small. In Case 4 none of the four tests preserve the test size, i.e. all the tests are slightly optimistic. As all the cell counts are high in this case it is not surprising that the estimated test size is the same for all the tests, however we see no apparent reason why the test size is not preserved, and thus this may perhaps be a purely random event. ˆ as the response and found For the likelihood ratio test we analysed the 23 factorial experiment using α that the interaction between the factors N and P (D) is the most significant effect on the the test size with a p-value of 0.012. When N is at its high level, N = 500, the test size is less affected by P (D) which makes sense, since the high value of N ensures that all the cells will have large expected values unless the corresponding cell probabilities are very small. There is also a significant interaction between N and σ, when N = 100, the estimated test size is higher for σ = 1.0 than for σ = 0.01 and when N = 500, the estimated test size is lower for σ = 1.0 than for σ = 0.01. Table 12 (see Appendix C) shows the estimated test size with 95% confidence limits for the NPV versions of the LAP test, the likelihood ratio test and the restricted and unrestricted difference tests in Case 1–8 where the data is generated under the null hypothesis that NPV1 = NPV2 . In Case 1, 2, 4 and 8, all the cases preserve the test size. In Case 5 all the cases except the likelihood ratio test preserve the test size too. In Case 3 and 7 however, only the restricted difference test preserves the test size, none of the other tests do. It may be because it is more robust to the small cell counts in the eight cell, n ¯ 8 , which is 0.8 and 2.2 respectively in these two cases. Table 13 and 14 (see Appendix C) show the estimated power with 95% confidence intervals for the PPV and NPV versions respectively of the LAP test, the likelihood ratio test and the restricted and unrestricted difference tests in Case 1–8 for the data generated under the two alternative hypotheses. The power of the restricted difference test is generally lower than the power of the other tests, which is not surprising since it preserves its test size when the other tests do not. The power increases with the number of subjects N as we would expect. For the PPV tests, it also increases when the prevalence P (D) increases. When the prevalence increases it is more likely that a random subject has the disease, therefore more subjects will have the disease and there will be more positive tests. P (D) = 0.50 in Case 4 and 8 where the tests have higher power than in Case 2 and 6 where P (D) = 0.25. We also note that in general the test power is higher when σ = 1 compared to when σ = 0.1. For the NPV tests, the power increases when N increases and when P (D) decreases. When P (D) = 0.25, P (D∗ ) = 1 − P (D) = 0.75, and the higher this probability is the more likely it is that a random subject does not have the disease. The number of subjects that do not have the disease are then expected to be higher than when P (D) = 0.50 and P (D∗ ) = 0.50. As for the PPV tests, the power increases when σ increases. Table 6 shows the estimated test size with 95% confidence intervals in Case 1–8 for the four other difference based test statistics from Section 3.2.1. When calculating the observed value of the standardized test statistics the unrestricted maximum likelihood estimates in the variance are inserted

18

Case no. 1 2 3 4 5 6 7 8

n ¯1 0.2 1.2 4.3 21.4 1.1 5.3 7.5 37.6

n ¯2 3.9 19.7 10.3 51.5 3.1 15.6 7.0 35.3

n ¯3 3.9 19.6 10.2 51.3 3.1 15.5 7.0 35.2

n ¯4 67.0 334.6 25.1 125.7 67.8 338.6 28.3 141.9

n ¯5 6.3 31.4 38.4 191.8 8.3 41.7 39.8 198.7

n ¯6 6.2 31.1 5.4 27.1 4.2 20.9 4.1 20.1

n ¯7 6.2 31.0 5.5 27.1 4.2 20.8 4.0 20.0

n ¯8 6.3 31.5 0.8 4.0 8.4 41.6 2.2 11.2

TABLE 5: Mean cell counts for the cases in the LAP simulation study under H0 . since in the LAP simulation experiment, the test size for the restricted difference test was lower than the test size for the unrestricted difference test. If we compare these results with the results for the unrestricted difference test, we see that the estimated test size depend highly on the choice of test statistic. The test based on g2 (N ) preserves its test size in all the cases except Case 4. It is however conservative in Case 1 and 5. The test based on g3 (N ) preserves its test size in all the cases, but it is conservative in all except Case 4 and 8. In Case 1 and 5 it is very conservative with an estimated test size of just 0.008 and 0.007 respectively. For the fourth difference based test statistic, g4 (N ), the test size is preserved in all the cases except Case 4. It is conservative in Case 1, 3 and 5. The test based on g5 (N ) is conservative in all the cases, and more conservative than the other tests. In Case 1 and 5 the estimated test size is 0 and 0.001 which shows that this test statistic almost never rejects the null hypothesis. The tests based on g2 (N ) and g4 (N ) can be used as their estimated test size is reasonable, although conservative in some of the cases. We do not recommend using the tests based on g3 (N ) and g5 (N ) as these are even more conservative than the other tests.

4.2

M ULTINOMIAL SIMULATION EXPERIMENT

In the LAP-simulation algorithm, n1 , . . . , n8 are not drawn from a particular probability distribution, but obtained from the disease status and test results which are drawn with the specified probabilities in Section 4.1.1. However, in our model for the likelihood ratio test we assume that N1 , ..., N8 are multinomially distributed. This can be used in the sampling strategy and we simulate data by sampling n1 , ..., n8 from the multinomial distribution given the total number of subjects N and the parameters p1 , ..., p8 . This is less challenging to implement than the LAP-simulation algorithm and when using the likelihood ratio test it is natural to sample data from the distribution assumed when deriving the test statistic. 4.2.1

A LGORITHM

Given p = (p1 , p2 , p3 , p4 , p5 , p6 , p7 , p8 ) and the total number of subjects N , we can generate datasets by drawing n1 , n2 , ..., n8 from a multinomial distribution with parameters p and N . We first need to set p and if we want to sample under the null hypotheses, we need to ensure that p satisfy the constraints δP in Equation (6) and/or δN in Equation (8). In addition p1 , ..., p8 must sum to 1. Our simulation algorithm is as follows: 19

Case/test Case 1 g2 (N ) Case 1 g3 (N ) Case 1 g4 (N ) Case 1 g5 (N ) Case 2 g2 (N ) Case 2 g3 (N ) Case 2 g4 (N ) Case 2 g5 (N )

α ˆ 0.042 0.008 0.021 0.000 0.056 0.043 0.051 0.008

α ˆL 0.037 0.006 0.017 0.000 0.050 0.038 0.045 0.006

α ˆU 0.048 0.011 0.025 0.001 0.063 0.049 0.058 0.011

Case 3 g2 (N ) Case 3 g3 (N ) Case 3 g4 (N ) Case 3 g5 (N ) Case 4 g2 (N ) Case 4 g3 (N ) Case 4 g4 (N ) Case 4 g5 (N ) Case 5 g2 (N ) Case 5 g3 (N ) Case 5 g4 (N ) Case 5 g5 (N ) Case 6 g2 (N ) Case 6 g3 (N ) Case 6 g4 (N ) Case 6 g5 (N ) Case 7 g2 (N ) Case 7 g3 (N ) Case 7 g4 (N ) Case 7 g5 (N ) Case 8 g2 (N ) Case 8 g3 (N ) Case 8 g4 (N ) Case 8 g5 (N )

0.045 0.037 0.043 0.001 0.057 0.056 0.057 0.042 0.039 0.007 0.027 0.000 0.049 0.040 0.047 0.007 0.047 0.035 0.047 0.001 0.054 0.052 0.054 0.042

0.040 0.032 0.038 0.000 0.051 0.050 0.051 0.036 0.034 0.005 0.023 0.000 0.043 0.035 0.042 0.005 0.042 0.031 0.042 0.000 0.048 0.046 0.048 0.036

0.051 0.042 0.049 0.002 0.063 0.063 0.064 0.048 0.044 0.010 0.032 0.001 0.055 0.046 0.054 0.010 0.053 0.041 0.054 0.003 0.061 0.058 0.061 0.048

TABLE 6: Estimated test size with 95% confidence limits when testing PPVA = PPVB using the difference based tests.

20

Case Case 3MN Case 5MN

N 100 100

p1 0.05 0.01

p2 0.10 0.03

p3 0.10 0.03

p4 0.25 0.68

p5 0.39 0.08

p6 0.05 0.04

p7 0.05 0.04

p8 0.01 0.09

TABLE 7: Specification of the parameters in the multinomial simulation experiment. 1. Set p = (p1 , p2 , p3 , p4 , p5 , p6 , p7 , p8 ) and N . 2. Draw n1 , n2 , ..., n8 ∼ multinom(p, N ). Repeat M times. 4.2.2

C ASES

UNDER STUDY

We performed a small simulation study by drawing data from a multinomial distribution. Under the null hypothesis (2) we defined two cases called Case 3MN and Case 5MN. The parameters for these cases are given in Table 7. The parameters p1 , ..., p8 for each of the cases sum to one and the δP -constraint (6) and δN -constraint (8) are both satisfied. The parameters were set in order to represent Case 3 and 5 from the LAPsimulation experiment. In both of these cases N = 100, while P (D) is 0.5 in Case 3MN and 0.25 in Case 5MN as in Case 3 and 5 in the LAP-simulation experiment. For both Case 3MN and 5MN the PPVs are equal and approximately 0.75, the NPVs are equal and approximately 0.85. However, since the datasets in the LAP simulation experiment were not drawn from a multinomial distribution, the mean and the variance of n will not be exactly the same in Case 3MN and 5MN as in Case 3 and 5. The parameters p1 , ..., p8 were found by setting the value of P (D), the values of PPV1 = PPV2 and NPV1 = NPV2 and by considering the mean observed values for Case 3 and 5 in Table 5. These two cases were chosen because we would like to test the multinomial sampling strategy for one case where the likelihood ratio test did not preserve its test size (Case 5) as well as one case where the test size was preserved (Case 3) in the LAP simulation experiment when testing if the positive predictive values are equal. For each of the cases we draw M = 5000 samples from the multinomial distribution with parameters as given in Table 7. 4.2.3

R ESULTS

The estimated test size and 95% confidence limits for the LAP test, the likelihood ratio test and the restricted and unrestricted difference tests for the two cases in the simulation study using the multinomial simulation algorithm are given in Table 8. In Case 3MN all the tests preserve the test size. We note that the estimated test size is lower for the restricted difference test than for the other tests. In Case 5MN only the restricted difference test and the LAP test preserve their test size. If we compare the results to Case 3 in the LAP simulation experiment we see that α ˆ is higher in Case 3MN than in Case 3 for all the tests. In Case 5MN, α ˆ is higher for the likelihood ratio test and lower for the other tests compared to Case 5 in the LAP simulation experiment. The datasets in the two simulation experiments are not identical, but since they are generated with approximately the same

21

Case/test Case 3MN LAP Case 3MN Likelihood ratio test Case 3MN Restricted difference test Case 3MN Unrestricted difference test

α ˆ 0.054 0.054 0.052 0.054

α ˆL 0.048 0.048 0.046 0.048

α ˆU 0.060 0.060 0.058 0.061

Case 5MN LAP Case 5MN Likelihood ratio test Case 5MN Restricted difference test Case 5MN Unrestricted difference test

0.056 0.072 0.050 0.064

0.050 0.065 0.044 0.058

0.063 0.079 0.057 0.071

TABLE 8: Estimated test size with 95% confidence limits for testing PPV1 = PPV2 under the null hypothesis using the multinomial simulation algorithm. values for PPV1 , PPV2 , NPV1 , NPV2 and P (D) we find it surprising that the estimated test size for the likelihood ratio test is higher in the multinomial simulation experiment than in the LAP simulation experiment. We would expect the likelihood ratio test to perform better, i.e. have a lower test size, on datasets that are drawn from the model on which the test statistic is based, namely the multinomial model. Case/test Case 3MN LAP Case 3MN Likelihood ratio test Case 3MN Restricted difference test Case 3MN Unrestricted difference test

α ˆ 0.059 0.061 0.052 0.060

α ˆL 0.053 0.054 0.046 0.054

α ˆU 0.066 0.068 0.058 0.067

Case 5MN LAP Case 5MN Likelihood ratio test Case 5MN Restricted difference test Case 5MN Unrestricted difference test

0.049 0.062 0.051 0.049

0.044 0.056 0.045 0.044

0.056 0.069 0.057 0.056

TABLE 9: Estimated test size with 95% confidence limits for testing NPV1 = NPV2 under the null hypothesis using the multinomial simulation algorithm. The estimated test size with 95% confidence limits for testing if the NPVs are equal in the same cases are shown in Table 9. In Case 3MN only the restricted difference test preserves its test size, while in Case 5MN the LAP test and the unrestricted difference test also preserve their test size. The likelihood ratio test does not preserve its test size in any of these cases.

5

DATA

FROM LITERATURE

We will use the dataset from Weiner, Ryan, McCabe, Kennedy, Schloss, Tristani and Fisher (1979) which is the same dataset as used in Leisenring et al. (2000) and Wang et al. (2006). There were 871 subjects of which 608 subjects had coronary artery disease (CAD) and 263 subjects did not have CAD. For all the subjects the results of clinical history (test A) and exercise stress testing (EST) (test 22

B) were registered. The dataset is shown in Table 10.

Result of clinical history

+ -

Coronary artery disease Result of EST + 22 44 46 151

Coronary artery disease + Result of EST + 473 81 29 25

TABLE 10: Data from the coronary artery disease study. Table 11 shows the resulting p-values for comparing the positive and negative predictive values using the LAP-test, the likelihood ratio test and the restricted and unrestricted difference test. Test LAP Likelihood ratio test Restricted difference test Unrestricted difference test

PPV 0.3706 0.3710 0.3696 0.3705

NPV logL(p1 , ..., p8 ). We start by writing down the expression for p2 and check if it is greater than p2 . k(p1 + p2 ) > p2 p1 + p2 + p3 (p1 + p2 ) > p2 2p1 + p2 + p3 (p1 + p2 + p3 ) · (p1 + p2 ) > (p1 + p1 + p2 + p3 )p2 (p1 + p2 + p3 )p1 > p1 · p2 p1 + p2 + p3 > p2 The inequality is satisfied and therefore p2 > p2 . The same argument can be used to show that p3 > p3 .

26

 The non-constant part of the log likelihood function is 8i=1 ni · logpi , and when n1 = 0, the first term in the sum is 0, regardless of the value of p1 . When p2 > p2 and p3 > p3 we see that logL(0, p2 , p3 , p4 , p5 , p6 , p7 , p8 ) > logL(p1 , p2 , p3 , p4 , p5 , p6 , p7 , p8 ). Therefore p˜1 = 0. The same argument is valid for p˜5 , i.e. p˜5 = 0 when n5 = 0. When n4 and/or n8 is 0, then p˜4 and/or p˜8 are also 0, see below. However, the argument does not hold for p˜2 , p˜3 , p˜6 and p˜7 when n2 , n3 , n6 or n7 is 0. Even though e.g. p˜2 may sometimes be 0 when n2 = 0, this is not always true. If e.g. p˜1 = 0 and p˜3 > 0, then p˜2 cannot be equal to 0 even if n2 = 0 because then the null hypothesis constraint (31) will not be satisfied. One example of this situation is the table n = (0, 0, 6, 0, 2, 6, 0, 0). The analytic solution ˜ = (0, 2/7, 1/7, 0, 2/7, 2/7, 0, 0) and we see that p˜2 = 0 of the Lagrangian system of equations is p even though n2 = 0. We proceed to show that p˜4 = n4 /N and p˜8 = n8 /N . If we add the first eight Lagrangian equations in (18), we get N = γh(p) + 2κk(p) = γ, where h(p) = 1 and k(p) = 0 are the two constraints. Thus γ = N , and p˜4 = n4 /N and p˜8 = n8 /N follow from (18). ˜ under the null hypothesis is among the p = (p1 , . . . , p8 ) So the maximum likelihood estimate p for which p4 = n4 /N and p8 = n8 /N . For such a p, let s(p) = r · (p1 , p2 , p3 , p5 , p6 , p7 ), where r = N/(N − n4 − n8 ) so that the sum of the components of s(p) is 1. Let logL and logL denote the log-likelihood of the original multinomial model with eight parameters and the alternative multinomial model with six parameters in Section 6. Then logL(p) − logL (s(p)) is constant, showing that logL(p) is maximal if and only if logL (s(p)) is. Furthermore, p satisfies the null hypothesis for the multinomial model with eight parameters if and only if s(p) does for the multinomial model with six parameters, showing that the maximum likelihood estimates under the null hypothesis for the multinomial model with eight and six parameters are obtained from the other model by up- and downscaling, respectively. There are also other relationships between the restricted parameter estimates that can easily be shown and used in the estimation of the parameters: p1 + p2 + p3 =

n1 + n2 + n3 N

p5 + p6 + p7 =

n5 + n6 + n7 N

n2 n3 n1 +N = + p1 p2 p3 n5 n6 n7 +N = + p5 p6 p7

B

E XISTING

DIFFERENCE BASED METHODS

The already published difference based tests for comparing predictive values will here be described briefly. 27

B.1

T EST BY WANG

Recently Wang et al. (2006) presented two tests, one based on the difference of the PPVs and one based on the log ratio of the PPVs for testing the null hypothesis in (2). The data are assumed to be multinomially distributed. They fit the model PPVA − PPVB = β1P using the weighted least squares approach.1 Testing if the positive predictive values for test A and B are equal is equivalent to testing H0 : β1P = 0. The test statistic is

 W1P =

2 N   ( PPV − PPV ) , A B ˆP Σ

(32)

1

ˆ P is the estimated variance of βˆP = PPV  A − PPV  B . To which is asymptotically χ21 -distributed. Σ 1 1 compare the negative predictive values the same approach is followed by looking at the difference of the two negative predictive values. They fit the model NPVA − NPVB = β1N and test the null hypothesis H0 : β1N = 0 using the following test statistic  W1N

=

2 N   (NPVA − NPVB ) ˆN Σ

(33)

1

ˆ N is the estimated variance of βˆN = NPV  A − NPV B . W1N is asymptotically χ21 -distributed. where Σ 1 1 In the second test they consider the log ratio of the PPVs as their test statistic and fit the model PPVA log PPV = β2P . Testing if the positive predictive values are equal is in this case equivalent to testing B the null hypothesis H0 : β2P = 0. The test statistic is 2 √ A N PPV W2P = log (34) ˆP Σ P PVB 2 dA ˆ P is the estimated variance of βˆP = log PPV which is asymptotically χ21 distributed. Σ 2 2 d B . The same PPV approach is followed to derive a second test for the negative predictive values bylooking  at the log NPVA ratio of the negative predictive values for test A and test B, the model fitted is log NPVB = β2N . To

test if the negative predictive values are equal, the null hypothesis is H0 : β2N = 0 and they use the following test statistic √ 2 A N NPV N W2 = log (35) ˆN B Σ NPV 2

ˆ N is the estimated variance of βˆN = log NPVA . The test statistic in (35) is χ2 -distributed. where Σ 2 2 1 dB NPV They recommend using the tests based on the difference of the predictive values because it performs better than the tests based on the log ratio of the predictive values in terms of test size and power. d

B.2

T EST BY M OSKOWITZ AND P EPE

PPVA NPVA Moskowitz and Pepe (2006) look at the relative predictive values, rPPV = PPV and rNPV = NPV . B B By using the multivariate central limit theorem and the Delta method (which uses Taylor series ex1

The notation in Appendix B.1 differs from the notation used in Wang et al. (2006).

28

pansions to derive the asymptotic variance), the following 100 · (1 − α)% confidence intervals can be estimated for log rPPV and log rNPV,  σ ˆP2 log rPPV ± z1−α/2 N  2 σ ˆN log rNPV ± z1−α/2 N 2 are the estimated variances of √1 log rPPV  and √1 log rNPV  respectively and N where σ P2 and σ N N N is the number of subjects under study. Moskowitz and Pepe (2006) do not present a hypothesis test, but based on the confidence intervals we have the asymptotically χ21 distributed test statistic



√ 2 N log rPPV σ P

ZP =

(36)

for testing the null hypothesis (2) for the positive predictive values. When testing the null hypothesis (3) whether the negative predictive values are equal the test statistic  ZN =

√ log

N

σ N

2 rNPV

,

(37)

which has an asymptotic χ21 distribution can be used. The test statistic in (36) only differs from the test statistic in (32) in the estimated variance. Moskowitz and Pepe (2006) use the multinomial Poisson transformation to simplify the variances.

C

R ESULTS

FROM THE

LAP

SIMULATION EXPERIMENT

Table 12 shows the estimated test size with 95% confidence limits when comparing the negative predictive values for data generated the null hypothesis. Table 13 and 14 show the estimated test power when comparing the positive and negative predictive values respectively for data generated under the alternative hypothesis.

29

α ˆ

α ˆL

α ˆU

Case 1 LAP test Case 1 Likelihood ratio test Case 1 Restricted difference test Case 1 Unrestricted difference test Case 2 LAP test Case 2 Likelihood ratio test Case 2 Restricted difference test Case 2 Unrestricted difference test Case 3 LAP test Case 3 Likelihood ratio test Case 3 Restricted difference test Case 3 Unrestricted difference test Case 4 LAP test Case 4 Likelihood ratio test Case 4 Restricted difference test 4 Case 4 Unrestricted difference test Case 5 LAP test Case 5 Likelihood ratio test Case 5 Restricted difference test Case 5 Unrestricted difference test Case 6 LAP test Case 6 Likelihood ratio test Case 6 Restricted difference test Case 6 Unrestricted difference test

0.052 0.056 0.048 0.052 0.050 0.050 0.050 0.050 0.058 0.059 0.052 0.060 0.046 0.046 0.045 0.047 0.050 0.061 0.049 0.050 0.049 0.049 0.049 0.049

0.046 0.050 0.043 0.046 0.045 0.044 0.044 0.045 0.052 0.053 0.047 0.053 0.041 0.040 0.039 0.041 0.044 0.055 0.044 0.044 0.044 0.044 0.043 0.044

0.059 0.063 0.055 0.059 0.057 0.057 0.056 0.057 0.065 0.066 0.059 0.067 0.052 0.052 0.051 0.053 0.056 0.068 0.056 0.056 0.056 0.056 0.055 0.056

Case 7 LAP test Case 7 Likelihood ratio test Case 7 Restricted difference test Case 7 Unrestricted difference test Case 8 LAP test Case 8 Likelihood ratio test Case 8 Restricted difference test Case 8 Unrestricted difference test

0.060 0.067 0.056 0.060 0.046 0.047 0.044 0.046

0.053 0.061 0.050 0.054 0.040 0.041 0.039 0.040

0.067 0.074 0.063 0.067 0.052 0.053 0.050 0.052

Case/test

TABLE 12: Estimated test size with 95% confidence limits when testing NPVA = NPVB for data generated under the null hypothesis using the LAP simulation algorithm.

30

α ˆ

α ˆL

α ˆU

Case 1 LAP test Case 1 Likelihood ratio test Case 1 Restricted difference test Case 1 Unrestricted difference test Case 2 LAP test Case 2 Likelihood ratio test Case 2 Restricted difference test Case 2 Unrestricted difference test Case 3 LAP test Case 3 Likelihood ratio test Case 3 Restricted difference test Case 3 Unrestricted difference test Case 4 LAP test Case 4 Likelihood ratio test Case 4 Restricted difference test Case 4 Unrestricted difference test Case 5 LAP test Case 5 Likelihood ratio test Case 5 Restricted difference test Case 5 Unrestricted difference test Case 6 LAP test Case 6 Likelihood ratio test Case 6 Restricted difference test Case 6 Unrestricted difference test

0.125 0.143 0.111 0.141 0.396 0.390 0.380 0.400 0.369 0.361 0.349 0.369 0.945 0.944 0.943 0.944 0.146 0.174 0.124 0.158 0.463 0.458 0.449 0.466

0.116 0.134 0.102 0.131 0.383 0.376 0.367 0.386 0.356 0.348 0.336 0.356 0.938 0.937 0.936 0.938 0.137 0.163 0.116 0.149 0.450 0.444 0.435 0.452

0.135 0.153 0.120 0.151 0.410 0.403 0.394 0.413 0.383 0.375 0.363 0.383 0.951 0.950 0.949 0.950 0.156 0.184 0.134 0.169 0.477 0.472 0.463 0.479

Case 7 LAP test Case 7 Likelihood ratio test Case 7 Restricted difference test Case 7 Unrestricted difference test Case 8 LAP test Case 8 Likelihood ratio test Case 8 Restricted difference test Case 8 Unrestricted difference test

0.485 0.484 0.468 0.485 0.987 0.987 0.987 0.987

0.471 0.470 0.454 0.471 0.984 0.984 0.983 0.984

0.498 0.498 0.482 0.498 0.990 0.990 0.990 0.990

Case/test

TABLE 13: Estimated power with 95% confidence limits when testing PPVA = PPVB for data generated under the alternative hypothesis.

31

Case/test Case 1 LAP test Case 1 Likelihood ratio test Case 1 Restricted difference test Case 1 Unrestricted difference test

α ˆ 0.324 0.336 0.048 0.052

α ˆL 0.312 0.323 0.043 0.046

α ˆU 0.338 0.349 0.055 0.059

Case 2 LAP test Case 2 Likelihood ratio test Case 2 Restricted difference test Case 2 Unrestricted difference test Case 3 LAP test Case 3 Likelihood ratio test Case 3 Restricted difference test Case 3 Unrestricted difference test Case 4 LAP test Case 4 Likelihood ratio test

0.929 0.929 0.050 0.050 0.113 0.114 0.052 0.060 0.350 0.349

0.921 0.921 0.044 0.045 0.105 0.105 0.047 0.053 0.337 0.336

0.935 0.935 0.056 0.057 0.122 0.123 0.059 0.067 0.364 0.362

Case 4 Restricted difference test Case 4 Unrestricted difference test Case 5 LAP test Case 5 Likelihood ratio test Case 5 Restricted difference test Case 5 Unrestricted difference test Case 6 LAP test Case 6 Likelihood ratio test Case 6 Restricted difference test Case 6 Unrestricted difference test

0.045 0.047 0.427 0.458 0.049 0.050 0.986 0.986 0.049 0.049

0.039 0.041 0.413 0.444 0.044 0.044 0.982 0.982 0.043 0.044

0.051 0.053 0.441 0.471 0.056 0.056 0.989 0.989 0.055 0.056

Case 7 LAP test Case 7 Likelihood ratio test Case 7 Restricted difference test Case 7 Unrestricted difference test Case 8 LAP test Case 8 Likelihood ratio test Case 8 Restricted difference test Case 8 Unrestricted difference test

0.136 0.146 0.056 0.060 0.465 0.463 0.044 0.046

0.127 0.136 0.050 0.054 0.451 0.450 0.039 0.040

0.146 0.156 0.063 0.067 0.478 0.477 0.050 0.052

TABLE 14: Estimated power with 95% confidence limits when testing NPVA = NPVB for data generated under the alternative hypothesis using the LAP simulation algorithm.

32

D

C OMPUTATIONAL REMARKS

When using the TANGO program Andreani et al. (2007), Andreani et al. (2008), there are several parameters that can be set or modified by the user. Along with the specification of the objective function and the constraints, the initial estimates of the Lagrange multipliers, the initial values of the variables and their lower and upper bounds must be set. Other parameters have a default value, but these can be altered by the user. These parameters include tolerance limits and the maximum number of iterations. In our simulation studies we have chosen the initial value 0.0 for all the variables with upper and lower bounds ±200000. The initial value for the Lagrangian multiplier was set to 0.0 as advised in the program when one does not believe it should have a specific value. The feasibility and optimality tolerances are 10−4 by default. We found that with these tolerances, the resulting variable values depend on both the initial value of the Lagrange multiplier and the initial values of the variables. However, different initial values for the variables give more similar results than different initial values of the Lagrange multipliers. The smaller the tolerance is, the more similar the results will be, so in order to get results that do not depend on any of the initial values one should use smaller values for the tolerances and in our problems, smaller than 10−4 . The problem is then that it takes longer for the algorithm to converge. When performing the likelihood ratio test for one or a few datasets this is not an issue, but when performing simulation experiments with several thousand datasets this will slow down the experiment considerably. Another problem is that of the algorithm converging to a local maximum. For example, the analytical restricted likelihood estimates for the table n = (0, 7, 0, 69, 5, 3, 11, 5) is -115.73 while it is -166.38 using the numerical estimates from TANGO. The difference is caused by the fact that p˜3 = 0 using the numerical optimization routine, while it is 0.04 using the analytical optimization. For most of the large sample datasets the difference is less, with e.g. 15 of 5000 estimates in Case 1 in the LAP simulation experiment differ by more than 1.0 between the analytical and numerical estimates. We recommend using the analytical estimates.

33

C OMPARING POSITIVE PREDICTIVE VALUES FOR SMALL SAMPLES WITH APPLICATION TO GENE ONTOLOGY TESTING C LARA -C ECILIE G ÜNTHER , Ø YVIND BAKKE AND M ETTE L ANGAAS Department of Mathematical Sciences. The Norwegian University of Science and Technology, NO-7491 Trondheim, Norway. M ARCH 2009

S UMMARY Motivated by the challenge of detecting Gene Ontology (GO) categories which are overrepresented or depleted when comparing biological findings represented by two over-lapping lists of genes, we examine the performance of different statistical tests. One key feature with this type of data is that the sample size at each GO category often is small and thus large sample asymptotic tests are not suitable. We look at four different test statistics in combination with parametric bootstrapping, and compare the methods with their asymptotic alternatives. We find that the choice of test statistic influence which GO categories are found to be significant, and all tests under study perform increasingly conservative as the sample size decreases. We observe that this problem is statistically the same as comparing the positive predictive values of two diagnostic tests.

1

I NTRODUCTION

In some biological experiments the aim is, e.g. by using DNA microarrays, to discover genes that are differentially expressed between two or more conditions. The conditions may be defined by the presence or absence of a disease or by different treatments like diets, drugs or amount of physical exercise. As an example we consider a situation where the relationship between inborn aerobic capacity and cardiac gene expression in rats was studied, Bye, Langaas, Høydahl, Kemi, Heinrich, Koch, Britton, Najjar, Ellingsen and Wisløff (2008). The rats were born with either high running capacity (HCR) or low running capacity (LCR), and half of the rats were trained, while the others remained sedentary. Thus there were four groups of rats, LCR trained, LCR sedentary, HCR trained and HCR sedentary. Several comparisons were done, and the comparison of the gene expression for the sedentary HCR rats with the gene expression for the sedentary LCR rats resulted in a list of 1540 differentially expressed genes between these two groups. However, since such lists contains only single genes, i.e. without information about potential connections to the other genes on the list, it can be challenging to interpret the biological meaning of the results. What may be more interesting for interpretation purposes is the biological pathways that are active in the conditions under study. To do this, groups of genes instead of single genes are considered. In this paper we consider groups of genes selected from a predefined set using the Gene Ontology (GO) vocabulary, The Gene Ontology Consortium (2000). GO is a vocabulary that classifies 1

genes into the three main categories: biological process, molecular function and cellular component and their subcategories. Given the list of differentially expressed genes from the experiment and the list of all genes present on the microarray chip, called the master list, the biologist wants to know whether certain gene classes are over-represented or depleted in the list of differentially expressed genes compared to the master list. In the rat example, we are interested in knowing if the number of genes related to aerobic capacity among those differentially expressed between the sedentary HCR rats and the sedentary LCR rats is higher than what we would expect by chance if we compare it to the master list. The list of differentially expressed genes is contained in the master list, and the statistical hypothesis problem is to test whether two binomial proportions are equal. Common approaches are Pearson’s asymptotic χ2 -test and Fisher’s exact test for large and small samples, respectively. If there are more than two conditions in the experiment, several comparisons can be done which may each result in a list of differentially expressed genes between the conditions being compared. Then we would like to see whether some specific gene classes of interest are over-represented or depleted on one of the lists compared to one of the others. The two lists may either be mutually exclusive or partly overlapping. If they are mutually exclusive the problem reduces to test whether two binomial proportions are equal as for the master list problem and the same approaches can be used. We will consider only the situation of overlapping gene lists. In the rat example, we want to compare the list of differentially expressed genes between trained HCR and LCR rats to the list of differentially expressed genes between trained HCR rats and sedentary LCR rats. Comparing two overlapping gene lists in terms of over-represented or depleted gene classes is statistically the same situation as comparing the positive predictive values for two diagnostic tests and several hypothesis tests for this situation can be found in the literature, see Leisenring, Alonzo and Pepe (2000), Wang, Davis and Soong (2006) and Moskowitz and Pepe (2006). Günther, Bakke, Lydersen and Langaas (2008) presented a likelihood ratio test and a restricted difference test and compared them to the other existing tests. Simulation experiments showed that for smaller sample sizes these tests did not preserve their test size. When comparing gene lists, the actual sample size is the number of genes associated with each of the three main GO-categories, not the number of genes on the microarray chip, nor the total number of genes on the lists. This number is usually quite small and large sample tests are not a suitable approach. Instead small sample tests should be applied. In this paper we evaluate small sample tests for comparing two overlapping gene lists, i.e. to test whether the probabilities that a randomly chosen gene belongs to a specific gene class are equal for the two lists. We first describe the assumed model and define the null and alternative hypotheses in Section 2, and then present the test statistics and how to calculate the p-values in Section 3. A simulation study is conducted to assess the method and described in Section 4 and in Section 5 an example in which data from the literature is given. A short discussion is given in Section 6 before we end with the conclusions in Section 7.

2

M ODEL AND DATA

We assume that we have two lists of genes, list A and list B. For each gene we are interested in comparing the probability that it belongs to a certain gene class D given that it is on list A, with the probability that it belongs to D given that it is on list B. Our null hypothesis is that these two probabilities are equal. By defining the three events 2

• D: The gene belongs to gene class D. • A: The gene is present on gene list A. • B: The gene is present on gene list B. we can express the null hypothesis as H0 : P (D | A) = P (D | B).

(1)

Statistically, this is the same problem as testing equality of the positive predictive values of two diagnostic tests for the same disease. Two diagnostic tests with binary outcomes, i.e. positive or negative, are applied to each subject in the study. The positive predictive value (PPV) is defined as the probability that the subject has the disease of study given that the test is positive. If we let event D be that the subject has the disease, A the event that the outcome of test A is positive and B the event that the outcome of test B is positive, then the positive predictive value of test A is PPVA = P (D | A), and the positive predictive value of test B is PPVB = P (D | B). Our null hypothesis is that the two positive predictive values are equal, i.e. H0 : P (D | A) = P (D | B) as in (1). The Venn diagram in Figure 1 shows the six mutually exclusive events defined by A, B and D. We only look at the restricted sample space, i.e. A ∪ B, and thereby only the part of D that intersects A ∪ B. Let A∗ , B ∗ and D∗ be the complementary events of A, B and D respectively. Günther et al. (2008) argue that when comparing positive predictive values it suffices to consider only the subjects with at least one positive test result, which equals the set A∪B. In the GO setting the number of genes belonging to the GO category D that are not present on any of the lists, i.e. the event (A∗ ∪ B ∗ ) ∩ D, is unknown as is the number of genes not present on the lists that do not belong to the GO category D, therefore the part of D that intersects with A∗ ∪ B ∗ is not included. To each of the six events in the Venn diagram there corresponds 6the probability qi that event i occurs, i = 1, ..., 6. The sum of these probabilities is one, i.e. i=1 qi = 1. Associated with each event is also a random variable Ni , i = 1, . . . , 6, Ni being the number of times event i occurs. We 6 consider one main category at a time, such that in total there are N = i=1 Ni unique genes on the two lists associated with either biological process, cellular component or molecular function. The number N will typically change between the three main categories. Given N , the random variables N1 , N2 , . . . , N6 are multinomially distributed with parameters N and q = (q1 , q2 , q3 , q4 , q5 , q6 ). The joint probability function of N1 , N2 , . . . , N6 is   6 6   qini (Ni = ni ) = N ! P . (2) ni ! i=1

i=1

The expected value of N = (N1 , N2 , N3 , N4 , N5 , N6 ) is μ = E(N ) = N · q and the covariance matrix is Σ = Cov(N ) = N (Diag(q) − q T q), Johnson, Kotz and Balakrishan (1997). We do not assume that a random gene’s presence on list A is independent on its presence on list B and of whether it belongs to GO category D. This is implicitly handled by the multinomial model, since each gene yields only one observation of one of the six mutually exclusive events. We do however assume that the genes are sampled independently of each other and we will comment this further in Section 6. Throughout this work, we assume that only N is fixed and that N are realisations of multinomial samples. Other sampling schemes are possible as well, for instance by fixing ND , NA and NB , the 3

number of genes belonging to gene class D, are present on list A and are present on list B respectively, and sampling N independently from three binomial distributions. In this report, we will not consider these approaches. The probabilities P (D|A) and P (D|B) can be expressed in terms of the parameters q since P (D|A) =

q4 + q5 P (D ∩ A) = P (A) q 1 + q 2 + q 4 + q5

and

q4 + q6 P (D ∩ B) . = P (B) q1 + q3 + q4 + q6 Thus, the null hypothesis can be written P (D|B) =

H0 : δ =

q4 + q5 q4 + q6 − = 0. q 1 + q2 + q4 + q5 q1 + q 3 + q 4 + q 6

A

(3)

B

N2

N1

N3

N4 N5

N6 D

F IGURE 1: Venn diagram for the events D, A and B showing which events the random variables N1 , . . . , N6 correspond to. There are several possible alternative hypotheses. If we are interested in whether there is an enrichment or depletion of genes belonging to gene class D on list A compared to list B, we have the two-sided alternative (4) H1 : P (D|A) = P (D|B). If we are interested in testing only whether there is an enrichment of genes belonging to gene class D on list A compared to list B, we have the one-sided alternative H1 : P (D|A) > P (D|B).

(5)

When testing whether there is a depletion of genes belonging to gene class D on list A compared to list B, the alternative hypothesis is H1 : P (D|A) < P (D|B).

(6)

In this work we will focus on the two sided alternative. We observe data n = (n1 , n2 , n3 , n4 , n5 , n6 ) which are realizations of N = (N1 , N2 , N3 , N4 , N5 , N6 ) and can be represented in a table as shown in Table 1. 4

D∗ n1 n2 n3

Event A∩B A ∩ B∗ A∗ ∩ B

D n4 n5 n6

TABLE 1: The observed data classified by the events A, B and D.

3

M ETHOD

In this section we present the test statistics we considered to test whether the probability of a gene belonging to gene class D given that it is present on list A is equal to the probability of a gene belonging to gene class D given that it is present on list B. We also describe how to calculate the p-values.

3.1

T EST STATISTICS

To test the null hypothesis (3), we consider four test statistics: a likelihood ratio test statistic, a score test statistic and two difference test statistics. They have all been shown to be asymptotically χ21 distributed when the sample size is large, Casella and Berger (2002), Leisenring et al. (2000), Wang et al. (2006), but here we will use parametric bootstrapping to approximate their distribution under the null hypothesis for small samples. We describe the test statistics briefly, more details can be found in Günther et al. (2008). The likelihood ratio test statistic is TLR = −2 · log(λ(N )) where λ(N ) is the maximum likelihood of a multinomial sample under the null hypothesis divided by the general maximum likelihood of the multinomial sample. Let Q denote the parameter space for q and Q0 the subspace of Q in which q satisfy the constraint given by the null hypothesis (3). Then, λ(n) =

supQ0 L(q|n) supQL(q|n)

.

Let q˜i , i = 1, . . . , 6, be the restricted maximum likelihood estimates, that is, the maximum likelihood estimates under H0 , and let qˆi , i = 1, . . . , 6 be the unrestricted general maximum likelihood estimates for the multinomial distribution, i.e. qˆi = ni /N . Inserting these estimates in the log-likelihood function for the multinomial distribution leads to the test statistic  6   TLR = −2 ni · (log q˜i − log qˆi ) . (7) i=1

Note that q˜i , i = 1, . . . , 6, cannot be written in any comprehensible closed form, but can be found using an optimization routine or analytically by solving a Lagrangian system of equations. We do the latter using Maple 12, for details see Günther et al. (2008).

5

The difference tests are based on the estimator g(N ) for the difference δ in (3), g(N ) =

N4 + N6 N4 + N5 − N 1 + N2 + N4 + N5 N 1 + N3 + N4 + N6

(8)

and the test statistic is derived by subtracting the expectation of g(N ) and dividing by its approximate standard deviation, which is found by taking the variance of the first order Taylor expansion of g(N ). Let μ = E(N ) and Σ = Cov(N ) as defined in Section 2. This yields Tg =

(g(N ) − g(μ))2 GT (μ)Σ G(μ)

(9)

where G is a vector containing the first order partial derivatives of g(N ) with respect to the components of N and GT is the transpose of G. G(μ) is G with μ inserted for N . Under the null hypothesis g(μ) = 0. G(μ) and Σ depend on the unknown parameters q which must be estimated when calculating the test statistic. We can either use the unrestricted maximum likelihood ˆ for the multinomial distribution or the restricted maximum likelihood estimates q ˜ under estimates q H0 . In the first case we refer to the test as the unrestricted difference test (uDT) and denote the test statistic TuDT and in the second case we refer to the test as the restricted difference test (rDT) and denote the test statistic TrDT . Leisenring et al. (2000) presented a score test, which we denote the LAP test, for testing equivalence of positive predictive values of two diagnostic tests, based on generalized estimating equations. They define three indicator variables. First Dij indicates the disease status of subject i for diagnostic test j, i.e. Dij = 0 if the subject does not have the disease and Dij = 1 if it does have the disease. Zij indicates which test is used, it is 0 for test A and 1 for test B. Xij indicates the test result, it is 0 if the test is negative and 1 if it is positive. Then the positive predictive value for test A can be written PPVA = P (Dij = 1 | Zij = 0, Xij = 1) and the positive predictive value for test B is PPVB = P (Dij = 1 | Zij = 1, Xij = 1). Leisenring et al. (2000) fit the generalized linear model logit(P (Dij = 1 | Zij , Xij = 1)) = αP + βP Zij , and test whether βP = 0 which is equivalent to testing whether PPVA = PPVB . We translate the test to the GO situation and in our notation the test statistic can be written as TLAP =

((N1 + N2 + N4 + N5 )(N4 + N6 ) − (N1 + N3 + N4 + N6 )(N4 + N5 ))2 , f (N1 , N2 , N3 , N4 , N5 , N6 )

6

(10)

where f (N1 , N2 , N3 , N4 , N5 , N6 )

2 2N4 + N5 + N6 2N1 + N2 + N3 + 2N4 + N5 + N6  2 2N4 + N5 + N6 + N2 (N1 + N3 + N4 + N6 )2 2N1 + N2 + N3 + 2N4 + N5 + N6  2 2N4 + N5 + N6 + N3 (N1 + N2 + N4 + N5 )2 2N1 + N2 + N3 + 2N4 + N5 + N6  2 2N4 + N5 + N6 + N4 (N2 − N3 + N5 − N6 )2 1 − 2N1 + N2 + N3 + 2N4 + N5 + N6  2 2N4 + N5 + N6 + N5 (N1 + N3 + N4 + N6 )2 1 − 2N1 + N2 + N3 + 2N4 + N5 + N6  2 2N4 + N5 + N6 + N6 (N1 + N2 + N4 + N5 )2 1 − . 2N1 + N2 + N3 + 2N4 + N5 + N6

= N1 (N2 − N3 + N5 − N6 )2



The numerator can be found from by setting the difference in (8) equal to 0 and rearranging the terms. In the denominator, the number of genes that do not belong to GO category D, N1 , N2 and N3 , are each multiplied by the proportion of genes that belong to the category D and in this proportion, the genes that are present on both lists, N1 and N4 , are given double weight. The number of genes that belong to GO category D, N4 , N5 and N6 , are each multiplied by the proportion of genes that do not belong to the gene class D where the genes that are present on both lists are given double weight.

3.2

C ALCULATION OF p- VALUES

We will use parametric bootstrapping to approximate the distribution of the test statistics under the null hypothesis and find approximate p-values. The test statistic of interest, is either TLAP , TLR , TuDT or TrDT . To calculate the p-values we use the following algorithm: 1. For a given sample of size N , find the maximum likelihood estimates of the parameters under H0 , q˜, and calculate the test statistic t for this sample, denoted ts . 2. Draw B samples from the multinomial distribution with parameters N and q˜. 3. Calculate the test statistic tk for each of these samples, 1 ≤ k ≤ B.

B 1 if tk ≥ ts , thus the p4. The p-value is given as k=1 I(tk ≥ ts ), where I(tk ≥ ts ) = 0 if tk < ts value is the proportion of simulated test statistics greater than or equal to the given test statistics.

4

A SSESSMENT OF METHOD

To assess the performance of the four tests in terms of test size, we perform a simulation study. The test size is the probability of making a type I error, i.e. for rejecting H0 when H0 is true. We 7

consider different sample sizes to evaluate the effect of N on the test size, and we also use several parameter values of q in the multinomial distribution to explore different areas of the null hypothesis. All analysis are performed using the R language, R Development Core Team (2008), except finding the maximum likelihood estimates under H0 which is done using Maple 12.

4.1

S IMULATION ALGORITHM

Given q and N , we draw M datasets from the multinomial distribution with parameters q and N . For each of these datasets we find the p-value using parametric bootstrapping as described in Section 3.2.

4.2

C ASES UNDER STUDY

The data sets are generated from the parameters q in the multinomial distribution and we choose six cases of parameters, given in Table 2 and depicted in Figure 2. Case 1 2 3 4 5 6

q1 0.068 0.043 0.267 0.300 0.400 0.450

q2 0.135 0.130 0.267 0.267 0.200 0.200

q3 0.135 0.130 0.267 0.267 0.200 0.200

q4 0.527 0.348 0.067 0.033 0.100 0.050

q5 0.068 0.174 0.067 0.067 0.050 0.050

q6 0.068 0.174 0.067 0.067 0.050 0.050

TABLE 2: Specification of parameters in the simulation study. The parameters in case 1 and 2 are motivated by the setting for diagnostic tests and chosen as described in the multinomial simulation experiment of Günther et al. (2008). In case 3–6, we first set the probabilities o1 = P (A ∩ B), o2 = P (A ∩ B ∗ ) and o3 = P (A∗ ∩ B) and then p1 = P (D|A ∩ B), p2 = P (D|A ∩ B ∗ ) and p3 = P (D|A∗ ∩ B). From these probabilities q are calculated as follows,

oi (1 − pi ) i = 1, 2, 3 qi = oi pi i = 4, 5, 6. In case 3 o1 = o2 = o3 = 1/3 and p1 = p2 = p3 = 1/5. In case 4 o1 = o2 = o3 = 1/3 and p1 = 1/10 while p2 = p3 = 2/10. The probabilities in case 5 are o1 = 1/2, o3 = o4 = 1/4 and p1 = p2 = p3 = 1/5. Finally, in case 6, o1 = 1/2, o2 = o3 = 1/4, p1 = 1/10 and p2 = p3 = 2/10. The remaining parameter in the multinomial distribution, N , must also be chosen and since we are considering small sample sizes, we use N = 10, 15, 20 and 25. For each of the values of N all the cases given in Table 2 are run. In each of the six cases we draw M = 10000 samples and for each of these samples we draw B = 10000 bootstrap samples.

4.3

R ESULTS

The test size is estimated as the proportion of p-values being less than or equal to the chosen nominal level α. Let W be a random variable counting the number of p-values smaller than or equal to α. 8

F IGURE 2: Values of qi , i = 1, . . . , 6, black: case 1, red: case 2, green: case 3, dark blue: case 4, turquoise: case 5, cyan: case 6. Then W is binomially distributed with size M , the number of p-values generated, and probability α. The estimate of the test size of the test, α ˆ is then α ˆ=

W . M

(11)

We say that the test preserves its test size if α ˆ ≤ α. The smaller α ˆ is, while less than α, the more conservative the test is. If α ˆ > α the test does not preserve its test size and we say that it is too optimistic. We choose α = 0.05 and calculate α ˆ for the four test statistics, six cases and four values of N . Table 3 show the estimated test size for all the combinations of q, N and test statistic. There is one table for each of the six cases. The likelihood ratio test has the largest test size in all the cases and for all values of N except for N = 20 and N = 25 in case 1 and N = 25 in case 2. The unrestricted difference test has the smallest test size in all the cases and for all values of N except for N = 20 and N = 25 in case 1 and N = 25 in case 2, which are the exceptions when the likelihood ratio test has the smallest test size. The test size of the LAP test and the unrestricted difference test lies somewhere in between, which one is the largest varies. When N = 10, all the tests are conservative for all the cases, but when N increases, the test size also increases and in case 1 and 2 the tests do not preserve their test size for N ≥ 15. This also happens in case 3 for N ≥ 20 for the likelihood ratio test and for N = 25 for the restricted difference test. In case 4, 5 and 6, the tests are conservative for all values of N except the likelihood ratio test which test 9

(a) Case 1

N 10 15 20 25

LAP 0.042 0.059 0.063 0.055

N 10 15 20 25

LAP 0.014 0.029 0.041 0.047

LRT 0.046 0.061 0.061 0.052

rDT 0.039 0.060 0.062 0.054

(b) Case 2

uDT 0.035 0.057 0.062 0.055

N 10 15 20 25

LAP 0.038 0.056 0.056 0.058

uDT 0.007 0.024 0.038 0.046

N 10 15 20 25

LAP 0.012 0.026 0.035 0.044

(c) Case 3

LRT 0.028 0.044 0.055 0.056

rDT 0.026 0.041 0.051 0.053

LAP 0.010 0.021 0.032 0.040

LRT 0.022 0.037 0.042 0.049

rDT 0.020 0.033 0.039 0.046

rDT 0.040 0.057 0.057 0.058

uDT 0.018 0.047 0.054 0.057

(d) Case 4

(e) Case 5

N 10 15 20 25

LRT 0.043 0.058 0.057 0.057 LRT 0.023 0.041 0.047 0.051

rDT 0.021 0.036 0.044 0.047

uDT 0.004 0.020 0.029 0.041

(f) Case 6

N 10 15 20 25

uDT 0.008 0.020 0.031 0.040

LAP 0.007 0.014 0.029 0.039

LRT 0.013 0.024 0.037 0.045

rDT 0.012 0.022 0.034 0.041

uDT 0.004 0.011 0.026 0.037

TABLE 3: Estimated test size, α ˆ , for α = 0.05. LAP denotes the LAP test, LRT the likelihood ratio test and uDT and rDT denote the unrestricted and restricted difference test respectively. size is 0.050 for N = 25 in case 4 and 6. Figure 3 shows the estimated test size for the asymptotic methods plotted against the estimated test size for the parametric bootstrap methods, there is one plot for each method for α = 0.05. If the points lie above the diagonal line, the test size of the asymptotic test is higher than the test size of the parametric bootstrap test, and lower if the points are below the line. If the points lie above the horizontal line the test size for the asymptotic test is greater than α = 0.05 and smaller if they lie below the line. Similarly, for the points that lie to the right of the vertical line, the test size for the parametric bootstrap test is higher than 0.05 and it is lower than 0.05 if they lie to the left of this line. We note in particular that for all the cases and for all values of N , the test size for the parametric bootstrap restricted difference test is greater than the test size for the large sample restricted difference test. For the likelihood ratio test, the opposite is true, the test size of the asymptotic likelihood ratio test is greater than the test size of the parametric bootstrap likelihood ratio test. This indicates that the parametric bootstrap test is an improvement compared to the asymptotic likelihood ratio test for small samples. However, the asymptotic likelihood ratio test does not preserve its test size in 15 of the 24 combinations of N and q and in six of those the parametric bootstrap test is still too optimistic. To use the parametric bootstrap restricted difference test does not yield an improvement compared to using the asymptotic restricted difference test. Figure 4 shows the observed level, i.e. the test size, of the tests plotted against the nominal level for N = 10, 15, 20, 25 in case 3 for a chosen nominal level in the range from 0 to 0.10. We see that the test size increases when N increases and also that the tests yield more similar results for the higher values of N . The unrestricted difference test and the LAP test preserve the test size in all the cases while the likelihood ratio test is too optimistic when N = 20 or 25. 10

(a)

(b)

(c)

(d)

F IGURE 3: Estimated test size for the asymptotic versus the small sample tests, for different values of N : Red=10, green=15, dark blue=20, cyan=25.

11

(a)

(b)

(c)

(d)

F IGURE 4: Observed level versus nominal level for (a) N = 10, (b) N = 15, (c) N = 20, (d) N = 25, green = LRT, red = LAP, dark blue = rDT, grey = uDT.

12

5

E XAMPLE FROM G ENE O NTOLOGY

As an example of how the tests perform on a data set from literature, we use part of the data presented by Bye et al. (2008). To estimate the effect of running capacity of the trained rats, we compare the gene expression for the HCR trained rats with the LCR trained rats. This gives us a list of differentially expressed genes between these two groups, we call this list A. It may be of interest to estimate the joint effect of training and inbread running capacity by comparing the trained HCR rats versus the sedentary LCR rats. This gives us another list of differentially expressed genes which we call list B. To determine which genes are differentially expressed a cut-off must be chosen. For each gene, a p-value and an adjusted p-value are calculated. The adjusted p-values are adjusted using the Benjamini-Hochberg step-up procedure to control the false discovery rate (FDR), Benjamini and Hochberg (1995). The cut-off is chosen such that all the genes that have a p-value smaller than or equal to this value are said to be differentially expressed. We will use two different cut-offs and first we choose an FDR cut-off of 0.025 for both lists, which yields 12 genes on list A and 24 genes on list B. These lists are submitted to eGOn, Beisvåg, Jünge, Bergum, Jølsum, Lydersen, Günther, Ramampiaro, Langaas, Sandvik and Lægreid (2006), a web-based tool that automatically translates the lists to GO categories. We are interested in genes annotated to the main category molecular function. There are three genes from the first list and nine genes from the second list annotated to this category. Of these genes two are on both list A and B and therefore there are N = 10 unique genes on the two lists associated with molecular function. The GO tree has several levels corresponding to the hierarchy of the GO categories. One gene can belong to more than one GO category, and given that it belongs to a subcategory it will also belong to the parent categories of this subcategory on the upper levels. After submitting the lists, one has to choose which main category to consider, i.e. either molecular function, biological process or cellular component. Level 1 is the main category itself with no subcategories, e.g. molecular function. The higher level number is chosen, the more subcategories are included, and they are all subcategories of the chosen main category. We choose to display the GO tree at level 3 for the main category molecular function and Table 4 shows the 11 GO categories that are represented on the lists, i.e. the categories which the genes on the lists belong to. A hypothesis test is performed for each category, testing whether it is over-represented or depleted on one of the lists compared to the other list. If we use an FDR cut-off of 0.05 on differential expression instead we get two lists of 30 and 63 genes, 42 of these genes can be classified under the main category molecular function. Within this category, seven genes are present on both lists, seven genes are present only on list A and 21 genes are present only on list B, yielding N = 35 unique genes. Table 5 shows the GO categories for these genes along with their p-values. Table 4 and 5 both include a column with the p-value calculated by eGOn. These p-values are calculated using the asymptotic LAP test. We calculate the p-values for the three other tests, i.e. the likelihood ratio test and the restricted and unrestricted difference test using parametric bootstrapping and compare them to the asymptotic p-values for all four tests. When performing the bootstrapping we draw B = 10000 bootstrap samples for each GO category. Table 6 shows the results for the FDR cut-off of 0.025. The GO category ion binding (GO:0043167) is significant when using either the parametric bootstrap or asymptotic likelihood ratio or restricted difference test. It is also significant when using the asymptotic unrestricted difference test, while it is not significant when using the parametric bootstrap or asymptotic LAP test or the asymptotic unrestricted difference test. None of the other GO categories are significant for any of the tests. 13

GO identifier GO:0005488 GO:0030246 GO:0043167 GO:0008289 GO:0003676 GO:0005515 GO:0046906 GO:0003824 GO:0016787 GO:0016874 GO:0016491

Name binding carbohydrate binding ion binding lipid binding nucleic acid binding protein binding tetrapyrrole binding catalytic activity hydrolase activity ligase activity oxidoreductase activity

p-value 0.157 0.273 0.077 0.279 0.317 0.705 0.279 0.245 0.273 0.317 0.46

n4 2 1 1 0 0 2 0 1 1 0 0

n5 1 0 1 1 0 0 1 1 0 0 1

n6 5 0 0 0 1 5 0 2 0 1 1

TABLE 4: GO categories within molecular function with their corresponding p-values and number of genes on the lists. When considering only the parametric bootstrap tests, in general the likelihood ratio test and restricted difference test give similar p-values which in some cases are smaller than the p-values for the LAP test and the unrestricted difference test. One example is the GO category lipid binding (GO:0008289) where the p-values are 0.160, 0.065, 0.065 and 0.220 for the LAP, likelihood ratio, restricted difference and unrestricted difference tests respectively. Even though lipid binding is not significant for any of these tests, it is not far from being significant for the LRT and rDT tests which is not the case for the LAP and unrestricted difference tests. Together with the example ion binding, this indicates that a GO category may be declared significant more often for the likelihood ratio test and restricted difference tests than with the LAP and unrestricted difference tests. This coincide with the findings in the simulation experiments where the estimated test size in several cases were higher for the likelihood ratio and restricted difference tests than for the LAP and uDT tests. Table 7 shows the results for the FDR cut-off of 0.05. The GO category catalytic activity (GO:0003824) is significant with a p-value 0.

11

Assuming independence between the treatment and placebo group, the joint distribution of X1 and X2 is the product of the two binomial distributions,   n 1 x1 n 2 x2 Pr(X1 = x1 , X2 = x2 ) = p1 (1 − p1 )n1 −x1 · p (1 − p2 )n2 −x2 x1 x2 2 In this situation, the reference set is all possible outcomes (x1 , x2 ) given n1 = 47 and n2 = 283 which is a set of 13 682 outcomes. When calculating p-values various test statistics can be used to define the tail set. One of the test statistics used by Berger and Boos (1994) and Lloyd (2008) is x1 /n1 − x2 /n2 . TT (x1 , x2 ) = (x1 + x2 )(n − x1 − x2 )/(nn1 n2 ) When testing the null hypothesis (7) the tail set of an observed outcome (x1,obs , x2,obs ) is R(x1,obs , x2,obs ) = {(x1 , x2 ) : |TT (x1 , x2 )| ≥ |TT (x1,obs , x2,obs )|}, and when testing the null hypothesis (8) the tail set is R(x1,obs , x2,obs ) = {(x1 , x2 ) : TT (x1 , x2 ) ≥ TT (x1,obs , x2,obs )}. Lloyd (2008) also uses the likelihood ratio test statistic 2   pˆi 1 − pˆi TLR = 2 xi log + (ni − xi )log p˜i 1 − p˜i i=1

where pˆi = xi /ni is the general maximum likelihood estimate for pi , i = 1, 2 and p˜i is the maximum likelihood estimate for pi , i = 1, 2 under the null hypothesis. If we are testing the two-sided null hypothesis p˜1 = p˜2 = (x1 + x2 )/(2n), and if we are testing the one-sided null hypothesis, p˜1 = p˜2 = (x1 + x2 )/(2n) when x1 /n1 ≥ x2 /n2 and p˜i = xi /ni , i = 1, 2, when x1 /n2 < x2 /n2 . These estimates were also used for the E step. For the maximization in the M step, 1001 equally spaced values of p1 = p2 in [0,1] were used for the two-sided test and 5151 equally spaced points in a rectangular grid in the triangular region 0 ≤ p1 ≤ 1, p1 ≤ p2 ≤ 1, were used for the one-sided test. In addition to TT and TLR we propose three additional test statistics. Let π(x; p) denote Pr(X1 = x1 , X2 = x2 ; p). First, we define a simplified version of the likelihood ratio test statistic, ˜ obs ), Tπe (xobs ) = π(xobs ; p which is simply the probability of the observed outcome xobs = (x1,obs , x2,obs ) with the maximum ˜ obs , inserted for p. likelihood estimate of p under H0 for this outcome, p In our second and third additional test statistic, TπE and TπM , we let the probability π(x; p) of an outcome x play the role of a test statistic. It is of course dependent on the unknown parameters, and thus not a test statistic in the ordinary sense. It still makes sense to apply an E or M step to it, yielding the πE p-value  ˜ obs ) ≤ π(xobs ; p ˜ obs ); p ˜ obs ) = ˜ obs ) π(x; p TπE (xobs ) = Pr(π(X; p x∈R∗ (xobs )

˜ obs ) ≤ π(xobs ; p ˜ obs ), and the πM p-value where R∗ (xobs ) consists of those x for which π(x; p  π(x; p), TπM (xobs ) = supp∈P0 Pr(π(X; p) ≤ π(xobs ; p); p) = supp∈P0 x∈R∗ (xobs )

R∗ (x

∗ where obs ) consists of those x for which π(x; p) ≤ π(xobs ; p). Note that the sets R (x) for these two statistics are dependent on the parameter p, as opposed to the tail sets defined by an ordinary statistic. Although TπE and TπM are constructed in a similar manner as p-values constructed by an E and an M step, respectively, their use will be as test statistics, rather than p-values.

12

3.2

C OMPARISON OF TEST STATISTICS

Lloyd (2008) recommends using the E+M p-values in the problem of comparing independent binomial proportions and we want to compare the test size and power of the TT , TLR , Tπe , TπE and TπM test statistics for these p-values when testing both the one-sided and two-sided null hypotheses given in (7) and (8). To calculate the test size, i.e. the probability of rejecting the null hypothesis given that the null hypothesis is true, we generated 10001 equally spaced values of p1 = p2 in [0, 1] and calculated the test size for each value p according to (5) by adding the probabilities of the outcomes that had a p-value less than or equal 0.05, which was the chosen significance level. To assess power, we used 9001 equally spaced points on the line p1 = p2 + 0.1, for which we calculated the power by adding the probabilities of the outcomes that had p-values less than or equal to 0.05, i.e. using (6). Table 6 shows the mean test size and power for the five test statistics followed by an E and an M step. For the two-sided hypothesis, TT , Tπe , TπE and TπM have similar mean test size, of which Tπe has the greatest. TLR yields a smaller test size than the other test statistics. When testing the one-sided hypothesis, the test size of Tπe is 0.0434 which is greater than the test size for the other test statistics which ranges from 0.0382 to 0.0387. We see that the test statistics have similar power, lower for the two-sided test than for the one-sided test, and for the two-sided test, TLR has the smallest power and Tπe has the largest power. For the one-sided test, TπE has the lowest power and Tπe has the largest power. This indicates that for this problem, the Tπe statistic performs best and should be considered an alternative to TT and TLR . Table 7 shows the E+M p-values for the observed outcome (x1 , x2 ) = (14, 48) which for TT and TLR agree with Lloyd (2008). For the two-sided test, the p-value for outcome (14,48), which is used as a test case by Berger and Boos (1994) and Lloyd (2008), is less than 0.05 for all the test statistics except TLR so the null hypothesis would be rejected on a 5% significance level for four of the test statistics. TT yields the smallest p-value. For TLR we reject the one-sided null hypothesis. All test statistics yield the p-value 0.025 for the one-sided test and thus reject the null hypothesis. Hypothesis Two-sided One-sided Two-sided One-sided

TT 0.0472 0.0386 0.3629 0.4421

TLR 0.0435 0.0384 0.3475 0.4410

T πe 0.0479 0.0434 0.3649 0.4639

T πE 0.0478 0.0382 0.3644 0.4388

T πM 0.0472 0.0387 0.3622 0.4427

TABLE 6: Mean test size in the two upper rows and mean test power in the two lower rows for the two-sided and one-sided hypothesis for the E+M p-values using different test statistics.

Hypothesis Two-sided One-sided

TT 0.037 0.025

TLR 0.057 0.025

T πe 0.040 0.025

T πE 0.041 0.025

T πM 0.040 0.025

TABLE 7: E+M p-values from the two-sided and one-sided tests for outcome (14,48). Figure 1 shows the E+M p-values for Tπe test plotted against the E+M p-values for TT for p-values less than or equal to 0.11. The plot shows that Tπe overall yields smaller p-values than TT and as we want p-values that are as small as possible provided that they are valid, Tπe seems to be preferable 13

0.10

over TT .

● ●



0.06

πe

0.08

● ● ● ● ● ● ● ●

● ●●

0.04

●● ●

0.00

0.02

● ●

● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.00

● ●● ● ● ●

0.02

●● ● ● ● ● ● ●●

● ● ● ● ● ● ● ●

●● ● ● ● ● ● ● ● ● ●● ● ● ● ●

● ● ● ● ●●

●● ●

● ● ●● ●●

●● ● ● ● ● ● ● ●● ● ● ●

● ●● ● ● ● ● ● ● ● ● ●

● ● ● ● ●●

● ●● ● ●● ●

● ● ●

0.04

0.06

0.08

0.10

T

F IGURE 1: E+M p-values for Tπe plotted against E+M p-values for TT . To explain what happens in the E and M steps, we look into the results for three particular outcomes, (16,52), (1,0) and (30,126), which are consecutive decreasing outcomes when ordering by TT . Table 8 shows the value of TT for these outcomes in the third column and the probabilities π(x; p) of the outcomes inserted the maximum likelihood estimates under the null hypothesis in the second column. ˜ (1,0) ) is than π((16, 52); p ˜ (16,52) ) and π((30, 126); p ˜ (30,126) ). When Note how much larger π((1, 0); p performing the E step, this larger probability has a significant effect, as it is included in the sum of probabilities that yields the p-value for outcome (1,0). The fourth column of Table 8 shows the E p-values for the three outcomes. The p-value for outcome (1,0) is 0.05970 which is much greater than the two other p-values and H0 is rejected on a 5% significance level, the other two p-values are both less than 0.01 and H0 will not be rejected. Thus, the decision of whether to reject H0 differ for these three outcomes, even though the values of TT are almost the same. The effect of the large probability of outcome (1,0) also shows in the M p-values in the fifth column in Table 8. We note a large increase in the M p-value for outcome (30,126) as compared to the E p-value, the reason being that the M p-value is at least as large as the E p-value by construction, in particular for (1,0), and next, that the the M p-value is at least as large as the M p-value of (1,0), since (1,0) has a larger test statistic value. According to the M p-values, we would not reject H0 for any of those outcomes as opposite to the E p-values where H0 is rejected for outcome (30,126). To avoid the effect of outcome (1,0), we need a different ordering of the outcomes, in which (1,0) is placed further down on the list where the outcomes are sorted by decreasing value of a test statistic. This is obtained by treating the E p-value as 14

x (16,52) (1,0) (30,126)

˜) π(x; p 0.00049 0.05247 0.00028

TT (x1 , x2 ) 2.45928 2.45756 2.45512

PE 0.00968 0.05970 0.00722

PM 0.02458 0.06112 0.06112

PE+M 0.01029 0.07229 0.00746

TABLE 8: The test statistic TT (x1 , x2 ) and corresponding E, M and E+M p-values for outcomes (16,52), (1,0) and (30,126). a test statistic and then applying the M step. The outcomes that had unusually large p-values after the E step compared to their neighbours, e.g. as outcome (1,0) had, is then moved down on the list when sorting the outcomes by decreasing negative E p-values. The sixth column of Table 8 shows these E+M p-values and we see that outcome (30,126) now has a p-value of 0.00746 and is thus unaffected by the outcome (1,0). This example indicates why it is beneficial to perform an E step prior to the M step. The M step is necessary to obtain valid p-values and therefore the E step alone is not sufficient. The M p-value for the outcome (14,48) is 0.06114, see Lloyd (2008), which is significantly greater than the E+M p-value of Table 7. Our investigation showed that the value 0.06114 also arises from outcome (1,0), because the value of TT is less for outcome (14,48) than for outcome (1,0) which is thus in the tail set of (14,48). The E+M p-value is smaller (0.025) because the E step changed the ordering of the outcomes and (1,0) was placed behind (14,48) so that after performing the E step, (1,0) is no longer in the tail set of (14,48). It should also be noted that π((14, 48); φ), as a function of φ = p1 = p2 has a prominent and narrow peak near φ = 0 (but φ > 0), which explains the large ˜ ) and thus of PE (1, 0). Partial maximization avoids this peak, explaining that the value of π((1, 0); p PM p-value is reasonable as reported by Berger and Boos (1994), though not as small as the E+M p-values as reported by Lloyd (2008).

4

C OMPARING POSITIVE PREDICTIVE VALUES

We now present the main problem of comparing the positive predictive values of two diagnostic tests. The performance of various test statistics and p-values are compared in terms of test size and power.

4.1

P RESENTATION OF THE PROBLEM

Suppose that two diagnostic tests are available for a particular disease of interest. We want to compare the prediction abilities of the two tests, which can be quantified by the positive and negative predictive values. The positive predictive value is defined as the probability that a subject has the disease given that the test is positive and the negative predictive value is the probability that a subject does not have the disease given that the test is negative. Without loss of generality, in this work we will only consider the positive predictive values, as the tests for the negative predictive values can easily be derived along the same lines. We want to test whether the positive predictive value of test A is equal to the positive predictive value of test B against the alternative that they are not equal; H0 : PPVA = PPVB

vs

H1 : PPVA = PPVB .

In this situation we define six random variables that are given in Table 9. Let Y be the vector of these 15

random variables, i.e. Y = (Y1 , Y2 , Y3 , Y4 , Y5 , Y6 ). We assume that Y is multinomially distributed with parameters N and p = (p1 , p2 , p3 , p4 , p5 , p6 ). Thus, the probability function of Y is Pr(Y = y) = N !

6  p yi i

i=1

Variable Y1 Y2 Y3 Y4 Y5 Y6

yi !

.

Description Number of non-diseased subjects with positive test A and B. Number of non-diseased subjects with positive test A and negative test B. Number of non-diseased subjects with negative test A and positive test B. Number of diseased subjects with positive test A and B. Number of diseased subjects with positive test A and negative test B. Number of diseased subjects with negative test A and positive test B. TABLE 9: Definition of the random variables Y1 , . . . , Y6 .

The positive predictive value of test A is PPVA =

p4 + p5 p 1 + p 2 + p 4 + p5

and the positive predictive value of test B is PPVB =

p4 + p 6 . p 1 + p3 + p4 + p 6

The null hypothesis is then H0 : fPPV (p) =

p4 + p5 p 4 + p6 − = 0. p1 + p2 + p4 + p5 p1 + p3 + p4 + p6

(9)

The parameters  p are not completely determined by the null hypothesis, we only know that fPPV (p) = 0 and that 6i=1 pi = 1. Thus, from these two constraints, two of the parameters can be expressed in terms of the four other remaining parameters, but these four parameters will be unknown nuisance parameters. In order to test the null hypothesis (9) we will calculate p-values by enumeration as described by the algorithm in Section 2.3. The first step is to find the reference set. 4.1.1

F INDING THE REFERENCE SET

In the first step in the algorithm for calculating p-values, we enumerate using five nested for-loops to find all possible outcomes having the value of N , which is the number of observations. It can be distributed amongsix non-negative integer random variables having sum N , and the number of  possible outcomes is N 5+5 . This is a general result for the number of distinct unordered selections of N elements from six elements drawn with replacement. Figure 2 shows the number of outcomes on a log10 scale plotted against N . The number of outcomes grows very quickly when N increases. When N = 25 there are 142506 possible outcomes and when N = 50, there are nearly 3.5 million possible outcomes. 16

8

























6

● ● ● ● ●

4





0

2





0

20

40

60

80

100

N

F IGURE 2: Number of possible outcomes on log10 scale as a function of N . 4.1.2

C ALCULATING THE PROBABILITY OF THE OBSERVED OUTCOME

In the setting of comparing positive predictive values we will focus on calculating the probability of an outcome either by substituting maximum likelihood estimates for p, maximize over the parameter space for p or integrate out p by a Bayesian approach. This results in the estimation (E), maximization (M) or combinations of these like the estimation and maximization (E+M) p-values, and the Bayesian prior predictive p-values. As far as we know, there is no sufficient or ancillary statistic for p in this problem. To calculate the p-values, a test statistic T (Y ) must be chosen. There are several possible test statistics for this problem, and they will be presented in Section 4.1.3. ˜ for p E STIMATION AND MAXIMIZATION If we substitute the maximum likelihood estimates p under H0 , the E p-value for an outcome y obs is ˜ obs ) = PE (y obs ) = Pr(T (Y ) ≥ T (y obs ); p



N!

y∈R(yobs )

6 i  p˜yi,obs i=1

(10)

yi !

where the tail set R(y) is defined by the chosen test statistic, T (Y ), and p˜i,obs is the maximum likelihood estimate under H0 for pi , i = 1, . . . , 6 for the outcome y obs . By maximizing the probability of the outcome y obs over the parameter space P0 where fPPV (p) = 0, the M p-value is given by PM (y obs ) = supp∈P0 Pr(T (Y ) ≥ T (y obs ); p) = supp∈P0

 y∈R(yobs )

N!

6 i  pyi,obs i=1

yi !

,

(11)

where R(y) is defined by the chosen test statistic. When we calculate the E+M p-value, the expression is the same as in (11), but the tail set is then defined by the p-values calculated from (10). We will also 17

consider the double estimation (E2 ) p-values, where we first calculate p-values from (10) and then use these p-values as test statistics to define the tail set when calculating p-values from (10) once more. Finally we will maximize the E2 p-values by using these as test statistics to define the tail set in (11), which results in E2 M p-values. We also consider the Bayesian prior predictive p-values, which requires a different I NTEGRATION approach. The starting point is still that the probability of an outcome under the null hypothesis is unknown because the parameters p are not completely specified. Instead of estimating p or maximizing the p-values over p, we weigh them according to how likely they are under the null hypothesis. We start out by conditioning on the parameter p. Let π(y|H0 ) be the probability of the outcome y under the null hypothesis (9). Then π(y|p) · π(p|H0 )dp (12) π(y|H0 ) = P0

The first factor of the integrand, the probability of y given p is simply the multinomial distribution, i.e., 6  pyi i π(y|p) = N ! , yi ! i=1

and π(p|H0 ) is the probability density function for p under the null hypothesis.  Since p6 = 1 − 5i=1 pi we first reduce the problem to five unknown parameters. Let 1 − p 1 − p 2 − p 3 − p5 p4 + p5 − . p1 + p2 + p4 + p5 1 − p 2 − p5  which is fPPV (p) (9) with 1 − 5i=1 pi inserted for p6 . z=

(13)

Under the null hypothesis z = 0 and from this an expression for p4 can be derived, yielding four unknown parameters. We change variables from p1 , p2 , p3 , p4 , p5 to p1 , p2 , p3 , z, p5 . The vector of the new parameters is denoted p∗ in the following. Then π(p∗ |z = 0) ∝ π(p∗ , z = 0) so therefore we start by finding π(p∗ , z). We use the formula for change of variables, π(p) = π(p∗ , z) · |J|, where J is the Jacobian determinant,      J =    

1 0 0

∂z ∂p1

0

0 1 0

∂z ∂p2

0

0 0 1

∂z ∂p3

0

0 0 0

∂z ∂p4

0

      = ∂z  ∂p4 ∂z  ∂p5  1  0 0 0

+p2 The absolute value |J| is (p1 +pp21+p 2 . As a prior distribution for p we first apply the Dirichlet 4 +p5 ) distribution with parameters α1 = α2 = . . . α5 = 1, thus π1 (p) is constant. Then

π(p∗ |z = 0) ∝ π(p∗ , z = 0) = π1 (p) · |J|−1 ∝ and π(p|H0 ) =

(p1 + p2 + p4 + p5 )2 , p1 + p2

1 (p1 + p2 + p4 + p5 )2 · , k1 p1 + p2 18

where k1 is a normalizing constant which can be found from

 P0

π(p|H0 ) = 1.

This leads to the following expression for the probability under H0 of an outcome y,  6  N !  pyi i (p1 + p2 + p4 + p5 )2 π(y|H0 ) = dp. yi ! p 1 + p2 P0 k1

(14)

i=1

For details on how to numerically compute the integral, see Section 5. The p-value is the sum of these probabilities for outcomes in the tail set of the observed outcome. In this setting we use the probability of an outcome under H0 as the test statistic and the tail set for an outcome y obs is defined as the outcomes for which π(y|H0 ) ≤ π(y obs |H0 ), so the p-value is given as PPP,1 (y obs ) = Pr(π(Y |H0 ) ≤ π(y obs |H0 ))   6  N !  pyi i (p1 + p2 + p4 + p5 )2 dp. = yi ! p 1 + p2 P0 k1

(15)

i=1

π(y|H0 )≤π(yobs |H0 )

To assess the effect of the choice of prior, we choose the non-uniform prior π2 (p) ∝ p1 as an alternative prior, which leads to p1 (p1 + p2 + p4 + p5 )2 π2 (p|H0 ) = , k2 (p1 + p2 ) where k2 is a normalizing constant and the probability under H0 of an outcome y is  6  N !  pyi i p1 (p1 + p2 + p4 + p5 )2 π(y|H0 ) = dp. yi ! p 1 + p2 P0 k2

(16)

i=1

We see that the probability of the outcome depends on the chosen prior π2 (p) as expected. The pvalue, which is the sum of the probabilities in (16) for the outcomes that are in the tail set of the one observed, is denoted PPP,2 and given by PPP,2 (y obs ) = Pr(π(Y |H0 ) ≤ π(y obs |H0 ))  6   N !  pyi i p1 (p1 + p2 + p4 + p5 )2 = dp. yi ! p 1 + p2 P0 k2

(17)

i=1

π(y|H0 )≤π(yobs |H0 )

An alternative formulation of the null hypothesis (9) is ∗ (p) = p1 (p1 + p2 + p3 + p4 + 2p5 − 1) − p2 (1 − p1 − p2 − p3 − p5 ) + p3 (p4 + p5 ) = 0. (18) fPPV

In this case the absolute value of the Jacobi determinant will be p1 + p3 and if we assume the uniform Dirichlet prior π1 (p), 1 1 · , π(p|H0 ) = k3 p1 + p3 where k3 is a normalizing constant and the probability under H0 of an outcome is  6  1 N !  pyi i · dp. π(y|H0 ) = yi ! p1 + p 3 P0 k3 i=1

19

(19)

This probability is clearly not equal to the probability (14) and this is an example of Borel’s paradox. The p-value is the sum of the probabilities in (19) for the outcomes in the tail set of the observed outcome. It is denoted PPP,3 and given by PPP,3 (y obs ) = Pr(π(Y |H0 ) ≤ π(y obs |H0 ))  6   1 N !  pyi i dp. = y i ! p 1 + p3 P0 k3 4.1.3

(20)

i=1

π(y|H0 )≤π(yobs |H0 )

D EFINING THE TAIL SET

The tail set of an outcome y is defined by a test statistic T (y). To test the null hypothesis in (9) there are several tests available that are used for large samples, see Günther, Bakke, Lydersen and Langaas (2008) for a detailed description of four possible test statistics. In this work we will use these test statistics to define the tail set while ignoring their asymptotic distribution. The first test statistic is the likelihood ratio test statistic which is the ratio between the maximum likelihood under the null hypothesis and the general maximum likelihood, of which by convenience the logarithm is taken and which is multiplied by −2, Casella and Berger (2002). In our multinomial situation, it is given as TLR = −2 · log

6  supp∈P0 L(p|y) yi · (log p˜i − log pˆi ), = −2 supp∈P L(p|y)

(21)

i=1

where p˜i is the restricted maximum likelihood estimates of pi , i.e. under H0 , i = 1, . . . , 6, and pˆi is the unrestricted general maximum likelihood estimates for the multinomial distribution, i.e., pˆi = ni /N , i = 1, . . . , 6, Johnson, Kotz and Balakrishan (1997). The maximum likelihood estimates under H0 , p˜i , i = 1, . . . , 6 cannot be written in closed form, but can be found analytically by solving a system of equations arising from the method of Lagrange multipliers, which we did using Maple 12. More details can be found in Günther et al. (2008), Section 3.1.2. The difference test statistic is given by Tg (y) =

(g(Y ) − g(μ))2 GT (μ)Σ G(μ)

(22)

where g(Y ) is an estimator for the difference fPPV (p) in (9), i.e. g(Y ) =

Y4 + Y5 Y4 + Y6 − , Y1 + Y2 + Y4 + Y5 Y1 + Y3 + Y4 + Y6

and μ = E(Y ) = N · p, Σ = Cov(Y ) = N (Diag(p) − pT p), G is a vector containing the first order partial derivatives of g(Y ) with respect to the components of Y , GT is the transpose of G, and G(μ) is G with μ inserted for Y . Under the null hypothesis g(μ) = 0. G(μ) and Σ depend on the unknown parameters p which must be estimated when calculating the test statistic. We can either ˆ and then we insert the unrestricted maximum likelihood estimates for the multinomial distribution p refer to the test as the unrestricted difference test (uDT) and denote the test statistic TuDT , or insert ˜ . Then the test is referred to as the restricted restricted maximum likelihood estimates under H0 , p difference test (rDT) and the test statistic is denoted TrDT . 20

Leisenring, Alonzo and Pepe (2000) presented a score test based on generalized estimating equations. We denote this test the LAP test. The test statistic can be written as TLAP =

((Y1 + Y2 + Y4 + Y5 )(Y4 + Y6 ) − (Y1 + Y3 + Y4 + Y6 )(Y4 + Y5 ))2 , h(Y1 , Y2 , Y3 , Y4 , Y5 , Y6 )

(23)

where h(Y1 , Y2 , Y3 , Y4 , Y5 , Y6 )

2 2Y4 + Y5 + Y6 2Y1 + Y2 + Y3 + 2Y4 + Y5 + Y6  2 2Y4 + Y5 + Y6 + Y2 (Y1 + Y3 + Y4 + Y6 )2 2Y1 + Y2 + Y3 + 2Y4 + Y5 + Y6  2 2Y4 + Y5 + Y6 + Y3 (Y1 + Y2 + Y4 + Y5 )2 2Y1 + Y2 + Y3 + 2Y4 + Y5 + Y6  2 2Y4 + Y5 + Y6 + Y4 (Y2 − Y3 + Y5 − Y6 )2 1 − 2Y1 + Y2 + Y3 + 2Y4 + Y5 + Y6  2 2Y4 + Y5 + Y6 + Y5 (Y1 + Y3 + Y4 + Y6 )2 1 − 2Y1 + Y2 + Y3 + 2Y4 + Y5 + Y6  2 2Y4 + Y5 + Y6 + Y6 (Y1 + Y2 + Y4 + Y5 )2 1 − . 2Y1 + Y2 + Y3 + 2Y4 + Y5 + Y6

= Y1 (Y2 − Y3 + Y5 − Y6 )2



These four test statistics will be used to define the tail set for the E and M p-values. Other test statistics are possible and we suggest three, Tπe , TπE and TπM , which are defined in the same way as they were for the independent binomial proportions (Section 3), but with the multinomial distribution with six parameters substituted for the joint distribution of two independent binomial distributions, that is, ˜ obs ), Tπe (y obs ) = π(y obs ; p

(24)

˜ obs ) ≤ π(y obs ; p ˜ obs ); p ˜ obs ) TπE (y obs ) = Pr(π(Y ; p

(25)

TπM (y obs ) = sup Pr(π(Y ; p) ≤ π(y obs ; p); p).

(26)

and p∈P0

Finally, we also consider the Bayesian prior predictive p-values, that in addition to being p-values in their own right, can be used as test statistics to define the critical region for the E and M p-values, and we denote them TPP where  π(y | p) · π(p | H0 )dp. (27) TPP (y obs ) = π(y|H0 )≤π(yobs |H0 ) P0

4.2

R ESULTS

In the PPV setting, we have studied the performance of the different types of p-values with respect to test statistics and the parameters in the multinomial distribution. The performance will be evaluated in terms of test size and test power, which are calculated as given by (5) and (6). We choose the significance level α = 0.05. 21

4.2.1

E VALUATION OF TEST SIZE

The test statistics considered were the LAP, likelihood ratio, unrestricted difference and restricted difference test statistics, TLAP , TLR , TuDT and TrDT . In addition we used the πe -probabilities and the πE , πM and Bayesian p-values as test statistics, i.e. Tπe , TπE , TπM and TPP . For each of these test statistics and for the chosen values of N we calculated the E, M, E+M, E2 and E2 M p-values. We also considered the performance of the Bayesian p-values as p-values in itself. The performance of the test statistics can depend highly on the parameters p in the multinomial distribution. Both the overall mean performance as well as the performance for specific values are evaluated. For the mean performance a set of 10385 values of p in P0 is used, which are obtained by using a four dimensional grid for the four free parameters where each side in the grid is divided into 30 subintervals, and the 10385 values of p in the grid that belong to P0 are then the cases we consider. For this set of cases we calculate the mean test size, i.e. we calculate the test size from (5) for each case and then find the average for the 10385 cases. In addition, six specific cases of p in P0 are evaluated, see Table 10. These are the same cases as in Günther, Bakke and Langaas (2009) where the reasoning for choosing these values can be found. Case 1 2 3 4 5 6

p1 0.068 0.043 0.267 0.300 0.400 0.450

p2 0.135 0.130 0.267 0.267 0.200 0.200

p3 0.135 0.130 0.267 0.267 0.200 0.200

p4 0.527 0.348 0.067 0.033 0.100 0.050

p5 0.068 0.174 0.067 0.067 0.050 0.050

p6 0.068 0.174 0.067 0.067 0.050 0.050

TABLE 10: Specification of multinomial parameters under H0 . The size of the multinomial sample determines how many possible outcomes there are and is an interesting factor to consider. We want to investigate whether the performance of the test statistics depends on sample size, in particular for small sample sizes, so we use N = 10, 15, 20, 25. When maximizing the p-values over p in P0 we used a four-dimensional grid since there are four free parameters, with 50 points on each side. In addition, the maximum likelihood estimates for all possible outcomes given N were included in the grid. The Bayesian p-values were calculated on the grid with 50 points in each side, for further details see Section 5. Table 11 shows the mean test size for all the test statistics, values of N and type of p-values investigated. We first compare the performance of the different types of p-values, E, M, E+M, E2 and E2 M. The M p-values yield the smallest test size for all values of N for all the test statistics except for the likelihood ratio test when N = 20, there the E2 M p-values yields the smallest test size. In general the E and E2 p-values result in larger test sizes than the E+M and E2 M p-values which we would expect since the E and E2 p-values are not valid, whereas the E+M and E2 M p-values are. The exception is TuDT , when N = 10, the E p-values yield smaller test size than the E+M and E2 M p-values, and when N = 15, the E p-values yield smaller test size than the E2 M p-values. The E2 p-values yield larger test sizes than the E p-values, except for Tπe . Next we compare the performance of the different test statistics. First we consider the test statistics that originated from large samples where their asymptotic distributions were utilized, i.e. the LAP, 22

N 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20 25 25 25 25 25

p-value M E E+M E2 E2 M M E E+M E2 E2 M M E E+M E2 E2 M M E E+M E2 E2 M

LRT 0.0164 0.0381 0.0285 0.0456 0.0273 0.0256 0.0451 0.0354 0.0470 0.0332 0.0333 0.0469 0.0395 0.0488 0.0319 0.0335 0.0483 0.0385 0.0492 0.0359

LAP 0.0092 0.0282 0.0198 0.0420 0.0247 0.0122 0.0395 0.0243 0.0466 0.0303 0.0140 0.0430 0.0308 0.0477 0.0331 0.0124 0.0449 0.0324 0.0479 0.0353

uDT 0.0019 0.0179 0.0181 0.0395 0.0220 0.0006 0.0297 0.0234 0.0462 0.0302 0.0002 0.0365 0.0283 0.0492 0.0329 0.0001 0.0403 0.0313 0.0496 0.0343

rDT 0.0130 0.0352 0.0285 0.0439 0.0268 0.0242 0.0447 0.0354 0.0470 0.0350 0.0277 0.0475 0.0386 0.0482 0.0339 0.0136 0.0486 0.0403 0.0490 0.0360

πM 0.0177 0.0388 0.0286 0.0435 0.0279 0.0227 0.0438 0.0364 0.0479 0.0355 0.0239 0.0468 0.0392 0.0490 0.0365 0.0214 0.0479 0.0403 0.0497 0.0362

πE 0.0088 0.0359 0.0242 0.0441 0.0213 0.0085 0.0376 0.0317 0.0453 0.0305 0.0091 0.0394 0.0337 0.0465 0.0341 0.0089 0.0405 0.0314 0.0482 0.0374

πe 0.0145 0.0643 0.0207 0.0549 0.0260 0.0138 0.0650 0.0274 0.0538 0.0311 0.0130 0.0640 0.0317 0.0544 0.0324 0.0131 0.0636 0.0343 0.0540 0.0349

PP1 0.0104 0.0330 0.0257 0.0441 0.0297 0.0061 0.0395 0.0339 0.0472 0.0374 0.0024 0.0433 0.0369 0.0491 0.0381 0.0009 0.0449 0.0402 0.0494 0.0381

PP3 0.0190 0.0366 0.0297 0.0435 0.0299 0.0150 0.0428 0.0360 0.0480 0.0370 0.0088 0.0458 0.0378 0.0491 0.0388 0.0032 0.0469 0.0398 0.0494 0.0379

TABLE 11: Mean test size for the 10385 values of p in P0 , for all the test statistics and M, E, E+M, E2 and E2 M p-values when the chosen significance level is α = 0.05. The top row denotes the test statistics, LRT is given in (21), LAP in (23), uDT and rDT in (22) with unrestricted and restricted maximum likelihood estimates inserted for p respectively, πM in (26), πE in (25), πe in (24), PP1 in (15) and PP3 in (20). N is the sample size.

23

likelihood ratio, unrestricted difference and restricted difference test statistics. In general, for all types of p-values, TLR and TrDT have the largest test size, TuDT and TLAP have the smallest test size and TuDT has mostly smaller test size than TLAP . TLR has the largest test size for the M p-values. When N = 20, TuDT has the largest test size for the E2 p-values, a result for which we have found no apparent reason. If we look into the performance of Tπe , TπE and TπM we see that Tπe has the largest test size compared to all the other test statistics for the E and E2 p-values for all N . TπM has the largest test size for the M, E+M and E2 M p-values for N = 10, for the E+M and E2 M p-values when N = 15 and for the E2 M p-values when N = 20, whereas TπE has the largest test size for the E2 M p-values when N = 25. We note that when N increases the likelihood ratio or restricted difference test performs better than the πM statistic for the M, E+M and E2 M p-values, therefore the πM test statistic is probably a better choice only when N is small. What is worth noting, is that for the test statistics that are most conservative with respect to test size for the M p-values, the gain is greater when performing one or more E step(s) before the M step compared to the test statistics for which the test size for the M p-values is less conservative. This is particularly evident for TuDT , TLAP and TπE compared to TLR . The test size for TLR increases less than the test size for the other three test statistics when comparing the M and E+M p-values. For TLR the test size is also reduced if two E steps instead of one are applied before the M step, whereas for TLAP and TuDT the test size increases when two E steps are applied before the M step. The mean test size increases when N increases. Comparing the test sizes for N = 10 to the test sizes for N = 25 reveals an increase for all test statistics and type of p-values except for some of the M p-values which have test size that is approximately 0. As an illustration, the mean test size for the M p-value for the likelihood ratio test statistic is 0.0164 when N = 10 and 0.0335 when N = 25. In addition to the test statistics discussed so far, the Bayesian prior predictive p-values PPP,1 and PPP,3 , originated from using the same prior π1 (p), but different formulations of the null hypothesis, were used as test statistics to compute E, M, E+M, E2 and E2 M p-values. When N = 10, the PP3 test statistic yields larger test size than all the other test statistics for the M, E+M and E2 M p-values. Otherwise the test size of these two test statistics lies between the test size of the other test statistics, not following a clear pattern, except that the PP3 yields larger test size than the PP1 in general. We also evaluated the performance of all the test statistics, values of N and types of p-values for the six multinomial cases of Table 10. Table 12 shows the test size for TLR for N = 10 and N = 25. We see that the E and E2 p-values yield a test size greater than 0.05 in case 1–5 for N = 25 and thus proves that these p-values are not valid. We also see that the test size is greater when N = 25 compared to N = 10. The results for the other test statistics and values of N are omitted in this report since the findings in respect to test statistics and p-values in the six specific multinomial cases were similar to the overall findings, however the test size for all test statistics was clearly dependent of the chosen multinomial cases, i.e. the parameter p in the multinomial distribution. In general, which can also be seen in Table 12, case 1 and 2 have larger test size than case 3–6. This trend was consistent through the different test statistics, types of p-values and N and indicates that when comparing test sizes the multinomial case chosen will have a large influence the test size, but it will not change the conclusions with respect to which test statistic or which p-value results in the largest or smallest test size. Figure 3 shows histograms for the test size in the 10385 cases under H0 for the M, E, E+M, E2 and E2 M p-values for each of the test statistics TLAP , TLR , TuDT , TrDT , TπM , TπE , Tπe and TPP,1 for N = 10. We see that for TLR , TrDT , TπE , TπM ,and TPP,1 the distribution of the test size for the E

24

N 10 10 10 10 10 25 25 25 25 25

p-value M E EM E2 E2 M M E EM E2 E2 M

case 1 0.0199 0.0412 0.0281 0.0475 0.0322 0.0431 0.0528 0.0431 0.0510 0.0401

case 2 0.0193 0.0426 0.0366 0.0500 0.0294 0.0428 0.0529 0.0435 0.0514 0.0387

case 3 0.0097 0.0281 0.0242 0.0379 0.0200 0.0386 0.0573 0.0458 0.0573 0.0415

case 4 0.0071 0.0218 0.0193 0.0302 0.0157 0.0391 0.0563 0.0448 0.0569 0.0404

case 5 0.0061 0.0197 0.0170 0.0333 0.0166 0.0313 0.0495 0.0389 0.0523 0.0367

case 6 0.0035 0.0122 0.0111 0.0220 0.0101 0.0287 0.0441 0.0339 0.0469 0.0322

TABLE 12: Test size for the likelihood ratio test statistic for the six multinomical cases. p-values is skewed towards the right compared to the distribution for the M p-values, and we note that the test size is sometimes larger than 0.05, showing that the E p-values are not valid. The E+M p-values preserve the skewed distribution while shifting it to the left so that no test size is greater than 0.05. For the LAP and uDT test statistics, we note that the distribution of test size for the E p-values is not skewed in the same way, but the E2 p-values are, so apparently it is necessary to do two E steps before maximization for the LAP and uDT statistics. Figure 3 illustrates what happens under the E and M steps. To obtain an even better understanding of the effect of the E and M steps, we consider two possible outcomes when N = 10, y 1 = (1, 3, 0, 6, 0, 0) and y 2 = (1, 0, 1, 3, 5, 0). Table 13 shows the p-values for these outcomes using the likelihood ratio and LAP test statistics. Outcome y1 y2 y1 y2

Test statistic TLRT = 4.159 TLRT = 4.077 TLAP = 3.932 TLAP = 2.492

M 0.1025 0.1025 0.1048 0.2297

E 0.0940 0.0349 0.0805 0.0978

E+M 0.1108 0.0450 0.1056 0.1329

E2 0.0770 0.0279 0.0768 0.0654

E2 M 0.1062 0.0408 0.1064 0.0855

TABLE 13: P -values for the likelihood ratio and LAP test statistics for the outcomes y 1 = (1, 3, 0, 6, 0, 0) and y 2 = (1, 0, 1, 3, 5, 0). Let us first consider the p-values for the likelihood ratio test statistic. We note that for y 2 with a 5% significance level, we would reject the null hypothesis based on the E p-value and not reject it based on the M p-value. Since the likelihood ratio test statistic is greater for y 1 than for y 2 , the M p-value for y 2 will necessarily be greater than for y 1 , which will be greater than the E p-value for y 1 . Since the E p-value is less for y 2 than for y 1 , y 1 will not be in the tail set for y 2 when performing the M step after the E step and the E+M p-value results in rejection of the null hypothesis on a 5% significance level for y 2 . Since the E and E2 steps alone do not result in valid p-values, we should perform an M step afterwards. But as we see, the E step(s) are means to avoid certain outcomes having a large p-value because of other outcomes having greater test statistics and artifically large E and M p-values compared to other outcomes with similar magnitude of the test statistics.

25

E2 2000

E2 M

2000

2000

E+M

2000

E

2000

M

LRT 0.05

2000

0.05

2000

0.05

2000

0.05

2000

2000

0.05

LAP 0.05

2000

0.05

2000

0.05

2000

0.05

2000

2000

0.05

uDT 0.05

2000

0.05

2000

0.05

2000

0.05

2000

2000

0.05

rDT

2000

0.05

2000

0.05

2000

0.05

2000

0.05

2000

0.05

πM 0.05

2000

0.05

2000

0.05

2000

0.05

2000

2000

0.05

πE 0.05

2000

0.05

2000

0.05

2000

0.05

2000

2000

0.05

πe 0.05

2000

0.05

2000

0.05

2000

0.05

2000

2000

0.05

PP1 0.05

0.05

0.05

0.05

0.05

F IGURE 3: Distribution of test size for the various test statistics and p-values, N = 10, the x-axis is cut at 0.08 and the y-axis at 3000. 26

For the LAP test statistic, we do not see the same effect, even though the p-value with the smallest test statistic, y 2 , has an M p-value greater than the M p-value for y 1 . Here the E p-value for y 2 is also greater than the one for y 1 , and thus this ordering is preserved when performing an M step after the E step. The E2 p-value however, is smaller for y 2 than y 1 , and performing the M step afterwards does not change this. Here y 1 is an example of an outcome for which the decision of rejecting the null hypothesis does not only depend on the type of p-value, but also of the chosen test statistic. Using the likelihood ratio test statistic, we reject the null hypothesis on a 5% confidence level for all of the p-values E, E+M, E2 and E2 M. If we use the LAP test statistic instead, we do not reject it for any of the p-values. Table 14 shows the mean test size for the Bayesian prior predictive p-values for N = 10, 15, 20, 25 ∗ (p) and using a grid with 50 points in each direction for both formulations of H0 , i.e. fPPV (p) and fPPV both priors for p. We see that the test size depends highly on the choice of prior and formulation of ∗ (p) = 0 as H , it increases to H0 . The test size is smallest using the uniform Dirichlet prior and fPPV 0 around 0.055 with fPPV (p) = 0 as H0 and if we choose the non-uniform Dirichlet prior π2 (p), the test size becomes very high. Clearly, the non-uniform prior is not a good choice and it also indicates that the choice of prior has a larger effect than how we choose to formulate the null hypothesis. Comparing these results to the results when using the prior predictive p-values as test statistics to define the tail sets for the M, E, E+M and E2 M p-values in Table 11 shows that the M step reduces the test size in all cases for all values of N and for both formulations of H0 as expected. The test size for the E2 p-values is higher than for the Bayesian p-values in many of the cases and in some cases, e.g. ∗ (p) = 0, the test case 5 for N = 15, 20, 25 for H0 : fPPV (p) = 0 and for N = 20, 25 for H0 : fPPV size increases for all the p-values except the M p-values. Table 14 also shows that the Bayesian prior predictive p-values are not valid since the test size is larger than the significance level. N 10 15 20 25

PP1 0.0561 0.0557 0.0556 0.0552

PP2 0.1272 0.1465 0.1533 0.1556

PP3 0.0489 0.0491 0.0494 0.0494

TABLE 14: Test size for the Bayesian positive predictive p-values using different priors, formulation of H0 and values of N , PP1 is given in (15), PP2 is given in (17) and PP3 is given in (20). The prior predictive p-values for the two outcomes y 1 = (1, 3, 0, 6, 0, 0) and y 2 = (1, 0, 1, 3, 5, 0) are given in Table 15. We see that the three Bayesian p-values are quite different for both outcomes. For y 2 we reject the null hypothesis, whereas for y 1 we do not reject the null hypothesis. The two p-values both found from the model with uniform Dirichlet prior are similar for y 2 , but for y 1 it is the two p-values that are based on the same formulation of H0 that are similar. The null hypothesis is rejected for y 2 , but not for y 1 for any of the p-values. 4.2.2

E VALUATION OF TEST POWER

We would like to compare the test power of TLR , TLAP , TuDT , TrDT and TπM . Since the results of the test size comparisons showed that the M p-values have the smallest test sizes and since the E and E2 p-values are not valid, we consider only the E+M and E2 M p-values when comparing test power. We 27

Outcome y1 y2

PP1 0.1086 0.0382

PP2 0.1084 0.0171

PP3 0.0506 0.0323

TABLE 15: Bayesian prior predictive p-values for the outcomes y 1 = (1, 3, 0, 6, 0, 0) and y 2 = (1, 0, 1, 3, 5, 0).

expect that the power increases with N and we used N = 10 and N = 25 to investigate the magnitude of the increase. The power was calculated the same way as the test size except that the values of p are chosen so that p does not satisfy the null hypothesis (9). We wanted to compare the test power in specific multinomial cases and we chose six sets of the parameters p, these were denoted case 7–12 and are given in Table 16. They were chosen because of their decreasing distance from H0 which is measured by the magnitude of fPPV (p). If fPPV (p) is close to 0, then p nearly satisfies H0 while the greater |fPPV (p)| is, the further away from H0 p is. Since the power in our chosen cases may not be representative for a randomly chosen case, we also generate 10385 random cases under H1 , by drawing 10385 vectors of length 6 from the uniform distribution and scaling each vector to sum to 1. Case 7 8 9 10 11 12

p1 0.06 0.01 0.20 0.01 0.06 0.17

p2 0.01 0.10 0.05 0.07 0.12 0.12

p3 0.44 0.44 0.24 0.27 0.18 0.18

p4 0.26 0.01 0.28 0.28 0.14 0.21

p5 0.22 0.43 0.22 0.26 0.35 0.16

p6 0.01 0.01 0.01 0.11 0.15 0.16

fPPV (p) 0.52 0.76 0.27 0.29 0.18 0.05

TABLE 16: Specification of cases under H1 . Table 17 and 18 shows the test power for the chosen cases, test statistics and p-values when N = 10 and N = 25 respectively. As expected the power increases when N increases. The test statistics TuDT and TLAP have the smallest power except for the E2 M p-values in case 6 when N = 25. When N = 25 the TπM statistic has the highest power except in case 6 for the E2 M p-values. For N = 10, the TLR statistic has highest power for the E+M p-values in four of six cases, while only in one case for the E2 M p-values. The E+M p-values yields in general higher power than the E2 M p-values, except for the TLAP and TuDT statistics when N = 10. If we compare these results to the calculated mean power for all the power cases, given in the last column of Table 17 and 18, we see that TLAP and TuDT have smaller power for the E+M than the E2 M p-values when N = 10 and also when N = 25 for TuDT . TLR has the largest power for the E+M p-values when N = 10, otherwise the πM has the largest test power. When comparing the power in each of the six cases by considering the value of fPPV (p) in Table 16 we see that the power seems to decrease when fPPV (p) decreases which we would expect since in cases that are far from H0 the test should have higher power than in cases closer to H0 . However, in case 7 and 8 fPPV (p) is 0.52 and 0.76 respectively and yet case 7 has the highest power, particularly 28

Test statistic TLRT TLAP TuDT TrDT T πM TLRT TLAP TuDT TrDT T πM

p-value E+M E+M E+M E+M E+M E2 M E2 M E2 M E2 M E2 M

case 7 0.8344 0.7165 0.6997 0.8372 0.8271 0.8219 0.7492 0.7640 0.8240 0.8274

case 8 0.7210 0.7203 0.3815 0.7173 0.7568 0.7159 0.7044 0.3903 0.7160 0.7500

case 9 0.4125 0.2909 0.3269 0.4096 0.3996 0.4140 0.3498 0.3712 0.4081 0.4112

case 10 0.2332 0.1863 0.1446 0.2286 0.2119 0.2020 0.1857 0.1923 0.2162 0.2031

case 11 0.1161 0.0801 0.0556 0.1108 0.1080 0.1006 0.0848 0.0755 0.1028 0.1060

case 12 0.0547 0.0360 0.0345 0.0529 0.0472 0.0463 0.0416 0.0442 0.0476 0.0467

mean 0.1064 0.0810 0.0697 0.1044 0.1016 0.0967 0.0881 0.0844 0.0977 0.0994

TABLE 17: Test power for the E and E2 M p-values in case 7–12 and mean over 10385 cases for N = 10. Test statistic TLRT TLAP TuDT TrDT T πM TLRT TLAP TuDT TrDT T πM

p-value E+M E+M E+M E+M E+M E2 M E2 M E2 M E2 M E2 M

case 7 0.9979 0.9967 0.9970 0.9985 0.9987 0.9976 0.9961 0.9969 0.9980 0.9983

case 8 0.9967 0.9935 0.9592 0.9976 0.9978 0.9963 0.9935 0.9622 0.9969 0.9969

case 9 0.8014 0.7897 0.8056 0.8179 0.8296 0.7908 0.7866 0.7874 0.7938 0.8091

case 10 0.4974 0.4713 0.4788 0.5275 0.5378 0.4793 0.4609 0.4660 0.5017 0.5080

case 11 0.1936 0.1686 0.1699 0.2193 0.2257 0.1835 0.1691 0.1704 0.2005 0.2070

case 12 0.0537 0.0509 0.0547 0.0585 0.0586 0.0499 0.0515 0.0504 0.0501 0.0510

mean 0.2163 0.2038 0.2075 0.2241 0.2242 0.2080 0.2063 0.2068 0.2087 0.2101

TABLE 18: Test power for the E and E2 M p-values in case 7–12 and mean over 10385 cases for N = 25.

when N = 10. We see the same in case 9 and 10, fPPV (p) is then 0.27 and 0.29, and the power in case 9 is a lot higher than in case 10. The mean value of |fPPV (p)| for the 10385 cases is 0.13, which explains the small overall power, since the mean value is not as far from H0 as e.g. case 7 or 8. The mean power is comparable to case 11 where the distance from H0 is 0.18. It is not surprising that the likelihood ratio, restricted difference and πM test statistics perform simi˜ . The larly, considering they are all functions of the maximum likelihood estimates for p under H0 , p LAP and unrestricted difference test statistics however, do not depend on these estimates and this can be the reason their performance is poorer.

29

5

C OMPUTATIONAL DETAILS

To compute the integral (14), we used the midpoint rule on a 4-dimensional grid. The four dimensions correspond to p1 , p2 , p3 and p5 . Each side in the grid is divided into a number of subintervals of equal length, and the midpoint in each subinterval is calculated. For each point (p1 , p2 , p3 , p5 ) in the grid, we set p4 , p1 (1 − p1 − 2p2 − p3 − 2p5 ) + p2 (1 − p2 − p3 − p5 ) − p3 p5 p4 = p 1 + p3 which is derived from (13). If 0 < p4 < 1, we set p6 = 1 −

5

i=1 pi

N!

and if 0 < p6 < 1 then the value

  6  pyi (p1 + p2 + p4 + p5 )2 i

i=1

yi !

p 1 + p2

,

(28)

which is π(y|p) multiplied with the non-normalized density of p, is added to the present value of the integral. If either p4 or p6 are less than 0 or greater than 1, the current point in the grid is discarded. The total non-normalized integral is the sum of (28) over the p’s satisfying the constraints for p4 and p6 . The integrals (16) and (19) are computed similarly. The number of points in the grid has to be chosen and in the results presented in this report, a grid where each side is divided into 50 subintervals was used. This resulted in 79876 points after discarding those with p4 or p6 outside [0,1]. Table 19 shows the test size for the Bayesian prior predictive values using the uniform Dirichlet prior and original formulation of H0 (9) for the six values of p given in Table 10 and N = 10 when the number of subintervals, nint , on each side in the grid is 30, 35, 40, 45 and 50. We see that the test size varies with the grid size to some extent, the largest difference is in case 1 between nint = 45 and nint = 50. nint 30 35 40 45 50

case 1 0.0395 0.0389 0.0384 0.0384 0.0407

case 2 0.0642 0.0632 0.0620 0.0620 0.0625

case 3 0.0401 0.0405 0.0405 0.0405 0.0408

case 4 0.0365 0.0367 0.0367 0.0367 0.0372

case 5 0.0186 0.0191 0.0191 0.0191 0.0193

case 6 0.0157 0.0159 0.0159 0.0159 0.0162

TABLE 19: Test size in case 1–6 for the Bayesian prior predictive p-value in (15) for N = 10 using different grid sizes, nint is the number of sub intervals on each of the four sides in the grid. A grid for p is also needed for the p-values that include a maximization step, i.e. the M, E+M and E2 M p-values. In the positive predictive value setting, we used the same grid as for the Bayesian prior predictive p-values with 50 possible subintervals for each of the four sides in the grid, but in addition ˜ of p under H0 for all possible outcomes given N . we included the maximum likelihood estimates p Table 20 shows the number of possible outcomes in the positive predictive value situation for a given value of N and the size of the grid with the maximum likelihood estimates included. Thus, the size of this grid increased with N , for N = 10, the grid consisted of 3003 + 79876 = 82879 points and when N = 25, it consisted of 142506 + 79876 = 222382 points. Comparisons of the test size for different grid sizes showed that the grid did not have a great influence on the test size when the 30

N 10 15 20 25

Number of outcomes 3003 15504 53130 142506

Size of grid 82879 95380 133006 222382

TABLE 20: Number of possible outcomes given N and size of grid used when calculating p-values in the problem of comparing positive predictive values. grid is used for maximization. We also investigated how often the maximum p-value was obtained in one of maximum likelihood points compared to the other points. The percentage increased with N and decreased with the size of the grid without maximum likelihood estimates. When N increases the number of maximum likelihood estimates increases, and it is not surprising that more of these points will give the maximum p-value and similarly, when the number of grid points in the grid without maximum likelihood estimates increases, more of the points that are not maximum likelihood estimates will give the maximum p-value. The p-value computations for a sequence of E and M steps are quite computer intensive, as p-values for all outcomes (except in the last step), not only the one of interest in a specific study, must be computed for further use as a test statistic in the next step. The test statistic giving the original ordering of outcomes, e.g. the likelihood ratio test statistic, should be computed only once, as should the maximum likelihood estimates of p under the null hypothesis. The grid used for the numerical maximization in the M step and for calculation of the πM statistic was also calculated in advance. In both the E and the M step, the outcomes should be sorted according to the test statistic (original test statistic or negative output of a previous E or M step). In the E step, the p-values are then accumulated, starting with the probability of the outcome having the most extreme value of the test statistic, and the probabilities (with the maximum likelihood estimates of p under the null hypothesis of the outcome of interest as parameters) of the forthcoming outcomes successively being added until the outcome of interest is reached. Special care must be taken to include possible outcomes having an equal test statistic value (“draws”), and because of possible numerical inaccuracies also a threshold for when two values are counted as equal should be specified. In order to compute all possible p-values, this should be repeated for all outcomes – we have chosen to accumulate probabilities for all outcomes in parallel. Taking care when dealing with draws also applies to the M step and calculation of the πE and πM test statistics. In the M step we accumulated probabilities given by the grid points as parameters in parallel while going through the sorted outcomes. As the number of grid points times the number of outcomes may be huge, only the accumulated probabilities for each outcome were saved, and for each outcome reached, the maximum of the accumulated probabilities were saved as the p-value of that outcome. Calculation of the πE and πM test statistics, based on the probabilities of the outcomes themselves instead of on an external test statistic, are more computer intensive, as the ordering of the outcomes is specific for each outcome of interest, and not to a given test statistic. For πE , the p-value for an outcome of interest is found by adding probabilities of all outcomes having a probability that is not greater, using the maximum likelihood estimate of p under the null hypothesis of the outcome of interest as parameters, thus the probability of each outcome has to be calculated for each outcome of interest. We found some gain in computation speed by sorting the outcomes before adding. 31

N 10 25 10 25

E 0m0.33s 16m17.13s 0m0.34s 15m51.13s

M 0m17.69s 14m4.51s 0m18.65s 39m17.33s

πe 0m0.06s 0m0.96s 0m0.01s 0m0.97s

πE 0m2.84s 155m13.68s 0m2.86s 156m40.3s

πM 1m29.94s 101m20.51s 1m33.01s 262m6.08s

TABLE 21: Running time for E and M p-values and for calculating the test statistics Tπe , TπE and TπM in the positive predictive values setting for samples sizes N = 10, 25 (3003 and 142506 outcomes, respectively). The two upper rows show the time when using a grid with nint = 50 without maximum likelihood estimates and the time in the two lower rows is the time when using the grid with nint = 50 including maximum likelihood estimates (79876 points without estimates, 82879 including estimates for N = 10, and 222382 points including estimates for N = 25). For πM , the grid points rather than the outcomes were gone through in an outer loop. For each grid point, the probability of each outcome was calculated, the outcomes sorted accordingly, and probabilities accumulated from the smallest to the greatest. If the cumulative probability of an outcome was greater than some earlier maximum for that outcome, the maximum was replaced by the current sum. In contrast, calculation of πe is trivial, this is simply the probability of an outcome taking its maximum likelihood estimate of p under the null hypothesis as the parameter vector. Power and size calculations for a given parameter vector are simply a matter of adding probabilities of outcomes having p-values not exceeding the significance level (in our case 0.05). The code was written in C++, implemented in GCC and the calculations were performed with the Standard Template Library, using one of eight processors on a Dell PowerEdge 2950 with two Quadcore Xeon X5365 3.0 GHz processors, 4 MB cache, 16 GB RAM. The running time for calculating E and M p-values for any test statistic, along with the running time for calculating the πe , πE and πM test statistics when comparing positive predictive values for N = 10 and N = 25 are given in Table 21 for the grid with nint = 50, without and with the maximum likelihood estimates of p included. When N = 10, all the calculations are performed rather fast, except calculating the values of the πM test statistic which takes one and a half minute. When N increases, the running time naturally increases severely since all calculations must be performed for all possible outcomes. We note that calculating the πE test statistic takes longer than calculating the πM test statistic when N = 25 for the grid without maximum likelihood estimates. This is because the number of possible outcomes is less than the number of grid points in this case. If the number of grid points is larger than the number of outcomes, as in the grid where the maximum likelihood estimates are included, calculating the πM statistic takes much longer than calculating the πE statistic.

6

D ISCUSSION

The enumeration idea is not new as it goes back to Fisher (1935), but it has often been overlooked. We have demonstrated how to apply the idea for testing independent binomial proportions and comparing positive predictive values. Another recent application of the idea is in genome-wide association studies, in which single nucleotide polymorphisms (SNP) across the human genome are studied. When the mode of inheritance is unknown, the MAX test statistic, which is the maximum of the three Cochrane–

32

Armitage trend statistics for dominant, recessive and additive inheritance modes, see Freidlin, Zheng, Li and Gastwirth (2002), tests the association between the genotype and phenotype. The exact distribution of the MAX test statistic is unknown and calculating p-values based on proposed asymptotic distributions involves numerical integration. Another common approach is to use permutations tests, but both solutions leads to possible random errors in the calculated p-values. Moldovan, Langaas and Bahlo (2009) instead calculate exact p-values using the enumeration approach and thereby avoid this uncertainty. When the sample size increases and enumeration will be too time consuming, the parametric bootstrapping approach can be used instead. Günther et al. (2009) used parametric bootstrapping to approximate the distribution of the likelihood ratio, LAP and restricted and unrestricted difference test statistics. The p-values obtained from this distribution are approximately the same as the E p-values we find by enumeration in this report, and the parametric bootstrap approach involving simulated outcomes is actually a numerical approximation that calculates the tail without using enumeration. This is seen if the test size for case 1–6 in Table 12 is compared to the test size for the small sample parametric bootstrap likelihood ratio test in Table 3 of Günther et al. (2009) – the values are almost the same. It may be of use for larger sample sizes when calculating maximum likelihood estimates and p-values for the bootstrap samples is less time consuming than calculating the maximum likelihood estimates and p-values for all possible outcomes. When using the formulas for calculating exact test size and power, i.e., (5) and (6), drawing outcomes from the multinomial distribution under H0 or H1 and estimating the test size or power by the proportion of these outcomes having p-values less than or equal to the significance level as was done in Günther et al. (2009) is not necessary, and therefore the uncertainty in the estimates are removed. This is however, only possible when the sample size is small enough so that the p-values for all possible outcomes can be calculated. Another option when the sample size increases is to condition on sums of Ni , i = 1, . . . , 6, which in a contingency table setting corresponds to conditioning on the marginals. This reduces the number of possible outcomes and makes it possible to use exact tests for higher values of N . The usability of this approach depends on the actual problem. In the example from Lloyd (2008), n1 and n2 are fixed as the number of subjects who receives treatment and placebo respectively. In the setting of positive predictive values, it is not clear which values that should be fixed. It could be the number of diseased and non-diseased subjects, if the disease status is decided before the two tests are applied, or it could be the number of subjects with positive test A, positive test B and positive tests A and B, but in practise, these numbers will usually not be fixed in advance. As Table 12 showed, the test size of a test statistic for any p-value depends on the chosen value of p, the parameter in the multinomial distribution. When the chosen significance level is 0.05, some cases have test size close to 0.05, whereas other cases have smaller test sizes. A further investigation reveals what the cases for which the test size is close to 0.05 have in common. Assume the outcomes are sorted by decreasing value of some chosen test statistic. The M step will result in rejection of the null hypothesis for outcomes that are above a certain limit, where the limit is the p-value closest to 0.05 (but not greater than 0.05). The null hypothesis is not rejected for any of the outcomes below the limit. Assume that the last outcome for which H0 is rejected, y 0 has a maximum tail probability PM,0 , i.e. p-value, in the point p0 . If the true value of p is in fact p0 , then the probability of rejecting H0 is the sum of the probabilities of this outcome and the outcomes above, which is PM,0 . Thus a test size of almost 0.05 is always obtained for a particular p, it is only the discreteness that prevents it from exactly being obtained for a specific value of p. This value is the value of p that maximizes the p-value for the outcome that has the largest p-value less than or equal to 0.05. If one wants to report

33

the test size in a certain multinomial case, choosing this value of p will ensure that the test size is close to 0.05 unlike the six multinomial cases we chose.

7

C ONCLUSIONS

In this work we have provided an in-depth effort of using enumeration and exact p-values to address the problem of comparing positive predictive values. The existing tests for this situation rely on asymptotic distributions and have previously been shown not to preserve the test size when the sample size was moderate. The test size and power of nine test statistics in combination with five types of p-values have been thoroughly evaluated for different sample sizes. As demonstrated, the M step yields valid p-values, although these are often conservative. The E step provides a reordering of the reference set in contrast to the M step and one or two E steps before the M step increases the test size while yielding valid p-values. We have presented three new test statistics, Tπe , TπE and TπM , that can be applied to any problem. In the problem of comparing binomial proportions, the πe test statistic performed better than the test statistics analyzed by Lloyd (2008) in terms of test size and power for the E+M p-values. For comparing the positive predictive values from two diagnostic tests, we recommend using either the likelihood ratio, restricted difference or πM test statistic and to calculate the E+M p-values. These p-values are valid, and for these test statistics the results have indicated that there is no need to do more than one E step before the final M step. However, the importance of one or more E steps before maximization is greater for e.g. the LAP and unrestricted difference test than for the likelihood ratio test as it increases the test size more significantly, suggesting that the ordering provided by the LAP and unrestricted difference test is not optimal with respect to test size and power. We do not recommend using the prior predictive p-values, as these are very sensitive to the choice of prior and on the null hypothesis formulation. This report gives further general insight into the mechanisms behind the E, M and E+M p-values in general and in the example discussed by Lloyd (2008). We describe how the E p-value changes the ordering of outcomes and why this reduces the conservativeness of the M p-values if the E p-values are applied before the M step. In further work, it would be of interest to find a test statistic that in some sense provides an optimal ordering of the outcomes with respect to test size and power and in particular, the πe , πE and πM should be studied in greater detail and compared to other test statistics. We would also like to investigate if ordering of the outcomes converges after a certain number of E steps, and also the effect of performing two or more consecutive sequences of the form Ek M.

R EFERENCES Agresti, A. (2002). Categorical data analysis, second edn, John Wiley & Sons, Inc., Hoboken, NJ, chapter 1.4.4. Bayarri, M. J. and Berger, J. O. (2000). P values for composite null models, Journal of the American Statistical Association 95(452): 1127–1142.

34

Berger, R. L. and Boos, D. D. (1994). P values maximized over a confidence set for the nuisance parameter, Journal of the American Statistical Association 89(427): 1012–1016. Bickel, J. and Doksum, K. A. (2001). Mathematical statistics, second edn, Prentice Hall, Inc., chapter 4. Casella, G. and Berger, R. L. (2002). Statistical inference, second edn, Duxbury, chapter 8. Fisher, R. A. (1935). The logic of inductive inference, Journal of the Royal Statistical Society 98: 39– 82. Freidlin, B., Zheng, G., Li, Z. and Gastwirth, J. L. (2002). Trend tests for case-control studies of genetic markers: power, sample size and robustness, Human Heredity 53: 146–152. Günther, C.-C., Bakke, Ø. and Langaas, M. (2009). Comparing positive predictive values for small samples with application to gene ontology testing. Preprint Statistics No. 3, Department of Mathematical Sciences, Norwegian University of Science and Technology. Günther, C.-C., Bakke, Ø., Lydersen, S. and Langaas, M. (2008). Comparison of predictive values from two diagnostic tests in large samples. Preprint Statistics No. 9, Department of Mathematical Sciences, Norwegian University of Science and Technology. Johnson, N. L., Kotz, S. and Balakrishan, N. (1997). Discrete multivariate distributions, Wiley series in probability and statistics, chapter 35. Leisenring, W., Alonzo, T. and Pepe, M. S. (2000). Comparisons of predictive values of binary medical diagnostic tests for paired designs, Biometrics 56: 345–351. Lloyd, C. J. (2008). Exact p-values for discrete models obtained by estimation and maximization, Australian & New Zealand Journal of Statistics 50(4): 329–345. Moldovan, M., Langaas, M. and Bahlo, M. (2009). Efficient error-free computation of MAX pvalues with an application to genome-wide association studies. Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia, submitted. Zelterman, D., Chan, I. S.-F. and Mielke, P. W. (1995). Exact tests of significance in higher dimensional tables, The American Statistician 49(4): 357–361.

35

Suggest Documents