Statistics and Data Analysis

NESUG 2007 Statistics and Data Analysis A Gene Prognosis Profile using Partial Least Square (PLS) Chenwei Liu, MS, Gordon Whiteley, Ph.D. Clinical P...
0 downloads 1 Views 132KB Size
NESUG 2007

Statistics and Data Analysis

A Gene Prognosis Profile using Partial Least Square (PLS) Chenwei Liu, MS, Gordon Whiteley, Ph.D. Clinical Proteomics Reference Lab, SAIC Frederick Inc., Gaithersburg, MD Gordon R. Whiteley, Director, Advanced Technology Program (ATP), Clinical Proteomics Reference Laboratory (CPRL), SAIC-Frederick, Inc., NCI-Frederick, Frederick, MD ABSTRACT In a typical clinical study based on a microarray gene expression experiment, there are often more genes than subject samples, and many genes are correlated. Partial least square regression, though not originally designed for classification purposes, can be used to build a classifier to predict outcomes based on the high dimensional correlated gene expression data. In this article, using a publicly available breast cancer study data, we show the process of using PROC PLS in SAS to construct a metastasis risk classifier from a training dataset. The classifier is further used to assign patients from an independent validation set into high- and low- risk groups based on their gene expressions. The result confirms that patients in the two predicted risk groups show significant difference in their survival outcome.

INTRODUCTION With the maturing of DNA microarray technology, many gene expression based applications in the cancer diagnostics field are becoming more acceptable to physicians and the regulatory agencies. On February 14 2007, the FDA approved the first gene expression based in vitro prognostics tool, MammaPrint, in US allowing it to proceed to the market to help doctors assess young breast cancer patients’ risk of distant metastasis. This approval will certainly encourage more clinical studies involving microarray technology, and more gene data analysis methods and tools will be developed as a result. In a DNA microarray experiment dataset, the number of subjects (observations) is usually much smaller than the number of genes (variables), and many genes are correlated, i.e., they are not independent. Theoretically, statistical methods having an independency assumption cannot be used directly in the above scenario. Many machine learning methods such as the Support Vector Machine, C5.0 Decision Tree and Neural Networks have been used to find gene expression patterns related to disease [1], Partial Least Square is also a good method for exploring information hidden in the huge amount of gene data. While Ordinary least squares regression has the goal of minimizing sample response prediction error and seeking linear functions of the predictors that explain as much variation in each response as possible, Partial Least Square has the additional goal of accounting for variation in the predictors under the assumption that directions in the predictor space that are well sampled should provide better prediction for new observations when the predictors are highly correlated [2]. Also, the Partial Least Square method has been implemented in SAS and is easy to use. It is a simple approach to first use PROC PLS to analyze the data and see if a gene expression difference exist among patient groups of interest in a study. Then other methods may be brought in to refine models and build more powerful tools. In this paper we show the detail of using PROC PLS for classification applications in SAS.

EXAMPLE: GENE EXPRESSION OF YOUNG BREAST CANCER PATIENTS Breast cancer is the most common cancer among women, except for non-melanoma skin cancers, and it is the second leading cause of cancer death in women, exceeded only by lung cancer. According to the American Cancer Society, an estimated 178,480 new cases of invasive breast cancer will be diagnosed among women in the United States this year and over 40,000 women are expected to die from the disease. Death rates from breast cancer continue to decline, with larger decreases in women younger than 50. It is believed that early detection through screening based on various technologies, refined prognosis, increased awareness, as well as improved treatment have contributed to the death rate decrease. Microarray technology is widely used to provide a better screening tool and refine prognosis to provide physicians with accurate information and guide tailored treatment for patient. Using DNA microarray analysis on primary breast tumors of 117 young patients, Van’t Veer et al reported a 70-gene prognosis profile associated with the risk of early development of distant metastasis in patients with lymph-node negative breast cancer [3].

1

NESUG 2007

Statistics and Data Analysis

Van de Vijver et al further evaluated the 70-gene classifier on a series of 295 consecutive patients with stage I or stage II breast cancer and made those 295 patients gene expression data available to the public [4]. The log ratio column in the dataset represents the normalized expression level of a gene. The dataset consists of log ratios of over 20000 genes for 295 patients who were diagnosed with breast cancer between 1984 and 1995 and treated by modified radical mastectomy or breast-conserving surgery, followed by radiotherapy at the hospital of the Netherlands Cancer Institute. The median follow up time was 7.2 years and the median survival time was 3.8 years for those 295 patients. We first used the following macro to randomly split the 295 patients into two independent set, 70% training set and 30% validation set. The all295 is the input data set, and 1224 is a seed for the random process. Since a random split is relatively easy to code in SAS, the detail of the macro is omitted here. %ran_split(all295, eventmeta,train, valid, 0.7,1224,idvar=sampleID, stratify=yes);

BUILDING A CLASSIFER FROM THE TRAINING SET There are many ways to select genes that might be associated with clinical outcomes. It is out of the scope of this article to discuss and compare the difference methods of feature selection. As a demonstration about using of PROC PLS, we used the 70 genes identified in the original paper of Van’t Veer et al as our predictor variables. The number of factors to extract depends on the training data. However, the extraction of excess factors must be avoided in order to prevent overfitting. The PLS procedure enabled us to choose the number of extracted factors by cross validation. Various methods of cross validation are available, including one-at-a-time validation, and splitting the data into blocks. proc pls data= train cv=split ; model eventmeta = &pls_mvars ; run; cv=split option sets every 7th ( default ) observation aside as validation set. This number can be optimized according to the size of the sample set. The above pls_mvars macro variable includes all the 70 genes. The output is as follows: The PLS Procedure Split-sample Validation for the Number of Extracted Factors Number of Extracted Factors

Root Mean PRESS

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1.079415 1.025679 1.017247 1.050493 1.073688 1.081337 1.090805 1.103451 1.113767 1.12401 1.128974 1.139266 1.143572 1.149865 1.154781 1.15846

2

NESUG 2007

Statistics and Data Analysis

Minimum root mean PRESS Minimizing number of factors

1.0172 2

PREDICT RISK OUTCOME IN THE VALIDATION SET It is recommended to extract 2 factors for the model based on the training data. The following code brings in the independent validation, specifies 2 factors to extract and output the predicted value to an output data set pred. data all; set train valid(rename=(eventmeta=meta)); run; proc pls data=all nfac=2 ; model eventmeta= &pls_mvars; output out=pred p=p_meta; run; It is necessary to mention, when combining the training set with the validation set, the above “rename” option is important, because we want the model to predict the outcome of the observations from the “valid” set. If observations from the validation set have the eventmeta values, they will affect the PROC PLS model building process and a new model is rebuilt and we want to avoid this. In the output dataset, the predicted values of response p_meta range from 0 to 1, we could chose difference cutoff value to dichotomize them, thus tuning the sensitivity and specificity. The following data set and two PROC FREQs were then used to see how well the model works on the training set and the validation set. data pred; set pred(keep=SampleID EventMeta TimeSurvival Meta P_meta) ; if P_meta>0.37 then pls=1; else pls=0; run; proc freq data=pred(where=(eventmeta>.)); table eventmeta*pls/nopercent nocol; run; proc freq data=pred(where=(eventmeta=.)); table meta*pls/nopercent nocol; run; The output is shown in the following Table 1. and Table 2. As seen in Table 1, when we chose the above 0.37 cutoff value, on the training set, the sensitivity is 73.24% and specificity is 76.12. They are quite balanced. On the validation set, the sensitivity is 63.33%, while the specificity is 75.44%. It is reasonable that sensitivity, specificity or both are not as high as those on the training set. Increasing the sensitivity by lowering the cutoff value results in a decrease in specificity.

SURVIVAL DIFFERENCE OF THE PREDICTED RISK GROUPS When the full clinical data is not available but survival data is provided, the survival data can be utilized. The classifier will divide patients based on gene data into high and low risk groups. The survival information can then be used to confirm the risk classification. The following is an example: proc lifetest data=pred(where=(eventmeta=.)) method=KM plots=(s) ; time timesurvival*eventdeath(0); strata pls; run;

3

NESUG 2007

Statistics and Data Analysis

The Kaplan-Meier curve plot is shown in Figure 1. The predicted high risk group ( pls=1, red) has significantly lower survival rate than the predicted low risk group(pls=0, black), as shown by a log rank test p-value