Multidimensional Stochastic Process Model and its Applications to Analysis of Longitudinal Data with Genetic Information

Multidimensional Stochastic Process Model and its Applications to Analysis of Longitudinal Data with Genetic Information Ilya Zhbannikov Konstantin A...

Author: Monica Blake

18 downloads 0 Views 1MB Size

Report

Download PDF

Recommend Documents

A TUTORIAL INTRODUCTION TO STOCHASTIC ANALYSIS AND ITS APPLICATIONS

Longitudinal AIDS Data Analysis

Multidimensional Social Network: Model and Analysis

STOCHASTIC ANALYSIS OF A TWO SPECIES MODEL WITH COMMENSALISM

Multidimensional analysis model of using Information technologies and information systems in higher education

Data and process analysis

Probability Theory and Stochastic Processes with Applications

Longitudinal Data Analysis CATEGORICAL RESPONSE DATA

LONGITUDINAL ANALYSIS OF LABOUR FORCE SURVEY DATA

Key-Words: - multidimensional view of data, multidimensional data model, experimental approach, Anchor modeling, query execution performance

Mobile Network Performance from User Devices: A Longitudinal, Multidimensional Analysis

A Conceptual Model for Multidimensional Data

An Introduction to Stochastic Processes with Applications to Biology

thinkteam with Replicated Data Repositories: Stochastic Model Checking in Industry

Chapter 2. The Entity-Relationship. Data Model. The process of designing a database begins with an analysis of what information

Data Mining Technology and its Applications to Power Systems

Nonlinear Regression Analysis and Its Applications

POLS571 - Longitudinal Data Analysis September 25, 2001

Time Series Analysis and Its Applications

Panel Data Stochastic Frontier Model with Determinants of Persistent and Transient Inefficiency

Dynamic Spectrum Access in WLAN Channels: Empirical Model and Its Stochastic Analysis

STOCHASTIC PROCESSES AND THEIR APPLICATIONS

Analysis, Characterization and Design of Data Mining Applications and Applications to Computer Architecture

1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs

Multidimensional Stochastic Process Model and its Applications to Analysis of Longitudinal Data with Genetic Information Ilya Zhbannikov

Konstantin Arbeev,

Anatoliy Yashin

Biodemography of Aging Research Unit (BARU), Social Science Research Institute (SSRI) Duke University Durham, NC, 27708

Biodemography of Aging Research Unit (BARU), Social Science Research Institute (SSRI) Duke University Durham, NC, 27708

[email protected]

[email protected]

Biodemography of Aging Research Unit (BARU), Social Science Research Institute (SSRI) Duke University Durham, NC, 27708

ABSTRACT Stochastic Process Model has many applications in analysis of longitudinal biodemographic data. Such data contain various physiological variables (sometimes known as covariates). It also can potentially contain genetic information available for all or a part of participants. Taking advantage from both genetic and nongenetic information can provide future insights into a broad range of processes describing aging-related changes in the organism. In this paper, we implemented a multi-dimensional Genetic Stochastic Process Model (GenSPM) in newly developed software tool, R-package stpm, which allows researchers performing such kind of analysis.

Keywords Stochastic process model; allostatic load; quadratic hazard; longitudinal data; life tables; risk factors; aging-related changes;

1. INTRODUCTION Developing aging-related diseases is mediated by thousands of biological and physiological variables, which are undergo environmental and social factors, individual behavioral patterns. Various studies show involvement of genetic component in developing aging-related diseases and its potential effect on longevity [1-3]. Longitudinal study is probably the most important component of study and evaluating contribution of physiological variables to decline of the health/well-being status and a lifespan. Longitudinal data can also contain genetic information of individuals participated to the study. In such data genetic component represents genetic information in form of a genetic marker describing a particular allele. However, such longitudinal data often comes from different studies and often is incomplete, meaning that only a part of individuals are genotyped. Incompleteness in longitudinal data may arise from several factors, for example: (a) not all individuals are genotyped since some of those who initially participated in a longitudinal study have already deceased or left the study; (b) difficulty in gathering genetic information, which may arise from various reasons, for example, cost of genotyping or, perhaps, an individual refused to provide genetic samples; (c) this also can happen due to loss of

[email protected]

some parts of experimental data during data preparation or storage. Incomplete data often confounds the analysis leading to potential misleading results. Such cases are typical in epidemiological studies when a measure of a certain variable (covariate) is difficult to obtain for majority of participants. The common approach is to divide participants into two subgroups: the larger subgroup, which does not carry information of a particular variable, which is difficult to collect; and a smaller subgroup in which measures of such variable are presented. This is so-called two-stage analysis. Methods for analyzing such kind of data are well developed for regression models [4,5]. These methods employ information from the first subgroup (named the “first stage”) and combine it with data from the second subgroup (or “second stage”) in order to better estimate regression parameters. Using such methods can potentially improve precision of estimates in comparison to analysis based on the smaller group exclusively. A common way of evaluating effects of genetic variability on health/well-being/survival condition is to estimate respective hazards (can be mortality rate) separately for carriers and noncarriers of particular allele (genotype). In absence of information of physical factors and processes affecting a hazard rate, an evaluation of a genetic effect is well developed in GWAS. Genetic data combined with longitudinal data can potentially, provide an opportunity of studying indirect genetic effects with trajectories of physiological variables, mediated by age, which, in turn, can model processes in organism not directly measured in pure longitudinal data. Therefore there is a need for special statistical methods and software tools that perform such kind of analysis. The Stochastic Process Model (SPM) [6, 8, 9], and its extension, a genetic SPM (“GenSPM”) [10], developed to deal with longitudinal data with presence of genetic information, and represents an important step toward joint analysis of longitudinal data (with corresponding genetic information) by considering together genetic- and non-genetic groups (genotyped and nongenotyped groups of participants). In this work, we (i) further extend the conception of GenSPM to a multi-dimensional case and (ii) provide a corresponding software tool: an R-package

‘stpm’, that implements GenSPM methodology. The stpm was verified thought extensive simulation and validation studies. This paper is also an attempt to partially close the problem of limited usage of the SPM methodology, arising from the lack of userfriendly software, documentation and examples, and promote and popularize it to academic and clinical audience.

We also present a corresponding software tool, an R-package ‘stpm’, freely available for download from CRAN: https://cran.rproject.org/web/packages/stpm/ which is the first publicly available software, implementing the general SPM and its extension, GenSPM, allowing researchers analyzing and making predictions from longitudinal data with genetic component.

2. METHODS

2.1 Description of algorithms

Originally, the Stochastic Process Model (SPM) was developed several decades ago at Duke University [6] and represents a general framework for modeling joint evolution of repeatedly measured variables (e.g., physiological or biological measures, also called covariates) and time-to-event outcomes observed in longitudinal studies. In other words, SPM links the stochastic dynamics of variables to the probabilities of end points (e.g., death or system failure). The dynamics of the stochastic variables is modeled by N-dimensional stochastic process (where N represents a number of physiological variables used in the study) and has two components: the first component, which is related to the basic regularities of the age-related physiological changes and the second is a stochastic component which integrates the effects of external and internal perturbations of the dynamics of the physiological covariates. In SPM methodology, the morbidity/mortality risk is presented as the quadratic hazard function, also known as U- or J- shaped hazard function [12-17], which is justified empirically based on many epidemiological observations for various biomarkers (see, e.g., [12,17]). The minimum of a hazard function, a paraboloid in the multivariable case, is a point (or domain) in the variable state space, which corresponds to the optimal system status (e.g., the "normal" health status) with the minimal hazard at a specific time (age). In general, the SPM can be applied in the same manner as the Cox model with time-dependent covariates [7] (delete this reference). However, the advantage of the SPM methodology is that it takes into account the stochastic dynamics of variables assuming that the respective process satisfies a certain stochastic model, which better describes the reality in many applications. Finally, SPM allows projection/prediction of individual physiological trajectories, which opens possibilities for targeted research such as personalized prognoses. GenSPM, presented in 2009 by Arbeev at al [10] (and further elaborated in 2014 “Joint analysis of longitudinal and time to event data”, 2015 “Latent class and genetic stochastic processs model” (JSM) by Arbeev), further elaborates the basic stochastic process model conception by introducing a categorical variable, Z, which may be a specific value of a genetic marker or, in general, any categorical variable. Currently, Z has two gradations: 0 or 1 in a genetic group of interest, assuming that P(Z=1) = p, p ∈ [0, 1], were p is the proportion of carriers and non-carriers of an allele in a population. Example of longitudinal data with genetic component Z is provided in Table 1. In the previous study, the GenSPM model was verified on a single physiological variable but its behavior if more than one variable used is not known. The general concern is that using a single variable may not be enough for performing comprehensive hypotheses evaluations. In this work we conducted corresponding simulation studies to evaluate second GenSPM dimension using simulation studies with two physiological variables by providing its validation and testing using one and two physiological covariates.

The block-scheme of the SPM is presented in Figure 1. In the specification of the SPM described in 2007 paper by Yashin and colleagues [8] the stochastic differential equation describing the age dynamics of a physiological variable (a dynamic component of the model) is: dY(t) = a(Z, t)(Y(t) – f1(Z, t))dt + b(Z, t)dW(t), Y(t = t0).

(1)

Here in this equation, Y(t) is a k × 1 matrix, where k is a number of covariates, which is a model dimension) describing the value of a physiological variable at a time (e.g. age) t. f1(Z,t) is a k × 1 matrix that corresponds to the long-term average value of the stochastic process Y(t), which describes a trajectory of individual variable influenced by different factors represented by a random Wiener process W(t). The negative feedback coefficient a(Z,t) (k × k matrix) characterizes the rate at which the stochastic process goes to its mean. In research on aging and well-being, f1(Z,t) represents the average allostatic trajectory and a(t) in this case represents the adaptive capacity of the organism. Coefficient b(Z,t) (k × 1 matrix) characterizes a strength of the random disturbances from Wiener process W(t). All of these parameters depend on Z (a genetic marker having values 1 or 0). The following function μ(t,Y(t)) represents a hazard rate: μ(t,Y(t)) = μ0(t) + (Y(t) - f(Z, t))*Q(Z, t)(Y(t) - f(Z, t)), (2) In this equation: μ0(t) is the baseline hazard, which represents a risk when Y(t) follows its optimal trajectory; f(t) (k × 1 matrix) represents the optimal trajectory that minimizes the risk and Q(Z, t) (k x k matrix) represents a sensitivity of risk function to deviation from the norm. In general, model coefficients a(Z, t), f1(Z, t), Q(Z, t), f(Z, t), b(Z, t) and μ0(t) are time(age)-dependent. For example, the coefficient a can be assumed as (i) -0.05 (a constant, time-independent, one-dimensional model) or (ii) a(t) = a0 + b0t (time-dependent), in which a0 and b0 are unknown parameters to be estimated. Presented model can handle, in

Figure 1. Block-scheme of GenSPM. In this picture, part I is described by the equation (1) and part II is described by the equation 2 (morbidity/mortality risk).

Table 1. Example of longitudinal data with genetic component Z presented as a dichotomous value of some genetic marker (Z = 1 for carriers and 0 for noncarriers). Here “ID” is a person’s identification number in a database; “Status” represents death (1) or censoring (0) of an individual at age “Age.next”; “DBP” is a physiological variable, which is diastolic blood pressure in this example.

theory, any number of physiological variables, however using many variables may lead to extensive usage of computational resources. Symbol ‘*’ denotes transpose operation. In order to estimate coefficients a(Z, t), f1(Z, t), Q(Z, t), f(Z, t), b(Z, t) and μ0(t) which are the parameters of stochastic process, a method of maximizing likelihood is used (described in the next section). ID

Status

Age

Age.next

Z

DBP

DBP.next

1

0

96.61

97.59

1

0

97.59

98.67

0

94.62

100.68

0

100.68

100.59

1

0

98.67

99.67

0

100.59

102.31

1

1

99.67

100.70

0

102.31

NA

2

0

64.78

65.78

1

81.77

80.62

2

0

65.78

66.78

1

80.62

70.49

2

0

66.78

67.68

1

70.49

69.20

2

0

67.68

68.66

1

69.20

67.74

…

…

…

…

…

…

…

These equations are then solved at the intervals between observation times [𝑡!! , 𝑡!! ), [𝑡!! , 𝑡!! ),…,[𝑡!! ! ,𝜏! ), with initial conditions 𝑦 ! (𝑡!! ), …,𝑦 ! (𝑡!! ! ) and γz,0,0,..0 respectively. Therefore, trajectories for mi(z, t) and γi(z, t) are different for each individual and so estimates of the chances of death for each individual are different; 𝛿 ! is a censoring indicator for i-th individual (1 – death, 0 - censored); 𝑡!! ! is the last measurement of physiological variable before death/censoring at 𝜏! for i-th individual.

2.1.2 Likelihood function for non-genetic group Assuming that the population of interest is heterogeneous, i.e. a mix of carriers and non-carries of particular allele or genotype, randomly selected from the data, we can write the following likelihood equation for non-genetic group [10]: N ng (0)

Lng =

∏ ( pL (1) + (1− p)L (0)) i g

i g

(8)

i=1

here p is an expected proportion of carriers in a population; Nng(0) represents the number of individuals in the non-genetic group at a time (age) t0. Likelihoods Lig (1) and Lig (0) are calculated using equation (4) above.

2.1.3 Joint analysis of genetic and non-genetic data 2.1.1 Likelihood function for genetic group The likelihood equations for genotyped (genetic) group were presented in [10] and is shown below: g 1

g 0

N1g (0)

N 0g (0)

i=1

i=1

Lg = p N (0) (1− p)N (0) ∏ Lig (1) ∏ Lig (0)

(3)

N1g (0) and N 0g (0) are the numbers of individuals having Z=1 and 0 in the beginning of the study. The likelihood for i-th individual calculated as follows: ⎧⎪ τ i ⎫⎪ n j −1/2 Lig (z) = µ i (z, τ i )δi exp ⎨− ∫ µ i (z, t)dt ⎬ ∏ γ i (z, t ij ) ⎪⎩ t0i ⎪⎭ j=0 ⎧ 1 ⎫ (4) ×exp ⎨− (y i (t ij ) − m i (z, t ij− ))γ i (z, t ij− )−1 (y i (t ij ) − m i (z, t ij− ))* ⎬ ⎩ 2 ⎭ Then the hazard rate can be estimated with the following equation:

+Tr(Q(z, t)γ i (z, t))

(5)

Functions m(z, t) and γ(z, t) are mean and variance of the conditional distribution P(Y(t) ≤ y | Z = z, T > t) that satisfy the following system of differential equations: dm i (z, t) = a(z, t)(m i (z, t) − f (z, t)) dt −2γ i (z, t)Q(z, t)(m i (z, t) − f (z, t))

(6)

dγ i (z, t) = a(z, t)γ i (z, t) − γ i (z, t)a(z, t)* +b(z, t)b(z, t)* dt −2γ i (z, t)Q(z, t)γ i (z, t)*

L = Lg Lng

(9)

In this equation Lg and Lng are likelihoods computed from

In this equation,

µ i (z, t) = µ 0 (z, t) + (m i (z, t) − f (z, t))* Q(z, t)(m i (z, t) − f (z, t))

In order to combine the data from genetic and non-genetic groups, we use the following joint likelihood:

genetic and non-genetic groups. This joint likelihood is then maximized used some well-known optimization method (we used Nelder-Mead [18] or COBYLA [19] optimization methods other methods are available in the R-package stpm) and thereby parameter estimates are obtained. These estimates can be used to test the hypotheses on different respective parameters for carriers and non-carriers of particular genetic marker: differences between the general model with parameters do not depend on Z to the model with Z-dependent parameters. This, in turn, will show the presence of a genetic effect in respective component of the model.

2.2 R-package stpm We developed an R-package “stpm” that comprises the SPMmethodology. The package allows for estimating several versions of SPM currently available in the literature including discrete[20] and continuous-time multidimensional models [8] and a onedimensional model with time-dependent parameters [9] and SPM with genetic component, described in this work. Also, the package provides routines for data preprocessing, simulation and projection of individual trajectories and hazard functions (microsimulations). The R-package stpm is available as open source software from the following link: https://cran.rproject.org/web/packages/stpm/index.html. [Add functions how to run the analysis] The R-package stpm contains two following functions that basically objectify GenSPM methodology described in this article:

(7)

(i) (ii)

simdata_gen(…) spm_gen(…)

The simdata_gen(…) is used for data simulation which may be useful in case of data absence and the spm_gen(…) provides

estimation of the parameters of the model described above. In the example below we show how to work with this function: library(stpm)

Table 3 contains results for Test 1. In this test we performed evaluation of model behavior for one physiological variable. For genetic group, parameter estimates, such as p, Q, f1 are getting close to their true values used in simulation as number of individuals in the cohort increases. Estimates obtained from combined data: genetic (100 of individuals) + non-genetic groups (5,000 of individuals) are very close to those parameters used in simulation.

#Data simulation: data