Genetic Association, Post-translational Modification and Protein-protein Interactions in. Type 2 Diabetes Mellitus

MCP Papers in Press. Published on May 10, 2005 as Manuscript M500024-MCP200 Genetic Association, Post-translational Modification and Protein-protein ...
Author: Myles Stewart
1 downloads 0 Views 518KB Size
MCP Papers in Press. Published on May 10, 2005 as Manuscript M500024-MCP200

Genetic Association, Post-translational Modification and Protein-protein Interactions in Type 2 Diabetes Mellitus. Amitabh Sharma£1, Sreenivas Chavali£1, Anubha Mahajan1, Rubina Tabassum1, Vijaya Banerjee1, Nikhil Tandon2, Dwaipayan Bharadwaj1∗

Functional Genomics Unit, Institute of Genomics and Integrative Biology, CSIR, Delhi, India

2

Department of Endocrinology, All India Institute of Medical Sciences, New Delhi, India

£

These authors contributed equally to this work.



Corresponding author:

Dr. Dwaipayan Bharadwaj Functional Genomics Unit Institute of Genomics and Integrative Biology (CSIR) Mall Road, Delhi- 110 007 India

Tel

: +91 11 2766 6156/6157

Fax

: +91 112766 7471

E-mail : [email protected]

Running Title: Functional assessment of variations in Type 2 Diabetes Mellitus

1 Copyright 2005 by The American Society for Biochemistry and Molecular Biology, Inc.

Downloaded from www.mcponline.org by on February 12, 2008

1

Abbreviations Used: SNPs - Single Nucleotide Polymorphisms T2DM - Type 2 Diabetes Mellitus DCVs - Disease Causing Variations DAVs - Disease Associated nonsynonymous Variations PSIC - Position Specific Independent Count CNVs - Control Nonsynonymous Variations

Downloaded from www.mcponline.org by on February 12, 2008

2

Summary: Type 2 Diabetes Mellitus is a complex disorder with a strong genetic component. Inherited complex disease susceptibility in humans is most commonly associated with single nucleotide polymorphisms. The mechanisms by which this occurs are still poorly understood. Here, we focus on analyzing the effect of a set of disease causing missense variations of monogenetic form of Type 2 Diabetes Mellitus and a set of disease associated nonsynonymous variations in comparison with that of nonsynonymous variations without any experimental evidence for association with any disease. Analysis of different properties such as evolutionary

variations are associated with extreme changes in the value of the parameters relating to evolutionary conservation and/or protein stability. Disease associated variations are rather moderately conserved and have milder effect on protein function and stability. Majority of the genes harboring these variations are clustered in or near insulin signaling network. Most of these variations are identified as potential sites for post-translational modifications; certain predictions have already reported experimental evidences. Overall, our results indicate that Type 2 Diabetes Mellitus may result from a large number of SNPs which impair modular domain function and post-translational modifications involved in signaling. Our emphasis is more on conserved corresponding residues than the variation alone. We believe that the approach of considering a stretch of peptide sequence involving a polymorphism would aid as a better method of defining its role in the manifestation of this disease. Since most of the variations associated with the disease are rare, we hypothesize that this disease is a ‘Mosaic model’ of interaction between a large number of rare alleles and a small number of common alleles along with the environment, which is little contrary to the existing Common Disease Common Variants model.

3

Downloaded from www.mcponline.org by on February 12, 2008

conservation status, solvent accessibility, secondary structure, etc. suggests that disease causing

Introduction: Type 2 Diabetes Mellitus (T2DM) is a genetically heterogeneous, polygenic disease with complex inheritance pattern and is caused by genetic predisposition and environmental factors. The precise biochemical defects are unknown and almost certainly include impairments in insulin secretion and insulin action. T2DM is characterized by abnormal glucose homeostasis leading to hyperglycemia and is represented primarily by insulin resistance. The vast majority of insulin resistance in T2DM has been shown to arise due to defects at the post-receptor level [1]. T2DM is also heterogeneous in the associated pathological and physiological symptoms leading

Genetic dissection of any complex trait is done based on two approaches, which include genome wide scan studies and association studies. The concept of association studies [2] is being widely applied as an experimental technique to identify Single Nucleotide Polymorphisms (SNPs) underlying complex phenotype, which represents the most common form (90%) of genetic variations in humans [3]. Association is defined as a statistical statement about the cooccurrence of alleles or phenotypes. Owing to the application of high-throughput SNP detection techniques, the number of identified SNPs is growing rapidly enabling detailed statistical studies. Over the past decade many laboratories have sought to clarify the etiology of T2DM by attempting to associate clear differences in metabolic phenotype with mutations or polymorphisms in the genes. As a result of this a large amount of data has accumulated, associating SNPs in a large number of candidate genes with the disease across different populations. Unlike fully penetrant mutations that cause Mendelian diseases, SNPs involved in complex human phenotypes are not a necessary and sufficient condition defining the phenotype

4

Downloaded from www.mcponline.org by on February 12, 2008

to a variety of complications such as coronary heart disease, neuropathy, retinopathy, etc.

but their effect depends on many other genetic and environmental components. In other words SNPs are shown to comprise risk factors of having a specific phenotype more in a statistical sense. This raises the question as to whether the associated SNPs are only of statistical significance. If not then, what might be the reason for encountering differences in variation statistics across different populations as shown by Cargill et al. [4]. However, identifying SNPs responsible for specific phenotypes appears to be an enigma that is very difficult to solve. Several recent studies [5-10] have applied computational methods to predict the potential effects of the nonsynonymous coding SNPs in bringing about variations in humans.

positive or deleterious) is often optimistic, as in practice they do not operate in isolation. Instead they work jointly to generate the disease gene architecture and hence a study to determine the contribution of these interactions towards the disease is essential. Ideally, the end point of disease gene identification should be functional analysis of the disease associated allele and an understanding of the molecular mechanism of causation of the disease phenotype. The functional characterization can be facilitated by the computational analysis. Vitkup et al. [9] have shown that the probability of a nonsynonymous mutation causing a genetic disease increases monotonically with an increase in the degree of evolutionary conservation of the mutation site and a decrease in the solvent accessibility of the site; opposite trends are observed for non disease polymorphisms. In the current study we have extensively analyzed the effect of nonsynonymous variations on the structure and function of proteins and have attempted to determine their possible role in the disease phenotype.

5

Downloaded from www.mcponline.org by on February 12, 2008

A focus on the individual factors that highlight their maximum potential effect (whether

Experimental procedures: Data set extraction: The data set considered for the study includes a set of 29 mutations shown to cause monogenetic T2DM in families or Maturity Onset of Diabetes in Young (disease causing variations-DCVs); 113 polymorphisms, associated with the disease in various populations in a total of 76 different candidate genes and 92 random nonsynonymous variations in 32 genes that do not have any experimental evidence of association with any disease as a control dataset (Supplementary Table 1 online). The selection of these random variations would help to

occurrence. Hence these random variations through out the sequence in those genes that have been implicated with the T2DM were selected. The disease associated polymorphisms fall into four major categories-nonsynonymous (45), regulatory (42), synonymous (11) and intronic SNPs (15). In this study we determine the effect of the disease associated nonsynonymous variations (referred here after as DAVs) in comparison to the control nonsynonymous variations (CNVs) on the phenotype. DCVs were obtained querying Medline for ‘Type 2 Diabetes, Mutations’ ; DAVs by querying for ‘Type 2 Diabetes, SNPs’, ‘Type 2 Diabetes, Polymorphisms’, and CNVs from the SWISSPROT database [11]. The extraction of protein sequences needed for the analysis of all these variations was done from SWISSPROT. Relationship between the genes harboring DAVs was determined using Pathway Assist [12]. Pathway Assist is a software application for navigation and analysis of biological pathways, gene-regulation networks and protein interaction maps. It comes with the built in natural language processing module MedScan and a comprehensive database.

6

Downloaded from www.mcponline.org by on February 12, 2008

distinguish specific behavior patterns of the disease related variations from that of chance

Evaluating evolutionary conservation status of the variations: The best method to evaluate the significance of a variation using evolutionary information is to consider the nature of the change with respect to the variability of the affected residue as estimated from the wild type sequences in different proteins of a protein family. Set of similar sequences can be characterized by a multiple sequence alignment within common sequence domains (in case of protein families) or just a small sequence region (motif). We have done systematic examination of positions of the variations in motif region of proteins, using Pfam database [13] of probablistic models of protein domains and families derived using the

estimating the frequencies of amino acids at conserved position in a protein family. Minimum risk estimation, finds the optimal weighting between a set of observed amino acid counts and a set of pseudo frequencies. This provides the information regarding the position of the variations in specific domains and functional motifs respectively. The prediction of residues conservation amongst the homologous proteins was performed by Scorecons [16]. Scorecons algorithm scores each residue position with multiple sequence alignment in terms of conservation. Multiple sequence alignment of homologous protein was done by using ClustalW [17] algorithm and was formatted in ClustalX (1.81). The mutation matrix of Jones et al. [18] is used to determine the likelihood of particular residue being replaced by another and to calculate a score based on the variability of each position. Normalized Shannon entropy scores for each amino acid position were calculated using the general formulae [16]. Cent = - Σkapa log2 pa/log2 [min (N, K)] and pa = na/N

7

Downloaded from www.mcponline.org by on February 12, 2008

HMM method and eMATRIX database [14]. eMATRIX [15] is a minimum risk method for

na is the number of amino acid residue of type A, N is the number of residues in the sequence database and K is the number of residue type. The program Scorecons http://www.biochem.ucl.ac.uk/cgi-bin/valder/Scorecons_server.pl was used for all calculations. A score of zero indicates a lack of conservation at that position where as score of 1 indicates very high sequence conservation.

Determining the involvement in formation of specific patterns: Non-conserved residues adjacent to the conserved residues in the primary sequence are

functionally important region [19]. Peptide sequence containing the variant along with ten neighboring residues on either side was selected from protein sequence and pattern search was done using PROSITE [20] database to determine the involvement of the variants in formation of specific patterns. PROSITE consists of biologically significant sites, patterns like phosphorylation, glycosylation, etc. and profiles that help to reliably identify specific motifs within a peptide sequence. Sequences involving variants showing potential phosphorylation sites were evaluated for the affect on phosphorylation using NetPhos 2.0. NetPhos 2.0 is an artificial neural-network method that predicts phosphorylation sites in independent sequences with sensitivity in the range from 69-96% [21].

Assessing the effect of variation on structural parameters of Proteins: It is apparent that amino acid allelic variants have an impact on the protein structure and function and this has been shown to be predicted by analysis of multiple sequence alignments

8

Downloaded from www.mcponline.org by on February 12, 2008

generally less substitutable than other non-conserved residues, reflecting their involvement in

and protein 3D structures [8]. To assess the effect of the variations on structure and function of proteins Polyphen [22] was used. Polyphen is a World Wide Web server to automate functional annotation of nonsynonymous SNPs, based on sequence-based characterization of the substitution site and structural parameters. This provides us with the PSIC score (Position Specific Independent Count) calculated from the overall similarity of the sequences that share the amino acid type at this position with the help of statistical concepts and predicts whether a nonsynonymous variation is damaging i.e. is supposed to affect the protein function, or benign i.e. most likely lacking a profound phenotypic effect. Large differences in PSIC values

interest is rarely or never observed in protein family [23]. Variations in the protein core involving a change in the hydrophobic character of a buried residue may result in different degrees of protein destabilization [24]. The hydrophobic effect is measured by solvent accessible surface area of a protein that is part of a complex surface in direct contact with solvent. Solvent accessibility is predicted using RVPNET [25], which uses single residue information of neighbors and provides real predictions of accessible surface area. Hydrophobic interactions are considered to be the primary factor stabilizing β-sheets [26], therefore by Chou-Fasman predictions [27] identification of secondary structure elements was done. Statistical evaluation: To compare the DCVs and DAVs with CNVs during the assessment of their effect on the disease phenotype χ2 tests were performed and p-value was calculated.

9

Downloaded from www.mcponline.org by on February 12, 2008

(difference range above 1.5) for specific genetic variants might indicate that the substitution of

Results: Pathway Assist analysis establishes the products of the genes harboring DAVs to be potential interacting members of insulin signaling cascade (Fig.1). Nevertheless it is to be noted that Pathway Assist connects any two input proteins and some of the proteins identified by Pathway Assist during networking of the input proteins might not be involved in Type 2 Diabetes as is understood at this point of time. Functional segregation of the proteins harboring the DAVs categorized enzymes as the major class (31%) whereas, transcription regulators were the major class harboring DCVs (58%) (Fig.2). Pfam analyses showed that most of the DCVs

proteins (Supplementary Table 2 online). Therefore, of the total variations, an average of 56% lie in functional domains of proteins (p=0.02). Further, in the total sequence space of the identified proteins, 60% is occupied by functional domains. eMatrix analysis revealed that majority of DCVs (50%) and DAVs (62%) corresponded to functional signatures in comparison to only 27% of CNVs. This clearly indicates that the DCVs and DAVs correspond significantly more to the functional signatures in comparison to the randomly picked CNVs (p= 0.0002). Scorecons analysis (Fig. 3) reveals that DCVs are more of conservative changes (90% above the value of 0.5) whereas DAVs are radical (56% above 0.5) in comparison to CNVs (47% above 0.5) which are mostly changes in variable regions with low Scorecons value (p=0.0003). Most of the patterns obtained from PROSITE for DCVs (51.7%) and DAVs (51.1%) represented consensus post-translational modification motifs for phosphorylation, glycosylation and myristoylation (Supplementary Table 2 online) in contrast to only 37% of CNVs. Few peptides showed more than one post-translational motif. Phosphorylation changes predicted by NetPhos 2.0 for the patterns, indicated a probable decrease in the phosphorylation of DCV-

10

Downloaded from www.mcponline.org by on February 12, 2008

(67%), 49% of DAVs and 63% of CNVs correspond to the functional domains of respective

T608R of IRS1 ( Common variant Hollow > Rare variant

> Disease causing > No role in the disease

30

Downloaded from www.mcponline.org by on February 12, 2008

Common Disease Common Variant Model

Suggest Documents